Posts

Python OOP

I frequently hear Python referred to as a ‘scripting’ language, because it’s not compiled. Unfortunately, for this reason, a lot of people seem to assume you can’t write ‘real’ programs with it. This post is about moving beyond using Python as a scripting language. I’m assuming you’re already comfortable with basic python data types and methods.

Note: Most of the content here is specific to Python 3. If you’re just learning Python now, don’t learn Python 2, it’s being deprecated and many current libraries already stopped supporting it.

Posts

Pachyderm vs Airflow

If you do a lot of data pipelining, you’ve probably heard a lot about Airflow by now. I gave a talk about it a while back at a meetup, and wrote a blog post about it. The gist of my pitch for Airflow was essentially “Look, it’s so much better than cron.”

Fast-forward a year or two, and my team is using Pachyderm now. This post is about why I wanted to try Pachyderm, what I love about it, some things that can be improved about it, and some of the tricks you’ll need to know if you want to start using it.

Posts

More AWS things I learned the hard way: S3 best practices and VPCs

To make a long, mostly whiny story short, as part of my current role, I’ve been doing a lot of fighting with AWS to help support my team.

Some of the things I’ve learned along the way are probably not obvious if you, like me, are relying mostly on AWS docs and other people’s advice, so I thought I’d collect some of them here.

Best practices for storing big data on S3

Posts

Cross-account access with AWS

The scene:

I needed to process data from an s3 bucket using pyspark. The s3 bucket was owned by a different account. I had done this before. But this time, there was a twist: we needed to encrypt the data because of GDPR requirements. At the end of the processing, I needed to save the results to another s3 bucket for loading into Redshift.

Thus began a weeks-long saga of learning about AWS the hard way.

Posts

Things I learned about Pyspark the hard way

Why Spark?

Lately I have been working on a project that requires cleaning and analyzing a large volume of event-level data.

Originally, I did some exploratory data analysis on small samples of data (up to 15 million rows) using pandas, my usual data visualization tools, and multiprocessing. But then it was time to scale up.

Why Spark is good for this

Distributed processing means it’s very fast at very large scale, and we can scale it up with minimal adjustments (the same code still works, we just need a bigger cluster).

Posts

Airflow

Airflow for hands-off ETL

Almost exactly a year ago, I joined Yahoo, which more recently became Oath.

The team I joined is called the Product Hackers, and we work with large amounts of data. By large amounts I meant, billions of rows of log data.

Our team does both ad-hoc analyses and ongoing machine learning projects. In order to support those efforts, our team had initially written scripts to parse logs and run them with cron to load the data into Redshift on AWS. After a while, it made sense to move to Airflow.

Posts

Probability binning: simple and fast

Over the years, I’ve done a few data science coding challenges for job interviews. My favorite ones included a data set and asked me to address both specific and open-ended questions about that data set.

One of the first things I usually do is make a bunch of histograms. Histograms are great because it’s an easy way to look at the distribution of data without having to plot every single point, or get distracted by a lot of noise.

Posts

A tutorial within a tutorial on building reusable models with scikit-learn

Things I learned while following a tutorial on how to build reusable models with scikit-learn.

When in doubt, go back to pandas.
When in doubt, write tests.
When in doubt, write helper methods to wrap existing objects, rather than creating new objects.

Ingesting “clean” data is easy, right?

Step 1 of this tutorial began with downloading data using requests, and saving that to a csv file. So I did that. I’ve used requests before, I had no reason to think it wouldn’t work. It looked like it worked.

Posts

Shuffling the deck: an interview experience

Here is a story about an interesting interview question and how I approached it.

The company in question wasn’t interested in actually looking at my code, since I apparently tried to answer the wrong question.

Given a deck of n unique cards, cut the deck c cards from the top and perform a perfect shuffle. A perfect shuffle is where you put down the bottom card from the top portion of the deck followed by the bottom card from the bottom portion of the deck. This is repeated until one portion is used up. The remaining cards go on top.

Posts

Validating Results

I don’t believe truth is a finite value. Truth is what we know right now. Every ten years or so, a major discovery gets overturned. Scientists are just people, and we’re wrong a lot.

So one of the scariest things about doing research, or predictions, is trying to convince yourself, and other people, that what you think you’ve discovered is ‘real’.

Or at least real enough, right now, to be believable. Whenever I do a project, I hope my findings will stand the test of time, at least long enough to be useful.