Posts
Pachyderm vs Airflow
If you do a lot of data pipelining, you’ve probably heard a lot about Airflow by now. I gave a talk about it a while back at a meetup, and wrote a blog post about it. The gist of my pitch for Airflow was essentially “Look, it’s so much better than cron.”
Fast-forward a year or two, and my team is using Pachyderm now. This post is about why I wanted to try Pachyderm, what I love about it, some things that can be improved about it, and some of the tricks you’ll need to know if you want to start using it.
read more
Posts
More AWS things I learned the hard way: S3 best practices and VPCs
To make a long, mostly whiny story short, as part of my current role, I’ve been doing a lot of fighting with AWS to help support my team.
Some of the things I’ve learned along the way are probably not obvious if you, like me, are relying mostly on AWS docs and other people’s advice, so I thought I’d collect some of them here.
Best practices for storing big data on S3
read more
Posts
Cross-account access with AWS
The scene:
I needed to process data from an s3 bucket using pyspark. The s3 bucket was owned by a different account. I had done this before. But this time, there was a twist: we needed to encrypt the data because of GDPR requirements. At the end of the processing, I needed to save the results to another s3 bucket for loading into Redshift.
Thus began a weeks-long saga of learning about AWS the hard way.
read more
Posts
Things I learned about Pyspark the hard way
Why Spark? Lately I have been working on a project that requires cleaning and analyzing a large volume of event-level data.
Originally, I did some exploratory data analysis on small samples of data (up to 15 million rows) using pandas, my usual data visualization tools, and multiprocessing. But then it was time to scale up.
Why Spark is good for this Distributed processing means it’s very fast at very large scale, and we can scale it up with minimal adjustments (the same code still works, we just need a bigger cluster).
read more
Posts
Airflow
Airflow for hands-off ETL Almost exactly a year ago, I joined Yahoo, which more recently became Oath.
The team I joined is called the Product Hackers, and we work with large amounts of data. By large amounts I meant, billions of rows of log data.
Our team does both ad-hoc analyses and ongoing machine learning projects. In order to support those efforts, our team had initially written scripts to parse logs and run them with cron to load the data into Redshift on AWS.
read more
Posts
Probability binning: simple and fast
Over the years, I’ve done a few data science coding challenges for job interviews. My favorite ones included a data set and asked me to address both specific and open-ended questions about that data set.
One of the first things I usually do is make a bunch of histograms. Histograms are great because it’s an easy way to look at the distribution of data without having to plot every single point, or get distracted by a lot of noise.
read more
Posts
A tutorial within a tutorial on building reusable models with scikit-learn
Things I learned while following a tutorial on how to build reusable models with scikit-learn.
When in doubt, go back to pandas. When in doubt, write tests. When in doubt, write helper methods to wrap existing objects, rather than creating new objects. Ingesting “clean” data is easy, right? Step 1 of this tutorial began with downloading data using requests, and saving that to a csv file. So I did that. I’ve used requests before, I had no reason to think it wouldn’t work.
read more
Posts
Shuffling the deck: an interview experience
Here is a story about an interesting interview question and how I approached it.
The company in question wasn’t interested in actually looking at my code, since I apparently tried to answer the wrong question.
Given a deck of n unique cards, cut the deck c cards from the top and perform a perfect shuffle. A perfect shuffle is where you put down the bottom card from the top portion of the deck followed by the bottom card from the bottom portion of the deck.
read more
Posts
Validating Results
I don’t believe truth is a finite value. Truth is what we know right now. Every ten years or so, a major discovery gets overturned. Scientists are just people, and we’re wrong a lot.
So one of the scariest things about doing research, or predictions, is trying to convince yourself, and other people, that what you think you’ve discovered is ‘real’.
Or at least real enough, right now, to be believable.
read more
Posts
Test-driven data pipelining
When to test, and why: • Write a test for every method.
• Write a test any time you find a bug! Then make sure the test passes after you fix the bug.
• Think of tests as showing how your code should be used, and write them accordingly. The next person who’s going to edit your code, or even just use your code, should be able to refer to your tests to see what’s happening.
read more