Below you will find pages that utilize the taxonomy term “python”
Posts
Cross-account access with AWS
The scene:
I needed to process data from an s3 bucket using pyspark. The s3 bucket was owned by a different account. I had done this before. But this time, there was a twist: we needed to encrypt the data because of GDPR requirements. At the end of the processing, I needed to save the results to another s3 bucket for loading into Redshift.
Thus began a weeks-long saga of learning about AWS the hard way.
read more
Posts
Airflow
Airflow for hands-off ETL Almost exactly a year ago, I joined Yahoo, which more recently became Oath.
The team I joined is called the Product Hackers, and we work with large amounts of data. By large amounts I meant, billions of rows of log data.
Our team does both ad-hoc analyses and ongoing machine learning projects. In order to support those efforts, our team had initially written scripts to parse logs and run them with cron to load the data into Redshift on AWS.
read more
Posts
Probability binning: simple and fast
Over the years, I’ve done a few data science coding challenges for job interviews. My favorite ones included a data set and asked me to address both specific and open-ended questions about that data set.
One of the first things I usually do is make a bunch of histograms. Histograms are great because it’s an easy way to look at the distribution of data without having to plot every single point, or get distracted by a lot of noise.
read more
Posts
A tutorial within a tutorial on building reusable models with scikit-learn
Things I learned while following a tutorial on how to build reusable models with scikit-learn.
When in doubt, go back to pandas. When in doubt, write tests. When in doubt, write helper methods to wrap existing objects, rather than creating new objects. Ingesting “clean” data is easy, right? Step 1 of this tutorial began with downloading data using requests, and saving that to a csv file. So I did that. I’ve used requests before, I had no reason to think it wouldn’t work.
read more
Posts
Shuffling the deck: an interview experience
Here is a story about an interesting interview question and how I approached it.
The company in question wasn’t interested in actually looking at my code, since I apparently tried to answer the wrong question.
Given a deck of n unique cards, cut the deck c cards from the top and perform a perfect shuffle. A perfect shuffle is where you put down the bottom card from the top portion of the deck followed by the bottom card from the bottom portion of the deck.
read more
Posts
Test-driven data pipelining
When to test, and why: • Write a test for every method.
• Write a test any time you find a bug! Then make sure the test passes after you fix the bug.
• Think of tests as showing how your code should be used, and write them accordingly. The next person who’s going to edit your code, or even just use your code, should be able to refer to your tests to see what’s happening.
read more
Posts
Data pipelining with pandas
For better or worse, when you’re dealing with data pipelines of varying shapes and sizes, sometimes you need to combine objects that don’t match up evenly.
For example, if you want to apply a condition via lookup, sometimes it makes sense to just do a merge. This creates a new column in your data table, and then you can use that for reference.
This is an extremely simple example to show what I mean:
read more
Posts
Things I learned about zip files
In an effort to advance my python skills, I spent some time slowly pecking away at the puzzles on pythonchallenge. I got stuck on most of the challenges, and either had to search for a hint, or ask for help from a friend, or both. This latest one was particularly instructive, and it had to do with zipfiles.
I thought I knew what zip files were. I have used them since grad school, for transferring folders via email, and for compression.
read more
Posts
Recursion excursion
More than once, and probably not for the last time, I have done a technical interview for which I was underprepared. I feel like no matter how much I try to prepare, I am always underprepared for technical interviews.
I’m going to tell you about a time I was underprepared for a few reasons, including:
a) It was the first interview where I was asked to write more than a couple lines of recursive code
read more