Postgres With Docker
Recently, someone asked for help with setting up postgres in docker and connecting to it with python.
While helping this person, I realized this is something that should be fairly straightforward with a simple set of instructions, but there aren’t a lot good beginner tutorials out there. So I decided to write this up because I’m sure it’s something other people would also find useful. A few years ago I wouldn’t have been able to do this even with a lot of googling (this time I only had to google a few things!).
Manager
I was chatting with some friends recently, and this question came up:
What does a good working relationship with your manager look like?
This is the list I came up with. I’m not saying I’ve been perfect at all of these myself as a manager, but it’s what I strive for, and it’s what I look for in a good manager.
- They trust you
- They ask for your input when appropriate
- They promote your work to others in the company where appropriate
- You trust them
- No surprises
- You feel safe asking them for help when you need it
- They discuss your career goals with you, and help steer you toward them
This post is just an elaboration on what I mean by each of the things on that list.
Python OOP
I frequently hear Python referred to as a ‘scripting’ language, because it’s not compiled. Unfortunately, for this reason, a lot of people seem to assume you can’t write ‘real’ programs with it. This post is about moving beyond using Python as a scripting language. I’m assuming you’re already comfortable with basic python data types and methods.
Note: Most of the content here is specific to Python 3. If you’re just learning Python now, don’t learn Python 2, it’s being deprecated and many current libraries already stopped supporting it.
Pachyderm vs Airflow
If you do a lot of data pipelining, you’ve probably heard a lot about Airflow by now. I gave a talk about it a while back at a meetup, and wrote a blog post about it. The gist of my pitch for Airflow was essentially “Look, it’s so much better than cron.”
Fast-forward a year or two, and my team is using Pachyderm now. This post is about why I wanted to try Pachyderm, what I love about it, some things that can be improved about it, and some of the tricks you’ll need to know if you want to start using it.
More AWS things I learned the hard way: S3 best practices and VPCs
To make a long, mostly whiny story short, as part of my current role, I’ve been doing a lot of fighting with AWS to help support my team.
Some of the things I’ve learned along the way are probably not obvious if you, like me, are relying mostly on AWS docs and other people’s advice, so I thought I’d collect some of them here.
Best practices for storing big data on S3
Cross-account access with AWS
The scene:
I needed to process data from an s3 bucket using pyspark. The s3 bucket was owned by a different account. I had done this before. But this time, there was a twist: we needed to encrypt the data because of GDPR requirements. At the end of the processing, I needed to save the results to another s3 bucket for loading into Redshift.
Thus began a weeks-long saga of learning about AWS the hard way.
Things I learned about Pyspark the hard way
Why Spark?
Lately I have been working on a project that requires cleaning and analyzing a large volume of event-level data.
Originally, I did some exploratory data analysis on small samples of data (up to 15 million rows) using pandas, my usual data visualization tools, and multiprocessing. But then it was time to scale up.
Why Spark is good for this
Distributed processing means it’s very fast at very large scale, and we can scale it up with minimal adjustments (the same code still works, we just need a bigger cluster).
Airflow
Airflow for hands-off ETL
Almost exactly a year ago, I joined Yahoo, which more recently became Oath.
The team I joined is called the Product Hackers, and we work with large amounts of data. By large amounts I meant, billions of rows of log data.
Our team does both ad-hoc analyses and ongoing machine learning projects. In order to support those efforts, our team had initially written scripts to parse logs and run them with cron to load the data into Redshift on AWS. After a while, it made sense to move to Airflow.
Probability binning: simple and fast
Over the years, I’ve done a few data science coding challenges for job interviews. My favorite ones included a data set and asked me to address both specific and open-ended questions about that data set.
One of the first things I usually do is make a bunch of histograms. Histograms are great because it’s an easy way to look at the distribution of data without having to plot every single point, or get distracted by a lot of noise.
A tutorial within a tutorial on building reusable models with scikit-learn
Things I learned while following a tutorial on how to build reusable models with scikit-learn.
- When in doubt, go back to pandas.
- When in doubt, write tests.
- When in doubt, write helper methods to wrap existing objects, rather than creating new objects.
Ingesting “clean” data is easy, right?
Step 1 of this tutorial began with downloading data using requests, and saving that to a csv file. So I did that. I’ve used requests before, I had no reason to think it wouldn’t work. It looked like it worked.