Posts

Probability binning: simple and fast

Over the years, I’ve done a few data science coding challenges for job interviews. My favorite ones included a data set and asked me to address both specific and open-ended questions about that data set.

One of the first things I usually do is make a bunch of histograms. Histograms are great because it’s an easy way to look at the distribution of data without having to plot every single point, or get distracted by a lot of noise.

Test-driven data pipelining

When to test, and why:

• Write a test for every method.

• Write a test any time you find a bug! Then make sure the test passes after you fix the bug.

• Think of tests as showing how your code should be used, and write them accordingly. The next person who’s going to edit your code, or even just use your code, should be able to refer to your tests to see what’s happening.

Data pipelining with pandas

For better or worse, when you’re dealing with data pipelines of varying shapes and sizes, sometimes you need to combine objects that don’t match up evenly.

For example, if you want to apply a condition via lookup, sometimes it makes sense to just do a merge. This creates a new column in your data table, and then you can use that for reference.

This is an extremely simple example to show what I mean: