Test-driven data pipelining
When to test, and why:
• Write a test for every method.
• Write a test any time you find a bug! Then make sure the test passes after you fix the bug.
• Think of tests as showing how your code should be used, and write them accordingly. The next person who’s going to edit your code, or even just use your code, should be able to refer to your tests to see what’s happening.
Data pipelining with pandas
For better or worse, when you’re dealing with data pipelines of varying shapes and sizes, sometimes you need to combine objects that don’t match up evenly.
For example, if you want to apply a condition via lookup, sometimes it makes sense to just do a merge. This creates a new column in your data table, and then you can use that for reference.
This is an extremely simple example to show what I mean:
Biking data from XML to analysis, revised
Am I getting slower every day?
If you’ve ever been a bike commuter, you’ve probably asked yourself this question. Thanks to these little devices we can now attach to ourselves or our bicycles, we can now use our own actual ride data to investigate these kinds of questions, as well as questions like these:
- If I’m going to work from home one day a week, which day would maximize my recovery?
- Do I tend to ride faster in the morning or the evening?
Last year, I wrote a few posts about learning how to parse a set of Garmin XML data from 2013 and analyze it using pandas, matplotlib, and seaborn. This year I redid the same analyses, with a new installment of data from 2014.
Working with device data
In continuing my series on investigating bike data, I ran into some interesting aspects of working with device data.
I have some experience with devices, thanks to my many years of working in research labs. This post is about the fun of hunting down what’s working and what’s not.
Things to consider when working with devices
- Are you using the device yourself?
- Are you interacting with the user(s) (directly or indirectly)? Or not at all?
- What is the device designed to do? Are you using it for its intended purpose?
- How well does the device actually work? Generic measurables might include: sensitivity, specificity, accuracy, precision, battery life
- What else is being measured?
- Measured how?
- How are data stored? How much data can it store? How does it connect to other devices/data stores?
In the case of a bike computer, I have been looking at:
Biking data from XML to analysis, part 2
So I have some bike data that I parsed out of XML and put into a pandas dataframe. Most of the questions I wanted to ask required that the timestamp of each ride segment, or lap, be used as the index along the x-axis of a plot.
Non-obvious nuances of pandas datetime objects and indexes.
You have to sort the dataframe by timestamps, before you can convert the timestamps to use as an index.
Biking data from XML to analysis, part 3
One thing I wanted to do with this data set was experiment with plotting methods. I had already done some exploratory plotting with regular matplotlib, so I had some vague ideas about what I wanted to do.
First I had to select out subsets of data to compare. I knew that there were two types of rides: shorter trips in the city, and longer trips in the suburbs. I was feeling lazy, so I just did a quick threshold with SQL.
Biking data from XML to analysis, part 4
One of the main reasons this project turned out to be interesting is that time series data has all kinds of gotchas. I never had to deal with a lot of this before, because the sorts of time series I did in my scientific life didn’t care about real-life things like time zones. We mostly just cared about calculating time elapsed.
…tick…tick…tick
Anyway one thing I wondered about with the bike data was, can we compare average speeds in the morning vs. the afternoon? But to do that, I first had to parse the datetime objects and put them in the right time zone.
Things I learned about zip files
In an effort to advance my python skills, I spent some time slowly pecking away at the puzzles on pythonchallenge. I got stuck on most of the challenges, and either had to search for a hint, or ask for help from a friend, or both. This latest one was particularly instructive, and it had to do with zipfiles.
I thought I knew what zip files were. I have used them since grad school, for transferring folders via email, and for compression. I used various utilities and command-line tools to deal with zipping and unzipping. But I never needed to know how they worked.
Things I learned studying the cell cycle in cancer
I know that from the outside, ‘science’ seems like The Place Where Scientists Live. But ‘science’ is not a monolithic, homogenous thing. Not all scientists are the same.
Today someone called me a Biologist. But I was never really a Biologist. My undergraduate degree was in a chemistry department.
My past life as a researcher was always very interdisciplinary. To better understand cancer cells, I used a lot of sophisticated software, and mathematical intuition, in addition to chemistry and physics.
Advice on recruiting
I have had a few pleasant job interviews. Here’s what’s different about those interviews, that made them really stand out from the others I’ve done. I’ll describe a specific example, and then give some specific suggestions.
- The hiring manager contacted me directly
- He had done his homework. He had looked at my GitHub repos.
- He told gave me pretty specific information about the structure of the interview, and gave me ~2 weeks to prepare.
- The interview was 1-on-1, in person. It lasted 2-3 hours.
- The first exercise was to have me go through an app and describe what all the pieces were doing.
- The second exercise was to interact with the app and add a feature or two.
- The third exercise was to look at a script and identify the bugs (or other problems with it).
It was an outstanding experience because he had clearly put time and effort into preparing the process, and was patient with understanding that while he knew the app inside and out, I did not.