A tutorial within a tutorial on building reusable models with scikit-learn
By Samantha G. Zeitlin
Things I learned while following a tutorial on how to build reusable models with scikit-learn.
- When in doubt, go back to pandas.
- When in doubt, write tests.
- When in doubt, write helper methods to wrap existing objects, rather than creating new objects.
Ingesting “clean” data is easy, right?
Step 1 of this tutorial began with downloading data using requests, and saving that to a csv file. So I did that. I’ve used requests before, I had no reason to think it wouldn’t work. It looked like it worked.
Step 2 was to read the file into pandas. I’ve read lots of csv files into pandas before, so I had no reason to think it wouldn’t work.
It didn’t work.
I double-checked that I had followed the instructions correctly, and then checked a few more times before concluding that something was not quite right about the data.
I went back and did the easy thing, just printing out the response from requests.
After some digging, I figured out that response.content
is not the same as response.text
.
The tutorial said to use response.content
, but response.text
seemed to have actually parsed the strings.
Even with that fix, pandas was refusing to read in more than the first row of data, due to a couple of problems:
- pandas wasn’t finding the line terminators (nothing special, just
'\n'
) - pandas wasn’t finding equal numbers of items per row
Unexpectedly, when I went back to what I usually do, just plain old pandas.read_csv
, this time going directly from the url, and including the column names, that actually worked.
So it was actually better, and a lot less code, to completely skip using requests
.
Testing always gets me unstuck
I really liked the end-to-end structure of this tutorial, and was frankly embarrassed that I had so much trouble getting the initial ingestion to work.
I liked that the tutorial gave me an excuse to walk through how the author actually uses scikit-learn models in production. With the data firmly in hand, the data visualization steps were easy - they worked as advertised, and anyway I’m very familiar with using seaborn
to make charts in python.
I had never created a Bunch object before, so that was new for me. That seemed to work, but then the next steps again failed, and I had to back up a few steps.
I wasn’t sure what the problem was, so I did what I always do with complicated problems, and wrote some tests to rule out user error and make sure I understood what the code was doing. That helped a lot, and identified what was actually broken.
The problem: how to apply LabelEncoder
to help convert categorical data, and Imputer
to help fill missing data, to multiple columns.
Because the idea was to do this in the context of a Pipeline
object, the author demonstrated how to create our own Encoder and Imputer objects, with multiple inheritance. I understand the goal of this: take advantage of the nice clean syntax you get from making a Pipeline. But it was failing at the fit_transform
step, and it wasn’t obvious why.
The fit()
and transform()
steps both seemed to be working individually and sequentially, and it wasn’t easy to figure out how the fit_transform
step was supposed to do anything more than chain them together.
After banging my head on this at the end of a long day, even going back to the original scikit-learn source code in an effort to design tests to help me figure out what was wrong, I decided to sleep on it.
Simple and working is better than complicated and broken
I seriously considered writing tests for our custom Encoder and Imputer objects, but then it dawned on me that I really didn’t need to do that. I decided that the Pipeline functionality was so simple that I didn’t really need it, so I just stripped the objects down into simple functions to run the fit
and transform
steps, which was really all I needed anyway.
That got me through the rest of the steps, so I could practice pickling a model and re-loading it, which seemed to work just fine.
I don’t know if the scikit-learn folks have plans to extend these methods, or if everyone normally does these kinds of acrobatics to encode and impute on multiple columns - normally I would just use pandas for that, too.