Things I learned about Pyspark the hard way

Why Spark? Lately I have been working on a project that requires cleaning and analyzing a large volume of event-level data. Originally, I did some exploratory data analysis on small samples of data (up to 15 million rows) using pandas, my usual data visualization tools, and multiprocessing. But then it was time to scale up. Why Spark is good for this Distributed processing means it’s very fast at very large scale, and we can scale it up with minimal adjustments (the same code still works, we just need a bigger cluster).