Below you will find pages that utilize the taxonomy term “Spark”
    Posts
    
      
    read more
  
Things I learned about Pyspark the hard way
Why Spark?
Lately I have been working on a project that requires cleaning and analyzing a large volume of event-level data.
Originally, I did some exploratory data analysis on small samples of data (up to 15 million rows) using pandas, my usual data visualization tools, and multiprocessing. But then it was time to scale up.
Why Spark is good for this
Distributed processing means it’s very fast at very large scale, and we can scale it up with minimal adjustments (the same code still works, we just need a bigger cluster).