In this blog, we learn how to utilize the Pandas API on Spark for efficient and scalable data analysis. This comprehensive tutorial covers everything from installation to applying custom business logic with UDFs, analyzing big datasets, and saving results, using PySpark 3.5.
Apache Spark has become the de facto standard for large-scale data processing. Its in-memory capabilities make it orders of magnitude faster than traditional disk-based frameworks like Hadoop MapReduce.
However, data scientists often prefer using Python's Pandas library for data analysis because of its simplicity and ease of use. Pandas API on Spark bridges this gap by providing a Pandas-like API for Spark DataFrames. This makes it easy for Pandas users to leverage the distributed capabilities of Spark for big data, without having to learn a new API.
In this comprehensive tutorial, we will learn how to use Pandas API on Spark through concrete examples. We will be using the latest Spark 3.5 which comes bundled with Pandas API out of the box.
Pandas API in Spark provides a domain-specific language to manipulate DataFrames in Apache Spark using Python. It aims to provide a Pandas-like experience while executing queries on a distributed system with all the optimizations that Spark offers under the hood.
Here are some key highlights of the Pandas API:
Let's start by installing PySpark 3.5 which comes bundled with Pandas API. We will use Conda to set up our environment.
Now we are ready to write our first Pandas query on Spark!
Let's start with a simple hello world example. We will create a Pandas DataFrame, manipulate it using .sum() and print the output.
This prints:
Here we created a simple single column Pandas DataFrame and just grouped by the column and printed the sum.
Note that all the Pandas API calls like groupby, sum etc. run on Spark under the hood even though we are manipulating a Pandas DataFrame.
The true power of this approach reveals when we start working with huge datasets that can leverage the distributed processing capability of Spark.
Let's now try to analyze a more realistic dataset using Pandas API. We will use the popular video game sales dataset for this analysis.
First, we load this CSV dataset from a public URL into a Spark DataFrame using Spark's data source API.
This loads the CSV file from the URL and creates a Spark DataFrame (game_df). Now we can convert this to a Pandas DataFrame.
The .to_pandas() method converts the PySpark DataFrame into a Pandas one that we can now analyze using Pandas API.
Let's group this data by platform and find the total global sales.
This performs the groupby and sum calculations across the large dataset in a distributed manner across the Spark cluster. It then collects and returns the final grouped DataFrame.
Here is a snippet of the output:
The entire processing was done in a distributed and optimized way in Spark. Yet we could use the intuitive Pandas syntax to manipulate this big data DataFrame.
Key Point: The Pandas API abstracts away the complexities of distributed processing and enables data scientists to focus on the analysis.
Let's try to analyze the sales trend over time to spot patterns. We will leverage the inbuilt date capabilities of Pandas for this analysis.
First, let's clean the year column and convert it to Pandas datetime:
Now we can analyze sales by year with ease:
This performs a timeseries groupby aggregation to get sales per year. Let's plot this to visualize the trend.
This generates an interactive timeseries chart in the notebook. As you can see, the global sales have been steadily rising over time with the advent of newer and more advanced gaming platforms.
A key requirement in data analysis is to apply custom business logic for data transformation. Pandas API provides an easy way to apply arbitrary Python logic through user defined functions (UDFs).
Let's bucket the Global_Sales into business categories like HIGH, MEDIUM, LOW etc. First we define the UDF:
This buckets sales data into business categories. We decorate it to register as Spark UDF.
Now we can transform the DataFrame using this UDF:
This applies the custom logic across the entire DataFrame in a distributed manner. Here is what the output looks like:
We now have business categories generated via custom Python logic!
Key Idea: UDFs unlock unlimited flexibility by allowing custom logic while leveraging Spark's distributions.
Once the analysis is done, we need to save the results to disk or database for other applications to consume. With Pandas API, saving data at scale is trivial.
We can save Spark DataFrames to a wide array of storage formats very easily:
This leverages the underlying Spark DataFrame to write to disk in a distributed and optimized manner without moving data through the driver process.
In this detailed post, we went through example(s) demonstrating how Pandas API in Spark enables clean, scalable data analysis. The key takeaways are:
Pandas API makes Scala/Java/SQL experts productive with PySpark DataFrames quickly without learning another new API other than basic Pandas.