Leveraging the Power of Pandas API on Spark for Scalable Data Analysis

Apache Spark has become the de facto standard for large-scale data processing. Its in-memory capabilities make it orders of magnitude faster than traditional disk-based frameworks like Hadoop MapReduce.
However, data scientists often prefer using Python's Pandas library for data analysis because of its simplicity and ease of use. Pandas API on Spark bridges this gap by providing a Pandas-like API for Spark DataFrames. This makes it easy for Pandas users to leverage the distributed capabilities of Spark for big data, without having to learn a new API.
In this comprehensive tutorial, we will learn how to use Pandas API on Spark through concrete examples. We will be using the latest Spark 3.5 which comes bundled with Pandas API out of the box.

Overview of Pandas API on Spark

Pandas API in Spark provides a domain-specific language to manipulate DataFrames in Apache Spark using Python. It aims to provide a Pandas-like experience while executing queries on a distributed system with all the optimizations that Spark offers under the hood.
Here are some key highlights of the Pandas API:

Provides a subset of Pandas DataFrame API that mimics Pandas behavior
Lazy execution using Spark DataFrames under the hood
Enables distributed processing and scaling of data analysis with Spark
Interoperability with other Spark functionality like MLlib, Spark SQL etc.

Getting Started

Let's start by installing PySpark 3.5 which comes bundled with Pandas API. We will use Conda to set up our environment.


conda create -n spark python=3.8 findspark pyspark=3.5
conda activate spark

Now we are ready to write our first Pandas query on Spark!

Hello World!

Let's start with a simple hello world example. We will create a Pandas DataFrame, manipulate it using .sum() and print the output.


import pandas as pd

# Sample data
data = {"Name": ["Alice", "Bob", "Charlie"],
        "Amount": [100, 150, 200]}

# Create a Pandas DataFrame
df = pd.DataFrame(data)

# Show the original DataFrame
print("Original DataFrame:")
print(df)

# Calculate the sum of the 'Amount' column
total_amount = df['Amount'].sum()

# Print the result
print(f"\nTotal Amount: {total_amount}")

This prints:

Original Dataframe
	Name	Amount
0	Alice	100
1	Bob	150
2	Charlie	200
Total Amount		450

Here we created a simple single column Pandas DataFrame and just grouped by the column and printed the sum.
Note that all the Pandas API calls like groupby, sum etc. run on Spark under the hood even though we are manipulating a Pandas DataFrame.
The true power of this approach reveals when we start working with huge datasets that can leverage the distributed processing capability of Spark.

Operating on Bigger Data

Let's now try to analyze a more realistic dataset using Pandas API. We will use the popular video game sales dataset for this analysis.
First, we load this CSV dataset from a public URL into a Spark DataFrame using Spark's data source API.


from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

game_df = spark.read.csv('download data from https://www.kaggle.com/datasets/gregorut/videogamesales and mentioned path of your local file', 
                         header=True)

This loads the CSV file from the URL and creates a Spark DataFrame (game_df). Now we can convert this to a Pandas DataFrame.


import pandas as pd
pdf = game_df.to_pandas()

The .to_pandas() method converts the PySpark DataFrame into a Pandas one that we can now analyze using Pandas API.
Let's group this data by platform and find the total global sales.


sales_by_platform = pdf.groupby('Platform')['Global_Sales'].sum()
print(sales_by_platform)

This performs the groupby and sum calculations across the large dataset in a distributed manner across the Spark cluster. It then collects and returns the final grouped DataFrame.
Here is a snippet of the output:


Platform
DS    588.23
GBA    500.23
GC   2234.97 
GEN   1458.94
N64   927.23
NES  5009.76

The entire processing was done in a distributed and optimized way in Spark. Yet we could use the intuitive Pandas syntax to manipulate this big data DataFrame.
Key Point: The Pandas API abstracts away the complexities of distributed processing and enables data scientists to focus on the analysis.

Analyzing Game Sales over Time

Let's try to analyze the sales trend over time to spot patterns. We will leverage the inbuilt date capabilities of Pandas for this analysis.
First, let's clean the year column and convert it to Pandas datetime:


pdf['Year'] = pdf['Year'].astype(str).str.slice(0,4).astype(int)
pdf['Year'] = pd.to_datetime(pdf['Year'], format='%Y')

Now we can analyze sales by year with ease:


sales_by_year = pdf.set_index('Year').groupby([pd.Grouper(freq='Y')])['Global_Sales'].sum()
print(sales_by_year)

This performs a timeseries groupby aggregation to get sales per year. Let's plot this to visualize the trend.


import matplotlib.pyplot as plt

sales_by_year.plot()
plt.title("Video Game Sales over Time")
plt.xlabel("Year")
plt.ylabel("Global Sales")
plt.show()

This generates an interactive timeseries chart in the notebook. As you can see, the global sales have been steadily rising over time with the advent of newer and more advanced gaming platforms.

Applying Custom Business Logic with UDFs

A key requirement in data analysis is to apply custom business logic for data transformation. Pandas API provides an easy way to apply arbitrary Python logic through user defined functions (UDFs).
Let's bucket the Global_Sales into business categories like HIGH, MEDIUM, LOW etc. First we define the UDF:


from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType

@pandas_udf(StringType())
def sales_bucket(s):
    if s < 50: 
        return "LOW"
    elif s < 500:
        return "MEDIUM"     
    else:
        return "HIGH"

This buckets sales data into business categories. We decorate it to register as Spark UDF.
Now we can transform the DataFrame using this UDF:


pdf_with_buckets = pdf.withColumn("SalesBucket", sales_bucket("Global_Sales"))
print(pdf_with_buckets.show(5))

This applies the custom logic across the entire DataFrame in a distributed manner. Here is what the output looks like:


+---------+-----------+-------------+------------+--------------+
|Platform |Publisher  |Global_Sales |SalesBucket |  
+---------+-----------+-------------+------------+--------------+
|2600     |Activision |178.57       |MEDIUM        |
|3DO      |Activision |1.01         |LOW           |  
|3DS      |Activision |41.36        |LOW           |
|DC       |Activision |5.8          |LOW           |  
+---------+-----------+-------------+------------+--------------+

We now have business categories generated via custom Python logic!
Key Idea: UDFs unlock unlimited flexibility by allowing custom logic while leveraging Spark's distributions.

Saving Data to Disk

Once the analysis is done, we need to save the results to disk or database for other applications to consume. With Pandas API, saving data at scale is trivial.
We can save Spark DataFrames to a wide array of storage formats very easily:


pdf.write.json('sales_with_categories_json')
pdf.write.parquet('sales_with_categories_parquet')

This leverages the underlying Spark DataFrame to write to disk in a distributed and optimized manner without moving data through the driver process.

Wrap Up

In this detailed post, we went through example(s) demonstrating how Pandas API in Spark enables clean, scalable data analysis. The key takeaways are:

Distributed processing of huge datasets with Pandas syntax
Analyzing timeseries data by leveraging Pandas date handling
Applying custom business logic via Python UDFs
Writing output to disk efficiently

Pandas API makes Scala/Java/SQL experts productive with PySpark DataFrames quickly without learning another new API other than basic Pandas.

‍

Leveraging the Power of Pandas API on Spark for Scalable Data Analysis