Introducing the {spark_map} Python package

With this package, you can easily apply a function over multiple columns of a Spark DataFrame
Python
Package
pyspark
Apache Spark
Author
Affiliation

Pedro Duarte Faria

Blip

Published

December 21, 2022

Overview

spark_map is a python package that offers some tools that help you to apply a function over multiple columns of Apache Spark DataFrames, using pyspark. The package offers two main functions (or “two main methods”) to distribute your calculations, which are spark_map() and spark_across(). Furthermore, the package offers several methods to map (or select) the columns to which you want to apply your calculations (these methods are called mapping methods in the package).

How to install it ?

You can install the package through PyPI, by using the pip tool on your terminal, like this:

pip install spark-map

What problem spark_map solves?

I work a lot with data pipelines using Apache Spark and pyspark at Take Blip. Some day, I got myself writing a very long agg() statement to aggregate multiple columns of my Spark DataFrame with the same function, like this one below:

from pyspark.sql.functions import sum, column
aggregates = (
    spark.table('cards.detailed_sales_per_user')
        .groupBy('day')
        .agg(
            sum(column('cards_lite')).alias('cards_lite'),
            sum(column('cards_silver')).alias('cards_silver'),
            sum(column('cards_gold')).alias('cards_gold'),
            sum(column('cards_premium')).alias('cards_premium'),
            sum(column('cards_enterprise')).alias('cards_enterprise'),
            sum(column('cards_business')).alias('cards_business')
        )
)

Looking at this code, I had the following thought: “this is not elegant, and is error-prone, because it involves copy and paste, and very subtle changes in each line”. Following the golden rule of DRY (do not repeat yourself), I had to find a better way to write this code.

I just wanted to apply the sum() function over multiple columns of cards.detailed_sales_per_user grouped by day. Because of that, I decided to develop the spark_map package, which allows you to declare this operation in a much better, elegant and concise way, by using the spark_map() function.

from spark_map.functions import spark_map
from spark_map.mapping import starts_with
grouped_by_day = spark.table('cards.detailed_sales_per_user')\
    .groupBy('day')

aggregates = spark_map(grouped_by_day, starts_with('cards'), sum)

How spark_map() works ?

The spark_map() function receives three inputs, which are table (i.e. the Spark DataFrame you want to use), mapping (i.e. a “mapping” that describes which columns you want to apply your function), and function (i.e. the function you want to apply to each column in the Spark DataFrame).

In short, the starts_with('cards') section tells spark_map() that you want to apply the input function on all columns of grouped_by_day whose name starts with the text 'cards'. In other words, all spark_map() does is to apply the function you want (in the above example this function is sum()) to whatever column it finds in the input DataFrame which fits in the description of your mapping method.

You can use different mapping methods to select the columns of your DataFrame, and the package offers several built-in methods which can be very useful for you, which are available through the spark_map.mapping module of the package. You can select columns based on:

  • at_position(): their position (i.e. 3rd, 4th and 5th columns);
  • matches(): a regex to which their match;
  • are_of_type(): the type of data their store (i.e. all columns of type int);
  • starts_with() and ends_with(): its name starting or ending with a particular pattern;
  • all_of(): its name being inside a specific list of options;

The differences between spark_map() and spark_across()

There are two main functions in the package that performs the heavy work, which are spark_map() and spark_across().

Both of these functions perform the same work, which is to apply a function over a set of columns of a Spark DataFrame. But they differ in the method they use to apply this function. While spark_map() uses the agg() method of Spark DataFrame’s to apply the function, spark_across() uses the withColumn() method to do so.

This means that you will mainly use spark_map() when you want to calculate aggregates. Is worthy pointing out that spark_map() works perfectly with grouped DataFrames as well (i.e. GroupedData). In the other hand, you will use spark_across() when you want to just transform the values of multiple colums at once by applying the same function over them.

Check the documentation for more examples and details

The website have documentation for all functions of the package. If you have any trouble to understand or to find examples, is a good idea to check the Function Reference of the package, to see examples and more details about how each function works.

To understand how the mapping methods works, and how you can create your own mapping method, a good place to start is to read the article Building the mapping available at the website of the package.