Introduction to pyspark

A book for anyone who wants to learn quickly how to use pyspark to effectively load, process, and transform large volumes of data using Python.
Python
Apache Spark
Book
Author

Pedro Duarte Faria

Published

January 9, 2024

About the book

This book is for anyone who wants to learn quickly how to use pyspark to effectively load, process, and transform large volumes of data using Python.

In more detail, this is a quick and introductory book about pyspark, which is the Python API for Apache Spark. Apache Spark is the de facto standard engine for big-data analytics. It is largely used to build data processing, data ingestion, and machine learning applications that process very large volumes of data.

One of the many reasons why Apache Spark became popular is because of its APIs. You can build Spark applications using different programming languages, such as Python, R, and Scala. But this book focuses solely on the Python API.

In this book, you will learn about:

  • How an Apache Spark application works?
  • What are Spark DataFrames?
  • How to build, transform and model your Spark DataFrame.
  • How to import data into (or export data out of) Apache Spark.
  • How to work with SQL inside pyspark.
  • Tools for manipulating specific data types (e.g. strings, dates and datetimes).
  • How to use window functions.

Motivation for the book

In summary, this book aims to give a solid introduction (for python and not python users) to the pyspark package, and on how to use it to build Spark applications for data pipelines and interactive data analysis.

Although we have a good range of materials about Apache Spark in general, such as Damji et al. (2020) and Chambers and Zaharia (2018), we do not have much abundance of materials about the APIs of Spark in “foreign” languages, like the Python (pyspark) and R (SparkR) APIs.

The reason for this is simple: the Spark API have a consistent structure across all languages. As consequence, a general book about Spark can fairly cover all languages at once. In other words, Spark code in Scala can be easily translated into python code with pyspark. Because the structure of the code is very similar between all languages.

So why this book? First, Python is a more popular and friendly language than Scala or Java. If the reader is not interested in learning Java or Scala, why show Java/Scala code to him? Is very important to focus solely on what interest the reader, specially if it is in a language that he is familiar with. Second, I had some time to spent, and a lot of practical experience with pyspark on production to share (so… why not write a book about it?).

Cover

References

Chambers, Bill, and Matei Zaharia. 2018. Spark: The Definitive Guide: Big Data Processing Made Simple. Sebastopol, CA: O’Reilly Media.
Damji, Jules, Brooke Wenig, Tathagata Das, and Denny Lee. 2020. Learning Spark: Lightning-Fast Data Analytics. Sebastopol, CA: O’Reilly Media.