import pandas
1 Introduction
In 2022, I developed my first Python package called spark_map
, and published it at PyPI. If you want to know more about this package 😉, you can check a previous post here, where I introduced the package, and described it’s main features, and showed a small use case.
Now, although spark_map
is a small Python package, I had a hard time developing it. More specifically, the Python code was not hard at all to develop. But packaging it into a proper package 📦 was hard. In this post, I want to share a few things that I learned about Python package development in this process.
I hope these tips help you in some way 😉. In summary, the main tips discussed here are:
Do not change the
sys.path
orPYTHONPATH
variable inside your package ⚠️;Differences between a package and a module;
Nailing the structure 🏛️ of your package can be one the hardest parts;
If you use
setuptools
to build your package, avoidsetup.py
and usepyproject.toml
instead;Check if your Python modules are included on the built versions of your package;
2 Before I say anything, some references
The development of this package involved reading several articles 📓, and doing some practical tests 🧪. Here I share some of the best resources I discovered along the way.
First, a great article to start your project is the Packaging Python Projects tutorial. This tutorial is written by PyPA, or, the Python Packaging Authority team1, and it seems to be “The Place” to search for best practices on Python package projects.
The PyPA team also have a very detailed documentation about Packaging and distributing projects with setuptools
. This is a more technical, detailed documentation, but, it can be a great resource too.
Another very useful article to learn about the structure of a Python package, and the import process of Python, is the article entitled Dead Simple Python: Project Structure and Imports from Jason C. McDonald.
3 Now, let’s dive in
3.1 The search path of Python
We usually import a package in Python, by including a import
statement at the beginning of our script. For example, to import the pandas
package to my Python session, I could do this:
When you import a package in Python, the Python interpreter starts a search process 🔍 through your computer, to find the specific package you called in your script. Every Python package you use must be installed in your machine. Otherwise, Python will never find it, and, as a consequence, you can not use it.
The Python interpreter will always look for the packages you import, inside a set of pre-defined locations of your computer. This pre-defined list is stored inside the sys.path
variable (Foundation 2023c). In other words, when you import a package, Python looks for this package inside each one of the directories listed at sys.path
variable.
import sys
print(sys.path)
['/usr/lib/python310.zip', '/usr/lib/python3.10', '/usr/lib/python3.10/lib-dynload', '', '/home/pedro-dev/.local/lib/python3.10/site-packages', '/usr/local/lib/python3.10/dist-packages', '/usr/lib/python3/dist-packages']
You might also find contents about the PYTHONPATH
variable when searching for this subject on the internet. In essence, PYTHONPATH
is a environment variable that can contain a complementary list of directories to be added to sys.path
(Foundation 2023a).
As you can imagine, the Python interpreter look into these directories in a sequential manner. That is, Python looks for the package at the first folder. If it does not find the package you called, then, it looks at the second folder. If it does not find the package again, it looks at the third folder. And goes on and on, until it hits the last folder of the list.
If does not find the package you called at this last folder, Python will automatically raise a ModuleNotFoundError
error. As you expect, this error means that Python could not find the package you called at any of the directories listed at sys.path
.
3.2 Do not change the sys.path
or PYTHONPATH
variable ⚠️
The sys.path
variable is a standard Python list, and, as any other list, can be altered to include other directories that are not currently there. The same goes for the PYTHONPATH
variable, which is an environment variable, and can be altered too.
As an example, when you try to import your package, which is stored at folder A
, and, you face a ModuleNotFoundError
error, you might be tempted to alter PYTHONPATH
or sys.path
, to add the folder A
to this search path of Python. DO NOT DO IT! YOU SHOULD NEVER alter PYTHONPATH
or sys.path
⚠️ ! I mean, at least not inside a Python package.
In other words, if at some point inside the source code of your Python package, you execute a code like this:
import sys
'./../weird-unknow-folder') sys.path.append(
just erase this code! Python packages are made to be used by other peoples, and with a code like this above, you will change the search path of this user. Changing the search path of a user is a bad idea. Because you can accidentaly produce bad and confusing side effects to the user’s session, which can be hard to debug and solve.
Besides, in the majority of times when you alter the sys.path
, you are trying to overcome a bad structure of your files. In other words, you can always avoid altering the sys.path
variable by changing the structure of your source files inside your project.
Just to be clear, is a bad idea to alter the sys.path
inside the source code of your package. However, it is ok to alter these variables outside of your package.
As a practical example, Apache Spark is written in Scala. But it have a Python API available through the pyspark
package. If you look closely into the source code of the project, or, more specifically at the run-tests.py
file, you can see that new paths (or new directories) are appended to sys.path
in this file.
However, this run-tests.py
file IS NOT A PART of the pyspark
package itself. It is just an auxiliary script (outside of the package) used to support the testing processes of pyspark
. This means that run-tests.py
contains code that is not intended to be executed by the users of the package, but by the developers of pyspark
instead.
3.3 Differences between a package and a module
This is a very basic knowledge for a Python developer. However, until recently, I did not know the meaning of these two concepts. So, I will give it to you now, in case you do not have it yet.
A Python module is a single Python file (i.e. a file with extension .py
). Every Python script you write, is a Python module. In contrast, a Python package is a set of Python modules gathered together inside a folder. This folder must contain a particular Python module named as __init__.py
. This __init__.py
is the file that “initialize”, or, “identifies” this folder as a Python package (Foundation 2023b).
You can have multiple Python packages inside a Python package. That is, inside the directory of your package, you can have multiple sub-directories with more Python modules and __init__.py
files. In this case, these sub-directories become submodules of the package. You can interpret them as sub-packages, if you prefer.
3.4 Structuring the package was one of the hardest parts
Every Python package follows the same basic file/directory structure 🏛️. In other words, the files that compose a Python package are always structured in a standard way. But, understanding and using this structure effectively was one of the hardest parts for me. At this section, I want to explain this structure for you.
In a Python package project, you usually have these files:
LICENSE.md
orLICENSE.rst
(or both): a text file with the license of your package. It can be a Markdown file (.md
), or, a reStructuredText markup file (.rst
);README.md
: a Markdown file introducing your package. That is, a file that describes succintly the objective of the package, its main features, and showing a small example of use of the package;setup.py
orpyproject.toml
orsetup.cfg
: these are files used by the build system you choose to build (or compile) your Python package into a compact and shareable format;
Also, a Python package project usually contains these folders (or directories):
src/<package-name>/
or<package-name>/
: inside this directory you store all Python modules of your package, that is, the source code of your package;tests/
: inside this directory you store all unit tests of your package. In other words, the scripts and automated workflow used to test your package.
You must store the source code (or the python modules) of your package inside a folder with the same name as the package itself (i.e. the <package-name>/
folder). So, for a package named spark_map
we should keep the source files (i.e. the .py
files) of this package inside a folder called spark_map
. As a practical example, if you look at the source code of the famous pandas
package, you will see that all source code of the package is stored inside a folder called pandas
.
In contrast, this <package-name>/
folder might be (or might be not) inside another folder called src/
, that is, the path to the source code might be src/<package-name>/
instead of <package-name>/
. The pandas
package for example, do not uses the src/
folder, so the source code is stored inside the pandas/
folder. In contrast, the famous flask
package uses the flask/
folder inside a src/
folder, so the path to the source code becomes src/flask
. You can check this by looking at the source code of the package..
So the folder structure to store the source code of the package might change very slightly from package to package. There is no consensus about which one of these two structures is the best. But in general, the source is always stored inside a folder with the same name as the package (i.e. the <package-name>/
folder). And this folder might be stored inside a src/
folder.
Furthermore, every project of a Python package usually have one, two, or more files that control the build process of the package (like setup.py
, pyproject.toml
or setup.cfg
). In other words, these files are not part of the package itself. But they are used by the build system to build (or compile) your package into a compact and shareable format. I talk more about these files at Section 3.5.
Having in mind all these files that we described until here, we can build a example of file structure for a package. For example, a possible file strucuture for a package named haven
could be:
.
├── LICENSE.md
├── LICENSE.rst
├── README.md
├── pyproject.toml
├── src
│ └── haven
│ ├── __init__.py
│ ├── functions.py
│ ├── utils.py
│ └── haven.py
│
└── tests
├── test_functions.py
└── test_haven.py
3.5 Introducing build systems
When you are developing a package, you write multiple Python modules that perform all necessary tasks to solve the specific problem you want to solve with your package. However, in order to distribute this package to other peoples, that is, to make it available to the wide public, you need to compile (or build) your package into a compact and shareable format.
This is what a build system does. It compiles all archives of your package into a single and compact file that you can publish at PyPI, and distribute to other peoples. This means that, when a user downloads your package (through pip
for example), it downloads this compact and shareable version of your package.
On the day of writing this article, there are four main build systems available on the market for Python packages, which are hatchling
, setuptools
, flit
and PDM
. No matter which one you use, just choose one, anyone. Because they do the same thing, and work in a very similar way.
Most of them use various metadata stored at the pyproject.toml
file to build your project. The setuptools
build system is probably a exception to this rule, because this build system supports other kinds of files to store this metadata, which are setup.py
and setup.cfg
.
3.5.1 The pyproject.toml
file as the commom ground
The PEP517 and PEP518 were lauched to stablish the pyproject.toml
file as the commom ground to every build system. Basically, the pyproject.toml
is a file included at the root level of your project, and contains metadata about your package (e.g. it’s name, version, license, description, author, dependencies, etc.).
It represents a “commom ground” because all build systems available can read and use this file to build (or compile) your package. As a consequence, with this file, you can change very easily the build system you use in your package, without having to change various configurantions and files. All you have to do is to change the options under the build-system
table.
This build-system
table is the part of pyproject.toml
file where you can specify options and configurations for the build system. As an example, to use the setuptools
build system, you would add these lines to pyproject.toml
:
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
In the other hand, if you want to use the hatchling
build system instead, you would change the above lines to:
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
Or, maybe you prefer flit
:
[build-system]
requires = ["flit_core"]
build-backend = "flit_core.buildapi"
Anyway, you get it. You use the options requires
and build-backend
under the build-system
table to specificy which system you want to use in the building process of your package. All of these different systems will use pyproject.toml
to collect other very important metadata about your package, like these metadata below:
[project]
name = "spark_map"
version = "0.2.77"
authors = [
{ name="Pedro Faria", email="pedropark99@gmail.com" }
]
description = "Pyspark implementation of `map()` function for spark DataFrames"
readme = "README.md"
requires-python = ">=3.7"
license = { file = "LICENSE.txt" }
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
]
dependencies = [
"pyspark",
"setuptools",
"toml"
]
3.5.2 The setuptools
build system
The setuptools
build system is the system that I use to build spark_map
. It is kind of a more “versatile”2 system than the others, only because it supports other kinds of files besides pyproject.toml
.
The setup.py
file is probably the most “famous” file, or, the most associable to setuptools
. Although this is changing in recent years (more about this at the next section), the setup.py
is still considered the traditional way of using the system, as the documentation itself says:
The traditional
setuptools
way of packaging Python modules uses asetup()
function within thesetup.py
script. (Authority 2023)
In summary, you can replace pyproject.toml
with the setup.py
file. Just as an example, I could replace the pyproject.toml
file for spark_map
with a setup.py
file similar to the file below. The whole setup.py
file is usually composed of just a single call to the setuptools.setup()
function, and nothing more. All of the various metadata about the package stored at pyproject.toml
are translated to named arguments to this setup()
function.
# setup.py placed at root directory
from setuptools import setup
setup(= 'spark_map'
name = '0.2.77',
version = 'Pedro Faria',
author = 'Pyspark implementation of `map()` function for spark DataFrames',
description = 'README.md',
long_description = { file = "LICENSE.txt" }
license = [
classifiers "Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
],= '>=3.7',
python_requires = ['pyspark'],
install_requires = ['setuptools', 'toml']
extras_require )
However, if you use a setup.py
file you become kind of locked into the setuptools
framework. This one the reasons why many developers prefer to use the pyproject.toml
file instead. Because it becomes so much more easy to change between build systems if you want to.
Another “clash point” about the setup.py
file is that it changes the way, or the steps (or the worflow) to build the package. The standard or the most supported strategy to build a package, is to use the build
module command of Python at the terminal. This command will build your package, no matter which build system you use:
# If you are on Windows:
py -m build
# If you are on Linux:
python3 -m build
In contrast, when you use the setup.py
file, you have to execute directly the script with some additional options, like sdist
:
# If you are on Windows:
py setup.py sdist
# If you are on Linux:
python3 setup.py sdist
Again, this is bad, because it forces you into a certain framework, or, a certain workflow that is different than the standard way of building a package.
3.5.3 Avoid setup.py
and use pyproject.toml
instead
The Python community understood this problem, and now, executing the setup.py
script directly is heavily discouraged (Ganssle 2021). Actually, the setuptools
project itself started a movement to migrate it’s functionality to pyproject.toml
and setup.cfg
files.
This concern is clearly documented at the “Quick Start” section of the official documentation for the project:
Setuptools offers first class support for
setup.py
files as a configuration mechanism … It is important to remember, however, that running this file as a script (e.g.python setup.py sdist
) is strongly discouraged, and that the majority of the command line interfaces are (or will be) deprecated (e.g.python setup.py install
,python setup.py bdist_wininst
, …) … We also recommend users to expose as much as possible configuration in a more declarative way via thepyproject.toml
orsetup.cfg
, and keep thesetup.py
minimal with only the dynamic parts (or even omit it completely if applicable).
In resume, avoid using setup.py
in your project, and use pyproject.toml
instead with the python -m build
command. This a much more recommended workflow for building your package.
4 One of the hardest bugs I ever faced in my life
While developing spark_map
I encoutered one of the hardest and weirdest bugs I ever faced in my life. Since it was really hard to find this bug, I want to reveal it here to you, and describe how it happens.
First, a little bit of context. This bug was present at the version 0.2.7 of the package, and, you can look at the exact source of this version, by searching for the git tag v0.2.7
at the source code history. At this point, the package was ready to be shipped to PyPI.
I had no problems at building the package. When I executed the command python -m build
, the package was built successfully, without any error or warning messages. The source and wheel distributions of the package were there.
I had no problemas at importing the package. I could start a new Python session, and use a import
statement to import the package, and that statement would execute flawlessly.
I had no problems at the unit tests of the package. When I executed the command pytest
to initiate the unit tests runs, I was passing in all tests. Again, I had no error or warning messages appearing.
Everything seemed to work perfectly. As a consequence, I was ready to publish the package at PyPI, and that is what I did. I published at PyPI. Than, I started to test the package at PyPI.
First, I tried to install the package with pip install spark_map
, and everything worked fine at the installation process. The package version was right, and, after the installation process, when I executed the pip list
command, I could clearly see spark_map
on the list of installed packages.
However, I SIMPLY COULD NOT IMPORT THE PACKAGE! The package was installed succesfully, and with that in mind, I expected to import it succesfully in any Python module, at any location of my computer. However, when I started a new Python session at a random location of my computer, and tried to import the package, the Python interpreter always raised a ModuleNotFoundError
error.
4.1 How did this problem occur
There was no error or warning messages at the building stage of the package, neither in the installation process as well. But something was definitely wrong with the package.
In essence, the source of this weird and confusing problem was at the building stage of the package. Basically, the build system (in that case, the setuptools
) was not able to automatically find all source files (i.e. the .py
files) of the package. As a result, the build system was building the package, but, not including the source files inside this built version of the package. Very weird! Isn’t it?
I mean, if the build system did not find any source files… it should definitely raise a warning or error during the build process. Because this is definitely an error in the process. How or why can you share a Python package, which have no Python source code? If the package does not contain any Python code in it, it is probably not a Python package. Right?
That is why I could not import the package after I installed it on my machine, trough pip
. Because the version of the package that was being installed on my machine, was a version of the package that had no Python source code that could be imported. That is why the ModuleNotFoundError
error was being raised by the Python interpreter when I tried to import the package.
4.2 How to identify this problem
First, when you build your package, usually two main files are created, which are: 1) the wheel distribution of your package; 2) the source distribution of your package. These are just two different formats in which you can ship and share your package with another human.
The source distribution is a TAR file (with .tar
or .tar.gz
extension). Is just a compressed file (like ZIP files) with all of the essential files that compose your package (i.e. the metadata of the package, the source files - .py
files, README and LICENSE files, etc.).
On the other hand, the wheel distribution (files with .whl
extension), is “pre-compiled” version of your package. This means that a wheel version of your package comes in a ready-to-install format, and, as a consequence, is much faster to install this version of your package, in comparison to source distributions, which must be compiled before being installed on your system.
To identify if the problem I described at Section 4.1 is happening inside your package project, I recommend you to focus your attention on the source distribution of your package (i.e. the TAR file - .tar
or .tar.gz
). Because you can easily open this kind of file with decompressing tools, like WinRar
or 7-zip
, and it is much more easy to analyze than the wheel distribution.
In resume, if you open the TAR file, and, you do not find any Python modules (i.e. .py
files) in it, this means that the problem I described at Section 4.1 is unfortunately happening inside your package. In contrast, if you do find all Python modules of your package inside this TAR file, than, you are safe and good to go forward.
To fix this problem inside my package project, I had to add the lines below to my pyproject.toml
file. These lines tell setuptools
which are the packages available inside the project, and, which is the folder inside the project where all the source code of the package was stored.
[tool.setuptools]
packages = ["spark_map"]
package-dir = {"" = "src"}
It was so hard to find this bug, yet, it was quite simple and straightforward to fix it 😑 (poker face). With these options, setuptools
could finally find all the Python modules of my package, and, as a consequence, include them in the built versions of the package (in the source and in the wheel distributions).
Therefore, if you face a ModuleNotFoundError
error when you try to import your package, after you installed it in your computer with pip
, I recommend you to check if your Python modules are being included on the built versions of your package.