Useful Python Libraries

Useful Python Libraries#

The following libraries are available and recommended for use by Cal-ITP data analysts. Our JupyterHub environment comes with all of these installed already, except for calitp-data-infra. A full list of the packages included in the system image that underpins our JupyterHub environment can be found (and updated when needed) here.

Table of Contents#

shared utils
calitp-data-analysis
siuba
- Basic Query
- Collect Query Results
- Show Query SQL
- More siuba Resources
pandas
Add New Packages
Updating calitp-data-analysis
Appendix: calitp-data-infra

shared utils#

A set of shared utility functions can also be installed, similarly to any Python library. The shared_utils are stored in two places: here in data-analyses, which houses functions that are more likely to be updated. Shared functions that are updated less frequently are housed here in the calitp_data_analysis package in data-infra. Generalized functions for analysis are added as collaborative work evolves so we aren’t constantly reinventing the wheel.

In terminal#

Navigate to the package folder: cd data-analyses/_shared_utils
Use the make command to run through conda install and pip install: make setup_env
- Note: you may need to select Kernel -> Restart Kernel from the top menu after make setup_env in order to successfully import shared_utils
Alternative: add an alias to your .bash_profile:
- In terminal use cd to navigate to the home directory (not a repository)
- Type nano .bash_profile to open the .bash_profile in a text editor
- Add a line at end: alias go='cd ~/data-analyses/portfolio && pip install -r requirements.txt && cd ../_shared_utils && make setup_env && cd ..'
- Exit with Ctrl+X, hit yes, then hit enter at the filename prompt
- Restart your server; you can check your changes with cat .bash_profile

In notebook#

from calitp_data_analysis import geography_utils

geography_utils.WGS84

See data-analyses/starter_kit for examples on how to use shared_utils for general functions, charts, and maps.

calitp-data-analysis#

calitp-data-analysis is an internal library of utility functions used to access our warehouse data for analysis purposes.

import tbls#

Most notably, you can include import tbls at the top of your notebook to import a table from the warehouse in the form of a tbls:

from calitp_data_analysis.tables import tbls

Example:

from calitp_data_analysis.tables import tbls

tbls.mart_gtfs.dim_agency()

# Source: lazy query
# DB Conn: Engine(bigquery://cal-itp-data-infra/?maximum_bytes_billed=5000000000)
# Preview:

	key	_gtfs_key	feed_key	agency_id	agency_name	agency_url	agency_timezone	agency_lang	agency_phone	agency_fare_url	agency_email	base64_url	_dt	_feed_valid_from	_line_number	feed_timezone
0	282b4e07171c48ac08162e0dc8749066	18fcf47148228c7b56c19726c854054b	9b8e9b8befe560293a6c5b38dc19ffbb	MT	None	http://www.mantecatransit.com	America/Los_Angeles	en	(209) 456-8000	None	None	aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...	2021-07-28	2021-07-28 13:56:16+00:00	1	America/Los_Angeles
1	fdbd4bd9cc406a651ec80dba6ac4a1c1	ebfb64604b1c18bdf8ab91a2e4394491	ba587fd71b4f9a956ff36e309e6acf3f	LAX FlyAway	None	http://www.LAXFlyAway.org	America/Los_Angeles	en	(714) 507-1170	None	None	aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...	2022-05-14	2022-05-14 00:01:18+00:00	1	America/Los_Angeles
2	033d018b031002914fbfa48c872d4a65	8f5e204872e6b7034479a6d11502547b	3d30b02b9008f788eb89c41c137786c1	MT	None	http://www.mantecatransit.com	America/Los_Angeles	en	(209) 456-8000	None	None	aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...	2022-01-19	2022-01-19 00:11:01+00:00	1	America/Los_Angeles
3	1d48492cd90930d6612fc70ef18bf8d4	93c44e614219c41696a8148018c1d83b	f87ac9abe81a7548325423ede69c3b86	LAX FlyAway	None	http://www.LAXFlyAway.org	America/Los_Angeles	en	(714) 507-1170	None	None	aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...	2022-09-09	2022-09-09 00:00:56+00:00	1	America/Los_Angeles
4	56112d81d62d71d408ea0247309178e4	090343c89ab8fe1d5e3b79e5687bbcca	21fa0b125d801eb5058da2ec5d748bda	LAX FlyAway	None	http://www.LAXFlyAway.org	America/Los_Angeles	en	(714) 507-1170	None	None	aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...	2022-09-15	2022-09-15 14:59:54.274779+00:00	1	America/Los_Angeles

# .. may have more rows

query_sql#

query_sql is another useful function to use inside of JupyterHub notebooks to turn a SQL query into a pandas DataFrame.

As an example, in a notebook:

from calitp_data_analysis.sql import query_sql

df_dim_agency = query_sql("""
SELECT
    *
FROM `mart_gtfs.dim_agency`
LIMIT 10""", as_df=True)

df_dim_agency.head()

	key	_gtfs_key	feed_key	agency_id	agency_name	agency_url	agency_timezone	agency_lang	agency_phone	agency_fare_url	agency_email	base64_url	_dt	_feed_valid_from	_line_number	feed_timezone
0	282b4e07171c48ac08162e0dc8749066	18fcf47148228c7b56c19726c854054b	9b8e9b8befe560293a6c5b38dc19ffbb	MT	None	http://www.mantecatransit.com	America/Los_Angeles	en	(209) 456-8000	None	None	aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...	2021-07-28	2021-07-28 13:56:16+00:00	1	America/Los_Angeles
1	fdbd4bd9cc406a651ec80dba6ac4a1c1	ebfb64604b1c18bdf8ab91a2e4394491	ba587fd71b4f9a956ff36e309e6acf3f	LAX FlyAway	None	http://www.LAXFlyAway.org	America/Los_Angeles	en	(714) 507-1170	None	None	aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...	2022-05-14	2022-05-14 00:01:18+00:00	1	America/Los_Angeles
2	033d018b031002914fbfa48c872d4a65	8f5e204872e6b7034479a6d11502547b	3d30b02b9008f788eb89c41c137786c1	MT	None	http://www.mantecatransit.com	America/Los_Angeles	en	(209) 456-8000	None	None	aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...	2022-01-19	2022-01-19 00:11:01+00:00	1	America/Los_Angeles
3	1d48492cd90930d6612fc70ef18bf8d4	93c44e614219c41696a8148018c1d83b	f87ac9abe81a7548325423ede69c3b86	LAX FlyAway	None	http://www.LAXFlyAway.org	America/Los_Angeles	en	(714) 507-1170	None	None	aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...	2022-09-09	2022-09-09 00:00:56+00:00	1	America/Los_Angeles
4	56112d81d62d71d408ea0247309178e4	090343c89ab8fe1d5e3b79e5687bbcca	21fa0b125d801eb5058da2ec5d748bda	LAX FlyAway	None	http://www.LAXFlyAway.org	America/Los_Angeles	en	(714) 507-1170	None	None	aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...	2022-09-15	2022-09-15 14:59:54.274779+00:00	1	America/Los_Angeles

GCSGeoPandas#

The GCSGeoPandas class fetches the Google Cloud Storage (GCS) filesystem and surfaces functions to provide analysts a clear and consistent way of accessing geospatial resources.

It’s recommended to memoize initialization of the class so that the GCS filesystem is fetched and cached the first time you call it and subsequent calls can reuse that cached filesystem.

from functools import cache

from calitp_data_analysis.gcs_geopandas import GCSGeoPandas

@cache
def gcs_geopandas():
    return GCSGeoPandas()

read_parquet#

Delegates to geopandas.read_parquet, providing GCS Filesystem

gcs_geopandas().read_parquet("gs://path/to/your/file.parquet")

read_file#

Delegates to geopandas.read_file with the file at the path specified in the GCS filesystem

gcs_geopandas().read_file("gs://path/to/your/file.geojson")

geo_data_frame_to_parquet#

Delegates to GeoDataFrame.to_parquet, providing the GCS filesystem

import geopandas as gpd

data = {'col1': ['name1', 'name2'], 'geometry': [...]}
geo_data_frame = gpd.GeoDataFrame(data, crs="EPSG:4326")
gcs_geopandas().geo_data_frame_to_parquet(geo_data_frame, "gs://path/to/your/file.parquet")

siuba#

siuba is a tool that allows the same analysis code to run on a pandas DataFrame, as well as generate SQL for different databases. It supports most pandas Series methods analysts use. See the siuba docs for more information.

The examples below go through the basics of using siuba, collecting a database query to a local DataFrame, and showing SQL test queries that siuba code generates.

Basic query#

from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query

# query agency information, then filter for a single gtfs feed,
# and then count how often each feed key occurs
(tbls.mart_gtfs.dim_agency()
    >> filter(_.agency_id == 'BA', _.base64_url == 'aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1SRw==')
    >> count(_.feed_key)
)

# Source: lazy query
# DB Conn: Engine(bigquery://cal-itp-data-infra/?maximum_bytes_billed=5000000000)
# Preview:

	feed_key	n
0	cae827cd30737d76600b13970dde458a	1
1	f4095db9282b859842d338f5db032561	1
2	5ad87901f329be154545fec771aa2c21	1
3	a64532640829f3eae9129cd5d5e9590b	1
4	7eded7cbbf072a2fd9dd812f80bf2d2b	1

# .. may have more rows

Collect query results#

Note that siuba by default prints out a preview of the SQL query results. In order to fetch the results of the query as a pandas DataFrame, run collect().

tbl_agency_names = tbls.mart_gtfs.dim_agency() >> collect()

# Use pandas .head() method to show first 5 rows of data
tbl_agency_names.head()

	key	_gtfs_key	feed_key	agency_id	agency_name	agency_url	agency_timezone	agency_lang	agency_phone	agency_fare_url	agency_email	base64_url	_dt	_feed_valid_from	_line_number	feed_timezone
0	282b4e07171c48ac08162e0dc8749066	18fcf47148228c7b56c19726c854054b	9b8e9b8befe560293a6c5b38dc19ffbb	MT	None	http://www.mantecatransit.com	America/Los_Angeles	en	(209) 456-8000	None	None	aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...	2021-07-28	2021-07-28 13:56:16+00:00	1	America/Los_Angeles
1	fdbd4bd9cc406a651ec80dba6ac4a1c1	ebfb64604b1c18bdf8ab91a2e4394491	ba587fd71b4f9a956ff36e309e6acf3f	LAX FlyAway	None	http://www.LAXFlyAway.org	America/Los_Angeles	en	(714) 507-1170	None	None	aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...	2022-05-14	2022-05-14 00:01:18+00:00	1	America/Los_Angeles
2	033d018b031002914fbfa48c872d4a65	8f5e204872e6b7034479a6d11502547b	3d30b02b9008f788eb89c41c137786c1	MT	None	http://www.mantecatransit.com	America/Los_Angeles	en	(209) 456-8000	None	None	aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...	2022-01-19	2022-01-19 00:11:01+00:00	1	America/Los_Angeles
3	1d48492cd90930d6612fc70ef18bf8d4	93c44e614219c41696a8148018c1d83b	f87ac9abe81a7548325423ede69c3b86	LAX FlyAway	None	http://www.LAXFlyAway.org	America/Los_Angeles	en	(714) 507-1170	None	None	aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...	2022-09-09	2022-09-09 00:00:56+00:00	1	America/Los_Angeles
4	56112d81d62d71d408ea0247309178e4	090343c89ab8fe1d5e3b79e5687bbcca	21fa0b125d801eb5058da2ec5d748bda	LAX FlyAway	None	http://www.LAXFlyAway.org	America/Los_Angeles	en	(714) 507-1170	None	None	aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...	2022-09-15	2022-09-15 14:59:54.274779+00:00	1	America/Los_Angeles

Show query SQL#

While collect() fetches query results, show_query() prints out the SQL code that siuba generates.

(tbls.mart_gtfs.dim_agency()
  >> filter(_.agency_name.str.contains("Metro"))
  >> show_query(simplify=True)
)

SELECT * 
FROM `mart_gtfs.dim_agency` AS `mart_gtfs.dim_agency_1` 
WHERE regexp_contains(`mart_gtfs.dim_agency_1`.`agency_name`, 'Metro')

# Source: lazy query
# DB Conn: Engine(bigquery://cal-itp-data-infra/?maximum_bytes_billed=5000000000)
# Preview:

	key	_gtfs_key	feed_key	agency_id	agency_name	agency_url	agency_timezone	agency_lang	agency_phone	agency_fare_url	agency_email	base64_url	_dt	_feed_valid_from	_line_number	feed_timezone
0	5b8556e31e5d83b64664aff930108935	1537a46c315af2bdc04e9dd40a2b33ab	f3d0776cc624e46ec8eb33acb9254b36	960	Madera Metro	https://www.madera.gov/home/departments/transi...	America/Los_Angeles	en	(559) 661-RIDE (7433)	https://www.madera.gov/home/departments/transi...	None	aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...	2025-02-07	2025-02-07 03:00:32.553429+00:00	1	America/Los_Angeles
1	4db1e4458fec07330417b4f27bae527e	1aae9998784e92ab4fefdbca9415a020	2900be32d7daff7b93d4d7f4a242fa68	960	Madera Metro	https://www.madera.gov/home/departments/transi...	America/Los_Angeles	en	(559) 661-RIDE (7433)	https://www.madera.gov/home/departments/transi...	None	aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...	2025-01-11	2025-01-11 03:00:37.352100+00:00	1	America/Los_Angeles
2	9e1782ca894aff6f1a20d273002b31a8	46bb281dadd29531a7208ddb2d4ad911	eb45540bc9b652b8ec24a952481dadc4	960	Madera Metro	https://www.madera.gov/home/departments/transi...	America/Los_Angeles	en	(559) 661-RIDE (7433)	https://www.madera.gov/home/departments/transi...	None	aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...	2023-07-11	2023-07-11 03:00:46.743372+00:00	1	America/Los_Angeles
3	7dde33f3c8629a4703af876c042aca24	89aec1b29dcfdd86a5ee1632db4cf0ac	58440d643369c3d8690e4c27ba3911bb	960	Madera Metro	https://www.madera.gov/home/departments/transi...	America/Los_Angeles	en	(559) 661-RIDE (7433)	https://www.madera.gov/home/departments/transi...	None	aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...	2023-09-07	2023-09-07 03:00:29.777243+00:00	1	America/Los_Angeles
4	149712407d0b279211c623c2e52a3d3c	513def24eef056315fb6ef128714f584	726a476a7c2917a2cd7c94a409c5d0c5	960	Madera Metro	https://www.madera.gov/home/departments/transi...	America/Los_Angeles	en	(559) 661-RIDE (7433)	https://www.madera.gov/home/departments/transi...	None	aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...	2024-02-06	2024-02-06 03:00:59.720405+00:00	1	America/Los_Angeles

# .. may have more rows

Note that here the pandas Series method str.contains corresponds to regexp_contains in Google BigQuery.

More siuba Resources#

pandas#

The library pandas is very commonly used in data analysis, and the external resources below provide a brief overview of it’s use.

Cheat Sheet - pandas

Add New Packages#

While most Python packages an analyst uses come in JupyterHub, there may be additional packages you’ll want to use in your analysis.

Install shared utility functions,
Change directory into the project task’s subfolder and add requirements.txt and/or conda-requirements.txt
Run pip install -r requirements.txt and/or conda install --yes -c conda-forge --file conda-requirements.txt

Updating calitp-data-analysis#

calitp-data-analysis is a package that lives here in the data-infra repo. Follow the steps below update to the package.

Steps

Adapted from this and this Slack thread.

Make the changes you want in the calitp-data-analysis folder inside packages here. If you are only changing package metadata (author information, package description, etc.) without changing the function of the package itself, that information lives in pyproject.toml rather than in the calitp-data-analysis subfolder.
- If you are adding a new function that relies on a package that isn’t already a dependency, run poetry add <package name> after changing directories to data-infra/packages/calitp_data_analysis. Check this Jupyter image file for the version number associated with the package, because you should specify the version.
- For example, your function relies on dask. In the Jupyter image file, the version is dask = "~2022.8" so run poetry add dask==~2022.8 in the terminal.
- You may also have run poetry install mypy. mypy is a package that audits Python files for information related to data types, and you can read more about it here.
- mypy is a package that audits Python files for information related to data types, and you can read more about it here. mypy is one of the standard development dependencies for the calitp-data-analysis package, defined in the pyproject.toml file for the package, so to install it you can run poetry install in packages/calitp-data-analysis/ (which will also install the other package dependencies). To use mypy, run poetry run mypy [file or directory name you're interested in checking].
- mypy is generally run in CI when a PR is opened, as part of build tasks. You can see it called here for this package in build-calitp-data-analysis.yml. Within the PR, the “Test, visualize, and build calitp-data-analysis” CI workflow will fail if problems are found.
- More helpful hints for passing mypy in our repo.
Each time you update the package, you must also update the version number. We use dates to reflect which version we are on. Update the version in pyproject.toml that lives in calitp-data-analysis to either today’s date or a future date.
Open a new pull request and make sure the new version date appears on the test version page.
- The new version date may not show up on the test page due to errors. Check the GitHub Action page of your pull request to see the errors that have occurred.
- If you run into the error message like this, error: Skipping analyzing "dask_geopandas": module is installed, but missing library stubs or py.typed marker [import] go to your .py file and add # type: ignore behind the package import.
- To fix the error above, change import dask_geopandas as dg to import dask_geopandas as dg # type: ignore.
- It is encouraged to make changes in a set of smaller commits. For example, add all the necessary packages with poetry run <package first, fix any issues flagged by mypy, and finally address any additional issues.
Merge the PR. Once it is merged in, the actual package will display the new version number. To make sure everything works as expected, run pip install calitp-data-analysis==<new version here> in a cell of Jupyter notebook and import a package (or two) such as from calitp_data_analysis import styleguide.
Update the new version number in the data-infra repository here, here, here, and anywhere else you find a reference to the old version of the package. You’ll also want to do the same for any other Cal-ITP repositories that reference the calitp-data-analysis package.
- When yu update the jupyter-singleuser toml, make sure to run poetry add calitp-data-analysis==<new version here> and commit the updated poetry.lock file.
- As of writing, the only other repository that references to the package version is reports.

Resources

Appendix: calitp-data-infra#

The calitp-data-infra package, used primarily by warehouse mainainers and data pipeline developers, includes utilities that analysts will likely not need need to interact with directly (and therefore generally won’t need to install), but which may be helpful to be aware of. For instance, the get_secret_by_name() and get_secrets_by_label() functions in the package’s auth module are used to interact with Google’s Secret Manager, the service that securely stores API keys and other sensitive information that underpins many of our data syncs.

You can read more about the calitp-data-infra Python package here.