Useful Python Libraries

The following libraries are available and recommended for use by Cal-ITP data analysts. Our JupyterHub environment comes with all of these installed already, except for calitp-data-infra. A full list of the packages included in the system image that underpins our JupyterHub environment can be found (and updated when needed) here.

shared utils

A set of shared utility functions can also be installed, similarly to any Python library. The shared_utils are stored in two places: here in data-analyses, which houses functions that are more likely to be updated. Shared functions that are updated less frequently are housed here in the calitp_data_analysis package in data-infra. Generalized functions for analysis are added as collaborative work evolves so we aren’t constantly reinventing the wheel.

In terminal

  • Navigate to the package folder: cd data-analyses/_shared_utils

  • Use the make command to run through conda install and pip install: make setup_env

    • Note: you may need to select Kernel -> Restart Kernel from the top menu after make setup_env in order to successfully import shared_utils

  • Alternative: add an alias to your .bash_profile:

    • In terminal use cd to navigate to the home directory (not a repository)

    • Type nano .bash_profile to open the .bash_profile in a text editor

    • Add a line at end: alias go='cd ~/data-analyses/portfolio && pip install -r requirements.txt && cd ../_shared_utils && make setup_env && cd ..'

    • Exit with Ctrl+X, hit yes, then hit enter at the filename prompt

    • Restart your server; you can check your changes with cat .bash_profile

In notebook

from calitp_data_analysis import geography_utils

geography_utils.WGS84

See data-analyses/starter_kit for examples on how to use shared_utils for general functions, charts, and maps.

calitp-data-analysis

calitp-data-analysis is an internal library of utility functions used to access our warehouse data for analysis purposes.

import tbls

Most notably, you can include import tbls at the top of your notebook to import a table from the warehouse in the form of a tbls:

from calitp_data_analysis.tables import tbls

Example:

from calitp_data_analysis.tables import tbls

tbls.mart_gtfs.dim_agency()
# Source: lazy query
# DB Conn: Engine(bigquery://cal-itp-data-infra/?maximum_bytes_billed=5000000000)
# Preview:
key _gtfs_key feed_key agency_id agency_name agency_url agency_timezone agency_lang agency_phone agency_fare_url agency_email base64_url _dt _feed_valid_from _line_number feed_timezone
0 56112d81d62d71d408ea0247309178e4 090343c89ab8fe1d5e3b79e5687bbcca 21fa0b125d801eb5058da2ec5d748bda LAX FlyAway None http://www.LAXFlyAway.org America/Los_Angeles en (714) 507-1170 None None aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... 2022-09-15 2022-09-15 14:59:54.274779+00:00 1 America/Los_Angeles
1 282b4e07171c48ac08162e0dc8749066 18fcf47148228c7b56c19726c854054b 9b8e9b8befe560293a6c5b38dc19ffbb MT None http://www.mantecatransit.com America/Los_Angeles en (209) 456-8000 None None aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... 2021-07-28 2021-07-28 13:56:16+00:00 1 America/Los_Angeles
2 1d48492cd90930d6612fc70ef18bf8d4 93c44e614219c41696a8148018c1d83b f87ac9abe81a7548325423ede69c3b86 LAX FlyAway None http://www.LAXFlyAway.org America/Los_Angeles en (714) 507-1170 None None aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... 2022-09-09 2022-09-09 00:00:56+00:00 1 America/Los_Angeles
3 fdbd4bd9cc406a651ec80dba6ac4a1c1 ebfb64604b1c18bdf8ab91a2e4394491 ba587fd71b4f9a956ff36e309e6acf3f LAX FlyAway None http://www.LAXFlyAway.org America/Los_Angeles en (714) 507-1170 None None aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... 2022-05-14 2022-05-14 00:01:18+00:00 1 America/Los_Angeles
4 033d018b031002914fbfa48c872d4a65 8f5e204872e6b7034479a6d11502547b 3d30b02b9008f788eb89c41c137786c1 MT None http://www.mantecatransit.com America/Los_Angeles en (209) 456-8000 None None aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... 2022-01-19 2022-01-19 00:11:01+00:00 1 America/Los_Angeles

# .. may have more rows

query_sql

query_sql is another useful function to use inside of JupyterHub notebooks to turn a SQL query into a pandas DataFrame.

As an example, in a notebook:

from calitp_data_analysis.sql import query_sql
df_dim_agency = query_sql("""
SELECT
    *
FROM `mart_gtfs.dim_agency`
LIMIT 10""", as_df=True)
df_dim_agency.head()
key _gtfs_key feed_key agency_id agency_name agency_url agency_timezone agency_lang agency_phone agency_fare_url agency_email base64_url _dt _feed_valid_from _line_number feed_timezone
0 56112d81d62d71d408ea0247309178e4 090343c89ab8fe1d5e3b79e5687bbcca 21fa0b125d801eb5058da2ec5d748bda LAX FlyAway None http://www.LAXFlyAway.org America/Los_Angeles en (714) 507-1170 None None aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... 2022-09-15 2022-09-15 14:59:54.274779+00:00 1 America/Los_Angeles
1 282b4e07171c48ac08162e0dc8749066 18fcf47148228c7b56c19726c854054b 9b8e9b8befe560293a6c5b38dc19ffbb MT None http://www.mantecatransit.com America/Los_Angeles en (209) 456-8000 None None aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... 2021-07-28 2021-07-28 13:56:16+00:00 1 America/Los_Angeles
2 1d48492cd90930d6612fc70ef18bf8d4 93c44e614219c41696a8148018c1d83b f87ac9abe81a7548325423ede69c3b86 LAX FlyAway None http://www.LAXFlyAway.org America/Los_Angeles en (714) 507-1170 None None aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... 2022-09-09 2022-09-09 00:00:56+00:00 1 America/Los_Angeles
3 fdbd4bd9cc406a651ec80dba6ac4a1c1 ebfb64604b1c18bdf8ab91a2e4394491 ba587fd71b4f9a956ff36e309e6acf3f LAX FlyAway None http://www.LAXFlyAway.org America/Los_Angeles en (714) 507-1170 None None aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... 2022-05-14 2022-05-14 00:01:18+00:00 1 America/Los_Angeles
4 033d018b031002914fbfa48c872d4a65 8f5e204872e6b7034479a6d11502547b 3d30b02b9008f788eb89c41c137786c1 MT None http://www.mantecatransit.com America/Los_Angeles en (209) 456-8000 None None aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... 2022-01-19 2022-01-19 00:11:01+00:00 1 America/Los_Angeles

siuba

siuba is a tool that allows the same analysis code to run on a pandas DataFrame, as well as generate SQL for different databases. It supports most pandas Series methods analysts use. See the siuba docs for more information.

The examples below go through the basics of using siuba, collecting a database query to a local DataFrame, and showing SQL test queries that siuba code generates.

Basic query

from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query

# query agency information, then filter for a single gtfs feed,
# and then count how often each feed key occurs
(tbls.mart_gtfs.dim_agency()
    >> filter(_.agency_id == 'BA', _.base64_url == 'aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1SRw==')
    >> count(_.feed_key)
)
# Source: lazy query
# DB Conn: Engine(bigquery://cal-itp-data-infra/?maximum_bytes_billed=5000000000)
# Preview:
feed_key n
0 5ad87901f329be154545fec771aa2c21 1
1 b01a3ce6ba2e972a6acc08574a3e06a9 1
2 023ec6a2b6246ad996abcb5aefedc068 1
3 a64532640829f3eae9129cd5d5e9590b 1
4 f4095db9282b859842d338f5db032561 1

# .. may have more rows

Collect query results

Note that siuba by default prints out a preview of the SQL query results. In order to fetch the results of the query as a pandas DataFrame, run collect().

tbl_agency_names = tbls.mart_gtfs.dim_agency() >> collect()

# Use pandas .head() method to show first 5 rows of data
tbl_agency_names.head()
key _gtfs_key feed_key agency_id agency_name agency_url agency_timezone agency_lang agency_phone agency_fare_url agency_email base64_url _dt _feed_valid_from _line_number feed_timezone
0 56112d81d62d71d408ea0247309178e4 090343c89ab8fe1d5e3b79e5687bbcca 21fa0b125d801eb5058da2ec5d748bda LAX FlyAway None http://www.LAXFlyAway.org America/Los_Angeles en (714) 507-1170 None None aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... 2022-09-15 2022-09-15 14:59:54.274779+00:00 1 America/Los_Angeles
1 282b4e07171c48ac08162e0dc8749066 18fcf47148228c7b56c19726c854054b 9b8e9b8befe560293a6c5b38dc19ffbb MT None http://www.mantecatransit.com America/Los_Angeles en (209) 456-8000 None None aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... 2021-07-28 2021-07-28 13:56:16+00:00 1 America/Los_Angeles
2 1d48492cd90930d6612fc70ef18bf8d4 93c44e614219c41696a8148018c1d83b f87ac9abe81a7548325423ede69c3b86 LAX FlyAway None http://www.LAXFlyAway.org America/Los_Angeles en (714) 507-1170 None None aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... 2022-09-09 2022-09-09 00:00:56+00:00 1 America/Los_Angeles
3 fdbd4bd9cc406a651ec80dba6ac4a1c1 ebfb64604b1c18bdf8ab91a2e4394491 ba587fd71b4f9a956ff36e309e6acf3f LAX FlyAway None http://www.LAXFlyAway.org America/Los_Angeles en (714) 507-1170 None None aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... 2022-05-14 2022-05-14 00:01:18+00:00 1 America/Los_Angeles
4 033d018b031002914fbfa48c872d4a65 8f5e204872e6b7034479a6d11502547b 3d30b02b9008f788eb89c41c137786c1 MT None http://www.mantecatransit.com America/Los_Angeles en (209) 456-8000 None None aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... 2022-01-19 2022-01-19 00:11:01+00:00 1 America/Los_Angeles

Show query SQL

While collect() fetches query results, show_query() prints out the SQL code that siuba generates.

(tbls.mart_gtfs.dim_agency()
  >> filter(_.agency_name.str.contains("Metro"))
  >> show_query(simplify=True)
)
SELECT * 
FROM `mart_gtfs.dim_agency` AS `mart_gtfs.dim_agency_1` 
WHERE regexp_contains(`mart_gtfs.dim_agency_1`.`agency_name`, 'Metro')
# Source: lazy query
# DB Conn: Engine(bigquery://cal-itp-data-infra/?maximum_bytes_billed=5000000000)
# Preview:
key _gtfs_key feed_key agency_id agency_name agency_url agency_timezone agency_lang agency_phone agency_fare_url agency_email base64_url _dt _feed_valid_from _line_number feed_timezone
0 a3ecd303b4eaf59d937631c943c0abfe bfe22848a94f7cff936847116d54445c ae20d23beb02c5331d1f7087f54bef7d 960 Madera Metro https://www.madera.gov/home/departments/transi... America/Los_Angeles en (559) 661-RIDE (7433) https://www.madera.gov/home/departments/transi... None aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... 2023-04-14 2023-04-14 03:00:18.458064+00:00 1 America/Los_Angeles
1 b22bc5c21e3e90bef6e0fc91c66199db f3271fec250502a41da3848717b9eeaf 149ad0155813993026db844421248c4a 960 Madera Metro https://www.cityofmadera.ca.gov/home/departmen... America/Los_Angeles en (559) 661-RIDE (7433) https://www.cityofmadera.ca.gov/wp-content/upl... None aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... 2021-08-04 2021-08-04 00:06:42+00:00 1 America/Los_Angeles
2 5510fd5cfd5718d36c5cea9efc61b05e 267a1b2a19b0526a5086cded02ba3325 2716d91a7c0f05036fd7d593a6980ec4 960 Madera Metro https://www.cityofmadera.ca.gov/home/departmen... America/Los_Angeles en (559) 661-RIDE (7433) https://www.cityofmadera.ca.gov/wp-content/upl... None aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... 2022-12-16 2022-12-16 03:00:27.604217+00:00 1 America/Los_Angeles
3 9a395f3d167d2c0088ebf8dd3d2e1da9 75ddeb651b8c6b5039add08b71e1464a 3f5795f479e908e5ea99de4a22bb2500 960 Madera Metro https://www.cityofmadera.ca.gov/home/departmen... America/Los_Angeles en (559) 661-RIDE (7433) https://www.cityofmadera.ca.gov/wp-content/upl... None aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... 2022-07-19 2022-07-19 00:00:43+00:00 1 America/Los_Angeles
4 8a786d43a3c1e8e620355a06d50a31b6 36fde7eedc560e48670dcde339e7e93d 9cb8e2f18ab3055f8b927f01ad57045b 960 Madera Metro https://www.cityofmadera.ca.gov/home/departmen... America/Los_Angeles en (559) 661-RIDE (7433) https://www.cityofmadera.ca.gov/wp-content/upl... None aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... 2021-06-24 2021-06-24 00:09:42+00:00 1 America/Los_Angeles

# .. may have more rows

Note that here the pandas Series method str.contains corresponds to regexp_contains in Google BigQuery.

pandas

The library pandas is very commonly used in data analysis, and the external resources below provide a brief overview of it’s use.

Add New Packages

While most Python packages an analyst uses come in JupyterHub, there may be additional packages you’ll want to use in your analysis.

  • Install shared utility functions

  • Change directory into the project task’s subfolder and add requirements.txt and/or conda-requirements.txt

  • Run pip install -r requirements.txt and/or conda install --yes -c conda-forge --file conda-requirements.txt

Appendix: calitp-data-infra

The calitp-data-infra package, used primarily by warehouse mainainers and data pipeline developers, includes utilities that analysts will likely not need need to interact with directly (and therefore generally won’t need to install), but which may be helpful to be aware of. For instance, the get_secret_by_name() and get_secrets_by_label() functions in the package’s auth module are used to interact with Google’s Secret Manager, the service that securely stores API keys and other sensitive information that underpins many of our data syncs.

You can read more about the calitp-data-infra Python package here.