Useful Python Libraries#
The following libraries are available and recommended for use by Cal-ITP data analysts. Our JupyterHub environment comes with all of these installed already, except for calitp-data-infra. A full list of the packages included in the system image that underpins our JupyterHub environment can be found (and updated when needed) here.
Table of Contents#
calitp-data-analysis#
calitp-data-analysis is an internal library of utility functions used to access our warehouse data for analysis purposes.
import tbls#
DEPRECATED: tbls will be removed after calitp_data_analysis version 2025.8.10. Instead of AutoTable or the tbls
instance, use query_sql() from calitp_data_analysis.sql to connect to and query a SQL database. See the query_sql()
section below.
Most notably, you can include import tbls at the top of your notebook to import a table from the warehouse in the form of a tbls:
from calitp_data_analysis.tables import tbls
Example:
from calitp_data_analysis.tables import tbls
tbls.mart_gtfs.dim_agency()
query_sql#
query_sql() is a useful function to connect to and query a SQL database. You can optionally pass as_df=True
to turn a SQL query result into a pandas DataFrame.
As an example, in a notebook:
from calitp_data_analysis.sql import query_sql
df_dim_agency = query_sql("""
SELECT
*
FROM `mart_gtfs.dim_agency`
LIMIT 10""", as_df=True)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1760130458.092950 2242 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.
df_dim_agency.head()
| key | _gtfs_key | feed_key | agency_id | agency_name | agency_url | agency_timezone | agency_lang | agency_phone | agency_fare_url | agency_email | base64_url | _dt | _feed_valid_from | _line_number | feed_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | fdbd4bd9cc406a651ec80dba6ac4a1c1 | ebfb64604b1c18bdf8ab91a2e4394491 | ba587fd71b4f9a956ff36e309e6acf3f | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-05-14 | 2022-05-14 00:01:18+00:00 | 1 | America/Los_Angeles |
| 1 | 033d018b031002914fbfa48c872d4a65 | 8f5e204872e6b7034479a6d11502547b | 3d30b02b9008f788eb89c41c137786c1 | MT | None | http://www.mantecatransit.com | America/Los_Angeles | en | (209) 456-8000 | None | None | aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... | 2022-01-19 | 2022-01-19 00:11:01+00:00 | 1 | America/Los_Angeles |
| 2 | a196a8996bb74b7bd6d92f0cc2802620 | 00e77c29f2ce1986407cb93f53936dff | 25915c089571c49be33b59e10c25d2ac | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-09-15 | 2022-09-15 00:01:13+00:00 | 1 | America/Los_Angeles |
| 3 | 56112d81d62d71d408ea0247309178e4 | 090343c89ab8fe1d5e3b79e5687bbcca | 21fa0b125d801eb5058da2ec5d748bda | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-09-15 | 2022-09-15 14:59:54.274779+00:00 | 1 | America/Los_Angeles |
| 4 | 1d48492cd90930d6612fc70ef18bf8d4 | 93c44e614219c41696a8148018c1d83b | f87ac9abe81a7548325423ede69c3b86 | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-09-09 | 2022-09-09 00:00:56+00:00 | 1 | America/Los_Angeles |
GCSGeoPandas#
The GCSGeoPandas class fetches the Google Cloud Storage (GCS) filesystem and surfaces functions to provide analysts a clear and consistent way of accessing geospatial resources.
It’s recommended to memoize initialization of the class so that the GCS filesystem is fetched and cached the first time you call it and subsequent calls can reuse that cached filesystem.
from functools import cache
from calitp_data_analysis.gcs_geopandas import GCSGeoPandas
@cache
def gcs_geopandas():
return GCSGeoPandas()
read_parquet#
Delegates to geopandas.read_parquet, providing GCS Filesystem
gcs_geopandas().read_parquet("gs://path/to/your/file.parquet")
read_file#
Delegates to geopandas.read_file with the file at the path specified in the GCS filesystem
gcs_geopandas().read_file("gs://path/to/your/file.geojson")
geo_data_frame_to_parquet#
Delegates to GeoDataFrame.to_parquet, providing the GCS filesystem
import geopandas as gpd
data = {'col1': ['name1', 'name2'], 'geometry': [...]}
geo_data_frame = gpd.GeoDataFrame(data, crs="EPSG:4326")
gcs_geopandas().geo_data_frame_to_parquet(geo_data_frame, "gs://path/to/your/file.parquet")
siuba#
DEPRECATED: siuba will be removed from calitp_data_analysis after version 2025.8.10 and other shared code within the
next few months. Use SQLAlchemy and Pandas or GeoPandas functions directly instead. siuba uses these under the hood.
siuba is a tool that allows the same analysis code to run on a pandas DataFrame,
as well as generate SQL for different databases.
It supports most pandas Series methods analysts use. See the siuba docs for more information.
The examples below go through the basics of using siuba, collecting a database query to a local DataFrame, and showing SQL test queries that siuba code generates.
Basic query#
siuba#
from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query
# query agency information, then filter for a single gtfs feed,
# and then count how often each feed key occurs
(tbls.mart_gtfs.dim_agency()
>> filter(_.agency_id == 'BA', _.base64_url == 'aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1SRw==')
>> count(_.feed_key)
)
E0000 00:00:1760130458.815412 2242 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.
# Source: lazy query # DB Conn: Engine(bigquery://cal-itp-data-infra/?maximum_bytes_billed=5000000000) # Preview:
| feed_key | n | |
|---|---|---|
| 0 | 7eded7cbbf072a2fd9dd812f80bf2d2b | 1 |
| 1 | 4ef3f6b584321a4d1072204391b2a9f9 | 1 |
| 2 | b01a3ce6ba2e972a6acc08574a3e06a9 | 1 |
| 3 | f4095db9282b859842d338f5db032561 | 1 |
| 4 | 4697bb52eb4da8bcff925f503a623326 | 1 |
# .. may have more rows
Equivalent BigQuery SQL using query_sql()#
from calitp_data_analysis.sql import query_sql
# query agency information, then filter for a single gtfs feed,
# and then count how often each feed key occurs
query_sql(
"""
SELECT feed_key, COUNT(*) AS n
FROM mart_gtfs.dim_agency
WHERE agency_id = 'BA' AND base64_url = 'aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1SRw=='
GROUP BY feed_key
ORDER BY n DESC
LIMIT 5
"""
)
Collect query results#
Note that siuba by default prints out a preview of the SQL query results.
In order to fetch the results of the query as a pandas DataFrame, run collect().
siuba#
tbl_agency_names = tbls.mart_gtfs.dim_agency() >> collect()
# Use pandas `head()` method to show first 5 rows of data
tbl_agency_names.head()
| key | _gtfs_key | feed_key | agency_id | agency_name | agency_url | agency_timezone | agency_lang | agency_phone | agency_fare_url | agency_email | base64_url | _dt | _feed_valid_from | _line_number | feed_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | fdbd4bd9cc406a651ec80dba6ac4a1c1 | ebfb64604b1c18bdf8ab91a2e4394491 | ba587fd71b4f9a956ff36e309e6acf3f | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-05-14 | 2022-05-14 00:01:18+00:00 | 1 | America/Los_Angeles |
| 1 | 033d018b031002914fbfa48c872d4a65 | 8f5e204872e6b7034479a6d11502547b | 3d30b02b9008f788eb89c41c137786c1 | MT | None | http://www.mantecatransit.com | America/Los_Angeles | en | (209) 456-8000 | None | None | aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... | 2022-01-19 | 2022-01-19 00:11:01+00:00 | 1 | America/Los_Angeles |
| 2 | a196a8996bb74b7bd6d92f0cc2802620 | 00e77c29f2ce1986407cb93f53936dff | 25915c089571c49be33b59e10c25d2ac | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-09-15 | 2022-09-15 00:01:13+00:00 | 1 | America/Los_Angeles |
| 3 | 56112d81d62d71d408ea0247309178e4 | 090343c89ab8fe1d5e3b79e5687bbcca | 21fa0b125d801eb5058da2ec5d748bda | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-09-15 | 2022-09-15 14:59:54.274779+00:00 | 1 | America/Los_Angeles |
| 4 | 1d48492cd90930d6612fc70ef18bf8d4 | 93c44e614219c41696a8148018c1d83b | f87ac9abe81a7548325423ede69c3b86 | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-09-09 | 2022-09-09 00:00:56+00:00 | 1 | America/Los_Angeles |
Equivalent BigQuery SQL using query_sql()#
from calitp_data_analysis.sql import query_sql
tbl_agency_names = query_sql('SELECT * FROM mart_gtfs.dim_agency', as_df=True)
# Use pandas `head()` method to show first 5 rows of data
tbl_agency_names.head()
Show query SQL#
While collect() fetches query results, show_query() prints out the SQL code that siuba generates.
(tbls.mart_gtfs.dim_agency()
>> filter(_.agency_name.str.contains("Metro"))
>> show_query(simplify=True)
)
SELECT *
FROM `mart_gtfs.dim_agency` AS `mart_gtfs.dim_agency_1`
WHERE regexp_contains(`mart_gtfs.dim_agency_1`.`agency_name`, 'Metro')
# Source: lazy query # DB Conn: Engine(bigquery://cal-itp-data-infra/?maximum_bytes_billed=5000000000) # Preview:
| key | _gtfs_key | feed_key | agency_id | agency_name | agency_url | agency_timezone | agency_lang | agency_phone | agency_fare_url | agency_email | base64_url | _dt | _feed_valid_from | _line_number | feed_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8a786d43a3c1e8e620355a06d50a31b6 | 36fde7eedc560e48670dcde339e7e93d | 9cb8e2f18ab3055f8b927f01ad57045b | 960 | Madera Metro | https://www.cityofmadera.ca.gov/home/departmen... | America/Los_Angeles | en | (559) 661-RIDE (7433) | https://www.cityofmadera.ca.gov/wp-content/upl... | None | aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... | 2021-06-24 | 2021-06-24 00:09:42+00:00 | 1 | America/Los_Angeles |
| 1 | 65ce81af96c6e1bbe1e8b34e9c6cd1bf | a51fc5abab3caa0f0adc09c2ad04ffe6 | 4f9efa4176247918b04f6c7ec3926b35 | 960 | Madera Metro | https://www.cityofmadera.ca.gov/home/departmen... | America/Los_Angeles | en | (559) 661-RIDE (7433) | https://www.cityofmadera.ca.gov/wp-content/upl... | None | aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... | 2021-04-16 | 2021-04-16 00:01:13+00:00 | 1 | America/Los_Angeles |
| 2 | 0782ef40776b4ad8fba49ca86a375e5a | 621d37259d6a5a7f641fd5a6e8a747f7 | 30a2d5133c29abfbdc9e504bef2830e9 | 960 | Madera Metro | https://www.cityofmadera.ca.gov/home/departmen... | America/Los_Angeles | en | (559) 661-RIDE (7433) | https://www.cityofmadera.ca.gov/wp-content/upl... | None | aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... | 2023-04-08 | 2023-04-08 03:00:20.134144+00:00 | 1 | America/Los_Angeles |
| 3 | 6ef00206641f97e0e6d2fd19f21ac9c0 | 56b61b5baea7c4768dd8939d3612296f | 1819d162df24993e08dd9d308edbad84 | 960 | Madera Metro | https://www.madera.gov/home/departments/transi... | America/Los_Angeles | en | (559) 661-RIDE (7433) | https://www.madera.gov/home/departments/transi... | None | aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... | 2024-02-01 | 2024-02-01 03:00:21.894200+00:00 | 1 | America/Los_Angeles |
| 4 | 149712407d0b279211c623c2e52a3d3c | 513def24eef056315fb6ef128714f584 | 726a476a7c2917a2cd7c94a409c5d0c5 | 960 | Madera Metro | https://www.madera.gov/home/departments/transi... | America/Los_Angeles | en | (559) 661-RIDE (7433) | https://www.madera.gov/home/departments/transi... | None | aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... | 2024-02-06 | 2024-02-06 03:00:59.720405+00:00 | 1 | America/Los_Angeles |
# .. may have more rows
Note that here the pandas Series method str.contains corresponds to regexp_contains in Google BigQuery.
filter() & collect()#
siuba#
from calitp_data_analysis.tables import tbls
from siuba import _, collect, filter
annnual_service_agencies = (
tbls.mart_ntd.dim_annual_service_agencies()
>> filter(_.state == "CA", _.report_year == 2023)
>> collect()
)
Equivalent BigQuery SQL using query_sql()#
Use SQL with SELECT and WHERE clauses for filtering
from calitp_data_analysis.sql import query_sql
query_sql(
"""
SELECT *
FROM cal-itp-data-infra.mart_ntd.dim_annual_service_agencies
WHERE state = 'CA' AND report_year = 2023
""",
as_df=True
)
select()#
siuba#
import geopandas as gpd
from siuba import _, select
blocks = gpd.read_file('./tl_2020_06_tabblock20.zip')
blocks = blocks >> select(_.GEOID20, _.POP20, _.HOUSING20, _.geometry)
pandas/geopandas#
import geopandas as gpd
blocks = gpd.read_file('./tl_2020_06_tabblock20.zip')
blocks = blocks[['GEOID20', 'POP20', 'HOUSING20', 'geometry']]
rename#
siuba#
import geopandas as gpd
from siuba import _, rename
blocks = gpd.read_file('./tl_2020_06_tabblock20.zip')
blocks = blocks >> rename(GEO_ID = _.GEOID20)
pandas/geopandas#
import geopandas as gpd
blocks = gpd.read_file('./tl_2020_06_tabblock20.zip')
blocks = blocks.rename(columns={'GEOID20': 'GEO_ID'})
inner_join#
siuba#
# df and df2 are both DataFrame
joined = df2 >> inner_join(_, df, on={'GEO_ID':'w_geocode'})
pandas/geopandas#
The below code will result in table having both GEO_ID and w_geocode columns with redundant data.
To avoid this, you could either (a) first rename one of the columns to match the other and
do a simpler merge using just the on parameter (no need then for left_on and
right_on) or (b) do as show below and subsequently drop one of the redundant columns.
# df and df2 are both DataFrame
joined = df2.merge(df, left_on='GEO_ID', right_on='w_geocode')
siuba#
More siuba Resources#
pandas#
The library pandas is very commonly used in data analysis, and the external resources below provide a brief overview of it’s use.
Add New Packages#
While most Python packages an analyst uses come in JupyterHub, there may be additional packages you’ll want to use in your analysis.
Install shared utility functions,
Change directory into the project task’s subfolder and add
requirements.txtand/orconda-requirements.txtRun
pip install -r requirements.txtand/orconda install --yes -c conda-forge --file conda-requirements.txt
Updating calitp-data-analysis#
calitp-data-analysis is a package that lives here in the data-infra repo. Follow the steps below update to the package.
Steps
Adapted from this and this Slack thread.
Make the changes you want in the
calitp-data-analysisfolder insidepackageshere. If you are only changing package metadata (author information, package description, etc.) without changing the function of the package itself, that information lives inpyproject.tomlrather than in thecalitp-data-analysissubfolder.If you are adding a new function that relies on a package that isn’t already a dependency, run
poetry add <package name>after changing directories todata-infra/packages/calitp_data_analysis. Check this Jupyter image file for the version number associated with the package, because you should specify the version.For example, your function relies on
dask. In the Jupyter image file, the version isdask = "~2022.8"so runpoetry add dask==~2022.8in the terminal.You may also have run
poetry install mypy.mypyis a package that audits Python files for information related to data types, and you can read more about it here.mypyis a package that audits Python files for information related to data types, and you can read more about it here.mypyis one of the standard development dependencies for thecalitp-data-analysis package, defined in thepyproject.tomlfile for the package, so to install it you can runpoetry installinpackages/calitp-data-analysis/(which will also install the other package dependencies). To usemypy, runpoetry run mypy [file or directory name you're interested in checking].mypyis generally run in CI when a PR is opened, as part of build tasks. You can see it called here for this package inbuild-calitp-data-analysis.yml. Within the PR, the “Test, visualize, and build calitp-data-analysis” CI workflow will fail if problems are found.More helpful hints for passing mypy in our repo.
Each time you update the package, you must also update the version number. We use dates to reflect which version we are on. Update the version in pyproject.toml that lives in
calitp-data-analysisto either today’s date or a future date.Open a new pull request and make sure the new version date appears on the test version page.
The new version date may not show up on the test page due to errors. Check the GitHub Action page of your pull request to see the errors that have occurred.
If you run into the error message like this,
error: Skipping analyzing "dask_geopandas": module is installed, but missing library stubs or py.typed marker [import]go to your.pyfile and add# type: ignorebehind the package import.To fix the error above, change
import dask_geopandas as dgtoimport dask_geopandas as dg # type: ignore.It is encouraged to make changes in a set of smaller commits. For example, add all the necessary packages with
poetry run <packagefirst, fix any issues flagged bymypy, and finally address any additional issues.
Merge the PR. Once it is merged in, the actual package will display the new version number. To make sure everything works as expected, run
pip install calitp-data-analysis==<new version here>in a cell of Jupyter notebook and import a package (or two) such asfrom calitp_data_analysis import styleguide.Update the new version number in the
data-infrarepository here, here, here, and anywhere else you find a reference to the old version of the package. You’ll also want to do the same for any other Cal-ITP repositories that reference thecalitp-data-analysispackage.When yu update the jupyter-singleuser toml, make sure to run
poetry add calitp-data-analysis==<new version here>and commit the updatedpoetry.lockfile.As of writing, the only other repository that references to the package version is reports.
Resources
Appendix: calitp-data-infra#
The calitp-data-infra package, used primarily by warehouse mainainers and data pipeline developers, includes utilities that analysts will likely not need need to interact with directly (and therefore generally won’t need to install), but which may be helpful to be aware of. For instance, the get_secret_by_name() and get_secrets_by_label() functions in the package’s auth module are used to interact with Google’s Secret Manager, the service that securely stores API keys and other sensitive information that underpins many of our data syncs.
You can read more about the calitp-data-infra Python package here.