Useful Python Libraries
Contents
Useful Python Libraries¶
The following libraries are available and recommended for use by Cal-ITP data analysts. Our JupyterHub environment comes with all of these installed already, except for calitp-data-infra
. A full list of the packages included in the system image that underpins our JupyterHub environment can be found (and updated when needed) here.
Table of Contents¶
calitp-data-analysis¶
calitp-data-analysis
is an internal library of utility functions used to access our warehouse data for analysis purposes.
import tbls¶
Most notably, you can include import tbls
at the top of your notebook to import a table from the warehouse in the form of a tbls
:
from calitp_data_analysis.tables import tbls
Example:
from calitp_data_analysis.tables import tbls
tbls.mart_gtfs.dim_agency()
# Source: lazy query # DB Conn: Engine(bigquery://cal-itp-data-infra/?maximum_bytes_billed=5000000000) # Preview:
key | _gtfs_key | feed_key | agency_id | agency_name | agency_url | agency_timezone | agency_lang | agency_phone | agency_fare_url | agency_email | base64_url | _dt | _feed_valid_from | _line_number | feed_timezone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 56112d81d62d71d408ea0247309178e4 | 090343c89ab8fe1d5e3b79e5687bbcca | 21fa0b125d801eb5058da2ec5d748bda | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-09-15 | 2022-09-15 14:59:54.274779+00:00 | 1 | America/Los_Angeles |
1 | 282b4e07171c48ac08162e0dc8749066 | 18fcf47148228c7b56c19726c854054b | 9b8e9b8befe560293a6c5b38dc19ffbb | MT | None | http://www.mantecatransit.com | America/Los_Angeles | en | (209) 456-8000 | None | None | aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... | 2021-07-28 | 2021-07-28 13:56:16+00:00 | 1 | America/Los_Angeles |
2 | 1d48492cd90930d6612fc70ef18bf8d4 | 93c44e614219c41696a8148018c1d83b | f87ac9abe81a7548325423ede69c3b86 | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-09-09 | 2022-09-09 00:00:56+00:00 | 1 | America/Los_Angeles |
3 | fdbd4bd9cc406a651ec80dba6ac4a1c1 | ebfb64604b1c18bdf8ab91a2e4394491 | ba587fd71b4f9a956ff36e309e6acf3f | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-05-14 | 2022-05-14 00:01:18+00:00 | 1 | America/Los_Angeles |
4 | 033d018b031002914fbfa48c872d4a65 | 8f5e204872e6b7034479a6d11502547b | 3d30b02b9008f788eb89c41c137786c1 | MT | None | http://www.mantecatransit.com | America/Los_Angeles | en | (209) 456-8000 | None | None | aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... | 2022-01-19 | 2022-01-19 00:11:01+00:00 | 1 | America/Los_Angeles |
# .. may have more rows
query_sql¶
query_sql
is another useful function to use inside of JupyterHub notebooks to turn a SQL query into a pandas DataFrame.
As an example, in a notebook:
from calitp_data_analysis.sql import query_sql
df_dim_agency = query_sql("""
SELECT
*
FROM `mart_gtfs.dim_agency`
LIMIT 10""", as_df=True)
df_dim_agency.head()
key | _gtfs_key | feed_key | agency_id | agency_name | agency_url | agency_timezone | agency_lang | agency_phone | agency_fare_url | agency_email | base64_url | _dt | _feed_valid_from | _line_number | feed_timezone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 56112d81d62d71d408ea0247309178e4 | 090343c89ab8fe1d5e3b79e5687bbcca | 21fa0b125d801eb5058da2ec5d748bda | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-09-15 | 2022-09-15 14:59:54.274779+00:00 | 1 | America/Los_Angeles |
1 | 282b4e07171c48ac08162e0dc8749066 | 18fcf47148228c7b56c19726c854054b | 9b8e9b8befe560293a6c5b38dc19ffbb | MT | None | http://www.mantecatransit.com | America/Los_Angeles | en | (209) 456-8000 | None | None | aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... | 2021-07-28 | 2021-07-28 13:56:16+00:00 | 1 | America/Los_Angeles |
2 | 1d48492cd90930d6612fc70ef18bf8d4 | 93c44e614219c41696a8148018c1d83b | f87ac9abe81a7548325423ede69c3b86 | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-09-09 | 2022-09-09 00:00:56+00:00 | 1 | America/Los_Angeles |
3 | fdbd4bd9cc406a651ec80dba6ac4a1c1 | ebfb64604b1c18bdf8ab91a2e4394491 | ba587fd71b4f9a956ff36e309e6acf3f | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-05-14 | 2022-05-14 00:01:18+00:00 | 1 | America/Los_Angeles |
4 | 033d018b031002914fbfa48c872d4a65 | 8f5e204872e6b7034479a6d11502547b | 3d30b02b9008f788eb89c41c137786c1 | MT | None | http://www.mantecatransit.com | America/Los_Angeles | en | (209) 456-8000 | None | None | aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... | 2022-01-19 | 2022-01-19 00:11:01+00:00 | 1 | America/Los_Angeles |
siuba¶
siuba
is a tool that allows the same analysis code to run on a pandas DataFrame,
as well as generate SQL for different databases.
It supports most pandas Series methods analysts use. See the siuba docs for more information.
The examples below go through the basics of using siuba, collecting a database query to a local DataFrame, and showing SQL test queries that siuba code generates.
Basic query¶
from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query
# query agency information, then filter for a single gtfs feed,
# and then count how often each feed key occurs
(tbls.mart_gtfs.dim_agency()
>> filter(_.agency_id == 'BA', _.base64_url == 'aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1SRw==')
>> count(_.feed_key)
)
# Source: lazy query # DB Conn: Engine(bigquery://cal-itp-data-infra/?maximum_bytes_billed=5000000000) # Preview:
feed_key | n | |
---|---|---|
0 | 5ad87901f329be154545fec771aa2c21 | 1 |
1 | b01a3ce6ba2e972a6acc08574a3e06a9 | 1 |
2 | 023ec6a2b6246ad996abcb5aefedc068 | 1 |
3 | a64532640829f3eae9129cd5d5e9590b | 1 |
4 | f4095db9282b859842d338f5db032561 | 1 |
# .. may have more rows
Collect query results¶
Note that siuba by default prints out a preview of the SQL query results.
In order to fetch the results of the query as a pandas DataFrame, run collect()
.
tbl_agency_names = tbls.mart_gtfs.dim_agency() >> collect()
# Use pandas .head() method to show first 5 rows of data
tbl_agency_names.head()
key | _gtfs_key | feed_key | agency_id | agency_name | agency_url | agency_timezone | agency_lang | agency_phone | agency_fare_url | agency_email | base64_url | _dt | _feed_valid_from | _line_number | feed_timezone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 56112d81d62d71d408ea0247309178e4 | 090343c89ab8fe1d5e3b79e5687bbcca | 21fa0b125d801eb5058da2ec5d748bda | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-09-15 | 2022-09-15 14:59:54.274779+00:00 | 1 | America/Los_Angeles |
1 | 282b4e07171c48ac08162e0dc8749066 | 18fcf47148228c7b56c19726c854054b | 9b8e9b8befe560293a6c5b38dc19ffbb | MT | None | http://www.mantecatransit.com | America/Los_Angeles | en | (209) 456-8000 | None | None | aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... | 2021-07-28 | 2021-07-28 13:56:16+00:00 | 1 | America/Los_Angeles |
2 | 1d48492cd90930d6612fc70ef18bf8d4 | 93c44e614219c41696a8148018c1d83b | f87ac9abe81a7548325423ede69c3b86 | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-09-09 | 2022-09-09 00:00:56+00:00 | 1 | America/Los_Angeles |
3 | fdbd4bd9cc406a651ec80dba6ac4a1c1 | ebfb64604b1c18bdf8ab91a2e4394491 | ba587fd71b4f9a956ff36e309e6acf3f | LAX FlyAway | None | http://www.LAXFlyAway.org | America/Los_Angeles | en | (714) 507-1170 | None | None | aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW... | 2022-05-14 | 2022-05-14 00:01:18+00:00 | 1 | America/Los_Angeles |
4 | 033d018b031002914fbfa48c872d4a65 | 8f5e204872e6b7034479a6d11502547b | 3d30b02b9008f788eb89c41c137786c1 | MT | None | http://www.mantecatransit.com | America/Los_Angeles | en | (209) 456-8000 | None | None | aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW... | 2022-01-19 | 2022-01-19 00:11:01+00:00 | 1 | America/Los_Angeles |
Show query SQL¶
While collect()
fetches query results, show_query()
prints out the SQL code that siuba generates.
(tbls.mart_gtfs.dim_agency()
>> filter(_.agency_name.str.contains("Metro"))
>> show_query(simplify=True)
)
SELECT *
FROM `mart_gtfs.dim_agency` AS `mart_gtfs.dim_agency_1`
WHERE regexp_contains(`mart_gtfs.dim_agency_1`.`agency_name`, 'Metro')
# Source: lazy query # DB Conn: Engine(bigquery://cal-itp-data-infra/?maximum_bytes_billed=5000000000) # Preview:
key | _gtfs_key | feed_key | agency_id | agency_name | agency_url | agency_timezone | agency_lang | agency_phone | agency_fare_url | agency_email | base64_url | _dt | _feed_valid_from | _line_number | feed_timezone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | a3ecd303b4eaf59d937631c943c0abfe | bfe22848a94f7cff936847116d54445c | ae20d23beb02c5331d1f7087f54bef7d | 960 | Madera Metro | https://www.madera.gov/home/departments/transi... | America/Los_Angeles | en | (559) 661-RIDE (7433) | https://www.madera.gov/home/departments/transi... | None | aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... | 2023-04-14 | 2023-04-14 03:00:18.458064+00:00 | 1 | America/Los_Angeles |
1 | b22bc5c21e3e90bef6e0fc91c66199db | f3271fec250502a41da3848717b9eeaf | 149ad0155813993026db844421248c4a | 960 | Madera Metro | https://www.cityofmadera.ca.gov/home/departmen... | America/Los_Angeles | en | (559) 661-RIDE (7433) | https://www.cityofmadera.ca.gov/wp-content/upl... | None | aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... | 2021-08-04 | 2021-08-04 00:06:42+00:00 | 1 | America/Los_Angeles |
2 | 5510fd5cfd5718d36c5cea9efc61b05e | 267a1b2a19b0526a5086cded02ba3325 | 2716d91a7c0f05036fd7d593a6980ec4 | 960 | Madera Metro | https://www.cityofmadera.ca.gov/home/departmen... | America/Los_Angeles | en | (559) 661-RIDE (7433) | https://www.cityofmadera.ca.gov/wp-content/upl... | None | aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... | 2022-12-16 | 2022-12-16 03:00:27.604217+00:00 | 1 | America/Los_Angeles |
3 | 9a395f3d167d2c0088ebf8dd3d2e1da9 | 75ddeb651b8c6b5039add08b71e1464a | 3f5795f479e908e5ea99de4a22bb2500 | 960 | Madera Metro | https://www.cityofmadera.ca.gov/home/departmen... | America/Los_Angeles | en | (559) 661-RIDE (7433) | https://www.cityofmadera.ca.gov/wp-content/upl... | None | aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... | 2022-07-19 | 2022-07-19 00:00:43+00:00 | 1 | America/Los_Angeles |
4 | 8a786d43a3c1e8e620355a06d50a31b6 | 36fde7eedc560e48670dcde339e7e93d | 9cb8e2f18ab3055f8b927f01ad57045b | 960 | Madera Metro | https://www.cityofmadera.ca.gov/home/departmen... | America/Los_Angeles | en | (559) 661-RIDE (7433) | https://www.cityofmadera.ca.gov/wp-content/upl... | None | aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3... | 2021-06-24 | 2021-06-24 00:09:42+00:00 | 1 | America/Los_Angeles |
# .. may have more rows
Note that here the pandas Series method str.contains
corresponds to regexp_contains
in Google BigQuery.
More siuba Resources¶
pandas¶
The library pandas is very commonly used in data analysis, and the external resources below provide a brief overview of it’s use.
Add New Packages¶
While most Python packages an analyst uses come in JupyterHub, there may be additional packages you’ll want to use in your analysis.
Install shared utility functions
Change directory into the project task’s subfolder and add
requirements.txt
and/orconda-requirements.txt
Run
pip install -r requirements.txt
and/orconda install --yes -c conda-forge --file conda-requirements.txt
Appendix: calitp-data-infra¶
The calitp-data-infra package, used primarily by warehouse mainainers and data pipeline developers, includes utilities that analysts will likely not need need to interact with directly (and therefore generally won’t need to install), but which may be helpful to be aware of. For instance, the get_secret_by_name()
and get_secrets_by_label()
functions in the package’s auth
module are used to interact with Google’s Secret Manager, the service that securely stores API keys and other sensitive information that underpins many of our data syncs.
You can read more about the calitp-data-infra
Python package here.