The Cal-ITP data infrastructure facilitates several types of data workflows:
In addition, we have
Infrastructure tools that monitor the health of the system itself or deploy or run other services and do not directly interact with data or support end user data access.
At a high level, the following diagram outlines (in very broad terms) the main tools that we use in our data stack (excluding
This documentation outlines two ways to think of this system and its components from a technical/maintenance perspective:
Services that are deployed and maintained (ex. Metabase, JupyterHub, etc.)
Data pipelines to ingest specific types of data (ex. GTFS Schedule, Payments, etc.)
Across both data and services, we often have a “production” (live, end-user-facing) environment and some type of testing, staging, or development environment.
Managed Airflow (i.e. Google Cloud Composer)
cal-itp-data-infradatabase (i.e. project) in BigQuery
Google Cloud Storage buckets without a prefix
Locally-run Airflow (via docker-compose)
cal-itp-data-infra-stagingdatabase (i.e. project) in BigQuery
GCS buckets with the
Some buckets prefixed with
dev-also exist; primarily for testing the RT archiver locally