Architecture Overview
Contents
Architecture Overview¶
The Cal-ITP data infrastructure facilitates several types of data workflows:
Ingestion
Modeling/transformation
Analysis
In addition, we have Infrastructure
tools that monitor the health of the system itself or deploy or run other services and do not directly interact with data or support end user data access.
At a high level, the following diagram outlines (in very broad terms) the main tools that we use in our data stack (excluding Infrastructure
tools).
This documentation outlines two ways to think of this system and its components from a technical/maintenance perspective:
Services that are deployed and maintained (ex. Metabase, JupyterHub, etc.)
Data pipelines to ingest specific types of data (ex. GTFS Schedule, Payments, etc.)
Environments¶
Across both data and services, we often have a “production” (live, end-user-facing) environment and some type of testing, staging, or development environment.
production¶
Managed Airflow (i.e. Google Cloud Composer)
Production gtfs-rt-archiver-v3
cal-itp-data-infra
database (i.e. project) in BigQueryGoogle Cloud Storage buckets without a prefix
e.g.
gs://calitp-gtfs-schedule-parsed-hourly
testing/staging/dev¶
Locally-run Airflow (via docker-compose)
Test gtfs-rt-archiver-v3
cal-itp-data-infra-staging
database (i.e. project) in BigQueryGCS buckets with the
test-
prefixe.g.
gs://test-calitp-gtfs-rt-raw-v2
Some buckets prefixed with
dev-
also exist; primarily for testing the RT archiver locally