Publishing data to California Open Data aka CKAN#
NOTE: Only non-spatial data should be directly published to CKAN. Spatial data (i.e. data keyed by location in some manner) has to go through the Caltrans geoportal and is subsequently synced to CKAN.
What is the California Open Data Portal?#
The state of California hosts its own instance of CKAN, called the California Open Data Portal. CKAN is an open-source data management tool; in other words, it allows organizations such as Caltrans to host data so that users can search for and download data sets that might be useful for analysis or research. Among other agencies, Caltrans publishes many data sets to CKAN. Data is generally published as flat files (typically CSV) alongside required metadata and a data dictionary.
Cal-ITP datasets#
What is the publication script?#
The publish_gtfs Airflow workflow, relies on a dbt exposure to determine what to publish - in practice, that exposure is titled california_open_data. The tables included in that exposure, their CKAN destinations, and their published descriptions are defined in _gtfs_schedule_latest.yml under the exposures heading.
By default, the columns of a table included in the exposure are not published on the portal. This is to prevent fields that are useful for internal data management but are hard to interpret for public users, like _is_current, from being included in the open data portal. Columns meant for publication are explicitly included in publication via the dbt meta tag publish.include: true, which you can see on various columns of the models in the same YAML file where the exposure itself is defined.
The publication script does not read from that YAML file directly when publishing - it reads from the manifest file generated by the dbt_run_and_upload_artifacts Airflow job. By default, that manifest file is read from the GCS bucket where run_and_upload.py stores it during regular daily runs. If you need to make changes to the Cal-ITP GTFS-Ingest Pipeline Dataset in between runs of the daily dbt_run_and_upload_artifacts Airflow job (e.g. if there’s a time-sensitive bug you’re fixing in open data), you’ll need to kick off an ad hoc run of that job in Airflow.
General open data publication process#
Develop data models#
Generally, data models should be built in dbt/BigQuery if possible. For example, we have latest-only GTFS schedule models we can use to update and expand the existing CKAN dataset
Document data#
California Open Data requires two documentation files for published datasets.
metadata.csv- one row per resource (i.e. file) to be publisheddictionary.csv- one row per column across all resources
We use dbt exposure-based data publishing to automatically generate
these two files using the Airflow job. The documentation from the dbt models’ corresponding YAML will be
converted into appropriate CSVs and written to GCS bucket calitp-publish. By default, the script will read the latest manifest.json in GCS uploaded by the dbt_run_and_upload_artifacts Airflow job.
Each day, a new version of manifest.json is automatically generated for tables in the production warehouse by the dbt_run_and_upload_artifacts job in the transform_warehouse DAG, and placed inside the calitp-dbt-artifacts GCS bucket.
Create dataset and metadata#
For new tables, a CKAN destination will need to be created with UUIDs corresponding to each model that will be published by adding a new resource here.
For example:
meta:
methodology: |
Cal-ITP collects the GTFS feeds from a statewide list [link] every night and aggegrates it into a statewide table
for analysis purposes only. Do not use for trip planner ingestation, rather is meant to be used for statewide
analytics and other use cases. Note: These data may or may or may not have passed GTFS-Validation.
coordinate_system_epsg: "4326"
destinations:
- type: ckan
format: csv
url: https://data.ca.gov
resources:
agency:
id: e8f9d49e-2bb6-400b-b01f-28bc2e0e7df2
description: |
Each row is a cleaned row from an agency.txt file.
Definitions for the original GTFS fields are available at:
https://gtfs.org/reference/static#agencytxt.
attributions:
id: 038b7354-06e8-4082-a4a1-40debd3110d5
description: |
Each row is a cleaned row from an attributions.txt file.
Definitions for the original GTFS fields are available at:
https://gtfs.org/reference/static#attributionstxt.
Publish the data#
An Airflow job refreshes/updates the data at a specified frequency.
Developers can use the staging airflow instance to publish to the Test CKAN.