Publishing data to California Open Data aka CKAN

NOTE: Only non-spatial data should be directly published to CKAN. Spatial data (i.e. data keyed by location in some manner) has to go through the Caltrans geoportal and is subsequently synced to CKAN.

What is the California Open Data Portal?

The state of California hosts its own instance of CKAN, called the California Open Data Portal. CKAN is an open-source data management tool; in other words, it allows organizations such as Caltrans to host data so that users can search for and download data sets that might be useful for analysis or research. Among other agencies, Caltrans publishes many data sets to CKAN. Data is generally published as flat files (typically CSV) alongside required metadata and a data dictionary.

What is the publication script?

The publication script, typically used within the publish_open_data Airflow workflow, relies on a dbt exposure to determine what to publish - in practice, that exposure is titled california_open_data. The tables included in that exposure, their CKAN destinations, and their published descriptions are defined in _gtfs_schedule_latest.yml under the exposures heading.

By default, the columns of a table included in the exposure are not published on the portal. This is to prevent fields that are useful for internal data management but are hard to interpret for public users, like _is_current, from being included in the open data portal. Columns meant for publication are explicitly included in publication via the dbt meta tag publish.include: true, which you can see on various columns of the models in the same YAML file where the exposure itself is defined.

The publication script does not read from that YAML file directly when publishing - it reads from the manifest file generated by the dbt_run_and_upload_artifacts Airflow job. By default, that manifest file is read from the GCS bucket where stores it during regular daily runs. If you need to make changes to the Cal-ITP GTFS-Ingest Pipeline Dataset in between runs of the daily dbt_run_and_upload_artifacts Airflow job (e.g. if there’s a time-sensitive bug you’re fixing in open data), you’ll need to either kick off an ad hoc run of that job in Airflow or run the script locally to generate a new manifest, and use that manifest to underpin the run. Take care when generating a manifest locally - you don’t want any information in your local dbt project to be different than the production project besides the models you’re making changes to.

General open data publication process

Develop data models

Generally, data models should be built in dbt/BigQuery if possible. For example, we have latest-only GTFS schedule models we can use to update and expand the existing CKAN dataset

Document data

California Open Data requires two documentation files for published datasets.

  1. metadata.csv - one row per resource (i.e. file) to be published

  2. dictionary.csv - one row per column across all resources

We use dbt exposure-based data publishing to automatically generate these two files using the main script (specifically the document-exposure subcommand). The documentation from the dbt models’ corresponding YAML will be converted into appropriate CSVs and written out locally. By default, the script will read the latest manifest.json in GCS uploaded by the dbt_run_and_upload_artifacts Airflow job.

Run this command inside the warehouse folder, assuming you have local dbt artifacts in target/ from a dbt run or dbt compile.

poetry run python scripts/ document-exposure california_open_data

Each day, a new version of manifest.json is automatically generated for tables in the production warehouse by the dbt_run_and_upload_artifacts job in the transform_warehouse DAG, and placed inside the calitp-dbt-artifacts GCS bucket. If you intend to generate new documentation locally, you’ll need to generate a new manifest.json locally first.

Create dataset and metadata

Once you’ve generated the necessary metadata and dictionary CSV, you need to get approval from the Caltrans Geospatial Data Officer (at the time of writing, Chad Baker) for publication. Send the dictionary and metadata CSVs via email, and explain what changes are coming to the dataset - have columns been added or removed from one of the tables, do you have a new table to add, or is there some other change?

For new tables, a CKAN destination will be created with UUIDs corresponding to each model that will be published. If you are using dbt exposures, you will need to update the meta field here to map the dbt models to the appropriate UUIDs.

For example:

      methodology: |
        Cal-ITP collects the GTFS feeds from a statewide list [link] every night and aggegrates it into a statewide table
        for analysis purposes only. Do not use for trip planner ingestation, rather is meant to be used for statewide
        analytics and other use cases. Note: These data may or may or may not have passed GTFS-Validation.
      coordinate_system_epsg: "4326"
        - type: ckan
          format: csv
              id: e8f9d49e-2bb6-400b-b01f-28bc2e0e7df2
              description: |
                Each row is a cleaned row from an agency.txt file.
                Definitions for the original GTFS fields are available at:
              id: 038b7354-06e8-4082-a4a1-40debd3110d5
              description: |
                Each row is a cleaned row from an attributions.txt file.
                Definitions for the original GTFS fields are available at:

Publish the data

If you are using dbt-based publishing, the publish_exposure subcommand of will query BigQuery, write out CSV files, and upload those files to CKAN. An Airflow job refreshes/updates the data at a specified frequency, or the publication script can be run manually. By default, the --no-publish flag is set, executing a dry run. You can also write to GCS without uploading to CKAN by manually entering an arbitrary bucket destination.

The weekly publishing Airflow job supports referencing gs:// paths for the manifest, which is used to determine which tables and columns to publish; by default, the script will read the latest manifest in GCS uploaded by the dbt_run_and_upload_artifacts Airflow job. You may also choose to run dbt models and/or run the publish script locally; these operations can be mixed-and-matched. If you are running locally, you will need to set $CALITP_CKAN_GTFS_SCHEDULE_KEY ahead of time.

By default, the script will upload artifacts to GCS, but will not actually upload data to CKAN. In addition, the script will upload the metadata and dictionary files to GCS for eventual sharing with Caltrans employees responsible for the open data portal.

$ poetry run python scripts/ publish-exposure california_open_data --manifest ./target/manifest.json
reading manifest from ./target/manifest.json
would be writing to gs://test-calitp-publish/california_open_data__metadata/dt=2022-08-30/ts=2022-08-30T20:46:00.474199Z/metadata.csv
would be writing to gs://test-calitp-publish/california_open_data__dictionary/dt=2022-08-30/ts=2022-08-30T20:46:00.474199Z/dictionary.csv
handling agency e8f9d49e-2bb6-400b-b01f-28bc2e0e7df2
writing 346 rows (42.9 kB) from to gs://test-calitp-publish/california_open_data__agency/dt=2022-08-30/ts=2022-08-30T20:46:00.474199Z/agency.csv
would be uploading to e8f9d49e-2bb6-400b-b01f-28bc2e0e7df2 if --publish

You can add the --publish flag to actually upload artifacts to CKAN after they are written to GCS. You must be using a production bucket to publish, either by setting $CALITP_BUCKET__PUBLISH or using the --bucket flag. In addition, you may specify a manifest file in GCS if desired.

poetry run python scripts/ publish-exposure california_open_data --bucket gs://calitp-publish --manifest gs://calitp-dbt-artifacts/latest/manifest.json --publish