Publishing data to California Open Data aka CKAN
Publishing data to California Open Data aka CKAN¶
NOTE: Only non-spatial data should be directly published to CKAN. Spatial data (i.e. data keyed by location in some manner) has to go through the Caltrans geoportal and is subsequently synced to CKAN.
What is the California Open Data Portal?¶
The state of California hosts its own instance of CKAN, called the California Open Data Portal. CKAN is an open-source data management tool; in other words, it allows organizations such as Caltrans to host data so that users can search for and download data sets that might be useful for analysis or research. Among other agencies, Caltrans publishes many data sets to CKAN. Data is generally published as flat files (typically CSV) alongside required metadata and a data dictionary.
What is the publication script?¶
The publication script publish.py, typically used within the publish_open_data Airflow workflow, relies on a dbt exposure to determine what to publish - in practice, that exposure is titled
california_open_data. The tables included in that exposure, their CKAN destinations, and their published descriptions are defined in _gtfs_schedule_latest.yml under the
By default, the columns of a table included in the exposure are not published on the portal. This is to prevent fields that are useful for internal data management but are hard to interpret for public users, like
_is_current, from being included in the open data portal. Columns meant for publication are explicitly included in publication via the dbt
publish.include: true, which you can see on various columns of the models in the same YAML file where the exposure itself is defined.
The publication script does not read from that YAML file directly when publishing - it reads from the manifest file generated by the
dbt_run_and_upload_artifacts Airflow job. By default, that manifest file is read from the GCS bucket where
run_and_upload.py stores it during regular daily runs. If you need to make changes to the Cal-ITP GTFS-Ingest Pipeline Dataset in between runs of the daily
dbt_run_and_upload_artifacts Airflow job (e.g. if there’s a time-sensitive bug you’re fixing in open data), you’ll need to either kick off an ad hoc run of that job in Airflow or run the script locally to generate a new manifest, and use that manifest to underpin the
publish.py run. Take care when generating a manifest locally - you don’t want any information in your local dbt project to be different than the production project besides the models you’re making changes to.
General open data publication process¶
Develop data models¶
California Open Data requires two documentation files for published datasets.
metadata.csv- one row per resource (i.e. file) to be published
dictionary.csv- one row per column across all resources
We use dbt exposure-based data publishing to automatically generate
these two files using the main
publish.py script (specifically the
subcommand). The documentation from the dbt models’ corresponding YAML will be
converted into appropriate CSVs and written out locally. By default, the script will read the latest
manifest.json in GCS uploaded by the
dbt_run_and_upload_artifacts Airflow job.
Run this command inside the
warehouse folder, assuming you have local dbt
target/ from a
dbt run or
poetry run python scripts/publish.py document-exposure california_open_data
Each day, a new version of
manifest.json is automatically generated for tables in the production warehouse by the
dbt_run_and_upload_artifacts job in the
transform_warehouse DAG, and placed inside the
calitp-dbt-artifacts GCS bucket. If you intend to generate new documentation locally, you’ll need to generate a new
manifest.json locally first.
Create dataset and metadata¶
Once you’ve generated the necessary metadata and dictionary CSV, you need to get approval from the Caltrans Geospatial Data Officer (at the time of writing, Chad Baker) for publication. Send the dictionary and metadata CSVs via email, and explain what changes are coming to the dataset - have columns been added or removed from one of the tables, do you have a new table to add, or is there some other change?
For new tables, a CKAN destination will be created with UUIDs corresponding to each
model that will be published. If you are using dbt exposures, you will need to
meta field here to map the dbt models to the appropriate UUIDs.
meta: methodology: | Cal-ITP collects the GTFS feeds from a statewide list [link] every night and aggegrates it into a statewide table for analysis purposes only. Do not use for trip planner ingestation, rather is meant to be used for statewide analytics and other use cases. Note: These data may or may or may not have passed GTFS-Validation. coordinate_system_epsg: "4326" destinations: - type: ckan format: csv url: https://data.ca.gov resources: agency: id: e8f9d49e-2bb6-400b-b01f-28bc2e0e7df2 description: | Each row is a cleaned row from an agency.txt file. Definitions for the original GTFS fields are available at: https://gtfs.org/reference/static#agencytxt. attributions: id: 038b7354-06e8-4082-a4a1-40debd3110d5 description: | Each row is a cleaned row from an attributions.txt file. Definitions for the original GTFS fields are available at: https://gtfs.org/reference/static#attributionstxt.
Publish the data¶
If you are using dbt-based publishing, the
publish_exposure subcommand of
will query BigQuery, write out CSV files, and upload those files to CKAN.
An Airflow job refreshes/updates the data at a specified
frequency, or the publication script can be run manually. By default, the
--no-publish flag is set, executing a dry run. You can also write to GCS without uploading to CKAN by manually entering an arbitrary bucket destination.
The weekly publishing Airflow job supports referencing
gs:// paths for the manifest, which is used to determine which tables and columns to publish; by default, the script will read the latest manifest in GCS uploaded by the
dbt_run_and_upload_artifacts Airflow job.
You may also choose to run dbt models and/or run the publish script locally; these
operations can be mixed-and-matched. If you are running
publish.py locally, you
will need to set
$CALITP_CKAN_GTFS_SCHEDULE_KEY ahead of time.
By default, the script will upload artifacts to GCS, but will not actually upload data to CKAN. In addition, the script will upload the metadata and dictionary files to GCS for eventual sharing with Caltrans employees responsible for the open data portal.
$ poetry run python scripts/publish.py publish-exposure california_open_data --manifest ./target/manifest.json reading manifest from ./target/manifest.json would be writing to gs://test-calitp-publish/california_open_data__metadata/dt=2022-08-30/ts=2022-08-30T20:46:00.474199Z/metadata.csv would be writing to gs://test-calitp-publish/california_open_data__dictionary/dt=2022-08-30/ts=2022-08-30T20:46:00.474199Z/dictionary.csv handling agency e8f9d49e-2bb6-400b-b01f-28bc2e0e7df2 ... writing 346 rows (42.9 kB) from andrew_gtfs_schedule.agency to gs://test-calitp-publish/california_open_data__agency/dt=2022-08-30/ts=2022-08-30T20:46:00.474199Z/agency.csv would be uploading to https://data.ca.gov e8f9d49e-2bb6-400b-b01f-28bc2e0e7df2 if --publish ...
You can add the
--publish flag to actually upload artifacts to CKAN after they
are written to GCS. You must be using a production bucket to publish, either
$CALITP_BUCKET__PUBLISH or using the
--bucket flag. In addition,
you may specify a manifest file in GCS if desired.
poetry run python scripts/publish.py publish-exposure california_open_data --bucket gs://calitp-publish --manifest gs://calitp-dbt-artifacts/latest/manifest.json --publish