Adding Ad-Hoc Data to the Warehouse

Adding Ad-Hoc Data to the Warehouse

To work with data in our BI tool (Metabase) we first have to add the data to our warehouse (BigQuery). This page describes how to do an ad-hoc, one-time import of a dataset (for example, an individual extract from some other system.)

Warning

If you are working with a data source that needs to be updated regularly (like data that should be ingested/updated daily, weekly, or monthly), you should follow the general data pipeline architecture and create a dedicated pipeline. The instructions here are only for one-time imports of individual files that will not be regularly updated.

To add one-time data to BigQuery for use in Metabase follow the instructions below. You may need additional Google Cloud or BigQuery permissions; if so, post in the #services-team channel in Slack to ask for assistance.

  1. First, upload your data to a Google Cloud Storage bucket in the gs://calitp-analytics-data/data-analyses/ directory and make note of the gsutil URI (file path).

  2. Next, navigate to a JupyterLab terminal window.

  3. Once in the terminal, input the following command with the appropriate structure:

bq --location=us-west2 load <source_format> --autodetect <destination_table> <source>
  • The <source_format> specifies the type of file you would like to use. An example of this flag’s use is --source_format=CSV. Other options include PARQUET and NEWLINE_DELIMITED_JSON

  • The <destination_table> is the table you would like to create, or append to if the table already exists. Your uploaded table destination should always be the uploaded_data dataset in BigQuery (e.g. the destination_table name should always have the format uploaded_data.your_new_table_name).

    • If you are looking to create a new table: use a new table name

    • If you are looking to append to existing data: re-use the name of the existing table

    • If you are looking to replace an existing table: use the --replace flag after the load command

  • The <source> argument is the gsutil URI (the path to the Google Cloud Storage bucket you are sourcing from).

  • If you run into upload errors related to the source file format, you may need to include the flag --allow_quoted_newlines. This may be helpful in resolving errors related to newline-delimited text, which may be present in file conversions from Excel to CSV.

Ex.

bq --location=us-west2 load --source_format=CSV --autodetect --allow_quoted_newlines uploaded_data.tircp_with_temporary_expenditure_sol_copy gs://calitp-analytics-data/data-analyses/tircp/tircp.csv

Looking for more information?

More information on the BigQuery Command Line Interface (CLI) can be found here