Adding Ad-Hoc Data to the Warehouse
Adding Ad-Hoc Data to the Warehouse¶
To work with data in our BI tool (Metabase) we first have to add the data to our warehouse (BigQuery). This page describes how to do an ad-hoc, one-time import of a dataset (for example, an individual extract from some other system.)
If you are working with a data source that needs to be updated regularly (like data that should be ingested/updated daily, weekly, or monthly), you should follow the general data pipeline architecture and create a dedicated pipeline. The instructions here are only for one-time imports of individual files that will not be regularly updated.
To add one-time data to BigQuery for use in Metabase follow the instructions below. You may need additional Google Cloud or BigQuery permissions; if so, post in the
#services-team channel in Slack to ask for assistance.
First, upload your data to a Google Cloud Storage bucket in the
gs://calitp-analytics-data/data-analyses/directory and make note of the
gsutil URI(file path).
Next, navigate to a JupyterLab terminal window.
Once in the terminal, input the following command with the appropriate structure:
bq --location=us-west2 load <source_format> --autodetect <destination_table> <source>
<source_format>specifies the type of file you would like to use. An example of this flag’s use is
--source_format=CSV. Other options include
<destination_table>is the table you would like to create, or append to if the table already exists. Your uploaded table destination should always be the
uploaded_datadataset in BigQuery (e.g. the
destination_tablename should always have the format
If you are looking to create a new table: use a new table name
If you are looking to append to existing data: re-use the name of the existing table
If you are looking to replace an existing table: use the
--replaceflag after the
<source>argument is the
gsutil URI(the path to the Google Cloud Storage bucket you are sourcing from).
If you run into upload errors related to the source file format, you may need to include the flag
--allow_quoted_newlines. This may be helpful in resolving errors related to newline-delimited text, which may be present in file conversions from Excel to CSV.
bq --location=us-west2 load --source_format=CSV --autodetect --allow_quoted_newlines uploaded_data.tircp_with_temporary_expenditure_sol_copy gs://calitp-analytics-data/data-analyses/tircp/tircp.csv
Looking for more information?
More information on the BigQuery Command Line Interface (CLI) can be found here