Overview
The BigQuery destination delivery process is as follows:
Crux files for a data product are processed and stored in the Crux internal data warehouse (GCS)
The Crux Delivery Dispatch service handles the data load into the BigQuery ecosystem. Files then get delivered to a temporary storage (i.e. GCS). From here, data gets copied from temporary storage to BigQuery tables.
Concepts
BigQuery follows a specific entity hierarchy:
Select a Dataset
Create a Table
Naming Convention
A fully qualified table name is referenced as <DatasetName>.<TableName>
The table ID on BigQuery is referenced as <ProjectId>.<DatasetName>.<TableName>
:
DatasetName – Crux-created dataset name
Format:
dataframe_<Crux Data Product Code>_<Crux Frame Primary Key Id>
(Example:dataframe_cx12345_FrGSv_3bbbe41f
)
TableName – Crux-created table name
Format:
<Crux Frame Name>__<Legacy Dataset Id>
(Example:SCHEMA_PINNING_TEST_0427__AQLGSv_3bhcFgcRdVtHdNePvbw
)
Requirements and Permissions
Collect the settings information for the delivery channel and provide permissions to the Crux service accounts. Permission must be in place before setting up and testing your connection.
Google Cloud Platform (GCP) attributes:
Project ID: GCP Project ID for BigQuery and GCS (e.g. “crux-sample-project”)
Data Location: GCP storage data location, as determined by your GCP project.
Temporary storage attributes:
GCS Bucket Name: The GCS bucket name associated with your GCP Project ID for temporary storage
Bucket prefix path: The GCS bucket prefix path for temporary storage. This is an optional field.
Create a GCS Bucket
This step describes how to create a Google Cloud Storage (GCS) bucket. The bucket will be used to stage data that will be imported into BigQuery.
Log in to the Google Cloud console
Make sure that Cloud Storage API is enabled. Navigate to API & Services. Search for "storage" in the global search at the top, click on the Cloud Storage API, and enable it if it is not already enabled.
Navigate to the Cloud Storage product in the menu and click on Buckets
Enter a name for your bucket that is unique across all organizations and projects
Select which region or multi-region you'd like to store your data in. Important: You must create the bucket in the same region or multi-region that your BigQuery dataset will be in.
Select the Standard default storage class.
Ensure that Enforce public access prevention on this bucket is selected as well as the Uniform radio button.
For protecting object data, use the default settings.
Granting Crux access to deliver data
After creating your GCS bucket, you will now need to grant Crux access to write data into your destination.
The Crux service account that will need access for the production environment is:
delivery-dispatch-core-service@crux-199318.iam.gserviceaccount.com
Perform the following steps to grant Crux access to both GCS and BigQuery capabilities:
Granting access should be done at the bucket level, not at the project or organization level. Grant the service account noted above with Storage Legacy Bucket Writer and Storage Legacy Object Owner roles by selecting the relevant bucket, clicking the Permissions tab, and then clicking Grant Access.
Next, you will need to grant the same Crux service account access to create and write to a BigQuery dataset in your project. Navigate to APIs & Services. Click Enable APIs and Services. Search for "bigquery" in the global search at the top, and select the BigQuery API result. Enable it if it is not already enabled.
You should now navigate to IAM & Admin and click Grant Access. Enter the same Crux service account as the principal and grant the BigQuery Data Editor role.
Upon clicking Save, you will now be able to successfully create a BigQuery destination in the Crux External Data Platform app.
Connect to Your Destination
To connect to your destination:
Provide a unique connection name that you can refer to at a later time.
Select BigQuery in the destination connection method.
Provide the Project ID associated with the BigQuery project that you have granted Crux access to. This Project ID value will be in the top bar of the navigation.
Provide the GCS Bucket Name associated with the GCP project for temporary storage.
If you would like, provide a GCS bucket prefix path for the temporary files. This is an optional field.
Provide the Location associated with the GCS bucket and BigQuery project from the select dropdown. Based on the location selected, the Region select dropdown will populate regions and multi-regions accordingly.
Click Add connection to connect to the delivery channel.
Note: When testing a connection to a new destination, a test file is added into the target destination.
Connection Fields
To create an BigQuery destination connection, configure the following connection settings:
Field | Description |
Destination name | Name for your destination that can be quickly referenced in the future. |
Connection method | Method for destination connection. |
Project ID | BigQuery project ID associated with the connection (e.g. "crux-sample-project"). |
GCS bucket name | The bucket name associated with the Google Cloud Storage location. This bucket name is part of the domain name in the GCS cloud URL. |
Bucket prefix path | (Optional) Bucket folder prefix path for the temporary files. |
Location | The geographic area associated with the GCS bucket. Options: Americas, Europe, Asia-Pacific
|
Region name | A region is a specific geographic place, such as London. A multi-region is a large geographic area, such as the United States, that contains two or more geographic places.
Note: After the dataset is created, the data is stored in the specified area. |