1 of 5

Google BigQuery

Google BigQuery is a cloud-based enterprise data warehouse.

Tonic Structural can move data from one dataset to another within the same BigQuery project. Structural also can move data between different BigQuery projects.

Structural uses service accounts to access both the source and destination databases.

Structural differences and limitations with Google BigQuery

Required license: Professional or Enterprise

No hidden datasets as source or destination

Google BigQuery workspaces cannot have a hidden dataset as its source or destination dataset.

Post-job scripts run in transactions

Post-job scripts run inside of transactions. They are limited to statements that are supported within transactions.

Table mode limitations

Google BigQuery workspaces cannot use the following table modes:

Incremental

Generator limitations

Google BigQuery workspaces cannot use the following generators:

Cross Table Sum

Cannot process STRUCT or INTERVAL types

Tonic Structural cannot process STRUCT or INTERVAL types in Google BigQuery.

External, snapshot, and cloned table handling

In the destination database, Structural creates external, snapshot, and cloned tables as normal tables.

Cannot assign generators when partition filters required

You cannot assign generators to partitioned tables that require a partition filter.

The environment setting TONIC_GRPC_ENABLED indicates whether to use GRPC-based endpoints to access Google BigQuery.

If TONIC_GRPC_ENABLED is true, then you can leave all of the columns set to Passthrough.

If TONIC_GRPC_ENABLED is false, then you must truncate the tables.

No subsetting, but support for table filtering

Google BigQuery workspaces do not support subsetting.

However, for tables that use the De-Identify table mode, you can provide a WHERE clause to filter the table. For details, go to Using table filtering for data warehouses and Spark-based data connectors.

No upsert

Google BigQuery workspaces do not support upsert.

No output to container artifacts

For Google BigQuery workspaces, you cannot write the destination data to container artifacts.

No output to an Ephemeral snapshot

For Google BigQuery workspaces, you cannot write the destination data to an Ephemeral snapshot.

Before you create a Google BigQuery workspace

Creating the service accounts for the source and destination databases

Tonic Structural uses service accounts to access both the source and destination datasets. Even if the source and destination are in the same BigQuery project, Structural recommends that you create separate service accounts for the source and destination.

When you create the service account for the source dataset, you should use the "Viewer" role to grant the service account access to the project. Structural does not require write permissions to the source dataset.

When you create the service account for the destination dataset, you should use the "Editor" role to grant the service account access to the project.

Configuring Google BigQuery workspace data connections

During workspace creation, under Connection Type, click BigQuery.

Connecting to the source database

In the Source Settings section, you provide the connection information for the source database:

In the Project ID field, provide the identifier of the project that contains the source database.
In the Dataset Name field, provide the name of the dataset that contains the source database.
For Service Account File, select the service account file (JSON file) for the source database.
To test the connection to the source database, click Test Source Connection.

Blocking data generation on all schema changes

By default, data generation is not blocked as long as schema changes do not conflict with your workspace configuration.

To block data generation when there are any schema changes, regardless of whether they conflict with your workspace configuration, toggle Block data generation on schema changes to the on position.

Enabling de-identification of Google BigQuery views

By default, Tonic Structural copies views from the source database to the destination database. You do not assign generators to view data.

To enable de-identification of view data, toggle the De-identify views option to the on position.

You cannot de-identify views that are written in Google Legacy SQL.

When De-identify views is enabled:

Structural includes views in the lists on Privacy Hub, Database View, and Table View. To identify an item as a view, the name is followed by (view). Structural does not display views that are written in Google Legacy SQL.
You can assign table modes to the views, and generators to the view columns.

When you run data generation:

Structural applies the table modes and generators to the view data. The data generation does not include views that are written in Google Legacy SQL.
For each view, Structural writes the de-identified data to a table called <view name>_tonic_table.
Structural also creates a view that has the same name and metadata as the view in the source data, but is populated from the destination table.

For example, for a source view called customers, in the destination database, Structural:

Uses the assigned table mode and generators to create a table called customers_tonic_table.
Creates a customers view that uses the schema and metadata from the source customers view, but is populated by the data from customers_tonic_table.

Connecting to the destination database

Note that destination tables cannot be external tables. They must be native BigQuery tables.

In the Destination Settings section, you provide the connection information for the destination database:

In the Project ID field, provide the identifier of the project that contains the destination database.
In the Dataset Name field, provide the name of the dataset that contains the destination database.
For Service Account File, select the service account file (JSON file) for the destination database.
To test the connection to the destination database, click Test Destination Connection.

Resolving schema changes for de-identified views

If you enable view de-identification, then you might need to rescynchronize out-of-sync view schemas.

Identifying out-of-sync schema data

Tonic Structural uses the schema metadata from the INFORMATION_SCHEMAS metadata tables to display a view’s schema and to manage its schema changes.

For performance reasons, Structural assumes this schema metadata is accurate.

However, schema metadata for BigQuery views might be out of sync with the actual results of querying the view. This can occur for various reasons, but is typically because the query definition is written in such a way that changes to referenced tables result in changes to the schema of the query results.

During data generation, jobs ignore and issue a warning for any views that have out-of-date schema metadata.

Before you can de-identify the affected views, you must manually resynchronize the schema metadata in BigQuery.

Resynchronizing logical views

For logical views, schema metadata is not automatically updated during the lifecycle of the view.

If the underlying schema of the logical view has changed, then to trigger a refresh, you must either change or recreate the view definition.

Changing the view query

To trigger a schema refresh, use the BigQuery user interface to modify the view query, then save the view.

No-op changes do trigger schema metadata refreshes.

However, saving the view without modifying the query does not trigger a refresh.

Recreating the view definition

To recreate the view, run the following command:

"CREATE OR REPLACE VIEW '{project}.{dataset}.{view}' OPTIONS({options}) AS ({view_query})"

The full DDL for a view is in the the following table: {dataset}.INFORMATION_SCHEMA.TABLES

Resynchronizing materialized views

For materialized views, refreshes automatically synchronize the schema metadata.

For details on how to configure automatic refreshes and trigger manual refreshes of materialized views, go to the BigQuery documentation.