Generating cohesive subset data from related databases

Your data might be stored in separate but related databases. In Tonic Structural, each database provides the source data for a different workspace.

For example, a Users database contains a list of users. Each service also has a separate database. The Service1 and Service2 databases refer to identifiers of users from the Users database, but there are no direct foreign key relationships.

When you generate a subset from each database, you might want to ensure that the resulting data is complete and cohesive. For example, your application connects to and pulls data from each database. This means that your end-to-end testing also requires corresponding data from each database.

To continue the previous example, you generate subsets from the Users, Service1, and Service2 databases. Your application pulls data from each database. For the data to be complete and have referential integrity, the subsets from the Service1 and Service2 service databases should only contain records that refer to the users in the subset from the Users database.

Here are some options to generate subset from separate databases to produce data that is complete and cohesive:

In all cases, when you generate subsets across different databases, you must use consistency to ensure that common columns have the same values in each subset.

Using deterministic WHERE clauses

One way to produce complete and cohesive data across databases is to use deterministic WHERE clauses in your target table configuration. A deterministic WHERE clause always produces the same results, and is never random.

A percentage is not deterministic. Structural selects a specific number of records, but selects those records at random.

Not all WHERE clauses are deterministic. For example, the Users, Service1, and Service2 databases each have a TotalValue column that reflects the total spent as a whole and for each service. Filtering based on TotalValue does not guarantee that you get a cohesive set of records.

Instead, provide a WHERE clause that can be used in each database to produce a cohesive set of records across the databases. For example, use a WHERE clause to look for a specific set of UUID values in each database.

In our example, if we target the same set of user UUIDs in the Users, Service1, and Service2 databases, we produce a complete and cohesive set of records for those users.

When you use a deterministic WHERE clause in each database, you can run the subsetting jobs independently.

Using pre-computed record lists

This is somewhat similar to using a deterministic WHERE clause. It is one way to provide input to create a deterministic WHERE clause.

You can run a query outside of Structural, and then use the results as input to the subset configuration. For example, you could run a query to identify users that are located in the United States.

One way to use the results would be to store the results somewhere in a database that is accessible to each workspace and that you can reference in the WHERE clauses. You could also return the result as a hard-coded list, and create WHERE clauses that use an IN() filter that contains a long list of these hard-coded values. You could even use the Structural API to update the WHERE clause values as a part of an automated process.

This method allows you to run the subsetting jobs independently.

Running subsetting jobs serially

Another option is to run the jobs on the workspaces serially. The results of a job on one workspace feed into the job on the next workspace.

To do this, you run the first job, which can have a target percentage or a non-deterministic WHERE clause.

After this job completes, use the results as input to a WHERE clause in the second workspace. For example, the results might be a set of user ID values.

Depending on how the databases are set up, you might be able to query the results directly. For fully isolated databases, you could export the list and hard-code it in the WHERE clause of the second workspace.

You can only use this option if the relevant column values are not changed by the generation process. If the column has a generator applied, then the output column values from the first database do not exist in the source column values in the second database.

The previous options are ideal for when the related databases are completely isolated from each other.

However, in some cases you can connect different database instances directly to query across them. Many database engines provide this capability, such as:

  • Oracle database links

  • SQL Server linked servers

  • PostgreSQL foreign-data wrapper with foreign server

If your environment allows and supports these mechanisms, then you can directly reference the external server in a query.

Ensuring consistent column values between databases

For columns that are common across all of the databases, you must ensure that a specific value in the source databases results in the same value in all of the destination databases.

To do this, you must assign a generator that supports consistency, and enable consistency on the column.

You must also configure Structural to ensure consistency across databases.

For more information, go to Enabling consistency.

Last updated