Generating cohesive subset data from related databases
Your data might be stored in separate but related databases. In Tonic Structural, each database provides the source data for a different workspace.
For example, a Users database contains a list of users. Each service also has a separate database. The Service1 and Service2 databases refer to identifiers of users from the Users database, but there are no direct foreign key relationships.
When you generate a subset from each database, you might want to ensure that the resulting data is complete and cohesive. For example, your application connects to and pulls data from each database. This means that your end-to-end testing also requires corresponding data from each database.
To continue the previous example, you generate subsets from the Users, Service1, and Service2 databases. Your application pulls data from each database. For the data to be complete and have referential integrity, the subsets from the Service1 and Service2 service databases should only contain records that refer to the users in the subset from the Users database.
Here are some options to generate subset from separate databases to produce data that is complete and cohesive:
In all cases, when you generate subsets across different databases, you must use consistency to ensure that common columns have the same values in each subset.
Using deterministic WHERE clauses
One way to produce complete and cohesive data across databases is to use deterministic WHERE
clauses in your target table configuration. A deterministic WHERE
clause always produces the same results, and is never random.
A percentage is not deterministic. Structural selects a specific number of records, but selects those records at random.
Not all WHERE
clauses are deterministic. For example, the Users, Service1, and Service2 databases each have a TotalValue
column that reflects the total spent as a whole and for each service. Filtering based on TotalValue
does not guarantee that you get a cohesive set of records.
Instead, provide a WHERE
clause that can be used in each database to produce a cohesive set of records across the databases. For example, use a WHERE
clause to look for a specific set of UUID values in each database.
In our example, if we target the same set of user UUIDs in the Users, Service1, and Service2 databases, we produce a complete and cohesive set of records for those users.
When you use a deterministic WHERE
clause in each database, you can run the subsetting jobs independently.
Using pre-computed record lists
This is somewhat similar to using a deterministic WHERE
clause. It is one way to provide input to create a deterministic WHERE
clause.
You can run a query outside of Structural, and then use the results as input to the subset configuration. For example, you could run a query to identify users that are located in the United States.
One way to use the results would be to store the results somewhere in a database that is accessible to each workspace and that you can reference in the WHERE
clauses. You could also return the result as a hard-coded list, and create WHERE
clauses that use an IN()
filter that contains a long list of these hard-coded values. You could even use the Structural API to update the WHERE
clause values as a part of an automated process.
This method allows you to run the subsetting jobs independently.
Running subsetting jobs serially
Another option is to run the jobs on the workspaces serially. The results of a job on one workspace feed into the job on the next workspace.
To do this, you run the first job, which can have a target percentage or a non-deterministic WHERE
clause.
After this job completes, use the results as input to a WHERE
clause in the second workspace. For example, the results might be a set of user ID values.
Depending on how the databases are set up, you might be able to query the results directly. For fully isolated databases, you could export the list and hard-code it in the WHERE
clause of the second workspace.
You can only use this option if the relevant column values are not changed by the generation process. If the column has a generator applied, then the output column values from the first database do not exist in the source column values in the second database.
Using database links or linked servers
The previous options are ideal for when the related databases are completely isolated from each other.
However, in some cases you can connect different database instances directly to query across them. Many database engines provide this capability, such as:
Oracle database links
SQL Server linked servers
PostgreSQL foreign-data wrapper with foreign server
If your environment allows and supports these mechanisms, then you can directly reference the external server in a query.
Ensuring consistent column values between databases
For columns that are common across all of the databases, you must ensure that a specific value in the source databases results in the same value in all of the destination databases.
To do this, you must assign a generator that supports consistency, and enable consistency on the column.
You must also configure Structural to ensure consistency across databases.
For more information, go to Enabling consistency.
Last updated