# Generating cohesive subset data from related databases

Your data might be stored in separate but related databases. In Tonic Structural, each database provides the source data for a different workspace.

For example, a Users database contains a list of users. Each service also has a separate database. The Service1 and Service2 databases refer to identifiers of users from the Users database, but there are no direct foreign key relationships.

When you generate a subset from each database, you might want to ensure that the resulting data is complete and cohesive. For example, your application connects to and pulls data from each database. This means that your end-to-end testing also requires corresponding data from each database.

To continue the previous example, you generate subsets from the Users, Service1, and Service2 databases. Your application pulls data from each database. For the data to be complete and have referential integrity, the subsets from the Service1 and Service2 service databases should only contain records that refer to the users in the subset from the Users database.

Here are some options to generate subset from separate databases to produce data that is complete and cohesive:

* [Use deterministic `WHERE` clauses](#subsets-cross-data-base-deterministic-where-clauses)
* [Use pre-computed record lists](#subset-cross-database-precomputed-record-list)
* [Run the subsetting jobs serially](#subset-cross-database-run-serially)
* [Using database links or linked servers](#subset-database-links-linked-servers)

In all cases, when you generate subsets across different databases, you must [use consistency](#subset-cross-database-consistency) to ensure that common columns have the same values in each subset.

## Using deterministic WHERE clauses <a href="#subsets-cross-data-base-deterministic-where-clauses" id="subsets-cross-data-base-deterministic-where-clauses"></a>

One way to produce complete and cohesive data across databases is to use deterministic `WHERE` clauses in your target table configuration. A deterministic `WHERE` clause always produces the same results, and is never random.

A percentage is not deterministic. Structural selects a specific number of records, but selects those records at random.

Not all `WHERE` clauses are deterministic. For example, the Users, Service1, and Service2 databases each have a `TotalValue` column that reflects the total spent as a whole and for each service. Filtering based on `TotalValue` does not guarantee that you get a cohesive set of records.

Instead, provide a `WHERE` clause that can be used in each database to produce a cohesive set of records across the databases. For example, use a `WHERE` clause to look for a specific set of UUID values in each database.

In our example, if we target the same set of user UUIDs in the Users, Service1, and Service2 databases, we produce a complete and cohesive set of records for those users.

When you use a deterministic `WHERE` clause in each database, you can run the subsetting jobs independently.

## Using pre-computed record lists <a href="#subset-cross-database-precomputed-record-list" id="subset-cross-database-precomputed-record-list"></a>

This is somewhat similar to using a deterministic `WHERE` clause. It is one way to provide input to create a deterministic `WHERE` clause.

You can run a query outside of Structural, and then use the results as input to the subset configuration. For example, you could run a query to identify users that are located in the United States.

One way to use the results would be to store the results somewhere in a database that is accessible to each workspace and that you can reference in the `WHERE` clauses. You could also return the result as a hard-coded list, and create `WHERE` clauses that use an `IN()` filter that contains a long list of these hard-coded values. You could even use the Structural API to update the `WHERE` clause values as a part of an automated process.

This method allows you to run the subsetting jobs independently.

## Running subsetting jobs serially <a href="#subset-cross-database-run-serially" id="subset-cross-database-run-serially"></a>

Another option is to run the jobs on the workspaces serially. The results of a job on one workspace feed into the job on the next workspace.

To do this, you run the first job, which can have a target percentage or a non-deterministic `WHERE` clause.

After this job completes, use the results as input to a `WHERE` clause in the second workspace. For example, the results might be a set of user ID values.

Depending on how the databases are set up, you might be able to query the results directly. For fully isolated databases, you could export the list and hard-code it in the `WHERE` clause of the second workspace.

You can only use this option if the relevant column values are not changed by the generation process. If the column has a generator applied, then the output column values from the first database do not exist in the source column values in the second database.

## Using database links or linked servers <a href="#subset-database-links-linked-servers" id="subset-database-links-linked-servers"></a>

The previous options are ideal for when the related databases are completely isolated from each other.

However, in some cases you can connect different database instances directly to query across them. Many database engines provide this capability, such as:

* Oracle database links
* SQL Server linked servers
* PostgreSQL foreign-data wrapper with foreign server

If your environment allows and supports these mechanisms, then you can directly reference the external server in a query.

## Ensuring consistent column values between databases <a href="#subset-cross-database-consistency" id="subset-cross-database-consistency"></a>

For columns that are common across all of the databases, you must ensure that a specific value in the source databases results in the same value in all of the destination databases.

To do this, you must assign a generator that supports consistency, and enable consistency on the column.

You must also configure Structural to ensure consistency across databases.

For more information, go to [consistency](https://docs.tonic.ai/app/generation/generators/generator-characteristics/consistency "mention").


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tonic.ai/app/generation/subsetting/subsetting-cross-database-subsets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
