Performing scans on collections

Required workspace permission: Run collection scan

When you first connect to a MongoDB database, Tonic Structural performs a scan to determine the available fields in each collection, the field types, and how prevalent the fields are. It performs this scan at the same time as the initial sensitivity scan.

For each collection, Structural creates a hybrid document, which is a superset of all of the fields contained in the collection documents.

Configuring the collection scan

By default, for each collection:

  • The scan includes all of the documents in the collection, and continues until the scan is finished.

  • Every unique path (field+data type) in the collection is added to the hybrid document.

You can change the default scan behavior. To change the scan configuration, use the following environment settings. You can add these settings manually to the Environment Settings list on Tonic Settings.

Configuring how schemas are scanned

The following options control the number of documents that Structural scans in a collection.

These options allow you to limit the number of scanned documents when the additional documents will not add fields to the hybrid document. For large homogenous collections, where all or most documents have the same structure, configuring these options can improve performance.

TONIC_DOCUMENT_SCAN_MAX_DOCS_COUNT

The maximum number of documents to scan for each schema in a collection. For example, if this is 10, then Structural scans up to 10 documents, and ignores the remaining documents. When this value is empty, Structural scans all of the documents.

TONIC_DOCUMENT_SCAN_MAX_TIME_SECONDS

The maximum amount of time in seconds to scan a schema. For example, if this is 360, then Structural scans a schema for up to 360 seconds. When this value is empty, Structural continues the scan until it is complete.

If you set both options, then the scan completes when it reaches either limit. For example, if the maximum document count is 10 and the maximum scan time is 360 seconds, then the scan completes either after 10 documents or after 360 seconds, whichever comes first.

Configuring how fields are collapsed in the hybrid document

Typically, the number of unique fields in a collection is small relative to the number of documents. However, in some cases the number of fields is similar to or greater than the number of documents. This most commonly occurs when documents have "data as keys", such as keys that are ObjectIds, UUIDs, or incrementing integers.

In these cases, adding every unique field to the hybrid document can result in a large hybrid document that has an undesirable structure.

Structural offers configuration options to "collapse" fields within the hybrid document. This shrinks the size of the hybrid document. It also allows you to assign a generator to the collapsed group instead of to each unique key.

By default, Structural does not collapse fields.

Collapsing fields when the key is an ObjectId

To enable this, set TONIC_MONGO_OBJECT_ID_COLLAPSE_THRESHOLD to the number of ObjectId keys that an object can contain before Structural collapses the object schema into a single key.

For example, if this is 10, then any object that has 10 or more ObjectId keys is collapsed into a single key.

A negative value indicates to not collapse the keys. The default value is -1.

Collapsing fields when the key matches a custom pattern

To enable Structural to collapse fields, you provide a regular expression to identify the fields that can be collapsed into the same field. You then configure the number of matches that must exist before Structural collapses the fields.

To configure how the fields are collapsed:

TONIC_DOCUMENT_COLLAPSE_FIELDS_REGEX

The regular expression that identifies the fields that can be collapsed into a single field. By default, this value is empty.

TONIC_DOCUMENT_COLLAPSE_FIELDS_REGEX_THRESHOLD

The number of fields that match the regular expression before Structural collapses the fields into a single field. For example, if this is 5, then once Structural finds 5 fields that match the regular expression, it collapses all of the matching fields into a single field. A negative value indicates to not collapse the fields. The default value is -1.

For example:

  • To collapse keys that are integer values, use the regular expression [0-9]+ or \d+

  • To collapse keys that are UUIDs, use the regular expression [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}

Viewing the most recent scans for each collection

On Privacy Hub, the Latest Collection Scan table shows the most recent scans on each scanned collection.

The Build Schema option runs a new scan on the collection.

Starting a collection scan

When the source database has a new collection, then on Collection View, you are prompted to run a scan either on that collection or on all collections.

Last updated