Performing scans on collections
Last updated
Last updated
Required workspace permission: Run collection scan
When you first connect to a MongoDB database, Tonic Structural performs a scan to determine the available fields in each collection, the field types, and how prevalent the fields are. It performs this scan at the same time as the initial sensitivity scan.
For each collection, Structural creates a hybrid document, which is a superset of all of the fields contained in the collection documents.
By default, for each collection:
The scan includes all of the documents in the collection, and continues until the scan is finished.
Every unique path (field+data type) in the collection is added to the hybrid document.
You can change the default scan behavior. To change the scan configuration, use the following . You can add these settings manually to the Environment Settings list on Structural Settings.
The following options control the number of documents that Structural scans in a collection.
These options allow you to limit the number of scanned documents when the additional documents will not add fields to the hybrid document. For large homogenous collections, where all or most documents have the same structure, configuring these options can improve performance.
If you set both options, then the scan completes when it reaches either limit. For example, if the maximum document count is 10 and the maximum scan time is 360 seconds, then the scan completes either after 10 documents or after 360 seconds, whichever comes first.
Typically, the number of unique fields in a collection is small relative to the number of documents. However, in some cases the number of fields is similar to or greater than the number of documents. This most commonly occurs when documents have "data as keys", such as keys that are ObjectIds, UUIDs, or incrementing integers.
In these cases, adding every unique field to the hybrid document can result in a large hybrid document that has an undesirable structure.
Structural offers configuration options to "collapse" fields within the hybrid document. This shrinks the size of the hybrid document. It also allows you to assign a generator to the collapsed group instead of to each unique key.
By default, Structural does not collapse fields.
To enable this, set TONIC_MONGO_OBJECT_ID_COLLAPSE_THRESHOLD
to the number of ObjectId keys that an object can contain before Structural collapses the object schema into a single key.
For example, if this is 10, then any object that has 10 or more ObjectId keys is collapsed into a single key.
A negative value indicates to not collapse the keys. The default value is -1.
To enable Structural to collapse fields, you provide a regular expression to identify the fields that can be collapsed into the same field. You then configure the number of matches that must exist before Structural collapses the fields.
To configure how the fields are collapsed:
For example:
To collapse keys that are integer values, use the regular expression [0-9]+
or \d+
To collapse keys that are UUIDs, use the regular expression [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
On Privacy Hub, the Latest Collection Scan table shows the most recent scans on each scanned collection.
The Build Schema option runs a new scan on the collection.
When the source database has a new collection, then on Collection View, you are prompted to run a scan either on that collection or on all collections.
| The regular expression that identifies the fields that can be collapsed into a single field. By default, this value is empty. |
| The number of fields that match the regular expression before Structural collapses the fields into a single field. For example, if this is 5, then once Structural finds 5 fields that match the regular expression, it collapses all of the matching fields into a single field. A negative value indicates to not collapse the fields. The default value is -1. |
| The maximum number of documents to scan for each schema in a collection. For example, if this is 10, then Structural scans up to 10 documents, and ignores the remaining documents. When this value is empty, Structural scans all of the documents. |
| The maximum amount of time in seconds to scan a schema. For example, if this is 360, then Structural scans a schema for up to 360 seconds. When this value is empty, Structural continues the scan until it is complete. |