Datasets flows

You use a Textual dataset to detect sensitive values in files. The dataset output can be either:

Files in the same format as the original file, with the sensitive values replaced based on the dataset configuration.
JSON files that contain a summary of the detected values and replacements.

You can also create and manage datasets from the Textual SDK or REST API.

Overall workflow

At a high level, to use Textual to create redacted data:

Create and populate a dataset or pipeline

Create a Textual dataset, which is a set of files to redact. The files can be uploaded from a local file system, or can come from a cloud storage solution. When you create the dataset, you also choose the type of output, which can be either:
- The redacted version of the original files. The file is in the same format as the original file.
- JSON summaries of the files and the detected entities.
Add files to the dataset. Textual supports almost any free-text file, PDF files, .docx files, and .xlsx files.
For images, Textual supports PNG, JPG (both .jpg and .jpeg), and TIF (both .tif and .tiff) files.
Textual uses its built-in models to scan the files and identify sensitive values.

Review the redaction results

Review the types of entities that were detected in the scanned files.

Configure entity type handling

At any time, including before you upload files and after you review the detection results, you can configure how Textual handles the detected values for each entity type.

You can provide added and excluded values for each built-in entity type.

You can also create and enable custom entity types.

Select the handling option for each entity type

For each entity type, select the action to perform on detected values. The options are:

Redaction - By default, Textual redacts the entity values, which means to replace the values with a token that identifies the type of sensitive value, followed by a unique identifier. For example, NAME_GIVEN_l2m5sb, LOCATION_j40pk6. The identifiers are consistent, which means that for the same original value, the redacted value always has the same identifier. For example, the first name Michael might always be replaced with NAME_GIVEN_12m5sb, while the first name Helen might always be replaced with NAME_GIVEN_9ha3m2. For PDF files, redaction means to either cover the value with a black box, or, if there is space, display the entity type and identifier. For image files, redaction means to cover the value with a black box.
Synthesis - For a given entity type, you can instead choose to synthesize the values, which means to replace the original value with a realistic replacement. The synthesized values are always consistent, meaning that a given original value always produces the same replacement value. For example, the first name Michael might always be replaced with the first name John. You can also identify specific replacement values.
Ignore - You can choose to ignore the values, and not replace them.

For a dataset, Textual automatically updates the file previews and downloadable files to reflect the updated configuration.

For a pipeline, the updated configuration is applied the next time you run the pipeline, and only applies to new files.

Define added and excluded values for entity types

Optionally, you can create lists of values to add to or exclude from an entity type. You might do this to reflect values that are not detected or that are detected incorrectly.

Manually update PDF files

Datasets also provide additional options for PDF files.

You can add manual overrides to a PDF file. When you add a manual override, you draw a box to identify the affected portion of the file.

You can use manual overrides either to ignore the automatically detected redactions in the selected area, or to redact the selected area.

To make it easier to process multiple files that have a similar format, such as a form, you can create templates that you can apply to PDF files in the dataset.

Generate or download output files

After you complete the redaction configuration and manual updates, to obtain the output files:

For local file datasets, you download the output files.
For cloud storage datasets, for datasets that produce original format files, you run a generation job that writes the output files to the configured output location. For datasets that produce JSON output, the files are generated to the output location as soon as the the output location is configured.

File upload and download flows

For a local file dataset, the file upload and download flows are as follows. For a more general overview of the Textual architecture, go to Textual architecture.

File upload flow

When you upload a file to a local file dataset, the flow is as follows:

The Textual user uploads the file.
The API service stores the file in either Amazon S3 or the Textual application database. For more information, go to Setting the S3 bucket for file uploads and redactions.
The API service starts a job in the worker.
The worker sends any PDF and image files to the OCR service (Amazon Textract, Document Intelligence, or Tesseract) to extract the file text.
The OCR service returns the PDF and image text to the worker.
The worker submits the file text to the Textual machine learning service to detect and replace entity values.
The machine learning service returns the results to the worker.
The worker stores the results in the application database.

File download flow

When you download a redacted file from a local file dataset, the flow is as follows:

The Textual user makes the request to download the file.
The API service retrieves the file from where it is stored in either Amazon S3 or the application database.
The API service retrieves the detected entities and entity handling settings from the application database.
The API service applies those results to the file.
The API service returns the redacted file to the Textual user.

Last updated 1 day ago

Was this helpful?