Datasets workflow for text redaction
Last updated
Last updated
You can use Textual to generate versions of files where the sensitive values are redacted.
To only generate redacted files, you use a Tonic Textual dataset.
You can also optionally configure a Textual pipeline to generate redacted files in addition to the JSON output.
At a high level, to use Textual to create redacted data:
Create a Textual dataset or pipeline. A dataset is a set of files to redact. A pipeline is used to generate JSON output that can be used to populate an LLM system. Pipelines also provide an option to generate redacted versions of the selected files.
Add files to the dataset or pipeline.
Textual supports almost any free-text file, PDF files, .docx files, and .xlsx files. For images, Textual supports PNG, JPG (both .jpg and .jpeg), and TIF (both .tif and .tiff) files.
For a dataset or an uploaded files pipeline, as you add the files, Textual automatically uses its built-in models to identify entities in the files and generate the pipeline output. For a cloud storage pipeline, to identify the entities and generate the output, you run the pipeline.
For a dataset, review the types of entities that were detected across all of the files. For pipeline files, the file details include the entities that were detected in that file.
At any time, including before you upload files and after you review the detection results, you can configure how Textual handles the detected values for each entity type.
For datasets, you can also provide added and excluded values for each entity type.
By default, Textual redacts the entity values, which means to replace the values with a token that identifies the type of sensitive value, followed by a unique identifier. For example, NAME_GIVEN_l2m5sb
, LOCATION_j40pk6
. The identifiers are consistent, which means that for the same original value, the redacted value always has the same identifier. For example, the first name Michael might always be replaced with NAME_GIVEN_12m5sb
, while the first name Helen might always be replaced with NAME_GIVEN_9ha3m2
.
For PDF files and image files, redaction means to cover the value with a black box. For a given entity type, you can instead choose to synthesize the values, which means to replace the original value with a realistic replacement. The synthesized values are always consistent, meaning that a given original value always produces the same replacement value. For example, the first name Michael might always be replaced with the first name John. You can also choose to ignore the values, and not replace them. For a dataset, Textual automatically updates the file previews and downloadable files to reflect the updated configuration.
For a pipeline, the updated configuration is applied the next time you run the pipeline, and only applies to new files.
Optionally, in a dataset, you can create lists of values to add to or exclude from an entity type. You might do this to reflect values that are not detected or that are detected incorrectly.
Pipelines do not allow you to add or exclude individual values.
Datasets also provide additional options for PDF files. These options are not available in pipelines.
You can add manual overrides to a PDF file. When you add a manual override, you draw a box to identify the affected portion of the file.
You can use manual overrides either to ignore the automatically detected redactions in the selected area, or to redact the selected area.
To make it easier to process multiple files that have a similar format, such as a form, you can create templates that you can apply to PDF files in the dataset.
After you complete the redaction configuration and manual updates, you can download the dataset files or the synthesized pipeline files to use as needed.