Textual uses datasets to produce files with sensitive values replaced.
Before you perform these tasks, remember to instantiate the SDK client.
To create a new dataset and then upload a file to it, use textual.create_dataset
.
To add a file to the dataset, use dataset.add_file
. To identify the file, provide the file path and name.
To provide the file as IO bytes, you provide the file name and the file bytes. You do not provide a path.
Textual creates the dataset, scans the uploaded file, and redacts the detected values.
To change the configuration of a dataset, use dataset.edit
.
You can use dataset.edit
to change:
The name of the dataset
To get the current status of the files in the current dataset, use dataset.describe
:
The response includes:
The name and identifier of the dataset
The number of files in the dataset
The number of files that are waiting to be processed (scanned and redacted)
The number of files that had errors during processing
For example:
To get a list of files that have a specific statuse, use the following:
The file list includes:
File identifier and name
Number of rows and columns
Processing status
For failed files, the error
When the file was uploaded
To delete a file from a dataset, use dataset.delete_file
.
To get the redacted content in JSON format for a dataset, use dataset.fetch_all_json()
:
For example:
The response looks something like:
You can use the Textual SDK to redact and synthesize values in individual files.
Before you perform these tasks, remember to .
For a self-hosted instance, you can also configure the S3 bucket to use to store the files. This is the same S3 bucket that is used to store files for uploaded file pipelines. For more information, go to . For an example of an IAM role with the required permissions, go to .
To send an individual file to Textual, you use .
You first open the file so that Textual can read it, then make then call for Textual to read the file.
The response includes:
The file name
The identifier of the job that processed the file. You use this identifier to retrieve a transformed version of the file.
After you use to send the file to Textual, you use to retrieve a transformed version of the file.
To identify the file, you use the job identifier that you received from textual.start_file_redaction
. You can also specify whether to redact, synthesize, or ignore specific entity types. By default, all of the values are redacted.
Before you make the call to download the file, you specify the path to download the file content to.
Before you perform these tasks, remember to .
You can use the Tonic Textual SDK to redact individual strings, including:
Plain text strings
JSON content
XML content
For a text string, you can also request synthesized values from a large language model (LLM).
The redaction request can include the .
The includes the redacted or synthesized content and details about the detected entity values.
To send a plain text string for redaction, use :
For example:
redact_json
ensures that only the values are redacted. It ignores the keys.
Here is a basic example of a JSON redaction request:
It produces the following JSON output:
When you redact a JSON string, you can optionally assign specific entity types to selected JSON paths.
To do this, you include the jsonpath_allow_lists
parameter. Each entry consists of an entity type and a list of JSON paths for which to always use that entity type. Each JSON path must point to a simple string or numeric value.
The specified entity type overrides both the detected entity type and any added or excluded values.
In the following example, the value of the key1
node is always treated as a telephone number:
It produces the following redacted output:
redact_xml
ensures that only the values are redacted. It ignores the XML markup.
For example:
Produces the following XML output:
redact_html
ensures that only the values are redacted. It ignores the HTML markup.
For example:
Produces the following HTML output:
You can also request synthesized values from a large language model (LLM).
When you use this process, Textual first identifies the sensitive values in the text. It then sends the value locations and redacted values to the LLM. For example, if Textual identifies a product name, it sends the location and the redacted value PRODUCT
to the LLM. Textual does not send the original values to the LLM.
The LLM then generates realistic synthesized values of the appropriate value types.
For example:
The response provides the redacted or synthesized version of the string, and the list of detected entity values.
For each redacted item, the response includes:
The location of the value in the original text (start
and end
)
The location of the value in the redacted version of the string (new_start
and new_end
)
The entity type (label
)
The original value (text
)
The redacted or synthesized value (new_text
). new_text
is null
in the following cases:
The entity type is ignored
The response is from llm_synthesis
A score to indicate confidence in the detection and redaction (score
)
The detected language for the value (language
)
For responses from textual.redact_json
, the JSON path to the entity in the original document (json_path
)
For responses from textual.redact_xml
, the Xpath to the entity in the original XML document (xml_path
)
To send a JSON string for redaction, use . You can send the JSON content as a JSON string or a Python dictionary.
To send an XML string for redaction, use .
To send an HTML string for redaction, use .
To send text to an LLM, use :
Create and manage datasets
Create, update, and get redacted files from a Textual dataset.
Redact and synthesize individual strings
Send a plain text, JSON, or XML string for redaction.
Redact and synthesize individual files
Send a file for redaction and retrieve the results.
Configure entity type handling
Configure how Textual treats each type of entity in a dataset, redacted file, or redacted string.
By default, when you:
Configure a dataset
Redact or synthesize a string
Retrieve a redacted file
Textual does the following:
For the string and file redaction, redacts the detected sensitive values.
For LLM synthesis, generates realistic synthesized values.
When you make the request, you can override the default behavior.
For each entity type, you can choose to redact, synthesize, or ignore the value.
When you redact a value, Textual replaces the value with <entity type>_<generated identifier>
. For example, ORGANIZATION_EPfC7XZUZ
.
When you synthesize a value, Textual replaces the value with a different realistic value.
When you ignore a value, Textual passes through the original value.
To specify the handling option for entity types, you use the generator_config
parameter.
Where:
<entity_type>
is the identifier of the entity type. For example, ORGANIZATION
. For the list of built-in entity types that Textual scans for, go to Entity types that Textual detects.
<handling_option>
is the handling option to use for the specified entity type. The possible values are Redact
, Synthesis
, and Off
.
For example, to synthesize organization values, and ignore languages:
For string and file redaction, you can specify a default handling option to use for entity types that are not specified in generator_config
.
To do this, you use the generator_default
parameter.
generator_default
can be either Redact
, Synthesis
, or Off
.
You can also configure added and excluded values for each entity type.
You add values that Textual does not detect for an entity type, but should. You exclude values that you do not want Textual to identify for that entity type.
To specify the added values, use label_allow_lists
.
To specify the excluded values, use label_block_lists
.
For each of these parameters, the value is a list of entity types to specify the added or excluded values for. To specify the values, you provide an array of regular expressions.
The following example uses label_allow_lists
to add values:
For NAME_GIVEN
, adds the values There
and Here
.
For NAME_FAMILY
, adds values that match the regular expression ([a-z]{2})
.