1 of 5

Datasets and redaction

You can use the Tonic Textual SDK to manage pipelines and to redact individual strings and files.

Create and manage datasets

Textual uses datasets to produce files with sensitive values replaced.

Before you perform these tasks, remember to instantiate the SDK client.

Create and add files to a dataset

To create a new dataset and then upload a file to it, use textual.create_dataset.

dataset = textual.create_dataset('<dataset name>')

To add a file to the dataset, use dataset.add_file. To identify the file, provide the file path and name.

dataset.add_file('<path to file>','<file name>')

To provide the file as IO bytes, you provide the file name and the file bytes. You do not provide a path.

dataset.add_file('<file name>',<file bytes>)

Textual creates the dataset, scans the uploaded file, and redacts the detected values.

Configure a dataset

To change the configuration of a dataset, use dataset.edit.

You can use dataset.edit to change:

The name of the dataset
The handling option for each entity type
Added or excluded values for each entity type

dataset.edit(name='<dataset name>', 
  generator_config={'<entity_type>':'<handling_type>'},
  label_allow_lists={'<entity_type>':LabelCustomList(regexes['<regex>']},
  label_block_lists={'<entity_type>':LabelCustomList(regexes['<regex>']}
)

Get the current status of dataset files

To get the current status of the files in the current dataset, use dataset.describe:

dataset.describe()

The response includes:

The name and identifier of the dataset
The number of files in the dataset
The number of files that are waiting to be processed (scanned and redacted)
The number of files that had errors during processing

For example:

    Dataset: example [879d4c5d-792a-c009-a9a0-60d69be20206]
    Number of Files: 1
    Files that are waiting for processing: 
    Files that encountered errors while processing: 
    Number of Rows: 0
    Number of rows fetched: 0

Get lists of files by status

To get a list of files that have a specific statuse, use the following:

The file list includes:

File identifier and name
Number of rows and columns
Processing status
For failed files, the error
When the file was uploaded

Delete a file from a dataset

To delete a file from a dataset, use dataset.delete_file.

dataset.delete_file('<file identifier>')

Get redacted content for a dataset

To get the redacted content in JSON format for a dataset, use dataset.fetch_all_json():

dataset = textual.get_dataset('<dataset name>')
dataset.fetch_all_json()

For example:

dataset = textual.get_dataset('mydataset')
dataset.fetch_all_json()

The response looks something like:

'[["PERSON_Rz8NtJTPONTKgcB95i Portrait by PERSON_blatU6mAWFCQoSa5E, DATE_TIME_Rcl58 ...]'

Redact and synthesize individual strings

Before you perform these tasks, remember to .

You can use the Tonic Textual SDK to redact individual strings, including:

Plain text strings
JSON content
XML content

For a text string, you can also request synthesized values from a large language model (LLM).

The redaction request can include the .

The includes the redacted or synthesized content and details about the detected entity values.

Redact a plain text string

To send a plain text string for redaction, use :

For example:

Redact JSON content

redact_json ensures that only the values are redacted. It ignores the keys.

Basic JSON redaction example

Here is a basic example of a JSON redaction request:

It produces the following JSON output:

Specifying entity types for specific JSON paths

When you redact a JSON string, you can optionally assign specific entity types to selected JSON paths.

To do this, you include the jsonpath_allow_lists parameter. Each entry consists of an entity type and a list of JSON paths for which to always use that entity type. Each JSON path must point to a simple string or numeric value.

The specified entity type overrides both the detected entity type and any added or excluded values.

In the following example, the value of the key1 node is always treated as a telephone number:

It produces the following redacted output:

Redact XML content

redact_xml ensures that only the values are redacted. It ignores the XML markup.

For example:

Produces the following XML output:

Redact HTML content

redact_html ensures that only the values are redacted. It ignores the HTML markup.

For example:

Produces the following HTML output:

Using an LLM to generate synthesized values

You can also request synthesized values from a large language model (LLM).

When you use this process, Textual first identifies the sensitive values in the text. It then sends the value locations and redacted values to the LLM. For example, if Textual identifies a product name, it sends the location and the redacted value PRODUCT to the LLM. Textual does not send the original values to the LLM.

The LLM then generates realistic synthesized values of the appropriate value types.

For example:

Format of the redaction and synthesis response

The response provides the redacted or synthesized version of the string, and the list of detected entity values.

For each redacted item, the response includes:

The location of the value in the original text (start and end)
The location of the value in the redacted version of the string (new_start and new_end)
The entity type (label)
The original value (text)
The redacted or synthesized value (new_text). new_text is null in the following cases:
- The entity type is ignored
- The response is from llm_synthesis
A score to indicate confidence in the detection and redaction (score)
The detected language for the value (language)
For responses from textual.redact_json, the JSON path to the entity in the original document (json_path)
For responses from textual.redact_xml, the Xpath to the entity in the original XML document (xml_path)

Redact and synthesize individual files

You can use the Textual SDK to redact and synthesize values in individual files.

Before you perform these tasks, remember to .

For a self-hosted instance, you can also configure the S3 bucket to use to store the files. This is the same S3 bucket that is used to store files for uploaded file pipelines. For more information, go to . For an example of an IAM role with the required permissions, go to .

Sending a file to Textual

To send an individual file to Textual, you use .

You first open the file so that Textual can read it, then make then call for Textual to read the file.

The response includes:

The file name
The identifier of the job that processed the file. You use this identifier to retrieve a transformed version of the file.

Getting the file with redacted or synthesized values

After you use to send the file to Textual, you use to retrieve a transformed version of the file.

To identify the file, you use the job identifier that you received from textual.start_file_redaction. You can also specify whether to redact, synthesize, or ignore specific entity types. By default, all of the values are redacted.

Before you make the call to download the file, you specify the path to download the file content to.

Configuring entity type handling for redaction

By default, when you:

Configure a dataset
Redact or synthesize a string
Retrieve a redacted file

Textual does the following:

For the string and file redaction, redacts the detected sensitive values.
For LLM synthesis, generates realistic synthesized values.

When you make the request, you can override the default behavior.

Specifying the handling option for entity types

For each entity type, you can choose to redact, synthesize, or ignore the value.

When you redact a value, Textual replaces the value with <entity type>_<generated identifier>. For example, ORGANIZATION_EPfC7XZUZ.
When you synthesize a value, Textual replaces the value with a different realistic value.
When you ignore a value, Textual passes through the original value.

To specify the handling option for entity types, you use the generator_config parameter.

generator_config={'<entity_type>':'<handling_option>'}

Where:

<entity_type> is the identifier of the entity type. For example, ORGANIZATION. For the list of built-in entity types that Textual scans for, go to Entity types that Textual detects.
<handling_option> is the handling option to use for the specified entity type. The possible values are Redact, Synthesis, and Off.

For example, to synthesize organization values, and ignore languages:

generator_config={'ORGANIZATION':'Synthesis', 'LANGUAGE':'Off'}

Specifying a default handling option

For string and file redaction, you can specify a default handling option to use for entity types that are not specified in generator_config.

To do this, you use the generator_default parameter.

generator_default can be either Redact, Synthesis, or Off.

Providing added and excluded values for entity types

You can also configure added and excluded values for each entity type.

You add values that Textual does not detect for an entity type, but should. You exclude values that you do not want Textual to identify for that entity type.

To specify the added values, use label_allow_lists.
To specify the excluded values, use label_block_lists.

For each of these parameters, the value is a list of entity types to specify the added or excluded values for. To specify the values, you provide an array of regular expressions.

{'<entity_type>':['<regex>']}

The following example uses label_allow_lists to add values:

For NAME_GIVEN, adds the values There and Here.
For NAME_FAMILY, adds values that match the regular expression ([a-z]{2}).

(label_allow_lists={
    'NAME_GIVEN':['There','Here'], 
    'NAME_FAMILY':['([a-z]{2})']
    }
)

Create and manage datasets

Textual uses datasets to produce files with sensitive values replaced.

Before you perform these tasks, remember to instantiate the SDK client.

Create and add files to a dataset

To create a new dataset and then upload a file to it, use textual.create_dataset.

dataset = textual.create_dataset('<dataset name>')

To add a file to the dataset, use dataset.add_file. To identify the file, provide the file path and name.

dataset.add_file('<path to file>','<file name>')

To provide the file as IO bytes, you provide the file name and the file bytes. You do not provide a path.

dataset.add_file('<file name>',<file bytes>)

Textual creates the dataset, scans the uploaded file, and redacts the detected values.

Configure a dataset

To change the configuration of a dataset, use dataset.edit.

You can use dataset.edit to change:

The name of the dataset
The handling option for each entity type
Added or excluded values for each entity type

dataset.edit(name='<dataset name>', 
  generator_config={'<entity_type>':'<handling_type>'},
  label_allow_lists={'<entity_type>':LabelCustomList(regexes['<regex>']},
  label_block_lists={'<entity_type>':LabelCustomList(regexes['<regex>']}
)

Get the current status of dataset files

To get the current status of the files in the current dataset, use dataset.describe:

dataset.describe()

The response includes:

The name and identifier of the dataset
The number of files in the dataset
The number of files that are waiting to be processed (scanned and redacted)
The number of files that had errors during processing

For example:

    Dataset: example [879d4c5d-792a-c009-a9a0-60d69be20206]
    Number of Files: 1
    Files that are waiting for processing: 
    Files that encountered errors while processing: 
    Number of Rows: 0
    Number of rows fetched: 0

Get lists of files by status

To get a list of files that have a specific statuse, use the following:

The file list includes:

File identifier and name
Number of rows and columns
Processing status
For failed files, the error
When the file was uploaded

Delete a file from a dataset

To delete a file from a dataset, use dataset.delete_file.

dataset.delete_file('<file identifier>')

Get redacted content for a dataset

To get the redacted content in JSON format for a dataset, use dataset.fetch_all_json():

dataset = textual.get_dataset('<dataset name>')
dataset.fetch_all_json()

For example:

dataset = textual.get_dataset('mydataset')
dataset.fetch_all_json()

The response looks something like:

'[["PERSON_Rz8NtJTPONTKgcB95i Portrait by PERSON_blatU6mAWFCQoSa5E, DATE_TIME_Rcl58 ...]'

Redact and synthesize individual strings

Before you perform these tasks, remember to .

You can use the Tonic Textual SDK to redact individual strings, including:

Plain text strings
JSON content
XML content

For a text string, you can also request synthesized values from a large language model (LLM).

The redaction request can include the .

The includes the redacted or synthesized content and details about the detected entity values.

Redact a plain text string

To send a plain text string for redaction, use :

For example:

Redact JSON content

To send a JSON string for redaction, use . You can send the JSON content as a JSON string or a Python dictionary.

json_redaction = textual.redact_json(<JSON string or Python dictionary>)

redact_json ensures that only the values are redacted. It ignores the keys.

Basic JSON redaction example

Here is a basic example of a JSON redaction request:

d=dict()
d['person']={'first':'John','last':'OReilly'}
d['address']={'city': 'Memphis', 'state':'TN', 'street': '847 Rocky Top', 'zip':1234}
d['description'] = 'John is a man that lives in Memphis.  He is 37 years old and is married to Cynthia'

json_redaction = textual.redact_json(d)

print(json.dumps(json.loads(json_redaction.redacted_text), indent=2))

It produces the following JSON output:

{
"person": {
    "first": "[NAME_GIVEN_WpFV4]",
    "last": "[NAME_FAMILY_orTxwj3I]"
},
"address": {
    "city": "[LOCATION_CITY_UtpIl2tL]",
    "state": "[LOCATION_STATE_n24]",
    "street": "[LOCATION_ADDRESS_KwZ3MdDLSrzNhwB]",
    "zip": "[LOCATION_ZIP_L42eP19]"
},
"description": "[NAME_GIVEN_WpFV4] is a man that lives in [LOCATION_CITY_UtpIl2tL].  He is [DATE_TIME_LLr6L3gpNcOcl3] and is married to [NAME_GIVEN_yWfthDa6]"
}

Specifying entity types for specific JSON paths

When you redact a JSON string, you can optionally assign specific entity types to selected JSON paths.

jsonpath_allow_lists={'entity_type':['JSON Paths']}

The specified entity type overrides both the detected entity type and any added or excluded values.

In the following example, the value of the key1 node is always treated as a telephone number:

response = textual.redact_json('{"key1":"Ex123", "key2":"Johnson"}', jsonpath_allow_lists={'PHONE_NUMBER':['$.key1']})

It produces the following redacted output:

{"key1":"[PHONE_NUMBER_zbIr0]","key2":"My name is [NAME_FAMILY_uC293]"}

Redact XML content

To send an XML string for redaction, use .

redact_xml ensures that only the values are redacted. It ignores the XML markup.

For example:

xml_string = '''<?xml version="1.0" encoding="UTF-8"?>
    <!-- This XML document contains sample PII with namespaces and attributes -->
    <PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact">
        <!-- Personal Information with an attribute containing PII -->
        <Name preferred="true" contact:userID="john.doe123">
            <FirstName>John</FirstName>
            <LastName>Doe</LastName>He was born in 1980.</Name>

        <contact:Details>
            <!-- Email stored in an attribute for demonstration -->
            <contact:Email address="john.doe@example.com"/>
            <contact:Phone type="mobile" number="555-6789"/>
        </contact:Details>

        <!-- SSN stored as an attribute -->
        <SSN value="987-65-4321" xsi:nil="false"/>
        <data>his name was John Doe</data>
    </PersonInfo>'''

response = textual.redact_xml(xml_string)

redacted_xml = response.redacted_text

Produces the following XML output:

<?xml version="1.0" encoding="UTF-8"?><!-- This XML document contains sample PII with namespaces and attributes -->\n<PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact"><!-- Personal Information with an attribute containing PII --><Name preferred="true" contact:userID="[NAME_GIVEN_NUhdshJf3SkI0]">[GENDER_IDENTIFIER_gh1] was born in [DOB_nHfb2].<FirstName>[NAME_GIVEN_HI1h7]</FirstName><LastName>[NAME_FAMILY_bKk1]</LastName></Name><contact:Details><!-- Email stored in an attribute for demonstration --><contact:Email address="[EMAIL_ADDRESS_DSlxAYEPw0XkIiADi0WbpW1]"></contact:Email><contact:Phone type="mobile" number="[PHONE_NUMBER_5LWjT19Ee]"></contact:Phone></contact:Details><!-- SSN stored as an attribute --><SSN value="[PHONE_NUMBER_4B2QKKwghix90]" xsi:nil="false"></SSN><data>[GENDER_IDENTIFIER_XN92] name was [NAME_GIVEN_HI1h7] [NAME_FAMILY_bKk1]</data></PersonInfo>

Redact HTML content

To send an HTML string for redaction, use .

redact_html ensures that only the values are redacted. It ignores the HTML markup.

For example:

html_content = """
<!DOCTYPE html>
<html>
    <head>
        <title>John Doe</title>
    </head>
    <body>
        <h1>John Doe</h1>
        <p>John Doe is a person who lives in New York City.</p>
        <p>John Doe's phone number is 555-555-5555.</p>
    </body>
</html>
"""

# Run the redact_xml method
redacted_html = redact.redact_html(html_content, generator_config={
            "NAME_GIVEN": "Synthesis",
            "NAME_FAMILY": "Synthesis"
        }) 

print(redacted_html.redacted_text)

Produces the following HTML output:

<!DOCTYPE html>
<html>
    <head>
        <title>Scott Roley</title>
    </head>
    <body>
        <h1>Scott Roley</h1>
        <p>Scott Roley is a person who lives in [LOCATION_CITY_HwTG541HnrMzfO7].</p>
        <p>Scott Roley's phone number is [PHONE_NUMBER_apZd0xjh3Z3lf4].</p>
    </body>
</html>

Using an LLM to generate synthesized values

You can also request synthesized values from a large language model (LLM).

The LLM then generates realistic synthesized values of the appropriate value types.

To send text to an LLM, use :

raw_synthesis = textual.llm_synthesis("Text of the string")

For example:

raw_synthesis = textual.llm_synthesis("My name is John, and today I am demoing Textual, a software product created by Tonic")
raw_synthesis.describe()

My name is John, and on Monday afternoon I am demoing Widget Pro, a software product created by Initech Enterprises.
{"start": 11, "end": 15, "new_start": 11, "new_end": 15, "label": "NAME_GIVEN", "text": "John", "new_text": null, "score": 0.9, "language": "en"}
{"start": 21, "end": 26, "new_start": 21, "new_end": 40, "label": "DATE_TIME", "text": "today", "new_text": null, "score": 0.85, "language": "en"}
{"start": 40, "end": 47, "new_start": 54, "new_end": 64, "label": "PRODUCT", "text": "Textual", "new_text": null, "score": 0.85, "language": "en"}
{"start": 79, "end": 84, "new_start": 96, "new_end": 115, "label": "ORGANIZATION", "text": "Tonic", "new_text": null, "score": 0.85, "language": "en"}

Format of the redaction and synthesis response

The response provides the redacted or synthesized version of the string, and the list of detected entity values.

Contact ORGANIZATION_EPfC7XZUZ with questions
    
{"start": 8, "end": 16, "new_start": 8, "new_end": 30, "label": "ORGANIZATION", "text": "Tonic AI", "new_text": "[ORGANIZATION_EPfC7XZUZ]", "score": 0.85, "language": "en"}

For each redacted item, the response includes:

The location of the value in the original text (start and end)
The location of the value in the redacted version of the string (new_start and new_end)
The entity type (label)
The original value (text)
The redacted or synthesized value (new_text). new_text is null in the following cases:
- The entity type is ignored
- The response is from llm_synthesis
A score to indicate confidence in the detection and redaction (score)
The detected language for the value (language)
For responses from textual.redact_json, the JSON path to the entity in the original document (json_path)
For responses from textual.redact_xml, the Xpath to the entity in the original XML document (xml_path)

Configuring entity type handling for redaction

By default, when you:

Configure a dataset
Redact or synthesize a string
Retrieve a redacted file

Textual does the following:

For the string and file redaction, redacts the detected sensitive values.
For LLM synthesis, generates realistic synthesized values.

When you make the request, you can override the default behavior.

Specifying the handling option for entity types

For each entity type, you can choose to redact, synthesize, or ignore the value.

When you redact a value, Textual replaces the value with <entity type>_<generated identifier>. For example, ORGANIZATION_EPfC7XZUZ.
When you synthesize a value, Textual replaces the value with a different realistic value.
When you ignore a value, Textual passes through the original value.

To specify the handling option for entity types, you use the generator_config parameter.

generator_config={'<entity_type>':'<handling_option>'}

Where:

<entity_type> is the identifier of the entity type. For example, ORGANIZATION. For the list of built-in entity types that Textual scans for, go to Entity types that Textual detects.
<handling_option> is the handling option to use for the specified entity type. The possible values are Redact, Synthesis, and Off.

For example, to synthesize organization values, and ignore languages:

generator_config={'ORGANIZATION':'Synthesis', 'LANGUAGE':'Off'}

Specifying a default handling option

For string and file redaction, you can specify a default handling option to use for entity types that are not specified in generator_config.

To do this, you use the generator_default parameter.

generator_default can be either Redact, Synthesis, or Off.

Providing added and excluded values for entity types

You can also configure added and excluded values for each entity type.

You add values that Textual does not detect for an entity type, but should. You exclude values that you do not want Textual to identify for that entity type.

To specify the added values, use label_allow_lists.
To specify the excluded values, use label_block_lists.

For each of these parameters, the value is a list of entity types to specify the added or excluded values for. To specify the values, you provide an array of regular expressions.

{'<entity_type>':['<regex>']}

The following example uses label_allow_lists to add values:

For NAME_GIVEN, adds the values There and Here.
For NAME_FAMILY, adds values that match the regular expression ([a-z]{2}).

(label_allow_lists={
    'NAME_GIVEN':['There','Here'], 
    'NAME_FAMILY':['([a-z]{2})']
    }
)