1 of 10

Using the Textual SDK

For details about all of the available Tonic Textual classes, go to the generated SDK documentation.

Instantiate the SDK client

Whenever you call the Textual SDK, you first instantiate the SDK client.

To work with Textual datasets, or to redact individual files, you instantiate TonicTextual.
To work with Textual pipelines, you instantiate TonicTextualParse.

Instantiating when the API key is already configured

If the API key is configured as the value of TONIC_TEXTUAL_API_KEY, then you do not need to provide the API key when you instantiate the SDK client.

For Textual pipelines:

from tonic_textual.parse_api import TonicTextualParse
# The default URL is https://textual.tonic.ai (Textual Cloud)
# If you host Textual, provide your Textual URL
textual_url = "<Textual URL>"
textual = TonicTextualParse(textual_url)

For Textual datasets:

from tonic_textual.redact_api import TonicTextual
# The default URL is https://textual.tonic.ai (Textual Cloud)
# If you host Textual, provide your Textual URL
textual_url = "<Textual URL>"
textual = TonicTextual(textual_url)

Instantiating when the API key is not configured

If the API key is not configured as the value of TONIC_TEXTUAL_API_KEY, then you must include the API key in the request.

For Textual pipelines:

from tonic_textual.parse_api import TonicTextualParse
api_key = "your-tonic-textual-api-key"
# The default URL is https://textual.tonic.ai (Textual Cloud)
# If you host Textual, provide your Textual URL
textual_url = "<Textual URL>"
textual = TonicTextualParse(textual_url, api_key=api_key)

For Textual datasets:

from tonic_textual.redact_api import TonicTextual
api_key = "your-tonic-textual-api-key"
# The default URL is https://textual.tonic.ai (Textual Cloud)
# If you host Textual, provide your Textual URL
textual_url = "<Textual URL>"
textual = TonicTextual(textual_url, api_key=api_key)

Pipelines and parsing

You can use the Tonic Textual SDK to manage pipelines and to parse individual files.

Create and manage pipelines

Textual uses pipelines to transform file text into a format that can be used in an LLM system.

You can use the Textual SDK to create and manage pipelines and to retrieve pipeline run results.

Before you perform these tasks, remember to .

Creating and deleting pipelines

Creating a pipeline

To create a pipeline, use the pipeline creation method for the type of pipeline to create"

- Creates an uploaded file pipeline.
- Creates an Amazon S3 pipeline.
- Creates an Azure pipeline.
- Creates a Databricks pipeline.

When you create the pipeline, you can also:

If needed, provide the credentials to use to connect to Amazon S3, Azure, or Databricks.
Indicate whether to also generate redacted files. By default, pipelines do not generate redacted files. To generate redacted files, set synthesize_files to True.

For example, to create an uploaded file pipeline that also creates redacted files:

The response contains the pipeline object.

Deleting a pipeline

Updating a pipeline configuration

Changing whether to also generate synthesized files

Adding files to an uploaded file pipeline

Configuring an Amazon S3 pipeline

For an Amazon S3 pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:

Configuring an Azure pipeline

For an Azure pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:

Getting a pipeline or pipelines

Getting the list of pipelines

The response contains a list of pipeline objects.

Getting a single pipeline

The response contains a single pipeline object.

The pipeline identifier is displayed on the pipeline details page. To copy the identifier, click the copy icon.

Running a pipeline

The response contains the job identifier.

Getting pipelines runs, files, and results

Getting pipeline runs

The response contains a list of pipeline run objects.

Getting pipeline files

The response is an enumerator of file parse result objects.

Getting the list of entities in a file

generator_config is a dictionary that specifies whether to redact, synthesize, or do neither for each entity type in the dictionary.

For each entity type, you provide the handling type:

Redaction indicates to replace the value with the value type.
Synthesis indicates to replace the value with a realistic value.
Off indicates to keep the value as is.

generator_default indicates how to process values for entity types that were not included in the generator_config list.

The response contains the list of entities. For each value, the list includes:

Entity type
Where the value starts in the source file
Where the value ends in the source file
The original text of the entity

Getting the Markdown output for pipeline files

The response contains the Markdown files, with the detected entities processed as specified in generator_config and generator_default.

Generating chunks from pipeline files

In the request, you set the maximum number of characters in each chunk.

You can also provide generator_config and generator_default to configure how to present the detected entities in the text chunks.

The response contains the list of text chunks, with the detected entities processed as specified in generator_config and generator_default.

Parse individual files

You can use the Textual SDK to parse individual files, either from a local file system or from an S3 bucket.

Textual returns a FileParseResult object for each parsed file. The FileParseResult object is a wrapper around the output JSON for the processed file.

Parse a file from a local file system

To parse a single file from a local file system, use textual.parse_file:

with open('<path to the file>','rb') as f: 
    byte_data = f.read()
    parsed_doc = textual.parse_file(byte_data, '<file name>')

You must use rb access mode to read the file. rb access mode opens the file to be read in binary format.

You can also set a timeout in seconds for the parsing. You can add the timeout as a parameter of parse_file command. To set a timeout to use for all parsing, set the environment variable TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS.

Parse a file from an S3 bucket

You can also parse files that are stored in Amazon S3. Because this process uses the boto3 library to fetch the file from Amazon S3, you must first set up the correct AWS credentials.

To parse a file from an S3 bucket, use textual.parse_s3_file:

parsed_doc = textual.parse_s3_file('<bucket>','<key>')

Datasets and redaction

You can use the Tonic Textual SDK to manage pipelines and to redact individual strings and files.

Create and manage datasets

Textual uses datasets to produce files with sensitive values replaced.

Before you perform these tasks, remember to instantiate the SDK client.

Create and add files to a dataset

To create a new dataset and then upload a file to it, use textual.create_dataset.

dataset = textual.create_dataset('<dataset name>')

To add a file to the dataset, use dataset.add_file. To identify the file, provide the file path and name.

dataset.add_file('<path to file>','<file name>')

To provide the file as IO bytes, you provide the file name and the file bytes. You do not provide a path.

dataset.add_file('<file name>',<file bytes>)

Textual creates the dataset, scans the uploaded file, and redacts the detected values.

Configure a dataset

To change the configuration of a dataset, use dataset.edit.

You can use dataset.edit to change:

The name of the dataset
The handling option for each entity type
Added or excluded values for each entity type

dataset.edit(name='<dataset name>', 
  generator_config={'<entity_type>':'<handling_type>'},
  label_allow_lists={'<entity_type>':LabelCustomList(regexes['<regex>']},
  label_block_lists={'<entity_type>':LabelCustomList(regexes['<regex>']}
)

Get the current status of dataset files

To get the current status of the files in the current dataset, use dataset.describe:

dataset.describe()

The response includes:

The name and identifier of the dataset
The number of files in the dataset
The number of files that are waiting to be processed (scanned and redacted)
The number of files that had errors during processing

For example:

    Dataset: example [879d4c5d-792a-c009-a9a0-60d69be20206]
    Number of Files: 1
    Files that are waiting for processing: 
    Files that encountered errors while processing: 
    Number of Rows: 0
    Number of rows fetched: 0

Get lists of files by status

To get a list of files that have a specific statuse, use the following:

The file list includes:

File identifier and name
Number of rows and columns
Processing status
For failed files, the error
When the file was uploaded

Delete a file from a dataset

To delete a file from a dataset, use dataset.delete_file.

dataset.delete_file('<file identifier>')

Get redacted content for a dataset

To get the redacted content in JSON format for a dataset, use dataset.fetch_all_json():

dataset = textual.get_dataset('<dataset name>')
dataset.fetch_all_json()

For example:

dataset = textual.get_dataset('mydataset')
dataset.fetch_all_json()

The response looks something like:

'[["PERSON_Rz8NtJTPONTKgcB95i Portrait by PERSON_blatU6mAWFCQoSa5E, DATE_TIME_Rcl58 ...]'

Redact and synthesize individual strings

Before you perform these tasks, remember to instantiate the SDK client.

You can use the Tonic Textual SDK to redact individual strings, including:

Plain text strings
JSON content
XML content

For a text string, you can also request synthesized values from a large language model (LLM).

The redaction request can include the handling configuration for entity types.

The redaction response includes the redacted or synthesized content and details about the detected entity values.

Redact a plain text string

To send a plain text string for redaction, use textual.redact:

redaction_response = textual.redact("""<text of the string>""")
redaction_response.describe()

For example:

redaction_response = textual.redact("""Contact Tonic AI with questions""")
redaction_response.describe()

Contact ORGANIZATION_EPfC7XZUZ with questions
    
{"start": 8, "end": 16, "new_start": 8, "new_end": 30, "label": "ORGANIZATION", "text": "Tonic AI", "new_text": "[ORGANIZATION_EPfC7XZUZ]", "score": 0.85, "language": "en"}

Redact multiple plain text strings

To send multiple plain text strings for redaction, use textual.redact_bulk:

bulk_response = textual.redact_bulk([<List of strings])

For example:

bulk_response = textual.redact_bulk(["Tonic.ai was founded in 2018", "John Smith is a person"])
bulk_response.describe()

[ORGANIZATION_5Ve7OH] was founded in [DATE_TIME_DnuC1]

{"start": 0, "end": 5, "new_start": 0, "new_end": 21, "label": "ORGANIZATION", "text": "Tonic", "score": 0.9, "language": "en", "new_text": "[ORGANIZATION_5Ve7OH]"}
{"start": 21, "end": 25, "new_start": 37, "new_end": 54, "label": "DATE_TIME", "text": "2018", "score": 0.9, "language": "en", "new_text": "[DATE_TIME_DnuC1]"}

[NAME_GIVEN_dySb5] [NAME_FAMILY_7w4Db3] is a person

{"start": 0, "end": 4, "new_start": 0, "new_end": 18, "label": "NAME_GIVEN", "text": "John", "score": 0.9, "language": "en", "new_text": "[NAME_GIVEN_dySb5]"}
{"start": 5, "end": 10, "new_start": 19, "new_end": 39, "label": "NAME_FAMILY", "text": "Smith", "score": 0.9, "language": "en", "new_text": "[NAME_FAMILY_7w4Db3]"}

Redact JSON content

To send a JSON string for redaction, use textual.redact_json. You can send the JSON content as a JSON string or a Python dictionary.

json_redaction = textual.redact_json(<JSON string or Python dictionary>)

redact_json ensures that only the values are redacted. It ignores the keys.

Basic JSON redaction example

Here is a basic example of a JSON redaction request:

d=dict()
d['person']={'first':'John','last':'OReilly'}
d['address']={'city': 'Memphis', 'state':'TN', 'street': '847 Rocky Top', 'zip':1234}
d['description'] = 'John is a man that lives in Memphis.  He is 37 years old and is married to Cynthia'

json_redaction = textual.redact_json(d)

print(json.dumps(json.loads(json_redaction.redacted_text), indent=2))

It produces the following JSON output:

{
"person": {
    "first": "[NAME_GIVEN_WpFV4]",
    "last": "[NAME_FAMILY_orTxwj3I]"
},
"address": {
    "city": "[LOCATION_CITY_UtpIl2tL]",
    "state": "[LOCATION_STATE_n24]",
    "street": "[LOCATION_ADDRESS_KwZ3MdDLSrzNhwB]",
    "zip": "[LOCATION_ZIP_L42eP19]"
},
"description": "[NAME_GIVEN_WpFV4] is a man that lives in [LOCATION_CITY_UtpIl2tL].  He is [DATE_TIME_LLr6L3gpNcOcl3] and is married to [NAME_GIVEN_yWfthDa6]"
}

Specifying entity types for specific JSON paths

When you redact a JSON string, you can optionally assign specific entity types to selected JSON paths.

To do this, you include the jsonpath_allow_lists parameter. Each entry consists of an entity type and a list of JSON paths for which to always use that entity type. Each JSON path must point to a simple string or numeric value.

jsonpath_allow_lists={'entity_type':['JSON Paths']}

The specified entity type overrides both the detected entity type and any added or excluded values.

In the following example, the value of the key1 node is always treated as a telephone number:

response = textual.redact_json('{"key1":"Ex123", "key2":"Johnson"}', jsonpath_allow_lists={'PHONE_NUMBER':['$.key1']})

It produces the following redacted output:

{"key1":"[PHONE_NUMBER_zbIr0]","key2":"My name is [NAME_FAMILY_uC293]"}

Redact XML content

To send an XML string for redaction, use textual.redact_xml.

redact_xml ensures that only the values are redacted. It ignores the XML markup.

For example:

xml_string = '''<?xml version="1.0" encoding="UTF-8"?>
    <!-- This XML document contains sample PII with namespaces and attributes -->
    <PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact">
        <!-- Personal Information with an attribute containing PII -->
        <Name preferred="true" contact:userID="john.doe123">
            <FirstName>John</FirstName>
            <LastName>Doe</LastName>He was born in 1980.</Name>

        <contact:Details>
            <!-- Email stored in an attribute for demonstration -->
            <contact:Email address="john.doe@example.com"/>
            <contact:Phone type="mobile" number="555-6789"/>
        </contact:Details>

        <!-- SSN stored as an attribute -->
        <SSN value="987-65-4321" xsi:nil="false"/>
        <data>his name was John Doe</data>
    </PersonInfo>'''

response = textual.redact_xml(xml_string)

redacted_xml = response.redacted_text

Produces the following XML output:

<?xml version="1.0" encoding="UTF-8"?><!-- This XML document contains sample PII with namespaces and attributes -->\n<PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact"><!-- Personal Information with an attribute containing PII --><Name preferred="true" contact:userID="[NAME_GIVEN_NUhdshJf3SkI0]">[GENDER_IDENTIFIER_gh1] was born in [DOB_nHfb2].<FirstName>[NAME_GIVEN_HI1h7]</FirstName><LastName>[NAME_FAMILY_bKk1]</LastName></Name><contact:Details><!-- Email stored in an attribute for demonstration --><contact:Email address="[EMAIL_ADDRESS_DSlxAYEPw0XkIiADi0WbpW1]"></contact:Email><contact:Phone type="mobile" number="[PHONE_NUMBER_5LWjT19Ee]"></contact:Phone></contact:Details><!-- SSN stored as an attribute --><SSN value="[PHONE_NUMBER_4B2QKKwghix90]" xsi:nil="false"></SSN><data>[GENDER_IDENTIFIER_XN92] name was [NAME_GIVEN_HI1h7] [NAME_FAMILY_bKk1]</data></PersonInfo>

Redact HTML content

To send an HTML string for redaction, use textual.redact_html.

redact_html ensures that only the values are redacted. It ignores the HTML markup.

For example:

html_content = """
<!DOCTYPE html>
<html>
    <head>
        <title>John Doe</title>
    </head>
    <body>
        <h1>John Doe</h1>
        <p>John Doe is a person who lives in New York City.</p>
        <p>John Doe's phone number is 555-555-5555.</p>
    </body>
</html>
"""

# Run the redact_xml method
redacted_html = redact.redact_html(html_content, generator_config={
            "NAME_GIVEN": "Synthesis",
            "NAME_FAMILY": "Synthesis"
        }) 

print(redacted_html.redacted_text)

Produces the following HTML output:

<!DOCTYPE html>
<html>
    <head>
        <title>Scott Roley</title>
    </head>
    <body>
        <h1>Scott Roley</h1>
        <p>Scott Roley is a person who lives in [LOCATION_CITY_HwTG541HnrMzfO7].</p>
        <p>Scott Roley's phone number is [PHONE_NUMBER_apZd0xjh3Z3lf4].</p>
    </body>
</html>

Using an LLM to generate synthesized values

You can also request synthesized values from a large language model (LLM).

When you use this process, Textual first identifies the sensitive values in the text. It then sends the value locations and redacted values to the LLM. For example, if Textual identifies a product name, it sends the location and the redacted value PRODUCT to the LLM. Textual does not send the original values to the LLM.

The LLM then generates realistic synthesized values of the appropriate value types.

To send text to an LLM, use textual.llm_synthesis:

raw_synthesis = textual.llm_synthesis("Text of the string")

For example:

raw_synthesis = textual.llm_synthesis("My name is John, and today I am demoing Textual, a software product created by Tonic")
raw_synthesis.describe()

My name is John, and on Monday afternoon I am demoing Widget Pro, a software product created by Initech Enterprises.
{"start": 11, "end": 15, "new_start": 11, "new_end": 15, "label": "NAME_GIVEN", "text": "John", "new_text": null, "score": 0.9, "language": "en"}
{"start": 21, "end": 26, "new_start": 21, "new_end": 40, "label": "DATE_TIME", "text": "today", "new_text": null, "score": 0.85, "language": "en"}
{"start": 40, "end": 47, "new_start": 54, "new_end": 64, "label": "PRODUCT", "text": "Textual", "new_text": null, "score": 0.85, "language": "en"}
{"start": 79, "end": 84, "new_start": 96, "new_end": 115, "label": "ORGANIZATION", "text": "Tonic", "new_text": null, "score": 0.85, "language": "en"}

Format of the redaction and synthesis response

The response provides the redacted or synthesized version of the string, and the list of detected entity values.

Contact ORGANIZATION_EPfC7XZUZ with questions
    
{"start": 8, "end": 16, "new_start": 8, "new_end": 30, "label": "ORGANIZATION", "text": "Tonic AI", "new_text": "[ORGANIZATION_EPfC7XZUZ]", "score": 0.85, "language": "en"}

For each redacted item, the response includes:

The location of the value in the original text (start and end)
The location of the value in the redacted version of the string (new_start and new_end)
The entity type (label)
The original value (text)
The redacted or synthesized value (new_text). new_text is null in the following cases:
- The entity type is ignored
- The response is from llm_synthesis
A score to indicate confidence in the detection and redaction (score)
The detected language for the value (language)
For responses from textual.redact_json, the JSON path to the entity in the original document (json_path)
For responses from textual.redact_xml, the Xpath to the entity in the original XML document (xml_path)

Redact and synthesize individual files

You can use the Textual SDK to redact and synthesize values in individual files.

Before you perform these tasks, remember to instantiate the SDK client.

For a self-hosted instance, you can also configure the S3 bucket to use to store the files. This is the same S3 bucket that is used to store files for uploaded file pipelines. For more information, go to Setting the S3 bucket for file uploads and redactions. For an example of an IAM role with the required permissions, go to #file-upload-example-iam-role.

Sending a file to Textual

To send an individual file to Textual, you use textual.start_file_redaction.

You first open the file so that Textual can read it, then make then call for Textual to read the file.

with open("<path to the file>", "r") as f:
    j = textual.start_file_redaction(f,"<file name>")

The response includes:

The file name
The identifier of the job that processed the file. You use this identifier to retrieve a transformed version of the file.

Getting the file with redacted or synthesized values

After you use textual.start_file_redaction to send the file to Textual, you use tonic.download_redacted_file to retrieve a transformed version of the file.

To identify the file, you use the job identifier that you received from textual.start_file_redaction. You can also specify whether to redact, synthesize, or ignore specific entity types. By default, all of the values are redacted.

Before you make the call to download the file, you specify the path to download the file content to.

with open("<path to output location>", "wb") as fo:
    fo.write(textual.download_redacted_file(<job identifier>)

Configuring entity type handling for redaction

By default, when you:

Configure a dataset
Redact or synthesize a string
Retrieve a redacted file

Textual does the following:

For the string and file redaction, redacts the detected sensitive values.
For LLM synthesis, generates realistic synthesized values.

When you make the request, you can override the default behavior.

Specifying the handling option for entity types

For each entity type, you can choose to redact, synthesize, or ignore the value.

When you redact a value, Textual replaces the value with <entity type>_<generated identifier>. For example, ORGANIZATION_EPfC7XZUZ.
When you synthesize a value, Textual replaces the value with a different realistic value.
When you ignore a value, Textual passes through the original value.

To specify the handling option for entity types, you use the generator_config parameter.

Where:

<handling_option> is the handling option to use for the specified entity type. The possible values are Redact, Synthesis, and Off.

For example, to synthesize organization values, and ignore languages:

Specifying a default handling option

For string and file redaction, you can specify a default handling option to use for entity types that are not specified in generator_config.

To do this, you use the generator_default parameter.

generator_default can be either Redact, Synthesis, or Off.

Providing added and excluded values for entity types

You can also configure added and excluded values for each entity type.

You add values that Textual does not detect for an entity type, but should. You exclude values that you do not want Textual to identify for that entity type.

To specify the added values, use label_allow_lists.
To specify the excluded values, use label_block_lists.

For each of these parameters, the value is a list of entity types to specify the added or excluded values for. To specify the values, you provide an array of regular expressions.

The following example uses label_allow_lists to add values:

For NAME_GIVEN, adds the values There and Here.
For NAME_FAMILY, adds values that match the regular expression ([a-z]{2}).

Create and manage pipelines

Textual uses pipelines to transform file text into a format that can be used in an LLM system.

You can use the Textual SDK to create and manage pipelines and to retrieve pipeline run results.

Before you perform these tasks, remember to .

Creating and deleting pipelines

Creating a pipeline

To create a pipeline, use the pipeline creation method for the type of pipeline to create"

- Creates an uploaded file pipeline.
- Creates an Amazon S3 pipeline.
- Creates an Azure pipeline.
- Creates a Databricks pipeline.

When you create the pipeline, you can also:

If needed, provide the credentials to use to connect to Amazon S3, Azure, or Databricks.
Indicate whether to also generate redacted files. By default, pipelines do not generate redacted files. To generate redacted files, set synthesize_files to True.

For example, to create an uploaded file pipeline that also creates redacted files:

The response contains the pipeline object.

Deleting a pipeline

To delete a pipeline, use .

textual.delete_pipeline(pipeline_id)

Updating a pipeline configuration

Changing whether to also generate synthesized files

To change whether a pipeline also generates synthesized files, use .

Adding files to an uploaded file pipeline

To a a file to an uploaded file pipeline, use .

pipeline = textual.create_pipeline(pipeline_name)
with open(file_path, "rb") as file_content:
    file_bytes = file_content.read()
pipeline.upload_file(file_bytes, file_name)

Configuring an Amazon S3 pipeline

For an Amazon S3 pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:

To identify the output location for the processed files, use .
To identify individual files for the pipeline to process, use .
To identify prefixes - folders for which the pipeline processes all applicable files - use .

Configuring an Azure pipeline

For an Azure pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:

To identify the output location for the processed files, use .
To identify individual files for the pipeline to process, use .
To identify prefixes - folders for which the pipeline processes all applicable files - use .

Getting a pipeline or pipelines

Getting the list of pipelines

To get the list of pipelines, use .

pipelines = textual.get_pipelines()

The response contains a list of pipeline objects.

Getting a single pipeline

To use the pipeline identifier to get a single pipeline, use .

pipeline_id: str # pipeline identifier
pipeline = textual.get_pipeline_by_id(pipeline_id)

The response contains a single pipeline object.

The pipeline identifier is displayed on the pipeline details page. To copy the identifier, click the copy icon.

Running a pipeline

To run a pipeline, use .

The response contains the job identifier.

Getting pipelines runs, files, and results

Getting pipeline runs

To get the list of pipeline runs, use .

The response contains a list of pipeline run objects.

Getting pipeline files

Once you have the pipeline, to get an enumerator of the files in the pipeline from the most recent pipeline run, use .

files = pipeline.enumerate_files()

The response is an enumerator of file parse result objects.

Getting the list of entities in a file

To get a list of entities that were detected in a file, use . For example, to get the detected entities for all of the files in a pipeline:

detected_entities = []
for file in pipeline.enumerate_files():
    entities = file.get_all_entities()
    detected_entities.append(entities)

To provide a list entity types and how to process them, use :

from textual.enums.pii_state import PiiState

generator_config: Dict[str, PiiState]
generator_default: PiiState

entities_list = []
for file in pipeline.enumerate_files():
    entities = file.get_entities(generator_config, generator_default)
    entities_list.append(entities)

generator_config is a dictionary that specifies whether to redact, synthesize, or do neither for each entity type in the dictionary.

For a list of the entity types that Textual detects, go to .

For each entity type, you provide the handling type:

Redaction indicates to replace the value with the value type.
Synthesis indicates to replace the value with a realistic value.
Off indicates to keep the value as is.

generator_default indicates how to process values for entity types that were not included in the generator_config list.

The response contains the list of entities. For each value, the list includes:

Entity type
Where the value starts in the source file
Where the value ends in the source file
The original text of the entity

Getting the Markdown output for pipeline files

To get the Markdown output of a pipeline file, use . In the request, you can provide generator_config and generator_default to configure how to present the detected entities in the output file.

from textual.enums.pii_state import PiiState

generator_config: Dict[str, PiiState]
generator_default: PiiState

markdown_list = []
for file in pipeline.enumerate_files():
    markdown = file.get_markdown(generator_config, generator_default)
    markdown_list.append(markdown)

The response contains the Markdown files, with the detected entities processed as specified in generator_config and generator_default.

Generating chunks from pipeline files

To split a pipeline file into text chunks that can be imported into an LLM, use .

In the request, you set the maximum number of characters in each chunk.

You can also provide generator_config and generator_default to configure how to present the detected entities in the text chunks.

from textual.enums.pii_state import PiiState

generator_config: Dict[str, PiiState]
generator_default: PiiState
max_chars: int

chunks_list = []
for file in pipeline.enumerate_files():
    chunks = file.get_chunks(max_chars=max_chars, generator_config, generator_default)
    chunks_list.append(chunks)

The response contains the list of text chunks, with the detected entities processed as specified in generator_config and generator_default.

Redact and synthesize individual strings

Before you perform these tasks, remember to instantiate the SDK client.

You can use the Tonic Textual SDK to redact individual strings, including:

Plain text strings
JSON content
XML content

For a text string, you can also request synthesized values from a large language model (LLM).

The redaction request can include the handling configuration for entity types.

The redaction response includes the redacted or synthesized content and details about the detected entity values.

Redact a plain text string

To send a plain text string for redaction, use textual.redact:

redaction_response = textual.redact("""<text of the string>""")
redaction_response.describe()

For example:

redaction_response = textual.redact("""Contact Tonic AI with questions""")
redaction_response.describe()

Contact ORGANIZATION_EPfC7XZUZ with questions
    
{"start": 8, "end": 16, "new_start": 8, "new_end": 30, "label": "ORGANIZATION", "text": "Tonic AI", "new_text": "[ORGANIZATION_EPfC7XZUZ]", "score": 0.85, "language": "en"}

Redact multiple plain text strings

To send multiple plain text strings for redaction, use textual.redact_bulk:

bulk_response = textual.redact_bulk([<List of strings])

For example:

bulk_response = textual.redact_bulk(["Tonic.ai was founded in 2018", "John Smith is a person"])
bulk_response.describe()

[ORGANIZATION_5Ve7OH] was founded in [DATE_TIME_DnuC1]

{"start": 0, "end": 5, "new_start": 0, "new_end": 21, "label": "ORGANIZATION", "text": "Tonic", "score": 0.9, "language": "en", "new_text": "[ORGANIZATION_5Ve7OH]"}
{"start": 21, "end": 25, "new_start": 37, "new_end": 54, "label": "DATE_TIME", "text": "2018", "score": 0.9, "language": "en", "new_text": "[DATE_TIME_DnuC1]"}

[NAME_GIVEN_dySb5] [NAME_FAMILY_7w4Db3] is a person

{"start": 0, "end": 4, "new_start": 0, "new_end": 18, "label": "NAME_GIVEN", "text": "John", "score": 0.9, "language": "en", "new_text": "[NAME_GIVEN_dySb5]"}
{"start": 5, "end": 10, "new_start": 19, "new_end": 39, "label": "NAME_FAMILY", "text": "Smith", "score": 0.9, "language": "en", "new_text": "[NAME_FAMILY_7w4Db3]"}

Redact JSON content

To send a JSON string for redaction, use textual.redact_json. You can send the JSON content as a JSON string or a Python dictionary.

json_redaction = textual.redact_json(<JSON string or Python dictionary>)

redact_json ensures that only the values are redacted. It ignores the keys.

Basic JSON redaction example

Here is a basic example of a JSON redaction request:

d=dict()
d['person']={'first':'John','last':'OReilly'}
d['address']={'city': 'Memphis', 'state':'TN', 'street': '847 Rocky Top', 'zip':1234}
d['description'] = 'John is a man that lives in Memphis.  He is 37 years old and is married to Cynthia'

json_redaction = textual.redact_json(d)

print(json.dumps(json.loads(json_redaction.redacted_text), indent=2))

It produces the following JSON output:

{
"person": {
    "first": "[NAME_GIVEN_WpFV4]",
    "last": "[NAME_FAMILY_orTxwj3I]"
},
"address": {
    "city": "[LOCATION_CITY_UtpIl2tL]",
    "state": "[LOCATION_STATE_n24]",
    "street": "[LOCATION_ADDRESS_KwZ3MdDLSrzNhwB]",
    "zip": "[LOCATION_ZIP_L42eP19]"
},
"description": "[NAME_GIVEN_WpFV4] is a man that lives in [LOCATION_CITY_UtpIl2tL].  He is [DATE_TIME_LLr6L3gpNcOcl3] and is married to [NAME_GIVEN_yWfthDa6]"
}

Specifying entity types for specific JSON paths

When you redact a JSON string, you can optionally assign specific entity types to selected JSON paths.

jsonpath_allow_lists={'entity_type':['JSON Paths']}

The specified entity type overrides both the detected entity type and any added or excluded values.

In the following example, the value of the key1 node is always treated as a telephone number:

response = textual.redact_json('{"key1":"Ex123", "key2":"Johnson"}', jsonpath_allow_lists={'PHONE_NUMBER':['$.key1']})

It produces the following redacted output:

{"key1":"[PHONE_NUMBER_zbIr0]","key2":"My name is [NAME_FAMILY_uC293]"}

Redact XML content

To send an XML string for redaction, use textual.redact_xml.

redact_xml ensures that only the values are redacted. It ignores the XML markup.

For example:

xml_string = '''<?xml version="1.0" encoding="UTF-8"?>
    <!-- This XML document contains sample PII with namespaces and attributes -->
    <PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact">
        <!-- Personal Information with an attribute containing PII -->
        <Name preferred="true" contact:userID="john.doe123">
            <FirstName>John</FirstName>
            <LastName>Doe</LastName>He was born in 1980.</Name>

        <contact:Details>
            <!-- Email stored in an attribute for demonstration -->
            <contact:Email address="john.doe@example.com"/>
            <contact:Phone type="mobile" number="555-6789"/>
        </contact:Details>

        <!-- SSN stored as an attribute -->
        <SSN value="987-65-4321" xsi:nil="false"/>
        <data>his name was John Doe</data>
    </PersonInfo>'''

response = textual.redact_xml(xml_string)

redacted_xml = response.redacted_text

Produces the following XML output:

<?xml version="1.0" encoding="UTF-8"?><!-- This XML document contains sample PII with namespaces and attributes -->\n<PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact"><!-- Personal Information with an attribute containing PII --><Name preferred="true" contact:userID="[NAME_GIVEN_NUhdshJf3SkI0]">[GENDER_IDENTIFIER_gh1] was born in [DOB_nHfb2].<FirstName>[NAME_GIVEN_HI1h7]</FirstName><LastName>[NAME_FAMILY_bKk1]</LastName></Name><contact:Details><!-- Email stored in an attribute for demonstration --><contact:Email address="[EMAIL_ADDRESS_DSlxAYEPw0XkIiADi0WbpW1]"></contact:Email><contact:Phone type="mobile" number="[PHONE_NUMBER_5LWjT19Ee]"></contact:Phone></contact:Details><!-- SSN stored as an attribute --><SSN value="[PHONE_NUMBER_4B2QKKwghix90]" xsi:nil="false"></SSN><data>[GENDER_IDENTIFIER_XN92] name was [NAME_GIVEN_HI1h7] [NAME_FAMILY_bKk1]</data></PersonInfo>

Redact HTML content

To send an HTML string for redaction, use textual.redact_html.

redact_html ensures that only the values are redacted. It ignores the HTML markup.

For example:

html_content = """
<!DOCTYPE html>
<html>
    <head>
        <title>John Doe</title>
    </head>
    <body>
        <h1>John Doe</h1>
        <p>John Doe is a person who lives in New York City.</p>
        <p>John Doe's phone number is 555-555-5555.</p>
    </body>
</html>
"""

# Run the redact_xml method
redacted_html = redact.redact_html(html_content, generator_config={
            "NAME_GIVEN": "Synthesis",
            "NAME_FAMILY": "Synthesis"
        }) 

print(redacted_html.redacted_text)

Produces the following HTML output:

<!DOCTYPE html>
<html>
    <head>
        <title>Scott Roley</title>
    </head>
    <body>
        <h1>Scott Roley</h1>
        <p>Scott Roley is a person who lives in [LOCATION_CITY_HwTG541HnrMzfO7].</p>
        <p>Scott Roley's phone number is [PHONE_NUMBER_apZd0xjh3Z3lf4].</p>
    </body>
</html>

Using an LLM to generate synthesized values

You can also request synthesized values from a large language model (LLM).

The LLM then generates realistic synthesized values of the appropriate value types.

To send text to an LLM, use textual.llm_synthesis:

raw_synthesis = textual.llm_synthesis("Text of the string")

For example:

raw_synthesis = textual.llm_synthesis("My name is John, and today I am demoing Textual, a software product created by Tonic")
raw_synthesis.describe()

My name is John, and on Monday afternoon I am demoing Widget Pro, a software product created by Initech Enterprises.
{"start": 11, "end": 15, "new_start": 11, "new_end": 15, "label": "NAME_GIVEN", "text": "John", "new_text": null, "score": 0.9, "language": "en"}
{"start": 21, "end": 26, "new_start": 21, "new_end": 40, "label": "DATE_TIME", "text": "today", "new_text": null, "score": 0.85, "language": "en"}
{"start": 40, "end": 47, "new_start": 54, "new_end": 64, "label": "PRODUCT", "text": "Textual", "new_text": null, "score": 0.85, "language": "en"}
{"start": 79, "end": 84, "new_start": 96, "new_end": 115, "label": "ORGANIZATION", "text": "Tonic", "new_text": null, "score": 0.85, "language": "en"}

Format of the redaction and synthesis response

The response provides the redacted or synthesized version of the string, and the list of detected entity values.

Contact ORGANIZATION_EPfC7XZUZ with questions
    
{"start": 8, "end": 16, "new_start": 8, "new_end": 30, "label": "ORGANIZATION", "text": "Tonic AI", "new_text": "[ORGANIZATION_EPfC7XZUZ]", "score": 0.85, "language": "en"}

For each redacted item, the response includes:

The location of the value in the original text (start and end)
The location of the value in the redacted version of the string (new_start and new_end)
The entity type (label)
The original value (text)
The redacted or synthesized value (new_text). new_text is null in the following cases:
- The entity type is ignored
- The response is from llm_synthesis
A score to indicate confidence in the detection and redaction (score)
The detected language for the value (language)
For responses from textual.redact_json, the JSON path to the entity in the original document (json_path)
For responses from textual.redact_xml, the Xpath to the entity in the original XML document (xml_path)