1 of 100

Tonic Textual

Tonic Textual guide

Tonic Textual provides a single tool to allow you to put your text-based data to work for you.

You can use Textual datasets to redact sensitive values, to produce files in the same format to use for development and training. Each original file becomes an output file in the same format, with the sensitive values replaced. You can also use the Textual SDK or the Textual REST API to manage datasets or to remove sensitive values from individual text strings.

The Textual pipeline option allows you to prepare unstructured text for use in an LLM system. Textual extracts the text from each file and then produces Markdown-formatted output. You can optionally replace sensitive values in the output, to prevent data leakage from your LLM. You can also use the Textual SDK to manage pipelines or parse individual files.

Textual workflows

Textual SDK, REST API, and Snowflake Native App

Need help with Textual? Contact [email protected].

Getting started with Textual

When you sign up for a Tonic Textual account, you can immediately get started with a new pipeline.

Note that these instructions are for setting up a new account on Textual Cloud. For a self-hosted instance, depending on how it is set up, you might either create an account manually or use single sign-on (SSO).

To get started with a new Textual account:

Go to .
Click Sign up.
Enter your email address.
Create and confirm a password for your Textual account.
Click Sign Up.

Textual creates your account. After you log in, Textual prompts you to provide some additional information about yourself and how you plan to use Textual.

After you fill out the information and click Get Started, Textual displays the Textual Home page, which you can use to preview how Textual detects and replaces values. For more information, go to .

Using the Textual free trial

When you set up an account on Textual Cloud, you start a Textual free trial.

Using the Getting Started checklist

When you start a free trial, Textual provides a checklist to guide you through initial steps to get started and learn more about Textual and what it can do.

The checklist displays automatically when you first log in. You can close and display it as needed. To display the checklist, in the Textual navigation menu, click Getting Started.

As you complete a step, Textual automatically marks it as completed.

The checklist includes:

When you click the step, you navigate to the Home page. Textual displays a popup panel that describes the task.
The checklist displays the installation command and an option to copy it. The step is marked as complete when you click the copy icon.
When you click the step, you are prompted to create an API key.
Creating an SDK request to redact a or a . When you click the step, you navigate to the Request Explorer. The step is marked as completed when you close the popup panel that describes the task.

Word count limit

During the free trial Textual scans up to 100,000 words for free. Note that Textual counts actual words, not tokens. For example, "Hello, my name is John Smith." counts as six words.

After the 100,000 words, Textual disables scanning for your account. Until you purchase a pay-as-you-go subscription, you cannot:

Add files to a dataset or pipeline
Run a pipeline

Viewing your current usage

During your free trial, Textual displays the current usage in the following locations:

On the Home page
In the navigation menu

Next steps - pay-as-you-go or product demo

Textual also prompts you to purchase a , which allows an unlimited number of words scanned for a flat rate per 1,000 words.

You can also request a Textual product demo.

Entity types that Textual detects

Tonic Textual comes with a built-in set of entity types that it detects. You can also configure custom entity types, which detect values based on regular expressions.

You can also view this video overview of entity types and entity type handling.

Managing custom entity types

Required global permission - either:

Create custom entity types
Edit any custom entity type

In addition to the built-in entity types, you can also create custom entity types.

Custom entity types are based on regular expressions. If a value matches a configured regular expression for the custom entity type, then it is identified as that entity type.

You can control whether each dataset or pipeline uses each custom entity type.

Viewing the list of custom entity types

To display the list of entity types, in the Textual navigation bar, click Custom Entity Types.

For each custom entity type, the list includes:

Entity type name and description.
Regular expressions to identify matching values.
The number of datasets and pipelines that the entity type is active for.

Creating, editing, and deleting a custom entity type

Creating a custom entity type

Required global permission: Create custom entity types

To create a custom entity type, on the Custom Entity Types page, click Create Custom Entity Type.

The dataset details and pipeline details pages also contain a Create Custom Entity Type option.

After you configure the entity type:

To save the new type, but not scan dataset and pipeline files for the new type, click Save Without Scanning Files.
To both save the new type and scan for it, click Save and Scan Files.

To detect new custom entity types in a dataset or pipeline, Textual needs to run a scan. If you do not run the scan when you save the custom entity type, then:

On the dataset details page, you are prompted to run a scan.
On the pipeline details page for an uploaded file pipeline, you are prompted to run a scan.
For a cloud storage pipeline, a new scan runs when you run the pipeline.

Editing a custom entity type

Required global permission: You can edit any custom entity type that you create.

Users with the global permission Edit any custom entity type can edit any custom entity type.

To edit a custom entity type, on the Custom Entity Types page, click the edit icon for the entity type.

You can also edit a custom entity type from the dataset or pipeline details page.

For an existing entity type, you can change the description, the regular expressions, and the enabled datasets and pipelines.

You cannot change the entity type name, which is used to produce the identifier to use to configure the entity type handling from the SDK.

After you update the configuration:

To save the changes, but not scan dataset and pipeline files based on the updated configuration, click Save Without Scanning Files.
To both save the new type and scan based on the updated configuration, click Save and Scan Files.

To reflect the changes to custom entity types in a dataset or pipeline, Textual needs to run a scan. If you do not run the scan when you save the changes, then:

On the dataset details page, you are prompted to run a scan.
On the pipeline details page for an uploaded file pipeline, you are prompted to run a scan.
For a cloud storage pipeline, a new scan runs when you run the pipeline.

Deleting a custom entity type

When you delete a custom entity type, it is removed from the datasets and pipelines that it was active for.

To delete a custom entity type:

On the Custom Entity Types page, click the delete icon for the entity type.
On the confirmation panel, click Delete Entity Type.

Custom entity type configuration settings

The custom entity type configuration includes:

Name and description
Regular expressions to identify matching values. From the configuration panel, you can test the expressions against text that you provide.
Datasets and pipelines to make the entity type active for. You can also enable and disable custom entity types from the dataset and pipeline details pages.

Name and description

In the Name field, provide a name for the entity type. Each custom entity type name:

Must be unique within an organization.
Can only contain alphanumeric characters and spaces. Custom entity type names cannot contain punctuation or other special characters.

After you save the entity type, you cannot change the name. Textual uses the name as the basis for the identifier that you use to refer to the entity type in the SDK.

In the Description field, provide a longer description of the custom entity type.

Regular expressions to identify matching values

Under Keywords, Phrases, or Regexes, provide expressions to identify matching values for the entity type.

An entry can be as simple as a single word or phrase, or you can provide a more complex regular expression to identify the values.

Textual maintains an empty row at the bottom of the list. When you type an expression into the last row, Textual adds a new empty row.

To add an entry, begin to type the value in the empty row.

To edit an entry, click the entry field, then edit the value.

To remove an entry, click its delete icon.

Testing an expression

Under Test Entry, you can check whether Textual correctly identifies a value as the entity type based on the provided expression.

To test an expression:

From the dropdown list, select the entry to test.

In the text area, provide the text to test.

As you enter the text, Textual automatically scans the text for matches to the selected expression. The Result field displays the input text and highlights the matching values.

Enabling and disabling the entity type for pipelines and datasets

Under Activate custom entity, you identify the datasets and pipelines to make the entity active for. From the pipeline details or dataset details, you can also enable and disable custom entity types for that pipeline or dataset.

To make the entity active for all current and future datasets and pipelines, check Automatically activate for all current, and new pipelines and datasets.

To make the entity active for specific pipelines and datasets, set the toggle for the dataset or pipeline to the on position.

To filter the list based on the pipeline or dataset name, in the filter field, begin to type text from the name. Textual updates the list to only include matching datasets and pipelines.

To update all of the currently displayed datasets and pipelines, click Bulk action, then click Enable or Disable.

You can also enable and disable custom entity types from within a dataset or pipeline. For more information, go to Enabling and disabling custom entity types.

Datasets - Create redacted files

Creating and managing datasets

A Tonic Textual dataset is a collection of text-based files. Textual uses models to detect and redact the sensitive information in each file.

Viewing the list of datasets

Displaying the Datasets page

To display the Datasets page, in the navigation menu, click Datasets.

The datasets list only displays the datasets that you have access to.

Users who have the global permission View all datasets can see the complete list of datasets.

For each dataset, the Datasets page includes:

The name of the dataset
Any tags assigned to the dataset. For datasets that you can edit, there is also an option to assign tags. For more information, go to Assigning tags to datasets.
The user who most recently updated the dataset
When the dataset was created

Filtering the datasets by name

To filter the datasets by name, in the search field, begin to type text that is in the dataset name.

As you type, the list is filtered to only include datasets with names that contain the filter text.

Filtering the datasets by tag

You can assign tags to each dataset. Tags can help you to organize and provide a quick glance into the dataset configuration.

On the Datasets page, to filter the datasets by their assigned tags:

In the heading for the Tags column, click the filter icon.
On the tag list, check the checkbox for each tag to include.

To find a specific tag, in the search field, type the tag name.

Creating a dataset

Required global permission: Create datasets

From the Datasets page, you can create a new empty dataset. Textual prompts you for the dataset name, then displays the dataset details page.

To create a dataset:

On the Datasets page, click Create a Dataset.
On the dataset creation panel, in the Dataset Name field, provide the name of the dataset.

Click Create Dataset. The dataset details page for the new dataset is displayed.

Displaying details for a dataset

Required dataset permission: View dataset settings

To display the details page for a dataset, on the Datasets page, click the dataset name.

The dataset details page includes:

The tags assigned to the dataset, as well as an option to add tags. For more information, go to Assigning tags to datasets.
The list of files in the dataset
The results of the scan for entity values
The configured handling for each type of value

Changing the dataset name

Required dataset permission: Edit dataset settings

The dataset name displays in the panel at the top left of the dataset details page.

To change the dataset name:

On the dataset details page, click Settings.
On the Dataset Settings page, in the Dataset Name field, provide the new name for the dataset..

Click Save Dataset.

Deleting a dataset

Required dataset permission: Delete a dataset

To delete a dataset:

On the dataset details page, click Settings.
On the Dataset Settings page, click Delete Dataset.
Click Confirm Delete.

Adding and removing dataset files

Supported file types for datasets

Tonic Textual can process the following types of files:

txt
csv
tsv
docx
xlsx
pdf
png
tif or tiff
jpg or jpeg

On a self-hosted instance, you can configure an S3 bucket where Textual stores the files. This is the same S3 bucket that is used for uploaded file pipelines.

For more information, go to Setting the S3 bucket for file uploads and redactions.

For an example of an IAM role with the required permissions, go to .

Adding files to the dataset

Required dataset permission: Upload files to a dataset

From the dataset details page, to add files to the dataset:

In the panel at the top left, click Upload Files.

Search for and select the files.

Tonic Textual uploads and then processes the files.

Do not leave the page while files are uploading. If you leave the page before the upload is complete, then the upload stops.

You can leave the page while Textual is processing the file.

On a self-hosted instance, when a file fails to upload, you can download the associated logs. To download the logs, click the options menu for the file, then select Download Logs.

Removing files from the dataset

Required dataset permission: Delete files from a dataset

To remove a file from the dataset:

In the file list, click the options menu for the file.
In the options menu, click Delete File.

Configuring the redaction

For each entity type, you can adjust how Tonic Textual identifies and updates the values.

For a dataset, you configure the redaction from the entity types list on the dataset details.

For a pipeline that generates synthesized files, you configure the redaction from the Generator Config tab on the pipeline details page.

Working with custom entity types

From the entity types list, you can set whether each custom entity is active, and edit the custom entity configuration.

You can also create a new custom entity type.

Enabling and disabling custom entity types

Required dataset permission: Edit dataset settings

In the entity types list, custom entity types include a toggle to indicate whether the custom entity type is active for that dataset or pipeline.

To disable a custom entity type, set the toggle to the off position.

When a custom entity type is enabled, then it is listed under either the found or not found entity types, depending on whether the files include entities of that type.

When a custom entity type is not enabled, it is listed under Inactive custom entity types. To enable the custom entity type, set the toggle to the on position.

Editing a custom entity type

Required global permission - either:

Create custom entity types
Edit any custom entity type

To edit a custom entity type, click the settings icon for that type.

Note that any changes to the custom entity type settings affect all of the datasets and pipelines that use the custom entity type.

For information on how to configure a custom entity type, go to .

Creating a custom entity type

Required global permission: Create custom entity types

From the dataset details or pipeline details page, to create a new custom entity type, click Create Custom Entity Type.

For information on how to configure a custom entity type, go to .

Running a new scan to reflect custom entity type changes

When you enable, disable, add, or edit custom entity types, the changes do not take effect until you run a new scan.

For datasets and uploaded file pipelines, to run a new scan, click Scan.

For a cloud storage pipeline, Textual scans the files when you run the pipeline.

Selecting the handling option for entity types

Required dataset permission: Edit dataset settings

For each entity type, you choose how to handle the detected values.

Available handling options

The available options are:

Synthesis - Indicates to replace the value with another realistic value. For example, the first name value Michael might be replaced with the value John. The synthesized values are always consistent, meaning that a given entity value always has the same replacement value. For example, if the first name Michael appears multiple times in the text, it is always replaced with John. Textual does not synthesize any excluded values. For custom entity types, Textual scrambles the values.
Redaction - This is the default option, except for the Full Mailing Address entity type, which is Off by default. For text files, Redaction indicates to tokenize the value - to replace it with a token that identifies the entity type followed by a unique identifier. For example, the first name value Michael might be replaced with NAME_GIVEN_12m5s. The identifiers are consistent, which means that for a given original value, the replacement always has the same unique identifier. For example, the first name Michael might always be replaced with NAME_GIVEN_12m5sb, while the first name Helen might always be replaced with NAME_GIVEN_9ha3m2. For PDF files, Redaction indicates to either cover the value with a black box, or, if there is space, display the entity type and identifier. For image files, Redaction indicates to cover the value with a black box. Textual does not redact any excluded values.
Off - Indicates to not make any changes to the values. For example, the first name value Michael remains Michael. This this the default option for the Full Mailing Address entity type.

Selecting the handling option for a specific entity type

To select the handling option for an individual entity type, click the option for that type.

Selecting the handling option for all of the entity types

For a dataset, to select the same handling option for all of the entity types, from the Bulk Edit dropdown above the data type list, select the option.

For a pipeline that generates synthesized files, on the Generator Config tab, use the Bulk Edit options at the top of the entity types list.

Adding manual overrides to PDF files

Required dataset permission: Edit dataset settings

For a PDF file in a dataset, you can add manual overrides to selected areas of a file. Manual overrides can either ignore redactions from Tonic Textual, or add redactions.

Pipelines do not support manual overrides in PDF files.

Editing an individual PDF file

For PDF files, you can add manual overrides to the initial redactions, which are based on the detected data types and handling configuration.

For each manual override, you select an area of the file.

For the selected area, you can either:

Ignore any automatically detected redactions. For example, a scanned form might show an example or boilerplate content that doesn't actually contain sensitive values.
Redact that area. The file might contain sensitive content that Tonic Textual is unable to detect. For example, a scanned form might contain handwritten notes.

You can also apply a template to the file.

Selecting the manual override option for a file

To manage the manual overrides for a PDF file:

In the file list, click the options menu for the file.
In the options menu, click Edit Redactions.

The File Redactions panel displays the file content. The values that Textual detected are highlighted. The page also shows any manual overrides that were added to the file.

Applying a PDF template to a file

If a dataset contains multiple files that have the same format, then you can create a template to apply to those files. For more information, go to .

On the File Redactions panel, to apply a template to the file, select it from the template dropdown list.

When you apply a PDF template to a file, the manual overrides from that template are displayed on the file preview. The manual overrides are not included in the Redactions list.

Adding a manual override

On the File Redactions panel, to add a manual override to a file:

Select the type of override. To indicate to ignore any automatically detected values in the selected area, click Ignore Redactions. To indicate to redact the selected area, click Add Manual Redaction.
Use the mouse to draw a box around the area to select.

Textual adds the override to the Redactions list. The icon indicates the type of override.

In the file content:

Overrides that ignore detected values within the selected area are outlined in red.
Overrides that redact the selected area are outlined in green.

Navigating to a manual override

To select and highlight a manual override in the file content, in the Redactions list, click the navigate icon for the override.

Removing a manual override

To remove a manual override, in the Redactions list, click the delete icon for the override.

Saving the manual overrides

To save the current manual overrides, click Save.

Sharing dataset access

Required permissions:

Global permission - View users and groups
Either:
- Global permission - Manage access to datasets
- Dataset permission - Share dataset access

Tonic Textual uses dataset permission sets for role-based access (RBAC) of each dataset.

A dataset permission set is a set of dataset permissions. Each permission provides access to a specific dataset feature or function.

Textual provides built-in dataset permission sets. Organizations can also configure custom permission sets.

To share dataset access, you assign dataset permission sets to users and to SSO groups, if you use SSO to manage Textual users. Before you assign a dataset permission set to an SSO group, make sure that you are aware of who is in the group. The permissions that are granted to an SSO group automatically are granted to all of the users in the group.

To change the current access to the dataset:

On the Datasets page, click the share icon for the dataset to share.

The dataset access panel contains the current list of users and groups who have access to the dataset, and displays their assigned dataset permission sets. To add a user or group to the list of users and groups:
1. In the search field, begin to type the user email address or group name.
2. From the list of matching users or groups, select the user or group to add.
For a user or group, to change the assigned dataset permission sets:
1. Click Access. The dropdown list displays the list of custom and built-in dataset permission sets.
2. Under Custom Permission Sets, check the checkbox next to each dataset permission set to assign to the user or group. To remove an assigned dataset permission set, uncheck the checkbox.
3. Under Built-In Permission Sets, click the dataset permission set to assign to the user or group. You can only assign one built-in permission set. By default, for an added user or group, the Viewer permission set is selected. To not grant any built-in permission set, select None.

Downloading redacted data

Required dataset permission: Download redacted dataset files

For each file in a dataset, you can download the version of the file that contains the replacement values.

For information on downloading synthesized files from a pipeline, go to .

Downloading a single dataset file

From the file list, to download a single file:

Click the options menu for the file.
In the options menu, click Download File.

Downloading all of the dataset files

To download all of the files, click Download All Files.

Pipelines - Prepare LLM content

Assigning tags to pipelines

Required pipeline permission: Edit pipeline settings

Tags can help you to organize your pipelines. For example, you can use tags to indicate pipelines that belong to different groups, or that deal with specific areas of your data.

You can manage tags from both the Pipelines page and the pipeline details.

Managing tags from the Pipelines page

On the Pipelines page, the Tags column displays the currently assigned tags.

To change the tag assignment for a pipeline:

Click Tags.
On the pipeline tags panel, to add a new tag, type the tag text, then press Enter.
To remove a tag, click its delete icon.
To remove all of the tags, click the delete all icon.

Managing tags from the pipeline details page

On the pipeline details page, the assigned tags display under the pipeline name.

To change the tag assignment:

Click Tags.
On the pipeline tags panel, to add a new tag, type the tag text, then press Enter.
To remove a tag, click its delete icon.
To remove all of the tags, click the delete all icon.

Setting up pipelines

For each pipeline, you configure the name and the files to process. You can also generate redacted versions of the files.

General pipeline creation and configuration

Configuring specific pipeline types

Supported file types for pipelines

Textual pipelines can process the following types of files:

txt
csv
tsv
docx
xlsx
pdf
png
tif or tiff
jpg or jpeg
eml
msg

Creating custom entity types from a pipeline

Required global permission: Create custom entity types

From the pipeline details page, to create a custom entity type, click Create Custom Entity Type.

For information on how to configure a custom entity type, go to .

Configuring file synthesis for a pipeline

Required pipeline permission: Edit pipeline settings

When you choose to also generate synthesized versions of the pipeline files, the pipeline details page includes a Generator Config tab. From the Generator Config tab, you configure how to transform the detected entities in each file.

The Generator Config tab lists all of the available entity types.

For each entity type, you select and configure the handling option. For more information, see and .

You can also manage custom entity types. From the list, you can enable and disable custom types, and edit the configuration. For more information, go to .

After you change the configuration, click Save Changes. The updated configuration is applied the next time you run the pipeline, and only to new files.

Selecting files for an uploaded file pipeline

Required pipeline permission: Manage the pipeline file list

On a self-hosted instance, before you can upload files to a pipeline, you must configure the S3 bucket where Tonic Textual stores the files. For more information, go to .

For an example of an IAM role that has the required permissions for file upload pipelines, go to .

Adding files to the pipeline

On the pipeline details page for an uploaded file pipeline, to add files to the pipeline:

Click Upload Files.
Search for and select the files to upload.

Removing files

To remove a file, on the pipeline details page, click the delete icon for the file.

Indicating whether to also redact the files

By default, Textual only generates the JSON output for the pipeline files.

To also generate versions of the original files that redact or synthesize the detected entity values, on the Pipeline Settings page, toggle Synthesize Files to the on position.

For information on how to configure the file generation, go to .

Starting a pipeline run

Required pipeline permission: Start pipeline runs

For uploaded file pipelines, Tonic Textual automatically processes new files as you add them.

For cloud storage pipelines, you start pipeline runs. A pipeline run processes the pipeline files. The pipeline run only processes files that were not processed by a previous pipeline run.

If the pipeline is configured to also redact files, the run also generates the redacted version of each file. The redaction is based on the current redaction configuration for the pipeline. The first run after you enable redaction generates redacted versions of all of the pipeline files, including files that were processed by earlier runs. Subsequent runs only process new files.

To start a pipeline run, on the pipeline details page, click Run Pipeline.

Before the run completes, to cancel the run:

On the Pipeline Runs tab, click the Cancel Run option for the run.
On the confirmation panel, click Cancel Run.

Sharing pipeline access

Required permissions:

Global permission - View users and groups
Either:
- Global permission - Manage access to pipelines
- Pipeline permission - Share pipeline access

Tonic Textual uses pipeline permission sets for role-based access (RBAC) of each pipeline.

A pipeline permission set is a set of pipeline permissions. Each pipeline permission provides access to a specific pipeline feature or function.

Textual provides built-in pipeline permission sets. Organizations can also configure custom permission sets.

To share pipeline access, you assign pipeline permission sets to users and to SSO groups, if you use SSO to manage Textual users. Before you assign a pipeline permission set to an SSO group, make sure that you are aware of who is in the group. The permissions that are granted to an SSO group automatically are granted to all of the users in the group.

To change the current access to the pipeline:

On the Pipelines page, click the share icon for the pipeline to share.

The pipeline access panel contains the current list of users and groups who have access to the pipeline, and displays their assigned pipeline permission sets. To add a user or group to the list of users and groups:
1. In the search field, begin to type the user email address or group name.
2. From the list of matching users or groups, select the user or group to add.
For a user or group, to change the assigned pipeline permission sets:
1. Click Access. The dropdown list displays the list of custom and built-in pipeline permission sets.
2. Under Custom Permission Sets, check the checkbox next to each pipeline permission set to assign to the user or group. To remove a pipeline permission set from a user or group, uncheck the checkbox.
3. Under Built-In Permission Sets, select the pipeline permission set to assign to the user or group. You can only assign one built-in permission set. By default, for an added user or group, the Viewer permission set is selected. To not grant any built-in permission set, select None.
To save the new access, click Share.

Viewing pipeline results

The pipeline details include the results of the pipeline processing, including the pipeline files and, for cloud storage pipelines, the individual pipeline runs.

Textual Python SDK

Installing the Textual SDK

The Tonic Textual SDK is a Python SDK that you can use to parse and redact text and files.

It requires Python 3.9 or higher.

To install the Tonic Textual Python SDK, run:

pip install tonic-textual

Instantiating the SDK client

Whenever you call the Textual SDK, you first instantiate the SDK client.

To work with Textual datasets, or to redact individual files, you instantiate TonicTextual.
To work with Textual pipelines and parsing, you instantiate TonicTextualParse.

Instantiating when the API key is already configured

If the API key is configured as the value of the TONIC_TEXTUAL_API_KEY, then you do not need to provide the API key when you instantiate the SDK client.

For Textual pipelines:

For Textual datasets:

Instantiating when the API key is not configured

If the API key is not configured as the value of the TONIC_TEXTUAL_API_KEY, then you must include the API key in the request.

For Textual pipelines:

For Textual datasets:

Datasets and redaction

You can use the Tonic Textual SDK to manage pipelines and to redact individual strings and files.

Redact individual files

Required global permission: Use the API to parse or redact a text string

You can use the Textual SDK to redact and synthesize values in individual files.

Before you perform these tasks, remember to .

For a self-hosted instance, you can also configure the S3 bucket to use to store the files. This is the same S3 bucket that is used to store files for uploaded file pipelines. For more information, go to . For an example of an IAM role with the required permissions, go to .

Sending a file to Textual

To send an individual file to Textual, you use .

You first open the file so that Textual can read it, then make the call for Textual to read the file.

The response includes:

The file name
The identifier of the job that processed the file. You use this identifier to retrieve a transformed version of the file.

Getting the file with redacted or synthesized values

After you use to send the file to Textual, you use to retrieve a transformed version of the file.

To identify the file, you use the job identifier that you received from textual.start_file_redaction. You can for the detected entity values.

Before you make the call to download the file, you specify the path to download the file content to.

Transcribe and redact an audio file

You can send an audio file to the Tonic Textual SDK. Textual creates a transcription of the audio file, and then redacts the transcription text as a string.

Audio file limitations

The file must be 25MB or smaller, and must be one of the following file types:

m4a
mp3
webm
mp4
mpga
wav

Sending the transcription and redaction request

To transcribe and redact an audio file, you use textual.redact_audio.

redaction_response=textual.redact_audio(<path to the audio file>)
redaaction_response.describe

The request includes the entity type handling configuration.

The redaction response includes the redacted or synthesized content and details about the detected entity values.

Pipelines and parsing

You can use the Tonic Textual SDK to manage pipelines and to parse individual files.

Parse individual files

Required global permission: Use the API to parse or redact a text string

You can use the Textual SDK to parse individual files, either from a local file system or from an S3 bucket.

Textual returns a FileParseResult object for each parsed file. The FileParseResult object is a wrapper around the output JSON for the processed file.

Parse a file from a local file system

To parse a single file from a local file system, use textual.parse_file:

with open('<path to the file>','rb') as f: 
    byte_data = f.read()
    parsed_doc = textual.parse_file(byte_data, '<file name>')

You must use rb access mode to read the file. rb access mode opens the file to be read in binary format.

You can also set a timeout in seconds for the parsing. You can add the timeout as a parameter of parse_file command. To set a timeout to use for all parsing, set the environment variable TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS.

Parse a file from an S3 bucket

You can also parse files that are stored in Amazon S3. Because this process uses the boto3 library to fetch the file from Amazon S3, you must first set up the correct AWS credentials.

To parse a file from an S3 bucket, use textual.parse_s3_file:

parsed_doc = textual.parse_s3_file('<bucket>','<key>')

Textual REST API

About the Textual REST API

The Tonic Textual REST API allows you to more deeply integrate Textual functions into your existing workflows.

You can use the REST API as another tool alongside the Textual application and the Textual Python SDK. The Python SDK supports the same actions as the REST API. We recommend the Python SDK for customers who already use Python.

You can download the Textual OpenAPI specification from:

https://textual.tonic.ai/swagger/v1/swagger.json

REST API authentication

Before you can use the API, you must create a Tonic Textual API key. For information on how to obtain a Textual API key, go to Creating and revoking Textual API keys.

When you call the API, you place your API key in the authorization header of the request, similar to the following curl request, which fetches the list of datasets for the current user.

curl --request GET \
--url "https://textual.tonic.ai/api/dataset" \
--header "Content-Type: application/json" \
--header "Authorization: API_KEY"

Most Textual API requests require authentication. For each request, the reference information indicates whether the request requires an API key.

For requests that require an API key, if you do not provide a valid API key, you receive a 401 Unauthorized response.

Redaction

Use the Tonic Textual REST API to redact text. Redaction means to detect and replace sensitive values.

Redact text strings

You can use the Tonic Textual REST API to redact text strings, including:

Plain text
JSON
XML
HTML

Textual provides a specific endpoint for each format. For JSON, XML, and HTML, Textual only redacts the text values. It preserves the underlying structure.

Datasets

Use the REST API to manage datasets.

Manage datasets

Use the REST API to create and manage datasets.

Manage dataset files

Use the REST API to manage dataset files.

Access management

Use the API to retrieve information about users and groups, and to manage access to datasets.

Snowflake Native App and SPCS

About the Snowflake Native App

The Tonic Textual Snowflake Native App uses the same models and algorithms as the Tonic Textual API, but runs natively in Snowflake.

You use the app to redact or parse your text data directly within your Snowflake workflows. The text never leaves your data warehouse.

App package containers

The app package runs natively in Snowflake, and leverages Snowpark Container Services.

It includes the following containers:

Detection service, which detects the sensitive entity values.
Redaction service, which replaces the sensitive entity values.

Redaction workflow

For the redaction workflow, you use the app to detect and replace sensitive values in text.

You use TEXTUAL_REDACT to send the redaction request. When you call TEXTUAL_REDACT, it passes to the redaction service:
- The text to redact
- Optional configuration
The redaction service forwards the text to the detection service.
The detection service uses a series of NER models to identify and categorize sensitive words and phrases in the text.
The detection service returns its results to the redaction service.
The redaction service uses the results to replace the sensitive words and phrases with redacted or synthesized versions.
The redacted text is returned to the user.

Parsing workflow

For the parsing workflow, you use the app to parse files that are in a Snowflake internal or external stage.

You call TEXTUAL_PARSE to send the parse request. The request includes:
- The fully qualified stage name where the files are located
- The name of the file, or a variable that identifies the list of files
- The MD5 sum of the file
The app uses a series of NER models to identify and categorize sensitive words and phrases in the text.
The app converts the content to a markdown format.
The markdown content is part of the JSON output that includes metadata about the parsed text. You can use the metadata to built RAG systems and LLM datasets.
The app stores the results of the parse request, including the output, in the TEXTUAL_RESULTS table.

Setting up the app

For the Tonic Textual Snowflake Native App, you set up:

A compute pool
A warehouse to enable queries

Setting up the compute pool

The compute pool must be specific to Textual.

For large-scale jobs, we highly recommend a GPU-enabled compute pool.

During setup and testing, you can use a CPU-only pool.

USE ROLE SYSADMIN;
CREATE COMPUTE POOL IF NOT EXISTS {YOUR_COMPUTE_POOL_NAME} FOR APPLICATION TONIC_TEXTUAL
    MIN_NODES = 1
    MAX_NODES = 1
    INSTANCE_FAMILY = GPU_NV_S
    AUTO_RESUME = true;

Creating a warehouse for Textual to query Snowflake

To run SQL queries against Snowflake tables that the app manages, the app requires a warehouse.

USE ROLE ACCOUNTADMIN; CREATE WAREHOUSE {YOUR_TEXTUAL_WAREHOUSE_NAME} WITH WAREHOUSE_SIZE='MEDIUM';
GRANT USAGE ON WAREHOUSE {YOUR_TEXTUAL_WAREHOUSE_NAME} TO APPLICATION TONIC_TEXTUAL;

Install and administer Textual

Textual architecture

The following diagram shows how data and requests flow within the Tonic Textual application:

Textual application database

The Textual application database is a PostgreSQL database that stores the dataset configuration.

If you do not configure an S3 bucket, then it also stores uploaded files and files that you use the SDK to redact.

Textual datastore in Amazon S3

You can configure an S3 bucket to store uploaded files and individual files that you use the SDK to redact. For more information, go to .

If you do not configure an S3 bucket, then the files are stored in the Textual application database.

Textual components

Textual web server

Runs the Textual user interface.

Textual worker

A textual instance can have multiple workers.

The worker orchestrates jobs. A job is a longer running task such as the redaction of a single file.

If you redact a large number of files, you might deploy additional workers and machine learning containers to increase the number of files that you can process concurrently.

Textual machine learning

A textual installation can have 1 or more machine learning containers.

The machine learning container hosts the Textual models. It takes text from the worker or web server and returns any entities that it discovers.

Additional machine learning containers can increase the number of words per second that Textual can process.

OCR service

The OCR service converts PDFs and images to text that Textual can then scan for sensitive values.

For more information, go to .

LLM service

Textual only uses the LLM service for .

Deploying a self-hosted instance

The Tonic Textual images are stored on . During onboarding, Tonic.ai provides you with credentials to access the image repository. If you require new credentials, or you experience issues accessing the repository, contact .

You can deploy Textual using either Kubernetes or Docker.

System requirements

You install a self-hosted instance of Tonic Textual on either:

A VM or server that runs Linux and on which you have superuser access.
A local machine that runs Mac, Windows, or Linux.

Application server or cluster requirements

At minimum, we recommend that the server or cluster that you deploy Textual to has access to the following resources:

Nvidia GPU, 16GB GPU RAM. We recommend at least 6GB GPU RAM for each textual-ml worker.

If you only use a CPU and not a GPU, then we recommend an M5.2xLarge. However, without GPU, performance is significantly slower.

GPU considerations

The number of words per second that Textual processes depends on many factors, including:

The hardware that runs the textual-ml container
The number of workers that are assigned to the textual-ml container
The auxiliary model, if any, that is used in the textual-ml container.

To optimize the throughput of and the cost to use Textual, we recommend that the textual-ml container runs on modern hardware with GPU compute. If you use AWS, we recommend a g5 instance with 1 GPU.

Setting up Nvidia GPU for Textual

To use GPU resources:

Ensure that the correct Nvidia drivers are installed for your instance.
If you use Kubernetes to deploy Textual, follow the instructions in the NVIDIA GPU operator documentation.
If you use Minikube, then use the instructions in Using NVIDIA GPUs with Minikube.
If you use Docker Compose to deploy Textual, follow these steps to install the nvidia-container-runtime.

Deploying with Docker Compose

The Docker Compose file is available in the GitHub repository .

Fork the repository.

To deploy Textual:

Rename sample.env to .env.
In .env, provide values for the required settings. These are not commented out and have <FILL IN> as a placeholder value:
- SOLAR_VERSION - Provided by Tonic.ai.
- SOLAR_LICENSE - Provided by Tonic.ai.
- ENVIRONMENT_NAME - The name that you want to use for your Textual instance. For example, my-company-name.
- SOLAR_SECRET - The string to use for Textual encryption.
- SOLAR_DB_PASSWORD - The password that you want to use for the Textual application database, which stores the metadata for Textual, including the datasets and pipelines. Textual deploys a PostgreSQL database container for the application database.
To deploy and start Textual, run docker-compose up -d.

Deploying on Kubernetes with Helm

The Tonic Textual Helm chart is available in the GitHub repository .

To use the Helm chart, you can either:

Use the that Tonic hosts on .
Fork or clone the repository and then maintain it locally.

During the onboarding period, you are provided access credentials to our docker image repository on . If you require new credentials, or you experience issues accessing the repository, contact .

Configure

Before you deploy Textual, you create a values.yaml file with the configuration for your instance.

For details about the required and optional configuration options, go to the .

Deploy

To deploy and validate access to Textual from the forked repository, follow the .

To use the OCI-based registry, run:

The GitHub repository contains a with the details on how to populate a values.yaml file and deploy Textual.

How to configure Textual environment variables

On a self-hosted instance of Textual, much of the configuration takes the form of environment variables.

After you configure an environment variable, you must restart Textual.

Docker

For Docker, add the variable to .env in the format:

SETTING_NAME=value

After you update .env, to restart Textual and complete the update, run:

$ docker-compose down

$ docker-compose pull && docker-compose up -d

Kubernetes

For Kubernetes, in values.yaml, add the environment variable to the appropriate env section of the Helm chart.

For example:

env: {
  "TEXTUAL_ML_WORKERS": "2"
}

After you update the YAML file, to restart the service and complete the update, run:

$ helm upgrade <name_of_release> -n <namespace_name> <path-to-helm-chart>

The above Helm upgrade command is always safe to use when you provide specific version numbers. However, if you use the latest tag, it might result in Textual containers that have different versions.

Configuring the number of textual-ml workers

The TEXTUAL_ML_WORKERS environment variable specifies the number of workers to use within the textual-ml container. The default value is 1.

Having multiple workers allows for parallelization of inferences with NER models. The number of required workers is also affected by the .

When you deploy Textual with Kubernetes on GPUs, parallelization allows the textual-ml container to fully utilize the GPU.

We recommend 3GB of GPU RAM for each worker.

Configuring the number of jobs to run concurrently

By default, each Tonic Textual worker can run eight jobs at the same time. For example, it can process up to eight files simultaneously.

The environment variable SOLAR_MAX_CONCURRENT_WORKER_JOBS controls the number of jobs to run concurrently.

The number of jobs that can run concurrently can affect the that you need. The more jobs that can run concurrently, the fewer workers that are needed.

Configuring the format of Textual logs

Textual writes the worker, machine learning, and API log messages to stdout.

By default, the log messages are in an unstructured format.

To instead use a JSON format for the logs, set the environment variable SOLAR_EMIT_JSON_LOGS_TO_STDOUT to true.

Configuring endpoint URLs for calls to AWS

For calls to AWS products that are used in Textual, you can configure custom URLs to use. For example, if you use proxy endpoints, then you would configure those endpoints in Textual.

The for custom AWS endpoints include the following:

AWS_S3_FORCE_PATH_STYLE - Whether to always use path-style instead virtual-hosted-style for connections to Amazon S3. The default is false.
This setting is only used if you configured either AWS_ENDPOINT_URL or AWS_ENDPOINT_URL_S3.
AWS_ENDPOINT_URL - The URL to use for all AWS calls, including calls to Amazon S3, Amazon Textract, and Amazon SES v2. This global endpoint is overridden by service-specific endpoints.
AWS_ENDPOINT_URL_S3 - The URL to use for calls to Amazon S3. This overrides the global URL set in AWS_ENDPOINT_URL.
AWS_ENDPOINT_URL_TEXTRACT - The URL to use for calls to Amazon Textract. This overrides the global URL set in AWS_ENDPOINT_URL.
AWS_ENDPOINT_URL_SESV2 - The URL to use for calls to Amazon SES v2. This overrides the global URL set in AWS_ENDPOINT_URL.

Enabling PDF and image processing

To process PDF and image files, Tonic Textual uses optical character recognition (OCR). Textual supports the following OCR models:

Azure AI Document Intelligence
Amazon Textract
Tesseract

For the best performance, we recommend that you use either Azure AI Document Intelligence or Amazon Textract.

If you cannot use either of those - for example because you run Textual on-premises and cannot access third-party services - then you can use Tesseract.

Azure AI Document Intelligence

To use Azure AI Document Intelligence to process PDF image files, Textual requires the Azure AI Document Intelligence key and endpoint.

Docker

In .env, uncomment and provide values for the following settings:

SOLAR_AZURE_DOC_INTELLIGENCE_KEY=#

SOLAR_AZURE_DOC_INTELLIGENCE_ENDPOINT=#

Kubernetes

In values.yaml, uncomment and provide values for the following settings:

azureDocIntelligenceKey:

azureDocIntelligenceEndpoint:

Amazon Textract

If the Azure-specific environment variables are not configured, then Textual attempts to use Amazon Textract.

To use Amazon Textract, Textual requires access to an IAM role that has sufficient permissions. You must also configure an S3 bucket to use to store files. The configured S3 bucket is required for uploaded file pipelines, and is also used to store dataset files and individual files that are redacted using the SDK.

We recommend that you use the AmazonTextractFullAccess policy, but you can also choose to use a more restricted policy.

Here is an example policy that provides the minimum required permissions:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"textract:StartDocumentAnalysis",
				"textract:AnalyzeDocument",
				"textract:GetDocumentAnalysis"
			],
			"Resource": "*"
		}
	]
}

After the policy is attached to an IAM user or a role, it must be made accessible to Textual. To do this, either:

Assign an instance profile
Provide the AWS key, secret, and Region in the following environment variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION

Tesseract

If neither Azure AI Document Intelligence nor Amazon Textract is configured, then Textual uses Tesseract, which is automatically available in your Textual installation.

Tesseract does not require any external access.

Setting the S3 bucket for file uploads and redactions

Tonic Textual pipelines can process files from sources such as Amazon S3, Azure Blob Storage, and Databricks Unity Catalog. You can also create pipelines to process files that you upload directly from your browser.

For those uploaded file pipelines, Textual always stores the files in an S3 bucket. On a self-hosted instance, before you add files to an uploaded file pipeline, you must configure the S3 bucket and the associated authentication credentials.

The configured S3 bucket is also used to store dataset files and individual files that you use the Textual SDK to redact. If an S3 bucket is not configured, then:

The dataset and individual redacted files are stored in the Textual application database.
You cannot use Amazon Textract for PDF and image processing. If you configured Textual to use Amazon Textract, Textual instead uses Tesseract.

The authentication credentials for the S3 bucket include:

The AWS Region where the S3 bucket is located.
An AWS access key that is associated with an IAM user or role.
The secret key that is associated with the access key.

To provide the authentication credentials, you can either:

Provide the values directly as environment variable values.
Use the instance profile of the compute instance where Textual runs.

For an example IAM role that has the required permissions, go to Example IAM role for file uploads and redactions.

Docker

In .env, add the following settings:

SOLAR_INTERNAL_BUCKET_NAME= <S3 bucket path>

AWS_REGION= <AWS Region>

AWS_ACCESS_KEY_ID= <AWS access key>

AWS_SECRET_ACCESS_KEY= <AWS secret key>

If you use the instance profile of the compute instance, then only the bucket name is required.

Kubernetes

In values.yaml, within env: { } under both textual_api_server and textual_worker, add the following settings:

SOLAR_INTERNAL_BUCKET_NAME

AWS_REGION

AWS_ACCESS_KEY_ID

AWS_SECRET_ACCESS_KEY

For example, if no other environment variables are defined:

  env: {
        "SOLAR_INTERNAL_BUCKET_NAME": "<S3 bucket path>",
        "AWS_REGION": "<AWS Region>",
        "AWS_ACCESS_KEY_ID": "<AWS access key>",
        "AWS_SECRET_ACCESS_KEY": "<AWS secret key>"
       }

If you use the instance profile of the compute instance, then only the bucket name is required.

Required IAM role permissions for Amazon S3

For Amazon S3 pipelines, you connect to S3 buckets to select and store files.

On self-hosted instances, you also configure an S3 bucket and the credentials to use to store files for:

Uploaded file pipelines. The S3 bucket is required for uploaded file pipelines. The S3 bucket is not used for pipelines that connect to Azure Blob Storage or to Databricks Unity Catalog.
Dataset files. If you do not configure an S3 bucket, then the files are stored in the application database.
Individual files that you send to the SDK for redaction. If you do not configure an S3 bucket, then the files are stored in the application database.

Here are examples of IAM roles that have the required permissions to connect to Amazon S3 to select or store files.

Example IAM role for file uploads and redactions

For uploaded file pipelines, datasets, and individual file redactions, the files are stored in a single S3 bucket. For information on how to configure the S3 bucket and the corresponding access credentials, go to Setting the S3 bucket for file uploads and redactions.

The IAM role that is used to connect to the S3 bucket must be able to read files from and write files to it.

Here is an example of an IAM role that has the permissions required to support uploaded file pipelines, datasets, and individual redactions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<SOLAR_INTERNAL_BUCKET_NAME>",
                "arn:aws:s3:::<SOLAR_INTERNAL_BUCKET_NAME>/*"
            ]
        }
    ]
}

Example IAM role for Amazon S3 pipelines

The access credentials that you configure for an Amazon S3 pipeline must be able to navigate to and select files and folders from the appropriate S3 buckets. They also need to be able to write output files to the configured output location.

Here is an example of an IAM role that has the permissions required to support Amazon S3 pipelines:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "arn:aws:s3:::*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
                "arn:aws:s3:::*/*"
            ]
        }
    ]
}

Configuring model preferences

On a self-hosted instance, you can configure settings to determine whether to the auxiliary model, and model use on GPU.

Configuring whether to use an auxiliary model

To improve overall inference, you can configure whether Textual uses the en_core_web_sm auxiliary NER model.

Entity types that the auxiliary model detects

The auxiliary model detects the following types:

EVENT
LANGUAGE
LAW
NRP
NUMERIC_VALUE
PRODUCT
WORK_OF_ART

Indicating whether to use the auxiliary model

To configure whether to use the auxiliary model, you use the environment variable TEXTUAL_AUX_MODEL.

The available values are:

en_core_web_sm - This is the default value.
none - Indicates to not use the auxiliary model.

Configuring model use for GPU

When you use a textual-ml-gpu container on accelerated hardware, you can configure:

Whether to use the auxiliary model,
Whether to use the date synthesis model

Indicating whether to use the auxiliary model for GPU

To configure whether to use the auxiliary model for GPU, you configure the TEXTUAL_AUX_MODEL_GPU.

By default, on GPU, Textual does not use the auxiliary model, and TEXTUAL_AUX_MODEL_GPU is false.

To use the auxiliary model for GPU, based on the configuration of TEXTUAL_AUX_MODEL, set TEXTUAL_AUX_MODEL_GPU to true.

When TEXTUAL_AUX_MODEL_GPU is true, and TEXTUAL_MULTI_LINGUAL is true, Textual also loads the multilingual models on GPU.

Indicating whether to use the date synthesis model for GPU

By default, on GPU, Textual loads the date synthesis model on GPU.

Note that this model requires 600MB of GPU RAM for each machine learning worker.

To not load the date synthesis model on GPU, set the TEXTUAL_DATE_SYNTH_GPU to false.

Viewing model specifications

On a self-hosted instance of Tonic Textual, you can view the current model specifications for the instance.

To view the model specifications:

Click the user icon at the top right.
In the user menu, click System Settings.

On the System Settings page, the Model Specifications section provides details about the models that Textual uses.

Managing user access to Textual

Tonic Textual provides the following options to manage access to Textual and its features.

Textual organizations

In Tonic Textual, each user belongs to an organization. Organizations are used to determine the company or customer that a Textual user belongs to.

A self-hosted instance of Textual contains a single organization. All users belong to that organization.

Textual Cloud hosts multiple organizations. The organizations are kept completely separate. Users from one Textual Cloud organization do not have any access to the users, datasets, or pipelines that belong to a different Textual Cloud organization.

When is a Textual organization created?

A Textual organization is created:

For a standard Textual license, both self-hosted and Textual Cloud, when the first user signs up for a Textual account.
When a user signs up for a free trial or pay-as-you go Textual Cloud license with a unique corporate email domain.
When a user signs up for a free trial or pay-as-you-go Textual Cloud license with a public email domain, such as Gmail or Yahoo. Every user with a public email domain is in a separate organization.

When is a new user added to an existing organization?

Self-hosted instance

A self-hosted instance has a single organization. Every user who signs up for an account on that instance is added to the organization.

Annual Textual Cloud license (not pay-as-you-go)

For companies with an annual Textual Cloud license, the license includes the email domains that are included in the license.

When a user with one of the included email domains signs up for a Textual account, they are automatically added to that organization.

Pay-as-you-go license

For a pay-as-you-go license, when a user with the same corporate email domain signs up for a Textual account, they are automatically added to that organization.

Users with public email domains are always in separate organizations.

Creating a new account in an existing organization

New user on a self-hosted instance

If your company has a self-hosted Textual instance that is installed on-premises, then you navigate to the Textual URL for that instance.

Your self-hosted instance might be configured to use single sign-on for Textual access. If so, then from the Textual login page, to create your Textual user account, click the single sign-on option.

Otherwise, to create your Textual user account, click Sign Up.

Your administrator can provide the URL for your Textual instance and confirm the instructions for creating your user account.

New user for an existing Textual Cloud organization

If your Textual license is on Textual Cloud, then new users that have a matching email domain are automatically added to your Textual Cloud organization.

For a Textual Cloud license other than a pay-as-you-go license, the license agreement specifies the included email domains. When a user with a matching email domain signs up for a Structural account, they are added to that Textual Cloud organization.

For a pay-as-you-go Textual Cloud license, when a user with the same corporate email domain as the subscribed user signs up for a Textual account, they are added to that Textual Cloud organization.

To create your Textual user account, on the Textual Cloud login page, click Sign Up.

Viewing the list of SSO groups in Textual

Required global permission: View users and groups

If you use SSO to manage Tonic Textual groups, then Textual displays the list of groups for which at least one user has logged in to Textual.

To display the SSO group list:

Click the user image at the top right.
In the user menu, click Permission Settings.
On the Permission Settings page, click Groups.

If no users from a group have logged in to Textual, then the group does not display in the list.

The list only displays the group names and indicates the SSO provider. To manage the group permissions:

To assign global permission sets, go to the Global Permission Sets list. For more information, go to Configuring access to global permission sets.
To assign dataset permission sets, go to the Datasets page. For more information, go to Sharing dataset access.
To assign pipeline permission sets, go to the Pipelines page. For more information, go to Sharing pipeline access.

Azure

Use these instructions to set up Azure Active Directory as your SSO provider for Tonic Textual.

Azure configuration

In the portal, navigate to Azure Active Directory -> App registrations, then click New registration.
Register Textual and create a new web redirect URI that points to your Textual instance's address and the path /sso/callback/azure.
Take note of the values for client ID and tenant ID. You will need them later.
Click Add a certificate or secret, and then create a new client secret. Take note of the secret value. You will need this later.
Navigate to the API permissions page. Add the following permissions for the Microsoft Graph API:
- OpenId permissions
- email
- openid
- profile
- GroupMember
- GroupMember.Read.All
- User
- User.Read
Click Grant admin consent for Tonic AI. This allows the application to read the user and group information from your organization. When permissions have been granted, the status should change to Granted for Tonic AI.
Navigate to Enterprise applications and then select Textual. From here, you can assign the users or groups that should have access to Textual.

Textual configuration

After you complete the configuration in Azure, you uncomment and configure the required environment variables in Textual.

For Kubernetes, in values.yaml:

# Azure SSO Config
# -----------------
#azureClientId: <client-id>
#azureTenantId: <tenant-id>
#azureClientSecret: <client-secret>

For Docker, in .env:

#SOLAR_SSO_AZURE_CLIENT_ID=#<FILL IN>
#SOLAR_SSO_AZURE_TENANT_ID=#<FILL IN>
#SOLAR_SSO_AZURE_CLIENT_SECRET=#<FILL IN>

Redact individual strings

Required global permission: Use the API to parse or redact a text string

Before you perform these tasks, remember to instantiate the SDK client.

You can use the Tonic Textual SDK to redact individual strings, including:

Plain text strings
JSON content
XML content

For a text string, you can also request synthesized values from a large language model (LLM).

The redaction request can include the handling configuration for entity types.

The redaction response includes the redacted or synthesized content and details about the detected entity values.

Redact a plain text string

To send a plain text string for redaction, use textual.redact:

redaction_response = textual.redact("""<text of the string>""")
redaction_response.describe()

For example:

redaction_response = textual.redact("""Contact Tonic AI with questions""")
redaction_response.describe()

Contact ORGANIZATION_EPfC7XZUZ with questions
    
{"start": 8, "end": 16, "new_start": 8, "new_end": 30, "label": "ORGANIZATION", "text": "Tonic AI", "new_text": "[ORGANIZATION]", "score": 0.85, "language": "en"}

The redact call provides an option to record the request, to allow you to preview the results in the Textual application. For more information, go to Record and review redaction requests.

Redact multiple plain text strings

To send multiple plain text strings for redaction, use textual.redact_bulk:

bulk_response = textual.redact_bulk([<List of strings])

For example:

bulk_response = textual.redact_bulk(["Tonic.ai was founded in 2018", "John Smith is a person"])
bulk_response.describe()

[ORGANIZATION_5Ve7OH] was founded in [DATE_TIME_DnuC1]

{"start": 0, "end": 5, "new_start": 0, "new_end": 21, "label": "ORGANIZATION", "text": "Tonic", "score": 0.9, "language": "en", "new_text": "[ORGANIZATION]"}
{"start": 21, "end": 25, "new_start": 37, "new_end": 54, "label": "DATE_TIME", "text": "2018", "score": 0.9, "language": "en", "new_text": "[DATE_TIME]"}

[NAME_GIVEN_dySb5] [NAME_FAMILY_7w4Db3] is a person

{"start": 0, "end": 4, "new_start": 0, "new_end": 18, "label": "NAME_GIVEN", "text": "John", "score": 0.9, "language": "en", "new_text": "[NAME_GIVEN]"}
{"start": 5, "end": 10, "new_start": 19, "new_end": 39, "label": "NAME_FAMILY", "text": "Smith", "score": 0.9, "language": "en", "new_text": "[NAME_FAMILY]"}

Redact JSON content

To send a JSON string for redaction, use textual.redact_json. You can send the JSON content as a JSON string or a Python dictionary.

json_redaction = textual.redact_json(<JSON string or Python dictionary>)

redact_json ensures that only the values are redacted. It ignores the keys.

Basic JSON redaction example

Here is a basic example of a JSON redaction request:

d=dict()
d['person']={'first':'John','last':'OReilly'}
d['address']={'city': 'Memphis', 'state':'TN', 'street': '847 Rocky Top', 'zip':1234}
d['description'] = 'John is a man that lives in Memphis.  He is 37 years old and is married to Cynthia.'

json_redaction = textual.redact_json(d)

print(json.dumps(json.loads(json_redaction.redacted_text), indent=2))

It produces the following JSON output:

{
"person": {
    "first": "[NAME_GIVEN]",
    "last": "[NAME_FAMILY]"
},
"address": {
    "city": "[LOCATION_CITY]",
    "state": "[LOCATION_STATE]",
    "street": "[LOCATION_ADDRESS]",
    "zip": "[LOCATION_ZIP]"
},
"description": "[NAME_GIVEN] is a man that lives in [LOCATION_CITY].  He is [DATE_TIME] and is married to [NAME_GIVEN]."
}

Specifying entity types for specific JSON paths

When you redact a JSON string, you can optionally assign specific entity types to selected JSON paths.

To do this, you include the jsonpath_allow_lists parameter. Each entry consists of an entity type and a list of JSON paths for which to always use that entity type. Each JSON path must point to a simple string or numeric value.

jsonpath_allow_lists={'entity_type':['JSON Paths']}

The specified entity type overrides both the detected entity type and any added or excluded values.

In the following example, the value of the key1 node is always treated as a telephone number:

response = textual.redact_json('{"key1":"Ex123", "key2":"Johnson"}', jsonpath_allow_lists={'PHONE_NUMBER':['$.key1']})

It produces the following redacted output:

{"key1":"[PHONE_NUMBER]","key2":"My name is [NAME_FAMILY]"}

Redact XML content

To send an XML string for redaction, use textual.redact_xml.

redact_xml ensures that only the values are redacted. It ignores the XML markup.

For example:

xml_string = '''<?xml version="1.0" encoding="UTF-8"?>
    <!-- This XML document contains sample PII with namespaces and attributes -->
    <PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact">
        <!-- Personal Information with an attribute containing PII -->
        <Name preferred="true" contact:userID="john.doe123">
            <FirstName>John</FirstName>
            <LastName>Doe</LastName>He was born in 1980.</Name>

        <contact:Details>
            <!-- Email stored in an attribute for demonstration -->
            <contact:Email address="[email protected]"/>
            <contact:Phone type="mobile" number="555-6789"/>
        </contact:Details>

        <!-- SSN stored as an attribute -->
        <SSN value="987-65-4321" xsi:nil="false"/>
        <data>his name was John Doe</data>
    </PersonInfo>'''

response = textual.redact_xml(xml_string)

redacted_xml = response.redacted_text

Produces the following XML output:

<?xml version="1.0" encoding="UTF-8"?><!-- This XML document contains sample PII with namespaces and attributes -->\n<PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact"><!-- Personal Information with an attribute containing PII --><Name preferred="true" contact:userID="[NAME_GIVEN]">[GENDER_IDENTIFIER] was born in [DOB].<FirstName>[NAME_GIVEN]</FirstName><LastName>[NAME_FAMILY]</LastName></Name><contact:Details><!-- Email stored in an attribute for demonstration --><contact:Email address="[EMAIL_ADDRESS]"></contact:Email><contact:Phone type="mobile" number="[PHONE_NUMBER]"></contact:Phone></contact:Details><!-- SSN stored as an attribute --><SSN value="[PHONE_NUMBER]" xsi:nil="false"></SSN><data>[GENDER_IDENTIFIER] name was [NAME_GIVEN] [NAME_FAMILY]</data></PersonInfo>

Redact HTML content

To send an HTML string for redaction, use textual.redact_html.

redact_html ensures that only the values are redacted. It ignores the HTML markup.

For example:

html_content = """
<!DOCTYPE html>
<html>
    <head>
        <title>John Doe</title>
    </head>
    <body>
        <h1>John Doe</h1>
        <p>John Doe is a person who lives in New York City.</p>
        <p>John Doe's phone number is 555-555-5555.</p>
    </body>
</html>
"""

# Run the redact_xml method
redacted_html = redact.redact_html(html_content, generator_config={
            "NAME_GIVEN": "Synthesis",
            "NAME_FAMILY": "Synthesis"
        }) 

print(redacted_html.redacted_text)

Produces the following HTML output:

<!DOCTYPE html>
<html>
    <head>
        <title>Scott Roley</title>
    </head>
    <body>
        <h1>Scott Roley</h1>
        <p>Scott Roley is a person who lives in [LOCATION_CITY].</p>
        <p>Scott Roley's phone number is [PHONE_NUMBER].</p>
    </body>
</html>

Using an LLM to generate synthesized values

You can also request synthesized values from a large language model (LLM).

When you use this process, Textual first identifies the sensitive values in the text. It then sends the value locations and redacted values to the LLM. For example, if Textual identifies a product name, it sends the location and the redacted value PRODUCT to the LLM. Textual does not send the original values to the LLM.

The LLM then generates realistic synthesized values of the appropriate value types.

To send text to an LLM, use textual.llm_synthesis:

raw_synthesis = textual.llm_synthesis("Text of the string")

For example:

raw_synthesis = textual.llm_synthesis("My name is John, and today I am demoing Textual, a software product created by Tonic")
raw_synthesis.describe()

My name is John, and on Monday afternoon I am demoing Widget Pro, a software product created by Initech Enterprises.
{"start": 11, "end": 15, "new_start": 11, "new_end": 15, "label": "NAME_GIVEN", "text": "John", "new_text": null, "score": 0.9, "language": "en"}
{"start": 21, "end": 26, "new_start": 21, "new_end": 40, "label": "DATE_TIME", "text": "today", "new_text": null, "score": 0.85, "language": "en"}
{"start": 40, "end": 47, "new_start": 54, "new_end": 64, "label": "PRODUCT", "text": "Textual", "new_text": null, "score": 0.85, "language": "en"}
{"start": 79, "end": 84, "new_start": 96, "new_end": 115, "label": "ORGANIZATION", "text": "Tonic", "new_text": null, "score": 0.85, "language": "en"}

Before you can use this endpoint, you must enable additional LLM processing. The additional processing sends the values and surrounding text to the LLM. For an overview of the LLM processing and how to enable it, go to Enabling and using additional LLM processing of detected entities.

Format of the redaction and synthesis response

The response provides the redacted or synthesized version of the string, and the list of detected entity values.

Contact ORGANIZATION_EPfC7XZUZ with questions
    
{"start": 8, "end": 16, "new_start": 8, "new_end": 30, "label": "ORGANIZATION", "text": "Tonic AI", "new_text": "[ORGANIZATION]", "score": 0.85, "language": "en"}

For each redacted item, the response includes:

The location of the value in the original text (start and end)
The location of the value in the redacted version of the string (new_start and new_end)
The entity type (label)
The original value (text)
The replacement value (new_text). new_text is null in the following cases:
- The entity type is ignored
- The response is from llm_synthesis
A score to indicate confidence in the detection and redaction (score)
The detected language for the value (language)
For responses from textual.redact_json, the JSON path to the entity in the original document (json_path)
For responses from textual.redact_xml, the XPath to the entity in the original XML document (xml_path)

Configuring synthesis options

Required dataset permission: Edit dataset settings

When Textual generates replacement values, those values are always consistent. Consistency means that the same original value always produces the same replacement value. You can also enable consistency with some Tonic Structural output values.

For all entity types, you can specify the replacements for specific values.

Some entity types include type-specific options for how Tonic Textual generates the replacement values.

For custom entity types, you can select the generator to use.

You can also set whether to use the new synthesis process.

Enabling consistency with Tonic Structural

If you also use Tonic Structural, then you can configure Textual to enable selected synthesized values to be consistent between the two applications.

For example, a given source telephone number can produce the same replacement telephone number in both Structural and Textual.

To enable this consistency, you configure a statistics seed value as the value of the Textual environment variable SOLAR_STATISTICS_SEED. A statistics seed is a signed 32-bit integer.

The value must match a , either:

The value of the Structural environment setting TONIC_STATISTICS_SEED.
A statistics seed configured for an individual Structural workspace.

The current statistics seed value is displayed on the System Settings page.

Using the new synthesis process

Textual has developed an updated synthesis process that is currently implemented for the following entity types:

URLs
Names
Custom entity types

In particular, the new synthesis process improves the display of the synthesized values in PDF files. The values better match the available space and the original font.

To configure whether to use the new process:

On the dataset details page, click Settings.
On the Dataset Settings page, under PDF Settings, the New PDF synthesis mode (experimental) determines which process to use. To use the new process, toggle the setting to the on position.

Click Save Dataset.

Providing specific replacement values

For all entity types, you can provide a list of specific replacement values.

For example, for the Given Name entity type, you might indicate to always replace John with Michael and Mary with Melissa.

For the remaining values, Textual generates the replacement values.

To display the synthesis options for an entity type, click Options.

In the text area, provide a JSON object that maps the original values to the replacement values. For example:

{
  "French": "German",
  "English": "Japanese"
}

With the above configuration for the Language entity type:

All instances of French are changed to German.
All instances of English are changed to Japanese.
Textual selects the replacement values for other languages.

Configuring name synthesis options

For the Given Name and Family Name entity types, you can configure:

Whether to treat the same name with different casing as a different value.
Whether to replicate the gender of the original value.

In the entity types list, to display the name synthesis options, click Options.

Differentiating source values by case

To treat the same name with different casing as different source values, check Is Consistency Case Sensitive.

For example, when this is checked, john and John are treated as different names, and can have different replacement values - john might be replaced with michael, and John might be replaced with Stephen.

When this is not checked, then john and John are treated as the same source value, and get the same replacement.

Preserving gender in names

To replace source names with a names that have the same gender, check Preserve Gender.

For example, when this is checked, John might be replaced with Michael, since they are both traditionally male names. However, John would not be replaced with Mary, which is traditionally a female name.

Configuring location synthesis options

Location values include the following types:

Location
Location Address
Location State
Location Zip

You can select whether to generate HIPAA or non-HIPAA addresses. Address values can be consistent with values generated in Structural.

For each location type other than Location State, you can specify whether to use a realistic replacement value. For Location State, based on HIPAA guidelines, both the Synthesis option and the Off option pass through the value.

For location types that include zip codes, you can also specify how to generate the new zip code values.

In the entity types list, to display the location synthesis options, click Options.

Selecting the type of address generator to use

Under Address generator type, select the type of address generator to use:

HIPAA-compliant address generator. This option generates values similar to those generated by the .
Non-HIPAA address generator. This option generates values similar to those generated by the .

If you configured a Textual statistics seed that matches a Structural statistics seed, then the generated address values are consistent with values generated in Structural. A given address value produces the same output value in both applications.

For example, in both Textual and Structural, a source address value 123 Main Street might be replaced with 234 Oak Avenue.

Indicating whether to use realistic replacement values

By default, Textual replaces a location value with a realistic corresponding value. For example, "Main Street" might be replaced with "Fourth Avenue".

To instead scramble the values, uncheck Replace with realistic values.

Indicating how to generate replacement zip codes

By default, to generate a new zip code, Textual selects a real zip code that starts with the same three digits as the original zip code. For a low population area, Textual instead selects a random zip code from the United States.

To instead replace the last two digits of the zip code with zeros, check Replace zeroes for zip codes. For a low population area, Textual instead replaces all of the digits in the zip code with zeros.

Configuring datetime synthesis options

By default, when you select the Synthesis option for Date/Time and Date of Birth values, Textual shifts the datetime values to a value that occurs within 7 days before or after the original value.

To customize how Textual sets the new values, you can:

Set a different range within which Textual sets the new values
Indicate whether to scramble date values that Textual cannot parse
Indicate whether to shift all of the original values by the same amount and in the same direction
Add additional date formats for Textual to recognize

In the entity types list, to display the datetime synthesis options, click Options.

Adjusting the range for the replacement values

By default, Textual adjusts the dates to values that are within 7 days before or after the original date.

To change the range:

In the Left bound on # of Days To Shift field, enter the number of days before the original date within which the replacement datetime value must occur. For example, if you enter 10, then the replacement datetime value cannot occur earlier than 10 days before the original value.
In the Right bound on # of Days To Shift field, enter the number of days after the original date within which the replacement datetime value must occur. For example, if you enter 6, then the replacement datetime value cannot occur later than 6 days after the original value.

Indicating how to replace datetime values in unsupported formats

Textual can parse datetime values that use either a format in Default supported datetime formats in Textual or a format that you add.

The Scramble Unrecognized Dates checkbox indicates how Textual should handle datetime values that it does not recognize.

By default, the checkbox is checked, and Textual scrambles those values.

To instead pass through the values without changing them, uncheck Scramble Unrecognized Dates.

Indicating whether to shift all values by the same amount

By default, Textual applies different shifts to the original values. Some replacement dates might be earlier, and some might be later. The amount of shift might also vary.

To shift all of the datetime values in the same way, check Apply same shift for entire document.

For example, if this is checked, Textual might shift all datetime values 3 days in the future.

Adding datetime formats

By default, Textual is able to recognize datetime values that use a format from Default supported datetime formats in Textual.

Under Additional Date Formats, you can add other datetime formats that you know are present in your data.

The formats must use a Noda Time LocalDateTime pattern.

To add a format, type the format in the field, then click +.

To remove a format, click its delete icon.

Default supported datetime formats in Textual

By default, Textual supports the following datetime formats.

Date only formats

Format

Example value

yyyy/M/d

2024/1/17

yyyy-M-d

2024-1-17

yyyyMMdd

20240117

yyyy.M.d

2024.1.17

yyyy, MMM d

2024, Jan 17

yyyy-M

2024-1

yyyy/M

2024/1

d/M/yyyy

17/1/2024

d-MMM-yyyy

17-Jan-2024

dd-MMM-yy

17-Jan-24

d-M-yyyy

17-1-2024

d/MMM/yyyy

17/Jan/2024

d MMMM yyyy

17 January 2024

d MMM yyyy

17 Jan 2024

d MMMM, yyyy

17 January, 2024

ddd, d MMM yyyy

Wed, 17 Jan 2024

M/d/yyyy

1/17/2024

M/d/yy

1/17/24

M-d-yyyy

1-17-2024

MMddyyyy

01172024

MMMM d, yyyy

January 17, 2024

MMM d, ''yy

Jan 17, '24

MM-yyyy

01-2024

MMMM, yyyy

January, 2024

Date and time formats

Format

Example value

yyyy-M-d HH:mm

2024-1-17 15:45

d-M-yyyy HH:mm

17-1-2024 15:45

MM-dd-yy HH:mm

01-17-24 15:45

d/M/yy HH:mm:ss

17/1/24 15:45:30

d/M/yyyy HH:mm:ss

17/1/2024 15:45:30

yyyy/M/d HH:mm:ss

2024/1/17 15:45:30

yyyy-M-dTHH:mm:ss

2024-1-17T15:45:30

yyyy/M/dTHH:mm:ss

2024/1/17T15:45:30

yyyy-M-d HH:mm:ss'Z'

2024-1-17 15:45:30Z

yyyy-M-d'T'HH:mm:ss'Z'

2024-1-17T15:45:30Z

yyyy-M-d HH:mm:ss.fffffff

2024-1-17 15:45:30.1234567

yyyy-M-dd HH:mm:ss.FFFFFF

2024-1-17 15:45:30.123456

yyyy-M-dTHH:mm:ss.fff

2024-1-17T15:45:30.123

Time only formats

Format

Example value

HH:mm

15:45

HH:mm:ss

15:45:30

HHmmss

154530

hh:mm:ss tt

03:45:30 PM

HH:mm:ss'Z'

15:45:30Z

Configuring age synthesis options

By default, when you select the Synthesis option for Age values, Textual shifts the age value to a value that is within seven years before or after the original value. For age values that it cannot synthesize, it scrambles the value.

In the entity types list, to display the age synthesis options, click Options.

To configure the synthesis:

In the Range of Years +/- for the Shifted Age field, enter the number of years before and after the original value to use as the range for the synthesized value.
By default, Textual scrambles age values that it cannot parse. To instead pass through the value unchanged, uncheck Scramble Unrecognized Ages.

Configuring telephone number synthesis options

For Phone Number values, you can choose whether to generate a realistic phone number. If you do, then the generated values can be consistent with values generated in Structural.

In the entity types list, to display the phone number synthesis options, click Options.

Selecting the generator type

From the Phone number generator type dropdown list:

To replace each phone number with a randomly generated number, select Random Number.
To generate a realistic telephone number, select US Phone Number. The US Phone Number option generates values similar to those generated by the .

If you also configured a Textual statistics seed that matches a Structural statistics seed, then the synthesized values are consistent with values generated in Structural. A given source telephone number produces the same output telephone number in both applications.

For example, in both Textual and Structural, 123-456-6789 might be replaced with 154-567-8901.

Determining how to replace invalid telephone numbers

The Replace invalid numbers with valid numbers checkbox determines how Textual handles invalid telephone numbers in the data.

To replace the invalid with valid telephone numbers, check the checkbox.

If you do not check the checkbox, then Textual randomly replaces the numeric characters.

Selecting and configuring the generator for custom entity types

By default, when you select the Synthesis option for a custom entity type, Textual scrambles the original value.

From the generator dropdown list, select the generator to use to create the replacement value.

The available generators are:

Generator

Description

Scramble

This is the default generator.

Scrambles the original value.

CC Exp

Generates a credit card expiration date.

Company Name

Generates a name of a business.

Credit Card

Generates a credit card number.

CVV

Generates a credit card security code.

Date Time

Generates a datetime value.

The Date Time generator has the .

Generates an email address.

HIPAA Address Generator

Generates a mailing address.

The generator has the as the built-in location entity types.

IP Address

Generates an IP address.

MICR Code

Generates an MICR code.

Money

Generates a currency amount.

Name

Generates a person's name.

You configure:

Whether to generate the same replacement value from source values that have different capitalization.
Whether the replacement value reflects the gender of the original value.

Numeric Value

Generates a numeric value.

You configure whether to use the Integer Primary Key generator to generate the value.

Person Age

Generates an age value.

The Person Age generator has the .

Phone Number

Generates a telephone number.

The Phone Number generator has the .

SSN

Generates a United States Social Security Number.

URL

Generates a URL.

Structure of the pipeline output file JSON

When Textual processes pipeline files, it produces JSON output that provides access to the Markdown content and that identifies the detected entities in the file.

Common elements in the JSON output

Information about the entire file

All JSON output files contain the following elements that contain information for the entire file:

{
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

fileType

The type of the original file.

content

Details about the file content. It includes:

Hashed and Markdown content for the file
Entities in the file

schemaVersion

An integer that identifies the version of the JSON schema that was used for the JSON output.

Textual uses this to convert content from older schemas to the most recent schema.

For specific file types, the JSON output includes additional objects and properties to reflect the file structure.

Hashed and Markdown content

The JSON output contains hashed and Markdown content for the entire file and for individual file components.

hash

The hashed version of the file or component content.

text

The file or component content in Markdown notation.

Entities

The JSON output contains entities arrays for the entire file and for individual file components.

Each entity in the entities array has the following properties:

start

Within the file or component, the location where the entity value starts.

For example, in the following text:

My name is John.

John is an entity that starts at 11.

end

Within the file or component, the location where the entity value ends.

For example, in the following text:

My name is John.

John is an entity that ends at 14.

label

The type of entity.

For a list of the entity types that Textual detects, go to .

text

The text of the entity.

score

The confidence score for the entity.

Indicates how confident Textual is that the value is an entity of the specified type.

language

The language code to identify the language for the entity value. For example, en indicates that the value is in English.

Plain text files

For plain text files, the JSON output only contains the information for the entire file.

{
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown content>",
    "hash": "<hashed content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"      }
    ]
  },
  "schemaVersion": <integer schema version>
}

.csv files

For .csv files, the structure contains a tables array.

The tables array contains a table object that contains header and data arrays..

For each row in the file, the data array contains a row array.

For each value in a row, the row array contains a value object.

The value object contains the entities, hashed content, and Markdown content for the value.

{
  "tables": [
    {
      "tableName": "csv_table",
      "header": [//Columns that contain heading info (col_0, col_1, and so on)
        "<column identifier>"
      ],
      "data": [  //Entry for each row in the file
        [   //Entry for each value in the row
          {    
            "entities": [   //Entry for each entity in the value
              {
                "start": <start location>,,
                "end": <end location>,
                "label": "<value type>",
                "text": "<value text>",
                "score": <confidence score>,
                "language": "<language code>"
              }
            ],
            "hash": "<hashed value content>",
            "text": "<Markdown value content>"
          }
        ]
      ]
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   ///Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

.xlsx files

For .xlsx files, the structure contains a tables array that provides details for each worksheet in the file.

For each worksheet, the tables array contains a worksheet object.

For each row in a worksheet, the worksheet object contains a header array and a data array. The data array contains a row array.

For each cell in a row, the row array contains a cell object.

Each cell object contains the entities, hashed content, and Markdown content for the cell.

{
  "tables": [   //Entry for each worksheet
    {
      "tableName": "<Name of the worksheet>",
      "header": [ //Columns that contain heading info (col_0, col_1, and so on)
        "<column identifier>"
      ],
      "data": [   //Entry for each row
        [   //Entry for each cell in the row
          {
            "entities": [   //Entry for each entity in the cell
              {
                "start": <start location>,
                "end": <end location>,
                "label": "<value type>",
                "text": "<value text>",
                "score": <confidence score>,
                "language": "<language code>"
              }
            ],
            "hash": "<hashed cell content>",
            "text": "<Markdown cell content>"
          }
        ]
      ]
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

.docx files

For .docx files, the JSON output structure adds:

A footnotes array for content in footnotes.
An endnotes array for content in endnotes.
A header object for content in the page headers. Includes separate objects for the first page header, even page header, and odd page header.
A footer object for content in the page footers. Includes separate objects for the first page footer, even page footer, and odd page footer.

These arrays and objects contain the entities, hashed content, and Markdown content for the notes, headers, and footers.

{
  "footNotes": [   //Entry for each footnote
    {
      "entities": [   //Entry for each entity in the footnote
        {
          "start": <start location>,
          "end": <end location>,
          "pythonStart": <start location in Python>,
          "pythonEnd": <end location in Python>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
          "exampleRedaction": null
        }
      ],
      "hash": "<hashed footnote content>",
      "text": "<Markdown footnote content>"
    }
  ],
  "endNotes": [   //Entry for each endnote
    {
      "entities": [   //Entry for each entity in the endnote
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed endnote content>",
      "text": "<Markdown endnote content>"
    }
  ],
  "header": {
    "first": {
      "entities": [   //Entry for each entity in the first page header
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed first page header content>",
      "text": "<Markdown first page header content>"
    },
    "even": {
      "entities": [   //Entry for each entity in the even page header
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed even page header content>",
      "text": "<Markdown even page header content>"
    },
    "odd": {
      "entities": [   //Entry for each entity in the odd page header
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed odd page header content>",
      "text": "<Markdown odd page header content>"
    }
  },
  "footer": {
    "first": {
      "entities": [   //Entry for each entity in the first page footer
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed first page footer content>",
      "text": "<Markdown first page footer content>"
    },
    "even": {
      "entities": [   //Entry for each entity in the even page footer
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed even page footer content>",
      "text": "<Markdown even page footer content>"
    },
    "odd": {
      "entities": [   //Entry for each entity in the odd page footer
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed odd page footer content>",
      "text": "<Markdown odd page footer content>"
    }
  },
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

PDF and image files

PDF and image files use the same structure. Textual extracts and scans the text from the files.

For PDF and image files, the JSON output structure adds the following content.

`pages` array

The pages array contains all of the content on the pages. This includes content in tables and key-value pairs, which are also listed separately in the output.

For each page in the file, the pages array contains a page array.

For each component on the page - such as paragraphs, headings, headers, and footers - the page array contains a component object.

Each component object contains the component entities, hashed content, and Markdown content.

`tables` array

The tables array contains content that is in tables.

For each table in the file, the tables array contains a table array.

For each row in a table, the table array contains a row array.

For each cell in a row, the row array contains a cell object.

Each cell object identifies the type of cell (header or content). It also contains the entities, hashed content, and Markdown content for the cell.

`keyValuePairs` array

The keyValuePairs array contains key-value pair content. For example, for a PDF of a form with fields, a key-value pair might represent a field label and a field value.

For each key-value pair, the keyValuePairs array contains a key-value pair object.

The key-value pair object contains:

An automatically incremented identifier. For example, id for the first key-value pair is 1, for the second key-value pair is 2, and so on.
The start and end position of the key-value pair
The text of the key
The entities, hashed content, and Markdown content for the value

PDF and image JSON outline

{
  "pages": [   //Entry for each page in the file
    [   //Entry for each component on the page
      {
        "type": "<page component type>",
        "content": {
          "entities": [   //Entry for each entity in the component
            {
              "start": <start location>,
              "end": <end location>,
              "label": "<value type>",
              "text": "<value text>",
              "score": <confidence score>,
              "language": "<language code>"
            }
          ],
          "hash": "<hashed component content>",
          "text": "<Markdown component content>"
        }
      }
    ],
  "tables": [   //Entry for each table in the file
    [   //Entry for each row in the table
      [   //Entry for each cell in the row
        {
          "type": "<content type>",   //ColumnHeader or Content
          "content": {
            "entities": [  //Entry for each entity in the cell
              {
                "start": <start location>,
                "end": <end location>,
                "label": "<value type>",
                "text": "<value text>",
                "score": <confidence score>,
                "language": "<language code>"
              }
            ],
            "hash": "<hashed cell text>",
            "text": "<Markdown cell text>"
          }
        }
      ]
    ]
  ],
  "keyValuePairs": [   //Entry for each key-value pair in the file
    {
      "id": <incremented identifier>,
      "key": "<key text>",
      "value": {
        "entities": [  //Entry for each entity in the value
          {
            "start": <start location>,
            "end": <end location>,
            "label": "<value type>",
            "text": "<value text>",
            "score": <confidence score>,
            "language": "<language code>"
          }
        ],
        "hash": "<hashed value text>",
        "text": "<Markdown value text>"
      },
      "start": <start location of the key-value pair>,
      "end": <end location of the key-value pair>
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   ///Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

.eml and .msg files

For email message files, the JSON output structure adds the following content.

Email message identifiers

The JSON output includes the following email message identifiers:

The identifier of the current message
If the message was a reply to another message, the identifier of that message
An array of related email messages. This includes the email message that the message replied to, as well as any other messages in an email message thread.

Recipients

The JSON output includes the email address and display name of the message recipients. It contains separate lists for the following:

Recipients in the To line
Recipients in the CC line
Recipients in the BCC line

Subject line

The subject object contains the message subject line. It includes:

Markdown and hashed versions of the message subject line.
The entities that were detected in the subject line.

Message timestamp

sentDate provides the timestamp when the message was sent.

Message body

The plainTextBodyContent object contains the body of the email message.

It contains:

Markdown and hashed versions of the message body.
The entities that were detected in the message body.

Message attachments

The attachments array provides information about any attachments to the email message. For each attached file, it includes:

The identifier of the message that the file is attached to.
The identifier of the attachment.
The JSON output for the file.
The count of words in the original file.
The count of words in the redacted version of the file.

Email message JSON outline

{
  "messageId": "<email message identifier>",
  "inReplyToMessageId": <message that this message replied to>,
  "messageIdReferences": [<related email messages>],
  "senderAddress": {
    "address": "<sender email address>",
    "displayName": "<sender display name>"
  },
  "toAddresses": [  //Entry for each recipient in the To list
    {
      "address": "<recipient email address>",
      "displayName": "<recipient display name>"
    }
  ],
  "ccAddresses": [ //Entry for each recipient in the CC list
    {
      "address": "<recipient email address>",
      "displayName": "<recipient display name>"
    }
  ],
  "bccAddresses": [ //Entry for each recipient in the BCC list
    {
      "address": "<recipient email address>",
      "displayName": "<recipient display name>"
    }
  ],
  "sentDate": "<timestamp when the message was sent>",
  "subject": {
    "text": "<Markdown version of the subject line>",
    "hash": "<hashed version of the subject line>",
    "entities": [   //Entry for each entity in the subject line
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "plainTextBodyContent": {
    "text": "<Markdown version of the message body>",
    "hash": "<hashed version of the message body>",
    "entities": [ //Entry for each entity in the message body
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "attachments": [ //Entry for each attached file
    {
      "parentMessageId": "<the message that the file is attached to>",
      "contentId": "<identifier of the attachment>",
      "fileName": "<name of the attachment file>",
      "document": {<pipeline JSON for the attached file>},
      "wordCount": <number of words in the attachment>,
      "redactedWordCount": <number of words in the redacted attachment>
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [ //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

Tonic Textual

Tonic Textual guide

Textual workflows

Textual SDK, REST API, and Snowflake Native App

Getting started with Textual

Signing up for Textual

Using the Textual free trial

Using the Getting Started checklist

Word count limit

Viewing your current usage

Next steps - pay-as-you-go or product demo

Entity types that Textual detects

Managing custom entity types

Viewing the list of custom entity types

Creating, editing, and deleting a custom entity type

Creating a custom entity type

Editing a custom entity type

Deleting a custom entity type

Custom entity type configuration settings

Name and description

Regular expressions to identify matching values

Testing an expression

Enabling and disabling the entity type for pipelines and datasets

Datasets - Create redacted files

Creating and managing datasets

Viewing the list of datasets

Displaying the Datasets page

Filtering the datasets by name

Filtering the datasets by tag

Creating a dataset

Displaying details for a dataset

Changing the dataset name

Deleting a dataset

Adding and removing dataset files

Supported file types for datasets

Adding files to the dataset

Removing files from the dataset

Configuring the redaction

Working with custom entity types

Enabling and disabling custom entity types

Editing a custom entity type

Creating a custom entity type

Running a new scan to reflect custom entity type changes

Selecting the handling option for entity types

Available handling options

Selecting the handling option for a specific entity type

Selecting the handling option for all of the entity types

Adding manual overrides to PDF files

Editing an individual PDF file

Selecting the manual override option for a file

Applying a PDF template to a file

Adding a manual override

Navigating to a manual override

Removing a manual override

Saving the manual overrides

Sharing dataset access

Downloading redacted data

Downloading a single dataset file

Downloading all of the dataset files

Pipelines - Prepare LLM content

Assigning tags to pipelines

Managing tags from the Pipelines page

Managing tags from the pipeline details page

Setting up pipelines

General pipeline creation and configuration

Configuring specific pipeline types

Supported file types for pipelines

Creating custom entity types from a pipeline

Configuring file synthesis for a pipeline

Selecting files for an uploaded file pipeline

Adding files to the pipeline

Removing files

Indicating whether to also redact the files

Starting a pipeline run

Sharing pipeline access

Viewing pipeline results

Textual Python SDK

Installing the Textual SDK

Instantiating the SDK client

Instantiating when the API key is already configured

Docker