Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Tonic Textual comes with a built-in set of entity types that it detects. You can also configure custom entity types, which detect values based on regular expressions.
You can also view this video overview of entity types and entity type handling.
When you sign up for a Tonic Textual account, you can immediately get started with a new pipeline.
Note that these instructions are for setting up a new account on Textual Cloud. For a self-hosted instance, depending on how it is set up, you might either create an account manually or use single sign-on (SSO).
To get started with a new Textual account:
Go to .
Click Sign up.
Enter your email address.
Create and confirm a password for your Textual account.
Click Sign Up.
Textual creates your account. After you log in, Textual prompts you to provide some additional information about yourself and how you plan to use Textual.
After you fill out the information and click Get Started, Textual displays the Textual Home page, which you can use to preview how Textual detects and replaces values. For more information, go to .
When you set up an account on Textual Cloud, you start a Textual free trial.
When you start a free trial, Textual provides a checklist to guide you through initial steps to get started and learn more about Textual and what it can do.
The checklist displays automatically when you first log in. You can close and display it as needed. To display the checklist, in the Textual navigation menu, click Getting Started.
As you complete a step, Textual automatically marks it as completed.
The checklist includes:
When you click the step, you navigate to the Home page. Textual displays a popup panel that describes the task.
The checklist displays the installation command and an option to copy it. The step is marked as complete when you click the copy icon.
When you click the step, you are prompted to create an API key.
Creating an SDK request to redact a or a . When you click the step, you navigate to the Request Explorer. The step is marked as completed when you close the popup panel that describes the task.
During the free trial Textual scans up to 100,000 words for free. Note that Textual counts actual words, not tokens. For example, "Hello, my name is John Smith." counts as six words.
After the 100,000 words, Textual disables scanning for your account. Until you purchase a pay-as-you-go subscription, you cannot:
Add files to a dataset or pipeline
Run a pipeline
During your free trial, Textual displays the current usage in the following locations:
On the Home page
In the navigation menu
Textual also prompts you to purchase a , which allows an unlimited number of words scanned for a flat rate per 1,000 words.
You can also request a Textual product demo.
Built-in entity types Entity types that Textual detects automatically.
Configure custom entity types Define custom entity types to detect additional values.
Textual pipelines can process the following types of files:
txt
csv
tsv
docx
xlsx
png
tif or tiff
jpg or jpeg
eml
msg
Tonic Textual provides a single tool to allow you to put your text-based data to work for you.
You can use Textual datasets to redact sensitive values, to produce files in the same format to use for development and training. Each original file becomes an output file in the same format, with the sensitive values replaced. You can also use the Textual SDK or the Textual REST API to manage datasets or to remove sensitive values from individual text strings.
The Textual pipeline option allows you to prepare unstructured text for use in an LLM system. Textual extracts the text from each file and then produces Markdown-formatted output. You can optionally replace sensitive values in the output, to prevent data leakage from your LLM. You can also use the Textual SDK to manage pipelines or parse individual files.
Need help with Textual? Contact [email protected].
When you choose to also generate synthesized versions of the pipeline files, the pipeline details page includes a Generator Config tab. From the Generator Config tab, you configure how to transform the detected entities in each file.
The Generator Config tab lists all of the available entity types.
For each entity type, you select and configure the handling option. For more information, see and .
You can also manage custom entity types. From the list, you can enable and disable custom types, and edit the configuration. For more information, go to .
After you change the configuration, click Save Changes. The updated configuration is applied the next time you run the pipeline, and only to new files.
From the pipeline details page, to create a custom entity type, click Create Custom Entity Type.
For information on how to configure a custom entity type, go to .
Getting started with Textual
Sign up for a Textual account. Create your first pipeline.
Datasets workflow - File redaction and synthesis
Use Textual to replace sensitive values in files.
Pipelines workflow - LLM preparation
Use Textual to prepare file content for use in an LLM system.
SDK - Datasets and redaction
Use the Textual Python SDK to redact text and manage datasets. Review redaction requests in the Request Explorer.
SDK - Pipelines and parsing
Use the Textual Python SDK to parse text and manage pipelines.
REST API - Redact text
Use the Textual REST API to redact individual text strings.
REST API - Manage datasets
Use the Textual REST API to retrieve, create, and edit datasets.
Snowflake Native App
Use the Snowflake Native App to redact values in your data warehouse.
The Tonic Textual SDK is a Python SDK that you can use to parse and redact text and files.
It requires Python 3.9 or higher.
To install the Tonic Textual Python SDK, run:
pip install tonic-textual
You install a self-hosted instance of Tonic Textual on either:
A VM or server that runs Linux and on which you have superuser access.
A local machine that runs Mac, Windows, or Linux.
At minimum, we recommend that the server or cluster that you deploy Textual to has access to the following resources:
Nvidia GPU, 16GB GPU RAM. We recommend at least 6GB GPU RAM for each textual-ml
worker.
If you only use a CPU and not a GPU, then we recommend an M5.2xLarge. However, without GPU, performance is significantly slower.
The number of words per second that Textual processes depends on many factors, including:
The hardware that runs the textual-ml
container
The number of workers that are assigned to the textual-ml
container
The auxiliary model, if any, that is used in the textual-ml
container.
To optimize the throughput of and the cost to use Textual, we recommend that the textual-ml
container runs on modern hardware with GPU compute. If you use AWS, we recommend a g5 instance with 1 GPU.
To use GPU resources:
Ensure that the correct Nvidia drivers are installed for your instance.
If you use Kubernetes to deploy Textual, follow the instructions in the NVIDIA GPU operator documentation.
If you use Minikube, then use the instructions in Using NVIDIA GPUs with Minikube.
If you use Docker Compose to deploy Textual, follow these steps to install the nvidia-container-runtime.
Before you can use the API, you must create a Tonic Textual API key. For information on how to obtain a Textual API key, go to Creating and revoking Textual API keys.
When you call the API, you place your API key in the authorization header of the request, similar to the following curl request, which fetches the list of datasets for the current user.
curl --request GET \
--url "https://textual.tonic.ai/api/dataset" \
--header "Content-Type: application/json" \
--header "Authorization: API_KEY"
Most Textual API requests require authentication. For each request, the reference information indicates whether the request requires an API key.
For requests that require an API key, if you do not provide a valid API key, you receive a 401 Unauthorized
response.
If your company has a self-hosted Textual instance that is installed on-premises, then you navigate to the Textual URL for that instance.
Your self-hosted instance might be configured to use single sign-on for Textual access. If so, then from the Textual login page, to create your Textual user account, click the single sign-on option.
Otherwise, to create your Textual user account, click Sign Up.
Your administrator can provide the URL for your Textual instance and confirm the instructions for creating your user account.
If your Textual license is on Textual Cloud, then new users that have a matching email domain are automatically added to your Textual Cloud organization.
For a Textual Cloud license other than a pay-as-you-go license, the license agreement specifies the included email domains. When a user with a matching email domain signs up for a Structural account, they are added to that Textual Cloud organization.
For a pay-as-you-go Textual Cloud license, when a user with the same corporate email domain as the subscribed user signs up for a Textual account, they are added to that Textual Cloud organization.
To create your Textual user account, on the Textual Cloud login page, click Sign Up.
If you use SSO to manage Tonic Textual groups, then Textual displays the list of groups for which at least one user has logged in to Textual.
To display the SSO group list:
Click the user image at the top right.
In the user menu, click Permission Settings.
On the Permission Settings page, click Groups.
If no users from a group have logged in to Textual, then the group does not display in the list.
The list only displays the group names and indicates the SSO provider. To manage the group permissions:
To assign global permission sets, go to the Global Permission Sets list. For more information, go to Configuring access to global permission sets.
To assign dataset permission sets, go to the Datasets page. For more information, go to Sharing dataset access.
To assign pipeline permission sets, go to the Pipelines page. For more information, go to Sharing pipeline access.
On a self-hosted instance of Textual, much of the configuration takes the form of environment variables.
After you configure an environment variable, you must restart Textual.
For Docker, add the variable to .env in the format:
SETTING_NAME=value
After you update .env, to restart Textual and complete the update, run:
$ docker-compose down
$ docker-compose pull && docker-compose up -d
For Kubernetes, in values.yaml, add the environment variable to the appropriate env
section of the Helm chart.
For example:
env: {
"TEXTUAL_ML_WORKERS": "2"
}
After you update the YAML file, to restart the service and complete the update, run:
$ helm upgrade <name_of_release> -n <namespace_name> <path-to-helm-chart>
The above Helm upgrade command is always safe to use when you provide specific version numbers. However, if you use the latest tag, it might result in Textual containers that have different versions.
In Tonic Textual, each user belongs to an organization. Organizations are used to determine the company or customer that a Textual user belongs to.
A self-hosted instance of Textual contains a single organization. All users belong to that organization.
Textual Cloud hosts multiple organizations. The organizations are kept completely separate. Users from one Textual Cloud organization do not have any access to the users, datasets, or pipelines that belong to a different Textual Cloud organization.
A Textual organization is created:
For a standard Textual license, both self-hosted and Textual Cloud, when the first user signs up for a Textual account.
When a user signs up for a free trial or pay-as-you go Textual Cloud license with a unique corporate email domain.
When a user signs up for a free trial or pay-as-you-go Textual Cloud license with a public email domain, such as Gmail or Yahoo. Every user with a public email domain is in a separate organization.
A self-hosted instance has a single organization. Every user who signs up for an account on that instance is added to the organization.
For companies with an annual Textual Cloud license, the license includes the email domains that are included in the license.
When a user with one of the included email domains signs up for a Textual account, they are automatically added to that organization.
For a pay-as-you-go license, when a user with the same corporate email domain signs up for a Textual account, they are automatically added to that organization.
Users with public email domains are always in separate organizations.
Textual writes the worker, machine learning, and API log messages to stdout
.
By default, the log messages are in an unstructured format.
To instead use a JSON format for the logs, set the environment variable SOLAR_EMIT_JSON_LOGS_TO_STDOUT
to true
.
In addition to the built-in entity types, you can also create custom entity types.
Custom entity types are based on regular expressions. If a value matches a configured regular expression for the custom entity type, then it is identified as that entity type.
You can control whether each dataset or pipeline uses each custom entity type.
To display the list of entity types, in the Textual navigation bar, click Custom Entity Types.
For each custom entity type, the list includes:
Entity type name and description.
Regular expressions to identify matching values.
The number of datasets and pipelines that the entity type is active for.
To create a custom entity type, on the Custom Entity Types page, click Create Custom Entity Type.
The dataset details and pipeline details pages also contain a Create Custom Entity Type option.
After you configure the entity type:
To save the new type, but not scan dataset and pipeline files for the new type, click Save Without Scanning Files.
To both save the new type and scan for it, click Save and Scan Files.
To detect new custom entity types in a dataset or pipeline, Textual needs to run a scan. If you do not run the scan when you save the custom entity type, then:
On the dataset details page, you are prompted to run a scan.
On the pipeline details page for an uploaded file pipeline, you are prompted to run a scan.
For a cloud storage pipeline, a new scan runs when you run the pipeline.
To edit a custom entity type, on the Custom Entity Types page, click the edit icon for the entity type.
You can also edit a custom entity type from the dataset or pipeline details page.
For an existing entity type, you can change the description, the regular expressions, and the enabled datasets and pipelines.
You cannot change the entity type name, which is used to produce the identifier to use to configure the entity type handling from the SDK.
After you update the configuration:
To save the changes, but not scan dataset and pipeline files based on the updated configuration, click Save Without Scanning Files.
To both save the new type and scan based on the updated configuration, click Save and Scan Files.
To reflect the changes to custom entity types in a dataset or pipeline, Textual needs to run a scan. If you do not run the scan when you save the changes, then:
On the dataset details page, you are prompted to run a scan.
On the pipeline details page for an uploaded file pipeline, you are prompted to run a scan.
For a cloud storage pipeline, a new scan runs when you run the pipeline.
When you delete a custom entity type, it is removed from the datasets and pipelines that it was active for.
To delete a custom entity type:
On the Custom Entity Types page, click the delete icon for the entity type.
On the confirmation panel, click Delete Entity Type.
The custom entity type configuration includes:
Name and description
Regular expressions to identify matching values. From the configuration panel, you can test the expressions against text that you provide.
Datasets and pipelines to make the entity type active for. You can also enable and disable custom entity types from the dataset and pipeline details pages.
In the Name field, provide a name for the entity type. Each custom entity type name:
Must be unique within an organization.
Can only contain alphanumeric characters and spaces. Custom entity type names cannot contain punctuation or other special characters.
After you save the entity type, you cannot change the name. Textual uses the name as the basis for the identifier that you use to refer to the entity type in the SDK.
In the Description field, provide a longer description of the custom entity type.
Under Keywords, Phrases, or Regexes, provide expressions to identify matching values for the entity type.
An entry can be as simple as a single word or phrase, or you can provide a more complex regular expression to identify the values.
Textual maintains an empty row at the bottom of the list. When you type an expression into the last row, Textual adds a new empty row.
To add an entry, begin to type the value in the empty row.
To edit an entry, click the entry field, then edit the value.
To remove an entry, click its delete icon.
Under Test Entry, you can check whether Textual correctly identifies a value as the entity type based on the provided expression.
To test an expression:
From the dropdown list, select the entry to test.
In the text area, provide the text to test.
As you enter the text, Textual automatically scans the text for matches to the selected expression. The Result field displays the input text and highlights the matching values.
Under Activate custom entity, you identify the datasets and pipelines to make the entity active for. From the pipeline details or dataset details, you can also enable and disable custom entity types for that pipeline or dataset.
To make the entity active for all current and future datasets and pipelines, check Automatically activate for all current, and new pipelines and datasets.
To make the entity active for specific pipelines and datasets, set the toggle for the dataset or pipeline to the on position.
To filter the list based on the pipeline or dataset name, in the filter field, begin to type text from the name. Textual updates the list to only include matching datasets and pipelines.
To update all of the currently displayed datasets and pipelines, click Bulk action, then click Enable or Disable.
You can also enable and disable custom entity types from within a dataset or pipeline. For more information, go to Enabling and disabling custom entity types.
A Tonic Textual dataset is a collection of text-based files. Textual uses models to detect and redact the sensitive information in each file.
To display the Datasets page, in the navigation menu, click Datasets.
The datasets list only displays the datasets that you have access to.
Users who have the global permission View all datasets can see the complete list of datasets.
For each dataset, the Datasets page includes:
The name of the dataset
Any tags assigned to the dataset. For datasets that you can edit, there is also an option to assign tags. For more information, go to Assigning tags to datasets.
The user who most recently updated the dataset
When the dataset was created
To filter the datasets by name, in the search field, begin to type text that is in the dataset name.
As you type, the list is filtered to only include datasets with names that contain the filter text.
You can assign tags to each dataset. Tags can help you to organize and provide a quick glance into the dataset configuration.
On the Datasets page, to filter the datasets by their assigned tags:
In the heading for the Tags column, click the filter icon.
On the tag list, check the checkbox for each tag to include.
To find a specific tag, in the search field, type the tag name.
From the Datasets page, you can create a new empty dataset. Textual prompts you for the dataset name, then displays the dataset details page.
To create a dataset:
On the Datasets page, click Create a Dataset.
On the dataset creation panel, in the Dataset Name field, provide the name of the dataset.
Click Create Dataset. The dataset details page for the new dataset is displayed.
To display the details page for a dataset, on the Datasets page, click the dataset name.
The dataset details page includes:
The tags assigned to the dataset, as well as an option to add tags. For more information, go to Assigning tags to datasets.
The list of files in the dataset
The results of the scan for entity values
The configured handling for each type of value
The dataset name displays in the panel at the top left of the dataset details page.
To change the dataset name:
On the dataset details page, click Settings.
On the Dataset Settings page, in the Dataset Name field, provide the new name for the dataset..
Click Save Dataset.
To delete a dataset:
On the dataset details page, click Settings.
On the Dataset Settings page, click Delete Dataset.
Click Confirm Delete.
Tonic Textual can process the following types of files:
txt
csv
tsv
docx
xlsx
png
tif or tiff
jpg or jpeg
On a self-hosted instance, you can configure an S3 bucket where Textual stores the files. This is the same S3 bucket that is used for uploaded file pipelines.
For more information, go to Setting the S3 bucket for file uploads and redactions.
For an example of an IAM role with the required permissions, go to .
From the dataset details page, to add files to the dataset:
In the panel at the top left, click Upload Files.
Search for and select the files.
Tonic Textual uploads and then processes the files.
On a self-hosted instance, when a file fails to upload, you can download the associated logs. To download the logs, click the options menu for the file, then select Download Logs.
To remove a file from the dataset:
In the file list, click the options menu for the file.
In the options menu, click Delete File.
For each entity type, you can adjust how Tonic Textual identifies and updates the values.
For a dataset, you configure the redaction from the entity types list on the dataset details.
For a pipeline that generates synthesized files, you configure the redaction from the Generator Config tab on the pipeline details page.
For uploaded file pipelines, Tonic Textual automatically processes new files as you add them.
For cloud storage pipelines, you start pipeline runs. A pipeline run processes the pipeline files. The pipeline run only processes files that were not processed by a previous pipeline run.
If the pipeline is configured to also redact files, the run also generates the redacted version of each file. The redaction is based on the current redaction configuration for the pipeline. The first run after you enable redaction generates redacted versions of all of the pipeline files, including files that were processed by earlier runs. Subsequent runs only process new files.
To start a pipeline run, on the pipeline details page, click Run Pipeline.
Before the run completes, to cancel the run:
On the Pipeline Runs tab, click the Cancel Run option for the run.
On the confirmation panel, click Cancel Run.
The Tonic Textual REST API allows you to more deeply integrate Textual functions into your existing workflows.
You can use the REST API as another tool alongside the Textual application and the Textual Python SDK. The Python SDK supports the same actions as the REST API. We recommend the Python SDK for customers who already use Python.
You can download the Textual OpenAPI specification from:
On a self-hosted instance of Tonic Textual, you can view the current model specifications for the instance.
To view the model specifications:
Click the user icon at the top right.
In the user menu, click System Settings.
On the System Settings page, the Model Specifications section provides details about the models that Textual uses.
Use the API to retrieve information about users and groups, and to manage access to datasets.
You can use the Tonic Textual REST API to redact text strings, including:
Plain text
JSON
XML
HTML
Textual provides a specific endpoint for each format. For JSON, XML, and HTML, Textual only redacts the text values. It preserves the underlying structure.
From the entity types list, you can set whether each custom entity is active, and edit the custom entity configuration.
You can also create a new custom entity type.
In the entity types list, custom entity types include a toggle to indicate whether the custom entity type is active for that dataset or pipeline.
To disable a custom entity type, set the toggle to the off position.
When a custom entity type is enabled, then it is listed under either the found or not found entity types, depending on whether the files include entities of that type.
When a custom entity type is not enabled, it is listed under Inactive custom entity types. To enable the custom entity type, set the toggle to the on position.
To edit a custom entity type, click the settings icon for that type.
Note that any changes to the custom entity type settings affect all of the datasets and pipelines that use the custom entity type.
For information on how to configure a custom entity type, go to .
From the dataset details or pipeline details page, to create a new custom entity type, click Create Custom Entity Type.
For information on how to configure a custom entity type, go to .
When you enable, disable, add, or edit custom entity types, the changes do not take effect until you run a new scan.
For datasets and uploaded file pipelines, to run a new scan, click Scan.
For a cloud storage pipeline, Textual scans the files when you run the pipeline.
Tags can help you to organize your pipelines. For example, you can use tags to indicate pipelines that belong to different groups, or that deal with specific areas of your data.
You can manage tags from both the Pipelines page and the pipeline details.
On the Pipelines page, the Tags column displays the currently assigned tags.
To change the tag assignment for a pipeline:
Click Tags.
On the pipeline tags panel, to add a new tag, type the tag text, then press Enter.
To remove a tag, click its delete icon.
To remove all of the tags, click the delete all icon.
On the pipeline details page, the assigned tags display under the pipeline name.
To change the tag assignment:
Click Tags.
On the pipeline tags panel, to add a new tag, type the tag text, then press Enter.
To remove a tag, click its delete icon.
To remove all of the tags, click the delete all icon.
You can use the Tonic Textual SDK to manage pipelines and to redact individual strings and files.
Tonic Textual uses dataset permission sets for role-based access (RBAC) of each dataset.
A dataset permission set is a set of dataset permissions. Each permission provides access to a specific dataset feature or function.
Textual provides built-in dataset permission sets. Organizations can also configure custom permission sets.
To share dataset access, you assign dataset permission sets to users and to SSO groups, if you use SSO to manage Textual users. Before you assign a dataset permission set to an SSO group, make sure that you are aware of who is in the group. The permissions that are granted to an SSO group automatically are granted to all of the users in the group.
To change the current access to the dataset:
On the Datasets page, click the share icon for the dataset to share.
The dataset access panel contains the current list of users and groups who have access to the dataset, and displays their assigned dataset permission sets. To add a user or group to the list of users and groups:
In the search field, begin to type the user email address or group name.
From the list of matching users or groups, select the user or group to add.
For a user or group, to change the assigned dataset permission sets:
Click Access. The dropdown list displays the list of custom and built-in dataset permission sets.
Under Custom Permission Sets, check the checkbox next to each dataset permission set to assign to the user or group. To remove an assigned dataset permission set, uncheck the checkbox.
Under Built-In Permission Sets, click the dataset permission set to assign to the user or group. You can only assign one built-in permission set. By default, for an added user or group, the Viewer permission set is selected. To not grant any built-in permission set, select None.
Tonic Textual uses pipeline permission sets for role-based access (RBAC) of each pipeline.
A pipeline permission set is a set of pipeline permissions. Each pipeline permission provides access to a specific pipeline feature or function.
Textual provides built-in pipeline permission sets. Organizations can also configure custom permission sets.
To share pipeline access, you assign pipeline permission sets to users and to SSO groups, if you use SSO to manage Textual users. Before you assign a pipeline permission set to an SSO group, make sure that you are aware of who is in the group. The permissions that are granted to an SSO group automatically are granted to all of the users in the group.
To change the current access to the pipeline:
On the Pipelines page, click the share icon for the pipeline to share.
The pipeline access panel contains the current list of users and groups who have access to the pipeline, and displays their assigned pipeline permission sets. To add a user or group to the list of users and groups:
In the search field, begin to type the user email address or group name.
From the list of matching users or groups, select the user or group to add.
For a user or group, to change the assigned pipeline permission sets:
Click Access. The dropdown list displays the list of custom and built-in pipeline permission sets.
Under Custom Permission Sets, check the checkbox next to each pipeline permission set to assign to the user or group. To remove a pipeline permission set from a user or group, uncheck the checkbox.
Under Built-In Permission Sets, select the pipeline permission set to assign to the user or group. You can only assign one built-in permission set. By default, for an added user or group, the Viewer permission set is selected. To not grant any built-in permission set, select None.
To save the new access, click Share.
For each entity type, you choose how to handle the detected values.
The available options are:
Synthesis - Indicates to replace the value with another realistic value. For example, the first name value Michael might be replaced with the value John. The synthesized values are always consistent, meaning that a given entity value always has the same replacement value. For example, if the first name Michael appears multiple times in the text, it is always replaced with John. Textual does not synthesize any excluded values. For custom entity types, Textual scrambles the values.
Redaction - This is the default option, except for the Full Mailing Address entity type, which is Off by default.
For text files, Redaction indicates to tokenize the value - to replace it with a token that identifies the entity type followed by a unique identifier. For example, the first name value Michael might be replaced with NAME_GIVEN_12m5s
. The identifiers are consistent, which means that for a given original value, the replacement always has the same unique identifier. For example, the first name Michael might always be replaced with NAME_GIVEN_12m5sb
, while the first name Helen might always be replaced with NAME_GIVEN_9ha3m2
.
For PDF files, Redaction indicates to either cover the value with a black box, or, if there is space, display the entity type and identifier.
For image files, Redaction indicates to cover the value with a black box.
Textual does not redact any excluded values.
Off - Indicates to not make any changes to the values. For example, the first name value Michael remains Michael. This this the default option for the Full Mailing Address entity type.
To select the handling option for an individual entity type, click the option for that type.
For a dataset, to select the same handling option for all of the entity types, from the Bulk Edit dropdown above the data type list, select the option.
For a pipeline that generates synthesized files, on the Generator Config tab, use the Bulk Edit options at the top of the entity types list.
On a self-hosted instance, before you can upload files to a pipeline, you must configure the S3 bucket where Tonic Textual stores the files. For more information, go to .
For an example of an IAM role that has the required permissions for file upload pipelines, go to .
On the pipeline details page for an uploaded file pipeline, to add files to the pipeline:
Click Upload Files.
Search for and select the files to upload.
To remove a file, on the pipeline details page, click the delete icon for the file.
By default, Textual only generates the JSON output for the pipeline files.
To also generate versions of the original files that redact or synthesize the detected entity values, on the Pipeline Settings page, toggle Synthesize Files to the on position.
For information on how to configure the file generation, go to .
For PDF files, you can add manual overrides to the initial redactions, which are based on the detected data types and handling configuration.
For each manual override, you select an area of the file.
For the selected area, you can either:
Ignore any automatically detected redactions. For example, a scanned form might show an example or boilerplate content that doesn't actually contain sensitive values.
Redact that area. The file might contain sensitive content that Tonic Textual is unable to detect. For example, a scanned form might contain handwritten notes.
You can also apply a template to the file.
To manage the manual overrides for a PDF file:
In the file list, click the options menu for the file.
In the options menu, click Edit Redactions.
The File Redactions panel displays the file content. The values that Textual detected are highlighted. The page also shows any manual overrides that were added to the file.
If a dataset contains multiple files that have the same format, then you can create a template to apply to those files. For more information, go to .
On the File Redactions panel, to apply a template to the file, select it from the template dropdown list.
When you apply a PDF template to a file, the manual overrides from that template are displayed on the file preview. The manual overrides are not included in the Redactions list.
On the File Redactions panel, to add a manual override to a file:
Select the type of override. To indicate to ignore any automatically detected values in the selected area, click Ignore Redactions. To indicate to redact the selected area, click Add Manual Redaction.
Use the mouse to draw a box around the area to select.
Textual adds the override to the Redactions list. The icon indicates the type of override.
In the file content:
Overrides that ignore detected values within the selected area are outlined in red.
Overrides that redact the selected area are outlined in green.
To select and highlight a manual override in the file content, in the Redactions list, click the navigate icon for the override.
To remove a manual override, in the Redactions list, click the delete icon for the override.
To save the current manual overrides, click Save.
On a self-hosted instance, you can configure settings to determine whether to the auxiliary model, and model use on GPU.
To improve overall inference, you can configure whether Textual uses the en_core_web_sm
auxiliary NER model.
The auxiliary model detects the following types:
EVENT
LANGUAGE
LAW
NRP
NUMERIC_VALUE
PRODUCT
WORK_OF_ART
To configure whether to use the auxiliary model, you use the environment variable TEXTUAL_AUX_MODEL
.
The available values are:
en_core_web_sm
- This is the default value.
none
- Indicates to not use the auxiliary model.
When you use a textual-ml-gpu
container on accelerated hardware, you can configure:
Whether to use the auxiliary model,
Whether to use the date synthesis model
To configure whether to use the auxiliary model for GPU, you configure the TEXTUAL_AUX_MODEL_GPU
.
By default, on GPU, Textual does not use the auxiliary model, and TEXTUAL_AUX_MODEL_GPU
is false
.
To use the auxiliary model for GPU, based on the configuration of TEXTUAL_AUX_MODEL
, set TEXTUAL_AUX_MODEL_GPU
to true
.
When TEXTUAL_AUX_MODEL_GPU
is true
, and TEXTUAL_MULTI_LINGUAL
is true
, Textual also loads the multilingual models on GPU.
By default, on GPU, Textual loads the date synthesis model on GPU.
Note that this model requires 600MB of GPU RAM for each machine learning worker.
To not load the date synthesis model on GPU, set the TEXTUAL_DATE_SYNTH_GPU
to false
.
Whenever you call the Textual SDK, you first instantiate the SDK client.
To work with Textual datasets, or to redact individual files, you instantiate TonicTextual
.
To work with Textual pipelines and parsing, you instantiate TonicTextualParse
.
If the API key is configured as the value of the TONIC_TEXTUAL_API_KEY
, then you do not need to provide the API key when you instantiate the SDK client.
For Textual pipelines:
For Textual datasets:
If the API key is not configured as the value of the TONIC_TEXTUAL_API_KEY
, then you must include the API key in the request.
For Textual pipelines:
For Textual datasets:
You can use the Textual SDK to redact and synthesize values in individual files.
Before you perform these tasks, remember to .
For a self-hosted instance, you can also configure the S3 bucket to use to store the files. This is the same S3 bucket that is used to store files for uploaded file pipelines. For more information, go to . For an example of an IAM role with the required permissions, go to .
To send an individual file to Textual, you use .
You first open the file so that Textual can read it, then make the call for Textual to read the file.
The response includes:
The file name
The identifier of the job that processed the file. You use this identifier to retrieve a transformed version of the file.
After you use to send the file to Textual, you use to retrieve a transformed version of the file.
To identify the file, you use the job identifier that you received from textual.start_file_redaction
. You can for the detected entity values.
Before you make the call to download the file, you specify the path to download the file content to.
For calls to AWS products that are used in Textual, you can configure custom URLs to use. For example, if you use proxy endpoints, then you would configure those endpoints in Textual.
The for custom AWS endpoints include the following:
AWS_S3_FORCE_PATH_STYLE
- Whether to always use path-style instead virtual-hosted-style for connections to Amazon S3. The default is false
.
This setting is only used if you configured either AWS_ENDPOINT_URL
or AWS_ENDPOINT_URL_S3
.
AWS_ENDPOINT_URL
- The URL to use for all AWS calls, including calls to Amazon S3, Amazon Textract, and Amazon SES v2. This global endpoint is overridden by service-specific endpoints.
AWS_ENDPOINT_URL_S3
- The URL to use for calls to Amazon S3. This overrides the global URL set in AWS_ENDPOINT_URL
.
AWS_ENDPOINT_URL_TEXTRACT
- The URL to use for calls to Amazon Textract. This overrides the global URL set in AWS_ENDPOINT_URL
.
AWS_ENDPOINT_URL_SESV2
- The URL to use for calls to Amazon SES v2. This overrides the global URL set in AWS_ENDPOINT_URL
.
By default, each Tonic Textual worker can run eight jobs at the same time. For example, it can process up to eight files simultaneously.
The environment variable SOLAR_MAX_CONCURRENT_WORKER_JOBS
controls the number of jobs to run concurrently.
The number of jobs that can run concurrently can affect the that you need. The more jobs that can run concurrently, the fewer workers that are needed.
The following diagram shows how data and requests flow within the Tonic Textual application:
The Textual application database is a PostgreSQL database that stores the dataset configuration.
If you do not configure an S3 bucket, then it also stores uploaded files and files that you use the SDK to redact.
You can configure an S3 bucket to store uploaded files and individual files that you use the SDK to redact. For more information, go to .
If you do not configure an S3 bucket, then the files are stored in the Textual application database.
Runs the Textual user interface.
A textual instance can have multiple workers.
The worker orchestrates jobs. A job is a longer running task such as the redaction of a single file.
If you redact a large number of files, you might deploy additional workers and machine learning containers to increase the number of files that you can process concurrently.
A textual installation can have 1 or more machine learning containers.
The machine learning container hosts the Textual models. It takes text from the worker or web server and returns any entities that it discovers.
Additional machine learning containers can increase the number of words per second that Textual can process.
The OCR service converts PDFs and images to text that Textual can then scan for sensitive values.
For more information, go to .
Textual only uses the LLM service for .
The Docker Compose file is available in the GitHub repository .
Fork the repository.
To deploy Textual:
Rename sample.env to .env.
In .env, provide values for the required settings. These are not commented out and have <FILL IN>
as a placeholder value:
SOLAR_VERSION
- Provided by Tonic.ai.
SOLAR_LICENSE
- Provided by Tonic.ai.
ENVIRONMENT_NAME
- The name that you want to use for your Textual instance. For example, my-company-name
.
SOLAR_SECRET
- The string to use for Textual encryption.
SOLAR_DB_PASSWORD
- The password that you want to use for the Textual application database, which stores the metadata for Textual, including the datasets and pipelines. Textual deploys a PostgreSQL database container for the application database.
To deploy and start Textual, run docker-compose up -d
.
The Tonic Textual images are stored on . During onboarding, Tonic.ai provides you with credentials to access the image repository. If you require new credentials, or you experience issues accessing the repository, contact .
You can deploy Textual using either Kubernetes or Docker.
The TEXTUAL_ML_WORKERS
environment variable specifies the number of workers to use within the textual-ml
container. The default value is 1.
Having multiple workers allows for parallelization of inferences with NER models. The number of required workers is also affected by the .
When you deploy Textual with Kubernetes on GPUs, parallelization allows the textual-ml
container to fully utilize the GPU.
We recommend 3GB of GPU RAM for each worker.
The Tonic Textual Snowflake Native App uses the same models and algorithms as the Tonic Textual API, but runs natively in Snowflake.
You use the app to redact or parse your text data directly within your Snowflake workflows. The text never leaves your data warehouse.
The app package runs natively in Snowflake, and leverages Snowpark Container Services.
It includes the following containers:
Detection service, which detects the sensitive entity values.
Redaction service, which replaces the sensitive entity values.
For the redaction workflow, you use the app to detect and replace sensitive values in text.
You use TEXTUAL_REDACT
to send the redaction request.
When you call TEXTUAL_REDACT
, it passes to the redaction service:
The text to redact
Optional configuration
The redaction service forwards the text to the detection service.
The detection service uses a series of NER models to identify and categorize sensitive words and phrases in the text.
The detection service returns its results to the redaction service.
The redaction service uses the results to replace the sensitive words and phrases with redacted or synthesized versions.
The redacted text is returned to the user.
For the parsing workflow, you use the app to parse files that are in a Snowflake internal or external stage.
You call TEXTUAL_PARSE
to send the parse request. The request includes:
The fully qualified stage name where the files are located
The name of the file, or a variable that identifies the list of files
The MD5 sum of the file
The app uses a series of NER models to identify and categorize sensitive words and phrases in the text.
The app converts the content to a markdown format.
The markdown content is part of the JSON output that includes metadata about the parsed text. You can use the metadata to built RAG systems and LLM datasets.
The app stores the results of the parse request, including the output, in the TEXTUAL_RESULTS
table.
The Tonic Textual Helm chart is available in the GitHub repository .
To use the Helm chart, you can either:
Use the that Tonic hosts on .
Fork or clone the repository and then maintain it locally.
During the onboarding period, you are provided access credentials to our docker image repository on . If you require new credentials, or you experience issues accessing the repository, contact .
Before you deploy Textual, you create a values.yaml file with the configuration for your instance.
For details about the required and optional configuration options, go to the .
To deploy and validate access to Textual from the forked repository, follow the .
To use the OCI-based registry, run:
The GitHub repository contains a with the details on how to populate a values.yaml file and deploy Textual.
Use the REST API to manage dataset files.
For each file in a dataset, you can download the version of the file that contains the replacement values.
For information on downloading synthesized files from a pipeline, go to .
From the file list, to download a single file:
Click the options menu for the file.
In the options menu, click Download File.
To download all of the files, click Download All Files.
Use the REST API to create and manage datasets.
Tonic Textual provides the following options to manage access to Textual and its features.
Use the Tonic Textual REST API to redact text. Redaction means to detect and replace sensitive values.
Add or exclude values for an entity type
For a dataset, identify additional values to detect for an entity type, and values to not identify as an entity type.
Select the handling option
For each entity type, indicate whether to redact, synthesize, or ignore the entity values.
Configure synthesis options
For synthesized entity values, configure additional options for how to generate the values.
Configure handling of file components
For a dataset, indicate how Textual handles .docx images, .docx comments, and PDF signatures.
Edit an individual file
Add manual overrides to a PDF file. You can also apply a template.
Create PDF templates
PDF templates allow you to add the same overrides to files that have the same structure.
View pipeline files and runs
Display the lists of pipeline files and pipeline runs.
View processed file details
Display details for a file, including the original content, detected entities, and JSON output.
JSON output structure
Structure of the JSON output for different types of pipeline files.
Create and manage pipelines
Create, run, and get results from Textual pipelines.
Parse individual files
Send a single file to be parsed.
Authentication Authentication requirements for the REST API
Redaction Use the REST API to redact text.
Datasets Use the REST API to manage datasets and dataset files.
Access management Use the REST API to retrieve user information and to manage dataset access.
Manage dataset files Upload new files and download redacted files.
Manage datasets Create datasets and edit dataset configuration.
from tonic_textual.parse_api import TonicTextualParse
# The default URL is https://textual.tonic.ai (Textual Cloud)
# If you host Textual, provide your Textual URL
textual_url = "<Textual URL>"
textual = TonicTextualParse(textual_url)
from tonic_textual.redact_api import TonicTextual
# The default URL is https://textual.tonic.ai (Textual Cloud)
# If you host Textual, provide your Textual URL
textual_url = "<Textual URL>"
textual = TonicTextual(textual_url)
from tonic_textual.parse_api import TonicTextualParse
api_key = "your-tonic-textual-api-key"
# The default URL is https://textual.tonic.ai (Textual Cloud)
# If you host Textual, provide your Textual URL
textual_url = "<Textual URL>"
textual = TonicTextualParse(textual_url, api_key=api_key)
from tonic_textual.redact_api import TonicTextual
api_key = "your-tonic-textual-api-key"
# The default URL is https://textual.tonic.ai (Textual Cloud)
# If you host Textual, provide your Textual URL
textual_url = "<Textual URL>"
textual = TonicTextual(textual_url, api_key=api_key)
helm install textual oci://quay.io/tonicai/textual -f values.yaml -n textual --create-namespace
You can send an audio file to the Tonic Textual SDK. Textual creates a transcription of the audio file, and then redacts the transcription text as a string.
The file must be 25MB or smaller, and must be one of the following file types:
m4a
mp3
webm
mp4
mpga
wav
To transcribe and redact an audio file, you use textual.redact_audio
.
redaction_response=textual.redact_audio(<path to the audio file>)
redaaction_response.describe
The request includes the entity type handling configuration.
The redaction response includes the redacted or synthesized content and details about the detected entity values.
You can use the Textual SDK to parse individual files, either from a local file system or from an S3 bucket.
Textual returns a FileParseResult
object for each parsed file. The FileParseResult
object is a wrapper around the output JSON for the processed file.
To parse a single file from a local file system, use textual.parse_file
:
with open('<path to the file>','rb') as f:
byte_data = f.read()
parsed_doc = textual.parse_file(byte_data, '<file name>')
You must use rb
access mode to read the file. rb
access mode opens the file to be read in binary format.
You can also set a timeout in seconds for the parsing. You can add the timeout as a parameter of parse_file command. To set a timeout to use for all parsing, set the environment variable TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS
.
You can also parse files that are stored in Amazon S3. Because this process uses the boto3 library to fetch the file from Amazon S3, you must first set up the correct AWS credentials.
To parse a file from an S3 bucket, use textual.parse_s3_file
:
parsed_doc = textual.parse_s3_file('<bucket>','<key>')
For Amazon S3 pipelines, you connect to S3 buckets to select and store files.
On self-hosted instances, you also configure an S3 bucket and the credentials to use to store files for:
Uploaded file pipelines. The S3 bucket is required for uploaded file pipelines. The S3 bucket is not used for pipelines that connect to Azure Blob Storage or to Databricks Unity Catalog.
Dataset files. If you do not configure an S3 bucket, then the files are stored in the application database.
Individual files that you send to the SDK for redaction. If you do not configure an S3 bucket, then the files are stored in the application database.
Here are examples of IAM roles that have the required permissions to connect to Amazon S3 to select or store files.
For uploaded file pipelines, datasets, and individual file redactions, the files are stored in a single S3 bucket. For information on how to configure the S3 bucket and the corresponding access credentials, go to Setting the S3 bucket for file uploads and redactions.
The IAM role that is used to connect to the S3 bucket must be able to read files from and write files to it.
Here is an example of an IAM role that has the permissions required to support uploaded file pipelines, datasets, and individual redactions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<SOLAR_INTERNAL_BUCKET_NAME>",
"arn:aws:s3:::<SOLAR_INTERNAL_BUCKET_NAME>/*"
]
}
]
}
The access credentials that you configure for an Amazon S3 pipeline must be able to navigate to and select files and folders from the appropriate S3 buckets. They also need to be able to write output files to the configured output location.
Here is an example of an IAM role that has the permissions required to support Amazon S3 pipelines:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:ListAllMyBuckets",
"Resource": "arn:aws:s3:::*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucketMultipartUploads",
"s3:AbortMultipartUpload"
],
"Resource": [
"arn:aws:s3:::*/*"
]
}
]
}
with open("<path to the file>", "r") as f:
j = textual.start_file_redaction(f,"<file name>")
with open("<path to output location>", "wb") as fo:
fo.write(textual.download_redacted_file(<job identifier>)
Use these instructions to set up Azure Active Directory as your SSO provider for Tonic Textual.
Register Textual as an application within the Azure Active Directory Portal:
In the portal, navigate to Azure Active Directory -> App registrations, then click New registration.
Register Textual and create a new web redirect URI that points to your Textual instance's address and the path /sso/callback/azure
.
Take note of the values for client ID and tenant ID. You will need them later.
Click Add a certificate or secret, and then create a new client secret. Take note of the secret value. You will need this later.
Navigate to the API permissions page. Add the following permissions for the Microsoft Graph API:
OpenId permissions
openid
profile
GroupMember
GroupMember.Read.All
User
User.Read
Click Grant admin consent for Tonic AI. This allows the application to read the user and group information from your organization. When permissions have been granted, the status should change to Granted for Tonic AI.
Navigate to Enterprise applications and then select Textual. From here, you can assign the users or groups that should have access to Textual.
After you complete the configuration in Azure, you uncomment and configure the required environment variables in Textual.
For Kubernetes, in values.yaml:
# Azure SSO Config
# -----------------
#azureClientId: <client-id>
#azureTenantId: <tenant-id>
#azureClientSecret: <client-secret>
For Docker, in .env:
#SOLAR_SSO_AZURE_CLIENT_ID=#<FILL IN>
#SOLAR_SSO_AZURE_TENANT_ID=#<FILL IN>
#SOLAR_SSO_AZURE_CLIENT_SECRET=#<FILL IN>
To process PDF and image files, Tonic Textual uses optical character recognition (OCR). Textual supports the following OCR models:
Azure AI Document Intelligence
Amazon Textract
Tesseract
For the best performance, we recommend that you use either Azure AI Document Intelligence or Amazon Textract.
If you cannot use either of those - for example because you run Textual on-premises and cannot access third-party services - then you can use Tesseract.
To use Azure AI Document Intelligence to process PDF image files, Textual requires the Azure AI Document Intelligence key and endpoint.
In .env, uncomment and provide values for the following settings:
SOLAR_AZURE_DOC_INTELLIGENCE_KEY=#
SOLAR_AZURE_DOC_INTELLIGENCE_ENDPOINT=#
In values.yaml, uncomment and provide values for the following settings:
azureDocIntelligenceKey:
azureDocIntelligenceEndpoint:
If the Azure-specific environment variables are not configured, then Textual attempts to use Amazon Textract.
To use Amazon Textract, Textual requires access to an IAM role that has sufficient permissions. You must also configure an S3 bucket to use to store files. The configured S3 bucket is required for uploaded file pipelines, and is also used to store dataset files and individual files that are redacted using the SDK.
We recommend that you use the AmazonTextractFullAccess
policy, but you can also choose to use a more restricted policy.
Here is an example policy that provides the minimum required permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"textract:StartDocumentAnalysis",
"textract:AnalyzeDocument",
"textract:GetDocumentAnalysis"
],
"Resource": "*"
}
]
}
After the policy is attached to an IAM user or a role, it must be made accessible to Textual. To do this, either:
Assign an instance profile
Provide the AWS key, secret, and Region in the following environment variables:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION
If neither Azure AI Document Intelligence nor Amazon Textract is configured, then Textual uses Tesseract, which is automatically available in your Textual installation.
Tesseract does not require any external access.
Tonic Textual pipelines can process files from sources such as Amazon S3, Azure Blob Storage, and Databricks Unity Catalog. You can also create pipelines to process files that you upload directly from your browser.
For those uploaded file pipelines, Textual always stores the files in an S3 bucket. On a self-hosted instance, before you add files to an uploaded file pipeline, you must configure the S3 bucket and the associated authentication credentials.
The configured S3 bucket is also used to store dataset files and individual files that you use the Textual SDK to redact. If an S3 bucket is not configured, then:
The dataset and individual redacted files are stored in the Textual application database.
You cannot use Amazon Textract for PDF and image processing. If you configured Textual to use Amazon Textract, Textual instead uses Tesseract.
The authentication credentials for the S3 bucket include:
The AWS Region where the S3 bucket is located.
An AWS access key that is associated with an IAM user or role.
The secret key that is associated with the access key.
To provide the authentication credentials, you can either:
Provide the values directly as environment variable values.
Use the instance profile of the compute instance where Textual runs.
For an example IAM role that has the required permissions, go to Example IAM role for file uploads and redactions.
In .env, add the following settings:
SOLAR_INTERNAL_BUCKET_NAME= <S3 bucket path>
AWS_REGION= <AWS Region>
AWS_ACCESS_KEY_ID= <AWS access key>
AWS_SECRET_ACCESS_KEY= <AWS secret key>
If you use the instance profile of the compute instance, then only the bucket name is required.
In values.yaml, within env: { }
under both textual_api_server
and textual_worker
, add the following settings:
SOLAR_INTERNAL_BUCKET_NAME
AWS_REGION
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
For example, if no other environment variables are defined:
env: {
"SOLAR_INTERNAL_BUCKET_NAME": "<S3 bucket path>",
"AWS_REGION": "<AWS Region>",
"AWS_ACCESS_KEY_ID": "<AWS access key>",
"AWS_SECRET_ACCESS_KEY": "<AWS secret key>"
}
If you use the instance profile of the compute instance, then only the bucket name is required.
For the Tonic Textual Snowflake Native App, you set up:
A compute pool
A warehouse to enable queries
The compute pool must be specific to Textual.
For large-scale jobs, we highly recommend a GPU-enabled compute pool.
During setup and testing, you can use a CPU-only pool.
USE ROLE SYSADMIN;
CREATE COMPUTE POOL IF NOT EXISTS {YOUR_COMPUTE_POOL_NAME} FOR APPLICATION TONIC_TEXTUAL
MIN_NODES = 1
MAX_NODES = 1
INSTANCE_FAMILY = GPU_NV_S
AUTO_RESUME = true;
To run SQL queries against Snowflake tables that the app manages, the app requires a warehouse.
USE ROLE ACCOUNTADMIN; CREATE WAREHOUSE {YOUR_TEXTUAL_WAREHOUSE_NAME} WITH WAREHOUSE_SIZE='MEDIUM';
GRANT USAGE ON WAREHOUSE {YOUR_TEXTUAL_WAREHOUSE_NAME} TO APPLICATION TONIC_TEXTUAL;
The Tonic Textual Home page provides a tool that allows you to see how Textual detects and replaces values in plain text or an uploaded file.
It also provides a preview of the redaction configuration options, including:
How to replace the values for each entity type.
Added and excluded values for each entity type.
The Home page displays automatically when you log in to Textual. To return to the Home page from other pages, in the navigation menu, click Home.
To provide the content to redact, you can enter text directly, or you can upload a file.
As you enter or paste text in the Original Content text area, Textual displays the redacted version in the Results panel at the right.
Textual also provides sample text options for some common use cases. To populate the text with a sample, under Try a sample, click the sample to use.
You can also redact .txt or .docx files.
To provide a file, either:
Drag and drop the file to the Original Content text area.
Click the upload prompt, then search for and select the file.
Textual processes the file and then displays the redacted version in the Results panel. The Original Content text area is removed.
To clear the text, click Clear.
The handling option indicates how Textual replaces a detected value for an entity type. You can experiment with different handling options.
Note that the updated configuration is only used for the current redacted text. When you clear the text, Textual also clears the configuration.
The options are:
Redact - This is the default value. Textual replaces the value with the name of the entity type.
For example, the first name John is replaced with NAME_GIVEN
.
Synthesize - Textual replaces the value with a realistic generated value. For example, the first name John is replaced with the first name Michael. The replacement values are consistent, which means that a given value always has the same replacement. For example, Michael is always the replacement value for John.
Off - Textual ignores the value and copies it as is to the Results panel.
To change the handling option for an entity type:
In the Results panel, click an instance of the entity type.
On the configuration panel, click the handling option to use.
Textual updates all instances of that entity type to use the selected handling option.
For example, if you change the handling option for NAME_GIVEN
to Synthesize, then all instances of first names are replaced with realistic values.
For each entity type in entered text, you can use regular expressions to define added and excluded values.
Added values are values that Textual does not detect for an entity type, but that you want to include. For example, you might have values that are specific to your company or industry.
Excluded values are values that you do not want Textual to identify as a given entity type.
Note that the configuration is only used for the current redacted text. When you clear the text, Textual also clears the configuration.
Also, this option is only available for text that you enter directly. For an uploaded file, to do additional configuration or to download the file, you must create a dataset from the file.
To display the configuration panel for added and excluded values, click Fine-tune Results.
The Fine-Tune Results panel displays the list of configured rules for the current text. For each rule, the list includes:
The entity type.
Whether the rule adds or excludes values.
The regular expression to identify the added or excluded values.
On the Fine-Tune Results panel, to create a rule:
Click Add Rule.
From the entity type dropdown list, select the entity type that the rule applies to.
From the rule type dropdown list:
If the rule adds values, then select Include.
If the rule excludes values, then select Exclude.
In the regular expression field, provide the regular expression to use to identify the values to add or exclude.
To save the rule, click the save icon.
To edit a rule:
On the Fine-Tune Results panel, click the edit icon for the rule.
Update the configuration.
Click the save icon.
On the Fine-Tune Results panel, to delete a rule, click its delete icon.
From an uploaded file, you can create a dataset that contains the file.
You can then provide additional configuration, such as added and excluded values, and download the redacted file.
To create a dataset from an uploaded file:
Click Download.
Click Create a Dataset.
Textual displays the dataset details for the new dataset. The dataset name is Playground Dataset <number>
, where the number reflects the number of datasets that were created from the Home page.
The dataset contains the uploaded file.
When Textual generates the redacted version of the text, it also generates the corresponding API request. The request includes the entity type configuration.
To view the API request code, click Show Code.
To hide the code, click Hide Code.
On the code panel:
The Python tab contains the Python version of the request.
The cURL tab contains the cURL version of the request.
To copy the currently selected version of the request code, click Copy Code.
For entered text on the Home page, Textual offers an option to send the following to an OpenAI large language model (LLM):
The detected entity values.
The text that surrounds each value.
The LLM processing is not available for uploaded files.
It is also limited to text that contains 100 or fewer words.
The LLM processing is intended to improve the detection and the replacement values. The LLM:
Verifies that the assigned entity type is correct.
If it is not, determines the correct entity type.
Standardizes an entity value that has different formats, such as Main St. versus Main Street.
Generates replacement values that use the same format as the original value.
To enable the LLM processing, set the environment variable ENABLE_EXPERIMENTAL_SYNTHESIS
to True
. Without this set to true, the LLM processing will not work.
By default, the LLM processing uses the gpt-4o
model.
To use a different model, configure the environment variable LLM_MODEL
.
If you use a ChatGPT model, you must also provide an OpenAI key as the value of the environment variable OPENAI_API_KEY
.
To use a ChapGPT model other than gpt-4o
configure the environment variable LLM_MODEL
. For example, to use gpt-4o-mini
, set LLM_MODEL
to openai/gpt-4o-mini
.
To use Amazon Bedrock for LLM processing, set the following environment variables:
LLM_MODEL
- bedrock/<model name>
, where <model name>
is one of the models that Amazon Bedrock supports.
AWS_ACCESS_KEY_ID
- An AWS access key that is associated with an IAM user or role. The role must have read permissions for Amazon Bedrock AmazonBedrockReadOnly
.
AWS_SECRET_ACCESS_KEY
- The secret key that is associated with the access key.
AWS_DEFAULT_REGION
- The AWS Region to send the authentication request to.
To use Azure OpenAI, you configure the following environment variables:
LLM_MODEL
- azure/<model name>
, where <model name> is the Azure deployment name.
AZURE_OPENAI_API_KEY
- The API key.
AZURE_API_BASE
- The URL of your Azure OpenAI deployment.
AZURE_API_VERSION
- The API version of Azure to use.
After you enter text in the Original Content panel, to enable the LLM processing, in the Results panel, click Use an LLM to perform AI synthesis.
You cannot use this option for text that contains more than 100 words.
When you clear the text, Textual reverts to the default processing.
The Textual LLM preparation workflow transforms source files into content that you can incorporate into an LLM.
You can:
Upload files directly from a local file system
Select files from an S3 bucket
Select files from a Databricks data volume
Select files from an Azure Blob Storage container
Select files from a Sharepoint repository
Textual can process plain text files (.txt and .csv), .docx files, and .xslx files. It can also process PDF files. For images, Textual can extract text from .png, .tif/.tiff, and .jpg/.jpeg files.
You can also create and manage pipelines from the Textual SDK.
At a high level, to use Textual to create LLM-ready content:
If the source files are in a local file system, then upload the files to the pipeline. Textual stores the files in your configured Amazon S3 location, and then automatically processes each new file.
If the source files are in cloud storage (Amazon S3, Databricks, Azure, or Sharepoint):
Provide the credentials to use to connect to the storage location.
Identify the location where Textual writes the pipeline output.
Optionally, filter the files by file type. For example, you might only want to process PDF files.
Identify the files to include in the pipeline. You can select individual files or folders. When you select folders, Textual processes all of the files in the folder.
For each file, Textual:
Converts the content to raw text. For image files, this means to extract any text that is present.
Uses its built-in models to detect entity values in the text.
Generates a Markdown version of the original text.
Produces a JSON file that contains:
The Markdown version of the text
The detected entities and their locations
From Textual, for each processed file, you can:
Textual also provides code snippets to help you to use the pipeline output.
For cloud storage pipelines, the JSON files also are available from the configured output location.
You can also configure pipelines to create redacted versions of the original values. For more information, go to Datasets workflow for text redaction.
Tags can help you to organize your datasets. For example, you can use tags to indicate datasets that belong to different groups, or that deal with specific areas of your data.
You can manage tags from both the Datasets page and the dataset details.
On the Datasets page, the Tags column displays the currently assigned tags.
To change the tag assignment for a dataset:
Click Tags.
On the dataset tags panel, to add a new tag, type the tag text, then press Enter.
To remove a tag, click its delete icon.
To remove all of the tags, click the delete all icon.
On the dataset details page, the assigned tags display under the dataset name.
To change the tag assignment:
Click Tags.
On the dataset tags panel, to add a new tag, type the tag text, then press Enter.
To remove a tag, click its delete icon.
To remove all of the tags, click the delete all icon.
From a file list, to display the details for a file, click the file name.
For files other than .txt files, the Original tab allows you to toggle between the generated Markdown and the rendered text.
For a .txt file, where there is no difference between the Markdown and the rendered text, the Original tab displays the file content.
In a pipeline that is configured to also generate redacted files, the Redacted <file type> option allows you to display the redacted version of a PDF or image file.
The Entities tab displays the file content with the detected entity values in context.
The actual values are followed by the type labels. For example, the given name John is displayed as John NAME_GIVEN
.
The JSON tab contains the content of the output file. For cloud storage pipelines, the files are also in the output location that you configured for the pipeline.
For details about the JSON output structure for the different types of files, go to Structure of the pipeline output file JSON.
For a PDF or image file that contains one or more tables, the Tables tab displays the tables. If the file does not contain any tables, then the Tables tab does not display.
For a PDF or image file that contains key-value pairs, the Key-Value Pairs tab displays the key-value pairs. If the file does not contain key-value pairs, then the Key-Value Pairs tab does not display.
Tonic Textual respects the access control policy of your single sign-on (SSO) provider. To access Textual, users must be granted access to the Textual application within your SSO provider.
Self-hosted instances can use any of the available SSO options. Textual Cloud organizations can enable Okta SSO.
To enable SSO, you first complete the required configuration in the SSO provider. You then configure Textual to connect to it. For self-hosted instances, you use Textual environment variables for the configuration. For a Textual Cloud organization, you use the Single Sign-On tab on the Permission Settings page.
After you enable SSO, users can use SSO to create an account in Textual.
For self-hosted instances, to only allow SSO authentication, set the environment variable REQUIRE_SSO_AUTH
to true
. For Textual Cloud, this is configured in the application. When SSO is required, Textual disables standard email/password authentication. All account creation and login is handled through your SSO provider. If multi-factor authentication (MFA) is set up with your SSO, then all authentication must go through your provider's MFA.
You can view the list of SSO groups whose members have logged into Textual.
Tonic Textual supports the following SSO providers:
The Tonic Textual pay-as-you-go plan allows you to automatically bill a credit card for your Textual usage.
The Textual subscription plan charges a flat rate for each 1000 words. You are billed each month based on when you started your subscription. For example, if you start your subscription on the 12th of the month, then you are billed every month on the 12th.
Tonic.ai integrates with a payment processing solution to manage the payments.
To start a new subscription, from a usage pane or upgrade prompt, click Upgrade Plan.
You are sent to the payment processing solution to enter your payment information.
The panel on the Home page shows the usage for the current month.
To view additional usage details, click Manage Plan.
The Manage Plan page displays the details for your subscription.
The summary at the top left contains an overview of the subscription payment information, as well as the total number of words scanned since you started your account.
From the summary, you can go to the payment processing solution to view and manage payment information.
The graph at the top of the page shows the words scanned per day for the previous 30 days.
The Current Billing Period panel summarizes your usage for the current month, and provides information about the next payment.
The Next billing date panel shows when the next billing period begins.
The Payment History section shows the list of subscription payments.
For each payment, the list shows the date and amount, and whether the payment was successful.
To download the invoice for a payment, click its Invoice option.
You can update the payment information for your subscription. For example, you might need to choose a different credit card or update an expiration date.
To manage the payment information:
On the home page, in the usage panel, click Manage Plan.
On the Manage Plan page, from the account summary, click Manage Payment.
You are sent to the payment processing solution to update your payment information.
To cancel a subscription, from the Manage Plan page:
Click Manage Payment.
In the payment processing solution, select the cancellation option.
The cancellation takes effect at the end of the current subscription month.
The Dataset Settings panel includes options for how Textual handles the following file components:
For .docx files, images and comments
For PDF files, scanned-in signatures
To display the Dataset Settings page, on the dataset details page, click Settings.
These options are not available for pipelines that also redact files.
For .docx images, including .svg files, you can configure the dataset to either:
Redact the image content. When you select this option, Textual looks for and blocks out sensitive values in the image.
Ignore the image.
Replace the images with black boxes.
On the Dataset Settings page, under Image settings for DOCX files:
To redact the image content, click Redact contents of images using OCR. This is the default selection.
To ignore the images entirely, click Ignore images during scan.
To replace the images with black boxes, click Replace images from the output file with black boxes.
For .docx tables, you can configure the dataset to either:
Redact the table content. When you select this option, Textual detects sensitive values and replaces them based on the entity type configuration.
Block out all of the table cells. When you select this option, Textual places a black box over each table cell.
On the Dataset Settings page, under Table settings for DOCX files:
To redact the table content, click Redact content using the entity type configuration. This is the default selection.
To block out the table content, click Block out all table cell content.
For comments in a .docx file, you can configure the dataset to either:
Remove the comments from the file.
Ignore the comments and leave them in the file.
On the Dataset Settings page, to remove the comments, toggle Remove comments from the output file to the on position. This is the default configuration.
To ignore the comments, toggle Remove comments from the output file to the off position.
By default, Textual redacts scanned-in signatures in PDF files. You can configure the dataset to instead ignore the signatures.
On the Dataset Settings page:
To redact PDF signatures, toggle Detect and redact signatures in PDFs to the on position. This is the default configuration.
To ignore PDF signatures, toggle Detect and redact signatures in PDFs to the off position.
In a dataset, for each built-in entity type, you can configure additional values to detect, and values to exclude. You cannot define added and excluded values for custom entity types.
You might add values that Textual does not detect because, for example, they are specific to your organization or industry.
You might exclude a value because:
Textual labeled the value incorrectly.
You do not want to redact a specific value. For example, you might want to preserve known test values.
Note that for a pipeline that redacts files, you cannot add or exclude specific values.
In the entity types list, the add values and exclude values icons indicate whether there are configured added and excluded values for the entity type.
When added or excluded values are configured, the corresponding icon is green.
When there are no configured values, the corresponding icon is black.
From the Configure Entity Detection panel, you configure both added and excluded values for entity types.
To display the panel, click the add values or exclude values icon for an entity type.
The panel contains an Add to detection tab for added values, and an Exclude from detection tab for excluded values.
The entity type dropdown list at the top of the Configure Entity Detection panel indicates the entity type to configure added and excluded values for.
The initial selected entity type is the entity type for which you clicked the icon. To configure values for a different entity type, select the entity type from the list.
On the Add to detection tab, you configure the added values for the selected entity type.
Each value can be a specific word or phrase, or a regular expression to identify the values to add. Regular expressions must be C# compatible.
To add an added value:
Click the empty entry.
Type the value into the field.
To edit an added value:
Click the value.
Update the value text.
For each added value, you can test whether Textual correctly detects it.
To test a value:
From the Test Entry dropdown list, select the number for the value to test.
In the text field, type or paste content that contains a value or values that Textual should detect.
The Results field displays the text and highlights matching values.
To remove an added value, click its delete icon.
On the Exclude from detection tab, you configure the excluded values for the selected entity type.
Each value can be either a specific word or phrase to exclude, or a regular expression to identify the values to exclude. The regular expression must be C# compatible.
You can also provide a specific context within which to ignore a value. For example, in the phrase "one moment, please", you probably do not want the word "one" to be detected as a numeric value. If you specify "one moment, please" as an excluded value for the numeric entity type, then "one" is not identified as a number when it is seen in that context.
To add an excluded value:
Click the empty entry.
Type the value into the field.
To edit an excluded value:
Click the value.
Update the value text.
For each excluded value, you can test whether Textual correctly detects it.
To test the value that you are currently editing:
From the Test Entry dropdown list, select the number for the value to test.
In the text field, type or paste content that contains a value or values to exclude.
The Results field displays the text and highlights matching values.
To remove an excluded value, click its delete icon.
The new added and excluded values are not reflected in the entity types list until Textual runs a new scan.
When you save the changes, you can choose whether to immediately run a new scan on the dataset files.
To save the changes and also start a scan, click Save and Scan Files.
To save the changes, but not run a scan, click Save Without Scanning Files. When you do not run the scan, then on the dataset details page, Textual displays a prompt to run a scan.
A dataset might contain multiple files that have the same structure, such as a set of scanned-in forms.
Instead of adding the same manual overrides for each file, you can use a PDF file in the dataset to create a template that you can apply to other PDF files in the dataset.
When you , you can apply a template.
To add a PDF template to a dataset:
On the dataset details page, click PDF Templates.
On the template creation and selection panel, click Create a New Template.
On the template details page:
In the Name field, provide a name for the template.
From the file dropdown list, select the dataset file to use to create the template.
Add the manual overrides to the file.
When you finish adding the manual overrides, click Save New Template.
When you update a PDF template, it affects any files that use the template.
To update a PDF template:
On the dataset details page, click PDF Templates.
Under Edit an Existing Template, select the template, then click Edit Selected Template.
On the template details panel, you can change the template name, and add or remove manual overrides.
To save the changes, click Update Template.
On the template details panel, to add a manual override to a file:
Select the type of override. To indicate to ignore any automatically detected values in the selected area, click Ignore Redactions. To indicate to redact the selected area, click Add Manual Redaction.
Use the mouse to draw a box around the area to select.
Tonic Textual adds the override to the Redactions list. The icon indicates the type of override.
To select and highlight a manual override in the file content, in the Redactions list, click the navigate icon for the override.
To remove a manual override, in the Redactions list, click the delete icon for the override.
When you delete a PDF template, the template and its manual overrides are removed from any files that the template was assigned to.
To delete a PDF template:
On the dataset details page, click PDF Templates.
Under Edit an Existing Template, select the template, then click Edit Selected Template.
On the template details panel, click Delete.
When you first create a dataset, Tonic Textual displays a single list of all of the entity types that it can detect.
As you add and remove files, Textual updates the entity types list to indicate the detected and not detected entity types.
At the top of the dataset details view, the Sensitive words tile shows the total number of sensitive values in the dataset that Textual detected.
As Textual processes files, it identifies the entity types that are detected and not detected.
The entity type list starts with the detected entity types. For each detected entity type, Textual displays:
The number of detected values that are marked as this type in the output file. Excluded values are not included in the count.
The selected handling option.
Whether there are configured added or excluded values.
For each detected entity type, to view a sample of up to 10 of the detected values , click the view icon next to the value count.
The entities list contains the full list of detected values for an entity type.
To display the entities list, from the value preview, click Open Entities Manager.
When you display the entities list, the entity type that you previewed the values for is selected by default.
To change the selected entity type, from the dropdown at the top left, select the entity type to view values for.
A detected value might match multiple entity types.
For example, a telephone number might match both the Phone Number and Numeric Value entity types.
Every value is only counted once, for the entity type that it is assigned in the output file.
By default, a detected value is assigned the entity type that it most closely matches. For our example, the telephone number value most closely matches the Phone Number entity type, and so by default is included in the Phone Number count and values list.
If the entity type is turned off, or the value is excluded, then Textual moves the value to the next matching type.
In our example, if you set the handling type for Phone Number to Off, then the telephone number value is added to the count and values list for the Numeric Value entity type.
The entities list groups the entities by the file and, if relevant, the page where they were detected.
For each value, the list includes:
The original value.
The original value in the context of its surrounding text.
The redacted or synthesized value in the context of its surrounding text, based on the selected handling option.
Below the list of detected entity types is the Entity types not found list, which contains the list of entity types that Textual did not detect in the files.
You can filter the entity types list by text in the type name or description. The filter applies to both the detected and undetected entity types.
To filter the types, in the filter field, begin to type text that is in the entity type name or description.
In Tonic Textual, a pipeline identifies a set of files that Textual processes into content that can be imported into an LLM system.
To display the Pipelines page, in the Textual navigation menu, click Pipelines.
The pipelines list only displays the pipelines that you have access to.
Users who have the global permission View all pipelines can see the complete list of pipelines.
For each pipeline, the list includes:
The name of the pipeline
Any tags assigned to the pipeline, as well as an option to add tags. For more information, go to .
When the pipeline was most recently updated
The user who most recently updated the pipeline
If there are no pipelines, then the Pipelines page displays a panel to allow you to create a pipeline.
To display the details for a pipeline, on the Pipelines page, click the pipeline name.
For a cloud storage pipeline (Amazon S3, Databricks, Azure, or Sharepoint), the details include:
The tags that are assigned to a pipeline, as well as an option to add tags. For more information, to go .
The Run Pipeline option, which starts a new pipeline run. For more information, go to .
The settings option, which you use to change the configuration settings for the pipeline. For more information, go to .
The list of processed files. For more information, go to .
The list of pipeline runs. For more information, go to .
For pipelines that are configured to also redact files, the redaction configuration. For more information, go to .
File statistics for the pipeline files. For more information, go to .
For an uploaded file pipeline, the pipeline details include:
The tags that are assigned to the pipeline, plus an option to add tags. For more information, go to .
The Upload Files option, which you use to add files to the pipeline. For more information, go to .
The settings option, which you use to change the configuration settings for the pipeline. For more information, go to .
The list of files in the pipeline. Includes both new and processed files. For more information, go to .
For pipelines that are configured to also redact files, the redaction configuration. For more information, go to .
File statistics for the pipeline files. For more information, go to .
For an Azure pipeline, the settings include:
Azure credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders
When you create a pipeline that uses files from Azure, you are prompted to provide the credentials to use to connect to Azure.
From the Pipeline Settings page, to change the credentials:
Click Update Azure Credentials.
Provide the new credentials:
In the Account Name field, provide the name of your Azure account.
In the Account Key field, provide the access key for your Azure account.
To test the connection, click Test Azure Connection.
To save the new credentials, click Update Azure Credentials.
On the Pipeline Settings page, under Select Output Location, navigate to and select the folder in Azure where Textual writes the output files.
When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.
Within the job folder, Textual recreates the folder structure for the original files. It then creates the JSON output for each file. The name of the JSON file is <original filename>_<original extension>_parsed.json
.
If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.
For example, for the original file Transaction1.txt
, the output for a pipeline run contains:
Transaction1_txt_parsed.json
Transaction1.txt
By default, when you run an Azure pipeline, Textual only generates the JSON output.
To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.
For information on how to configure the file generation, go to .
One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.
Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.
Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.
Under Select files and folders to add to run, navigate to and select the folders and individual files to process.
To add a folder or file to the pipeline, check its checkbox.
When you check a folder checkbox, Textual adds it to the Prefix Patterns list. It processes all of the applicable files in the folder, based on whether the file type is a type that Textual supports and whether it is included in the file type filter.
When you click the folder name, it displays the folder contents.
When you select an individual file, Textual adds it to the Selected Files list.
To delete a file or folder, either:
In the navigation pane, uncheck the checkbox.
In the Prefix Patterns or Selected Files list, click its delete icon.
For an Amazon S3 pipeline, the settings include:
AWS credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders
When you create a pipeline that uses files from Amazon S3, you are prompted to provide the credentials to use to connect to Amazon S3.
From the Pipeline Settings page, to change the credentials:
Click Update AWS Credentials.
Provide the new credentials:
In the Access Key field, provide an AWS access key that is associated with an IAM user or role. For an example of an IAM role that has the required permissions for an Amazon S3 pipeline, go to .
In the Access Secret field, provide the secret key that is associated with the access key.
From the Region dropdown list, select the AWS Region to send the authentication request to.
In the Session Token field, provide the session token to use for the authentication request.
To test the connection, click Test AWS Connection.
To save the new credentials, click Update AWS Credentials.
On the Pipeline Settings page, under Select Output Location, navigate to and select the folder in Amazon S3 where Textual writes the output files.
When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.
Within the job folder, Textual recreates the folder structure for the original files. It then creates the JSON output for each file. The name of the JSON file is <original filename>_<original extension>_parsed.json
.
If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.
For example, for the original file Transaction1.txt
, the output for a pipeline run contains:
Transaction1_txt_parsed.json
Transaction1.txt
By default, when you run an Amazon S3 pipeline, Textual only generates the JSON output.
To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.
For information on how to configure the file generation, go to .
One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.
Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.
Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.
Under Select files and folders to add to run, navigate to and select the folders and individual files to process.
To add a folder or file to the pipeline, check its checkbox.
When you check a folder checkbox, Textual adds it to the Prefix Patterns list. It processes all of the applicable files in the folder, based on whether the file type is a type that Textual supports and whether it is included in the file type filter.
When you click the folder name, it displays the folder contents.
When you select an individual file, Textual adds it to the Selected Files list.
To delete a file or folder, either:
In the navigation pane, uncheck the checkbox.
In the Prefix Patterns or Selected Files list, click its delete icon.
For a Databricks pipeline, the settings include:
Databricks credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders
When you create a pipeline that uses files from Databricks, you are prompted to provide the credentials to use to connect to Databricks.
From the Pipeline Settings page, to change the credentials:
Click Update Databricks Credentials.
Provide the new credentials:
In the Databricks URL field, provide the URL to the Databricks workspace.
In the Access Token field, provide the access token to use to get access to the volume.
To test the connection, click Test Databricks Connection.
To save the new credentials, click Update Databricks Credentials.
On the Pipeline Settings page, under Select Output Location, navigate to and select the folder in Databricks where Textual writes the output files.
When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.
Within the job folder, Textual recreates the folder structure for the original files. It then creates the JSON output for each file. The name of the JSON file is <original filename>_<original extension>_parsed.json
.
If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.
For example, for the original file Transaction1.txt
, the output for a pipeline run contains:
Transaction1_txt_parsed.json
Transaction1.txt
By default, when you run a Databricks pipeline, Textual only generates the JSON output.
To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.
For information on how to configure the file generation, go to .
One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.
Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.
Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.
Under Select files and folders to add to run, navigate to and select the folders and individual files to process.
To add a folder or file to the pipeline, check its checkbox.
When you check a folder checkbox, Textual adds it to the Prefix Patterns list. It processes all of the applicable files in the folder, based on whether the file type is a type that Textual supports and whether it is included in the file type filter.
When you click the folder name, it displays the folder contents.
When you select an individual file, Textual adds it to the Selected Files list.
To delete a file or folder, either:
In the navigation pane, uncheck the checkbox.
In the Prefix Patterns or Selected Files list, click its delete icon.
For an uploaded file pipeline, the Files tab contains the list of all of the pipeline files.
For cloud storage pipelines, you use the pipeline details page to track processed files and pipeline runs.
For pipelines that are configured to also redact files, you can configure the redaction for the detected entity types. For more information, go to .
For uploaded file pipelines, when you add a file to the pipeline, it is automatically added to the file list.
For cloud storage pipelines, the file list is not populated until you run the pipeline. The list only contains processed files.
The statistics panels at the right of the pipeline details page provide a summary of information about the pipeline files, the detected entities, and the detected topics.
The File Statistics panel displays the following values.
Total # of files - The number of files in the pipeline.
Total # of words - The number of words that the files contain.
Entities detected - The number of entity types for which Textual detected values in the files.
Topics detected - The number of topics that the files contain. A topic is a subject area that is common across multiple files. If the pipeline files contain completely unrelated content, then Textual might not detect any topics.
The entity types panel displays the 5 entity types that have the largest number of values in the pipeline files.
For each entity type, the panel displays the value count.
If there are more than 5 detected entity types, to display the full list of detected entity types, click View All.
The topics panel displays the 5 topics that are present in the most files.
For each topic, the panel displays the number of files that include that topic.
If there are more than 5 detected topics, to display the full list of detected topics, click View All.
On the pipeline details page for a cloud storage pipeline, the Pipeline Runs tab displays the list of pipeline runs.
For each run, the list includes:
Run identifier
When the run was started
The current status of the pipeline run. The possible statuses are:
Queued - The pipeline run has not started to run yet.
Running - The pipeline run is in progress.
Completed - The pipeline run completed successfully.
Failed - The pipeline run failed.
For a pipeline run, to display the list of files that the pipeline run includes, click View Run.
For each file, the list includes the following information:
File name
For cloud storage files, the path to the file
The status of the file processing. The possible status are:
Unprocessed - The file is added, but a pipeline run to process it has not yet started. This only applies to uploaded files that were added since the most recent pipeline run.
Queued - A pipeline run was started but the file is not yet processed.
Running - The file is being processed.
Completed - The file was processed successfully.
Failed - The file could not be processed.
Instead of an API key, you can use the Textual API to obtain a JSON Web Token (JWT) to use for authentication.
By default, a JWT is valid for 30 minutes.
On a self-hosted instance, to configure a different lifetime, set the SOLAR_JWT_EXPIRATION_IN_MINUTES
.
You use a refresh token to obtain a new JWT. By default, a refresh token is valid for 10,000 minutes, which is roughly equivalent to 7 days.
On a self-hosted instance, to configure a different lifetime, set the environment variable SOLAR_REFRESH_TOKEN_EXPIRATION_IN_MINUTES.
To obtain your first JWT and refresh token, you make a login request to the Textual API. Before you can make this call, you must have a Textual account.
To make the call, perform a POST
operation against:
The request payload is:
For example:
In the response:
The jwt
property contains the JWT.
The refreshToken
property contains the refresh token.
You use the refresh token to obtain both a new JWT and a new refresh token.
To obtain the new JWT and token, perform a POST operation against:
The request payload is:
In the response:
The jwt
property contains the new JWT.
The refreshToken
property contains the new refresh token.
Tonic Textual provides a certificate for https traffic, but on a self-hosted instance, you can also use a user-provided certificate. The certificate must use the the PFX format and be named solar.pfx
.
To use your own certificate, you must:
Add the SOLAR_PFX_PASSWORD
.
Use a volume mount to provide the certificate file. Textual uses volume mounting to give the Textual containers access to the certificate.
You must apply the changes to both the Textual web server and Textual worker containers.
To use your own certificate, you make the following changes to the docker-compose.yml file.
Add the SOLAR_PFX_PASSWORD
, which contains the certificate password.
Place the certificate on the host machine, then share it to the containers as a volume.
You must map the certificate to /certificates
on the containers.
Copy the following:
You must add the SOLAR_PFX_PASSWORD
, which contains the certificate password.
You can use any volume type that is allowed within your environment. It must provide at least access.
You map the certificate to /certificates
on the containers. Within your web server and worker deployment YAML files, the entry should be similar to the following:
Configure environment variables
How to set environment variable values and restart Textual in Docker and Kubernetes.
Set the number of textual-ml workers
Used to enable parallel processing in Textual.
Set a custom certificate
Provide a custom certificate to use for https traffic.
Configure custom AWS endpoints
Set custom endpoint URLs for calls to AWS services.
Enable PDF and image processing
Set the required configuration based on the OCR option that you want to use.
Enable uploads to uploaded file pipelines
Provide the required access to Amazon S3.
Configure model preferences
Select an auxiliary model and configure model usage for GPU.
Azure
Use Azure to enable SSO on Textual.
GitHub
Use GitHub to enable SSO on Textual.
Use Google to enable SSO on Textual.
Keycloak
Use Keycloak to enable SSO on Textual.
Okta
Use Okta to enable SSO on Textual.
Available for both self-hosted instances and Textual Cloud.
<Textual_URL>/api/auth/login
{"userName": "<Textual username>",
"password": "<Textual password>"}
{"userName": "[email protected]",
"password": "MyPassword123!"}
<TEXTUAL_URL>/api/auth/token_refresh
{"refreshToken": "<refresh token>"}
volumes:
...
- /my-host-path:/certificates
volumeMounts:
- name: <my-volume-name>
mountPath: /certificates
Snowpark Container Services (SPCS) allow developers to run containerized workloads directly within Snowflake. Because Tonic Textual is distributed using a private Docker repository, you can use these images in SPCS to run Textual workloads.
It is quicker to use the Snowflake Native App, but SPCS allows for more customization.
To use the Textual images, you must add them to Snowflake. The Snowflake documentation and tutorial walks through the process in great detail, but the basic steps are as follows:
To pull down the required images, you must have access to our private Docker image repository on Quay.io. You should have been provided credentials during onboarding. If you require new credentials, or you experience issues accessing the repository, contact [email protected]. Once you have access, pull down the following images:
textual-snowflake
Either textual-ml
or textual-ml-gpu
, depending on whether you plan to use a GPU compute pool
The images are now available in Snowflake.
The API service exposes the functions that are used to redact sensitive values in Snowflake. The service must be attached to a compute pool. You can scale the instances as needed, but you likely only need one API.
DROP SERVICE IF EXISTS api_service;
CREATE SERVICE api_service
IN COMPUTE POOL compute_pool
FROM SPECIFICATION $$
spec:
containers:
- name: api_container
image: /your_db/your_schema/your_image_repository/textual-snowflake:latest
env:
ML_SERVICE_URL: https://ml-service:7701
endpoints:
- name: api_endpoint
port: 9002
protocol: HTTP
$$
MIN_INSTANCES=1
MAX_INSTANCES=1;
Next, you create the ML service, which recognizes personally identifiable information (PII) and other sensitive values in text. This is more likely to need scaling.
DROP SERVICE IF EXISTS ml_service;
CREATE SERVICE ml_service
IN COMPUTE POOL compute_pool
FROM SPECIFICATION $$
spec:
containers:
- name: ml_container
image: /your_db/your_schema/your_image_repository/textual-ml:latest
endpoints:
- name: ml_endpoint
port: 7701
protocol: TCP
$$
MIN_INSTANCES=1
MAX_INSTANCES=1;
You can create custom SQL functions that use your API and ML services. These functions are accessible from directly within Snowflake.
CREATE OR REPLACE FUNCTION textual_redact(input_text STRING, config STRING)
RETURNS STRING
SERVICE = your_db.your_schema.api_service
ENDPOINT = 'api_endpoint'
AS '/api/redact';
CREATE OR REPLACE FUNCTION textual_redact(input_text STRING)
RETURNS STRING
SERVICE = your_db.your_schema.api_service
CONTEXT_HEADERS = (current_user)
ENDPOINT = 'api_endpoint'
AS '/api/redact';
CREATE OR REPLACE FUNCTION textual_parse(PATH VARCHAR, STAGE_NAME VARCHAR, md5sum VARCHAR)
returns string
SERVICE=core.textual_service
CONTEXT_HEADERS = (current_user)
endpoint='api_endpoint'
MAX_BATCH_ROWS=10
as '/api/parse/start';
It can take a couple of minutes for the containers to start. After the containers are started, you can use the functions that you created in Snowflake.
To test the functions, use an existing table. You can also create this simple test table:
CREATE TABLE Messages (
Message TEXT
);
INSERT INTO Messages (Message) VALUES ('Hi my name is John Smith');
INSERT INTO Messages (Message) VALUES ('Hi John, mine is Jane Doe');
You use the function in the same way as any other user-defined function. You can pass in additional configuration to determine how to process specific types of sensitive values.
For example:
SELECT Message, textual_redact(Message) as REDACTED, textual_redact(Message, PARSE_JSON('NAME_GIVEN':'Synthesis', 'NAME_FAMILY':'Off')) as SYNTHESIZED FROM MESSAGES;
By default, the function redacts the entity values. In other words, it replaces the values with a placeholder that includes the type. Synthesis
indicates to replace the value with a realistic replacement value. Off
indicates to leave the value as is.
The textual_redact
function works identically to the textual_redact
function in the Snowflake Native App.
The response from the above example should look something like this:
Hi my name is John Smith
Hi my name is [NAME_GIVEN_Kx0Y7] [NAME_FAMILY_s9TTP0]
Hi my name is Lamar Smith
Hi John, mine is Jane Doe
Hi [NAME_GIVEN_Kx0Y7], mine is [NAME_GIVEN_veAy9] [NAME_FAMILY_6eC2]
Hi Lamar, mine is Doris Doe
The textual_parse
function works identically to the textual_parse
function in the Snowflake Native App.
For a Sharepoint pipeline, the settings include:
Azure credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders
When you create a pipeline that uses files from Sharepoint, you are prompted to provide the credentials to use to connect to the Entra ID application.
The credentials must have the following application permissions (not delegated permissions):
Files.Read.All
- To see the Sharepoint files
Files.ReadWrite.All
-To write redacted files and metadata back to Sharepoint
Sites.ReadWrite.All
: To view and modify the Sharepoint sites
From the Pipeline Settings page, to change the credentials:
Click Update Sharepoint Credentials.
In the Tenant ID field, provide the Sharepoint tenant identifier for the Sharepoint site.
In the Client ID field, provide the client identifier for the Sharepoint site.
In the Client Secret field, provide the secret to use to connect to the Sharepoint site.
To test the connection, click Test Connection.
To save the new credentials, click Update Sharepoint Credentials.
On the Pipeline Settings page, under Select Output Location, click the edit icon, then navigate to and select the folder in Sharepoint where Textual writes the output files.
When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.
Within the job folder, Textual recreates the folder structure for the original files. It then creates the JSON output for each file. The name of the JSON file is <original filename>_<original extension>_parsed.json
.
If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.
For example, for the original file Transaction1.txt
, the output for a pipeline run contains:
Transaction1_txt_parsed.json
Transaction1.txt
By default, when you run a Sharepoint pipeline, Textual only generates the JSON output.
To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.
For information on how to configure the file generation, go to Configuring file synthesis for a pipeline.
One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.
Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.
Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.
Under Select files and folders to add to run, navigate to and select the folders and individual files to process.
To add a folder or file to the pipeline, check its checkbox.
When you check a folder checkbox, Textual adds it to the Prefix Patterns list. It processes all of the applicable files in the folder, based on whether the file type is a type that Textual supports and whether it is included in the file type filter.
When you click the folder name, it displays the folder contents.
When you select an individual file, Textual adds it to the Selected Files list.
To delete a file or folder, either:
In the navigation pane, uncheck the checkbox.
In the Prefix Patterns or Selected Files list, click its delete icon.
From the file list, to display the preview, either:
Click the file name.
Click the options menu, then click Preview.
On the left, the preview displays the original data. The detected entity values are highlighted.
On the right, the preview displays the data with replacement values that are based on the dataset configuration for the detected entity types.
Note that in the preview, the redacted values do not include the identifier. They only include the entity type. For example, NAME_GIVEN
instead of NAME_GIVEN_1d9w5
. The identifiers are included when you download the files.
For a PDF or image file, for entity types that use the Redact handling option:
If there is space to display the entity type, then it is displayed.
Otherwise, the value is covered by a black box.
When you hover over a black box, the entity type displays in a tooltip:
To view the entity type labels, you can also zoom into the file.
The preview for a PDF file also includes any manual overrides.
For .txt, .csv, and .docx files, you can use the preview to select the entity type handling option for each entity type. The options are:
Redact - This is the default value. Textual replaces the value with the name of the entity type followed by a unique identifier.
For example, the first name John is replaced with NAME_GIVEN_12345
. Note that the identifier is only visible in the downloaded file. It does not display on the preview.
Synthesize - Textual replaces the value with a realistic generated value. For example, the first name John is replaced with the first name Michael. The replacement values are consistent, which means that a given value always has the same replacement. For example, Michael is always the replacement value for John.
Off - Textual ignores the value and copies it as is to the output file.
To select the entity type handling option:
In the results panel, click a detected value.
On the panel, click the entity type handling option. Textual applies the same option to all entity values of that type.
From the preview, you can only select the entity type handling option. For the Synthesize option, you cannot configure synthesis options for an entity type. You must configure those options from the dataset details page. For more information, go to Configuring synthesis options.
On the file details for a pipeline PDF or image file, on the Original tab:
To display the original file content, click Rendered.
To display the version of the file with the replacement values, click Redacted <file type>.
By default, when you:
Configure a dataset
Redact a string
Retrieve a redacted file
Textual does the following:
For the string and file redaction, replaces detected values with tokens.
For LLM synthesis, generates realistic synthesized values.
When you make the request, you can:
Override the default behavior.
For individual files and text strings, specify custom entity types to include.
For each entity type, you can choose to redact, synthesize, or ignore the value.
When you redact a value, Textual replaces the value with a token that consists of the entity type. For example, ORGANIZATION
.
When you synthesize a value, Textual replaces the value with a different realistic value.
When you ignore a value, Textual passes through the original value.
To specify the handling option for entity types, you use the generator_config
parameter.
generator_config={'<entity_type>':'<handling_option>'}
Where:
<entity_type>
is the identifier of the entity type. For example, ORGANIZATION
.
For the list of built-in entity types that Textual scans for, go to Built-in entity types.
For custom entity types, the identifier is the entity type name in all caps. Spaces are replaced with underscores, and the identifier is prefixed with CUSTOM_
.
For example, for a custom entity type named My New Type, the identifier is CUSTOM_MY_NEW_TYPE
.
From the Custom Entity Types page, to copy the identifier of a custom entity type, click its copy icon.
<handling_option>
is the handling option to use for the specified entity type. The possible values are Redact
, Synthesis
, and Off
.
For example, to synthesize organization values, and ignore languages:
generator_config={'ORGANIZATION':'Synthesis', 'LANGUAGE':'Off'}
For string and file redaction, you can specify a default handling option to use for entity types that are not specified in generator_config
.
To do this, you use the generator_default
parameter.
generator_default
can be either Redact
, Synthesis
, or Off
.
You can also configure added and excluded values for each entity type.
You add values that Textual does not detect for an entity type, but should. You exclude values that you do not want Textual to identify as that entity type.
To specify the added values, use label_allow_lists
.
To specify the excluded values, use label_block_lists
.
For each of these parameters, the value is a list of entity types to specify the added or excluded values for. To specify the values, you provide an array of regular expressions.
{'<entity_type>':['<regex>']}
The following example uses label_allow_lists
to add values:
For NAME_GIVEN
, adds the values There
and Here
.
For NAME_FAMILY
, adds values that match the regular expression ([a-z]{2})
.
(label_allow_lists={
'NAME_GIVEN':['There','Here'],
'NAME_FAMILY':['([a-z]{2})']
}
)
When you redact a string or download a redacted file, you can provide a comma-separated list of custom entity types to include. Textual then scans for and redacts those entity types based on the configuration in generator_config.
custom_entities="["<entity type identifier>"]
For example:
custom_entities=["CUSTOM_COGNITIVE_ACCESS_KEY", "CUSTOM_PERSONAL_GRAVITY_INDEX"]
To be able to use the Textual SDK, you must have an API key.
Alternatively, you can use the Textual API to obtain a JSON Web Token (JWT) to use for authentication.
You manage keys from the User API Keys page.
To display the User API Keys page, in the Textual navigation menu, click User API Keys.
To create a Textual API key:
Either:
In the API keys panel on the dataset details page, click Create an API Key.
On the pipeline details page, click the API key creation icon.
On the User API Keys page, click Create API Key.
In the Name field, type a name to use to identify the key.
Click Create API Key.
Textual displays the key value, and prompts you to copy the key. If you do not copy the key and save it to a file, you will not have access to the key. To copy the key, click the copy icon.
To revoke a Textual API key, on the User API Keys page, click the Revoke option for the key to revoke.
You cannot instantiate the SDK client without an API key.
Instead of providing the key every time you call the Textual API, you can configure the API key as the value of the environment variable TONIC_TEXTUAL_API_KEY
.
Textual uses datasets to produce files with sensitive values replaced.
Before you perform these tasks, remember to .
To get the complete list of datasets that you own, use .
To create a new dataset and then upload a file to it, use .
To add a file to the dataset, use . To identify the file, provide the file path and name.
To provide the file as IO bytes, you provide the file name and the file bytes. You do not provide a path.
Textual creates the dataset, scans the uploaded file, and redacts the detected values.
To change the configuration of a dataset, use .
You can use dataset.edit
to change:
The name of the dataset
The
Alternatively, instead of specifying the configuration, you can use the copy_from_dataset
parameter to indicate to copy the configuration from another dataset.
To get the current status of the files in the current dataset, use :
The response includes:
The name and identifier of the dataset
The number of files in the dataset
The number of files that are waiting to be processed (scanned and redacted)
The number of files that had errors during processing
For example:
To get a list of files that have a specific status, use the following:
The file list includes:
File identifier and name
Number of rows and columns
Processing status
For failed files, the error
When the file was uploaded
To delete a file from a dataset, use .
To get the redacted content in JSON format for a dataset, use :
For example:
The response looks something like:
After you install the app in the Snowflake UI, only the ACCOUNTADMIN
role has access to it.
You can grant access to other roles as needed.
To start the app, run the following command:
This initializes the application. You can then use the app to redact or parse text data.
You use the TEXTUAL_REDACT
function to detect and replace sensitive files in text.
The TEXTUAL_REDACT
function takes the following arguments:
The text to redact, which is required
Optionally, a PARSE_JSON
JSON object that represents the generation configuration for each entity type. The generator configuration indicates what to do with the detected value.
For each entry in PARSE_JSON
:
<EntityType>
is the type of entity for which to specify the handling. For the list of entity types, go to .
For example, for a first name, the entity type is NAME_GIVEN
.
<HandlingType>
indicates what to do with the detected value. The options are:
Redact
, which replaces the value with a redacted value in the format [<EntityType>_<RandomIdentifier>]
Synthesis
, which replaces the value with a realistic replacement
Off
, which leaves the value as is
If you do not include PARSE_JSON
, then all of the detected values are redacted.
The following example sends a text string to the app:
This returns the redacted text, which looks similar to the following:
Because we did not specify the handling for any of the entity types, both the first name Jane and last name Doe are redacted.
In this example, when a first name (NAME_GIVEN
) is detected, it is synthesized instead of redacted.
This returns output similar to the following. The first name Jane is replaced with a realistic value (synthesized), and the last name Doe is redacted.
You use the TEXTUAL_PARSE
function to transform files in an external or internal stage into Markdown-based content that you can use to populate LLM systems.
The output includes metadata about the file, including sensitive values that were detected.
To be able to parse the files, Textual must have access to the stage where the files are located.
Your role must be able to grant the USAGE
and READ
permissions.
To grant Textual access to the stage, run the following commands:
To send a parse request for a single file, run the following:
Where:
<FullyQualifiedStageName>
is the fully qualified name of the stage, in the format <DatabaseName>.<SchemaName>.<StageName>
. For example, database1.schema1.stage1
.
<FileName>
is the name of the file.
<FileMD5Sum>
is the MD5 sum version of the file content.
To parse a large number of files:
List the stage files to parse. For example, you might use PATTERN
to limit the files based on file type.
Run the parse request command on the list.
For example:
The app writes the results to the TEXTUAL_RESULTS
table.
For each request, the entry in TEXTUAL_RESULTS
includes the request status and the request results.
The status is one of the following values:
QUEUED
- The parse request was received and is waiting to be processed.
RUNNING
- The parse request is currently being processed.
SKIPPED
- The parse request was skipped because the file did not change since the previous time it was parsed. Whether a file is changed is determined by its MD5 checksum.
FAILURE_<FailureReason>
- The parse request failed for the provided reason.
The result
column is a VARIANT
type that contains the parsed data. For more information about the format of the results for each document, go to .
You can query the parse results in the same way as you would any other Snowflake VARIANT
column.
For example, the following command retrieves the parsed documents, which are in a converted Markdown representation.
To retrieve the entities that were identified in the document:
Because the result
column is a simple variant, you can use flattening operations to perform more complex analysis. For example, you can extract all entities of a certain type or value across the documents, or find all documents that contain a specific type of entity.
datasets = textual.get_all_datasets()
dataset = textual.create_dataset('<dataset name>')
dataset.add_file('<path to file>','<file name>')
dataset.add_file('<file name>',<file bytes>)
dataset.edit(name='<dataset name>',
generator_config={'<entity_type>':'<handling_type>'},
label_allow_lists={'<entity_type>':LabelCustomList(regexes['<regex>']},
label_block_lists={'<entity_type>':LabelCustomList(regexes['<regex>']}
)
dataset.describe()
Dataset: example [879d4c5d-792a-c009-a9a0-60d69be20206]
Number of Files: 1
Files that are waiting for processing:
Files that encountered errors while processing:
Number of Rows: 0
Number of rows fetched: 0
dataset.delete_file('<file identifier>')
dataset = textual.get_dataset('<dataset name>')
dataset.fetch_all_json()
dataset = textual.get_dataset('mydataset')
dataset.fetch_all_json()
'[["PERSON Portrait by PERSON, DATE_TIME ...]'
CALL TONIC_TEXTUAL.APP_PUBLIC.START_APP('{YOUR_COMPUTE_POOL_NAME}', '{YOUR_TEXTUAL_TELEMETRY_EGRESS_INTEGRATION_NAME}');
SELECT TONIC_TEXTUAL.APP_PUBLIC.TEXTUAL_REDACT('Text to redact',
PARSE_JSON('{"<EntityType>": "<HandlingType>"}'));
SELECT TONIC_TEXTUAL.APP_PUBLIC.TEXTUAL_REDACT('My name is Jane Doe');
My name is [NAME_GIVEN_abc789] [NAME_FAMILY_xyz123].
SELECT TONIC_TEXTUAL.APP_PUBLIC.TEXTUAL_REDACT('My name is Jane Doe', PARSE_JSON('{"NAME_GIVEN": "Synthesis"}'));
My name is Shirley [NAME_FAMILY_xyz123].
GRANT USAGE ON DATABASE <DatabaseName> TO APPLICATION TONIC_TEXTUAL;
GRANT USAGE ON SCHEMA <DatabaseName>.<SchemaName> TO APPLICATION TONIC_TEXTUAL;
GRANT READ, USAGE ON STAGE <DatabaseName>.<SchemaName>.<StageName> TO APPLICATION TONIC_TEXTUAL;
SELECT TONIC_TEXTUAL.APP_PUBLIC.TEXTUAL_PARSE('<FullyQualifiedStageName>', '<FileName>', '<FileMD5Sum>');
LIST @<StageName> PATTERN='.*(txt|xlsx|docx)';
SELECT TONIC_TEXTUAL.APP_PUBLIC.TEXTUAL_PARSE('<StageName>', "name","md5") FROM table(result_scan(last_query_id()));
SELECT result["Content"]["ContentAsMarkdown"] FROM TEXTUAL_RESULTS;
SELECT result["Content"]["nerResults"] FROM TEXTUAL_RESULTS;
Tonic Textual's built-in models identify a range of sensitive values, such as:
Locations and addresses
Names of people and organizations
Identifiers and account numbers
The built-in entity types are:
CC Exp
CC_EXP
The expiration date of a credit card.
Credit Card
CREDIT_CARD
A credit card number.
CVV
CVV
The card verification value for a credit card.
Date Time
DATE_TIME
A date or timestamp.
DOB
DOB
A person's date of birth.
Email Address
EMAIL_ADDRESS
An email address.
Event
EVENT
The name of an event.
Gender Identifier
GENDER_IDENTIFIER
An identifier of a person's gender.
Healthcare Identifier
HEALTHCARE_ID
An identifier associated with healthcare, such as a patient number.
IBAN Code
IBAN_CODE
An international bank account number used to identify an overseas bank account.
IP Address
IP_ADDRESS
An IP address.
Language
LANGUAGE
The name of a spoken language.
Law
LAW
A title of a law.
Location
LOCATION
A value related to a location. Can include any part of a mailing address.
Occupation
OCCUPATION
A job title or profession.
Street Address
LOCATION_ADDRESS
A street address.
City
LOCATION_CITY
The name of a city.
State
LOCATION_STATE
A state name or abbreviation.
Zip
LOCATION_ZIP
A postal code.
Country
LOCATION_COUNTRY
The name of a country.
Full Mailing Address
LOCATION_COMPLETE_ADDRESS
A full postal address. By default, the entity type handling option for this entity type is Off.
Medical License
MEDICAL_LICENSE
The identifier of a medical license.
Money
MONEY
A monetary value.
Given Name
NAME_GIVEN
A given name or first name.
Family Name
NAME_FAMILY
A family name or surname.
NRP
NRP
A nationality, religion, or political group.
Numeric Identifier
NUMERIC_PII
A numeric value that acts as an identifier.
Numeric Value
NUMERIC_VALUE
A numeric value.
Organization
ORGANIZATION
The name of an organization.
Password
PASSWORD
A password used for authentication.
Person Age
PERSON_AGE
The age of a person.
Phone Number
PHONE_NUMBER
A telephone number.
Product
PRODUCT
The name of a product.
URL
URL
A URL to a web page.
US Bank Number
US_BANK_NUMBER
The routing number of a bank in the United States.
US ITIN
US_ITIN
An Individual Taxpayer Identification Number in the United States.
US Passport
US_PASSPORT
A United States passport identifier.
US SSN
US_SSN
A United States Social Security number.
You can use Textual to generate versions of files where the sensitive values are redacted.
To only generate redacted files, you use a Tonic Textual dataset.
You can also optionally configure a Textual pipeline to generate redacted files in addition to the JSON output.
You can also create and manage datasets from the Textual SDK or REST API.
At a high level, to use Textual to create redacted data:
Create a Textual dataset or pipeline. A dataset is a set of files to redact. A pipeline is used to generate JSON output that can be used to populate an LLM system. Pipelines also provide an option to generate redacted versions of the selected files.
Add files to the dataset or pipeline.
Textual supports almost any free-text file, PDF files, .docx files, and .xlsx files. For images, Textual supports PNG, JPG (both .jpg and .jpeg), and TIF (both .tif and .tiff) files.
For a dataset or an uploaded files pipeline, as you add the files, Textual automatically uses its built-in models to identify entities in the files and generate the pipeline output. For a cloud storage pipeline, to identify the entities and generate the output, you run the pipeline.
For a dataset, review the types of entities that were detected across all of the files. For pipeline files, the file details include the entities that were detected in that file.
At any time, including before you upload files and after you review the detection results, you can configure how Textual handles the detected values for each entity type.
For datasets, you can provide added and excluded values for each built-in entity type.
You can also create and enable custom entity types.
For each entity type, select the action to perform on detected values. The options are:
Redaction - By default, Textual redacts the entity values, which means to replace the values with a token that identifies the type of sensitive value, followed by a unique identifier. For example, NAME_GIVEN_l2m5sb
, LOCATION_j40pk6
.
The identifiers are consistent, which means that for the same original value, the redacted value always has the same identifier. For example, the first name Michael might always be replaced with NAME_GIVEN_12m5sb
, while the first name Helen might always be replaced with NAME_GIVEN_9ha3m2
.
For PDF files, redaction means to either cover the value with a black box, or, if there is space, display the entity type and identifier.
For image files, redaction means to cover the value with a black box.
Synthesis - For a given entity type, you can instead choose to synthesize the values, which means to replace the original value with a realistic replacement. The synthesized values are always consistent, meaning that a given original value always produces the same replacement value. For example, the first name Michael might always be replaced with the first name John.
Ignore - You can also choose to ignore the values, and not replace them.
For a dataset, Textual automatically updates the file previews and downloadable files to reflect the updated configuration.
For a pipeline, the updated configuration is applied the next time you run the pipeline, and only applies to new files.
Optionally, in a dataset, you can create lists of values to add to or exclude from an entity type. You might do this to reflect values that are not detected or that are detected incorrectly.
Pipelines do not allow you to add or exclude individual values.
Datasets also provide additional options for PDF files. These options are not available in pipelines.
You can add manual overrides to a PDF file. When you add a manual override, you draw a box to identify the affected portion of the file.
You can use manual overrides either to ignore the automatically detected redactions in the selected area, or to redact the selected area.
To make it easier to process multiple files that have a similar format, such as a form, you can create templates that you can apply to PDF files in the dataset.
After you complete the redaction configuration and manual updates, you can download the dataset files or the synthesized pipeline files to use as needed.
To create a pipeline, on the Pipelines page, click Create a New Pipeline.
On the Create A New Pipeline panel:
In the Name field, type the name of the pipeline.
Under Files Source, select the location of the source files.
To upload files from a local file system, click File upload, then click Save.
To select files from and write output to Amazon S3, click Amazon S3.
To select files from and write output to Databricks, click Databricks.
To select files from and write output to Azure Blob Storage, click Azure.
To select files from and write output to Sharepoint, click Sharepoint.
Click Save.
If you selected Amazon S3, provide the credentials to use to connect to Amazon S3.
In the Access Key field, provide an AWS access key that is associated with an IAM user or role. For an example of a role that has the required permissions for an Amazon S3 pipeline, go to .
In the Access Secret field, provide the secret key that is associated with the access key.
From the Region dropdown list, select the AWS Region to send the authentication request to.
In the Session Token field, provide the session token to use for the authentication request.
To test the credentials, click Test AWS Connection.
By default, connections to Amazon S3 use Amazon S3 encryption. To instead use AWS KMS encryption:
Click Show Advanced Options.
From the Server-Side Encryption Type dropdown list, select AWS KMS. Note that after you save the new pipeline, you cannot change the encryption type.
In the Server-side Encryption AWS KMS ID field, provide the KMS key ID. Note that if the KMS key doesn't exist in the same account that issues the command, you must provide the full key ARN instead of the key ID.
Click Save.
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to .
Click Save.
If you selected Databricks, provide the connection information:
In the Databricks URL field, provide the URL to the Databricks workspace.
In the Access Token field, provide the access token to use to get access to the volume.
To test the connection, click Test Databricks Connection.
Click Save.
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to .
Click Save.
If you selected Azure, provide the connection information:
In the Account Name field, provide the name of your Azure account.
In the Account Key field, provide the access key for your Azure account.
To test the connection, click Test Azure Connection.
Click Save.
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to .
Click Save.
If you selected Sharepoint, provide the credentials for the Entra ID application.
The credentials must have the following application permissions (not delegated permissions):
Files.Read.All
- To see the Sharepoint files
Files.ReadWrite.All
-To write redacted files and metadata back to Sharepoint
Sites.ReadWrite.All
- To view and modify the Sharepoint sites
In the Tenant ID field, provide the Sharepoint tenant identifier for the Sharepoint site.
In the Client ID field, provide the client identifier for the Sharepoint site.
In the Client Secret field, provide the secret to use to connect to the Sharepoint site.
To test the connection, click Test Sharepoint Connection.
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to .
Click Save.
To update a pipeline configuration:
On the pipeline details page, click the settings icon. For cloud storage pipelines, the settings icon is next to the Run Pipeline option. For uploaded file pipelines, the settings icon is next to the Upload Files option.
On the Pipeline Settings page, update the configuration. For all pipelines, you can change the pipeline name, and whether to also create redacted versions of the original files. For cloud storage pipelines, you can change the file selection. For more information, go to:
For uploaded file pipelines, you do not manage files from the Pipeline Settings page. For information about uploading files, go to .
Click Save.
To delete a pipeline, on the Pipeline Settings page, click Delete Pipeline.
When you use the redact
method to redact a plain text string, you can also choose to record the request.
The recorded requests are encrypted.
When you make the request, you specify the number of hours to keep the recorded request. After that amount of time elapses, the request is completely purged. Recorded requests are never kept more than 720 hours, regardless of the configured retention time.
From the Request Explorer, you can review your recorded requests to check the results and assess the quality of the redaction. You can also test changes to the redaction configuration.
You cannot view requests from other users.
To record a redaction request, you include the record_options
argument:
record_options = RecordApiRequestOptions(record=<boolean>, retention_time_in_hours=<number of hours>, tags=["tag name"])
The record_options
argument includes the following parameters:
record
- Whether to record the request. The default is False
. To record the request, set record to True
.
retention_time_in_hours
- The number of hours to preserve the recorded request. The default is 1. After the retention time elapses, the request is purged completely.
tags
- A list of tags to assign to the request. The tags are mostly intended to make it easier to search for requests on the Request Explorer page.
The Request Explorer page in Textual contains the list requests that you recorded and that are not yet purged. You cannot view requests from other users.
To display the Request Explorer page, in the Textual navigation bar, click Request Explorer.
For each request, the list includes:
A 255-character preview of the text that was sent for redaction.
The tags assigned to the request.
The date when the request will be purged.
You can search for a request based on text that is contained in the redacted text, and by the tags that you assigned to the request.
To search by text from the string, in the search field, begin to type the text.
To search by an assigned tag, in the search field, type tags:
followed by the tag to search for.
From the request list, to view the results of a request, click the request row.
By default, the preview uses Identification view. For each detected entity, Identification view displays the value and the entity type.
To instead display only the replacement value, which by default is the entity type, click Replacement.
From the preview, you can test how the results change when you:
Change the handling option for entity types.
Add and exclude values for entity types.
To display the edit panel, from the request preview page, click Edit.
The Edit Request panel displays the full list of the available entity types.
You can change how Textual handles detected entity values for each entity type.
Note that the handling option changes are not saved when you close the preview and return to the requests list.
The handling options are:
Off - Indicates to ignore values for this entity type.
Redact - This is the default option. Indicates to replace each value with a token that represents the entity type.
Synthesize - Indicates to replace each value with a realistic replacement value.
To change the handling option for a single entity type, either:
Click the handling option value for the entity type, then select the handling option.
Click the entity type, then under Generator, click the handling option.
To select the same handling option for all of the entity types:
Click Bulk Edit.
From the Bulk Edit dropdown list, select the handling option.
To configure added and excluded values for an entity type, click the entity type.
The Edit Request panel expands to display the Add to detection and Exclude from detection lists.
You use the Add to detection list to configure regular expressions to identify additional values to detect as the selected entity type.
You use the Exclude from detection list to configure regular expressions to identify values to not detect as the selected entity type.
Note that the added and excluded values are not saved when you close the preview and return to the requests list.
To create a regular expression for added or excluded values:
Click the Add regex option for that list.
In the field, provide a regular expression to identify values to add or exclude.
Press Enter.
To edit a regular expression:
Click the edit icon for the expression.
In the field, edit the expression.
Click the save icon.
To delete a regular expression, click the delete icon for that expression.
When an entity type has added values, the added values icon displays for that entity type.
When an entity type has excluded values, the excluded values icon displays for that entity type.
To replay the request based on the current configuration, click Replay.
When you replay the request, in addition to the Identification and Replacement options, you use the Diff toggle to indicate whether to compare the original and new results.
For our example, we made the following changes to the configuration:
For Given Name and Family Name, changed the handling option to Synthesize.
For Credit Card, indicated to ignore the value 41111111111.
When the Diff toggle is in the off position, Identification view only reflects changes to the added and excluded values.
In our example, we configured 41111111111 to not be detected as a credit card number. In the replayed request, it is instead detected as a numeric value.
Replacement view reflects both the added and excluded values and the changes to the handling option.
For our example, in addition to the entity type change for the credit card number 41111111111, the given and family names are now realistic replacement values instead of the entity types.
When you set the Diff toggle to the on position, the preview displays the original content to the left, and the modified content to the right.
In Identification view, you can see the changes to the entity detection based on the added and excluded values.
In Replacement view, you can also see the changes to the selected handling options for the entity types.
To clear all of the regular expressions for all of the entity types, click Remove Changes.
From Tonic Textual, you can download the JSON output for each file. For pipelines that also generate synthesized files, you can download those files.
You can also use the Textual API to further process the pipeline output - for example, you can chunk the output and determine whether to replace sensitive values before you use the output in a RAG system.
Textual provides next step hints to use the pipeline output. The examples in this topic provide details about how to use the output.
From a file details page, to download the JSON file, click Download Results.
On the file details for a pipeline file, to download the synthesized version of the file, click Download Synthesized File.
On the Original tab for files other than .txt files, the Redacted <file type> view contains a Download option.
For cloud storage pipelines, the synthesized files are also available in the configured output location.
On the pipeline details page, the next steps panel at the left contains suggested steps to set up the API and use the pipeline output:
Create an API Key contains a link to create the key
Install the Python SDK contains a link to copy the SDK installation command
Fetch the pipeline results provides access to code snippets that you can use to retrieve and chunk the pipeline results.
At the top of the Fetch the pipeline results step is the pipeline identifier. To copy the identifier, click the copy icon.
The pipeline results step provides access to the following snippets:
Markdown - A code snippet to retrieve the Markdown results for the pipeline.
JSON - A code snippet to retrieve the JSON results for the pipeline.
Chunks - A code snippet to chunk the pipeline results.
To view a snippet, click the snippet tab.
To display the snippet panel, on the snippet tab, click View. The snippet panel provides a larger view of the snippet.
To copy the code snippet, on the snippet tab or the snippet panel, click Copy.
This example shows how to use your Textual pipeline output to create private chunks for RAG, where sensitive chunks are dropped, redacted, or synthesized.
This allows you to ensure that the chunks that you use for RAG do not contain any private information.
First, we connect to the API and get the files from the most recent pipeline.
Next, specify the sensitive entity types, and indicate whether to redact or to synthesize those entities in the chunks.
Next, generate the chunks.
In the following code snippet, the final list does not include chunks with sensitive entities.
To include the chunks with the sensitive entities redacted, remove the if chunk['is_sensitive']: continue
lines.
The chunks are now ready to use for RAG or for other downstream tasks.
This example shows how to use Pinecone to add your Tonic Textual pipeline output to a vector retrieval system, for example for RAG.
The Pinecone metadata filtering options allow you to incorporate Textual NER metadata into the retrieval system.
First, connect to the Textual pipeline API, and get the files from the most recently created pipeline.
Next, specify the entity types to incorporate into the retrieval system.
Chunk the files.
For each chunk, add the metadata that contains the instances of the entity types that occur in that chunk.
Next, embed the text of the chunks.
For each chunk, store the following in a Pinecone vector database:
Text
Embedding
Metadata
You define the embedding function for your system.
When you query the Pinecone database, you can then use metadata filters that specify entity type constraints.
For example, to only return chunks that contain the name John Smith
:
As another example, to only return chunks that contain one of the following organizations - Google, Apple, or Microsoft:
from tonic_textual.parse_api import TonicTextualParse
api_key = "your-tonic-textual-api-key"
textual = TonicTextualParse("https://textual.tonic.ai", api_key=api_key)
pipelines = textual.get_pipelines()
pipeline = pipelines[-1] # get most recent pipeline
sensitive_entities = [
"NAME_GIVEN",
"NAME_FAMILY",
"EMAIL_ADDRESS",
"PHONE_NUMBER",
"CREDIT_CARD",
"CC_EXP",
"CVV",
"US_BANK_NUMBER"
]
# sensitive entities are set to be redacted
# to synthesize, change Redaction to Synthesis
generator_config = {label: 'Redaction' for label in sensitive_entities}
chunks = []
for file in pipeline.enumerate_files():
file_chunks = file.get_chunks(generator_config=generator_config)
for chunk in file_chunks:
if chunk['is_sensitive']:
continue # you can choose to ignore chunks that contain sensitive entities
# or ingest the redacted version
chunks.append(chunk)
from tonic_textual.parse_api import TonicTextualParse
api_key = "your-tonic-textual-api-key"
textual = TonicTextualParse("https://textual.tonic.ai", api_key=api_key)
pipelines = textual.get_pipelines()
pipeline = pipelines[-1] # get most recent pipeline
files = pipeline.enumerate_files()
metadata_entities = [
"NAME_GIVEN",
"NAME_FAMILY",
"DATE_TIME",
"ORGANIZATION"
]
chunks = []
for f in files:
chunks.extend(f.get_chunks(metadata_entities=metadata_entities))
from pinecone import Pinecone
import random, uuid
def embedding_function(text: str) -> list[float]:
# put your embedding function here
return [random.random() for i in range(10)]
vectors = []
for chunk in chunks:
metadata = dict(chunk["metadata"]["entities"])
metadata["text"] = chunk["text"]
vectors.append({
"id": str(uuid.uuid4()),
"values": embedding_function(chunk["text"]),
"metadata": metadata
})
pc = Pinecone(api_key='your-pinecone-api-key')
index_name = "your-pinecone-index-name"
index = pc.Index(index_name)
index.upsert(vectors=vectors)
query: str # your query
index.query(
vector=embedding_function(query),
filter={
"NAME_FAMILY": {"$eq": "Smith"},
"NAME_GIVEN": {"$eq": "John"}
},
top_k=5,
include_metadata=True
)
query: str # your query
index.query(
vector=embedding_function(query),
filter={
"ORGANIZATION": { "$in": ["Google", "Apple", "Microsoft"]}
},
top_k=5,
include_metadata=True
)
Tonic Textual supports languages in addition to English. Textual automatically detects the language and applies the correct model.
On self-hosted instances, you configure whether to support multiple languages, and can optionally provide auxiliary language models.
Textual can detect values in the following languages:
Afrikaans
af
Albanian
sq
Amharic
am
Arabic
ar
Armenian
hy
Assamese
as
Azerbaijani
az
Basque
eu
Belarusian
be
Bengali
bn
Bengali Romanized
Bosnian
bs
Breton
br
Bulgarian
bg
Burmese
my
Burmese (alternative)
Catalan
ca
Chinese (Simplified)
zh
Chinese (Traditional)
zh
Croatian
hr
Czech
cs
Danish
da
Dutch
nl
English
en
Esperanto
eo
Estonian
et
Filipino
tl
Finnish
fi
French
fr
Galician
gl
Irish
ga
Georgian
ka
German
de
Greek
el
Gujarati
gu
Hausa
ha
Hebrew
he
Hindi
hi
Hindi Romanized
Hungarian
hu
Icelandic
is
Indonesian
id
Italian
it
Japanese
ja
Javanese
jv
Kannada
kn
Kazakh
kk
Khmer
km
Korean
ko
Kurdish (Kurmanji)
ku
Kyrgyz
ky
Lao
lo
Latin
la
Latvian
lv
Lithuanian
lt
Macedonian
mk
Malagasy
mg
Malay
ms
Malayalam
ml
Marathi
mr
Mongolian
mn
Nepali
ne
Norwegian
no
Oriya
or
Oromo
om
Pashto
ps
Persian
fa
Polish
pl
Portuguese
pt
Punjabi
pa
Romanian
ro
Russian
ru
Sanskrit
sa
Scottish Gaelic
gd
Serbian
sr
Sinhala
si
Sindhi
sd
Slovak
sk
Slovenian
sl
Somali
so
Spanish
es
Sundanese
su
Swahili
sw
Swedish
sv
Tamil
ta
Tamil Romanized
Telugu
te
Telugu Romanized
Thai
th
Turkish
tr
Ukrainian
uk
Urdu
ur
Urdu Romanized
Uyghur
ug
Uzbek
uz
Vietnamese
vi
Welsh
cy
Western Frisian
fy
Xhosa
xh
Yiddish
yi
On a self-hosted instance, you configure whether Textual supports multiple languages.
You can also optionally provide auxiliary language models.
To enable support for languages other than English, set the environment variable TEXTUAL_MULTI_LINGUAL=true
.
The setting is used by the machine learning container.
You can provide additional language model assets for Textual to use.
By default, Textual looks for model assets in the machine learning container, in /usr/bin/textual/language_models. The default Helm and Docker Compose configurations include the volume mount.
To choose a different location, set the environment variable TEXTUAL_LANGUAGE_MODEL_DIRECTORY
. Note that if you change the location, you must also modify your volume mounts.
For help with installing model assets, contact Tonic.ai support ([email protected]).
Before you perform these tasks, remember to instantiate the SDK client.
You can use the Tonic Textual SDK to redact individual strings, including:
Plain text strings
JSON content
XML content
For a text string, you can also request synthesized values from a large language model (LLM).
The redaction request can include the handling configuration for entity types.
The redaction response includes the redacted or synthesized content and details about the detected entity values.
To send a plain text string for redaction, use textual.redact
:
redaction_response = textual.redact("""<text of the string>""")
redaction_response.describe()
For example:
redaction_response = textual.redact("""Contact Tonic AI with questions""")
redaction_response.describe()
Contact ORGANIZATION_EPfC7XZUZ with questions
{"start": 8, "end": 16, "new_start": 8, "new_end": 30, "label": "ORGANIZATION", "text": "Tonic AI", "new_text": "[ORGANIZATION]", "score": 0.85, "language": "en"}
The redact
call provides an option to record the request, to allow you to preview the results in the Textual application. For more information, go to Record and review redaction requests.
To send multiple plain text strings for redaction, use textual.redact_bulk
:
bulk_response = textual.redact_bulk([<List of strings])
For example:
bulk_response = textual.redact_bulk(["Tonic.ai was founded in 2018", "John Smith is a person"])
bulk_response.describe()
[ORGANIZATION_5Ve7OH] was founded in [DATE_TIME_DnuC1]
{"start": 0, "end": 5, "new_start": 0, "new_end": 21, "label": "ORGANIZATION", "text": "Tonic", "score": 0.9, "language": "en", "new_text": "[ORGANIZATION]"}
{"start": 21, "end": 25, "new_start": 37, "new_end": 54, "label": "DATE_TIME", "text": "2018", "score": 0.9, "language": "en", "new_text": "[DATE_TIME]"}
[NAME_GIVEN_dySb5] [NAME_FAMILY_7w4Db3] is a person
{"start": 0, "end": 4, "new_start": 0, "new_end": 18, "label": "NAME_GIVEN", "text": "John", "score": 0.9, "language": "en", "new_text": "[NAME_GIVEN]"}
{"start": 5, "end": 10, "new_start": 19, "new_end": 39, "label": "NAME_FAMILY", "text": "Smith", "score": 0.9, "language": "en", "new_text": "[NAME_FAMILY]"}
To send a JSON string for redaction, use textual.redact_json. You can send the JSON content as a JSON string or a Python dictionary.
json_redaction = textual.redact_json(<JSON string or Python dictionary>)
redact_json
ensures that only the values are redacted. It ignores the keys.
Here is a basic example of a JSON redaction request:
d=dict()
d['person']={'first':'John','last':'OReilly'}
d['address']={'city': 'Memphis', 'state':'TN', 'street': '847 Rocky Top', 'zip':1234}
d['description'] = 'John is a man that lives in Memphis. He is 37 years old and is married to Cynthia.'
json_redaction = textual.redact_json(d)
print(json.dumps(json.loads(json_redaction.redacted_text), indent=2))
It produces the following JSON output:
{
"person": {
"first": "[NAME_GIVEN]",
"last": "[NAME_FAMILY]"
},
"address": {
"city": "[LOCATION_CITY]",
"state": "[LOCATION_STATE]",
"street": "[LOCATION_ADDRESS]",
"zip": "[LOCATION_ZIP]"
},
"description": "[NAME_GIVEN] is a man that lives in [LOCATION_CITY]. He is [DATE_TIME] and is married to [NAME_GIVEN]."
}
When you redact a JSON string, you can optionally assign specific entity types to selected JSON paths.
To do this, you include the jsonpath_allow_lists
parameter. Each entry consists of an entity type and a list of JSON paths for which to always use that entity type. Each JSON path must point to a simple string or numeric value.
jsonpath_allow_lists={'entity_type':['JSON Paths']}
The specified entity type overrides both the detected entity type and any added or excluded values.
In the following example, the value of the key1
node is always treated as a telephone number:
response = textual.redact_json('{"key1":"Ex123", "key2":"Johnson"}', jsonpath_allow_lists={'PHONE_NUMBER':['$.key1']})
It produces the following redacted output:
{"key1":"[PHONE_NUMBER]","key2":"My name is [NAME_FAMILY]"}
To send an XML string for redaction, use textual.redact_xml
.
redact_xml
ensures that only the values are redacted. It ignores the XML markup.
For example:
xml_string = '''<?xml version="1.0" encoding="UTF-8"?>
<!-- This XML document contains sample PII with namespaces and attributes -->
<PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact">
<!-- Personal Information with an attribute containing PII -->
<Name preferred="true" contact:userID="john.doe123">
<FirstName>John</FirstName>
<LastName>Doe</LastName>He was born in 1980.</Name>
<contact:Details>
<!-- Email stored in an attribute for demonstration -->
<contact:Email address="[email protected]"/>
<contact:Phone type="mobile" number="555-6789"/>
</contact:Details>
<!-- SSN stored as an attribute -->
<SSN value="987-65-4321" xsi:nil="false"/>
<data>his name was John Doe</data>
</PersonInfo>'''
response = textual.redact_xml(xml_string)
redacted_xml = response.redacted_text
Produces the following XML output:
<?xml version="1.0" encoding="UTF-8"?><!-- This XML document contains sample PII with namespaces and attributes -->\n<PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact"><!-- Personal Information with an attribute containing PII --><Name preferred="true" contact:userID="[NAME_GIVEN]">[GENDER_IDENTIFIER] was born in [DOB].<FirstName>[NAME_GIVEN]</FirstName><LastName>[NAME_FAMILY]</LastName></Name><contact:Details><!-- Email stored in an attribute for demonstration --><contact:Email address="[EMAIL_ADDRESS]"></contact:Email><contact:Phone type="mobile" number="[PHONE_NUMBER]"></contact:Phone></contact:Details><!-- SSN stored as an attribute --><SSN value="[PHONE_NUMBER]" xsi:nil="false"></SSN><data>[GENDER_IDENTIFIER] name was [NAME_GIVEN] [NAME_FAMILY]</data></PersonInfo>
To send an HTML string for redaction, use textual.redact_html
.
redact_html
ensures that only the values are redacted. It ignores the HTML markup.
For example:
html_content = """
<!DOCTYPE html>
<html>
<head>
<title>John Doe</title>
</head>
<body>
<h1>John Doe</h1>
<p>John Doe is a person who lives in New York City.</p>
<p>John Doe's phone number is 555-555-5555.</p>
</body>
</html>
"""
# Run the redact_xml method
redacted_html = redact.redact_html(html_content, generator_config={
"NAME_GIVEN": "Synthesis",
"NAME_FAMILY": "Synthesis"
})
print(redacted_html.redacted_text)
Produces the following HTML output:
<!DOCTYPE html>
<html>
<head>
<title>Scott Roley</title>
</head>
<body>
<h1>Scott Roley</h1>
<p>Scott Roley is a person who lives in [LOCATION_CITY].</p>
<p>Scott Roley's phone number is [PHONE_NUMBER].</p>
</body>
</html>
You can also request synthesized values from a large language model (LLM).
When you use this process, Textual first identifies the sensitive values in the text. It then sends the value locations and redacted values to the LLM. For example, if Textual identifies a product name, it sends the location and the redacted value PRODUCT
to the LLM. Textual does not send the original values to the LLM.
The LLM then generates realistic synthesized values of the appropriate value types.
To send text to an LLM, use textual.llm_synthesis
:
raw_synthesis = textual.llm_synthesis("Text of the string")
For example:
raw_synthesis = textual.llm_synthesis("My name is John, and today I am demoing Textual, a software product created by Tonic")
raw_synthesis.describe()
My name is John, and on Monday afternoon I am demoing Widget Pro, a software product created by Initech Enterprises.
{"start": 11, "end": 15, "new_start": 11, "new_end": 15, "label": "NAME_GIVEN", "text": "John", "new_text": null, "score": 0.9, "language": "en"}
{"start": 21, "end": 26, "new_start": 21, "new_end": 40, "label": "DATE_TIME", "text": "today", "new_text": null, "score": 0.85, "language": "en"}
{"start": 40, "end": 47, "new_start": 54, "new_end": 64, "label": "PRODUCT", "text": "Textual", "new_text": null, "score": 0.85, "language": "en"}
{"start": 79, "end": 84, "new_start": 96, "new_end": 115, "label": "ORGANIZATION", "text": "Tonic", "new_text": null, "score": 0.85, "language": "en"}
Before you can use this endpoint, you must enable additional LLM processing. The additional processing sends the values and surrounding text to the LLM. For an overview of the LLM processing and how to enable it, go to Enabling and using additional LLM processing of detected entities.
The response provides the redacted or synthesized version of the string, and the list of detected entity values.
Contact ORGANIZATION_EPfC7XZUZ with questions
{"start": 8, "end": 16, "new_start": 8, "new_end": 30, "label": "ORGANIZATION", "text": "Tonic AI", "new_text": "[ORGANIZATION]", "score": 0.85, "language": "en"}
For each redacted item, the response includes:
The location of the value in the original text (start
and end
)
The location of the value in the redacted version of the string (new_start
and new_end
)
The entity type (label
)
The original value (text
)
The replacement value (new_text
). new_text
is null
in the following cases:
The entity type is ignored
The response is from llm_synthesis
A score to indicate confidence in the detection and redaction (score
)
The detected language for the value (language
)
For responses from textual.redact_json
, the JSON path to the entity in the original document (json_path
)
For responses from textual.redact_xml
, the XPath to the entity in the original XML document (xml_path
)
Textual uses pipelines to transform file text into a format that can be used in an LLM system.
You can use the Textual SDK to create and manage pipelines and to retrieve pipeline run results.
Before you perform these tasks, remember to instantiate the SDK client.
To create a pipeline, use the pipeline creation method for the type of pipeline to create"
textual.create_local_pipeline
- Creates an uploaded file pipeline.
textual.create_s3_pipeline
- Creates an Amazon S3 pipeline.
textual.create_azure_pipeline
- Creates an Azure pipeline.
textual.create_databricks_pipeline
- Creates a Databricks pipeline.
When you create the pipeline, you can also:
If needed, provide the credentials to use to connect to Amazon S3, Azure, or Databricks.
Indicate whether to also generate redacted files. By default, pipelines do not generate redacted files. To generate redacted files, set synthesize_files
to True
.
For example, to create an uploaded file pipeline that also creates redacted files:
pipeline = textual.create_local_pipeline(pipeline_name="pipeline name", synthesize_files=True)
The response contains the pipeline object.
To delete a pipeline, use textual.delete_pipeline
.
textual.delete_pipeline(pipeline_id)
To change whether a pipeline also generates synthesized files, use pipeline.set_synthesize_files
.
To a a file to an uploaded file pipeline, use pipeline.upload_file
.
pipeline = textual.create_pipeline(pipeline_name)
with open(file_path, "rb") as file_content:
file_bytes = file_content.read()
pipeline.upload_file(file_bytes, file_name)
For an Amazon S3 pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:
To identify the output location for the processed files, use s3_pipeline.set_output_location
.
To identify individual files for the pipeline to process, use s3_pipeline.add_files
.
To identify prefixes - folders for which the pipeline processes all applicable files - use s3_pipeline.add_prefixes
.
For an Azure pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:
To identify the output location for the processed files, use azure_pipeline.set_output_location
.
To identify individual files for the pipeline to process, use azure_pipeline.add_files
.
To identify prefixes - folders for which the pipeline processes all applicable files - use azure_pipeline.add_prefixes
.
To get the list of pipelines, use textual.get_pipelines
.
pipelines = textual.get_pipelines()
The response contains a list of pipeline objects.
To use the pipeline identifier to get a single pipeline, use textual.get_pipeline_by_id
.
pipeline_id: str # pipeline identifier
pipeline = textual.get_pipeline_by_id(pipeline_id)
The response contains a single pipeline object.
The pipeline identifier is displayed on the pipeline details page. To copy the identifier, click the copy icon.
To run a pipeline, use pipeline.run
.
The response contains the job identifier.
To get the list of pipeline runs, use pipeline.get_runs
.
The response contains a list of pipeline run objects.
Once you have the pipeline, to get an enumerator of the files in the pipeline from the most recent pipeline run, use pipeline.enumerate_files
.
files = pipeline.enumerate_files()
The response is an enumerator of file parse result objects.
To get a list of entities that were detected in a file, use get_all_entities
. For example, to get the detected entities for all of the files in a pipeline:
detected_entities = []
for file in pipeline.enumerate_files():
entities = file.get_all_entities()
detected_entities.append(entities)
To provide a list entity types and how to process them, use get_entities
:
from textual.enums.pii_state import PiiState
generator_config: Dict[str, PiiState]
generator_default: PiiState
entities_list = []
for file in pipeline.enumerate_files():
entities = file.get_entities(generator_config, generator_default)
entities_list.append(entities)
generator_config
is a dictionary that specifies whether to redact, synthesize, or do neither for each entity type in the dictionary.
For a list of the entity types that Textual detects, go to Entity types that Textual detects.
For each entity type, you provide the handling type:
Redaction
indicates to replace the value with a token that represents the entity type.
Synthesis
indicates to replace the value with a realistic value.
Off
indicates to keep the value as is.
generator_default
indicates how to process values for entity types that were not included in the generator_config
list.
The response contains the list of entities. For each value, the list includes:
Entity type
Where the value starts in the source file
Where the value ends in the source file
The original text of the entity
To get the Markdown output of a pipeline file, use get_markdown
. In the request, you can provide generator_config
and generator_default
to configure how to present the detected entities in the output file.
from textual.enums.pii_state import PiiState
generator_config: Dict[str, PiiState]
generator_default: PiiState
markdown_list = []
for file in pipeline.enumerate_files():
markdown = file.get_markdown(generator_config, generator_default)
markdown_list.append(markdown)
The response contains the Markdown files, with the detected entities processed as specified in generator_config
and generator_default
.
To split a pipeline file into text chunks that can be imported into an LLM, use get_chunks
.
In the request, you set the maximum number of characters in each chunk.
You can also provide generator_config
and generator_default
to configure how to present the detected entities in the text chunks.
from textual.enums.pii_state import PiiState
generator_config: Dict[str, PiiState]
generator_default: PiiState
max_chars: int
chunks_list = []
for file in pipeline.enumerate_files():
chunks = file.get_chunks(max_chars=max_chars, generator_config, generator_default)
chunks_list.append(chunks)
The response contains the list of text chunks, with the detected entities processed as specified in generator_config
and generator_default
.
PUT /api/Dataset HTTP/1.1
Host:
Content-Type: application/json
Accept: */*
Content-Length: 507
{
"id": "text",
"name": "text",
"generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
"datasetGeneratorMetadata": {
"ANY_ADDITIONAL_PROPERTY": {}
},
"labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
"labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
"enabledModels": [
"text"
],
"docXImagePolicy": "Redact",
"pdfSignaturePolicy": "Redact",
"docXCommentPolicy": "Remove",
"docXTablePolicy": "Redact"
}
{
"id": "text",
"name": "text",
"datasetGeneratorMetadata": "asdfqwer",
"generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
"labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
"labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
"enabledModels": [
"text"
],
"files": [
{
"fileId": "text",
"fileName": "text",
"fileType": "text",
"datasetId": "text",
"numRows": 1,
"numColumns": 1,
"piiTypes": [
"text"
],
"wordCount": 1,
"redactedWordCount": 1,
"uploadedTimestamp": {},
"fileSource": "Local",
"processingStatus": "text",
"processingError": "text",
"mostRecentCompletedJobId": "text"
}
],
"lastUpdated": {},
"docXImagePolicy": "Redact",
"pdfSignaturePolicy": "Redact",
"docXCommentPolicy": "Remove",
"docXTablePolicy": "Redact",
"fileSource": "Local",
"customPiiEntityIds": [
"text"
],
"rescanJobs": [
{
"id": "text",
"status": "text",
"errorMessages": "text",
"startTime": {},
"endTime": {},
"publishedTime": {},
"datasetFileId": "text",
"jobType": "DeidentifyFile"
}
]
}
GET /api/Dataset HTTP/1.1
Host:
Accept: */*
[
{
"id": "text",
"name": "text",
"datasetGeneratorMetadata": "asdfqwer",
"generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
"labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
"labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
"enabledModels": [
"text"
],
"files": [
{
"fileId": "text",
"fileName": "text",
"fileType": "text",
"datasetId": "text",
"numRows": 1,
"numColumns": 1,
"piiTypes": [
"text"
],
"wordCount": 1,
"redactedWordCount": 1,
"uploadedTimestamp": {},
"fileSource": "Local",
"processingStatus": "text",
"processingError": "text",
"mostRecentCompletedJobId": "text"
}
],
"lastUpdated": {},
"docXImagePolicy": "Redact",
"pdfSignaturePolicy": "Redact",
"docXCommentPolicy": "Remove",
"docXTablePolicy": "Redact",
"fileSource": "Local",
"customPiiEntityIds": [
"text"
],
"rescanJobs": [
{
"id": "text",
"status": "text",
"errorMessages": "text",
"startTime": {},
"endTime": {},
"publishedTime": {},
"datasetFileId": "text",
"jobType": "DeidentifyFile"
}
]
}
]
POST /api/Dataset HTTP/1.1
Host:
Content-Type: application/json
Accept: */*
Content-Length: 15
{
"name": "text"
}
{
"id": "text",
"name": "text",
"datasetGeneratorMetadata": "asdfqwer",
"generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
"labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
"labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
"enabledModels": [
"text"
],
"files": [
{
"fileId": "text",
"fileName": "text",
"fileType": "text",
"datasetId": "text",
"numRows": 1,
"numColumns": 1,
"piiTypes": [
"text"
],
"wordCount": 1,
"redactedWordCount": 1,
"uploadedTimestamp": {},
"fileSource": "Local",
"processingStatus": "text",
"processingError": "text",
"mostRecentCompletedJobId": "text"
}
],
"lastUpdated": {},
"docXImagePolicy": "Redact",
"pdfSignaturePolicy": "Redact",
"docXCommentPolicy": "Remove",
"docXTablePolicy": "Redact",
"fileSource": "Local",
"customPiiEntityIds": [
"text"
],
"rescanJobs": [
{
"id": "text",
"status": "text",
"errorMessages": "text",
"startTime": {},
"endTime": {},
"publishedTime": {},
"datasetFileId": "text",
"jobType": "DeidentifyFile"
}
]
}
GET /api/Dataset/{datasetId} HTTP/1.1
Host:
Accept: */*
{
"id": "text",
"name": "text",
"datasetGeneratorMetadata": "asdfqwer",
"generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
"labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
"labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
"enabledModels": [
"text"
],
"files": [
{
"fileId": "text",
"fileName": "text",
"fileType": "text",
"datasetId": "text",
"numRows": 1,
"numColumns": 1,
"piiTypes": [
"text"
],
"wordCount": 1,
"redactedWordCount": 1,
"uploadedTimestamp": {},
"fileSource": "Local",
"processingStatus": "text",
"processingError": "text",
"mostRecentCompletedJobId": "text"
}
],
"lastUpdated": {},
"docXImagePolicy": "Redact",
"pdfSignaturePolicy": "Redact",
"docXCommentPolicy": "Remove",
"docXTablePolicy": "Redact",
"fileSource": "Local",
"customPiiEntityIds": [
"text"
],
"rescanJobs": [
{
"id": "text",
"status": "text",
"errorMessages": "text",
"startTime": {},
"endTime": {},
"publishedTime": {},
"datasetFileId": "text",
"jobType": "DeidentifyFile"
}
]
}
GET /api/Dataset HTTP/1.1
Host:
Accept: */*
[
{
"id": "text",
"name": "text",
"datasetGeneratorMetadata": "asdfqwer",
"generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
"labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
"labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
"enabledModels": [
"text"
],
"files": [
{
"fileId": "text",
"fileName": "text",
"fileType": "text",
"datasetId": "text",
"numRows": 1,
"numColumns": 1,
"piiTypes": [
"text"
],
"wordCount": 1,
"redactedWordCount": 1,
"uploadedTimestamp": {},
"fileSource": "Local",
"processingStatus": "text",
"processingError": "text",
"mostRecentCompletedJobId": "text"
}
],
"lastUpdated": {},
"docXImagePolicy": "Redact",
"pdfSignaturePolicy": "Redact",
"docXCommentPolicy": "Remove",
"docXTablePolicy": "Redact",
"fileSource": "Local",
"customPiiEntityIds": [
"text"
],
"rescanJobs": [
{
"id": "text",
"status": "text",
"errorMessages": "text",
"startTime": {},
"endTime": {},
"publishedTime": {},
"datasetFileId": "text",
"jobType": "DeidentifyFile"
}
]
}
]
GET /api/Dataset/{datasetId} HTTP/1.1
Host:
Accept: */*
{
"id": "text",
"name": "text",
"datasetGeneratorMetadata": "asdfqwer",
"generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
"labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
"labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
"enabledModels": [
"text"
],
"files": [
{
"fileId": "text",
"fileName": "text",
"fileType": "text",
"datasetId": "text",
"numRows": 1,
"numColumns": 1,
"piiTypes": [
"text"
],
"wordCount": 1,
"redactedWordCount": 1,
"uploadedTimestamp": {},
"fileSource": "Local",
"processingStatus": "text",
"processingError": "text",
"mostRecentCompletedJobId": "text"
}
],
"lastUpdated": {},
"docXImagePolicy": "Redact",
"pdfSignaturePolicy": "Redact",
"docXCommentPolicy": "Remove",
"docXTablePolicy": "Redact",
"fileSource": "Local",
"customPiiEntityIds": [
"text"
],
"rescanJobs": [
{
"id": "text",
"status": "text",
"errorMessages": "text",
"startTime": {},
"endTime": {},
"publishedTime": {},
"datasetFileId": "text",
"jobType": "DeidentifyFile"
}
]
}
POST /api/Dataset HTTP/1.1
Host:
Content-Type: application/json
Accept: */*
Content-Length: 15
{
"name": "text"
}
{
"id": "text",
"name": "text",
"datasetGeneratorMetadata": "asdfqwer",
"generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
"labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
"labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
"enabledModels": [
"text"
],
"files": [
{
"fileId": "text",
"fileName": "text",
"fileType": "text",
"datasetId": "text",
"numRows": 1,
"numColumns": 1,
"piiTypes": [
"text"
],
"wordCount": 1,
"redactedWordCount": 1,
"uploadedTimestamp": {},
"fileSource": "Local",
"processingStatus": "text",
"processingError": "text",
"mostRecentCompletedJobId": "text"
}
],
"lastUpdated": {},
"docXImagePolicy": "Redact",
"pdfSignaturePolicy": "Redact",
"docXCommentPolicy": "Remove",
"docXTablePolicy": "Redact",
"fileSource": "Local",
"customPiiEntityIds": [
"text"
],
"rescanJobs": [
{
"id": "text",
"status": "text",
"errorMessages": "text",
"startTime": {},
"endTime": {},
"publishedTime": {},
"datasetFileId": "text",
"jobType": "DeidentifyFile"
}
]
}
When Textual generates replacement values, those values are always consistent. Consistency means that the same original value always produces the same replacement value. You can also enable consistency with some Tonic Structural output values.
For all entity types, you can specify the replacements for specific values.
Some entity types include type-specific options for how Tonic Textual generates the replacement values.
For custom entity types, you can select the generator to use.
You can also set whether to use the new synthesis process.
If you also use Tonic Structural, then you can configure Textual to enable selected synthesized values to be consistent between the two applications.
For example, a given source telephone number can produce the same replacement telephone number in both Structural and Textual.
To enable this consistency, you configure a statistics seed value as the value of the Textual environment variable SOLAR_STATISTICS_SEED
. A statistics seed is a signed 32-bit integer.
The value must match a , either:
The value of the Structural environment setting TONIC_STATISTICS_SEED
.
A statistics seed configured for an individual Structural workspace.
The current statistics seed value is displayed on the System Settings page.
Textual has developed an updated synthesis process that is currently implemented for the following entity types:
URLs
Names
Custom entity types
In particular, the new synthesis process improves the display of the synthesized values in PDF files. The values better match the available space and the original font.
To configure whether to use the new process:
On the dataset details page, click Settings.
On the Dataset Settings page, under PDF Settings, the New PDF synthesis mode (experimental) determines which process to use. To use the new process, toggle the setting to the on position.
Click Save Dataset.
For all entity types, you can provide a list of specific replacement values.
For example, for the Given Name entity type, you might indicate to always replace John with Michael and Mary with Melissa.
For the remaining values, Textual generates the replacement values.
To display the synthesis options for an entity type, click Options.
In the text area, provide a JSON object that maps the original values to the replacement values. For example:
{
"French": "German",
"English": "Japanese"
}
With the above configuration for the Language entity type:
All instances of French are changed to German.
All instances of English are changed to Japanese.
Textual selects the replacement values for other languages.
For the Given Name and Family Name entity types, you can configure:
Whether to treat the same name with different casing as a different value.
Whether to replicate the gender of the original value.
In the entity types list, to display the name synthesis options, click Options.
To treat the same name with different casing as different source values, check Is Consistency Case Sensitive.
For example, when this is checked, john and John are treated as different names, and can have different replacement values - john might be replaced with michael, and John might be replaced with Stephen.
When this is not checked, then john and John are treated as the same source value, and get the same replacement.
To replace source names with a names that have the same gender, check Preserve Gender.
For example, when this is checked, John might be replaced with Michael, since they are both traditionally male names. However, John would not be replaced with Mary, which is traditionally a female name.
Location values include the following types:
Location
Location Address
Location State
Location Zip
You can select whether to generate HIPAA or non-HIPAA addresses. Address values can be consistent with values generated in Structural.
For each location type other than Location State, you can specify whether to use a realistic replacement value. For Location State, based on HIPAA guidelines, both the Synthesis option and the Off option pass through the value.
For location types that include zip codes, you can also specify how to generate the new zip code values.
In the entity types list, to display the location synthesis options, click Options.
Under Address generator type, select the type of address generator to use:
HIPAA-compliant address generator. This option generates values similar to those generated by the .
Non-HIPAA address generator. This option generates values similar to those generated by the .
If you configured a Textual statistics seed that matches a Structural statistics seed, then the generated address values are consistent with values generated in Structural. A given address value produces the same output value in both applications.
For example, in both Textual and Structural, a source address value 123 Main Street might be replaced with 234 Oak Avenue.
By default, Textual replaces a location value with a realistic corresponding value. For example, "Main Street" might be replaced with "Fourth Avenue".
To instead scramble the values, uncheck Replace with realistic values.
By default, to generate a new zip code, Textual selects a real zip code that starts with the same three digits as the original zip code. For a low population area, Textual instead selects a random zip code from the United States.
To instead replace the last two digits of the zip code with zeros, check Replace zeroes for zip codes. For a low population area, Textual instead replaces all of the digits in the zip code with zeros.
By default, when you select the Synthesis option for Date/Time and Date of Birth values, Textual shifts the datetime values to a value that occurs within 7 days before or after the original value.
To customize how Textual sets the new values, you can:
Set a different range within which Textual sets the new values
Indicate whether to scramble date values that Textual cannot parse
Indicate whether to shift all of the original values by the same amount and in the same direction
Add additional date formats for Textual to recognize
In the entity types list, to display the datetime synthesis options, click Options.
By default, Textual adjusts the dates to values that are within 7 days before or after the original date.
To change the range:
In the Left bound on # of Days To Shift field, enter the number of days before the original date within which the replacement datetime value must occur. For example, if you enter 10, then the replacement datetime value cannot occur earlier than 10 days before the original value.
In the Right bound on # of Days To Shift field, enter the number of days after the original date within which the replacement datetime value must occur. For example, if you enter 6, then the replacement datetime value cannot occur later than 6 days after the original value.
Textual can parse datetime values that use either a format in Default supported datetime formats in Textual or a format that you add.
The Scramble Unrecognized Dates checkbox indicates how Textual should handle datetime values that it does not recognize.
By default, the checkbox is checked, and Textual scrambles those values.
To instead pass through the values without changing them, uncheck Scramble Unrecognized Dates.
By default, Textual applies different shifts to the original values. Some replacement dates might be earlier, and some might be later. The amount of shift might also vary.
To shift all of the datetime values in the same way, check Apply same shift for entire document.
For example, if this is checked, Textual might shift all datetime values 3 days in the future.
By default, Textual is able to recognize datetime values that use a format from Default supported datetime formats in Textual.
Under Additional Date Formats, you can add other datetime formats that you know are present in your data.
The formats must use a Noda Time LocalDateTime pattern.
To add a format, type the format in the field, then click +.
To remove a format, click its delete icon.
By default, Textual supports the following datetime formats.
yyyy/M/d
2024/1/17
yyyy-M-d
2024-1-17
yyyyMMdd
20240117
yyyy.M.d
2024.1.17
yyyy, MMM d
2024, Jan 17
yyyy-M
2024-1
yyyy/M
2024/1
d/M/yyyy
17/1/2024
d-MMM-yyyy
17-Jan-2024
dd-MMM-yy
17-Jan-24
d-M-yyyy
17-1-2024
d/MMM/yyyy
17/Jan/2024
d MMMM yyyy
17 January 2024
d MMM yyyy
17 Jan 2024
d MMMM, yyyy
17 January, 2024
ddd, d MMM yyyy
Wed, 17 Jan 2024
M/d/yyyy
1/17/2024
M/d/yy
1/17/24
M-d-yyyy
1-17-2024
MMddyyyy
01172024
MMMM d, yyyy
January 17, 2024
MMM d, ''yy
Jan 17, '24
MM-yyyy
01-2024
MMMM, yyyy
January, 2024
yyyy-M-d HH:mm
2024-1-17 15:45
d-M-yyyy HH:mm
17-1-2024 15:45
MM-dd-yy HH:mm
01-17-24 15:45
d/M/yy HH:mm:ss
17/1/24 15:45:30
d/M/yyyy HH:mm:ss
17/1/2024 15:45:30
yyyy/M/d HH:mm:ss
2024/1/17 15:45:30
yyyy-M-dTHH:mm:ss
2024-1-17T15:45:30
yyyy/M/dTHH:mm:ss
2024/1/17T15:45:30
yyyy-M-d HH:mm:ss'Z'
2024-1-17 15:45:30Z
yyyy-M-d'T'HH:mm:ss'Z'
2024-1-17T15:45:30Z
yyyy-M-d HH:mm:ss.fffffff
2024-1-17 15:45:30.1234567
yyyy-M-dd HH:mm:ss.FFFFFF
2024-1-17 15:45:30.123456
yyyy-M-dTHH:mm:ss.fff
2024-1-17T15:45:30.123
HH:mm
15:45
HH:mm:ss
15:45:30
HHmmss
154530
hh:mm:ss tt
03:45:30 PM
HH:mm:ss'Z'
15:45:30Z
By default, when you select the Synthesis option for Age values, Textual shifts the age value to a value that is within seven years before or after the original value. For age values that it cannot synthesize, it scrambles the value.
In the entity types list, to display the age synthesis options, click Options.
To configure the synthesis:
In the Range of Years +/- for the Shifted Age field, enter the number of years before and after the original value to use as the range for the synthesized value.
By default, Textual scrambles age values that it cannot parse. To instead pass through the value unchanged, uncheck Scramble Unrecognized Ages.
For Phone Number values, you can choose whether to generate a realistic phone number. If you do, then the generated values can be consistent with values generated in Structural.
In the entity types list, to display the phone number synthesis options, click Options.
From the Phone number generator type dropdown list:
To replace each phone number with a randomly generated number, select Random Number.
To generate a realistic telephone number, select US Phone Number. The US Phone Number option generates values similar to those generated by the .
If you also configured a Textual statistics seed that matches a Structural statistics seed, then the synthesized values are consistent with values generated in Structural. A given source telephone number produces the same output telephone number in both applications.
For example, in both Textual and Structural, 123-456-6789 might be replaced with 154-567-8901.
The Replace invalid numbers with valid numbers checkbox determines how Textual handles invalid telephone numbers in the data.
To replace the invalid with valid telephone numbers, check the checkbox.
If you do not check the checkbox, then Textual randomly replaces the numeric characters.
By default, when you select the Synthesis option for a custom entity type, Textual scrambles the original value.
From the generator dropdown list, select the generator to use to create the replacement value.
The available generators are:
Scramble
This is the default generator.
Scrambles the original value.
CC Exp
Generates a credit card expiration date.
Company Name
Generates a name of a business.
Credit Card
Generates a credit card number.
CVV
Generates a credit card security code.
Date Time
Generates a datetime value.
The Date Time generator has the .
Generates an email address.
HIPAA Address Generator
Generates a mailing address.
The generator has the as the built-in location entity types.
IP Address
Generates an IP address.
MICR Code
Generates an MICR code.
Money
Generates a currency amount.
Name
Generates a person's name.
You configure:
Whether to generate the same replacement value from source values that have different capitalization.
Whether the replacement value reflects the gender of the original value.
Numeric Value
Generates a numeric value.
You configure whether to use the Integer Primary Key generator to generate the value.
Person Age
Generates an age value.
The Person Age generator has the .
Phone Number
Generates a telephone number.
The Phone Number generator has the .
SSN
Generates a United States Social Security Number.
URL
Generates a URL.
When Textual processes pipeline files, it produces JSON output that provides access to the Markdown content and that identifies the detected entities in the file.
All JSON output files contain the following elements that contain information for the entire file:
{
"fileType": "<file type>",
"content": {
"text": "<Markdown file content>",
"hash": "<hashed file content>",
"entities": [ //Entry for each entity in the file
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
]
},
"schemaVersion": <integer schema version>
}
fileType
The type of the original file.
content
Details about the file content. It includes:
Hashed and Markdown content for the file
Entities in the file
schemaVersion
An integer that identifies the version of the JSON schema that was used for the JSON output.
Textual uses this to convert content from older schemas to the most recent schema.
For specific file types, the JSON output includes additional objects and properties to reflect the file structure.
The JSON output contains hashed and Markdown content for the entire file and for individual file components.
hash
The hashed version of the file or component content.
text
The file or component content in Markdown notation.
The JSON output contains entities
arrays for the entire file and for individual file components.
Each entity in the entities
array has the following properties:
start
Within the file or component, the location where the entity value starts.
For example, in the following text:
My name is John.
John is an entity that starts at 11.
end
Within the file or component, the location where the entity value ends.
For example, in the following text:
My name is John.
John is an entity that ends at 14.
label
The type of entity.
For a list of the entity types that Textual detects, go to .
text
The text of the entity.
score
The confidence score for the entity.
Indicates how confident Textual is that the value is an entity of the specified type.
language
The language code to identify the language for the entity value.
For example, en
indicates that the value is in English.
For plain text files, the JSON output only contains the information for the entire file.
{
"fileType": "<file type>",
"content": {
"text": "<Markdown content>",
"hash": "<hashed content>",
"entities": [ //Entry for each entity in the file
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>" }
]
},
"schemaVersion": <integer schema version>
}
For .csv files, the structure contains a tables
array.
The tables
array contains a table object that contains header
and data
arrays..
For each row in the file, the data
array contains a row array.
For each value in a row, the row array contains a value object.
The value object contains the entities, hashed content, and Markdown content for the value.
{
"tables": [
{
"tableName": "csv_table",
"header": [//Columns that contain heading info (col_0, col_1, and so on)
"<column identifier>"
],
"data": [ //Entry for each row in the file
[ //Entry for each value in the row
{
"entities": [ //Entry for each entity in the value
{
"start": <start location>,,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
],
"hash": "<hashed value content>",
"text": "<Markdown value content>"
}
]
]
}
],
"fileType": "<file type>",
"content": {
"text": "<Markdown file content>",
"hash": "<hashed file content>",
"entities": [ ///Entry for each entity in the file
{
"start": <start location>,
"end": <end location>
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
]
},
"schemaVersion": <integer schema version>
}
For .xlsx files, the structure contains a tables
array that provides details for each worksheet in the file.
For each worksheet, the tables array contains a worksheet object.
For each row in a worksheet, the worksheet object contains a header
array and a data
array. The data
array contains a row array.
For each cell in a row, the row array contains a cell object.
Each cell object contains the entities, hashed content, and Markdown content for the cell.
{
"tables": [ //Entry for each worksheet
{
"tableName": "<Name of the worksheet>",
"header": [ //Columns that contain heading info (col_0, col_1, and so on)
"<column identifier>"
],
"data": [ //Entry for each row
[ //Entry for each cell in the row
{
"entities": [ //Entry for each entity in the cell
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
],
"hash": "<hashed cell content>",
"text": "<Markdown cell content>"
}
]
]
}
],
"fileType": "<file type>",
"content": {
"text": "<Markdown file content>",
"hash": "<hashed file content>",
"entities": [ //Entry for each entity in the file
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
]
},
"schemaVersion": <integer schema version>
}
For .docx files, the JSON output structure adds:
A footnotes
array for content in footnotes.
An endnotes
array for content in endnotes.
A header
object for content in the page headers. Includes separate objects for the first page header, even page header, and odd page header.
A footer
object for content in the page footers. Includes separate objects for the first page footer, even page footer, and odd page footer.
These arrays and objects contain the entities, hashed content, and Markdown content for the notes, headers, and footers.
{
"footNotes": [ //Entry for each footnote
{
"entities": [ //Entry for each entity in the footnote
{
"start": <start location>,
"end": <end location>,
"pythonStart": <start location in Python>,
"pythonEnd": <end location in Python>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
"exampleRedaction": null
}
],
"hash": "<hashed footnote content>",
"text": "<Markdown footnote content>"
}
],
"endNotes": [ //Entry for each endnote
{
"entities": [ //Entry for each entity in the endnote
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
],
"hash": "<hashed endnote content>",
"text": "<Markdown endnote content>"
}
],
"header": {
"first": {
"entities": [ //Entry for each entity in the first page header
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
],
"hash": "<hashed first page header content>",
"text": "<Markdown first page header content>"
},
"even": {
"entities": [ //Entry for each entity in the even page header
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
],
"hash": "<hashed even page header content>",
"text": "<Markdown even page header content>"
},
"odd": {
"entities": [ //Entry for each entity in the odd page header
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
],
"hash": "<hashed odd page header content>",
"text": "<Markdown odd page header content>"
}
},
"footer": {
"first": {
"entities": [ //Entry for each entity in the first page footer
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
],
"hash": "<hashed first page footer content>",
"text": "<Markdown first page footer content>"
},
"even": {
"entities": [ //Entry for each entity in the even page footer
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
],
"hash": "<hashed even page footer content>",
"text": "<Markdown even page footer content>"
},
"odd": {
"entities": [ //Entry for each entity in the odd page footer
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
],
"hash": "<hashed odd page footer content>",
"text": "<Markdown odd page footer content>"
}
},
"fileType": "<file type>",
"content": {
"text": "<Markdown file content>",
"hash": "<hashed file content>",
"entities": [ //Entry for each entity in the file
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
]
},
"schemaVersion": <integer schema version>
}
PDF and image files use the same structure. Textual extracts and scans the text from the files.
For PDF and image files, the JSON output structure adds the following content.
pages
arrayThe pages
array contains all of the content on the pages. This includes content in tables and key-value pairs, which are also listed separately in the output.
For each page in the file, the pages
array contains a page array.
For each component on the page - such as paragraphs, headings, headers, and footers - the page array contains a component object.
Each component object contains the component entities, hashed content, and Markdown content.
tables
arrayThe tables
array contains content that is in tables.
For each table in the file, the tables
array contains a table array.
For each row in a table, the table array contains a row array.
For each cell in a row, the row array contains a cell object.
Each cell object identifies the type of cell (header or content). It also contains the entities, hashed content, and Markdown content for the cell.
keyValuePairs
arrayThe keyValuePairs
array contains key-value pair content. For example, for a PDF of a form with fields, a key-value pair might represent a field label and a field value.
For each key-value pair, the keyValuePairs
array contains a key-value pair object.
The key-value pair object contains:
An automatically incremented identifier. For example, id
for the first key-value pair is 1, for the second key-value pair is 2, and so on.
The start and end position of the key-value pair
The text of the key
The entities, hashed content, and Markdown content for the value
{
"pages": [ //Entry for each page in the file
[ //Entry for each component on the page
{
"type": "<page component type>",
"content": {
"entities": [ //Entry for each entity in the component
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
],
"hash": "<hashed component content>",
"text": "<Markdown component content>"
}
}
],
"tables": [ //Entry for each table in the file
[ //Entry for each row in the table
[ //Entry for each cell in the row
{
"type": "<content type>", //ColumnHeader or Content
"content": {
"entities": [ //Entry for each entity in the cell
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
],
"hash": "<hashed cell text>",
"text": "<Markdown cell text>"
}
}
]
]
],
"keyValuePairs": [ //Entry for each key-value pair in the file
{
"id": <incremented identifier>,
"key": "<key text>",
"value": {
"entities": [ //Entry for each entity in the value
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
],
"hash": "<hashed value text>",
"text": "<Markdown value text>"
},
"start": <start location of the key-value pair>,
"end": <end location of the key-value pair>
}
],
"fileType": "<file type>",
"content": {
"text": "<Markdown file content>",
"hash": "<hashed file content>",
"entities": [ ///Entry for each entity in the file
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
]
},
"schemaVersion": <integer schema version>
}
For email message files, the JSON output structure adds the following content.
The JSON output includes the following email message identifiers:
The identifier of the current message
If the message was a reply to another message, the identifier of that message
An array of related email messages. This includes the email message that the message replied to, as well as any other messages in an email message thread.
The JSON output includes the email address and display name of the message recipients. It contains separate lists for the following:
Recipients in the To line
Recipients in the CC line
Recipients in the BCC line
The subject
object contains the message subject line. It includes:
Markdown and hashed versions of the message subject line.
The entities that were detected in the subject line.
sentDate
provides the timestamp when the message was sent.
The plainTextBodyContent
object contains the body of the email message.
It contains:
Markdown and hashed versions of the message body.
The entities that were detected in the message body.
The attachments
array provides information about any attachments to the email message. For each attached file, it includes:
The identifier of the message that the file is attached to.
The identifier of the attachment.
The JSON output for the file.
The count of words in the original file.
The count of words in the redacted version of the file.
{
"messageId": "<email message identifier>",
"inReplyToMessageId": <message that this message replied to>,
"messageIdReferences": [<related email messages>],
"senderAddress": {
"address": "<sender email address>",
"displayName": "<sender display name>"
},
"toAddresses": [ //Entry for each recipient in the To list
{
"address": "<recipient email address>",
"displayName": "<recipient display name>"
}
],
"ccAddresses": [ //Entry for each recipient in the CC list
{
"address": "<recipient email address>",
"displayName": "<recipient display name>"
}
],
"bccAddresses": [ //Entry for each recipient in the BCC list
{
"address": "<recipient email address>",
"displayName": "<recipient display name>"
}
],
"sentDate": "<timestamp when the message was sent>",
"subject": {
"text": "<Markdown version of the subject line>",
"hash": "<hashed version of the subject line>",
"entities": [ //Entry for each entity in the subject line
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
]
},
"plainTextBodyContent": {
"text": "<Markdown version of the message body>",
"hash": "<hashed version of the message body>",
"entities": [ //Entry for each entity in the message body
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
]
},
"attachments": [ //Entry for each attached file
{
"parentMessageId": "<the message that the file is attached to>",
"contentId": "<identifier of the attachment>",
"fileName": "<name of the attachment file>",
"document": {<pipeline JSON for the attached file>},
"wordCount": <number of words in the attachment>,
"redactedWordCount": <number of words in the redacted attachment>
}
],
"fileType": "<file type>",
"content": {
"text": "<Markdown file content>",
"hash": "<hashed file content>",
"entities": [ //Entry for each entity in the file
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>,
"language": "<language code>"
}
]
},
"schemaVersion": <integer schema version>
}
Returns a modified version of the provided text string that redacts or synthesizes the detected entity values.
POST /api/Redact HTTP/1.1
Host:
Content-Type: application/json
Accept: */*
Content-Length: 716
{
"generatorConfig": {
"NAME_GIVEN": "Redaction",
"NAME_FAMILY": "Redaction"
},
"generatorDefault": "Off",
"docXImagePolicy": "Redact",
"docXCommentPolicy": "Remove",
"pdfSignaturePolicy": "Redact",
"pdfSynthModePolicy": "V1",
"docXTablePolicy": "Redact",
"labelBlockLists": {
"NAME_FAMILY": {
"strings": [],
"regexes": [
".*\\s(disease|syndrom|disorder)"
]
}
},
"labelAllowLists": {
"HEALTHCARE_ID": {
"strings": [],
"regexes": [
"[a-z]{2}\\d{9}"
]
}
},
"customModels": [
"text"
],
"generatorMetadata": {
"ANY_ADDITIONAL_PROPERTY": {
"version": "V1",
"customGenerator": "Scramble",
"swaps": {
"ANY_ADDITIONAL_PROPERTY": "text"
}
}
},
"recordApiRequestOptions": {
"record": true,
"retentionTimeInHours": 1,
"tags": [
"text"
]
},
"customPiiEntityIds": [
"text"
],
"text": "My name is John Smith"
}
OK
{
"originalText": "text",
"redactedText": "text",
"usage": 1,
"deIdentifyResults": [
{
"start": 1,
"end": 1,
"newStart": 1,
"newEnd": 1,
"label": "text",
"text": "text",
"newText": "text",
"score": 1,
"language": "text",
"exampleRedaction": "text",
"jsonPath": "text",
"xmlPath": "text",
"idx": 1
}
]
}
Returns a modified version of the provided text string that redacts or synthesizes the detected entity values.
POST /api/Redact/bulk HTTP/1.1
Host:
Content-Type: application/json
Accept: */*
Content-Length: 705
{
"generatorConfig": {
"NAME_GIVEN": "Redaction",
"NAME_FAMILY": "Redaction"
},
"generatorDefault": "Off",
"docXImagePolicy": "Redact",
"docXCommentPolicy": "Remove",
"pdfSignaturePolicy": "Redact",
"pdfSynthModePolicy": "V1",
"docXTablePolicy": "Redact",
"labelBlockLists": {
"NAME_FAMILY": {
"strings": [],
"regexes": [
".*\\s(disease|syndrom|disorder)"
]
}
},
"labelAllowLists": {
"HEALTHCARE_ID": {
"strings": [],
"regexes": [
"[a-z]{2}\\d{9}"
]
}
},
"customModels": [
"text"
],
"generatorMetadata": {
"ANY_ADDITIONAL_PROPERTY": {
"version": "V1",
"customGenerator": "Scramble",
"swaps": {
"ANY_ADDITIONAL_PROPERTY": "text"
}
}
},
"recordApiRequestOptions": {
"record": true,
"retentionTimeInHours": 1,
"tags": [
"text"
]
},
"customPiiEntityIds": [
"text"
],
"bulkText": [
"text"
]
}
OK
{
"bulkText": [
"text"
],
"bulkRedactedText": [
"text"
],
"usage": 1,
"deIdentifyResults": [
{
"start": 1,
"end": 1,
"newStart": 1,
"newEnd": 1,
"label": "text",
"text": "text",
"newText": "text",
"score": 1,
"language": "text",
"exampleRedaction": "text",
"jsonPath": "text",
"xmlPath": "text",
"idx": 1
}
]
}
Returns a modified version of the JSON that redacts or synthesizes the detected entity values. The redacted JSON has the same structure as the input JSON. Only the primitive JSON values, such as strings and numbers, are modified.
POST /api/Redact/json HTTP/1.1
Host:
Content-Type: application/json
Accept: */*
Content-Length: 744
{
"generatorConfig": {
"NAME_GIVEN": "Redaction",
"NAME_FAMILY": "Redaction"
},
"generatorDefault": "Off",
"docXImagePolicy": "Redact",
"docXCommentPolicy": "Remove",
"pdfSignaturePolicy": "Redact",
"pdfSynthModePolicy": "V1",
"docXTablePolicy": "Redact",
"labelBlockLists": {
"NAME_FAMILY": {
"strings": [],
"regexes": [
".*\\s(disease|syndrom|disorder)"
]
}
},
"labelAllowLists": {
"HEALTHCARE_ID": {
"strings": [],
"regexes": [
"[a-z]{2}\\d{9}"
]
}
},
"customModels": [
"text"
],
"generatorMetadata": {
"ANY_ADDITIONAL_PROPERTY": {
"version": "V1",
"customGenerator": "Scramble",
"swaps": {
"ANY_ADDITIONAL_PROPERTY": "text"
}
}
},
"customPiiEntityIds": [
"text"
],
"jsonText": "{\"Name\": \"John Smith\", \"Description\": \"John lives in Atlanta, Ga.\"}",
"jsonPathAllowLists": {
"NAME_GIVEN": [
"$.name.first"
]
}
}
OK
{
"originalText": "text",
"redactedText": "text",
"usage": 1,
"deIdentifyResults": [
{
"start": 1,
"end": 1,
"newStart": 1,
"newEnd": 1,
"label": "text",
"text": "text",
"newText": "text",
"score": 1,
"language": "text",
"exampleRedaction": "text",
"jsonPath": "text",
"xmlPath": "text",
"idx": 1
}
]
}
Returns a modified version of the XML that redacts or synthesizes the detected entity values. The redacted XML has the same structure as the input XML. Only the XML inner text and attribute values are modified.
POST /api/Redact/xml HTTP/1.1
Host:
Content-Type: application/json
Accept: */*
Content-Length: 825
{
"generatorConfig": {
"NAME_GIVEN": "Redaction",
"NAME_FAMILY": "Redaction"
},
"generatorDefault": "Off",
"docXImagePolicy": "Redact",
"docXCommentPolicy": "Remove",
"pdfSignaturePolicy": "Redact",
"pdfSynthModePolicy": "V1",
"docXTablePolicy": "Redact",
"labelBlockLists": {
"NAME_FAMILY": {
"strings": [],
"regexes": [
".*\\s(disease|syndrom|disorder)"
]
}
},
"labelAllowLists": {
"HEALTHCARE_ID": {
"strings": [],
"regexes": [
"[a-z]{2}\\d{9}"
]
}
},
"customModels": [
"text"
],
"generatorMetadata": {
"ANY_ADDITIONAL_PROPERTY": {
"version": "V1",
"customGenerator": "Scramble",
"swaps": {
"ANY_ADDITIONAL_PROPERTY": "text"
}
}
},
"customPiiEntityIds": [
"text"
],
"xmlText": "\n <note>\n <to>Tove</to>\n <from>Jani</from>\n <heading>Reminder</heading>\n <body>Don't forget me this weekend!</body>\n </note>\n "
}
OK
{
"originalText": "text",
"redactedText": "text",
"usage": 1,
"deIdentifyResults": [
{
"start": 1,
"end": 1,
"newStart": 1,
"newEnd": 1,
"label": "text",
"text": "text",
"newText": "text",
"score": 1,
"language": "text",
"exampleRedaction": "text",
"jsonPath": "text",
"xmlPath": "text",
"idx": 1
}
]
}
Returns a modified version of the HTML that redacts or synthesizes the detected entity values. The redacted HTML has the same structure as the input HTML. Only the text contained in the HTML elements is modified.
POST /api/Redact/html HTTP/1.1
Host:
Content-Type: application/json
Accept: */*
Content-Length: 830
{
"generatorConfig": {
"NAME_GIVEN": "Redaction",
"NAME_FAMILY": "Redaction"
},
"generatorDefault": "Off",
"docXImagePolicy": "Redact",
"docXCommentPolicy": "Remove",
"pdfSignaturePolicy": "Redact",
"pdfSynthModePolicy": "V1",
"docXTablePolicy": "Redact",
"labelBlockLists": {
"NAME_FAMILY": {
"strings": [],
"regexes": [
".*\\s(disease|syndrom|disorder)"
]
}
},
"labelAllowLists": {
"HEALTHCARE_ID": {
"strings": [],
"regexes": [
"[a-z]{2}\\d{9}"
]
}
},
"customModels": [
"text"
],
"generatorMetadata": {
"ANY_ADDITIONAL_PROPERTY": {
"version": "V1",
"customGenerator": "Scramble",
"swaps": {
"ANY_ADDITIONAL_PROPERTY": "text"
}
}
},
"customPiiEntityIds": [
"text"
],
"htmlText": "\n <!DOCTYPE html>\n <html>\n <body>\n <h1>Account Information</h1>\n <p>Account Holder: John Smith</p>\n </body>\n </html>\n "
}
OK
{
"originalText": "text",
"redactedText": "text",
"usage": 1,
"deIdentifyResults": [
{
"start": 1,
"end": 1,
"newStart": 1,
"newEnd": 1,
"label": "text",
"text": "text",
"newText": "text",
"score": 1,
"language": "text",
"exampleRedaction": "text",
"jsonPath": "text",
"xmlPath": "text",
"idx": 1
}
]
}
Upload a file to a dataset for processing.
File to upload
POST /api/Dataset/{datasetId}/files/upload HTTP/1.1
Host:
Content-Type: multipart/form-data
Accept: */*
Content-Length: 288
{
"document": {
"fileName": "example.txt",
"csvConfig": {
"numColumns": 1,
"hasHeader": true,
"escapeChar": "text",
"quoteChar": "text",
"delimiter": "text",
"nullChar": "text"
},
"datasetId": "6a01360f-78fc-9f2f-efae-c5e1461e9c1et",
"customPiiEntityIds": [
"CUSTOM_ENTITY_1",
"CUSTOM_ENTITY_2"
]
},
"file": "binary"
}
OK
{
"updatedDataset": {
"id": "text",
"name": "text",
"generatorMetadata": "asdfqwer",
"generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
"labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
"labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
"enabledModels": [
"text"
],
"tags": [
"text"
],
"files": [
{
"fileId": "text",
"fileName": "text",
"fileType": "text",
"datasetId": "text",
"numRows": 1,
"numColumns": 1,
"piiTypes": [
"text"
],
"wordCount": 1,
"redactedWordCount": 1,
"uploadedTimestamp": {},
"fileSource": "Local",
"processingStatus": "text",
"processingError": "text",
"mostRecentCompletedJobId": "text"
}
],
"lastUpdated": {},
"created": {},
"creatorUser": {
"id": "text",
"userName": "text",
"firstName": "text",
"lastName": "text"
},
"docXImagePolicy": "Redact",
"pdfSignaturePolicy": "Redact",
"pdfSynthModePolicy": "V1",
"docXCommentPolicy": "Remove",
"docXTablePolicy": "Redact",
"fileSource": "Local",
"customPiiEntityIds": [
"text"
],
"operations": [
"HasAccess"
],
"rescanJobs": [
{
"id": "text",
"status": "text",
"errorMessages": "text",
"startTime": {},
"endTime": {},
"publishedTime": {},
"datasetFileId": "text",
"jobType": "DeidentifyFile"
}
],
"fileSourceExternalCredential": {
"fileSource": "Local",
"credential": {}
},
"awsCredentialSource": "text",
"outputPath": "text"
},
"uploadedFileId": "text"
}
Downloads the specified file from the dataset. The downloaded file is redacted based on the dataset configuration.
GET /api/Dataset/{datasetId}/files/{fileId}/download HTTP/1.1
Host:
Accept: */*
binary
Downloads all files from the specified dataset. The downloaded files are redacted based on the dataset configuration.
GET /api/Dataset/{datasetId}/files/download_all HTTP/1.1
Host:
Accept: */*
binary
Updates a dataset to use the specified configuration.
PUT /api/Dataset HTTP/1.1
Host:
Content-Type: application/json
Accept: */*
Content-Length: 731
{
"id": "text",
"name": "text",
"generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
"generatorMetadata": {
"ANY_ADDITIONAL_PROPERTY": {
"version": "V1",
"customGenerator": "Scramble",
"swaps": {
"ANY_ADDITIONAL_PROPERTY": "text"
}
}
},
"labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
"labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
"enabledModels": [
"text"
],
"docXImagePolicy": "Redact",
"pdfSignaturePolicy": "Redact",
"pdfSynthModePolicy": "V1",
"docXCommentPolicy": "Remove",
"docXTablePolicy": "Redact",
"fileSourceExternalCredential": {
"fileSource": "Local",
"credential": {}
},
"awsCredentialSource": "text",
"outputPath": "text"
}
{
"id": "text",
"name": "text",
"generatorMetadata": "asdfqwer",
"generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
"labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
"labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
"enabledModels": [
"text"
],
"tags": [
"text"
],
"files": [
{
"fileId": "text",
"fileName": "text",
"fileType": "text",
"datasetId": "text",
"numRows": 1,
"numColumns": 1,
"piiTypes": [
"text"
],
"wordCount": 1,
"redactedWordCount": 1,
"uploadedTimestamp": {},
"fileSource": "Local",
"processingStatus": "text",
"processingError": "text",
"mostRecentCompletedJobId": "text"
}
],
"lastUpdated": {},
"created": {},
"creatorUser": {
"id": "text",
"userName": "text",
"firstName": "text",
"lastName": "text"
},
"docXImagePolicy": "Redact",
"pdfSignaturePolicy": "Redact",
"pdfSynthModePolicy": "V1",
"docXCommentPolicy": "Remove",
"docXTablePolicy": "Redact",
"fileSource": "Local",
"customPiiEntityIds": [
"text"
],
"operations": [
"HasAccess"
],
"rescanJobs": [
{
"id": "text",
"status": "text",
"errorMessages": "text",
"startTime": {},
"endTime": {},
"publishedTime": {},
"datasetFileId": "text",
"jobType": "DeidentifyFile"
}
],
"fileSourceExternalCredential": {
"fileSource": "Local",
"credential": {}
},
"awsCredentialSource": "text",
"outputPath": "text"
}
GET /api/Users HTTP/1.1
Host:
Accept: */*
OK
[
{
"id": "text",
"userName": "text",
"firstName": "text",
"lastName": "text",
"photoMetadata": {
"name": "text",
"url": "text",
"fileType": "text",
"content": "Ynl0ZXM=",
"isManualUpload": true
},
"accountMetadata": {
"createdAt": {},
"lastActivityDate": {}
}
}
]
GET /api/permission-sets HTTP/1.1
Host:
Accept: */*
[
{
"id": "text",
"type": "Global",
"name": "text",
"isBuiltIn": true,
"isDefault": true,
"isDisabled": true,
"operations": [
1
],
"createdDate": {},
"lastModifiedDate": {},
"lastModifiedByUserId": "text"
}
]
GET /api/dataset/{datasetId}/shares HTTP/1.1
Host:
Accept: */*
OK
[
{
"id": "text",
"permissionSetId": "text",
"sharedWithUser": {
"id": "text",
"userName": "text",
"firstName": "text",
"lastName": "text"
},
"sharedWithGroup": {
"id": "text",
"userName": "text",
"context": "None"
},
"shareableEntityType": "User",
"resourceId": "text"
}
]
The ID of the dataset
A request to modify the permission assignments for a dataset.
POST /api/dataset/{datasetId}/shares/bulk HTTP/1.1
Host:
Content-Type: application/json
Accept: */*
Content-Length: 182
{
"grant": [
{
"sharedWithUserId": "text",
"sharedWithGroupId": "text",
"permissionSetId": "text"
}
],
"revoke": [
{
"sharedWithUserId": "text",
"sharedWithGroupId": "text",
"permissionSetId": "text"
}
]
}
OK
{
"granted": [
{
"id": "text",
"permissionSetId": "text",
"sharedWithUser": {
"id": "text",
"userName": "text",
"firstName": "text",
"lastName": "text"
},
"sharedWithGroup": {
"id": "text",
"userName": "text",
"context": "None"
},
"shareableEntityType": "User",
"resourceId": "text"
}
],
"revoked": [
{
"id": "text",
"permissionSetId": "text",
"sharedWithUser": {
"id": "text",
"userName": "text",
"firstName": "text",
"lastName": "text"
},
"sharedWithGroup": {
"id": "text",
"userName": "text",
"context": "None"
},
"shareableEntityType": "User",
"resourceId": "text"
}
]
}