Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 135 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

Tonic Textual

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Create and manage datasets

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Manage dataset files

Loading...

Loading...

Loading...

Loading...

Loading...

Configure the redaction

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Preview and obtain output

Loading...

Loading...

Loading...

Loading...

Textual Python SDK

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Textual REST API

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Install and administer Textual

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Getting started with Textual

When you sign up for a Tonic Textual account, you can immediately get started with a new pipeline.

Note that these instructions are for setting up a new account on Textual Cloud. For a self-hosted instance, depending on how it is set up, you might either create an account manually or use single sign-on (SSO).

Signing up for Textual

To get started with a new Textual account:

  1. Go to https://textual.tonic.ai/.

  2. Click Sign up.

  3. Enter your email address.

  4. Create and confirm a password for your Textual account.

  5. Click Sign Up.

Textual creates your account. After you log in, Textual prompts you to provide some additional information about yourself and how you plan to use Textual.

Survey for new Textual accounts

After you fill out the information and click Get Started, Textual displays the Textual Home page, which you can use to preview how Textual detects and replaces values. For more information, go to Previewing Textual detection and redaction.

Home page for a new account

Using the Textual free trial

When you set up an account on Textual Cloud, you start a Textual free trial.

Using the Getting Started checklist

When you start a free trial, Textual provides a checklist to guide you through initial steps to get started and learn more about Textual and what it can do.

Getting Started Checklist panel

The checklist displays automatically when you first log in. You can close and display it as needed. To display the checklist, in the Textual navigation menu, click Getting Started.

As you complete a step, Textual automatically marks it as completed.

The checklist includes:

  • Using the Home page to preview Textual redaction. When you click the step, you navigate to the Home page. Textual displays a popup panel that describes the task.

  • Installing the Textual SDK. The checklist displays the installation command and an option to copy it. The step is marked as complete when you click the copy icon.

  • Creating an API key. When you click the step, you are prompted to create an API key.

  • Creating an SDK request to redact a text string or a file. When you click the step, you navigate to the Request Explorer. The step is marked as completed when you close the popup panel that describes the task.

Word count limit

During the free trial Textual scans up to 100,000 words for free. Note that Textual counts actual words, not tokens. For example, "Hello, my name is John Smith." counts as six words.

After the 100,000 words, Textual disables scanning for your account. Until you purchase a pay-as-you-go subscription, you cannot:

  • Add files to a dataset or pipeline

  • Run a pipeline

Viewing your current usage

During your free trial, Textual displays the current usage in the following locations:

  • On the Home page

  • In the navigation menu

Next steps - pay-as-you-go or product demo

Textual also prompts you to purchase a pay-as-you-go subscription, which allows an unlimited number of words scanned for a flat rate per 1,000 words.

You can also request a Textual product demo.

Supported file types

Tonic Textual can process the following types of files:

  • txt

  • csv

  • tsv

  • docx

  • xlsx

  • pdf

  • png

  • tif or tiff

  • jpg or jpeg

Deleting datasets

Required dataset permission: Delete a dataset

To delete a dataset:

  1. On the dataset details page, click Settings.

  2. On the Dataset Settings page, click Delete Dataset.

  3. Click Confirm Delete.

Tracking and managing file processing

When you add files to a local files dataset, or change the file selection for a cloud storage dataset, Tonic Textual automatically scans the files to identify the entities that they contain.

When you change the dataset configuration, Textual also prompts you to run a new scan. For example, a new scan is required when you:

  • Configure added and excluded values

  • Change the available custom entity types

The file list reflects the current scanning status for the file. A file is initially queued for scanning. When the scan starts, the status changes to scanning. When Textual finishes processing a file, it marks the file as scanned.

As Textual processes each file, it updates the results in the dataset details heading and the entity types list.

Pausing the file processing

If needed, you can pause the file processing. To pause the processing, click Pause.

The information in the heading and entity types list only reflect the files that are scanned.

For a cloud storage dataset, when you generate output, Textual only includes files that are scanned.

Starting a scan on a paused file

After you pause the scan, you can start a scan on individual files.

To start a scan on a file:

  1. Click the options menu for the file.

  2. Click Scan.

Tonic Textual guide

Tonic Textual allows you to put your text-based data to work for you.

A textual dataset is a collection of files from a local file system or cloud storage.

Textual scans the data files to identify sensitive values. It redacts or replaces those sensitive values, to produce output files in the same format that you can safely use for development and training.

You can also instead use a dataset to prepare unstructured text for use in an LLM system. Textual can produce a JSON summary of the detected values, which includes Markdown-formatted output.

You can use the Textual SDK or the Textual REST API to manage datasets or to remove sensitive values from individual text strings and files.

Startup and overview

Textual SDK, REST API, and Snowflake Native App

Need help with Textual? Contact [email protected].

Entity types that Textual detects

Tonic Textual comes with a built-in set of entity types that it detects. You can also configure custom entity types, which detect values based on regular expressions.

You can also view this video overview of entity types and entity type handling.

Generating cloud storage output files

To generate original format output files for a cloud storage dataset, on the dataset details page, click Generate to <cloud storage type>.

Generate option to generate output files for a cloud storage dataset

Tonic Textual generates the output files to the configured output location. If the output location is not configured, then the generate option is disabled.

For datasets that produce JSON output, Textual generates the output files automatically as soon as the output location is configured.

Assigning tags to datasets

Required dataset permission: Edit dataset settings

Tags can help you to organize your datasets. For example, you can use tags to indicate datasets that belong to different groups, or that deal with specific areas of your data.

You can manage tags from both the Datasets page and the dataset details.

Managing tags from the Datasets page

On the Datasets page, the Tags column displays the currently assigned tags.

To change the tag assignment for a dataset:

  1. Click Tags.

  2. On the dataset tags panel, to add a new tag, type the tag text, then press Enter.

  3. To remove a tag, click its delete icon.

  4. To remove all of the tags, click the delete all icon.

Managing tags from the dataset details page

On the dataset details page, the assigned tags display under the dataset name.

To change the tag assignment:

  1. Click Tags.

  2. On the dataset tags panel, to add a new tag, type the tag text, then press Enter.

  3. To remove a tag, click its delete icon.

  4. To remove all of the tags, click the delete all icon.

Changing the dataset name

Required dataset permission: Edit dataset settings

The dataset name displays in the panel at the top left of the dataset details page.

To change the dataset name:

  1. On the dataset details page, click Settings.

  2. On the Dataset Settings page, in the Dataset Name field, provide the new name for the dataset..

  1. Click Save Dataset.

Navigating the file list

For a local files dataset, the file list is a single list of uploaded files.

For cloud storage dataset, the file list initially displays the first folder that contains dataset files. You can then navigate through the folders and files.

For a cloud storage dataset, you can search for folders in the currently displayed bucket or folder.

To search for a folder, in the search field, start to type the folder name.

Configuring handling of file components

Required dataset permission: Edit dataset settings

The Dataset Settings panel includes options for how Textual handles the following file components:

  • For .docx files, images and comments

  • For PDF files, scanned-in signatures

To display the Dataset Settings page, on the dataset details page, click Settings.

These options are not available for pipelines that also redact files.

Configuring how to handle .docx images

For .docx images, including .svg files, you can configure the dataset to either:

  • Redact the image content. When you select this option, Textual looks for and blocks out sensitive values in the image.

  • Ignore the image.

  • Replace the images with black boxes.

On the Dataset Settings page, under Image settings for DOCX files:

  • To redact the image content, click Redact contents of images using OCR. This is the default selection.

  • To ignore the images entirely, click Ignore images during scan.

  • To replace the images with black boxes, click Replace images from the output file with black boxes.

Configuring how to handle .docx tables

For .docx tables, you can configure the dataset to either:

  • Redact the table content. When you select this option, Textual detects sensitive values and replaces them based on the entity type configuration.

  • Block out all of the table cells. When you select this option, Textual places a black box over each table cell.

On the Dataset Settings page, under Table settings for DOCX files:

  • To redact the table content, click Redact content using the entity type configuration. This is the default selection.

  • To block out the table content, click Block out all table cell content.

Configuring how to handle .docx comments

For comments in a .docx file, you can configure the dataset to either:

  • Remove the comments from the file.

  • Ignore the comments and leave them in the file.

On the Dataset Settings page, to remove the comments, toggle Remove comments from the output file to the on position. This is the default configuration.

To ignore the comments, toggle Remove comments from the output file to the off position.

Configuring whether to redact PDF signatures

By default, Textual redacts scanned-in signatures in PDF files. You can configure the dataset to instead ignore the signatures.

On the Dataset Settings page:

  • To redact PDF signatures, toggle Detect and redact signatures in PDFs to the on position. This is the default configuration.

  • To ignore PDF signatures, toggle Detect and redact signatures in PDFs to the off position.

Datasets and redaction

You can use the Tonic Textual SDK to manage pipelines and to redact individual strings and files.

Downloading local output files

Required dataset permission: Download redacted dataset files

For each file in a dataset, you can download the output file.

Downloading a single output file

From the file list, to download a single output file:

  1. Click the options menu for the file.

  2. In the options menu, click Download File.

Downloading all of the output files

To download all of the output files, click Download All Files.

How to configure Textual environment variables

On a self-hosted instance of Textual, much of the configuration takes the form of environment variables.

After you configure an environment variable, you must restart Textual.

​​Docker

For Docker, add the variable to .env in the format:

SETTING_NAME=value

After you update .env, to restart Textual and complete the update, run:

$ docker-compose down

$ docker-compose pull && docker-compose up -d

Kubernetes

For Kubernetes, in values.yaml, add the environment variable to the appropriate env section of the Helm chart.

For example:

After you update the YAML file, to restart the service and complete the update, run:

$ helm upgrade <name_of_release> -n <namespace_name> <path-to-helm-chart>

The above Helm upgrade command is always safe to use when you provide specific version numbers. However, if you use the latest tag, it might result in Textual containers that have different versions.

Manage datasets

Use the REST API to create and manage datasets.

Configuring endpoint URLs for calls to AWS

For calls to AWS products that are used in Textual, you can configure custom URLs to use. For example, if you use proxy endpoints, then you would configure those endpoints in Textual.

The for custom AWS endpoints include the following:

  • AWS_S3_FORCE_PATH_STYLE - Whether to always use path-style instead virtual-hosted-style for connections to Amazon S3. The default is false.

    This setting is only used if you configured either AWS_ENDPOINT_URL or AWS_ENDPOINT_URL_S3.

  • AWS_ENDPOINT_URL - The URL to use for all AWS calls, including calls to Amazon S3, Amazon Textract, and Amazon SES v2. This global endpoint is overridden by service-specific endpoints.

  • AWS_ENDPOINT_URL_S3 - The URL to use for calls to Amazon S3. This overrides the global URL set in AWS_ENDPOINT_URL.

  • AWS_ENDPOINT_URL_TEXTRACT - The URL to use for calls to Amazon Textract. This overrides the global URL set in AWS_ENDPOINT_URL.

  • AWS_ENDPOINT_URL_SESV2 - The URL to use for calls to Amazon SES v2. This overrides the global URL set in AWS_ENDPOINT_URL.

Deploying with Docker Compose

The Docker Compose file is available in the GitHub repository .

Fork the repository.

To deploy Textual:

  1. Rename sample.env to .env.

  2. In .env, provide values for the required settings. These are not commented out and have <FILL IN> as a placeholder value:

    • SOLAR_VERSION - Provided by Tonic.ai.

    • SOLAR_LICENSE - Provided by Tonic.ai.

    • ENVIRONMENT_NAME - The name that you want to use for your Textual instance. For example, my-company-name.

    • SOLAR_SECRET - The string to use for Textual encryption.

    • SOLAR_DB_PASSWORD - The password that you want to use for the Textual application database, which stores the metadata for Textual, including the datasets and pipelines. Textual deploys a PostgreSQL database container for the application database.

  3. To deploy and start Textual, run docker-compose up -d.

Redaction

Use the Tonic Textual REST API to redact text. Redaction means to detect and replace sensitive values.

Built-in entity types Entity types that Textual detects automatically.

Configure custom entity types Define custom entity types to detect additional values.

env: {
  "TEXTUAL_ML_WORKERS": "2"
}
environment variables
https://github.com/TonicAI/textual_docker_compose/tree/main
Datasets with assigned tags and the Tags option to change the tag assignment
Editing a dataset's tags from the datasets list
Editing a dataset's tags from the dataset details page
Dataset name
Dataset Settings page
File list for an uploaded file dataset
File list and navigation for a cloud storage dataset
Dataset Settings page

Create and manage datasets

Create, update, and get redacted files from a Textual dataset.

Redact strings

Send plain text, JSON, or XML strings for redaction.

Redact individual files

Send a file for redaction and retrieve the results.

Transcribe and redact audio files

Send and audio file to be transcribed and retrieve the redacted transcription.

Configure entity type handling

Configure how Textual treats each type of entity in a dataset, redacted file, or redacted string.

Record and review redaction requests

View the results of an SDK redaction request in the Textual application.

File options menu with the download option
Dataset file list with Download All Files option

Redact text strings Redact individual text strings.

Getting started with Textual

Sign up for a Textual account.

Textual entity types

Built-in entity types come with Textual. You can also configure custom entity types.

Preview Textual detection and redaction

Use the home page to see how Textual identifies sensitive values in text or a file.

Datasets workflow

Use Textual to detect and replace sensitive values in files.

Manage API keys

Generate and revoke API keys for SDK and API authentication.

SDK - Datasets and redaction

Use the Textual Python SDK to redact text and manage datasets. Review redaction requests in the Request Explorer.

REST API

Use the Textual REST API to redact text strings, manage datasets, and manage user access.

Snowflake Native App

Use the Snowflake Native App to redact values in your data warehouse.

Changing cloud storage credentials and output location

Required dataset permission: Edit dataset settings

For a cloud storage dataset, you can:

  • Update the cloud storage credentials. Note that this option is only available if you provided the credentials manually. If you use the credentials set in environment variables, then you cannot change the credentials.

  • Change the output location for the generated output files

You configure the connection credentials and output location in the Connection Settings section of the Dataset Settings page.

To display the Dataset Settings page, on the dataset details page, click Settings.

After you update the configuration, click Save Dataset.

Changing cloud storage credentials

Amazon S3

To provide updated credentials for Amazon S3:

  1. In the Access Key field, provide an AWS access key that is associated with an IAM user or role. For an example of a role that has the required permissions for an Amazon S3 dataset, go to .

  2. In the Access Secret field, provide the secret key that is associated with the access key.

  3. From the Region dropdown list, select the AWS Region to send the authentication request to.

  4. In the Session Token field, provide the session token to use for the authentication request.

  5. To test the credentials, click Test AWS Connection.

Azure

To provide updated credentials for Azure:

  1. In the Account Name field, provide the name of your Azure account.

  2. In the Account Key field, provide the access key for your Azure account.

  3. To test the connection, click Test Azure Connection.

SharePoint

SharePoint credentials must have the following application permissions (not delegated permissions):

  • Files.Read.All - To see the SharePoint files

  • Files.ReadWrite.All -To write redacted files and metadata back to SharePoint

  • Sites.ReadWrite.All - To view and modify the SharePoint sites

To provide updated credentials for SharePoint:

  1. In the Tenant ID field, provide the SharePoint tenant identifier for the SharePoint site.

  2. In the Client ID field, provide the client identifier for the SharePoint site.

  3. In the Client Secret field, provide the secret to use to connect to the SharePoint site.

  4. To test the connection, click Test SharePoint Connection.

Setting the output location

The output location is where Textual writes the redacted files.

When you create a cloud storage database, after you select the initial set of files and folders, Textual prompts you to select the output location.

For an existing dataset, you set the output location from the Dataset Settings page.

Under Select Output Location, select the cloud storage folder where Textual writes the output files for the dataset.

When you generate output for a cloud storage dataset, Textual creates a folder in the output location. The folder name is the identifier of the job that generated the files.

Within the job folder, Textual recreates the folder structure for the original files.

Textual then writes the output files to the corresponding folders.

Transcribe and redact an audio file

You can send an audio file to the Tonic Textual SDK. Textual creates a transcription of the audio file, and then redacts the transcription text as a string.

Audio file limitations

The file must be 25MB or smaller, and must be one of the following file types:

  • m4a

  • mp3

  • webm

  • mp4

  • mpga

  • wav

Sending the transcription and redaction request

To transcribe and redact an audio file, you use textual.redact_audio.

redaction_response=textual.redact_audio(<path to the audio file>)
redaaction_response.describe

The request includes the entity type handling configuration.

The redaction response includes the redacted or synthesized content and details about the detected entity values.

Deploying on Kubernetes with Helm

The Tonic Textual Helm chart is available in the GitHub repository https://github.com/TonicAI/textual_helm_charts.

To use the Helm chart, you can either:

  • Use the OCI-based registry that Tonic hosts on quay.io.

  • Fork or clone the repository and then maintain it locally.

During the onboarding period, you are provided access credentials to our docker image repository on Quay.io. If you require new credentials, or you experience issues accessing the repository, contact [email protected].

Configure

Before you deploy Textual, you create a values.yaml file with the configuration for your instance.

For details about the required and optional configuration options, go to the repository readme.

Deploy

To deploy and validate access to Textual from the forked repository, follow the instructions in the repository readme.

To use the OCI-based registry, run:

helm install textual oci://quay.io/tonicai/textual -f values.yaml -n textual --create-namespace

The GitHub repository contains a readme with the details on how to populate a values.yaml file and deploy Textual.

Configuring the number of textual-ml workers

The TEXTUAL_ML_WORKERS environment variable specifies the number of workers to use within the textual-ml container. The default value is 1.

Having multiple workers allows for parallelization of inferences with NER models. The number of required workers is also affected by the number of jobs that each worker can run simultaneously.

When you deploy Textual with Kubernetes on GPUs, parallelization allows the textual-ml container to fully utilize the GPU.

We recommend 3GB of GPU RAM for each worker.

Configuring processing and parallelism

The following environment variables control job and file processing.

Configuring the number of jobs to run concurrently

By default, each Tonic Textual worker can run eight jobs at the same time. For example, it can process up to eight files simultaneously.

The environment variable SOLAR_MAX_CONCURRENT_WORKER_JOBS controls the number of jobs to run concurrently.

The number of jobs that can run concurrently can affect the number of Textual workers that you need. The more jobs that can run concurrently, the fewer workers that are needed.

Configuring the size of the datetime generator cache

When it generates datetime values, to optimize the processing, Textual stores the redacted datetime values in a cache.

To change the cache size, configure the environment variable SOLAR_DATETIME_GENERATOR_CACHE_CAPACITY.

The default value is 100000, meaning that the cache contains 100,000 values.

Note that while increasing the size of the cache can speed up processing, it also uses more RAM.

Configuring the number of PDF pages to redact simultaneously

When Textual redacts PDF files so that a user can preview or download the output, the following environment variable determines the number of pages that it processes simultaneously:

SOLAR_PDF_PAGE_REDACTION_PARALLELISM

The default value is 4, meaning that Textual processes 4 pages at a time.

Configuring the number of PDF files to plan simultaneously

When Textual plans the redaction of PDF files for a user to preview or download, the following environment variable determines the number of files that it plans simultaneously.

SOLAR_PDF_DOC_PLAN_PARALLELISM

The default value is 3, meaning that Textual plans 3 PDF files at a time.

Configuring how often to purge cached PDF pages

When it redacts PDF files, Textual stores the redacted PDF pages in a cache.

The following environment variable determines how often Textual purges the cache of PDF pages.

PURGE_REDACTED_PAGES_IN_HOURS

The default value is 12, meaning that Textual purges the redacted PDF pages cache every 12 hours.

REST API authentication

Before you can use the API, you must create a Tonic Textual API key. For information on how to obtain a Textual API key, go to Creating and revoking Textual API keys.

When you call the API, you place your API key in the authorization header of the request, similar to the following curl request, which fetches the list of datasets for the current user.

curl --request GET \
--url "https://textual.tonic.ai/api/dataset" \
--header "Content-Type: application/json" \
--header "Authorization: API_KEY"

Most Textual API requests require authentication. For each request, the reference information indicates whether the request requires an API key.

For requests that require an API key, if you do not provide a valid API key, you receive a 401 Unauthorized response.

Installing the Textual SDK

The Tonic Textual SDK is a Python SDK that you can use to redact text and files.

It requires Python 3.9 or higher.

To install the Tonic Textual Python SDK, run:

pip install tonic-textual

Configuring the format of Textual logs

Textual writes the worker, machine learning, and API log messages to stdout.

By default, the log messages are in an unstructured format.

To instead use a JSON format for the logs, set the environment variable SOLAR_EMIT_JSON_LOGS_TO_STDOUT to true.

Azure

Use these instructions to set up Azure Active Directory as your SSO provider for Tonic Textual.

Azure configuration

Register Textual as an application within the Azure Active Directory Portal:

  1. In the portal, navigate to Azure Active Directory -> App registrations, then click New registration.

  2. Register Textual and create a new web redirect URI that points to your Textual instance's address and the path /sso/callback/azure.

  3. Take note of the values for client ID and tenant ID. You will need them later.

  4. Click Add a certificate or secret, and then create a new client secret. Take note of the secret value. You will need this later.

  5. Navigate to the API permissions page. Add the following permissions for the Microsoft Graph API:

    • OpenId permissions

    • email

    • openid

    • profile

    • GroupMember

    • GroupMember.Read.All

    • User

    • User.Read

  6. Click Grant admin consent for Tonic AI. This allows the application to read the user and group information from your organization. When permissions have been granted, the status should change to Granted for Tonic AI.

  7. Navigate to Enterprise applications and then select Textual. From here, you can assign the users or groups that should have access to Textual.

Textual configuration

After you complete the configuration in Azure, you uncomment and configure the required environment variables in Textual.

For Kubernetes, in values.yaml:

# Azure SSO Config
# -----------------
#azureClientId: <client-id>
#azureTenantId: <tenant-id>
#azureClientSecret: <client-secret>
#azureGroupFilterRegex: <regular expression to identify allowed groups>

For Docker, in .env:

#SOLAR_SSO_AZURE_CLIENT_ID=#<client ID>
#SOLAR_SSO_AZURE_TENANT_ID=#<tenant ID>
#SOLAR_SSO_AZURE_CLIENT_SECRET=#<client secret>
#SOLAR_SSO_AZURE_GROUP_FILTER_REGEX=#"<regular expression to identify allowed groups>

Textual organizations

In Tonic Textual, each user belongs to an organization. Organizations are used to determine the company or customer that a Textual user belongs to.

A self-hosted instance of Textual contains a single organization. All users belong to that organization.

Textual Cloud hosts multiple organizations. The organizations are kept completely separate. Users from one Textual Cloud organization do not have any access to the users, datasets, or pipelines that belong to a different Textual Cloud organization.

When is a Textual organization created?

A Textual organization is created:

  • For a standard Textual license, both self-hosted and Textual Cloud, when the first user signs up for a Textual account.

  • When a user signs up for a free trial or pay-as-you go Textual Cloud license with a unique corporate email domain.

  • When a user signs up for a free trial or pay-as-you-go Textual Cloud license with a public email domain, such as Gmail or Yahoo. Every user with a public email domain is in a separate organization.

When is a new user added to an existing organization?

Self-hosted instance

A self-hosted instance has a single organization. Every user who signs up for an account on that instance is added to the organization.

Annual Textual Cloud license (not pay-as-you-go)

For companies with an annual Textual Cloud license, the license includes the email domains that are included in the license.

When a user with one of the included email domains signs up for a Textual account, they are automatically added to that organization.

Pay-as-you-go license

For a pay-as-you-go license, when a user with the same corporate email domain signs up for a Textual account, they are automatically added to that organization.

Users with public email domains are always in separate organizations.

Configuring access to global permission sets

Required global permissions:

  • Manage access to global permission sets

  • View users and groups

From the Global Permission Sets list, you can grant or revoke access to a global permission set. Global permission sets can be assigned to individual users and to SSO groups.

Access to dataset permission sets is managed from the Datasets page. For more information, go to Sharing dataset access.

Access to pipeline permission sets is managed from the Pipelines page. For more information, go to Sharing pipeline access.

You cannot change the assignment of the following global permission sets:

  • The global permission set that is assigned to all Textual users. Initially, this is the General User permission set, but it can be changed to a different permission set.

  • The built-in Admin (Environment) global permission set.

Before you assign a global permission set to an SSO group, make sure that you are aware of who is in the group. The permissions that are granted to an SSO group automatically are granted to all of the users in the group.

To manage the permission set assignment:

  1. On the Global Permission Sets list, for the permission set to manage, click Manage Access.

  2. To grant access to a user or group:

    1. Begin to type the user or group name.

    2. In the list of matching users or groups, click the user or group name.

  3. To remove access from a user or group, click Revoke for that user or group.

  4. To save the changes to the permission set access, click Save.

Viewing the list of SSO groups in Textual

Required global permission: View users and groups

If you use SSO to manage Tonic Textual groups, then Textual displays the list of groups for which at least one user has logged in to Textual.

To display the SSO group list:

  1. Click the user image at the top right.

  2. In the user menu, click Permission Settings.

  3. On the Permission Settings page, click Groups.

If no users from a group have logged in to Textual, then the group does not display in the list.

The list only displays the group names and indicates the SSO provider. To manage the group permissions:

  • To assign global permission sets, go to the Global Permission Sets list. For more information, go to Configuring access to global permission sets.

  • To assign dataset permission sets, go to the Datasets page. For more information, go to Sharing dataset access.

  • To assign pipeline permission sets, go to the Pipelines page. For more information, go to Sharing pipeline access.

Viewing the lists of permission sets

Displaying the permission set lists

The Permission Settings page contains the lists of global, dataset, and pipeline permissions.

To display the Permission Settings page:

  1. Click the user icon at the top right.

  2. In the user menu, click Permission Settings.

On the Permission Settings page:

  • Global Permission Sets contains the list of global permission sets.

  • Dataset Permission Sets contains the list of dataset permission sets.

  • Pipeline Permission Sets contains the list of pipeline permission sets.

The lists include:

  • The permission set name.

  • Whether the permission set is built-in or custom.

  • For custom permission sets, when it was most recently modified, and the user who modified it.

Viewing the details for a permission set

To view the details for a permission set, in the permission sets list, click Settings.

The details panel for a permission set includes:

  • The name of the permission set.

  • The permission configuration.

Creating a new account in an existing organization

New user on a self-hosted instance

If your company has a self-hosted Textual instance that is installed on-premises, then you navigate to the Textual URL for that instance.

Your self-hosted instance might be configured to use single sign-on for Textual access. If so, then from the Textual login page, to create your Textual user account, click the single sign-on option.

Otherwise, to create your Textual user account, click Sign Up.

Your administrator can provide the URL for your Textual instance and confirm the instructions for creating your user account.

New user for an existing Textual Cloud organization

If your Textual license is on Textual Cloud, then new users that have a matching email domain are automatically added to your Textual Cloud organization.

For a Textual Cloud license other than a pay-as-you-go license, the license agreement specifies the included email domains. When a user with a matching email domain signs up for a Structural account, they are added to that Textual Cloud organization.

For a pay-as-you-go Textual Cloud license, when a user with the same corporate email domain as the subscribed user signs up for a Textual account, they are added to that Textual Cloud organization.

To create your Textual user account, on the Textual Cloud login page, click Sign Up.

Viewing the dataset list and details

Viewing the list of datasets

Displaying the Datasets page

To display the Datasets page, in the navigation menu, click Datasets.

Datasets page

The datasets list only displays the datasets that you have access to.

Users who have the global permission View all datasets can see the complete list of datasets.

For each dataset, the Datasets page includes:

  • The name of the dataset

  • Any tags assigned to the dataset. For datasets that you can edit, there is also an option to assign tags. For more information, go to Assigning tags to datasets.

  • The user who most recently updated the dataset

  • When the dataset was created

Filtering the datasets by name

To filter the datasets by name, in the search field, begin to type text that is in the dataset name.

As you type, the list is filtered to only include datasets with names that contain the filter text.

Filtering the datasets by tag

You can assign tags to each dataset. Tags can help you to organize and provide a quick glance into the dataset configuration.

On the Datasets page, to filter the datasets by their assigned tags:

Panel to filter datasets by their assigned tags
  1. In the heading for the Tags column, click the filter icon.

  2. On the tag list, check the checkbox for each tag to include.

To find a specific tag, in the search field, type the tag name.

Displaying details for a dataset

Required dataset permission: View dataset settings

To display the details page for a dataset, on the Datasets page, click the dataset name.

Dataset details page

The dataset details page includes:

  • The tags assigned to the dataset, as well as an option to add tags. For more information, go to Assigning tags to datasets.

  • The list of files in the dataset. For a cloud storage dataset, where the files can be located across multiple folders, Textual navigates to the first folder that contains selected dataset files.

  • The results of the scan for entity values

  • The configured handling for each entity type

Working with custom entity types

From the entity types list, you can set whether each custom entity is active, and edit the custom entity configuration.

You can also create a new custom entity type.

Enabling and disabling custom entity types

Required dataset permission: Edit dataset settings

In the entity types list, custom entity types include a toggle to indicate whether the custom entity type is active for that dataset or pipeline.

Custom entity type in the found entity types list

To disable a custom entity type, set the toggle to the off position.

When a custom entity type is enabled, then it is listed under either the found or not found entity types, depending on whether the files include entities of that type.

When a custom entity type is not enabled, it is listed under Inactive custom entity types. To enable the custom entity type, set the toggle to the on position.

Inactive custom entity types list

Editing a custom entity type

Required global permission - either:

  • Create custom entity types

  • Edit any custom entity type

To edit a custom entity type, click the settings icon for that type.

Note that any changes to the custom entity type settings affect all of the datasets and pipelines that use the custom entity type.

For information on how to configure a custom entity type, go to .

Creating a custom entity type

Required global permission: Create custom entity types

From the dataset details or pipeline details page, to create a new custom entity type, click Create Custom Entity Type.

Create Custom Entity Type option on the dataset details page
Create Custom Entity Type option on the pipeline details page

For information on how to configure a custom entity type, go to .

Running a new scan to reflect custom entity type changes

When you enable, disable, add, or edit custom entity types, the changes do not take effect until you run a new scan.

For datasets and uploaded file pipelines, to run a new scan, click Scan.

Scan prompt for a dataset or an uploaded file pipeline

For a cloud storage pipeline, Textual scans the files when you run the pipeline.

Redact text strings

You can use the Tonic Textual REST API to redact text strings, including:

  • Plain text

  • JSON

  • XML

  • HTML

Textual provides a specific endpoint for each format. For JSON, XML, and HTML, Textual only redacts the text values. It preserves the underlying structure.

Manage dataset files

Use the REST API to manage dataset files.

About the Textual REST API

The Tonic Textual REST API allows you to more deeply integrate Textual functions into your existing workflows.

You can use the REST API as another tool alongside the Textual application and the Textual Python SDK. The Python SDK supports the same actions as the REST API. We recommend the Python SDK for customers who already use Python.

You can download the Textual OpenAPI specification from:

https://textual.tonic.ai/swagger/v1/swagger.json

Viewing model specifications

On a self-hosted instance of Tonic Textual, you can view the current model specifications for the instance.

To view the model specifications:

  1. Click the user icon at the top right.

  2. In the user menu, click System Settings.

On the System Settings page, the Model Specifications section provides details about the models that Textual uses.

Model Specifications section on the System Settings page

Single sign-on (SSO)

Tonic Textual respects the access control policy of your single sign-on (SSO) provider. To access Textual, users must be granted access to the Textual application within your SSO provider.

Self-hosted instances can use any of the available SSO options. Textual Cloud organizations can enable Okta SSO.

To enable SSO, you first complete the required configuration in the SSO provider. You then configure Textual to connect to it. For self-hosted instances, you use Textual environment variables for the configuration. For a Textual Cloud organization, you use the Single Sign-On tab on the Permission Settings page.

After you enable SSO, users can use SSO to create an account in Textual.

For self-hosted instances, to only allow SSO authentication, set the environment variable REQUIRE_SSO_AUTH to true. For Textual Cloud, this is configured in the application. When SSO is required, Textual disables standard email/password authentication. All account creation and login is handled through your SSO provider. If multi-factor authentication (MFA) is set up with your SSO, then all authentication must go through your provider's MFA.

You can view the list of SSO groups whose members have logged into Textual.

Tonic Textual supports the following SSO providers:

Selecting the handling option for entity types

Required dataset permission: Edit dataset settings

For datasets that produce redacted files, for each entity type, you choose how to handle the detected values. This determines how each value displays in the output files.

Datasets that produce JSON output do not use entity type handling options.

Available handling options

The available options are:

  • Synthesis - Indicates to replace the value with another realistic value. For example, the first name value Michael might be replaced with the value John. The synthesized values are always consistent, meaning that a given entity value always has the same replacement value. For example, if the first name Michael appears multiple times in the text, it is always replaced with John. Textual does not synthesize any excluded values. For custom entity types, Textual scrambles the values.

  • Redaction - This is the default option, except for the Full Mailing Address entity type, which is Off by default. For text files, Redaction indicates to tokenize the value - to replace it with a token that identifies the entity type followed by a unique identifier. For example, the first name value Michael might be replaced with NAME_GIVEN_12m5s. The identifiers are consistent, which means that for a given original value, the replacement always has the same unique identifier. For example, the first name Michael might always be replaced with NAME_GIVEN_12m5sb, while the first name Helen might always be replaced with NAME_GIVEN_9ha3m2. For PDF files, Redaction indicates to either cover the value with a black box, or, if there is space, display the entity type and identifier. For image files, Redaction indicates to cover the value with a black box. Textual does not redact any excluded values.

  • Off - Indicates to not make any changes to the values. For example, the first name value Michael remains Michael. This this the default option for the Full Mailing Address entity type.

Selecting the handling option for a specific entity type

To select the handling option for an individual entity type, click the option for that type.

Handling options for a detected entity type

Selecting the handling option for all of the entity types

For a dataset, to select the same handling option for all of the entity types, from the Bulk Edit dropdown above the data type list, select the option.

Bulk Edit dropdown list to apply the same handling option to all of the entity types

For a pipeline that generates synthesized files, on the Generator Config tab, use the Bulk Edit options at the top of the entity types list.

Bulk edit options for a pipeline

Access management

Use the API to retrieve information about users and groups, and to manage access to datasets.

Okta (self-hosted and Cloud)

Use these instructions to set up Okta as your SSO provider for Tonic Textual. Okta is supported on both self-hosted instances and on Textual Cloud.

Textual architecture

The following diagram shows how data and requests flow within the Tonic Textual application:

Textual architecture

Textual application database

The Textual application database is a PostgreSQL database that stores the dataset configuration.

If you do not configure an S3 bucket, then it also stores uploaded files and files that you use the SDK to redact.

Textual datastore in Amazon S3

You can configure an S3 bucket to store uploaded files and individual files that you use the SDK to redact. For more information, go to Setting the S3 bucket for file uploads and redactions.

If you do not configure an S3 bucket, then the files are stored in the Textual application database.

Textual components

Textual web server

Runs the Textual user interface.

Textual worker

A textual instance can have multiple workers.

The worker orchestrates jobs. A job is a longer running task such as the redaction of a single file.

If you redact a large number of files, you might deploy additional workers and machine learning containers to increase the number of files that you can process concurrently.

Textual machine learning

A textual installation can have 1 or more machine learning containers.

The machine learning container hosts the Textual models. It takes text from the worker or web server and returns any entities that it discovers.

Additional machine learning containers can increase the number of words per second that Textual can process.

OCR service

The OCR service converts PDFs and images to text that Textual can then scan for sensitive values.

For more information, go to Enabling PDF and image processing.

LLM service

Textual only uses the LLM service for LLM synthesis.

Uploading and deleting local files

For a local file dataset, you upload and remove new files directly.

On Tonic Textual Cloud, and by default for self-hosted instances, Textual stores the uploaded files in the application database.

On a self-hosted instance, you can instead configure an S3 bucket where Textual stores the files. In the S3 bucket, the files are stored in a folder that is named for the dataset identifier.

For more information, go to .

For an example of an IAM role with the required permissions, go to .

Adding files to the dataset

Required dataset permission: Upload files to a dataset

From the dataset details page, to add files to the dataset:

  1. In the panel at the top left, click Upload Files.

  1. Search for and select the files.

Textual uploads and then processes the files. For more information about file processing, go to .

Do not leave the page while files are uploading. If you leave the page before the upload is complete, then the upload stops.

You can leave the page while Textual is processing the file.

On a self-hosted instance, when a file fails to upload, you can download the associated logs. To download the logs, click the options menu for the file, then select Download Logs.

Removing files from the dataset

Required dataset permission: Delete files from a dataset

To remove a file from the dataset:

  1. In the file list, click the options menu for the file.

  2. In the options menu, click Delete File.

Creating and revoking Textual API keys

Required global permission: Create an API key

To be able to use the Textual SDK, you must have an API key.

Alternatively, you can use the Textual API to to use for authentication.

Viewing the list of API keys

You manage keys from the User API Keys page.

To display the User API Keys page, in the Textual navigation menu, click User API Keys.

Creating a Textual API key

To create a Textual API key:

  1. On the User API Keys page, click Create API Key.

  2. In the Name field, type a name to use to identify the key.

  1. Click Create API Key.

  2. Textual displays the key value, and prompts you to copy the key. If you do not copy the key and save it to a file, you will not have access to the key. To copy the key, click the copy icon.

Revoking a Textual API key

To revoke a Textual API key, on the User API Keys page, click the Revoke option for the key to revoke.

Configuring the API key as an environment setting

You cannot instantiate the SDK client without an API key.

Instead of providing the key every time you call the Textual API, you can configure the API key as the value of the TONIC_TEXTUAL_API_KEY.

Redact individual files

Required global permission: Use the API to parse or redact a text string

You can use the Textual SDK to redact and synthesize values in individual files.

Before you perform these tasks, remember to .

For a self-hosted instance, you can also configure the S3 bucket to use to store the files. This is the same S3 bucket that is used to store files for uploaded file pipelines. For more information, go to . For an example of an IAM role with the required permissions, go to .

Sending a file to Textual

To send an individual file to Textual, you use .

You first open the file so that Textual can read it, then make the call for Textual to read the file.

The response includes:

  • The file name

  • The identifier of the job that processed the file. You use this identifier to retrieve a transformed version of the file.

Getting the file with redacted or synthesized values

After you use to send the file to Textual, you use to retrieve a transformed version of the file.

To identify the file, you use the job identifier that you received from textual.start_file_redaction. You can for the detected entity values.

Before you make the call to download the file, you specify the path to download the file content to.

Sharing dataset access

Required permissions:

  • Global permission - View users and groups

  • Either:

    • Global permission - Manage access to datasets

    • Dataset permission - Share dataset access

Tonic Textual uses dataset permission sets for role-based access (RBAC) of each dataset.

A dataset permission set is a set of dataset permissions. Each permission provides access to a specific dataset feature or function.

Textual provides built-in dataset permission sets. Organizations can also configure custom permission sets.

To share dataset access, you assign dataset permission sets to users and to SSO groups, if you use SSO to manage Textual users. Before you assign a dataset permission set to an SSO group, make sure that you are aware of who is in the group. The permissions that are granted to an SSO group automatically are granted to all of the users in the group.

To change the current access to the dataset:

  1. Either:

    • On the Datasets page, click the share icon for the dataset to share.

    • On the dataset details page, click Share.

  1. The dataset access panel contains the current list of users and groups who have access to the dataset, and displays their assigned dataset permission sets. To add a user or group to the list of users and groups:

    1. In the search field, begin to type the user email address or group name.

    2. From the list of matching users or groups, select the user or group to add.

  2. For a user or group, to change the assigned dataset permission sets:

    1. Click Access. The dropdown list displays the list of custom and built-in dataset permission sets.

    2. Under Custom Permission Sets, check the checkbox next to each dataset permission set to assign to the user or group. To remove an assigned dataset permission set, uncheck the checkbox.

    3. Under Built-In Permission Sets, click the dataset permission set to assign to the user or group. You can only assign one built-in permission set. By default, for an added user or group, the Viewer permission set is selected. To not grant any built-in permission set, select None.

Enabling PDF and image processing

To process PDF and image files, Tonic Textual uses optical character recognition (OCR). Textual supports the following OCR models:

  • Azure AI Document Intelligence

  • Amazon Textract

  • Tesseract

For the best performance, we recommend that you use either Azure AI Document Intelligence or Amazon Textract.

If you cannot use either of those - for example because you run Textual on-premises and cannot access third-party services - then you can use Tesseract.

Azure AI Document Intelligence

To use Azure AI Document Intelligence to process PDF image files, Textual requires the Azure AI Document Intelligence key and endpoint.

Docker

In .env, uncomment and provide values for the following settings:

SOLAR_AZURE_DOC_INTELLIGENCE_KEY=#

SOLAR_AZURE_DOC_INTELLIGENCE_ENDPOINT=#

Kubernetes

In values.yaml, uncomment and provide values for the following settings:

azureDocIntelligenceKey:

azureDocIntelligenceEndpoint:

Amazon Textract

If the Azure-specific environment variables are not configured, then Textual attempts to use Amazon Textract.

To use Amazon Textract, Textual requires access to an IAM role that has sufficient permissions. You must also . The configured S3 bucket is required for uploaded file pipelines, and is also used to store dataset files and individual files that are redacted using the SDK.

We recommend that you use the AmazonTextractFullAccess policy, but you can also choose to use a more restricted policy.

Here is an example policy that provides the minimum required permissions:

After the policy is attached to an IAM user or a role, it must be made accessible to Textual. To do this, either:

  • Assign an instance profile

  • Provide the AWS key, secret, and Region in the following environment variables:

Tesseract

If neither Azure AI Document Intelligence nor Amazon Textract is configured, then Textual uses Tesseract, which is automatically available in your Textual installation.

Tesseract does not require any external access.

GitHub

Use these instructions to set up GitHub as your SSO provider for Tonic Textual.

Create an OAuth application

  1. In GitHub, navigate to Settings -> Developer Settings -> OAuth Apps, then create a new application.

  2. For Application Name, enter Textual.

  3. For Homepage URL, enter https://textual.tonic.ai.

  4. For Authorization callback URL, enter https://your-textual-url/sso/callback/github.

Replace your-textual-url with the URL of your Textual instance.

Create a client secret

After you create the application, to create a new secret, click Generate a new client secret.

You use the client ID and the client secret in the Textual configuration.

Textual configuration

After you complete the configuration in GitHub, you uncomment and configure the required in Textual.

For Kubernetes, in values.yaml:

For Docker, in .env:

Instantiating the SDK client

Whenever you call the Textual SDK, you first instantiate the SDK client.

  • To work with Textual datasets, or to redact individual files, you instantiate TonicTextual.

  • To work with Textual pipelines and parsing, you instantiate TonicTextualParse.

Instantiating when the API key is already configured

If the API key is configured as the value of the TONIC_TEXTUAL_API_KEY, then you do not need to provide the API key when you instantiate the SDK client.

For Textual datasets, or to use the redact method:

For Textual pipelines:

Instantiating when the API key is not configured

If the API key is not configured as the value of the TONIC_TEXTUAL_API_KEY, then you must include the API key in the request.

For Textual datasets, or to use the redact method:

For Textual pipelines:

Deploying a self-hosted instance

The Tonic Textual images are stored on . During onboarding, Tonic.ai provides you with credentials to access the image repository. If you require new credentials, or you experience issues accessing the repository, contact .

You can deploy Textual using either Kubernetes or Docker.

Obtaining JWT tokens for authentication

Instead of an API key, you can use the Textual API to obtain a JSON Web Token (JWT) to use for authentication.

Configuring the JWT and refresh token lifetimes

JWT lifetime

By default, a JWT is valid for 30 minutes.

On a self-hosted instance, to configure a different lifetime, set the SOLAR_JWT_EXPIRATION_IN_MINUTES.

Refresh token lifetime

You use a refresh token to obtain a new JWT. By default, a refresh token is valid for 10,000 minutes, which is roughly equivalent to 7 days.

On a self-hosted instance, to configure a different lifetime, set the environment variable SOLAR_REFRESH_TOKEN_EXPIRATION_IN_MINUTES.

Obtaining your first JWT and refresh token

To obtain your first JWT and refresh token, you make a login request to the Textual API. Before you can make this call, you must have a Textual account.

To make the call, perform a POST operation against:

The request payload is:

For example:

In the response:

  • The jwt property contains the JWT.

  • The refreshToken property contains the refresh token.

Obtaining a new JWT and refresh token

You use the refresh token to obtain both a new JWT and a new refresh token.

To obtain the new JWT and token, perform a POST operation against:

The request payload is:

In the response:

  • The jwt property contains the new JWT.

  • The refreshToken property contains the new refresh token.

Google

Use these instructions to set up Google as your SSO provider for Tonic Textual.

Create an OAuth 2.0 client ID in Google

  1. Go to

  2. Click Create credentials, located near the top.

  3. Select OAuth client ID.

  4. Select Web application as the application type.

  5. Choose a name.

  6. Under Authorized redirect URIs, add the URL of the Textual server with the endpoint /sso/callback/google. For example, a local Textual server at http://localhost:3000 would need http://localhost:3000/sso/callback/google to be set as the redirect URI. Also note that internal URLs might not work.

  7. On the confirmation page, note the client ID and client secret. You will need to provide them to Textual.

Textual configuration

After you complete the configuration in Google, you uncomment and configure the required in Textual.

  • The client ID

  • The client secret

For Kubernetes, in values.yaml:

For Docker, in .env:

Managing permissions

You use permissions and permission sets to manage access to Tonic Textual features and functions.

Learn about permission sets and permissions

View and configure permission sets

Assign global permission sets

About permissions and permission sets

Tonic Textual uses permissions and permission sets to manage role-based access (RBAC) to Textual features and functions.

A permission grants access to a specific feature or function.

A permission set is a collection of permissions that can be assigned to a user or an SSO group.

Built-in and custom permission sets

Textual provides a set of built-in permission sets that you cannot edit or delete.

You can also create custom permission sets.

Global permission sets

Global permission sets control access to features and functions that are outside of the context of a specific dataset or pipeline. For example, global permission control who can manage users and configure custom entity types.

You can also select a default global permission set to assign to all new users.

For the list of global permission sets and available permissions, go to .

For information on how to assign global permission sets, go to .

Dataset and pipeline permission sets

Dataset and pipeline permission sets provide access to specific dataset or pipeline management features and functions.

Dataset and pipeline permission sets are assigned to users and groups within the context of a specific dataset or pipeline. For example, a user might have the Editor permission set in one dataset and the Viewer permission set in another dataset.

For the lists of built-in dataset and pipeline permission sets and available permissions, go to .

For information on how to assign dataset permission sets, go to . For information on how to assign pipeline permission sets, go to .

Managing user access to Textual

Tonic Textual provides the following options to manage access to Textual and its features.

Adding manual overrides to PDF files

Required dataset permission: Edit dataset settings

For a PDF file in a dataset, you can add manual overrides to selected areas of a file. Manual overrides can either ignore redactions from Tonic Textual, or add redactions.

Pipelines do not support manual overrides in PDF files.

Textual configuration (Textual Cloud)

Required global permission: Manage users and groups

On Textual Cloud, after you , you configure the Textual connection to Okta from the Single Sign-On tab on the Permission Settings page.

  1. To enable Okta SSO, check the Enable Okta SSO checkbox.

  2. In the SSO Client ID field, enter the client identifier of the application.

  3. In the SSO Domain field, enter the Okta domain.

  4. If you use a third-party provider, then in the Identity Provider ID field, provide the provider identifier. If you do not use a third-party provider, then you can skip this field.

  5. If you created a custom authorization server, then in the Authorization Server field, provide the server identifier. If you did not create a custom authorization server, then you can skip this field.

  6. To require your organization users to use Okta SSO to log in to Textual, check the Require SSO for login checkbox.

Entity linking

Use the REST API to link entity values.

Datasets

Use the REST API to manage datasets.

Textual configuration (self-hosted)

On a self-hosted instance, after you , uncomment and configure the relevant in Textual.

Kubernetes

For Kubernetes, the settings are in the Okta SSO Config section of values.yaml:

  • oktaAuthServerId - If you created a custom authorization server, the server ID. If you do not use a custom authorization server, then you can omit this.

  • oktaClientId - The client identifier of the application.

  • oktaDomain - The Okta domain.

  • oktaIdentityProviderId - If you use a third-party provider, the provider identifier. If you do not use a third-party provider, you can omit this.

Docker

For Docker, the settings are in .env:

  • SOLAR_SSO_OKTA_CLIENT_ID - The client identifier of the application.

  • SOLAR_SSO_OKTA_DOMAIN - The Okta domain.

  • SOLAR_SSO_OKTA_IDENTITY_PROVIDER_ID - If you use a third-party provider, the provider identifier. If you do not use a third-party provider, then you can omit this.

Authentication Authentication requirements for the REST API

Redaction Use the REST API to redact text.

Datasets Use the REST API to manage datasets and dataset files.

Entity linking Use the REST API to link entity values.

Access management Use the REST API to retrieve user information and to manage dataset access.

Azure

Use Azure to enable SSO on Textual.

GitHub

Use GitHub to enable SSO on Textual.

Google

Use Google to enable SSO on Textual.

Keycloak

Use Keycloak to enable SSO on Textual.

Okta

Use Okta to enable SSO on Textual.

Available for both self-hosted instances and Textual Cloud.

OpenID Connect (OIDC)

Use OIDC to enable SSO on Textual.

Okta configuration Required configuration within Okta.

Textual configuration (self-hosted) Required configuration to enable Okta on a self-hosted instance.

Textual configuration (Textual Cloud) Configuring Textual Cloud to use Okta SSO for an organization.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"textract:StartDocumentAnalysis",
				"textract:AnalyzeDocument",
				"textract:GetDocumentAnalysis"
			],
			"Resource": "*"
		}
	]
}
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION
configure an S3 bucket to use to store files
# Github SSO Config
# -----------------
#githubClientId: <client-id>
#githubClientSecret: <client-secret>
#SOLAR_SSO_GITHUB_CLIENT_ID=#<client ID>
#SOLAR_SSO_GITHUB_CLIENT_SECRET=#<client secret>
environment variables
from tonic_textual.redact_api import TextualNer
# The default URL is https://textual.tonic.ai (Textual Cloud)
# If you host Textual, provide your Textual URL
textual = TextualNer()
from tonic_textual.parse_api import TonicTextualParse
# The default URL is https://textual.tonic.ai (Textual Cloud)
# If you host Textual, provide your Textual URL
textual = TonicTextualParse()
from tonic_textual.redact_api import TonicTextual
api_key = "your-tonic-textual-api-key"
# The default URL is https://textual.tonic.ai (Textual Cloud)
# If you host Textual, provide your Textual URL
textual = TonicTextual(api_key=api_key)
from tonic_textual.parse_api import TonicTextualParse
api_key = "your-tonic-textual-api-key"
# The default URL is https://textual.tonic.ai (Textual Cloud)
# If you host Textual, provide your Textual URL
textual = TonicTextualParse(api_key=api_key)
environment variable
environment variable
<Textual_URL>/api/auth/login
{"userName": "<Textual username>",
"password": "<Textual password>"}
{"userName": "[email protected]",
"password": "MyPassword123!"}
<TEXTUAL_URL>/api/auth/token_refresh
{"refreshToken": "<refresh token>"}
environment variable
# Google SSO Config
# -----------------
#googleClientId: <client-id>
#googleClientSecret: <client-secret>
#googleGroupFilterRegex: <regular expression to identify allowed groups>
#SOLAR_SSO_GOOGLE_CLIENT_ID=#<client ID>
#SOLAR_SSO_GOOGLE_CLIENT_SECRET=#<client secret>
#SOLAR_SSO_GOOGLE_GROUP_FILTER_REGEX=#<regular expression to identify allowed groups>
https://console.developers.google.com/apis/credentials
environment variables
Built-in permission sets and available permissions
Configuring access to global permission sets
Built-in permission sets and available permissions
Sharing dataset access
Sharing pipeline access
complete the configuration in Okta
# Okta SSO Config
# -----------------
#oktaAuthServerId: <customer auth server if you have one>
#oktaClientId: <client-id>
#oktaDomain: <sso-domain>
#oktaIdentityProviderId: <identity-provider-id>
#oktaGroupFilterRegex: <regular expression to identify allowed groups>
#SOLAR_SSO_OKTA_CLIENT_ID=#<client ID>
#SOLAR_SSO_OKTA_DOMAIN=#<SSO domain>
#SOLAR_SSO_OKTA_IDENTITY_PROVIDER_ID=#<third-party provider identifier>
#SOLAR_SSO_OKTA_GROUP_FILTER_REGEX="<regular expression to identify allowed groups>
complete the configuration in Okta
environment variables
obtain a JSON Web Token (JWT)
environment variable
User API Keys page
API key creation panel
Confirmation of API key creation with copy key option
Share icon for a dataset
Share option on dataset details
Quay.io
[email protected]

System requirements

System requirements to deploy a self-hosted Textual instance.

Deploy on Docker

How to use Docker Compose to deploy a self-hosted Textual instance on Docker.

Deploy on Kubernetes

How to use Helm to deploy a self-hosted Textual instance on Kubernetes.

About permissions and permission sets Learn how user access works in Textual.

Built-in permission sets and permissions Lists of the global, dataset, and pipeline permission sets and permissions that come with Textual.

View the permission sets View the lists of global, dataset, and pipeline permission sets.

Configure custom permission sets Create and assign permissions to your own global, dataset, and pipeline permission sets.

Select default permission sets Select the global permission set for all users, and the permission sets for database and pipeline creators.

Grant access to global permission sets Assign users and groups to global permission sets.

Set the initial admin access Use an environment variable to grant the initial access to all global permissions.

Textual organizations Learn about Textual organizations and how they are populated.

Create an account in an organization How new accounts are assigned to organizations.

Single sign-on (SSO) Use SSO to manage user access to Textual.

Manage Textual users View and remove Textual users that are in your organization.

Manage permissions View and configure permission sets. Assign global permission sets to users and groups.

Edit an individual file

Add manual overrides to a PDF file. You can also apply a template.

Create PDF templates

PDF templates allow you to add the same overrides to files that have the same structure.

Manage dataset files Upload new files and download redacted files.

Manage datasets Create datasets and edit dataset configuration.

Configuring model preferences

On a self-hosted instance, you can configure settings to determine whether to the auxiliary model, and model use on GPU.

Configuring whether to use an auxiliary model

To improve overall inference, you can configure whether Textual uses the en_core_web_sm auxiliary NER model.

Entity types that the auxiliary model detects

The auxiliary model detects the following types:

  • EVENT

  • LANGUAGE

  • LAW

  • NRP

  • NUMERIC_VALUE

  • PRODUCT

  • WORK_OF_ART

Indicating whether to use the auxiliary model

To configure whether to use the auxiliary model, you use the environment variable TEXTUAL_AUX_MODEL.

The available values are:

  • en_core_web_sm - This is the default value.

  • none - Indicates to not use the auxiliary model.

Configuring model use for GPU

When you use a textual-ml-gpu container on accelerated hardware, you can configure:

  • Whether to use the auxiliary model,

  • Whether to use the date synthesis model

Indicating whether to use the auxiliary model for GPU

To configure whether to use the auxiliary model for GPU, you configure the environment variable TEXTUAL_AUX_MODEL_GPU.

By default, on GPU, Textual does not use the auxiliary model, and TEXTUAL_AUX_MODEL_GPU is false.

To use the auxiliary model for GPU, based on the configuration of TEXTUAL_AUX_MODEL, set TEXTUAL_AUX_MODEL_GPU to true.

When TEXTUAL_AUX_MODEL_GPU is true, and TEXTUAL_MULTI_LINGUAL is true, Textual also loads the multilingual models on GPU.

Indicating whether to use the date synthesis model for GPU

By default, on GPU, Textual loads the date synthesis model on GPU.

Note that this model requires 600MB of GPU RAM for each machine learning worker.

To not load the date synthesis model on GPU, set the environment variable TEXTUAL_DATE_SYNTH_GPU to false.

System requirements

You install a self-hosted instance of Tonic Textual on either:

  • A VM or server that runs Linux and on which you have superuser access.

  • A local machine that runs Mac, Windows, or Linux.

Application server or cluster requirements

At minimum, we recommend that the server or cluster that you deploy Textual to has access to the following resources:

  • Nvidia GPU, 16GB GPU RAM. We recommend at least 6GB GPU RAM for each textual-ml worker.

If you only use a CPU and not a GPU, then we recommend an M5.2xLarge. However, without GPU, performance is significantly slower.

GPU considerations

The number of words per second that Textual processes depends on many factors, including:

  • The hardware that runs the textual-ml container

  • The number of workers that are assigned to the textual-ml container

  • The auxiliary model, if any, that is used in the textual-ml container.

To optimize the throughput of and the cost to use Textual, we recommend that the textual-ml container runs on modern hardware with GPU compute. If you use AWS, we recommend a g5 instance with 1 GPU.

Application database

The Textual application database is a PostgreSQL database that stores the dataset and Textual configuration. If you did not configure an S3 bucket to store files that you upload into a dataset, then the application database also contains those uploaded files.

External database server

The Textual application database is an external database that is hosted on a separate server.

For the PostgreSQL server, we recommend a minimum of an RDS t3.small on AWS, with at least 2GB RAM, 2 vCPU, and 100 GB of storage.

To prevent the loss of Textual metadata, keep regular backups of the PostgreSQL instance.

PostgreSQL version

For the Textual application database, the current minimum supported version is PostgreSQL 14+.

You should keep your PostgreSQL version relatively up-to-date with the current PostgreSQL LTS.

Tonic.ai might periodically conduct a campaign to request updates of self-hosted PostgreSQL instances before a scheduled update in the minimum supported version.

Database user permissions

The user credentials that you provide to Textual for the application database must have permission to create tables, insert, and select.

Setting up Nvidia GPU for Textual

To use GPU resources:

  • Ensure that the correct Nvidia drivers are installed for your instance.

  • If you use Kubernetes to deploy Textual, follow the instructions in the NVIDIA GPU operator documentation.

    If you use Minikube, then use the instructions in Using NVIDIA GPUs with Minikube.

  • If you use Docker Compose to deploy Textual, follow these steps to install the nvidia-container-runtime.

Setting the S3 bucket for file uploads and redactions

Tonic Textual pipelines can process files from sources such as Amazon S3, Azure Blob Storage, and Databricks Unity Catalog. You can also create pipelines to process files that you upload directly from your browser.

For those uploaded file pipelines, Textual always stores the files in an S3 bucket. On a self-hosted instance, before you add files to an uploaded file pipeline, you must configure the S3 bucket and the associated authentication credentials.

The configured S3 bucket is also used to store dataset files and individual files that you use the Textual SDK to redact. If an S3 bucket is not configured, then:

  • The dataset and individual redacted files are stored in the Textual application database.

  • You cannot use Amazon Textract for PDF and image processing. If you configured Textual to use Amazon Textract, Textual instead uses Tesseract.

The authentication credentials for the S3 bucket include:

  • The AWS Region where the S3 bucket is located.

  • An AWS access key that is associated with an IAM user or role.

  • The secret key that is associated with the access key.

To provide the authentication credentials, you can either:

  • Provide the values directly as environment variable values.

  • Use the instance profile of the compute instance where Textual runs.

For an example IAM role that has the required permissions, go to Example IAM role for file uploads and redactions.

Docker

In .env, add the following settings:

SOLAR_INTERNAL_BUCKET_NAME= <S3 bucket path>

AWS_DEFAULT_REGION= <AWS Region>

AWS_ACCESS_KEY_ID= <AWS access key>

AWS_SECRET_ACCESS_KEY= <AWS secret key>

If you use the instance profile of the compute instance, then only the bucket name is required.

Kubernetes

In values.yaml, within env: { } under both textual_api_server and textual_worker, add the following settings:

SOLAR_INTERNAL_BUCKET_NAME

AWS_DEFAULT_REGION

AWS_ACCESS_KEY_ID

AWS_SECRET_ACCESS_KEY

For example, if no other environment variables are defined:

  env: {
        "SOLAR_INTERNAL_BUCKET_NAME": "<S3 bucket path>",
        "AWS_DEFAULT_REGION": "<AWS Region>",
        "AWS_ACCESS_KEY_ID": "<AWS access key>",
        "AWS_SECRET_ACCESS_KEY": "<AWS secret key>"
       }

If you use the instance profile of the compute instance, then only the bucket name is required.

Setting a custom certificate

Tonic Textual provides a certificate for https traffic, but on a self-hosted instance, you can also use a user-provided certificate. The certificate must use the the PFX format and be named solar.pfx.

To use your own certificate, you must:

  • Add the SOLAR_PFX_PASSWORD environment variable.

  • Use a volume mount to provide the certificate file. Textual uses volume mounting to give the Textual containers access to the certificate.

You must apply the changes to both the Textual web server and Textual worker containers.

Docker

To use your own certificate, you make the following changes to the docker-compose.yml file.

Environment variable

Add the environment variable SOLAR_PFX_PASSWORD, which contains the certificate password.

Volume mount

Place the certificate on the host machine, then share it to the containers as a volume.

You must map the certificate to /usr/bin/textual/certificates on the containers.

Copy the following:

volumes:
        ...
        - /my-host-path:/usr/bin/textual/certificates

Kubernetes

Environment variable

You must add the environment variable SOLAR_PFX_PASSWORD, which contains the certificate password.

Volume mount

You can use any volume type that is allowed within your environment. It must provide at least ReadOnlyMany access.

You map the certificate to /usr/bin/textual/certificates on the containers. Within your web server and worker deployment YAML files, the entry should be similar to the following:

    volumeMounts:
    - name: <my-volume-name>
      mountPath: /usr/bin/textual/certificates

Configuring custom permission sets

Required global permission: Manage custom permission sets

You can create custom global, dataset, and pipeline permission sets.

A custom permission set allows you to have more precise control over global, dataset, and pipeline permissions.

For example, you might want a pipeline permission set that allows a user to manage the files but not share or delete the pipeline.

For global permissions, you might want a global permission set that allows a user to manage access to any dataset or pipeline, but not manage Textual users.

Creating a custom permission set

To create a custom permission set:

  1. On the global, dataset, or pipelines permission sets list, click the create permission set button.

  2. On the permission set details panel, in the Permission Set Name field, type the name for the new permission set. Permission set names must be unique for that permission set type (global, pipeline, dataset).

  3. Select the permissions to grant to the permission set. If a permission checkbox is checked, then the permission is granted to the permission set. If a permission checkbox is not checked, then the permission is not granted to the permission set.

  4. To save the new permission set, click Save.

  5. For a global permission set, Textual prompts you to configure access to the new permission set. To display the access management panel for the permission set, click Manage User Access. To not manage access at that time, click Skip.

Editing a custom permission set

You cannot make any changes to a built-in permission set.

For a custom permission set, you can change the permission set name and adjust the assigned permissions.

To edit an existing custom permission set:

  1. On the global, datasets, or pipelines permission sets list, click Settings.

  2. On the permission set details panel, update the permission set configuration.

  3. Click Save.

Deleting a custom permission set

You can delete a custom permission set. You cannot delete a built-in permission set.

You cannot delete a permission set that is assigned to any users or groups. Before you can delete the permission set, you must remove the assignment.

To delete a custom permission set:

  1. On the global, dataset, or pipelines permission sets list, click Settings.

  2. On the permission set details panel, click Delete Permission Set.

  3. On the confirmation panel, click Confirm.

Custom entity type configuration settings
Custom entity type configuration settings
Setting the S3 bucket for file uploads and redactions
Tracking and managing file processing
Dataset files list with the upload option
Options menu for a dataset file
Example IAM role for file uploads and redactions
with open("<path to the file>", "r") as f:
    j = textual.start_file_redaction(f,"<file name>")
with open("<path to output location>", "wb") as fo:
    fo.write(textual.download_redacted_file(<job identifier>)
instantiate the SDK client
Setting the S3 bucket for file uploads and redactions
textual.start_file_redaction
textual.start_file_redaction
textual.download_redacted_file
specify the entity type handling
Example IAM role for file uploads and redactions

Reviewing the sensitivity detection results

Required dataset permission: View dataset settings

When you first create a dataset, Tonic Textual displays a single list of all of the entity types that it can detect.

As you add and remove files, Textual updates the entity types list to indicate the detected and not detected entity types.

Viewing the number of detected values

At the top of the dataset details view, the Sensitive words tile shows the total number of sensitive values in the dataset that Textual detected.

Sensitive words tile with the number of detected values

Viewing the detected entity types

As Textual processes files, it identifies the entity types that are detected and not detected.

The entity type list starts with the detected entity types. For each detected entity type, Textual displays:

  • The number of detected values that are marked as this type in the output file. Excluded values are not included in the count.

  • The selected handling option.

  • Whether there are configured added or excluded values.

List of detected entity types in the dataset files

Previewing the detected values for an entity type

For each detected entity type, to view a sample of up to 10 of the detected values , click the view icon next to the value count.

Sample of the detected values for an entity type

Displaying the list of detected values for an entity type

The entities list contains the full list of detected values for an entity type.

To display the entities list, from the value preview, click Open Entities Manager.

Entities list for an entity type

Selecting the entity type

When you display the entities list, the entity type that you previewed the values for is selected by default.

To change the selected entity type, from the dropdown at the top left, select the entity type to view values for.

How Textual handles entity values that match multiple types

A detected value might match multiple entity types.

For example, a telephone number might match both the Phone Number and Numeric Value entity types.

Every value is only counted once, for the entity type that it is assigned in the output file.

By default, a detected value is assigned the entity type that it most closely matches. For our example, the telephone number value most closely matches the Phone Number entity type, and so by default is included in the Phone Number count and values list.

If the entity type is turned off, or the value is excluded, then Textual moves the value to the next matching type.

In our example, if you set the handling type for Phone Number to Off, then the telephone number value is added to the count and values list for the Numeric Value entity type.

Information in the entities list

The entities list groups the entities by the file and, if relevant, the page where they were detected.

For each value, the list includes:

  • The original value.

  • The original value in the context of its surrounding text.

  • The redacted or synthesized value in the context of its surrounding text, based on the selected handling option.

Viewing the list of entity types that were not detected

Below the list of detected entity types is the Entity types not found list, which contains the list of entity types that Textual did not detect in the files.

Entity types not found list of entity types that were not detected in the dataset files

Filtering the entity types

You can filter the entity types list by text in the type name or description. The filter applies to both the detected and undetected entity types.

To filter the types, in the filter field, begin to type text that is in the entity type name or description.

Filtering the list of entity types

Configuring Textual

You can configure your self-hosted instance of Textual to enable Textual features.

Okta configuration

To enable Okta as your SSO provider for Tonic Textual, you first complete the following configuration steps within Okta:

  1. Create a new application. Choose the OIDC - OpenId Connect method with the Single-Page Application option.

Create a new app integration
  1. Click Next, then fill out the fields with the values below:

    • App integration name: The name to use for the Textual application. For example, Textual, Textual-Prod, Textual-Dev.

    • Grant type: Implicit (hybrid)

    • Sign-in redirect URIs: For self-hosted instances, <base-url>/sso/callback/okta. For Textual Cloud, on the Permission Settings page, the sign-in redirect URL is displayed on the Single Sign-On tab. Copy the value from there and paste it into the field.

    • Base URIs: The URL to your Textual instance

    • Controlled access: Configure as needed to limit Textual access to the appropriate users

App integration settings
  1. After saving the above, navigate to the General Settings page for the application and make the following changes:

    • Grant type: Check Implicit (Hybrid) and Allow ID Token with implicit grant type.

    • Login initiated by: Either Okta or App

    • Application visibility: Check Display application icon to users

    • Initiate login URI: <base-url>

Application settings
Login settings
  1. Make a note of the following values that must be provided to Textual:

    • Client ID of the application:

    • Your Okta domain (for example, tonic.okta.com)

    • If you created a custom authorization server for Textual, the server ID:

    • IdP ID (If you use an outside identity provider):

Setting up and managing a Textual Cloud pay-as-you-go subscription

The Tonic Textual pay-as-you-go plan allows you to automatically bill a credit card for your Textual usage.

The Textual subscription plan charges a flat rate for each 1000 words. You are billed each month based on when you started your subscription. For example, if you start your subscription on the 12th of the month, then you are billed every month on the 12th.

Tonic.ai integrates with a payment processing solution to manage the payments.

Setting up the subscription

To start a new subscription, from a usage pane or upgrade prompt, click Upgrade Plan.

You are sent to the payment processing solution to enter your payment information.

Tracking your usage

The panel on the Home page shows the usage for the current month.

To view additional usage details, click Manage Plan.

The Manage Plan page displays the details for your subscription.

Manage Plan page for a pay-as-you-go subscription

Account summary

The summary at the top left contains an overview of the subscription payment information, as well as the total number of words scanned since you started your account.

From the summary, you can go to the payment processing solution to view and manage payment information.

30-day usage graph

The graph at the top of the page shows the words scanned per day for the previous 30 days.

Current billing period

The Current Billing Period panel summarizes your usage for the current month, and provides information about the next payment.

Next billing date

The Next billing date panel shows when the next billing period begins.

Payment history

The Payment History section shows the list of subscription payments.

For each payment, the list shows the date and amount, and whether the payment was successful.

To download the invoice for a payment, click its Invoice option.

Updating the payment information

You can update the payment information for your subscription. For example, you might need to choose a different credit card or update an expiration date.

To manage the payment information:

  1. On the home page, in the usage panel, click Manage Plan.

  2. On the Manage Plan page, from the account summary, click Manage Payment.

You are sent to the payment processing solution to update your payment information.

Canceling a subscription

To cancel a subscription, from the Manage Plan page:

  1. Click Manage Payment.

  2. In the payment processing solution, select the cancellation option.

The cancellation takes effect at the end of the current subscription month.

Selecting default permission sets

You can configure:

  • The global permission set that is assigned to all Tonic Textual users.

  • The dataset permission set to assign to a user who creates a dataset.

  • The pipeline permission set to assign to a user who creates a pipeline.

Selecting the global permission set to assign to all Textual users

Required global permission: Manage access to global permission sets

By default, all Textual users are assigned the built-in General User global permission set. You can configure a different global permission set to assign to all Textual users.

The permission set cannot be removed.

When you choose a different permission set to assign to all users, unless they were otherwise assigned the previous permission set, they lose access to it.

To set the default global permission set to assign to all Textual users:

  1. Click the user icon at the top right.

  2. In the user menu, click Permission Settings.

  3. On the Permission Settings page, click Global Permission Sets. The current permission set for all users is marked as Assigned to all users.

  4. To select a different permission set, hover over the permission set row, then click Assign to all users.

Selecting a global permission set to assign to all users
  1. The confirmation panel explains the risks of making this change. To confirm the change:

  2. Check I have read and understand the risks.

  3. Click Confirm.

Selecting the permission set to assign to a dataset creator

Required global permission: Manage access to dataset permission sets

By default, when a user creates a dataset, they are assigned the built-in Editor dataset permission set. You can configure a different dataset permission set to assign when a database is created.

Changing the selected permission set does not affect existing datasets. It only applies to datasets that are created after the change.

To select a different permission set for dataset creation:

  1. Click the user icon at the top right.

  2. In the user menu, click Permission Settings.

  3. On the Permission Settings page, click Dataset Permission Sets. The current permission set for database creators is marked Assigned on dataset creation.

  4. To select a different permission set, hover over the permission set row, then click Assign on dataset creation.

  5. The confirmation panel explains the risk of making this change. To confirm the change:

    1. Check I have read and understand the risks.

    2. Click Confirm.

Selecting the permission set to assign to a pipeline creator

Required global permission: Manage access to pipeline permission sets

By default, when a user creates a pipeline, they are assigned the built-in Editor pipeline permission set. You can configure a different pipeline permission set to assign when a pipeline is created.

Changing the selected permission set does not affect existing pipelines. It only applies to pipelines that are created after the change.

To select a different permission set for pipeline creation:

  1. Click the user icon at the top right.

  2. In the user menu, click Permission Settings.

  3. On the Permission Settings page, click Pipeline Permission Sets. The current permission set for pipeline creators is marked Assigned on pipeline creation.

  4. To select a different permission set, hover over the permission set row, then click Assign on pipeline creation.

  5. The confirmation panel explains the risk of making this change. To confirm the change:

    1. Check I have read and understand the risks.

    2. Click Confirm.

Creating and managing custom entity types

Required global permission - either:

  • Create custom entity types

  • Edit any custom entity type

In addition to the built-in entity types, you can also create custom entity types.

Custom entity types are based on regular expressions. If a value matches a configured regular expression for the custom entity type, then it is identified as that entity type.

You can control whether each dataset uses each custom entity type.

Viewing the list of custom entity types

To display the list of entity types, in the Textual navigation bar, click Custom Entity Types.

For each custom entity type, the list includes:

  • Entity type name and description.

  • Regular expressions to identify matching values.

  • The number of datasets that the entity type is active for.

Creating, editing, and deleting a custom entity type

Creating a custom entity type

Required global permission: Create custom entity types

To create a custom entity type, on the Custom Entity Types page, click Create Custom Entity Type.

The dataset details page also contains a Create Custom Entity Type option.

After you :

  • To save the new type, but not scan dataset files for the new type, click Save Without Scanning Files.

  • To both save the new type and scan for it, click Save and Scan Files.

To detect new custom entity types in a dataset, Textual needs to run a scan. If you do not run the scan when you save the custom entity type, then on the dataset details page, you are prompted to run a scan.

Editing a custom entity type

Required global permission: You can edit any custom entity type that you create.

Users with the global permission Edit any custom entity type can edit any custom entity type.

To edit a custom entity type, on the Custom Entity Types page, click the edit icon for the entity type.

You can also edit a custom entity type from the dataset details page.

For an existing entity type, you can change the description, the regular expressions, and the enabled datasets.

You cannot change the entity type name, which is used to produce the identifier to use to configure the entity type handling from the SDK.

After you update the configuration:

  • To save the changes, but not scan dataset files based on the updated configuration, click Save Without Scanning Files.

  • To both save the new type and scan based on the updated configuration, click Save and Scan Files.

To reflect the changes to custom entity types in a dataset, Textual needs to run a scan. If you do not run the scan when you save the changes, then on the dataset details page, you are prompted to run a scan.

Deleting a custom entity type

When you delete a custom entity type, it is removed from the datasets that it was active for.

To delete a custom entity type:

  1. On the Custom Entity Types page, click the delete icon for the entity type.

  2. On the confirmation panel, click Delete Entity Type.

Custom entity type configuration settings

The custom entity type configuration includes:

  • Name and description

  • Regular expressions to identify matching values. From the configuration panel, you can test the expressions against text that you provide.

  • Datasets to make the entity type active for. You can also enable and disable custom entity types from the dataset details pages.

Name and description

In the Name field, provide a name for the entity type. Each custom entity type name:

  • Must be unique within an organization.

  • Can only contain alphanumeric characters and spaces. Custom entity type names cannot contain punctuation or other special characters.

After you save the entity type, you cannot change the name. Textual uses the name as the basis for the identifier that you use to refer to the entity type in the SDK.

In the Description field, provide a longer description of the custom entity type.

Regular expressions to identify matching values

Under Keywords, Phrases, or Regexes, provide expressions to identify matching values for the entity type.

An entry can be as simple as a single word or phrase, or you can provide a more complex regular expression to identify the values.

Textual maintains an empty row at the bottom of the list. When you type an expression into the last row, Textual adds a new empty row.

To add an entry, begin to type the value in the empty row.

To edit an entry, click the entry field, then edit the value.

To remove an entry, click its delete icon.

Testing an expression

Under Test Entry, you can check whether Textual correctly identifies a value as the entity type based on the provided expression.

To test an expression:

  1. From the dropdown list, select the entry to test.

  1. In the text area, provide the text to test.

As you enter the text, Textual automatically scans the text for matches to the selected expression. The Result field displays the input text and highlights the matching values.

Enabling and disabling the entity type for datasets

Under Activate custom entity, you identify the datasets to make the entity active for. From the dataset details, you can also enable and disable custom entity types for that dataset.

To make the entity active for all current and future datasets, check Automatically activate for all current, and new pipelines and datasets.

To make the entity active for specific datasets, set the toggle for the dataset to the on position.

To filter the list based on the dataset name, in the filter field, begin to type text from the name. Textual updates the list to only include matching datasets.

To update all of the currently displayed datasets, click Bulk action, then click Enable or Disable.

You can also enable and disable custom entity types from within a dataset. For more information, go to .

Creating a dataset

Required global permission: Create datasets

When you create a dataset, you specify:

  • The type of output to produce

  • The source location for the files.

  • If the files are in cloud storage, the connection credentials.

Setting the name, source type, and output type

To create a dataset:

  1. On the Datasets page, click Create a Dataset.

  1. In the Dataset Name field, provide a name for the dataset.

  2. Under Output Format, select the type of output to generate.

  3. Under File Source, select the source type. If the source type is a cloud storage option, then provide the required credentials.

  4. Click Save.

  5. For cloud storage datasets:

    1. Textual prompts you to configure the initial file selection. For more information, go to .

    2. After you select the files, it prompts you to select an output location. For more information, go to .

Providing credentials for Amazon S3

If the source type is Amazon S3, provide the credentials to use to connect to Amazon S3.

  1. For a self-hosted instance, select the location of the credentials. You can either provide credentials manually, or use credentials that are configured in environment variables. Note that after you save the dataset, you cannot change the selection.

  2. If you are not using environment variables, then in the Access Key field, provide an AWS access key that is associated with an IAM user or role. For an example of a role that has the required permissions for an Amazon S3 dataset, go to .

  3. In the Access Secret field, provide the secret key that is associated with the access key.

  4. From the Region dropdown list, select the AWS Region to send the authentication request to.

  5. In the Session Token field, provide the session token to use for the authentication request.

  6. To test the credentials, click Test AWS Connection.

  7. By default, connections to Amazon S3 use Amazon S3 encryption. To instead use AWS KMS encryption:

    1. Click Show Advanced Options.

    2. From the Server-Side Encryption Type dropdown list, select AWS KMS.

    3. In the Server-side Encryption AWS KMS ID field, provide the KMS key ID. Note that if the KMS key doesn't exist in the same account that issues the command, you must provide the full key ARN instead of the key ID.

    Note that after you save the new dataset, you cannot change the encryption type.

  8. Click Save. Textual prompts you to .

Providing Azure credentials

If the source type is Azure, provide the connection information:

  1. In the Account Name field, provide the name of your Azure account.

  2. In the Account Key field, provide the access key for your Azure account.

  3. To test the connection, click Test Azure Connection.

  4. Click Save. Textual prompts you to .

Providing SharePoint credentials

If the source type is SharePoint, provide the credentials for the Entra ID application.

The credentials must have the following application permissions (not delegated permissions):

  • Files.Read.All - To see the SharePoint files

  • Files.ReadWrite.All -To write redacted files and metadata back to SharePoint

  • Sites.ReadWrite.All - To view and modify the SharePoint sites

To provide the credentials:

  1. In the Tenant ID field, provide the SharePoint tenant identifier for the SharePoint site.

  2. In the Client ID field, provide the client identifier for the SharePoint site.

  3. In the Client Secret field, provide the secret to use to connect to the SharePoint site.

  4. To test the connection, click Test SharePoint Connection.

  5. Click Save. Textual prompts you to .

Selecting cloud storage files

Required dataset permission: Edit dataset settings

For a cloud storage dataset, you manage files from the file selection panel.

When you create a dataset, after you provide the cloud storage credentials and save the dataset, Textual immediately prompts you to select dataset files. After you select the files and click Next, Textual prompts you to set the output location.

For an existing dataset, to display the File Selection panel, on the dataset details page, click Select Files.

The file selection includes:

  • Whether to restrict the dataset to specific file types

  • The files or folders to include in the dataset

When you change the file selection, Textual scans the files for entities. For more information, go to .

Filtering files by file extension

When you select files, you can filter the selectable files based on file extension.

To limit the file extensions to include:

  1. Click File Extension Filter. By default, all file extensions are included, and none of the checkboxes are checked.

  2. Check the checkbox for each file extension to include. As you select the file extensions to include, Textual updates the navigation pane so that you can only select files that have one of those file extensions. It hides files that have other file extensions and folders that do not contain files with the selected file extensions.

Selecting files and folders to include

In the file selection area, you navigate to and select the folders and files to add to the dataset.

Navigating through the folders

In the navigation area, to display the contents of a folder, click the Open link for the folder.

Selecting a file or folder

To add a folder or file to the dataset, check its checkbox.

Managing selected folders

In the navigation pane, when you check a folder checkbox, Textual adds it to the Prefix Patterns list.

Adding a folder manually

Instead of navigating to a folder and selecting it, you can add the path to the list manually

To add a folder path:

  1. Click Add Prefix Pattern.

  2. In the field, type the path to the folder, then click the save icon.

Removing folder paths

To remove a folder path from the dataset, either:

  • In the navigation pane, uncheck its checkbox.

  • In the Prefix Patterns list, click its delete icon.

For the selected folders, the dataset includes all of the applicable files in the folder that:

  • Are of a file type that Textual supports

  • Match the file extension filter

Managing selected files

In the navigation pane, when you select an individual file, Textual adds it to the Selected Files list.

To delete a file, either:

  • In the navigation pane, uncheck its checkbox.

  • In the Selected Files list, click its delete icon.

Configuring added and excluded values for built-in entity types

Required dataset permission: Edit dataset settings

In a dataset, for each built-in entity type, you can configure additional values to detect, and values to exclude. You cannot define added and excluded values for custom entity types.

You might add values that Textual does not detect because, for example, they are specific to your organization or industry.

You might exclude a value because:

  • Textual labeled the value incorrectly.

  • You do not want to redact a specific value. For example, you might want to preserve known test values.

Note that for a pipeline that redacts files, you cannot add or exclude specific values.

Viewing whether an entity type has added or excluded values

In the entity types list, the add values and exclude values icons indicate whether there are configured added and excluded values for the entity type.

When added or excluded values are configured, the corresponding icon is green.

When there are no configured values, the corresponding icon is black.

Displaying the Configure Entity Detection panel

From the Configure Entity Detection panel, you configure both added and excluded values for entity types.

To display the panel, click the add values or exclude values icon for an entity type.

The panel contains an Add to detection tab for added values, and an Exclude from detection tab for excluded values.

Selecting the entity type to add or exclude values for

The entity type dropdown list at the top of the Configure Entity Detection panel indicates the entity type to configure added and excluded values for.

The initial selected entity type is the entity type for which you clicked the icon. To configure values for a different entity type, select the entity type from the list.

Configuring added values

On the Add to detection tab, you configure the added values for the selected entity type.

Each value can be a specific word or phrase, or a regular expression to identify the values to add. Regular expressions must be C# compatible.

Configuring a new added value

To add an added value:

  1. Click the empty entry.

  2. Type the value into the field.

Editing an added value

To edit an added value:

  1. Click the value.

  2. Update the value text.

Testing an added value

For each added value, you can test whether Textual correctly detects it.

To test a value:

  1. From the Test Entry dropdown list, select the number for the value to test.

  2. In the text field, type or paste content that contains a value or values that Textual should detect.

The Results field displays the text and highlights matching values.

Removing an added value

To remove an added value, click its delete icon.

Configuring excluded values

On the Exclude from detection tab, you configure the excluded values for the selected entity type.

Each value can be either a specific word or phrase to exclude, or a regular expression to identify the values to exclude. The regular expression must be C# compatible.

You can also provide a specific context within which to ignore a value. For example, in the phrase "one moment, please", you probably do not want the word "one" to be detected as a numeric value. If you specify "one moment, please" as an excluded value for the numeric entity type, then "one" is not identified as a number when it is seen in that context.

Adding an excluded value

To add an excluded value:

  1. Click the empty entry.

  2. Type the value into the field.

Editing an excluded value

To edit an excluded value:

  1. Click the value.

  2. Update the value text.

Testing an excluded value

For each excluded value, you can test whether Textual correctly detects it.

To test the value that you are currently editing:

  1. From the Test Entry dropdown list, select the number for the value to test.

  2. In the text field, type or paste content that contains a value or values to exclude.

The Results field displays the text and highlights matching values.

Removing an excluded value

To remove an excluded value, click its delete icon.

Saving the updated added and excluded values

The new added and excluded values are not reflected in the entity types list until Textual runs a new scan.

When you save the changes, you can choose whether to immediately run a new scan on the dataset files.

To save the changes and also start a scan, click Save and Scan Files.

To save the changes, but not run a scan, click Save Without Scanning Files. When you do not run the scan, then on the dataset details page, Textual displays a prompt to run a scan.

Previewing file output

Required dataset permission: Preview redacted dataset files

You cannot preview TIF image files. You can preview PNG and JPG files.

Displaying a dataset file preview

From the file list, to display the preview, either:

  • Click the file name.

  • Click the options menu, then click Preview.

File preview for a redacted file

For a dataset that generates output files of the same type as the original file:

  • On the left, the preview displays the original data. The detected entity values are highlighted.

  • On the right, the preview displays the data with replacement values that are based on the dataset configuration for the detected entity types.

Format of redacted values

Note that in the preview, the redacted values do not include the identifier. They only include the entity type. For example, NAME_GIVEN instead of NAME_GIVEN_1d9w5. The identifiers are included when you download the files.

Preview for PDF and image files

For a PDF or image file, for entity types that use the Redact handling option:

  • If there is space to display the entity type, then it is displayed.

  • Otherwise, the value is covered by a black box.

When you hover over a black box, the entity type displays in a tooltip:

To view the entity type labels, you can also zoom into the file.

The preview for a PDF file also reflects any manual overrides.

Selecting entity type handling options from the preview

You can use the preview to select the entity type handling option for each entity type. The options are:

  • Redact - This is the default value. Textual replaces the value with the name of the entity type followed by a unique identifier. For example, the first name John is replaced with NAME_GIVEN_12345. Note that the identifier is only visible in the downloaded file. It does not display on the preview.

  • Synthesize - Textual replaces the value with a realistic generated value. For example, the first name John is replaced with the first name Michael. The replacement values are consistent, which means that a given value always has the same replacement. For example, Michael is always the replacement value for John.

  • Off - Textual ignores the value and copies it as is to the output file.

To select the entity type handling option:

  1. In the results panel, click a detected value.

  2. On the panel, click the entity type handling option. Textual applies the same option to all entity values of that type.

From the preview, you can only select the entity type handling option. For the Synthesize option, you cannot configure synthesis options for an entity type. You must configure those options from the dataset details page. For more information, go to .

Ignoring specific instances in PDF files

From the PDF preview, you can also choose to ignore a specific value.

To configure whether to ignore a specific detected value:

  1. In the results panel, click the value.

  2. On the panel, to ignore the value, toggle Ignore Redaction to the on position.

File preview for a JSON output file

For a dataset that generates JSON output:

  • On the left is the original content. For files other than .txt files, you can toggle between generated Markdown and the rendered file.

  • On the right is a set of tabs that summarize the results.

Entities tab - Detected entities in the file

The Entities tab displays the file content with the detected entity values in context.

The actual values are followed by the type labels. For example, the given name John is displayed as John NAME_GIVEN.

JSON tab - Output JSON for the file

The JSON tab contains the content of the actual output file.

For details about the JSON output structure for the different types of files, go to .

Tables tab - Tables in a PDF or image file

For a PDF or image file that contains one or more tables, the Tables tab displays the tables. If the file does not contain any tables, then the Tables tab does not display.

Key-Value Pairs tab - Key-value pairs in a PDF or image file

For a PDF or image file that contains key-value pairs, the Key-Value Pairs tab displays the key-value pairs. If the file does not contain key-value pairs, then the Key-Value Pairs tab does not display.

Configure entity type handling for redaction

Required dataset permission: Edit dataset settings

By default, when you:

  • Configure a dataset

  • Redact a string

  • Retrieve a redacted file

Textual does the following:

  • For the string and file redaction, replaces detected values with tokens.

  • For LLM synthesis, generates realistic synthesized values.

When you make the request, you can:

  • Override the default behavior.

  • For individual files and text strings, specify custom entity types to include.

Specifying the handling option for entity types

For each entity type, you can choose to redact, synthesize, or ignore the value.

  • When you redact a value, Textual replaces the value with a token that consists of the entity type. For example, ORGANIZATION.

  • When you synthesize a value, Textual replaces the value with a different realistic value.

  • When you ignore a value, Textual passes through the original value.

To specify the handling option for entity types, you use the generator_config parameter.

Where:

  • <entity_type> is the identifier of the entity type. For example, ORGANIZATION. For the list of built-in entity types that Textual scans for, go to . For custom entity types, the identifier is the entity type name in all caps. Spaces are replaced with underscores, and the identifier is prefixed with CUSTOM_. For example, for a custom entity type named My New Type, the identifier is CUSTOM_MY_NEW_TYPE. From the Custom Entity Types page, to copy the identifier of a custom entity type, click its copy icon.

  • <handling_option> is the handling option to use for the specified entity type. The possible values are Redact, Synthesis, and Off.

For example, to synthesize organization values, and ignore languages:

Specifying a default handling option

For string and file redaction, you can specify a default handling option to use for entity types that are not specified in generator_config.

To do this, you use the generator_default parameter.

generator_default can be either Redact, Synthesis, or Off.

Providing added and excluded values for entity types

You can also configure added and excluded values for each entity type.

You add values that Textual does not detect for an entity type, but should. You exclude values that you do not want Textual to identify as that entity type.

  • To specify the added values, use label_allow_lists.

  • To specify the excluded values, use label_block_lists.

For each of these parameters, the value is a list of entity types to specify the added or excluded values for. To specify the values, you provide an array of regular expressions.

The following example uses label_allow_lists to add values:

  • For NAME_GIVEN, adds the values There and Here.

  • For NAME_FAMILY, adds values that match the regular expression ([a-z]{2}).

Including custom entity types

When you redact a string or download a redacted file, you can provide a comma-separated list of custom entity types to include. Textual then scans for and redacts those entity types based on the configuration in generator_config.

For example:

Creating templates to apply to PDF files

A dataset might contain multiple files that have the same structure, such as a set of scanned-in forms.

Instead of adding the same manual overrides for each file, you can use a PDF file in the dataset to create a template that you can apply to other PDF files in the dataset.

When you , you can apply a template.

Creating a PDF template

To add a PDF template to a dataset:

  1. On the dataset details page, click Templates.

  1. On the template creation and selection panel, click Create a New Template.

  1. On the template details page:

    1. In the Name field, provide a name for the template.

    2. From the file dropdown list, select the dataset file to use to create the template.

    3. Add the manual overrides to the file.

  1. When you finish adding the manual overrides, click Save New Template.

Updating an existing PDF template

When you update a PDF template, it affects any files that use the template.

To update a PDF template:

  1. On the dataset details page, click PDF Templates.

  2. Under Edit an Existing Template, select the template, then click Edit Selected Template.

  3. On the template details panel, you can change the template name, and add or remove manual overrides.

  1. To save the changes, click Update Template.

Managing the manual overrides

Adding a manual override

On the template details panel, to add a manual override to a file:

  1. Select the type of override. To indicate to ignore any automatically detected values in the selected area, click Ignore Redactions. To indicate to redact the selected area, click Add Manual Redaction.

  2. Use the mouse to draw a box around the area to select.

Tonic Textual adds the override to the Redactions list. The icon indicates the type of override.

Navigating to a manual override

To select and highlight a manual override in the file content, in the Redactions list, click the navigate icon for the override.

Removing a manual override

To remove a manual override, in the Redactions list, click the delete icon for the override.

Deleting a PDF template

When you delete a PDF template, the template and its manual overrides are removed from any files that the template was assigned to.

To delete a PDF template:

  1. On the dataset details page, click PDF Templates.

  2. Under Edit an Existing Template, select the template, then click Edit Selected Template.

  3. On the template details panel, click Delete.

Required IAM role permissions for Amazon S3

For Amazon S3 datasets and pipelines, you connect to S3 buckets to select and store files.

On self-hosted instances, you also configure an S3 bucket and the credentials to use to store files for:

  • File upload pipelines. The S3 bucket is required for file upload pipelines.

  • File upload datasets. If you do not configure an S3 bucket, then the files are stored in the application database.

  • Individual files that you send to the SDK for redaction. If you do not configure an S3 bucket, then the files are stored in the application database.

Here are examples of IAM roles that have the required permissions to connect to Amazon S3 to select or store files.

Example IAM role for file uploads and redactions

For file upload pipelines, datasets, and individual file redactions, the files are stored in a single S3 bucket. For information on how to configure the S3 bucket and the corresponding access credentials, go to .

The IAM role that is used to connect to the S3 bucket must be able to read files from and write files to it.

Here is an example of an IAM role that has the permissions required to support uploaded file pipelines, datasets, and individual redactions:

Example IAM role for Amazon S3 datasets and pipelines

The access credentials that you configure for an Amazon S3 dataset or pipeline must be able to navigate to and select files and folders from the appropriate S3 buckets. They also need to be able to write output files to the configured output location.

Here is an example of an IAM role that has the permissions required to support Amazon S3 datasets or pipelines:

OpenID Connect (OIDC)

Use these instructions to set up an OpenID Connect SSO provider for Tonic Textual.

SSO setup

When you configure the application/client in your SSO system, you must configure it to use Authorization Code Flow.

You must also make note of the client_id. You must provide the client ID when you complete the configuration for Textual.

Redirect URI

In your SSO provider, configure the following redirect URI:

  • Sign-in redirect URI: <textual-base-url>/sso/callback/oidc

Textual configuration

Required environment variables

After you set up the SSO provider, you uncomment and configure the required in Textual.

  • The application client identifier

  • For HTTP basic authentication (client_secret_basic), the client secret

  • The base URL of the provider. This is the location of /.well-known/openid-configuration

  • A regular expression to identify groups that are permitted to use Textual.

For Kubernetes, in values. yaml:

For Docker, in .env:

Optional environment variables

You can optionally uncomment and configure the following optional environment variables:

  • A space-delimited list of scopes to request from the OIDC SSO provider. Because group information is not part of the standard OIDC specification, for Textual to capture group information, a custom scope must be configured.

  • The name of the claim that contains the user's first name.

  • The name of the claim that contains the user's last name.

  • The name of the claim that contains the user's email address or username.

  • The name of the claim that contains the user's group membership.

Textual has default values for these settings:

For Kubernetes, in values.yaml:

For Docker, in .env:

Editing an individual PDF file

For PDF files, you can add manual overrides to the initial redactions, which are based on the detected data types and handling configuration.

For each manual override, you select an area of the file.

For the selected area, you can either:

  • Ignore any automatically detected redactions. For example, a scanned form might show an example or boilerplate content that doesn't actually contain sensitive values.

  • Redact that area. The file might contain sensitive content that Tonic Textual is unable to detect. For example, a scanned form might contain handwritten notes.

You can also apply a template to the file.

Selecting the manual override option for a file

To manage the manual overrides for a PDF file:

  1. In the file list, click the options menu for the file.

  2. In the options menu, click Edit Redactions.

The File Redactions panel displays the file content. The values that Textual detected are highlighted. The page also shows any manual overrides that were added to the file.

Applying a PDF template to a file

If a dataset contains multiple files that have the same format, then you can create a template to apply to those files. For more information, go to .

On the File Redactions panel, to apply a template to the file, select it from the template dropdown list.

When you apply a PDF template to a file, the manual overrides from that template are displayed on the file preview. The manual overrides are not included in the Redactions list.

Adding a manual override

On the File Redactions panel, to add a manual override to a file:

  1. Select the type of override. To indicate to ignore any automatically detected values in the selected area, click Ignore Redactions. To indicate to redact the selected area, click Add Manual Redaction.

  2. Use the mouse to draw a box around the area to select.

Textual adds the override to the Redactions list. The icon indicates the type of override.

In the file content:

  • Overrides that ignore detected values within the selected area are outlined in red.

  • Overrides that redact the selected area are outlined in green.

Navigating to a manual override

To select and highlight a manual override in the file content, in the Redactions list, click the navigate icon for the override.

Removing a manual override

To remove a manual override, in the Redactions list, click the delete icon for the override.

Saving the manual overrides

To save the current manual overrides, click Save.

Configure environment variables

How to set environment variable values and restart Textual in Docker and Kubernetes.

Set the number of textual-ml workers

Used to enable parallel processing in Textual.

Set a custom certificate

Provide a custom certificate to use for https traffic.

Configure custom AWS endpoints

Set custom endpoint URLs for calls to AWS services.

Enable PDF and image processing

Set the required configuration based on the OCR option that you want to use.

Enable uploads to uploaded file pipelines

Provide the required access to Amazon S3.

Configure model preferences

Select an auxiliary model and configure model usage for GPU.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<SOLAR_INTERNAL_BUCKET_NAME>",
                "arn:aws:s3:::<SOLAR_INTERNAL_BUCKET_NAME>/*"
            ]
        }
    ]
}
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "arn:aws:s3:::*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
                "arn:aws:s3:::*/*"
            ]
        }
    ]
}
Setting the S3 bucket for file uploads and redactions
# OIDC SSO Config
# -----------------
#oidcClientId: <application client ID>
#oidcClientSecret: <client secret for HTTP basic authentication>
#oidcAuthority: <base URL of the provider>
#oidcGroupFilterRegex: <regular expression to identify allowed groups>
#SOLAR_SSO_OIDC_CLIENT_ID=#<application client ID>
#SOLAR_SSO_OIDC_CLIENT_SECRET=#<client secret for HTTP basic authentication>
#SOLAR_SSO_OIDC_AUTHORITY=#<base URL of the provider>
#SOLAR_SSO_OIDC_GROUP_FILTER_REGEX=#<regular expression to identify allowed groups>
#oidcScopes: openid profile email
#oidcFirstNameClaimName: given_name
#oidcLastNameClaimName: family_name
#oidcEmailClaimName: email
#oidcGroupsClaimName: groups
#SOLAR_SSO_OIDC_SCOPES=#openid profile email
#SOLAR_SSO_OIDC_FIRST_NAME_CLAIM_NAME=#given_name
#SOLAR_SSO_OIDC_LAST_NAME_CLAIM_NAME=#family_name
#SOLAR_SSO_OIDC_EMAIL_CLAIM_NAME=#email
#SOLAR_SSO_OIDC_GROUPS_CLAIM_NAME=#groups
environment variables
Selecting cloud storage files
Changing cloud storage credentials and output location
Required IAM role permissions for Amazon S3
select the dataset files
select the dataset files
select the dataset files
Dataset creation panel
Dataset creation panel
Credentials fields for an Amazon S3 dataset
Credentials fields for an Azure dataset
Credentials fields for a SharePoint dataset
Tracking and managing file processing
File selection panel for a cloud storage dataset
File navigator with option to select files
File extension filter on the file selection panel
File selection panel with selected files
Selected paths for a cloud storage dataset
Selected files for a cloud storage dataset
Entity types showing the configured and not configured versions of the added and excluded values icons
Configure Entity Detection panel to configure added and excluded entity values
Entity type dropdown for Custom Entity Detection
Adding an added value for an entity type
Testing an added value
Adding an excluded value for an entity type
Testing an excluded value
Save options for Custom Entity Detection
Dataset prompt to scan the files to reflect new added and excluded values
Configuring synthesis options
Structure of JSON output files
Options menu for a dataset file with the Preview option
File preview with the original and redacted and synthesized text data
File preview for a redacted PDF file
Entity type tooltip for a redacted value in a PDF file
Zoomed in version of a PDF preview that displays entity types
Selecting an entity type handling option
Panel with the option to ignore a PDF value
File preview for a text file in a JSON output dataset
File preview for a PDF file in a JSON output dataset
JSON tab on a file preview for a JSON output dataset
generator_config={'<entity_type>':'<handling_option>'}
generator_config={'ORGANIZATION':'Synthesis', 'LANGUAGE':'Off'}
{'<entity_type>':['<regex>']}
(label_allow_lists={
    'NAME_GIVEN':['There','Here'], 
    'NAME_FAMILY':['([a-z]{2})']
    }
)
custom_entities="["<entity type identifier>"]
custom_entities=["CUSTOM_COGNITIVE_ACCESS_KEY", "CUSTOM_PERSONAL_GRAVITY_INDEX"]
Built-in entity types
Copy identifier option for a custom entity type
edit a PDF file
Dataset details
Panel with option to create a PDF template
PDF template details panel for a new template
Template details panel for an existing template.
Redactions list of manual overrides with a navigate icon highlighted
Creating templates to apply to PDF files
File options menu for a PDF file
File Redactions panel
Redactions list of manual overrides with a navigate icon highlighted
configure the entity type
Custom Entity Types page
Details panel for a custom entity type
Regular expressions list for a custom entity type
Dropdown list to select the regular expression to test
Test results for a custom entity type regular expression
Activate Custom Entity Type section to select the datasets that include the custom entity type
Enabling and disabling custom entity types

Create and manage datasets

Textual uses datasets to produce files with sensitive values replaced.

Before you perform these tasks, remember to instantiate the SDK client.

Get your list of datasets

To get the complete list of datasets that you own, use textual.get_all_datasets.

datasets = textual.get_all_datasets()

Create and add files to a dataset

Required global permission: Create datasets

Required dataset permission: Upload files to a dataset

To create a new dataset and then upload a file to it, use textual.create_dataset.

dataset = textual.create_dataset('<dataset name>')

To add a file to the dataset, use dataset.add_file. To identify the file, provide the file path and name.

dataset.add_file('<path to file>','<file name>') 

To provide the file as IO bytes, you provide the file name and the file bytes. You do not provide a path.

dataset.add_file('<file name>',<file bytes>) 

Textual creates the dataset, scans the uploaded file, and redacts the detected values.

Configure a dataset

Required dataset permission: Edit dataset settings

To change the configuration of a dataset, use dataset.edit.

You can use dataset.edit to change:

  • The name of the dataset

  • The handling option for each entity type

  • Added or excluded values for each entity type

dataset.edit(name='<dataset name>', 
  generator_config={'<entity_type>':'<handling_type>'},
  label_allow_lists={'<entity_type>':LabelCustomList(regexes['<regex>']},
  label_block_lists={'<entity_type>':LabelCustomList(regexes['<regex>']}
)

Alternatively, instead of specifying the configuration, you can use the copy_from_dataset parameter to indicate to copy the configuration from another dataset.

Get the current status of dataset files

Required dataset permission: Preview redacted dataset files

To get the current status of the files in the current dataset, use dataset.describe:

dataset.describe()

The response includes:

  • The name and identifier of the dataset

  • The number of files in the dataset

  • The number of files that are waiting to be processed (scanned and redacted)

  • The number of files that had errors during processing

For example:

    Dataset: example [879d4c5d-792a-c009-a9a0-60d69be20206]
    Number of Files: 1
    Files that are waiting for processing: 
    Files that encountered errors while processing: 
    Number of Rows: 0
    Number of rows fetched: 0

Get lists of files by status

Required dataset permission: Preview redacted dataset files

To get a list of files that have a specific status, use the following:

  • dataset.get_failed_files

  • dataset.get_running_files

  • dataset.get_queued_files

  • dataset.get_processed_files

The file list includes:

  • File identifier and name

  • Number of rows and columns

  • Processing status

  • For failed files, the error

  • When the file was uploaded

Delete a file from a dataset

Required dataset permission: Delete files from a dataset

To delete a file from a dataset, use dataset.delete_file.

dataset.delete_file('<file identifier>')

Get redacted content for a dataset

Required dataset permission: Download redacted dataset files

To get the redacted content in JSON format for a dataset, use dataset.fetch_all_json():

dataset = textual.get_dataset('<dataset name>')
dataset.fetch_all_json()

For example:

dataset = textual.get_dataset('mydataset')
dataset.fetch_all_json()

The response looks something like:

'[["PERSON Portrait by PERSON, DATE_TIME ...]'

Managing Textual users

Required global permission: View users and groups

To display the list of Textual users:

  1. Click the user icon at the top right.

  2. In the user menu, click Permission Settings.

  3. On the Permission Settings page, click Users.

Datasets flows

You use a Textual dataset to detect sensitive values in files. The dataset output can be either:

  • Files in the same format as the original file, with the sensitive values replaced based on the dataset configuration.

  • JSON files that contain a summary of the detected values and replacements.

You can also create and manage datasets from the Textual SDK or REST API.

Overall workflow

At a high level, to use Textual to detect sensitive values and create redacted data:

Diagram of the Tonic Textual dataset workflow

Create and populate a dataset or pipeline

  1. Create a Textual dataset, which is a set of files to redact. The files can be uploaded from a local file system, or can come from a cloud storage solution. When you create the dataset, you also choose the type of output, which can be either:

    • The redacted version of the original files. The file is in the same format as the original file.

    • JSON summaries of the files and the detected entities.

  2. Add files to the dataset. Textual supports almost any free-text file, PDF files, .docx files, and .xlsx files.

    For images, Textual supports PNG, JPG (both .jpg and .jpeg), and TIF (both .tif and .tiff) files.

  3. Textual uses its built-in models to scan the files and identify sensitive values. For JSON output, Textual also immediately generates the output files.

Review the redaction results

Review the types of entities that were detected in the scanned files.

Configure entity type handling

At any time, for datasets that produce redacted files, you can configure how Textual handles the detected values for each entity type.

For all datasets, you can provide added and excluded values for each built-in entity type.

You can also create and enable custom entity types.

Select the handling option for each entity type

For datasets that produce redacted output files, you configure how Textual redacts the values. This configuration does not apply to datasets that produce JSON output.

For each entity type, you select the action to perform on detected values. The options are:

  • Redaction - By default, Textual redacts the entity values, which means to replace the values with a token that identifies the type of sensitive value, followed by a unique identifier. For example, NAME_GIVEN_l2m5sb, LOCATION_j40pk6. The identifiers are consistent, which means that for the same original value, the redacted value always has the same identifier. For example, the first name Michael might always be replaced with NAME_GIVEN_12m5sb, while the first name Helen might always be replaced with NAME_GIVEN_9ha3m2. For PDF files, redaction means to either cover the value with a black box, or, if there is space, display the entity type and identifier. For image files, redaction means to cover the value with a black box.

  • Synthesis - For a given entity type, you can instead choose to synthesize the values, which means to replace the original value with a realistic replacement. The synthesized values are always consistent, meaning that a given original value always produces the same replacement value. For example, the first name Michael might always be replaced with the first name John. You can also identify specific replacement values.

  • Ignore - You can choose to ignore the values, and not replace them.

Textual automatically updates the file previews and downloadable files to reflect the updated configuration.

Define added and excluded values for entity types

Optionally, for all datasets, you can create lists of values to add to or exclude from an entity type. You might do this to reflect values that are not detected or that are detected incorrectly.

Manually update PDF files

Datasets also provide additional options to redact PDF files.

You can add manual overrides to a PDF file. When you add a manual override, you draw a box to identify the affected portion of the file.

You can use manual overrides either to ignore the automatically detected redactions in the selected area, or to redact the selected area.

To make it easier to process multiple files that have a similar format, such as a form, you can create templates that you can apply to PDF files in the dataset.

Generate or download output files

After you complete the redaction configuration and manual updates, to obtain the output files:

  • For local file datasets, you download the output files.

  • For cloud storage datasets, for datasets that produce original format files, you run a generation job that writes the output files to the configured output location. For datasets that produce JSON output, the files are generated to the output location as soon as the the output location is configured.

File upload and download flows

For a local file dataset, the file upload and download flows are as follows. For a more general overview of the Textual architecture, go to Textual architecture.

File upload flow

When you upload a file to a local file dataset, the flow is as follows:

File upload flow for a local file dataset
  1. The Textual user uploads the file.

  2. The API service stores the file in either Amazon S3 or the Textual application database. For more information, go to Setting the S3 bucket for file uploads and redactions.

  3. The API service starts a job in the worker.

  4. The worker sends any PDF and image files to the OCR service (Amazon Textract, Document Intelligence, or Tesseract) to extract the file text.

  5. The OCR service returns the PDF and image text to the worker.

  6. The worker submits the file text to the Textual machine learning service to detect and replace entity values.

  7. The machine learning service returns the results to the worker.

  8. The worker stores the results in the application database.

File download flow

When you download a redacted file from a local file dataset, the flow is as follows:

File download flow for a local file dataset
  1. The Textual user makes the request to download the file.

  2. The API service retrieves the file from where it is stored in either Amazon S3 or the application database.

  3. The API service retrieves the detected entities and entity handling settings from the application database.

  4. The API service applies those results to the file.

  5. The API service returns the redacted file to the Textual user.

Keycloak

Use these instructions to set up Keycloak as your SSO provider for Tonic Textual.

Keycloak configuration

Within Keycloak, select the realm to use for your Textual client. Under Clients, click Create client.

On the Create client page, under General Settings:

  1. From the Client type dropdown list, select OpenID Connect.

  2. Enter a Client ID and Name.

  3. Click Next.

On the Capability Config tab, click Save. The details page for the new client displays.

On the Settings tab, under Access settings, enter your Textual URL information.

Click Client scopes. Each client has a dedicated scope named <client-id>-dedicated. To configure the scope, click the scope name.

On the Mappers tab, to add a property mapper to the scope, click Configure a new mapper.

In the list of mapper types, click Group Membership.

Under Add mapper, set both Name and Token Claim Name to groups.

The Full group path toggle affects how child groups appear in Tonic:

  • When on, child groups display as parent group/child group.

  • When off, child groups display as child group.

To save the new group membership mapper, click Save.

Textual configuration

After you complete the configuration in Keycloak, you uncomment and configure the required in Textual.

  • The realm URL

  • The client identifier

  • The client secret, if client authentication is enabled

For Kubernetes, in values.yaml:

For Docker, in .env:

Disabling pushed authorization requests

The environment variable SOLAR_SSO_KEYCLOAK_DISABLE_PUSHED_AUTHORIZATION determines whether to disable Keycloak pushed authorization requests.

By default, this is false.

You would set this to true to troubleshoot Keycloak authentication issues.

# Keycloak SSO Config
# -----------------
#keycloakClientId: <client-id>
#keycloakClientSecret: <client-secret>
#keycloakAuthority: <authority-url>
#keycloakGroupFilterRegex: <regular expression to identify allowed groups>
#SOLAR_SSO_KEYCLOAK_AUTHORITY=#<keycloak_url_with_scheme>/realms/<realm_name>
#SOLAR_SSO_KEYCLOAK_CLIENT_ID=#<client identifier>
#SOLAR_SSO_KEYCLOAK_CLIENT_SECRET=#<client secret>
#SOLAR_SSO_KEYCLOAK_GROUP_FILTER_REGEX=#<regex to identify allowed groups>
environment variables
Create client option for Keycloak
Create client fields for a Keycloak client
Access settings for a Keycloak client
Client scopes tab for a Keycloak client
Options to add a property mapper to a Keycloak client scope
Available mapper types for a Keycloak client scope property mapper
Configuration options for a Keycloak property mapper

Built-in entity types

Tonic Textual's built-in models identify a range of sensitive values, such as:

  • Locations and addresses

  • Names of people and organizations

  • Identifiers and account numbers

The built-in entity types are:

Entity type name
Identifier (for API)
Description

CC Exp

CC_EXP

The expiration date of a credit card.

Credit Card

CREDIT_CARD

A credit card number.

CVV

CVV

The card verification value for a credit card.

Date Time

DATE_TIME

A date or timestamp.

DOB

DOB

A person's date of birth.

Email Address

EMAIL_ADDRESS

An email address.

Event

EVENT

The name of an event.

Gender Identifier

GENDER_IDENTIFIER

An identifier of a person's gender.

Healthcare Identifier

HEALTHCARE_ID

An identifier associated with healthcare, such as a patient number.

IBAN Code

IBAN_CODE

An international bank account number used to identify an overseas bank account.

IP Address

IP_ADDRESS

An IP address.

Language

LANGUAGE

The name of a spoken language.

Law

LAW

A title of a law.

Location

LOCATION

A value related to a location. Can include any part of a mailing address.

Occupation

OCCUPATION

A job title or profession.

Street Address

LOCATION_ADDRESS

A street address.

City

LOCATION_CITY

The name of a city.

State

LOCATION_STATE

A state name or abbreviation.

Zip

LOCATION_ZIP

A postal code.

Country

LOCATION_COUNTRY

The name of a country.

Full Mailing Address

LOCATION_COMPLETE_ADDRESS

A full postal address. By default, the entity type handling option for this entity type is Off.

Medical License

MEDICAL_LICENSE

The identifier of a medical license.

Money

MONEY

A monetary value.

Given Name

NAME_GIVEN

A given name or first name.

Family Name

NAME_FAMILY

A family name or surname.

NRP

NRP

A nationality, religion, or political group.

Numeric Identifier

NUMERIC_PII

A numeric value that acts as an identifier.

Numeric Value

NUMERIC_VALUE

A numeric value.

Organization

ORGANIZATION

The name of an organization.

Password

PASSWORD

A password used for authentication.

Person Age

PERSON_AGE

The age of a person.

Phone Number

PHONE_NUMBER

A telephone number.

Product

PRODUCT

The name of a product.

URL

URL

A URL to a web page.

US Bank Number

US_BANK_NUMBER

The routing number of a bank in the United States.

US ITIN

US_ITIN

An Individual Taxpayer Identification Number in the United States.

US Passport

US_PASSPORT

A United States passport identifier.

US SSN

US_SSN

A United States Social Security number.

Previewing Textual detection and redaction

Required global permission: Use the playground on the Home page

The Tonic Textual Home page provides a tool that allows you to see how Textual detects and replaces values in plain text or an uploaded file.

It also provides a preview of the redaction configuration options, including:

  • How to replace the values for each entity type.

  • Added and excluded values for each entity type.

The Home page displays automatically when you log in to Textual. To return to the Home page from other pages, in the navigation menu, click Home.

Initial view of the Textual Home page

Providing the content to redact

To provide the content to redact, you can enter text directly, or you can upload a file.

Entering text

As you enter or paste text in the Original Content text area, Textual displays the redacted version in the Results panel at the right.

Home page with redacted text

Using one of the samples

Textual also provides sample text options for some common use cases. To populate the text with a sample, under Try a sample, click the sample to use.

Sample text options for the Home page

Uploading a file

You can also redact .txt or .docx files.

To provide a file, either:

  • Drag and drop the file to the Original Content text area.

  • Click the upload prompt, then search for and select the file.

Textual processes the file and then displays the redacted version in the Results panel. The Original Content text area is removed.

Home page with the content of an uploaded file

Clearing the text

To clear the text, click Clear.

Selecting the handling option for an entity type

The handling option indicates how Textual replaces a detected value for an entity type. You can experiment with different handling options.

Note that the updated configuration is only used for the current redacted text. When you clear the text, Textual also clears the configuration.

The options are:

  • Redact - This is the default value. Textual replaces the value with the name of the entity type. For example, the first name John is replaced with NAME_GIVEN.

  • Synthesize - Textual replaces the value with a realistic generated value. For example, the first name John is replaced with the first name Michael. The replacement values are consistent, which means that a given value always has the same replacement. For example, Michael is always the replacement value for John.

  • Off - Textual ignores the value and copies it as is to the Results panel.

To change the handling option for an entity type:

  1. In the Results panel, click an instance of the entity type.

  2. On the configuration panel, click the handling option to use.

Selecting the handling option for an entity type

Textual updates all instances of that entity type to use the selected handling option.

For example, if you change the handling option for NAME_GIVEN to Synthesize, then all instances of first names are replaced with realistic values.

Redacted text with given name values synthesized

Defining added and excluded values

For each entity type in entered text, you can use regular expressions to define added and excluded values.

  • Added values are values that Textual does not detect for an entity type, but that you want to include. For example, you might have values that are specific to your company or industry.

  • Excluded values are values that you do not want Textual to identify as a given entity type.

Note that the configuration is only used for the current redacted text. When you clear the text, Textual also clears the configuration.

Also, this option is only available for text that you enter directly. For an uploaded file, to do additional configuration or to download the file, you must create a dataset from the file.

Displaying the configuration panel

To display the configuration panel for added and excluded values, click Fine-tune Results.

The Fine-Tune Results panel displays the list of configured rules for the current text. For each rule, the list includes:

  • The entity type.

  • Whether the rule adds or excludes values.

  • The regular expression to identify the added or excluded values.

Fine-Tune Results panel for added and excluded values

Adding a rule to add or exclude values

On the Fine-Tune Results panel, to create a rule:

  1. Click Add Rule.

Row to define a new rule for added or excluded values
  1. From the entity type dropdown list, select the entity type that the rule applies to.

  2. From the rule type dropdown list:

    • If the rule adds values, then select Include.

    • If the rule excludes values, then select Exclude.

  3. In the regular expression field, provide the regular expression to use to identify the values to add or exclude.

  4. To save the rule, click the save icon.

Editing a rule

To edit a rule:

  1. On the Fine-Tune Results panel, click the edit icon for the rule.

  2. Update the configuration.

  3. Click the save icon.

Deleting a rule

On the Fine-Tune Results panel, to delete a rule, click its delete icon.

Creating a dataset from an uploaded file

From an uploaded file, you can create a dataset that contains the file.

You can then provide additional configuration, such as added and excluded values, and download the redacted file.

To create a dataset from an uploaded file:

  1. Click Download.

  2. Click Create a Dataset.

Textual displays the dataset details for the new dataset. The dataset name is Playground Dataset <number>, where the number reflects the number of datasets that were created from the Home page.

The dataset contains the uploaded file.

Viewing and copying the request code

When Textual generates the redacted version of the text, it also generates the corresponding API request. The request includes the entity type configuration.

To view the API request code, click Show Code.

Code to create the redaction request, including the entity type handling and added and excluded values

To hide the code, click Hide Code.

Selecting the request code type

On the code panel:

  • The Python tab contains the Python version of the request.

  • The cURL tab contains the cURL version of the request.

Copying the request code

To copy the currently selected version of the request code, click Copy Code.

Enabling and using additional LLM processing of detected entities

For entered text on the Home page, Textual offers an option to send the following to our custom Large Language Model (LLM) to synthesize accurate replacements. The following information is sent to our models.

  • The detected entity values.

  • The text that surrounds each value.

The LLM processing is not available for uploaded files.

It is also limited to text that contains 100 or fewer words.

Textual's LLM functionality is run only on our cloud and does not use any third parties.

About the LLM processing

The LLM processing is intended to improve the detection and the replacement values. The LLM:

  1. Groups entities based on whether they refer to the same thing, concept, or person. The grouping is only done within each entity type. For example, Lyon the person and Lyon the city are never grouped together.

  2. Chooses a representative value for each group. For example, if the content includes the names Will, William, and W.I.L.L, the LLM processing chooses William as the representative value, because it's the most complete form of the name.

  3. Sends the representative value to our standard, non-LLM, synthesis generators.

  4. Gets the replacement value from the generators, and then formats it to match the original format. For example, because Will is replaced with Rob, W.I.L.L becomes R.O.B.

Making the LLM processing available

To enable the LLM processing, set the environment variable ENABLE_EXPERIMENTAL_SYNTHESIS to True. If this is not set to true, then the LLM processing does not work.

You must also set up the Solar.LLM container.

Configuring the Solar.LLM container

To configure the container, you can use the following Docker Compose content as a reference:

services:
  textual-llm:
    image: textual-llm:[textual-version-here]
    container_name: textual-llm
    volumes:
      - llm-models:/app/models
    ports:
      - "11443:11443"
    secrets:
      - llm_aws_key_id
      - llm_aws_access_key
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped
    networks:
      - llm-network

volumes:
  llm-models:

networks:
  llm-network:
    driver: bridge

secrets:
  llm_aws_key_id:
    environment: "LLM_AWS_KEY_ID"
  llm_aws_access_key:
    environment: "LLM_AWS_ACCESS_KEY"

The AWS keys are used to download our custom models. To get a copy of the keys, contact your Tonic.ai support representative.

Enabling the LLM processing for entered text

After you enter text in the Original Content panel, to enable the LLM processing, in the Results panel, click Use an LLM to perform AI synthesis.

You cannot use this option for text that contains more than 100 words.

When you clear the text, Textual reverts to the default processing.

Processing with the SDK

In the Python SDK, to use LLM synthesis, call the llm_synthesis function.

Record and review redaction requests

Required global permission: Use the Request Explorer

When you use the redact method to redact a plain text string, you can also choose to record the request.

The recorded requests are encrypted.

When you make the request, you specify the number of hours to keep the recorded request. After that amount of time elapses, the request is completely purged. Recorded requests are never kept more than 720 hours, regardless of the configured retention time.

From the Request Explorer, you can review your recorded requests to check the results and assess the quality of the redaction. You can also test changes to the redaction configuration.

You cannot view requests from other users.

Recording a redaction request

To record a redaction request, you include the record_options argument:

record_options = RecordApiRequestOptions(record=<boolean>, retention_time_in_hours=<number of hours>, tags=["tag name"])

The record_options argument includes the following parameters:

  • record - Whether to record the request. The default is False. To record the request, set record to True.

  • retention_time_in_hours - The number of hours to preserve the recorded request. The default is 1. After the retention time elapses, the request is purged completely.

  • tags - A list of tags to assign to the request. The tags are mostly intended to make it easier to search for requests on the Request Explorer page.

Viewing the list of recorded requests

The Request Explorer page in Textual contains the list requests that you recorded and that are not yet purged. You cannot view requests from other users.

To display the Request Explorer page, in the Textual navigation bar, click Request Explorer.

For each request, the list includes:

  • A 255-character preview of the text that was sent for redaction.

  • The tags assigned to the request.

  • The date when the request will be purged.

Request Explorer page for redaction requests from the SDK

Filtering the requests

You can search for a request based on text that is contained in the redacted text, and by the tags that you assigned to the request.

To search by text from the string, in the search field, begin to type the text.

To search by an assigned tag, in the search field, type tags: followed by the tag to search for.

Previewing the request results

From the request list, to view the results of a request, click the request row.

By default, the preview uses Identification view. For each detected entity, Identification view displays the value and the entity type.

Identification view of the redaction preview

To instead display only the replacement value, which by default is the entity type, click Replacement.

Replacement view of the redaction preview

Testing changes to the redaction configuration

From the preview, you can test how the results change when you:

  • Change the handling option for entity types.

  • Add and exclude values for entity types.

Displaying the Edit Request panel

To display the edit panel, from the request preview page, click Edit.

The Edit Request panel displays the full list of the available entity types.

Edit Request panel with the list of entity types

Changing entity type handling options

You can change how Textual handles detected entity values for each entity type.

Note that the handling option changes are not saved when you close the preview and return to the requests list.

Available handling options

The handling options are:

  • Off - Indicates to ignore values for this entity type.

  • Redact - This is the default option. Indicates to replace each value with a token that represents the entity type.

  • Synthesize - Indicates to replace each value with a realistic replacement value.

Changing the handling option for a single entity type

To change the handling option for a single entity type, either:

  • Click the handling option value for the entity type, then select the handling option.

Handling option dropdown for an entity type on the Request Explorer preview
  • Click the entity type, then under Generator, click the handling option.

Generator panel to select the entity type handling option

Selecting the same handling option for all entity types

To select the same handling option for all of the entity types:

  1. Click Bulk Edit.

  2. From the Bulk Edit dropdown list, select the handling option.

Bulk Edit dropdown to set the handling option for all entity types

Configuring added and excluded values for an entity type

To configure added and excluded values for an entity type, click the entity type.

The Edit Request panel expands to display the Add to detection and Exclude from detection lists.

  • You use the Add to detection list to configure regular expressions to identify additional values to detect as the selected entity type.

  • You use the Exclude from detection list to configure regular expressions to identify values to not detect as the selected entity type.

Note that the added and excluded values are not saved when you close the preview and return to the requests list.

Edit Request panel with the Add to detection and Exclude from detection lists

Creating a regular expression for an added or excluded value

To create a regular expression for added or excluded values:

  1. Click the Add regex option for that list.

  2. In the field, provide a regular expression to identify values to add or exclude.

Field to create an added or excluded value regular expression
  1. Press Enter.

Saved regular expression for a value

Editing a regular expression for added or excluded values

To edit a regular expression:

  1. Click the edit icon for the expression.

  2. In the field, edit the expression.

Edit field for a regular expression
  1. Click the save icon.

Deleting a regular expression for added or excluded values

To delete a regular expression, click the delete icon for that expression.

Viewing whether an entity type has added or excluded values

When an entity type has added values, the added values icon displays for that entity type.

Added values icon for an entity type in the Request Explorer

When an entity type has excluded values, the excluded values icon displays for that entity type.

Excluded values icon for an entity type in the Request Explorer

Replaying the request

To replay the request based on the current configuration, click Replay.

Replay button for a previewed request

When you replay the request, in addition to the Identification and Replacement options, you use the Diff toggle to indicate whether to compare the original and new results.

For our example, we made the following changes to the configuration:

  • For Given Name and Family Name, changed the handling option to Synthesize.

  • For Credit Card, indicated to ignore the value 41111111111.

Replayed results views with the Diff toggle off

When the Diff toggle is in the off position, Identification view only reflects changes to the added and excluded values.

In our example, we configured 41111111111 to not be detected as a credit card number. In the replayed request, it is instead detected as a numeric value.

Identification view of a replayed request with Diff off

Replacement view reflects both the added and excluded values and the changes to the handling option.

For our example, in addition to the entity type change for the credit card number 41111111111, the given and family names are now realistic replacement values instead of the entity types.

Replacement view of a replayed request with Diff off

Replayed results views with the Diff toggle on

When you set the Diff toggle to the on position, the preview displays the original content to the left, and the modified content to the right.

In Identification view, you can see the changes to the entity detection based on the added and excluded values.

Identification view of a replayed request with Diff on

In Replacement view, you can also see the changes to the selected handling options for the entity types.

Replacement view of a replayed request with Diff on

Clearing all of the configuration changes

To clear all of the regular expressions for all of the entity types, click Remove Changes.

Remove Changes button for a previewed request

Redact individual strings

Required global permission: Use the API to parse or redact a text string

Before you perform these tasks, remember to .

You can use the Tonic Textual SDK to redact individual strings, including:

  • Plain text strings

  • JSON content

  • XML content

For a text string, you can also request synthesized values from a large language model (LLM).

The redaction request can include the .

The includes the redacted or synthesized content and details about the detected entity values.

Redact a plain text string

To send a plain text string for redaction, use :

For example:

The redact call provides an option to record the request, to allow you to preview the results in the Textual application. For more information, go to .

Redact multiple plain text strings

To send multiple plain text strings for redaction, use :

For example:

Redact JSON content

To send a JSON string for redaction, use . You can send the JSON content as a JSON string or a Python dictionary.

redact_json ensures that only the values are redacted. It ignores the keys.

Basic JSON redaction example

Here is a basic example of a JSON redaction request:

It produces the following JSON output:

Specifying entity types for specific JSON paths

When you redact a JSON string, you can optionally assign specific entity types to selected JSON paths.

To do this, you include the jsonpath_allow_lists parameter. Each entry consists of an entity type and a list of JSON paths for which to always use that entity type. Each JSON path must point to a simple string or numeric value.

The specified entity type overrides both the detected entity type and any added or excluded values.

In the following example, the value of the key1 node is always treated as a telephone number:

It produces the following redacted output:

Redact XML content

To send an XML string for redaction, use .

redact_xml ensures that only the values are redacted. It ignores the XML markup.

For example:

Produces the following XML output:

Redact HTML content

To send an HTML string for redaction, use .

redact_html ensures that only the values are redacted. It ignores the HTML markup.

For example:

Produces the following HTML output:

Using an LLM to generate synthesized values

You can also request synthesized values from a large language model (LLM).

When you use this process, Textual first identifies the sensitive values in the text. It then sends the value locations and redacted values to the LLM. For example, if Textual identifies a product name, it sends the location and the redacted value PRODUCT to the LLM. Textual does not send the original values to the LLM.

The LLM then generates realistic synthesized values of the appropriate value types.

To send text to an LLM, use :

For example:

Before you can use this endpoint, you must enable additional LLM processing. The additional processing sends the values and surrounding text to the LLM. For an overview of the LLM processing and how to enable it, go to .

Format of the redaction and synthesis response

The response provides the redacted or synthesized version of the string, and the list of detected entity values.

For each redacted item, the response includes:

  • The location of the value in the original text (start and end)

  • The location of the value in the redacted version of the string (new_start and new_end)

  • The entity type (label)

  • The original value (text)

  • The replacement value (new_text). new_text is null in the following cases:

    • The entity type is ignored

    • The response is from llm_synthesis

  • A score to indicate confidence in the detection and redaction (score)

  • The detected language for the value (language)

  • For responses from textual.redact_json, the JSON path to the entity in the original document (json_path)

  • For responses from textual.redact_xml, the XPath to the entity in the original XML document (xml_path)

Built-in permission sets and available permissions

Tonic Textual comes with a set of built-in global, pipeline, and dataset permission sets. You cannot edit or delete the built-in permission sets.

Each permission is assigned Textual permissions. When a new permission is added to Textual, it is also added to the appropriate built-in permission sets.

Built-in global permission sets

Textual comes with the following built-in global permission sets:

  • Admin - Provides complete access to all global permissions. The Admin permission set automatically receives any new global permissions.

  • Admin (Environment) - For self-hosted only. Identical to the Admin permission set. Only assigned to users and groups listed in the value of the TEXTUAL_ADMINISTRATORS.

  • General User - Allows users to create datasets, pipelines, and custom entity types. Users can also use the Home page and the Request Explorer. By default, the General User permission set is assigned to all Textual users and SSO groups.

Available global permissions

The following tables list the available global permissions, and indicate how the permissions apply to the built-in global permission sets.

Textual API and requests

Permission
General User
Admin and Admin (Environment)

Configuration management

Permission
General User
Admin and Admin (Environment)

Datasets and pipelines

Permission
General User
Admin and Admin (Environment)

Users and permissions

Permission
General User
Admin and Admin (Environment)

Built-in dataset permission sets

Textual comes with the following built-in permission sets:

  • Editor - A dataset Editor has full access to view, edit, and share access to a dataset. When you create a dataset, you are automatically granted the Editor permission set for that dataset.

  • Viewer - A dataset Viewer only has access to view the configuration and to preview and download the results.

Available dataset permissions

The following tables list the available dataset permissions, and indicate how the permissions apply to the built-in dataset permission sets.

General dataset management

Permission
Viewer
Editor

Dataset file management

Permission
Viewer
Editor

Built-in pipeline permission sets

Textual comes with the following built-in pipeline permission sets:

  • Editor - A pipeline Editor has full access to view, edit, and share access to a pipeline. When you create a pipeline, you are automatically granted the Editor permission set for that pipeline.

  • Viewer - A pipeline Viewer only has access to view the configuration and download results.

Available pipeline permissions

The following tables list the available pipeline permissions, and indicate how the permissions apply to the built-in pipeline permission sets.

General pipeline management

Permission
Viewer
Editor

Pipeline file management

Permission
Viewer
Editor

Create an API key

✔️

✔️

Use the playground on the Home page

✔️

✔️

Use the API to parse or redact a text string

✔️

✔️

Use the Request Explorer

✔️

✔️

Create custom entity types

✔️

✔️

Edit any custom entity type

✔️

Create datasets

✔️

✔️

Manage access to datasets

✔️

Create pipelines

✔️

✔️

Manage access to pipelines

✔️

View all datasets

✔️

View all pipelines

✔️

Manage custom permission sets

✔️

Manage access to global permission sets

✔️

View users and groups

✔️

✔️

Manage users and user groups

✔️

View usage metrics

✔️

View dataset settings

✔️

✔️

Edit dataset settings

✔️

Share dataset access

✔️

Delete a dataset

✔️

Upload files to a dataset

✔️

Start a scan of dataset files

✔️

✔️

Delete files from a dataset

✔️

Preview redacted dataset files

✔️

✔️

Download redacted dataset files

✔️

✔️

View pipeline settings

✔️

✔️

Edit pipeline settings

✔️

Share pipeline access

✔️

Delete a pipeline

✔️

Manage the pipeline file list

✔️

Preview pipeline files

✔️

✔️

Start pipeline runs

✔️

✔️

environment variable
redaction_response = textual.redact("""<text of the string>""")
redaction_response.describe()
redaction_response = textual.redact("""Contact Tonic AI with questions""")
redaction_response.describe()

Contact ORGANIZATION_EPfC7XZUZ with questions
    
{"start": 8, "end": 16, "new_start": 8, "new_end": 30, "label": "ORGANIZATION", "text": "Tonic AI", "new_text": "[ORGANIZATION]", "score": 0.85, "language": "en"}
bulk_response = textual.redact_bulk([<List of strings])
bulk_response = textual.redact_bulk(["Tonic.ai was founded in 2018", "John Smith is a person"])
bulk_response.describe()

[ORGANIZATION_5Ve7OH] was founded in [DATE_TIME_DnuC1]

{"start": 0, "end": 5, "new_start": 0, "new_end": 21, "label": "ORGANIZATION", "text": "Tonic", "score": 0.9, "language": "en", "new_text": "[ORGANIZATION]"}
{"start": 21, "end": 25, "new_start": 37, "new_end": 54, "label": "DATE_TIME", "text": "2018", "score": 0.9, "language": "en", "new_text": "[DATE_TIME]"}

[NAME_GIVEN_dySb5] [NAME_FAMILY_7w4Db3] is a person

{"start": 0, "end": 4, "new_start": 0, "new_end": 18, "label": "NAME_GIVEN", "text": "John", "score": 0.9, "language": "en", "new_text": "[NAME_GIVEN]"}
{"start": 5, "end": 10, "new_start": 19, "new_end": 39, "label": "NAME_FAMILY", "text": "Smith", "score": 0.9, "language": "en", "new_text": "[NAME_FAMILY]"}
json_redaction = textual.redact_json(<JSON string or Python dictionary>)
d=dict()
d['person']={'first':'John','last':'OReilly'}
d['address']={'city': 'Memphis', 'state':'TN', 'street': '847 Rocky Top', 'zip':1234}
d['description'] = 'John is a man that lives in Memphis.  He is 37 years old and is married to Cynthia.'

json_redaction = textual.redact_json(d)

print(json.dumps(json.loads(json_redaction.redacted_text), indent=2))
{
"person": {
    "first": "[NAME_GIVEN]",
    "last": "[NAME_FAMILY]"
},
"address": {
    "city": "[LOCATION_CITY]",
    "state": "[LOCATION_STATE]",
    "street": "[LOCATION_ADDRESS]",
    "zip": "[LOCATION_ZIP]"
},
"description": "[NAME_GIVEN] is a man that lives in [LOCATION_CITY].  He is [DATE_TIME] and is married to [NAME_GIVEN]."
}
jsonpath_allow_lists={'entity_type':['JSON Paths']}
response = textual.redact_json('{"key1":"Ex123", "key2":"Johnson"}', jsonpath_allow_lists={'PHONE_NUMBER':['$.key1']})
{"key1":"[PHONE_NUMBER]","key2":"My name is [NAME_FAMILY]"}
xml_string = '''<?xml version="1.0" encoding="UTF-8"?>
    <!-- This XML document contains sample PII with namespaces and attributes -->
    <PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact">
        <!-- Personal Information with an attribute containing PII -->
        <Name preferred="true" contact:userID="john.doe123">
            <FirstName>John</FirstName>
            <LastName>Doe</LastName>He was born in 1980.</Name>

        <contact:Details>
            <!-- Email stored in an attribute for demonstration -->
            <contact:Email address="[email protected]"/>
            <contact:Phone type="mobile" number="555-6789"/>
        </contact:Details>

        <!-- SSN stored as an attribute -->
        <SSN value="987-65-4321" xsi:nil="false"/>
        <data>his name was John Doe</data>
    </PersonInfo>'''

response = textual.redact_xml(xml_string)

redacted_xml = response.redacted_text
<?xml version="1.0" encoding="UTF-8"?><!-- This XML document contains sample PII with namespaces and attributes -->\n<PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact"><!-- Personal Information with an attribute containing PII --><Name preferred="true" contact:userID="[NAME_GIVEN]">[GENDER_IDENTIFIER] was born in [DOB].<FirstName>[NAME_GIVEN]</FirstName><LastName>[NAME_FAMILY]</LastName></Name><contact:Details><!-- Email stored in an attribute for demonstration --><contact:Email address="[EMAIL_ADDRESS]"></contact:Email><contact:Phone type="mobile" number="[PHONE_NUMBER]"></contact:Phone></contact:Details><!-- SSN stored as an attribute --><SSN value="[PHONE_NUMBER]" xsi:nil="false"></SSN><data>[GENDER_IDENTIFIER] name was [NAME_GIVEN] [NAME_FAMILY]</data></PersonInfo>
html_content = """
<!DOCTYPE html>
<html>
    <head>
        <title>John Doe</title>
    </head>
    <body>
        <h1>John Doe</h1>
        <p>John Doe is a person who lives in New York City.</p>
        <p>John Doe's phone number is 555-555-5555.</p>
    </body>
</html>
"""

# Run the redact_xml method
redacted_html = redact.redact_html(html_content, generator_config={
            "NAME_GIVEN": "Synthesis",
            "NAME_FAMILY": "Synthesis"
        }) 

print(redacted_html.redacted_text)
<!DOCTYPE html>
<html>
    <head>
        <title>Scott Roley</title>
    </head>
    <body>
        <h1>Scott Roley</h1>
        <p>Scott Roley is a person who lives in [LOCATION_CITY].</p>
        <p>Scott Roley's phone number is [PHONE_NUMBER].</p>
    </body>
</html>
raw_synthesis = textual.llm_synthesis("Text of the string")
raw_synthesis = textual.llm_synthesis("My name is John, and today I am demoing Textual, a software product created by Tonic")
raw_synthesis.describe()

My name is John, and on Monday afternoon I am demoing Widget Pro, a software product created by Initech Enterprises.
{"start": 11, "end": 15, "new_start": 11, "new_end": 15, "label": "NAME_GIVEN", "text": "John", "new_text": null, "score": 0.9, "language": "en"}
{"start": 21, "end": 26, "new_start": 21, "new_end": 40, "label": "DATE_TIME", "text": "today", "new_text": null, "score": 0.85, "language": "en"}
{"start": 40, "end": 47, "new_start": 54, "new_end": 64, "label": "PRODUCT", "text": "Textual", "new_text": null, "score": 0.85, "language": "en"}
{"start": 79, "end": 84, "new_start": 96, "new_end": 115, "label": "ORGANIZATION", "text": "Tonic", "new_text": null, "score": 0.85, "language": "en"}
Contact ORGANIZATION_EPfC7XZUZ with questions
    
{"start": 8, "end": 16, "new_start": 8, "new_end": 30, "label": "ORGANIZATION", "text": "Tonic AI", "new_text": "[ORGANIZATION]", "score": 0.85, "language": "en"}
instantiate the SDK client
handling configuration for entity types
redaction response
textual.redact
Record and review redaction requests
textual.redact_bulk
textual.redact_json
textual.redact_xml
textual.redact_html
textual.llm_synthesis
Enabling and using additional LLM processing of detected entities

Language support in Textual

Tonic Textual supports languages in addition to English. Textual automatically detects the language and applies the correct model.

On self-hosted instances, you configure whether to support multiple languages, and can optionally provide auxiliary language models.

Supported languages

Textual can detect values in the following languages:

Name
Code

Afrikaans

af

Albanian

sq

Amharic

am

Arabic

ar

Armenian

hy

Assamese

as

Azerbaijani

az

Basque

eu

Belarusian

be

Bengali

bn

Bengali Romanized

Bosnian

bs

Breton

br

Bulgarian

bg

Burmese

my

Burmese (alternative)

Catalan

ca

Chinese (Simplified)

zh

Chinese (Traditional)

zh

Croatian

hr

Czech

cs

Danish

da

Dutch

nl

English

en

Esperanto

eo

Estonian

et

Filipino

tl

Finnish

fi

French

fr

Galician

gl

Irish

ga

Georgian

ka

German

de

Greek

el

Gujarati

gu

Hausa

ha

Hebrew

he

Hindi

hi

Hindi Romanized

Hungarian

hu

Icelandic

is

Indonesian

id

Italian

it

Japanese

ja

Javanese

jv

Kannada

kn

Kazakh

kk

Khmer

km

Korean

ko

Kurdish (Kurmanji)

ku

Kyrgyz

ky

Lao

lo

Latin

la

Latvian

lv

Lithuanian

lt

Macedonian

mk

Malagasy

mg

Malay

ms

Malayalam

ml

Marathi

mr

Mongolian

mn

Nepali

ne

Norwegian

no

Oriya

or

Oromo

om

Pashto

ps

Persian

fa

Polish

pl

Portuguese

pt

Punjabi

pa

Romanian

ro

Russian

ru

Sanskrit

sa

Scottish Gaelic

gd

Serbian

sr

Sinhala

si

Sindhi

sd

Slovak

sk

Slovenian

sl

Somali

so

Spanish

es

Sundanese

su

Swahili

sw

Swedish

sv

Tamil

ta

Tamil Romanized

Telugu

te

Telugu Romanized

Thai

th

Turkish

tr

Ukrainian

uk

Urdu

ur

Urdu Romanized

Uyghur

ug

Uzbek

uz

Vietnamese

vi

Welsh

cy

Western Frisian

fy

Xhosa

xh

Yiddish

yi

Self-hosted instances

On a self-hosted instance, you configure whether Textual supports multiple languages.

You can also optionally provide auxiliary language models.

Enabling multi-language support

To enable support for languages other than English, set the environment variable TEXTUAL_MULTI_LINGUAL=true.

The setting is used by the machine learning container.

Providing auxiliary language model assets

You can provide additional language model assets for Textual to use.

By default, Textual looks for model assets in the machine learning container, in /usr/bin/textual/language_models. The default Helm and Docker Compose configurations include the volume mount.

To choose a different location, set the environment variable TEXTUAL_LANGUAGE_MODEL_DIRECTORY. Note that if you change the location, you must also modify your volume mounts.

For help with installing model assets, contact Tonic.ai support ([email protected]).

Get all datasets

get

Returns all datasets to which the user has access

Path parameters
includeSynthesizePipelinebooleanRequiredDefault: false
Responses
200

OK

application/json
get
200

OK

Gets the dataset by its Id

get

Returns the dataset specified by the datasetId

Path parameters
datasetIdstringRequired
Responses
200

OK

application/json
404

The dataset cannot be found

get

Creates a new dataset

post

Creates a new dataset with the specified configuration. You must specify a unique, non-empty dataset name

Body
all ofOptional
Responses
200

OK

application/json
400

The dataset name must be specified

409

Dataset name is already in use

post

Creates a new dataset

put

Edits a dataset with the specified configuration

Query parameters
shouldRescanbooleanOptional
Body
all ofOptional
Responses
200

OK

application/json
404

The dataset cannot be found

409

Dataset name is already in use

put

Get all datasets

get

Returns all datasets to which the user has access

Path parameters
includeSynthesizePipelinebooleanRequiredDefault: false
Responses
200

OK

application/json
get
200

OK

Creates a new dataset

post

Creates a new dataset with the specified configuration. You must specify a unique, non-empty dataset name

Body
all ofOptional
Responses
200

OK

application/json
400

The dataset name must be specified

409

Dataset name is already in use

post

Gets the dataset by its Id

get

Returns the dataset specified by the datasetId

Path parameters
datasetIdstringRequired
Responses
200

OK

application/json
404

The dataset cannot be found

get
GET /api/Dataset HTTP/1.1
Host: 
Accept: */*
[
  {
    "id": "text",
    "name": "text",
    "datasetGeneratorMetadata": "asdfqwer",
    "generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
    "labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
    "labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
    "enabledModels": [
      "text"
    ],
    "files": [
      {
        "fileId": "text",
        "fileName": "text",
        "fileType": "text",
        "datasetId": "text",
        "numRows": 1,
        "numColumns": 1,
        "piiTypes": [
          "text"
        ],
        "wordCount": 1,
        "redactedWordCount": 1,
        "uploadedTimestamp": {},
        "fileSource": "Local",
        "processingStatus": "text",
        "processingError": "text",
        "mostRecentCompletedJobId": "text"
      }
    ],
    "lastUpdated": {},
    "docXImagePolicy": "Redact",
    "pdfSignaturePolicy": "Redact",
    "docXCommentPolicy": "Remove",
    "docXTablePolicy": "Redact",
    "fileSource": "Local",
    "customPiiEntityIds": [
      "text"
    ],
    "rescanJobs": [
      {
        "id": "text",
        "status": "text",
        "errorMessages": "text",
        "startTime": {},
        "endTime": {},
        "publishedTime": {},
        "datasetFileId": "text",
        "jobType": "DeidentifyFile"
      }
    ]
  }
]
GET /api/Dataset/{datasetId} HTTP/1.1
Host: 
Accept: */*
{
  "id": "text",
  "name": "text",
  "datasetGeneratorMetadata": "asdfqwer",
  "generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
  "labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
  "labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
  "enabledModels": [
    "text"
  ],
  "files": [
    {
      "fileId": "text",
      "fileName": "text",
      "fileType": "text",
      "datasetId": "text",
      "numRows": 1,
      "numColumns": 1,
      "piiTypes": [
        "text"
      ],
      "wordCount": 1,
      "redactedWordCount": 1,
      "uploadedTimestamp": {},
      "fileSource": "Local",
      "processingStatus": "text",
      "processingError": "text",
      "mostRecentCompletedJobId": "text"
    }
  ],
  "lastUpdated": {},
  "docXImagePolicy": "Redact",
  "pdfSignaturePolicy": "Redact",
  "docXCommentPolicy": "Remove",
  "docXTablePolicy": "Redact",
  "fileSource": "Local",
  "customPiiEntityIds": [
    "text"
  ],
  "rescanJobs": [
    {
      "id": "text",
      "status": "text",
      "errorMessages": "text",
      "startTime": {},
      "endTime": {},
      "publishedTime": {},
      "datasetFileId": "text",
      "jobType": "DeidentifyFile"
    }
  ]
}
POST /api/Dataset HTTP/1.1
Host: 
Content-Type: application/json
Accept: */*
Content-Length: 15

{
  "name": "text"
}
{
  "id": "text",
  "name": "text",
  "datasetGeneratorMetadata": "asdfqwer",
  "generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
  "labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
  "labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
  "enabledModels": [
    "text"
  ],
  "files": [
    {
      "fileId": "text",
      "fileName": "text",
      "fileType": "text",
      "datasetId": "text",
      "numRows": 1,
      "numColumns": 1,
      "piiTypes": [
        "text"
      ],
      "wordCount": 1,
      "redactedWordCount": 1,
      "uploadedTimestamp": {},
      "fileSource": "Local",
      "processingStatus": "text",
      "processingError": "text",
      "mostRecentCompletedJobId": "text"
    }
  ],
  "lastUpdated": {},
  "docXImagePolicy": "Redact",
  "pdfSignaturePolicy": "Redact",
  "docXCommentPolicy": "Remove",
  "docXTablePolicy": "Redact",
  "fileSource": "Local",
  "customPiiEntityIds": [
    "text"
  ],
  "rescanJobs": [
    {
      "id": "text",
      "status": "text",
      "errorMessages": "text",
      "startTime": {},
      "endTime": {},
      "publishedTime": {},
      "datasetFileId": "text",
      "jobType": "DeidentifyFile"
    }
  ]
}
PUT /api/Dataset HTTP/1.1
Host: 
Content-Type: application/json
Accept: */*
Content-Length: 507

{
  "id": "text",
  "name": "text",
  "generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
  "datasetGeneratorMetadata": {
    "ANY_ADDITIONAL_PROPERTY": {}
  },
  "labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
  "labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
  "enabledModels": [
    "text"
  ],
  "docXImagePolicy": "Redact",
  "pdfSignaturePolicy": "Redact",
  "docXCommentPolicy": "Remove",
  "docXTablePolicy": "Redact"
}
{
  "id": "text",
  "name": "text",
  "datasetGeneratorMetadata": "asdfqwer",
  "generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
  "labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
  "labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
  "enabledModels": [
    "text"
  ],
  "files": [
    {
      "fileId": "text",
      "fileName": "text",
      "fileType": "text",
      "datasetId": "text",
      "numRows": 1,
      "numColumns": 1,
      "piiTypes": [
        "text"
      ],
      "wordCount": 1,
      "redactedWordCount": 1,
      "uploadedTimestamp": {},
      "fileSource": "Local",
      "processingStatus": "text",
      "processingError": "text",
      "mostRecentCompletedJobId": "text"
    }
  ],
  "lastUpdated": {},
  "docXImagePolicy": "Redact",
  "pdfSignaturePolicy": "Redact",
  "docXCommentPolicy": "Remove",
  "docXTablePolicy": "Redact",
  "fileSource": "Local",
  "customPiiEntityIds": [
    "text"
  ],
  "rescanJobs": [
    {
      "id": "text",
      "status": "text",
      "errorMessages": "text",
      "startTime": {},
      "endTime": {},
      "publishedTime": {},
      "datasetFileId": "text",
      "jobType": "DeidentifyFile"
    }
  ]
}
GET /api/Dataset HTTP/1.1
Host: 
Accept: */*
[
  {
    "id": "text",
    "name": "text",
    "datasetGeneratorMetadata": "asdfqwer",
    "generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
    "labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
    "labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
    "enabledModels": [
      "text"
    ],
    "files": [
      {
        "fileId": "text",
        "fileName": "text",
        "fileType": "text",
        "datasetId": "text",
        "numRows": 1,
        "numColumns": 1,
        "piiTypes": [
          "text"
        ],
        "wordCount": 1,
        "redactedWordCount": 1,
        "uploadedTimestamp": {},
        "fileSource": "Local",
        "processingStatus": "text",
        "processingError": "text",
        "mostRecentCompletedJobId": "text"
      }
    ],
    "lastUpdated": {},
    "docXImagePolicy": "Redact",
    "pdfSignaturePolicy": "Redact",
    "docXCommentPolicy": "Remove",
    "docXTablePolicy": "Redact",
    "fileSource": "Local",
    "customPiiEntityIds": [
      "text"
    ],
    "rescanJobs": [
      {
        "id": "text",
        "status": "text",
        "errorMessages": "text",
        "startTime": {},
        "endTime": {},
        "publishedTime": {},
        "datasetFileId": "text",
        "jobType": "DeidentifyFile"
      }
    ]
  }
]
POST /api/Dataset HTTP/1.1
Host: 
Content-Type: application/json
Accept: */*
Content-Length: 15

{
  "name": "text"
}
{
  "id": "text",
  "name": "text",
  "datasetGeneratorMetadata": "asdfqwer",
  "generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
  "labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
  "labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
  "enabledModels": [
    "text"
  ],
  "files": [
    {
      "fileId": "text",
      "fileName": "text",
      "fileType": "text",
      "datasetId": "text",
      "numRows": 1,
      "numColumns": 1,
      "piiTypes": [
        "text"
      ],
      "wordCount": 1,
      "redactedWordCount": 1,
      "uploadedTimestamp": {},
      "fileSource": "Local",
      "processingStatus": "text",
      "processingError": "text",
      "mostRecentCompletedJobId": "text"
    }
  ],
  "lastUpdated": {},
  "docXImagePolicy": "Redact",
  "pdfSignaturePolicy": "Redact",
  "docXCommentPolicy": "Remove",
  "docXTablePolicy": "Redact",
  "fileSource": "Local",
  "customPiiEntityIds": [
    "text"
  ],
  "rescanJobs": [
    {
      "id": "text",
      "status": "text",
      "errorMessages": "text",
      "startTime": {},
      "endTime": {},
      "publishedTime": {},
      "datasetFileId": "text",
      "jobType": "DeidentifyFile"
    }
  ]
}
GET /api/Dataset/{datasetId} HTTP/1.1
Host: 
Accept: */*
{
  "id": "text",
  "name": "text",
  "datasetGeneratorMetadata": "asdfqwer",
  "generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
  "labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
  "labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
  "enabledModels": [
    "text"
  ],
  "files": [
    {
      "fileId": "text",
      "fileName": "text",
      "fileType": "text",
      "datasetId": "text",
      "numRows": 1,
      "numColumns": 1,
      "piiTypes": [
        "text"
      ],
      "wordCount": 1,
      "redactedWordCount": 1,
      "uploadedTimestamp": {},
      "fileSource": "Local",
      "processingStatus": "text",
      "processingError": "text",
      "mostRecentCompletedJobId": "text"
    }
  ],
  "lastUpdated": {},
  "docXImagePolicy": "Redact",
  "pdfSignaturePolicy": "Redact",
  "docXCommentPolicy": "Remove",
  "docXTablePolicy": "Redact",
  "fileSource": "Local",
  "customPiiEntityIds": [
    "text"
  ],
  "rescanJobs": [
    {
      "id": "text",
      "status": "text",
      "errorMessages": "text",
      "startTime": {},
      "endTime": {},
      "publishedTime": {},
      "datasetFileId": "text",
      "jobType": "DeidentifyFile"
    }
  ]
}

Configuring synthesis options

Required dataset permission: Edit dataset settings

When Textual generates replacement values, those values are always consistent. Consistency means that the same original value always produces the same replacement value. You can also enable consistency with some Tonic Structural output values.

For all entity types, you can specify the replacements for specific values.

Some entity types include type-specific options for how Tonic Textual generates the replacement values.

For custom entity types, you can select the generator to use.

You can also set whether to use the new synthesis process.

Enabling consistency with Tonic Structural

If you also use Tonic Structural, then you can configure Textual to enable selected synthesized values to be consistent between the two applications.

For example, a given source telephone number can produce the same replacement telephone number in both Structural and Textual.

To enable this consistency, you configure a statistics seed value as the value of the Textual environment variable SOLAR_STATISTICS_SEED. A statistics seed is a signed 32-bit integer.

The value must match a , either:

  • The value of the Structural environment setting TONIC_STATISTICS_SEED.

  • A statistics seed configured for an individual Structural workspace.

The current statistics seed value is displayed on the System Settings page.

Using the new synthesis process

Textual has developed an updated synthesis process that is currently implemented for the following entity types:

  • URLs

  • Names

  • Custom entity types

In particular, the new synthesis process improves the display of the synthesized values in PDF files. The values better match the available space and the original font.

To configure whether to use the new process:

  1. On the dataset details page, click Settings.

  2. On the Dataset Settings page, under PDF Settings, the New PDF synthesis mode (experimental) determines which process to use. To use the new process, toggle the setting to the on position.

Dataset Settings page with the new synthesis option
  1. Click Save Dataset.

Providing specific replacement values

For all entity types, you can provide a list of specific replacement values.

For example, for the Given Name entity type, you might indicate to always replace John with Michael and Mary with Melissa.

For the remaining values, Textual generates the replacement values.

To display the synthesis options for an entity type, click Options.

Synthesis options for an entity type

In the text area, provide a JSON object that maps the original values to the replacement values. For example:

{
  "French": "German",
  "English": "Japanese"
}

With the above configuration for the Language entity type:

  • All instances of French are changed to German.

  • All instances of English are changed to Japanese.

  • Textual selects the replacement values for other languages.

Configuring name synthesis options

For the Given Name and Family Name entity types, you can configure:

  • Whether to treat the same name with different casing as a different value.

  • Whether to replicate the gender of the original value.

In the entity types list, to display the name synthesis options, click Options.

Synthesis options for given name values

Differentiating source values by case

To treat the same name with different casing as different source values, check Is Consistency Case Sensitive.

For example, when this is checked, john and John are treated as different names, and can have different replacement values - john might be replaced with michael, and John might be replaced with Stephen.

When this is not checked, then john and John are treated as the same source value, and get the same replacement.

Preserving gender in names

To replace source names with a names that have the same gender, check Preserve Gender.

For example, when this is checked, John might be replaced with Michael, since they are both traditionally male names. However, John would not be replaced with Mary, which is traditionally a female name.

Configuring location synthesis options

Location values include the following types:

  • Location

  • Location Address

  • Location State

  • Location Zip

You can select whether to generate HIPAA or non-HIPAA addresses. Address values can be consistent with values generated in Structural.

For each location type other than Location State, you can specify whether to use a realistic replacement value. For Location State, based on HIPAA guidelines, both the Synthesis option and the Off option pass through the value.

For location types that include zip codes, you can also specify how to generate the new zip code values.

In the entity types list, to display the location synthesis options, click Options.

Synthesis options for location values

Selecting the type of address generator to use

Under Address generator type, select the type of address generator to use:

  • HIPAA-compliant address generator. This option generates values similar to those generated by the .

  • Non-HIPAA address generator. This option generates values similar to those generated by the .

If you configured a Textual statistics seed that matches a Structural statistics seed, then the generated address values are consistent with values generated in Structural. A given address value produces the same output value in both applications.

For example, in both Textual and Structural, a source address value 123 Main Street might be replaced with 234 Oak Avenue.

Indicating whether to use realistic replacement values

By default, Textual replaces a location value with a realistic corresponding value. For example, "Main Street" might be replaced with "Fourth Avenue".

To instead scramble the values, uncheck Replace with realistic values.

Indicating how to generate replacement zip codes

By default, to generate a new zip code, Textual selects a real zip code that starts with the same three digits as the original zip code. For a low population area, Textual instead selects a random zip code from the United States.

To instead replace the last two digits of the zip code with zeros, check Replace zeroes for zip codes. For a low population area, Textual instead replaces all of the digits in the zip code with zeros.

Configuring datetime synthesis options

By default, when you select the Synthesis option for Date/Time and Date of Birth values, Textual shifts the datetime values to a value that occurs within 7 days before or after the original value.

To customize how Textual sets the new values, you can:

  • Set a different range within which Textual sets the new values

  • Indicate whether to scramble date values that Textual cannot parse

  • Indicate whether to shift all of the original values by the same amount and in the same direction

  • Add additional date formats for Textual to recognize

In the entity types list, to display the datetime synthesis options, click Options.

Datetime synthesis options

Adjusting the range for the replacement values

By default, Textual adjusts the dates to values that are within 7 days before or after the original date.

To change the range:

  1. In the Left bound on # of Days To Shift field, enter the number of days before the original date within which the replacement datetime value must occur. For example, if you enter 10, then the replacement datetime value cannot occur earlier than 10 days before the original value.

  2. In the Right bound on # of Days To Shift field, enter the number of days after the original date within which the replacement datetime value must occur. For example, if you enter 6, then the replacement datetime value cannot occur later than 6 days after the original value.

Indicating how to replace datetime values in unsupported formats

Textual can parse datetime values that use either a format in Default supported datetime formats in Textual or a format that you add.

The Scramble Unrecognized Dates checkbox indicates how Textual should handle datetime values that it does not recognize.

By default, the checkbox is checked, and Textual scrambles those values.

To instead pass through the values without changing them, uncheck Scramble Unrecognized Dates.

Indicating whether to shift all values by the same amount

By default, Textual applies different shifts to the original values. Some replacement dates might be earlier, and some might be later. The amount of shift might also vary.

To shift all of the datetime values in the same way, check Apply same shift for entire document.

For example, if this is checked, Textual might shift all datetime values 3 days in the future.

Adding datetime formats

By default, Textual is able to recognize datetime values that use a format from Default supported datetime formats in Textual.

Under Additional Date Formats, you can add other datetime formats that you know are present in your data.

The formats must use a Noda Time LocalDateTime pattern.

To add a format, type the format in the field, then click +.

To remove a format, click its delete icon.

Default supported datetime formats in Textual

By default, Textual supports the following datetime formats.

Date only formats

Format
Example value

yyyy/M/d

2024/1/17

yyyy-M-d

2024-1-17

yyyyMMdd

20240117

yyyy.M.d

2024.1.17

yyyy, MMM d

2024, Jan 17

yyyy-M

2024-1

yyyy/M

2024/1

d/M/yyyy

17/1/2024

d-MMM-yyyy

17-Jan-2024

dd-MMM-yy

17-Jan-24

d-M-yyyy

17-1-2024

d/MMM/yyyy

17/Jan/2024

d MMMM yyyy

17 January 2024

d MMM yyyy

17 Jan 2024

d MMMM, yyyy

17 January, 2024

ddd, d MMM yyyy

Wed, 17 Jan 2024

M/d/yyyy

1/17/2024

M/d/yy

1/17/24

M-d-yyyy

1-17-2024

MMddyyyy

01172024

MMMM d, yyyy

January 17, 2024

MMM d, ''yy

Jan 17, '24

MM-yyyy

01-2024

MMMM, yyyy

January, 2024

Date and time formats

Format
Example value

yyyy-M-d HH:mm

2024-1-17 15:45

d-M-yyyy HH:mm

17-1-2024 15:45

MM-dd-yy HH:mm

01-17-24 15:45

d/M/yy HH:mm:ss

17/1/24 15:45:30

d/M/yyyy HH:mm:ss

17/1/2024 15:45:30

yyyy/M/d HH:mm:ss

2024/1/17 15:45:30

yyyy-M-dTHH:mm:ss

2024-1-17T15:45:30

yyyy/M/dTHH:mm:ss

2024/1/17T15:45:30

yyyy-M-d HH:mm:ss'Z'

2024-1-17 15:45:30Z

yyyy-M-d'T'HH:mm:ss'Z'

2024-1-17T15:45:30Z

yyyy-M-d HH:mm:ss.fffffff

2024-1-17 15:45:30.1234567

yyyy-M-dd HH:mm:ss.FFFFFF

2024-1-17 15:45:30.123456

yyyy-M-dTHH:mm:ss.fff

2024-1-17T15:45:30.123

Time only formats

Format
Example value

HH:mm

15:45

HH:mm:ss

15:45:30

HHmmss

154530

hh:mm:ss tt

03:45:30 PM

HH:mm:ss'Z'

15:45:30Z

Configuring age synthesis options

By default, when you select the Synthesis option for Age values, Textual shifts the age value to a value that is within seven years before or after the original value. For age values that it cannot synthesize, it scrambles the value.

In the entity types list, to display the age synthesis options, click Options.

Synthesis options for age values

To configure the synthesis:

  1. In the Range of Years +/- for the Shifted Age field, enter the number of years before and after the original value to use as the range for the synthesized value.

  2. By default, Textual scrambles age values that it cannot parse. To instead pass through the value unchanged, uncheck Scramble Unrecognized Ages.

Configuring telephone number synthesis options

For Phone Number values, you can choose whether to generate a realistic phone number. If you do, then the generated values can be consistent with values generated in Structural.

In the entity types list, to display the phone number synthesis options, click Options.

Synthesis options for telephone number values

Selecting the generator type

From the Phone number generator type dropdown list:

  • To replace each phone number with a randomly generated number, select Random Number.

  • To generate a realistic telephone number, select US Phone Number. The US Phone Number option generates values similar to those generated by the .

If you also configured a Textual statistics seed that matches a Structural statistics seed, then the synthesized values are consistent with values generated in Structural. A given source telephone number produces the same output telephone number in both applications.

For example, in both Textual and Structural, 123-456-6789 might be replaced with 154-567-8901.

Determining how to replace invalid telephone numbers

The Replace invalid numbers with valid numbers checkbox determines how Textual handles invalid telephone numbers in the data.

To replace the invalid with valid telephone numbers, check the checkbox.

If you do not check the checkbox, then Textual randomly replaces the numeric characters.

Selecting and configuring the generator for custom entity types

By default, when you select the Synthesis option for a custom entity type, Textual scrambles the original value.

From the generator dropdown list, select the generator to use to create the replacement value.

Generator dropdown list for a custom entity type

The available generators are:

Generator
Description

Scramble

This is the default generator.

Scrambles the original value.

CC Exp

Generates a credit card expiration date.

Company Name

Generates a name of a business.

Credit Card

Generates a credit card number.

CVV

Generates a credit card security code.

Date Time

Generates a datetime value.

The Date Time generator has the .

Email

Generates an email address.

HIPAA Address Generator

Generates a mailing address.

The generator has the as the built-in location entity types.

IP Address

Generates an IP address.

MICR Code

Generates an MICR code.

Money

Generates a currency amount.

Name

Generates a person's name.

You configure:

  • Whether to generate the same replacement value from source values that have different capitalization.

  • Whether the replacement value reflects the gender of the original value.

Numeric Value

Generates a numeric value.

You configure whether to use the Integer Primary Key generator to generate the value.

Person Age

Generates an age value.

The Person Age generator has the .

Phone Number

Generates a telephone number.

The Phone Number generator has the .

SSN

Generates a United States Social Security Number.

URL

Generates a URL.

Structure of JSON output files

The JSON output provides access to Markdown content and identifies the entities that were detected in the file.

Common elements in the JSON output

Information about the entire file

All JSON output files contain the following elements that contain information for the entire file:

For specific file types, the JSON output includes additional objects and properties to reflect the file structure.

Hashed and Markdown content

The JSON output contains hashed and Markdown content for the entire file and for individual file components.

Entities

The JSON output contains entities arrays for the entire file and for individual file components.

Each entity in the entities array has the following properties:

Plain text files

For plain text files, the JSON output only contains the information for the entire file.

.csv files

For .csv files, the structure contains a tables array.

The tables array contains a table object that contains header and data arrays..

For each row in the file, the data array contains a row array.

For each value in a row, the row array contains a value object.

The value object contains the entities, hashed content, and Markdown content for the value.

.xlsx files

For .xlsx files, the structure contains a tables array that provides details for each worksheet in the file.

For each worksheet, the tables array contains a worksheet object.

For each row in a worksheet, the worksheet object contains a header array and a data array. The data array contains a row array.

For each cell in a row, the row array contains a cell object.

Each cell object contains the entities, hashed content, and Markdown content for the cell.

.docx files

For .docx files, the JSON output structure adds:

  • A footnotes array for content in footnotes.

  • An endnotes array for content in endnotes.

  • A header object for content in the page headers. Includes separate objects for the first page header, even page header, and odd page header.

  • A footer object for content in the page footers. Includes separate objects for the first page footer, even page footer, and odd page footer.

These arrays and objects contain the entities, hashed content, and Markdown content for the notes, headers, and footers.

PDF and image files

PDF and image files use the same structure. Textual extracts and scans the text from the files.

For PDF and image files, the JSON output structure adds the following content.

pages array

The pages array contains all of the content on the pages. This includes content in tables and key-value pairs, which are also listed separately in the output.

For each page in the file, the pages array contains a page array.

For each component on the page - such as paragraphs, headings, headers, and footers - the page array contains a component object.

Each component object contains the component entities, hashed content, and Markdown content.

tables array

The tables array contains content that is in tables.

For each table in the file, the tables array contains a table array.

For each row in a table, the table array contains a row array.

For each cell in a row, the row array contains a cell object.

Each cell object identifies the type of cell (header or content). It also contains the entities, hashed content, and Markdown content for the cell.

keyValuePairs array

The keyValuePairs array contains key-value pair content. For example, for a PDF of a form with fields, a key-value pair might represent a field label and a field value.

For each key-value pair, the keyValuePairs array contains a key-value pair object.

The key-value pair object contains:

  • An automatically incremented identifier. For example, id for the first key-value pair is 1, for the second key-value pair is 2, and so on.

  • The start and end position of the key-value pair

  • The text of the key

  • The entities, hashed content, and Markdown content for the value

PDF and image JSON outline

.eml and .msg files

For email message files, the JSON output structure adds the following content.

Email message identifiers

The JSON output includes the following email message identifiers:

  • The identifier of the current message

  • If the message was a reply to another message, the identifier of that message

  • An array of related email messages. This includes the email message that the message replied to, as well as any other messages in an email message thread.

Recipients

The JSON output includes the email address and display name of the message recipients. It contains separate lists for the following:

  • Recipients in the To line

  • Recipients in the CC line

  • Recipients in the BCC line

Subject line

The subject object contains the message subject line. It includes:

  • Markdown and hashed versions of the message subject line.

  • The entities that were detected in the subject line.

Message timestamp

sentDate provides the timestamp when the message was sent.

Message body

The plainTextBodyContent object contains the body of the email message.

It contains:

  • Markdown and hashed versions of the message body.

  • The entities that were detected in the message body.

Message attachments

The attachments array provides information about any attachments to the email message. For each attached file, it includes:

  • The identifier of the message that the file is attached to.

  • The identifier of the attachment.

  • The JSON output for the file.

  • The count of words in the original file.

  • The count of words in the redacted version of the file.

Email message JSON outline

{
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

fileType

The type of the original file.

content

Details about the file content. It includes:

  • Hashed and Markdown content for the file

  • Entities in the file

schemaVersion

An integer that identifies the version of the JSON schema that was used for the JSON output.

Textual uses this to convert content from older schemas to the most recent schema.

hash

The hashed version of the file or component content.

text

The file or component content in Markdown notation.

start

Within the file or component, the location where the entity value starts.

For example, in the following text:

My name is John.

John is an entity that starts at 11.

end

Within the file or component, the location where the entity value ends.

For example, in the following text:

My name is John.

John is an entity that ends at 14.

label

The type of entity.

For a list of the entity types that Textual detects, go to Entity types that Textual detects.

text

The text of the entity.

score

The confidence score for the entity.

Indicates how confident Textual is that the value is an entity of the specified type.

language

The language code to identify the language for the entity value. For example, en indicates that the value is in English.

{
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown content>",
    "hash": "<hashed content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"      }
    ]
  },
  "schemaVersion": <integer schema version>
}
{
  "tables": [
    {
      "tableName": "csv_table",
      "header": [//Columns that contain heading info (col_0, col_1, and so on)
        "<column identifier>"
      ],
      "data": [  //Entry for each row in the file
        [   //Entry for each value in the row
          {    
            "entities": [   //Entry for each entity in the value
              {
                "start": <start location>,,
                "end": <end location>,
                "label": "<value type>",
                "text": "<value text>",
                "score": <confidence score>,
                "language": "<language code>"
              }
            ],
            "hash": "<hashed value content>",
            "text": "<Markdown value content>"
          }
        ]
      ]
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   ///Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}
{
  "tables": [   //Entry for each worksheet
    {
      "tableName": "<Name of the worksheet>",
      "header": [ //Columns that contain heading info (col_0, col_1, and so on)
        "<column identifier>"
      ],
      "data": [   //Entry for each row
        [   //Entry for each cell in the row
          {
            "entities": [   //Entry for each entity in the cell
              {
                "start": <start location>,
                "end": <end location>,
                "label": "<value type>",
                "text": "<value text>",
                "score": <confidence score>,
                "language": "<language code>"
              }
            ],
            "hash": "<hashed cell content>",
            "text": "<Markdown cell content>"
          }
        ]
      ]
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}
{
  "footNotes": [   //Entry for each footnote
    {
      "entities": [   //Entry for each entity in the footnote
        {
          "start": <start location>,
          "end": <end location>,
          "pythonStart": <start location in Python>,
          "pythonEnd": <end location in Python>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
          "exampleRedaction": null
        }
      ],
      "hash": "<hashed footnote content>",
      "text": "<Markdown footnote content>"
    }
  ],
  "endNotes": [   //Entry for each endnote
    {
      "entities": [   //Entry for each entity in the endnote
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed endnote content>",
      "text": "<Markdown endnote content>"
    }
  ],
  "header": {
    "first": {
      "entities": [   //Entry for each entity in the first page header
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed first page header content>",
      "text": "<Markdown first page header content>"
    },
    "even": {
      "entities": [   //Entry for each entity in the even page header
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed even page header content>",
      "text": "<Markdown even page header content>"
    },
    "odd": {
      "entities": [   //Entry for each entity in the odd page header
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed odd page header content>",
      "text": "<Markdown odd page header content>"
    }
  },
  "footer": {
    "first": {
      "entities": [   //Entry for each entity in the first page footer
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed first page footer content>",
      "text": "<Markdown first page footer content>"
    },
    "even": {
      "entities": [   //Entry for each entity in the even page footer
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed even page footer content>",
      "text": "<Markdown even page footer content>"
    },
    "odd": {
      "entities": [   //Entry for each entity in the odd page footer
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed odd page footer content>",
      "text": "<Markdown odd page footer content>"
    }
  },
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}
{
  "pages": [   //Entry for each page in the file
    [   //Entry for each component on the page
      {
        "type": "<page component type>",
        "content": {
          "entities": [   //Entry for each entity in the component
            {
              "start": <start location>,
              "end": <end location>,
              "label": "<value type>",
              "text": "<value text>",
              "score": <confidence score>,
              "language": "<language code>"
            }
          ],
          "hash": "<hashed component content>",
          "text": "<Markdown component content>"
        }
      }
    ],
  "tables": [   //Entry for each table in the file
    [   //Entry for each row in the table
      [   //Entry for each cell in the row
        {
          "type": "<content type>",   //ColumnHeader or Content
          "content": {
            "entities": [  //Entry for each entity in the cell
              {
                "start": <start location>,
                "end": <end location>,
                "label": "<value type>",
                "text": "<value text>",
                "score": <confidence score>,
                "language": "<language code>"
              }
            ],
            "hash": "<hashed cell text>",
            "text": "<Markdown cell text>"
          }
        }
      ]
    ]
  ],
  "keyValuePairs": [   //Entry for each key-value pair in the file
    {
      "id": <incremented identifier>,
      "key": "<key text>",
      "value": {
        "entities": [  //Entry for each entity in the value
          {
            "start": <start location>,
            "end": <end location>,
            "label": "<value type>",
            "text": "<value text>",
            "score": <confidence score>,
            "language": "<language code>"
          }
        ],
        "hash": "<hashed value text>",
        "text": "<Markdown value text>"
      },
      "start": <start location of the key-value pair>,
      "end": <end location of the key-value pair>
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   ///Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}
{
  "messageId": "<email message identifier>",
  "inReplyToMessageId": <message that this message replied to>,
  "messageIdReferences": [<related email messages>],
  "senderAddress": {
    "address": "<sender email address>",
    "displayName": "<sender display name>"
  },
  "toAddresses": [  //Entry for each recipient in the To list
    {
      "address": "<recipient email address>",
      "displayName": "<recipient display name>"
    }
  ],
  "ccAddresses": [ //Entry for each recipient in the CC list
    {
      "address": "<recipient email address>",
      "displayName": "<recipient display name>"
    }
  ],
  "bccAddresses": [ //Entry for each recipient in the BCC list
    {
      "address": "<recipient email address>",
      "displayName": "<recipient display name>"
    }
  ],
  "sentDate": "<timestamp when the message was sent>",
  "subject": {
    "text": "<Markdown version of the subject line>",
    "hash": "<hashed version of the subject line>",
    "entities": [   //Entry for each entity in the subject line
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "plainTextBodyContent": {
    "text": "<Markdown version of the message body>",
    "hash": "<hashed version of the message body>",
    "entities": [ //Entry for each entity in the message body
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "attachments": [ //Entry for each attached file
    {
      "parentMessageId": "<the message that the file is attached to>",
      "contentId": "<identifier of the attachment>",
      "fileName": "<name of the attachment file>",
      "document": {<pipeline JSON for the attached file>},
      "wordCount": <number of words in the attachment>,
      "redactedWordCount": <number of words in the redacted attachment>
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [ //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}
same synthesis configuration options as the built-in Date/Time entity type
same configuration options for generator type and realistic replacements
same configuration options as the built-in Age entity type
same configuration options as the built-in Phone Number entity type

Edits a dataset

put

Updates a dataset to use the specified configuration.

Required Permissions

  • Dataset: Edit Settings
Query parameters
shouldRescanbooleanOptional
Body
all ofOptional
Responses
200

OK

application/json
404

The dataset cannot be found

409

Dataset name is already in use

put
PUT /api/Dataset HTTP/1.1
Host: 
Content-Type: application/json
Accept: */*
Content-Length: 731

{
  "id": "text",
  "name": "text",
  "generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
  "generatorMetadata": {
    "ANY_ADDITIONAL_PROPERTY": {
      "version": "V1",
      "customGenerator": "Scramble",
      "swaps": {
        "ANY_ADDITIONAL_PROPERTY": "text"
      }
    }
  },
  "labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
  "labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
  "enabledModels": [
    "text"
  ],
  "docXImagePolicy": "Redact",
  "pdfSignaturePolicy": "Redact",
  "pdfSynthModePolicy": "V1",
  "docXCommentPolicy": "Remove",
  "docXTablePolicy": "Redact",
  "fileSourceExternalCredential": {
    "fileSource": "Local",
    "credential": {}
  },
  "awsCredentialSource": "text",
  "outputPath": "text"
}
{
  "id": "text",
  "name": "text",
  "generatorMetadata": "asdfqwer",
  "outputFormat": "Original",
  "generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
  "labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
  "labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
  "enabledModels": [
    "text"
  ],
  "tags": [
    "text"
  ],
  "files": [
    {
      "fileId": "text",
      "fileName": "text",
      "fileType": "text",
      "datasetId": "text",
      "numRows": 1,
      "numColumns": 1,
      "piiTypes": [
        "text"
      ],
      "wordCount": 1,
      "redactedWordCount": 1,
      "uploadedTimestamp": {},
      "fileSource": "Local",
      "processingStatus": "text",
      "processingError": "text",
      "mostRecentCompletedJobId": "text",
      "fileParseResultId": "text",
      "filePath": "text",
      "generatedFileStatus": "text"
    }
  ],
  "lastUpdated": {},
  "created": {},
  "creatorUser": {
    "id": "text",
    "userName": "text",
    "firstName": "text",
    "lastName": "text"
  },
  "docXImagePolicy": "Redact",
  "pdfSignaturePolicy": "Redact",
  "pdfSynthModePolicy": "V1",
  "docXCommentPolicy": "Remove",
  "docXTablePolicy": "Redact",
  "fileSource": "Local",
  "customPiiEntityIds": [
    "text"
  ],
  "operations": [
    "HasAccess"
  ],
  "rescanJobs": [
    {
      "id": "text",
      "status": "text",
      "errorMessages": "text",
      "startTime": {},
      "endTime": {},
      "publishedTime": {},
      "datasetFileId": "text",
      "datasetId": "text",
      "jobType": "DeidentifyFile"
    }
  ],
  "mostRecentExternalFileGenerationJob": {
    "id": "text",
    "status": "text",
    "errorMessages": "text",
    "startTime": {},
    "endTime": {},
    "publishedTime": {},
    "datasetFileId": "text",
    "datasetId": "text",
    "jobType": "DeidentifyFile"
  },
  "fileSourceExternalCredential": {
    "fileSource": "Local",
    "credential": {}
  },
  "awsCredentialSource": "text",
  "outputPath": "text",
  "externalFilesInfo": {
    "selectedFiles": [
      "text"
    ],
    "pathPrefixes": [
      "text"
    ],
    "selectedFileExtensions": [
      "text"
    ]
  }
}

Returns a list of supported entity types

get
Responses
200

OK

application/json
get
GET /api/Redact/pii_types HTTP/1.1
Host: 
Accept: */*
200

OK

[
  "NUMERIC_VALUE"
]

Redact entities in plain text

post

Returns a modified version of the provided text string that redacts or synthesizes the detected entity values.

Body
all ofOptional
Responses
200

OK

application/json
post
POST /api/Redact HTTP/1.1
Host: 
Content-Type: application/json
Accept: */*
Content-Length: 716

{
  "generatorConfig": {
    "NAME_GIVEN": "Redaction",
    "NAME_FAMILY": "Redaction"
  },
  "generatorDefault": "Off",
  "docXImagePolicy": "Redact",
  "docXCommentPolicy": "Remove",
  "pdfSignaturePolicy": "Redact",
  "pdfSynthModePolicy": "V1",
  "docXTablePolicy": "Redact",
  "labelBlockLists": {
    "NAME_FAMILY": {
      "strings": [],
      "regexes": [
        ".*\\s(disease|syndrom|disorder)"
      ]
    }
  },
  "labelAllowLists": {
    "HEALTHCARE_ID": {
      "strings": [],
      "regexes": [
        "[a-z]{2}\\d{9}"
      ]
    }
  },
  "customModels": [
    "text"
  ],
  "generatorMetadata": {
    "ANY_ADDITIONAL_PROPERTY": {
      "version": "V1",
      "customGenerator": "Scramble",
      "swaps": {
        "ANY_ADDITIONAL_PROPERTY": "text"
      }
    }
  },
  "recordApiRequestOptions": {
    "record": true,
    "retentionTimeInHours": 1,
    "tags": [
      "text"
    ]
  },
  "customPiiEntityIds": [
    "text"
  ],
  "text": "My name is John Smith"
}
200

OK

{
  "originalText": "text",
  "redactedText": "text",
  "usage": 1,
  "deIdentifyResults": [
    {
      "start": 1,
      "end": 1,
      "newStart": 1,
      "newEnd": 1,
      "label": "text",
      "text": "text",
      "newText": "text",
      "score": 1,
      "language": "text",
      "exampleRedaction": "text",
      "jsonPath": "text",
      "xmlPath": "text",
      "idx": 1
    }
  ]
}

Redact plain text entities in bulk

post

Returns a modified version of the provided text string that redacts or synthesizes the detected entity values.

Body
all ofOptional
Responses
200

OK

application/json
post
POST /api/Redact/bulk HTTP/1.1
Host: 
Content-Type: application/json
Accept: */*
Content-Length: 705

{
  "generatorConfig": {
    "NAME_GIVEN": "Redaction",
    "NAME_FAMILY": "Redaction"
  },
  "generatorDefault": "Off",
  "docXImagePolicy": "Redact",
  "docXCommentPolicy": "Remove",
  "pdfSignaturePolicy": "Redact",
  "pdfSynthModePolicy": "V1",
  "docXTablePolicy": "Redact",
  "labelBlockLists": {
    "NAME_FAMILY": {
      "strings": [],
      "regexes": [
        ".*\\s(disease|syndrom|disorder)"
      ]
    }
  },
  "labelAllowLists": {
    "HEALTHCARE_ID": {
      "strings": [],
      "regexes": [
        "[a-z]{2}\\d{9}"
      ]
    }
  },
  "customModels": [
    "text"
  ],
  "generatorMetadata": {
    "ANY_ADDITIONAL_PROPERTY": {
      "version": "V1",
      "customGenerator": "Scramble",
      "swaps": {
        "ANY_ADDITIONAL_PROPERTY": "text"
      }
    }
  },
  "recordApiRequestOptions": {
    "record": true,
    "retentionTimeInHours": 1,
    "tags": [
      "text"
    ]
  },
  "customPiiEntityIds": [
    "text"
  ],
  "bulkText": [
    "text"
  ]
}
200

OK

{
  "bulkText": [
    "text"
  ],
  "bulkRedactedText": [
    "text"
  ],
  "usage": 1,
  "deIdentifyResults": [
    {
      "start": 1,
      "end": 1,
      "newStart": 1,
      "newEnd": 1,
      "label": "text",
      "text": "text",
      "newText": "text",
      "score": 1,
      "language": "text",
      "exampleRedaction": "text",
      "jsonPath": "text",
      "xmlPath": "text",
      "idx": 1
    }
  ]
}

Redact entities in JSON and preserve the JSON structure

post

Returns a modified version of the JSON that redacts or synthesizes the detected entity values. The redacted JSON has the same structure as the input JSON. Only the primitive JSON values, such as strings and numbers, are modified.

Body
all ofOptional
Responses
200

OK

application/json
post
POST /api/Redact/json HTTP/1.1
Host: 
Content-Type: application/json
Accept: */*
Content-Length: 744

{
  "generatorConfig": {
    "NAME_GIVEN": "Redaction",
    "NAME_FAMILY": "Redaction"
  },
  "generatorDefault": "Off",
  "docXImagePolicy": "Redact",
  "docXCommentPolicy": "Remove",
  "pdfSignaturePolicy": "Redact",
  "pdfSynthModePolicy": "V1",
  "docXTablePolicy": "Redact",
  "labelBlockLists": {
    "NAME_FAMILY": {
      "strings": [],
      "regexes": [
        ".*\\s(disease|syndrom|disorder)"
      ]
    }
  },
  "labelAllowLists": {
    "HEALTHCARE_ID": {
      "strings": [],
      "regexes": [
        "[a-z]{2}\\d{9}"
      ]
    }
  },
  "customModels": [
    "text"
  ],
  "generatorMetadata": {
    "ANY_ADDITIONAL_PROPERTY": {
      "version": "V1",
      "customGenerator": "Scramble",
      "swaps": {
        "ANY_ADDITIONAL_PROPERTY": "text"
      }
    }
  },
  "customPiiEntityIds": [
    "text"
  ],
  "jsonText": "{\"Name\": \"John Smith\", \"Description\": \"John lives in Atlanta, Ga.\"}",
  "jsonPathAllowLists": {
    "NAME_GIVEN": [
      "$.name.first"
    ]
  }
}
200

OK

{
  "originalText": "text",
  "redactedText": "text",
  "usage": 1,
  "deIdentifyResults": [
    {
      "start": 1,
      "end": 1,
      "newStart": 1,
      "newEnd": 1,
      "label": "text",
      "text": "text",
      "newText": "text",
      "score": 1,
      "language": "text",
      "exampleRedaction": "text",
      "jsonPath": "text",
      "xmlPath": "text",
      "idx": 1
    }
  ]
}

Redact entities in XML and preserve the XML structure

post

Returns a modified version of the XML that redacts or synthesizes the detected entity values. The redacted XML has the same structure as the input XML. Only the XML inner text and attribute values are modified.

Body
all ofOptional
Responses
200

OK

application/json
post
POST /api/Redact/xml HTTP/1.1
Host: 
Content-Type: application/json
Accept: */*
Content-Length: 825

{
  "generatorConfig": {
    "NAME_GIVEN": "Redaction",
    "NAME_FAMILY": "Redaction"
  },
  "generatorDefault": "Off",
  "docXImagePolicy": "Redact",
  "docXCommentPolicy": "Remove",
  "pdfSignaturePolicy": "Redact",
  "pdfSynthModePolicy": "V1",
  "docXTablePolicy": "Redact",
  "labelBlockLists": {
    "NAME_FAMILY": {
      "strings": [],
      "regexes": [
        ".*\\s(disease|syndrom|disorder)"
      ]
    }
  },
  "labelAllowLists": {
    "HEALTHCARE_ID": {
      "strings": [],
      "regexes": [
        "[a-z]{2}\\d{9}"
      ]
    }
  },
  "customModels": [
    "text"
  ],
  "generatorMetadata": {
    "ANY_ADDITIONAL_PROPERTY": {
      "version": "V1",
      "customGenerator": "Scramble",
      "swaps": {
        "ANY_ADDITIONAL_PROPERTY": "text"
      }
    }
  },
  "customPiiEntityIds": [
    "text"
  ],
  "xmlText": "\n            <note>\n            <to>Tove</to>\n            <from>Jani</from>\n            <heading>Reminder</heading>\n            <body>Don't forget me this weekend!</body>\n            </note>\n            "
}
200

OK

{
  "originalText": "text",
  "redactedText": "text",
  "usage": 1,
  "deIdentifyResults": [
    {
      "start": 1,
      "end": 1,
      "newStart": 1,
      "newEnd": 1,
      "label": "text",
      "text": "text",
      "newText": "text",
      "score": 1,
      "language": "text",
      "exampleRedaction": "text",
      "jsonPath": "text",
      "xmlPath": "text",
      "idx": 1
    }
  ]
}

Redact entities in HTML and preserve the HTML structure

post

Returns a modified version of the HTML that redacts or synthesizes the detected entity values. The redacted HTML has the same structure as the input HTML. Only the text contained in the HTML elements is modified.

Body
all ofOptional
Responses
200

OK

application/json
post
POST /api/Redact/html HTTP/1.1
Host: 
Content-Type: application/json
Accept: */*
Content-Length: 830

{
  "generatorConfig": {
    "NAME_GIVEN": "Redaction",
    "NAME_FAMILY": "Redaction"
  },
  "generatorDefault": "Off",
  "docXImagePolicy": "Redact",
  "docXCommentPolicy": "Remove",
  "pdfSignaturePolicy": "Redact",
  "pdfSynthModePolicy": "V1",
  "docXTablePolicy": "Redact",
  "labelBlockLists": {
    "NAME_FAMILY": {
      "strings": [],
      "regexes": [
        ".*\\s(disease|syndrom|disorder)"
      ]
    }
  },
  "labelAllowLists": {
    "HEALTHCARE_ID": {
      "strings": [],
      "regexes": [
        "[a-z]{2}\\d{9}"
      ]
    }
  },
  "customModels": [
    "text"
  ],
  "generatorMetadata": {
    "ANY_ADDITIONAL_PROPERTY": {
      "version": "V1",
      "customGenerator": "Scramble",
      "swaps": {
        "ANY_ADDITIONAL_PROPERTY": "text"
      }
    }
  },
  "customPiiEntityIds": [
    "text"
  ],
  "htmlText": "\n            <!DOCTYPE html>\n            <html>\n            <body>\n            <h1>Account Information</h1>\n            <p>Account Holder: John Smith</p>\n            </body>\n            </html>\n            "
}
200

OK

{
  "originalText": "text",
  "redactedText": "text",
  "usage": 1,
  "deIdentifyResults": [
    {
      "start": 1,
      "end": 1,
      "newStart": 1,
      "newEnd": 1,
      "label": "text",
      "text": "text",
      "newText": "text",
      "score": 1,
      "language": "text",
      "exampleRedaction": "text",
      "jsonPath": "text",
      "xmlPath": "text",
      "idx": 1
    }
  ]
}

Upload dataset files

post

Upload a file to a dataset for processing.

Required Permissions

  • Dataset: Upload Files
Path parameters
datasetIdstringRequired
Body
filestring · binaryOptional

File to upload

Responses
200

OK

application/json
post
POST /api/Dataset/{datasetId}/files/upload HTTP/1.1
Host: 
Content-Type: multipart/form-data
Accept: */*
Content-Length: 288

{
  "document": {
    "fileName": "example.txt",
    "csvConfig": {
      "numColumns": 1,
      "hasHeader": true,
      "escapeChar": "text",
      "quoteChar": "text",
      "delimiter": "text",
      "nullChar": "text"
    },
    "datasetId": "6a01360f-78fc-9f2f-efae-c5e1461e9c1et",
    "customPiiEntityIds": [
      "CUSTOM_ENTITY_1",
      "CUSTOM_ENTITY_2"
    ]
  },
  "file": "binary"
}
200

OK

{
  "updatedDataset": {
    "id": "text",
    "name": "text",
    "generatorMetadata": "asdfqwer",
    "outputFormat": "Original",
    "generatorSetup": "{\"NAME_GIVEN\":\"Redaction\", \"NAME_FAMILY\":\"Redaction\"}",
    "labelBlockLists": "{\"NAME_FAMILY\": {\"strings\":[],\"regexes\":[\".*\\\\s(disease|syndrom|disorder)\"]}}",
    "labelAllowLists": "{ \"HEALTHCARE_ID\": {\"strings\":[],\"regexes\":[\"[a-z]{2}\\\\d{9}\"]} }",
    "enabledModels": [
      "text"
    ],
    "tags": [
      "text"
    ],
    "files": [
      {
        "fileId": "text",
        "fileName": "text",
        "fileType": "text",
        "datasetId": "text",
        "numRows": 1,
        "numColumns": 1,
        "piiTypes": [
          "text"
        ],
        "wordCount": 1,
        "redactedWordCount": 1,
        "uploadedTimestamp": {},
        "fileSource": "Local",
        "processingStatus": "text",
        "processingError": "text",
        "mostRecentCompletedJobId": "text",
        "fileParseResultId": "text",
        "filePath": "text",
        "generatedFileStatus": "text"
      }
    ],
    "lastUpdated": {},
    "created": {},
    "creatorUser": {
      "id": "text",
      "userName": "text",
      "firstName": "text",
      "lastName": "text"
    },
    "docXImagePolicy": "Redact",
    "pdfSignaturePolicy": "Redact",
    "pdfSynthModePolicy": "V1",
    "docXCommentPolicy": "Remove",
    "docXTablePolicy": "Redact",
    "fileSource": "Local",
    "customPiiEntityIds": [
      "text"
    ],
    "operations": [
      "HasAccess"
    ],
    "rescanJobs": [
      {
        "id": "text",
        "status": "text",
        "errorMessages": "text",
        "startTime": {},
        "endTime": {},
        "publishedTime": {},
        "datasetFileId": "text",
        "datasetId": "text",
        "jobType": "DeidentifyFile"
      }
    ],
    "mostRecentExternalFileGenerationJob": {
      "id": "text",
      "status": "text",
      "errorMessages": "text",
      "startTime": {},
      "endTime": {},
      "publishedTime": {},
      "datasetFileId": "text",
      "datasetId": "text",
      "jobType": "DeidentifyFile"
    },
    "fileSourceExternalCredential": {
      "fileSource": "Local",
      "credential": {}
    },
    "awsCredentialSource": "text",
    "outputPath": "text",
    "externalFilesInfo": {
      "selectedFiles": [
        "text"
      ],
      "pathPrefixes": [
        "text"
      ],
      "selectedFileExtensions": [
        "text"
      ]
    }
  },
  "uploadedFileId": "text"
}

Download a dataset file

get

Downloads the specified file from the dataset. The downloaded file is redacted based on the dataset configuration.

Required Permissions

  • Dataset: Download Redacted Files
Path parameters
datasetIdstringRequired
fileIdstringRequired
Responses
200

OK

application/octet-stream
Responsestring · binary
400

Bad Request

404

Not Found

409

Conflict

500

Internal Server Error

get
GET /api/Dataset/{datasetId}/files/{fileId}/download HTTP/1.1
Host: 
Accept: */*
binary

Download all dataset files

get

Downloads all files from the specified dataset. The downloaded files are redacted based on the dataset configuration.

Required Permissions

  • Dataset: Download Redacted Files
Path parameters
datasetIdstringRequired
Responses
200

OK

application/json
Responsestring · binary
400

Bad Request

application/json
404

Not Found

application/json
500

Internal Server Error

get
GET /api/Dataset/{datasetId}/files/download_all HTTP/1.1
Host: 
Accept: */*
binary

Lists all users in your organization

get

Required Permissions

  • Global: View Users And Groups

Responses
200

OK

get
GET /api/Users HTTP/1.1
Host: 
Accept: */*
200

OK

[
  {
    "id": "text",
    "userName": "text",
    "firstName": "text",
    "lastName": "text",
    "organizationId": "text",
    "photoMetadata": {
      "name": "text",
      "url": "text",
      "fileType": "text",
      "content": "Ynl0ZXM=",
      "isManualUpload": true
    },
    "accountMetadata": {
      "createdAt": {},
      "lastActivityDate": {}
    }
  }
]

Lists all groups in your organization

get

Required Permissions

  • Global: View Users And Groups

Responses
200

OK

get
GET /api/Groups HTTP/1.1
Host: 
Accept: */*
200

OK

[
  {
    "id": "text",
    "userName": "text",
    "context": "None"
  }
]

Retrieve a list of permission sets.

get

Users that have the global permission Manage permission sets get the full details, including the available operations.

Required Permissions

  • Global (At least 1 of the following): Manage Permission Sets, Manage User Global Permissions

Query parameters
Responses
200

OK

400

Bad Request

get
GET /api/permission-sets HTTP/1.1
Host: 
Accept: */*
[
  {
    "id": "text",
    "type": "Global",
    "name": "text",
    "isBuiltIn": true,
    "isDefault": true,
    "isDisabled": true,
    "lastModifiedDate": {},
    "operations": [
      1
    ],
    "createdDate": {},
    "lastModifiedByUserId": "text"
  }
]

Get all user and group permissions granted to a dataset.

get

Required Permissions

  • Dataset: Share

Path parameters
datasetIdstringRequired
Responses
200

OK

get
GET /api/dataset/{datasetId}/shares HTTP/1.1
Host: 
Accept: */*
200

OK

[
  {
    "id": "text",
    "permissionSetId": "text",
    "sharedWithUser": {
      "id": "text",
      "userName": "text",
      "firstName": "text",
      "lastName": "text"
    },
    "sharedWithGroup": {
      "id": "text",
      "userName": "text",
      "context": "None"
    },
    "shareableEntityType": "User",
    "resourceId": "text"
  }
]

Modify the permissions assigned to a dataset.

post

Required Permissions

  • Dataset: Share

Path parameters
datasetIdstringRequired

The ID of the dataset

Body
all ofOptional

A request to modify the permission assignments for a dataset.

Responses
200

OK

post
POST /api/dataset/{datasetId}/shares/bulk HTTP/1.1
Host: 
Content-Type: application/json
Accept: */*
Content-Length: 182

{
  "grant": [
    {
      "sharedWithUserId": "text",
      "sharedWithGroupId": "text",
      "permissionSetId": "text"
    }
  ],
  "revoke": [
    {
      "sharedWithUserId": "text",
      "sharedWithGroupId": "text",
      "permissionSetId": "text"
    }
  ]
}
200

OK

{
  "granted": [
    {
      "id": "text",
      "permissionSetId": "text",
      "sharedWithUser": {
        "id": "text",
        "userName": "text",
        "firstName": "text",
        "lastName": "text"
      },
      "sharedWithGroup": {
        "id": "text",
        "userName": "text",
        "context": "None"
      },
      "shareableEntityType": "User",
      "resourceId": "text"
    }
  ],
  "revoked": [
    {
      "id": "text",
      "permissionSetId": "text",
      "sharedWithUser": {
        "id": "text",
        "userName": "text",
        "firstName": "text",
        "lastName": "text"
      },
      "sharedWithGroup": {
        "id": "text",
        "userName": "text",
        "context": "None"
      },
      "shareableEntityType": "User",
      "resourceId": "text"
    }
  ]
}

Group similar entities together

post

Group entities together based on the original text

Body
all ofOptional
Responses
200

OK

post
POST /api/Synthesis/group HTTP/1.1
Host: 
Content-Type: application/json
Accept: */*
Content-Length: 418

{
  "entities": [
    {
      "start": 1,
      "end": 1,
      "pythonStart": 1,
      "pythonEnd": 1,
      "label": "text",
      "text": "text",
      "score": 1,
      "language": "text",
      "exampleRedaction": "text",
      "head": "text",
      "tail": "text",
      "subNerEntities": [
        {
          "start": 1,
          "end": 1,
          "pythonStart": 1,
          "pythonEnd": 1,
          "label": "text",
          "text": "text",
          "score": 1,
          "language": "text",
          "exampleRedaction": "text",
          "head": "text",
          "tail": "text",
          "subNerEntities": "[Circular Reference]"
        }
      ]
    }
  ],
  "original_text": "text"
}
200

OK

{
  "groups": [
    {
      "representative": "text",
      "pii_type": "NUMERIC_VALUE",
      "entities": [
        {
          "start": 1,
          "end": 1,
          "pythonStart": 1,
          "pythonEnd": 1,
          "label": "text",
          "text": "text",
          "score": 1,
          "language": "text",
          "exampleRedaction": "text",
          "head": "text",
          "tail": "text",
          "subNerEntities": [
            "[Circular Reference]"
          ]
        }
      ]
    }
  ]
}
Structural statistics seed value
HIPAA Address generator in Structural
Address generator in Structural
Phone generator in Structural