Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The Tonic Textual SDK is a Python SDK that you can use to parse and redact text and files.
To install the Tonic Textual Python SDK, run:
pip install tonic-textual
Tonic Textual supports languages in addition to English. Textual automatically detects the language and applies the correct model.
On Textual Cloud, we support the following languages:
Chinese
German
Greek
Italian
Japanese
Norwegian (Bokmål)
Russian
Spanish
Ukrainian
On a self-hosted instance, to enable support for any languages other than English, you must:
Configure Textual to support multiple languages
Provide the language models for the languages to support
Self-hosted customers can enable support for the following languages:
Catalan
Chinese
Croatian
Danish
Dutch
Finnish
French
German
Greek
Italian
Japanese
Korean
Lithuanian
Macedonian
Norwegian (Bokmål)
Polish
Portuguese
Romanian
Russian
Slovenian
Spanish
Swedish
Ukrainian
To enable support for languages other than English, set the environment variable TEXTUAL_MULTI_LINGUAL=true
.
The setting is used by the machine learning container.
For the languages to add, you must provide the language model assets for Textual to use.
By default, Textual looks for model assets in the machine learning container, in /usr/bin/textual/language_models. The default Helm and Docker Compose configurations include the volume mount.
To choose a different location, set the environment variable TEXTUAL_LANGUAGE_MODEL_DIRECTORY
. Note that if change the location, you must also modify your volume mounts.
For help with installing model assets, contact Tonic.ai support (support@tonic.ai).
Tonic Textual comes with built-in models to identify a range of sensitive values, such as:
Locations and addresses
Names of people and organizations
Identifiers and account numbers
The built-in entity types are:
CC Exp
CC_EXP
The expiration date of a credit card.
Credit Card
CREDIT_CARD
A credit card number.
CVV
CVV
The card verification value for a credit card.
Date Time
DATE_TIME
A date or timestamp.
DOB
DOB
A person's date of birth.
Email Address
EMAIL_ADDRESS
An email address.
Event
EVENT
The name of an event.
Gender Identifier
GENDER_IDENTIFIER
An identifier of a person's gender.
Healthcare Identifier
HEALTHCARE_ID
An identifier associated with healthcare, such as a patient number.
IBAN Code
IBAN_CODE
An international bank account number used to identify an overseas bank account.
IP Address
IP_ADDRESS
An IP address.
Language
LANGUAGE
The name of a spoken language.
Law
LAW
A title of a law.
Location
LOCATION
A value related to a location. Can include any part of a mailing address.
Occupation
OCCUPATION
A job title or profession.
Street Address
LOCATION_ADDRESS
A street address.
City
LOCATION_CITY
The name of a city.
State
LOCATION_STATE
A state name or abbreviation.
Zip
LOCATION_ZIP
A postal code.
Medical License
MEDICAL_LICENSE
The identifier of a medical license.
Money
MONEY
A monetary value.
Given Name
NAME_GIVEN
A given name or first name.
Family Name
NAME_FAMILY
A family name or surname.
NRP
NRP
A nationality, religion, or political group.
Numeric Value
NUMERIC_VALUE
A numeric value.
Organization
ORGANIZATION
The name of an organization.
Password
PASSWORD
A password associated with an account.
Person Age
PERSON_AGE
The age of a person.
Phone Number
PHONE_NUMBER
A telephone number.
Product
PRODUCT
The name of a product.
Project Name
PROJECT_NAME
The name of a project.
URL
URL
A URL to a web page.
US Bank Number
US_BANK_NUMBER
The routing number of a bank in the United States.
US ITIN
US_ITIN
An Individual Taxpayer Identification Number in the United States.
US Passport
US_PASSPORT
A United States passport identifier.
US SSN
US_SSN
A United States Social Security number.
Username
USERNAME
A username associated with an account.
Tonic Textual provides a single tool to allow you to put your file-based data to work for you.
You can use Textual datasets to redact sensitive values, to produce files in the same format to use for development and training. Each original file becomes an output file in the same format, with the sensitive values replaced.
The Textual pipeline option allows you to prepare unstructured text for use in an LLM system. Textual extracts the text from each file and then produces Markdown-formatted output. You can optionally replace sensitive values in the output, to prevent data leakage from your LLM.
Need help with Textual? Contact support@tonic.ai.
When you sign up for a Tonic Textual account, you can immediately get started with a new pipeline.
Note that these instructions are for setting up a new account on Textual Cloud. For a self-hosted instance, depending on how it is set up, you might either create an account manually or use single sign-on (SSO).
To get started with a new Textual account:
Go to https://textual.tonic.ai/.
Click Sign up.
Enter your email address.
Create and confirm a password for your Textual account.
Click Sign Up.
Textual creates your account. After you log in, Textual prompts you to provide some additional information about yourself and how you plan to use Textual.
After you fill out the information and click Get Started, Textual displays the Textual Home page, which you can use to preview how Textual detects and replaces values. For more information, go to Previewing Textual detection and redaction.
After you set up an account on Textual Cloud, you start a Textual free trial, during which Textual scans up to 100,000 words for free. Note that Textual counts actual words, not tokens. For example, "Hello, my name is John Smith." counts as six words.
After the 100,000 words, Textual disables scanning for your account. Until you purchase a pay-as-you-go subscription, you cannot:
Add files to a dataset or pipeline
Run a pipeline
During your free trial, Textual displays the current usage in the following locations:
On the Home page
In the navigation menu
On the Playground
Textual also prompts you to purchase a pay-as-you-go subscription, which allows an unlimited number of words scanned for a flat rate per 1,000 words.
You can also request a Textual product demo.
Whenever you call the Textual SDK, you first instantiate the SDK client.
To work with Textual datasets, or to redact individual files, you instantiate TonicTextual
.
To work with Textual pipelines and parsing, you instantiate TonicTextualParse
.
If the API key is configured as the value of the TONIC_TEXTUAL_API_KEY
, then you do not need to provide the API key when you instantiate the SDK client.
For Textual pipelines:
For Textual datasets:
For Textual pipelines:
For Textual datasets:
To be able to use the Textual SDK, you must have an API key.
You manage keys from the User API Keys page.
To display the User API Keys page, in the Textual navigation menu, click User API Keys.
To create a Textual API key:
Either:
In the API keys panel on the dataset details page, click Create an API Key.
On the pipeline details page, click the API key creation icon.
On the User API Keys page, click Create API Key.
In the Name field, type a name to use to identify the key.
Click Create API Key.
Textual displays the key value, and prompts you to copy the key. If you do not copy the key and save it to a file, you will not have access to the key. To copy the key, click the copy icon.
To revoke a Textual API key, on the User API Keys page, click the Revoke option for the key to revoke.
You cannot instantiate the SDK client without an API key.
You can use the Tonic Textual SDK to manage pipelines and to redact individual strings and files.
Textual uses datasets to produce files with sensitive values replaced.
Before you perform these tasks, remember to .
To create a new dataset and then upload a file to it, use .
To add a file to the dataset, use . To identify the file, provide the file path and name.
To provide the file as IO bytes, you provide the file name and the file bytes. You do not provide a path.
Textual creates the dataset, scans the uploaded file, and redacts the detected values.
To change the configuration of a dataset, use .
You can use dataset.edit
to change:
The name of the dataset
The
The response includes:
The name and identifier of the dataset
The number of files in the dataset
The number of files that are waiting to be processed (scanned and redacted)
The number of files that had errors during processing
For example:
To get a list of files that have a specific status, use the following:
The file list includes:
File identifier and name
Number of rows and columns
Processing status
For failed files, the error
When the file was uploaded
For example:
The response looks something like:
Before you perform these tasks, remember to .
You can use the Tonic Textual SDK to redact individual strings, including:
Plain text strings
JSON content
XML content
For a text string, you can also request synthesized values from a large language model (LLM).
The redaction request can include the .
The includes the redacted or synthesized content and details about the detected entity values.
To send a plain text string for redaction, use :
For example:
For example:
redact_json
ensures that only the values are redacted. It ignores the keys.
Here is a basic example of a JSON redaction request:
It produces the following JSON output:
When you redact a JSON string, you can optionally assign specific entity types to selected JSON paths.
To do this, you include the jsonpath_allow_lists
parameter. Each entry consists of an entity type and a list of JSON paths for which to always use that entity type. Each JSON path must point to a simple string or numeric value.
The specified entity type overrides both the detected entity type and any added or excluded values.
In the following example, the value of the key1
node is always treated as a telephone number:
It produces the following redacted output:
redact_xml
ensures that only the values are redacted. It ignores the XML markup.
For example:
Produces the following XML output:
redact_html
ensures that only the values are redacted. It ignores the HTML markup.
For example:
Produces the following HTML output:
You can also request synthesized values from a large language model (LLM).
When you use this process, Textual first identifies the sensitive values in the text. It then sends the value locations and redacted values to the LLM. For example, if Textual identifies a product name, it sends the location and the redacted value PRODUCT
to the LLM. Textual does not send the original values to the LLM.
The LLM then generates realistic synthesized values of the appropriate value types.
For example:
The response provides the redacted or synthesized version of the string, and the list of detected entity values.
For each redacted item, the response includes:
The location of the value in the original text (start
and end
)
The location of the value in the redacted version of the string (new_start
and new_end
)
The entity type (label
)
The original value (text
)
The replacement value (new_text
). new_text
is null
in the following cases:
The entity type is ignored
The response is from llm_synthesis
A score to indicate confidence in the detection and redaction (score
)
The detected language for the value (language
)
For responses from textual.redact_json
, the JSON path to the entity in the original document (json_path
)
For responses from textual.redact_xml
, the XPath to the entity in the original XML document (xml_path
)
If the API key is not configured as the value of the TONIC_TEXTUAL_API_KEY
, then you must include the API key in the request.
Instead of providing the key every time you call the Textual API, you can configure the API key as the value of the TONIC_TEXTUAL_API_KEY
.
To get the current status of the files in the current dataset, use :
To delete a file from a dataset, use .
To get the redacted content in JSON format for a dataset, use :
The redact
call provides an option to record the request, to allow you to preview the results in the Textual application. For more information, go to .
To send multiple plain text strings for redaction, use :
To send a JSON string for redaction, use . You can send the JSON content as a JSON string or a Python dictionary.
To send an XML string for redaction, use .
To send an HTML string for redaction, use .
To send text to an LLM, use :
Getting started with Textual
Sign up for a Textual account. Create your first pipeline.
Datasets workflow - File redaction and synthesis
Use Textual to replace sensitive values in files.
Pipelines workflow - LLM preparation
Use Textual to prepare file content for use in an LLM system.
Datasets and redaction
Use the Textual Python SDK to redact text and manage datasets. Review redaction requests in the Request Explorer.
Pipelines and parsing
Use the Textual Python SDK to parse text and manage pipelines.
Snowflake Native App
Use the Snowflake Native App to redact values in your data warehouse.
By default, when you:
Configure a dataset
Redact a string
Retrieve a redacted file
Textual does the following:
For the string and file redaction, replaces detected values with tokens.
For LLM synthesis, generates realistic synthesized values.
When you make the request, you can override the default behavior.
For each entity type, you can choose to redact, synthesize, or ignore the value.
When you redact a value, Textual replaces the value with a token that consists of the entity type. For example, ORGANIZATION
.
When you synthesize a value, Textual replaces the value with a different realistic value.
When you ignore a value, Textual passes through the original value.
To specify the handling option for entity types, you use the generator_config
parameter.
Where:
<entity_type>
is the identifier of the entity type. For example, ORGANIZATION
. For the list of built-in entity types that Textual scans for, go to Entity types that Textual detects.
<handling_option>
is the handling option to use for the specified entity type. The possible values are Redact
, Synthesis
, and Off
.
For example, to synthesize organization values, and ignore languages:
For string and file redaction, you can specify a default handling option to use for entity types that are not specified in generator_config
.
To do this, you use the generator_default
parameter.
generator_default
can be either Redact
, Synthesis
, or Off
.
You can also configure added and excluded values for each entity type.
You add values that Textual does not detect for an entity type, but should. You exclude values that you do not want Textual to identify as that entity type.
To specify the added values, use label_allow_lists
.
To specify the excluded values, use label_block_lists
.
For each of these parameters, the value is a list of entity types to specify the added or excluded values for. To specify the values, you provide an array of regular expressions.
The following example uses label_allow_lists
to add values:
For NAME_GIVEN
, adds the values There
and Here
.
For NAME_FAMILY
, adds values that match the regular expression ([a-z]{2})
.
You can use the Textual SDK to parse individual files, either from a local file system or from an S3 bucket.
Textual returns a FileParseResult
object for each parsed file. The FileParseResult
object is a wrapper around the output JSON for the processed file.
To parse a single file from a local file system, use textual.parse_file
:
You must use rb
access mode to read the file. rb
access mode opens the file to be read in binary format.
You can also set a timeout in seconds for the parsing. You can add the timeout as a parameter of parse_file command. To set a timeout to use for all parsing, set the environment variable TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS
.
You can also parse files that are stored in Amazon S3. Because this process uses the boto3 library to fetch the file from Amazon S3, you must first set up the correct AWS credentials.
To parse a file from an S3 bucket, use textual.parse_s3_file
:
Textual pipelines can process the following types of files:
txt
csv
tsv
docx
xlsx
png
tif or tiff
jpg or jpeg
eml
msg
The TEXTUAL_ML_WORKERS
environment variable specifies the number of workers to use within the textual-ml
container. The default value is 1.
Having multiple workers allows for parallelization of inferences with NER models.
When you deploy Textual with Kubernetes on GPUs, parallelization allows the textual-ml
container to fully utilize the GPU.
We recommend 6GB of GPU RAM for each worker.
When you use the redact
method to redact a plain text string, you can also choose to record the request.
The recorded requests are encrypted.
When you make the request, you specify the number of hours to keep the recorded request. After that amount of time elapses, the request is completely purged. Recorded requests are never kept more than 720 hours, regardless of the configured retention time.
From the Request Explorer, you can review your recorded requests to check the results and assess the quality of the redaction. You can also test changes to the redaction configuration.
You cannot view requests from other users.
To record a redaction request, you include the record_options
argument:
The record_options
argument includes the following parameters:
record
- Whether to record the request. The default is False
. To record the request, set record to True
.
retention_time_in_hours
- The number of hours to preserve the recorded request. The default is 1. After the retention time elapses, the request is purged completely.
tags
- A list of tags to assign to the request. The tags are mostly intended to make it easier to search for requests on the Request Explorer page.
The Request Explorer page in Textual contains the list requests that you recorded and that are not yet purged. You cannot view requests from other users.
To display the Request Explorer page, in the Textual navigation bar, click Request Explorer.
For each request, the list includes:
A 255-character preview of the text that was sent for redaction.
The tags assigned to the request.
The date when the request will be purged.
You can search for a request based on text that is contained in the redacted text, and by the tags that you assigned to the request.
To search by text from the string, in the search field, begin to type the text.
To search by an assigned tag, in the search field, type tags:
followed by the tag to search for.
From the request list, to view the results of a request, click the request row.
By default, the preview uses Identification view. For each detected entity, Identification view displays the value and the entity type.
To instead display only the replacement value, which by default is the entity type, click Replacement.
From the preview, you can test how the results change when you:
Change the handling option for entity types.
Add and exclude values for entity types.
To display the edit panel, from the request preview page, click Edit.
The Edit Request panel displays the full list of the available entity types.
You can change how Textual handles detected entity values for each entity type.
Note that the handling option changes are not saved when you close the preview and return to the requests list.
The handling options are:
Off - Indicates to ignore values for this entity type.
Redact - This is the default option. Indicates to replace each value with a token that represents the entity type.
Synthesize - Indicates to replace each value with a realistic replacement value.
To change the handling option for a single entity type, either:
Click the handling option value for the entity type, then select the handling option.
Click the entity type, then under Generator, click the handling option.
To select the same handling option for all of the entity types:
Click Bulk Edit.
From the Bulk Edit dropdown list, select the handling option.
To configure added and excluded values for an entity type, click the entity type.
The Edit Request panel expands to display the Add to detection and Exclude from detection lists.
You use the Add to detection list to configure regular expressions to identify additional values to detect as the selected entity type.
You use the Exclude from detection list to configure regular expressions to identify values to not detect as the selected entity type.
Note that the added and excluded values are not saved when you close the preview and return to the requests list.
To create a regular expression for added or excluded values:
Click the Add regex option for that list.
In the field, provide a regular expression to identify values to add or exclude.
Press Enter.
To edit a regular expression:
Click the edit icon for the expression.
In the field, edit the expression.
Click the save icon.
To delete a regular expression, click the delete icon for that expression.
When an entity type has added values, the added values icon displays for that entity type.
When an entity type has excluded values, the excluded values icon displays for that entity type.
To replay the request based on the current configuration, click Replay.
When you replay the request, in addition to the Identification and Replacement options, you use the Diff toggle to indicate whether to compare the original and new results.
For our example, we made the following changes to the configuration:
For Given Name and Family Name, changed the handling option to Synthesize.
For Credit Card, indicated to ignore the value 41111111111.
When the Diff toggle is in the off position, Identification view only reflects changes to the added and excluded values.
In our example, we configured 41111111111 to not be detected as a credit card number. In the replayed request, it is instead detected as a numeric value.
Replacement view reflects both the added and excluded values and the changes to the handling option.
For our example, in addition to the entity type change for the credit card number 41111111111, the given and family names are now realistic replacement values instead of the entity types.
When you set the Diff toggle to the on position, the preview displays the original content to the left, and the modified content to the right.
In Identification view, you can see the changes to the entity detection based on the added and excluded values.
In Replacement view, you can also see the changes to the selected handling options for the entity types.
To clear all of the regular expressions for all of the entity types, click Remove Changes.
Textual uses pipelines to transform file text into a format that can be used in an LLM system.
You can use the Textual SDK to create and manage pipelines and to retrieve pipeline run results.
Before you perform these tasks, remember to instantiate the SDK client.
To create a pipeline, use the pipeline creation method for the type of pipeline to create"
textual.create_local_pipeline
- Creates an uploaded file pipeline.
textual.create_s3_pipeline
- Creates an Amazon S3 pipeline.
textual.create_azure_pipeline
- Creates an Azure pipeline.
textual.create_databricks_pipeline
- Creates a Databricks pipeline.
When you create the pipeline, you can also:
If needed, provide the credentials to use to connect to Amazon S3, Azure, or Databricks.
Indicate whether to also generate redacted files. By default, pipelines do not generate redacted files. To generate redacted files, set synthesize_files
to True
.
For example, to create an uploaded file pipeline that also creates redacted files:
The response contains the pipeline object.
To delete a pipeline, use textual.delete_pipeline
.
To change whether a pipeline also generates synthesized files, use pipeline.set_synthesize_files
.
To a a file to an uploaded file pipeline, use pipeline.upload_file
.
For an Amazon S3 pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:
To identify the output location for the processed files, use s3_pipeline.set_output_location
.
To identify individual files for the pipeline to process, use s3_pipeline.add_files
.
To identify prefixes - folders for which the pipeline processes all applicable files - use s3_pipeline.add_prefixes
.
For an Azure pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:
To identify the output location for the processed files, use azure_pipeline.set_output_location
.
To identify individual files for the pipeline to process, use azure_pipeline.add_files
.
To identify prefixes - folders for which the pipeline processes all applicable files - use azure_pipeline.add_prefixes
.
To get the list of pipelines, use textual.get_pipelines
.
The response contains a list of pipeline objects.
To use the pipeline identifier to get a single pipeline, use textual.get_pipeline_by_id
.
The response contains a single pipeline object.
The pipeline identifier is displayed on the pipeline details page. To copy the identifier, click the copy icon.
To run a pipeline, use pipeline.run
.
The response contains the job identifier.
To get the list of pipeline runs, use pipeline.get_runs
.
The response contains a list of pipeline run objects.
Once you have the pipeline, to get an enumerator of the files in the pipeline from the most recent pipeline run, use pipeline.enumerate_files
.
The response is an enumerator of file parse result objects.
To get a list of entities that were detected in a file, use get_all_entities
. For example, to get the detected entities for all of the files in a pipeline:
To provide a list entity types and how to process them, use get_entities
:
generator_config
is a dictionary that specifies whether to redact, synthesize, or do neither for each entity type in the dictionary.
For a list of the entity types that Textual detects, go to Entity types that Textual detects.
For each entity type, you provide the handling type:
Redaction
indicates to replace the value with a token that represents the entity type.
Synthesis
indicates to replace the value with a realistic value.
Off
indicates to keep the value as is.
generator_default
indicates how to process values for entity types that were not included in the generator_config
list.
The response contains the list of entities. For each value, the list includes:
Entity type
Where the value starts in the source file
Where the value ends in the source file
The original text of the entity
To get the Markdown output of a pipeline file, use get_markdown
. In the request, you can provide generator_config
and generator_default
to configure how to present the detected entities in the output file.
The response contains the Markdown files, with the detected entities processed as specified in generator_config
and generator_default
.
To split a pipeline file into text chunks that can be imported into an LLM, use get_chunks
.
In the request, you set the maximum number of characters in each chunk.
You can also provide generator_config
and generator_default
to configure how to present the detected entities in the text chunks.
The response contains the list of text chunks, with the detected entities processed as specified in generator_config
and generator_default
.
Use these instructions to set up Azure Active Directory as your SSO provider for Tonic Textual.
Register Textual as an application within the Azure Active Directory Portal:
In the portal, navigate to Azure Active Directory -> App registrations, then click New registration.
Register Textual and create a new web redirect URI that points to your Textual instance's address and the path /sso/callback/azure
.
Take note of the values for client ID and tenant ID. You will need them later.
Click Add a certificate or secret, and then create a new client secret. Take note of the secret value. You will need this later.
Navigate to the API permissions page. Add the following permissions for the Microsoft Graph API:
OpenId permissions
openid
profile
GroupMember
GroupMember.Read.All
User
User.Read
Click Grant admin consent for Tonic AI. This allows the application to read the user and group information from your organization. When permissions have been granted, the status should change to Granted for Tonic AI.
Navigate to Enterprise applications and then select Textual. From here, you can assign the users or groups that should have access to Textual.
For Kubernetes, in values.yaml:
For Docker, in .env:
On a self-hosted instance, you can configure settings to determine whether to the auxiliary model, and model use on GPU.
To improve overall inference, you can configure whether Textual uses the en_core_web_sm
auxiliary NER model.
The auxiliary model detects the following types:
EVENT
LANGUAGE
LAW
NRP
NUMERIC_VALUE
PRODUCT
WORK_OF_ART
To configure whether to use the auxiliary model, you use the environment variable TEXTUAL_AUX_MODEL
.
The available values are:
en_core_web_sm
- This is the default value.
none
- Indicates to not use the auxiliary model.
When you use a textual-ml-gpu
container on accelerated hardware, you can configure:
Whether to use the auxiliary model,
Whether to use the date synthesis model
By default, on GPU, Textual does not use the auxiliary model, and TEXTUAL_AUX_MODEL_GPU
is false
.
To use the auxiliary model for GPU, based on the configuration of TEXTUAL_AUX_MODEL
, set TEXTUAL_AUX_MODEL_GPU
to true
.
When TEXTUAL_AUX_MODEL_GPU
is true
, and TEXTUAL_MULTI_LINGUAL
is true
, Textual also loads the multilingual models on GPU.
By default, on GPU, Textual loads the date synthesis model on GPU.
Note that this model requires 600MB of GPU RAM for each machine learning worker.
To process PDF and image files, Tonic Textual uses optical character recognition (OCR). Textual supports the following OCR models:
Azure AI Document Intelligence
Amazon Textract
Tesseract
For the best performance, we recommend that you use either Azure AI Document Intelligence or Amazon Textract.
If you cannot use either of those - for example because you run Textual on-premises and cannot access third-party services - then you can use Tesseract.
To use Azure AI Document Intelligence to process PDF image files, Textual requires the Azure AI Document Intelligence key and endpoint.
In .env, uncomment and provide values for the following settings:
SOLAR_AZURE_DOC_INTELLIGENCE_KEY=#
SOLAR_AZURE_DOC_INTELLIGENCE_ENDPOINT=#
In values.yaml, uncomment and provide values for the following settings:
azureDocIntelligenceKey:
azureDocIntelligenceEndpoint:
If the Azure-specific environment variables are not configured, then Textual attempts to use Amazon Textract.
We recommend that you use the AmazonTextractFullAccess
policy, but you can also choose to use a more restricted policy.
Here is an example policy that provides the minimum required permissions:
After the policy is attached to an IAM user or a role, it must be made accessible to Textual. To do this, either:
Assign an instance profile
Provide the AWS key, secret, and Region in the following environment variables:
If neither Azure AI Document Intelligence nor Amazon Textract is configured, then Textual uses Tesseract, which is automatically available in your Textual installation.
Tesseract does not require any external access.
Use these instructions to set up Okta as your SSO provider for Tonic Structural.
You complete the following configuration steps within Okta:
Create a new application. Choose the OIDC - OpenId Connect method with the Single-Page Application option.
Click Next, then fill out the fields with the values below:
App integration name: The name to use for the Textual application. For example, Textual, Textual-Prod, Textual-Dev.
Grant type: Implicit (hybrid)
Sign-in redirect URIs: <base-url>/sso/callback/okta
Sign-out redirect URIs: <base-url>/sso/logout
Base URIs: The URL to your Textual instance
Controlled access: Configure as needed to limit Textual access to the appropriate users
After saving the above, navigate to the General Settings page for the application and make the following changes:
Grant type: Check Implicit (Hybrid) and Allow ID Token with implicit grant type.
Login initiated by: Either Okta or App
Application visibility: Check Display application icon to users
Initiate login URI: <base-url>
For Kubernetes, in values.yaml:
For Docker, in .env:
The Textual LLM preparation workflow transforms source files into content that you can incorporate into an LLM.
You can:
Upload files directly from a local file system
Select files from an S3 bucket
Select files from a Databricks data volume
Select files from an Azure Blob Storage container
Textual can process plain text files (.txt and .csv), .docx files, and .xslx files. It can also process PDF files. For images, Textual can extract text from .png, .tif/.tiff, and .jpg/.jpeg files.
At a high level, to use Textual to create LLM-ready content:
If the source files are in a local file system, then upload the files to the pipeline. Textual stores the files in your configured Amazon S3 location, and then automatically processes each new file.
If the source files are in cloud storage (Amazon S3, Databricks, or Azure):
Provide the credentials to use to connect to the storage location.
Identify the location where Textual writes the pipeline output.
Optionally, filter the files by file type. For example, you might only want to process PDF files.
Identify the files to include in the pipeline. You can select individual files or folders. When you select folders, Textual processes all of the files in the folder.
For each file, Textual:
Converts the content to raw text. For image files, this means to extract any text that is present.
Uses its built-in models to detect entity values in the text.
Generates a Markdown version of the original text.
Produces a JSON file that contains:
The Markdown version of the text
The detected entities and their locations
From Textual, for each processed file, you can:
Textual also provides code snippets to help you to use the pipeline output.
For cloud storage pipelines, the JSON files also are available from the configured output location.
You can also configure pipelines to create redacted versions of the original values. For more information, go to Datasets workflow for text redaction.
To create a pipeline, on the Pipelines page, click Create a New Pipeline.
On the Create A New Pipeline panel:
In the Name field, type the name of the pipeline.
Under Files Source, select the location of the source files.
To upload files from a local file system, click File upload, then click Save.
To select files from and write output to Amazon S3, click Amazon S3.
To select files from and write output to Databricks, click Databricks.
To select files from and write output to Azure Blob Storage, click Azure.
If you selected Amazon S3, provide the credentials to use to connect to Amazon S3.
In the Access Secret field, provide the secret key that is associated with the access key.
From the Region dropdown list, select the AWS Region to send the authentication request to.
In the Session Token field, provide the session token to use for the authentication request.
To test the credentials, click Test AWS Connection.
Click Save.
Click Save.
If you selected Databricks, provide the connection information:
In the Databricks URL field, provide the URL to the Databricks workspace.
In the Access Token field, provide the access token to use to get access to the volume.
To test the connection, click Test Databricks Connection.
Click Save.
Click Save.
If you selected Azure, provide the connection information:
In the Account Name field, provide the name of your Azure account.
In the Account Key field, provide the access key for your Azure account.
To test the connection, click Test Azure Connection.
Click Save.
Click Save.
To update a pipeline configuration:
Either:
On the Pipelines page, click the pipeline options menu, then click Settings.
On the pipeline details page, click the settings icon. For cloud storage pipelines, the settings icon is next to the Run Pipeline option. For uploaded file pipelines, the settings icon is next to the Upload Files option.
Click Save.
To delete a pipeline, on the Pipeline Settings page, click Delete Pipeline.
In Tonic Textual, a pipeline identifies a set of files that Textual processes into content that can be imported into an LLM system.
To display the Pipelines page, in the Textual navigation menu, click Pipelines.
If there are no pipelines, then the Pipelines page displays a panel to allow you to create a pipeline.
To display the details for a pipeline, on the Pipelines page, click the pipeline name.
Here is a pipeline details page for an Amazon S3 pipeline:
For an Amazon S3 pipeline, the details include:
Here is a pipeline details page for a Databricks pipeline:
For a Databricks pipeline, the details include:
Here is a pipeline details page for an Azure pipeline:
For an Azure pipeline, the details include:
Here is a pipeline details page for an uploaded file pipeline:
The pipeline details include:
For each pipeline, you configure the name and the files to process.
For a Databricks pipeline, the settings include:
Databricks credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders
When you create a pipeline that uses files from Databricks, you are prompted to provide the credentials to use to connect to Databricks.
From the Pipeline Settings page, to change the credentials:
Click Update Databricks Credentials.
Provide the new credentials:
In the Databricks URL field, provide the URL to the Databricks workspace.
In the Access Token field, provide the access token to use to get access to the volume.
To test the connection, click Test Databricks Connection.
To save the new credentials, click Update Databricks Credentials.
On the Pipeline Settings page, under Select Output Location, navigate to and select the folder in Databricks where Textual writes the output files.
When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.
Within the job folder, Textual recreates the folder structure for the original files. It then creates the JSON output for each file. The name of the JSON file is <original filename>_<original extension>_parsed.json
.
If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.
For example, for the original file Transaction1.txt
, the output for a pipeline run contains:
Transaction1_txt_parsed.json
Transaction1.txt
By default, when you run a Databricks pipeline, Textual only generates the JSON output.
To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.
One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.
Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.
Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.
Under Select files and folders to add to run, navigate to and select the folders and individual files to process.
To add a folder or file to the pipeline, check its checkbox.
When you check a folder checkbox, Textual adds it to the Prefix Patterns list. It processes all of the applicable files in the folder, based on whether the file type is a type that Textual supports and whether it is included in the file type filter.
When you click the folder name, it displays the folder contents.
When you select an individual file, Textual adds it to the Selected Files list.
To delete a file or folder, either:
In the navigation pane, uncheck the checkbox.
In the Prefix Patterns or Selected Files list, click its delete icon.
For an uploaded file pipeline, the Files tab contains the list of all of the pipeline files.
For cloud storage pipelines, you use the pipeline details Overview page to track processed files and pipeline runs.
For pipelines that are configured to also redact files, you can configure the redaction for the detected entity types. For more information, go to .
For uploaded file pipelines, when you add a file to the pipeline, it is automatically added to the file list.
For cloud storage pipelines, the file list is not populated until you run the pipeline. The list only contains processed files.
On the Overview page for a cloud storage pipeline, the Pipeline Runs tab displays the list of pipeline runs.
For each run, the list includes:
Run identifier
When the run was started
The current status of the pipeline run. The possible statuses are:
Queued - The pipeline run has not started to run yet.
Running - The pipeline run is in progress.
Completed - The pipeline run completed successfully.
Failed - The pipeline run failed.
For a pipeline run, to display the list of files that the pipeline run includes, click View Run.
For each file, the list includes the following information:
File name
For cloud storage files, the path to the file
The status of the file processing. The possible status are:
Unprocessed - The file is added, but a pipeline run to process it has not yet started. This only applies to uploaded files that were added since the most recent pipeline run.
Queued - A pipeline run was started but the file is not yet processed.
Running - The file is being processed.
Completed - The file was processed successfully.
Failed - The file could not be processed.
The pipeline details include the results of the pipeline processing, including the pipeline files and, for cloud storage pipelines, the individual pipeline runs.
For an Azure pipeline, the settings include:
Azure credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders
When you create a pipeline that uses files from Azure, you are prompted to provide the credentials to use to connect to Azure.
From the Pipeline Settings page, to change the credentials:
Click Update Azure Credentials.
Provide the new credentials:
In the Account Name field, provide the name of your Azure account.
In the Account Key field, provide the access key for your Azure account.
To test the connection, click Test Azure Connection.
To save the new credentials, click Update Azure Credentials.
On the Pipeline Settings page, under Select Output Location, navigate to and select the folder in Azure where Textual writes the output files.
When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.
Within the job folder, Textual recreates the folder structure for the original files. It then creates the JSON output for each file. The name of the JSON file is <original filename>_<original extension>_parsed.json
.
If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.
For example, for the original file Transaction1.txt
, the output for a pipeline run contains:
Transaction1_txt_parsed.json
Transaction1.txt
By default, when you run an Azure pipeline, Textual only generates the JSON output.
To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.
One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.
Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.
Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.
Under Select files and folders to add to run, navigate to and select the folders and individual files to process.
To add a folder or file to the pipeline, check its checkbox.
When you check a folder checkbox, Textual adds it to the Prefix Patterns list. It processes all of the applicable files in the folder, based on whether the file type is a type that Textual supports and whether it is included in the file type filter.
When you click the folder name, it displays the folder contents.
When you select an individual file, Textual adds it to the Selected Files list.
To delete a file or folder, either:
In the navigation pane, uncheck the checkbox.
In the Prefix Patterns or Selected Files list, click its delete icon.
For an Amazon S3 pipeline, the settings include:
AWS credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders
When you create a pipeline that uses files from Amazon S3, you are prompted to provide the credentials to use to connect to Amazon S3.
From the Pipeline Settings page, to change the credentials:
Click Update AWS Credentials.
Provide the new credentials:
In the Access Secret field, provide the secret key that is associated with the access key.
From the Region dropdown list, select the AWS Region to send the authentication request to.
In the Session Token field, provide the session token to use for the authentication request.
To test the connection, click Test AWS Connection.
To save the new credentials, click Update AWS Credentials.
On the Pipeline Settings page, under Select Output Location, navigate to and select the folder in Amazon S3 where Textual writes the output files.
When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.
Within the job folder, Textual recreates the folder structure for the original files. It then creates the JSON output for each file. The name of the JSON file is <original filename>_<original extension>_parsed.json
.
If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.
For example, for the original file Transaction1.txt
, the output for a pipeline run contains:
Transaction1_txt_parsed.json
Transaction1.txt
By default, when you run an Amazon S3 pipeline, Textual only generates the JSON output.
To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.
One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.
Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.
Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.
Under Select files and folders to add to run, navigate to and select the folders and individual files to process.
To add a folder or file to the pipeline, check its checkbox.
When you check a folder checkbox, Textual adds it to the Prefix Patterns list. It processes all of the applicable files in the folder, based on whether the file type is a type that Textual supports and whether it is included in the file type filter.
When you click the folder name, it displays the folder contents.
When you select an individual file, Textual adds it to the Selected Files list.
To delete a file or folder, either:
In the navigation pane, uncheck the checkbox.
In the Prefix Patterns or Selected Files list, click its delete icon.
On a self-hosted instance, before you can upload files to a pipeline, you must configure the S3 bucket where Tonic Textual stores the files. For more information, go to .
For an example of an IAM role that has the required permissions for file upload pipelines, go to .
On the pipeline details page for an uploaded file pipeline, to add files to the pipeline:
Click Upload Files.
Search for and select the files to upload.
To remove a file, on the pipeline details page, click the delete icon for the file.
By default, Textual only generates the JSON output for the pipeline files.
To also generate versions of the original files that redact or synthesize the detected entity values, on the Pipeline Settings page, toggle Synthesize Files to the on position.
When you choose to also generate synthesized versions of the pipeline files, the pipeline details page includes a Generator Config tab. From the Generator Config tab, you configure how to transform the detected entities in each file.
The Generator Config tab lists all of the available entity types.
For each entity type, you select and configure the handling option. For more information, see and .
After you change the configuration, click Save Changes. The updated configuration is applied the next time you run the pipeline, and only to new files.
For the Tonic Textual Snowflake Native App, you set up:
A compute pool
A warehouse to enable queries
The compute pool must be specific to Textual.
For large-scale jobs, we highly recommend a GPU-enabled compute pool.
During setup and testing, you can use a CPU-only pool.
To run SQL queries against Snowflake tables that the app manages, the app requires a warehouse.
Use these instructions to set up GitHub as your SSO provider for Tonic Textual.
In GitHub, navigate to Settings -> Developer Settings -> OAuth Apps, then create a new application.
For Application Name, enter Textual.
For Homepage URL, enter https://textual.tonic.ai
.
For Authorization callback URL, enter https://your-textual-url/sso/callback/github
.
Replace your-textual-url
with the URL of your Textual instance.
After you create the application, to create a new secret, click Generate a new client secret.
You use the client ID and the client secret in the Textual configuration.
After you complete the configuration in GitHub, you uncomment and configure the required in Textual.
For Kubernetes, in values.yaml:
For Docker, in .env:
Tonic Textual provides a certificate for https traffic, but on a self-hosted instance, you can also use a user-provided certificate. The certificate must use the the PFX format and be named solar.pfx
.
To use your own certificate, you must:
Add the SOLAR_PFX_PASSWORD
.
Use a volume mount to provide the certificate file. Textual uses volume mounting to give the Textual containers access to the certificate.
You must apply the changes to both the Textual web server and Textual worker containers.
To use your own certificate, you make the following changes to the docker-compose.yml file.
Add the SOLAR_PFX_PASSWORD
, which contains the certificate password.
Place the certificate on the host machine, then share it to the containers as a volume.
You must map the certificate to /certificates
on the containers.
Copy the following:
You map the certificate to /certificates
on the containers. Within your web server and worker deployment YAML files, the entry should be similar to the following:
You install a self-hosted instance of Tonic Textual on either:
A VM or server that runs Linux and on which you have superuser access.
A local machine that runs Mac, Windows, or Linux.
At minimum, we recommend that the server or cluster that you deploy Textual to has access to the following resources:
Nvidia GPU, 16GB GPU RAM. We recommend at least 6GB GPU RAM for each textual-ml
worker.
If you only use a CPU and not a GPU, then we recommend an M5.2xLarge. However, without GPU, performance is significantly slower.
The number of words per second that Textual processes depends on many factors, including:
The hardware that runs the textual-ml
container
The number of workers that are assigned to the textual-ml
container
The auxiliary model, if any, that is used in the textual-ml
container.
To optimize the throughput of and the cost to use Textual, we recommend that the textual-ml
container runs on modern hardware with GPU compute. If you use AWS, we recommend a with 1 GPU.
To use GPU resources:
For PDF files, you can add manual overrides to the initial redactions, which are based on the detected data types and handling configuration.
For each manual override, you select an area of the file.
For the selected area, you can either:
Ignore any automatically detected redactions. For example, a scanned form might show an example or boilerplate content that doesn't actually contain sensitive values.
Redact that area. The file might contain sensitive content that Tonic Textual is unable to detect. For example, a scanned form might contain handwritten notes.
You can also apply a template to the file.
To manage the manual overrides for a PDF file:
In the file list, click the options menu for the file.
In the options menu, click Edit Redactions.
The File Redactions panel displays the file content. The values that Textual detected are highlighted. The page also shows any manual overrides that were added to the file.
On the File Redactions panel, to apply a template to the file, select it from the template dropdown list.
When you apply a PDF template to a file, the manual overrides from that template are displayed on the file preview. The manual overrides are not included in the Redactions list.
On the File Redactions panel, to add a manual override to a file:
Select the type of override. To indicate to ignore any automatically detected values in the selected area, click Ignore Redactions. To indicate to redact the selected area, click Add Manual Redaction.
Use the mouse to draw a box around the area to select.
Textual adds the override to the Redactions list. The icon indicates the type of override.
In the file content:
Overrides that ignore detected values within the selected area are outlined in red.
Overrides that redact the selected area are outlined in green.
To select and highlight a manual override in the file content, in the Redactions list, click the navigate icon for the override.
To remove a manual override, in the Redactions list, click the delete icon for the override.
To save the current manual overrides, click Save.
Tonic Textual pipelines can process files from sources such as Amazon S3, Azure Blob Storage, and Databricks Unity Catalog. You can also create pipelines to process files that you upload directly from your browser.
For those uploaded file pipelines, Textual always stores the files in an S3 bucket. On a self-hosted instance, before you add files to an uploaded file pipeline, you must configure the S3 bucket and the associated authentication credentials.
The configured S3 bucket is also used to store dataset files and individual files that you use the Textual SDK to redact. If an S3 bucket is not configured, then:
The dataset and individual redacted files are stored in the Textual application database.
You cannot use Amazon Textract for . If you configured Textual to use Amazon Textract, Textual instead uses Tesseract.
The authentication credentials for the S3 bucket include:
The AWS Region where the S3 bucket is located.
An AWS access key that is associated with an IAM user or role.
The secret key that is associated with the access key.
To provide the authentication credentials, you can either:
Provide the values directly as environment variable values.
Use the instance profile of the compute instance where Textual runs.
For an example IAM role that has the required permissions, go to .
In .env, add the following settings:
SOLAR_INTERNAL_BUCKET_NAME= <S3 bucket path>
AWS_REGION= <AWS Region>
AWS_ACCESS_KEY_ID= <AWS access key>
AWS_SECRET_ACCESS_KEY= <AWS secret key>
If you use the instance profile of the compute instance, then only the bucket name is required.
In values.yaml, within env: { }
under both textual_api_server
and textual_worker
, add the following settings:
SOLAR_INTERNAL_BUCKET_NAME
AWS_REGION
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
For example, if no other environment variables are defined:
If you use the instance profile of the compute instance, then only the bucket name is required.
You can configure your self-hosted instance of Textual to enable Textual features.
The Docker Compose file is available in the GitHub repository .
Fork the repository.
To deploy Textual:
Rename sample.env to .env.
In .env, provide values for the required settings. These are not commented out and have <FILL IN>
as a placeholder value:
SOLAR_VERSION
- Provided by Tonic.ai.
SOLAR_LICENSE
- Provided by Tonic.ai.
ENVIRONMENT_NAME
- The name that you want to use for your Textual instance. For example, my-company-name
.
SOLAR_SECRET
- The string to use for Textual encryption.
SOLAR_DB_PASSWORD
- The password that you want to use for the Textual application database, which stores the metadata for Textual, including the datasets and pipelines. Textual deploys a PostgreSQL database container for the application database.
To deploy and start Textual, run docker-compose up -d
.
The Tonic Textual Helm chart is available in the GitHub repository .
To use the Helm chart, you can either:
Use the that Tonic hosts on .
Fork or clone the repository and then maintain it locally.
During the onboarding period, you are provided access credentials to our docker image repository on . If you require new credentials, or you experience issues accessing the repository, contact .
Before you deploy Textual, you create a values.yaml file with the configuration for your instance.
For details about the required and optional configuration options, go to the .
To deploy and validate access to Textual from the forked repository, follow the .
To use the OCI-based registry, run:
The GitHub repository contains a with the details on how to populate a values.yaml file and deploy Textual.
You can use the Textual SDK to redact and synthesize values in individual files.
Before you perform these tasks, remember to .
For a self-hosted instance, you can also configure the S3 bucket to use to store the files. This is the same S3 bucket that is used to store files for uploaded file pipelines. For more information, go to . For an example of an IAM role with the required permissions, go to .
To send an individual file to Textual, you use .
You first open the file so that Textual can read it, then make the call for Textual to read the file.
The response includes:
The file name
The identifier of the job that processed the file. You use this identifier to retrieve a transformed version of the file.
After you use to send the file to Textual, you use to retrieve a transformed version of the file.
To identify the file, you use the job identifier that you received from textual.start_file_redaction
. You can also specify whether to redact, synthesize, or ignore specific entity types. By default, all of the values are redacted.
Before you make the call to download the file, you specify the path to download the file content to.
The Dataset Settings panel includes options for how Textual handles the following file components:
For .docx files, images and comments
For PDF files, scanned-in signatures
To display the Dataset Settings panel, on the dataset details page, click Settings.
These options are not available for pipelines that also redact files.
For .docx images, you can configure the dataset to either:
Redact the image content. When you select this option, Textual looks for and blocks out sensitive values in the image.
Ignore the image.
Replace the images with black boxes.
On the Dataset Settings panel, under Image settings for DOCX files:
To redact the image content, click Redact contents of images using OCR. This is the default selection.
To ignore the images entirely, click Ignore images during scan.
To replace the images with black boxes, click Replace images from the output file with black boxes.
For comments in a .docx file, you can configure the dataset to either:
Remove the comments from the file.
Ignore the comments and leave them in the file.
On the Dataset Settings panel, to remove the comments, toggle Remove comments from the output file to the on position. This is the default configuration.
To ignore the comments, toggle Remove comments from the output file to the off position.
By default, Textual redacts scanned-in signatures in PDF files. You can configure the dataset to instead ignore the signatures.
On the Dataset Settings panel:
To redact PDF signatures, toggle Detect and redact signatures in PDFs to the on position. This is the default configuration.
To ignore PDF signatures, toggle Detect and redact signatures in PDFs to the off position.
The Tonic Textual images are stored on . During onboarding, Tonic.ai provides you with credentials to access the image repository. If you require new credentials, or you experience issues accessing the repository, contact .
You can deploy Textual using either Kubernetes or Docker.
For each file in a dataset, you can download the version of the file that contains the replacement values.
For information on downloading synthesized files from a pipeline, go to .
You can download an individual file from either the dataset file list or the file manager.
To download a single file:
Click the options menu for the file.
In the options menu, click Download File.
From the file manager, to download a single file:
Click the options menu for the file.
In the options menu, click Download File.
To download all of the files, click Download All Files.
For a PDF file in a dataset, you can add manual overrides to selected areas of a file. Manual overrides can either ignore redactions from Tonic Textual, or add redactions.
Pipelines do not support manual overrides in PDF files.
On a self-hosted instance of Textual, much of the configuration takes the form of environment variables.
After you configure an environment variable, you must restart Textual.
For Docker, add the variable to .env in the format:
SETTING_NAME=value
After you update .env, to restart Textual and complete the update, run:
$ docker-compose down
$ docker-compose pull && docker-compose up -d
For Kubernetes, in values.yaml, add the environment variable to the appropriate env
section of the Helm chart.
For example:
After you update the YAML file, to restart the service and complete the update, run:
$ helm upgrade <name_of_release> -n <namespace_name> <path-to-helm-chart>
The above Helm upgrade command is always safe to use when you provide specific version numbers. However, if you use the latest tag, it might result in Textual containers that have different versions.
For each entity type, you choose how to handle the detected values.
The available options are:
Synthesis - Indicates to replace the value with another realistic value. For example, the first name value Michael might be replaced with the value John. Textual does not synthesize any excluded values.
Redaction - This is the default option. For text files, Redaction indicates to tokenize the value - to replace it with a token that identifies the entity type. For example, the first name value Michael is replaced with the value NAME_GIVEN. For PDF files and image files, Redaction indicates to cover the value with a black box. Textual does not redact any excluded values.
Off - Indicates to not make any changes to the values. For example, the first name value Michael remains Michael.
To select the handling option for an individual entity type, click the option for that type.
For a dataset, to select the same handling option for all of the entity types, from the Bulk Edit dropdown above the data type list, select the option.
For a pipeline that generates synthesized files, on the Generator Config tab, use the Bulk Edit options at the top of the entity types list.
You cannot preview TIF image files. You can preview PNG and JPG files.
You can display the preview from the file list or from the file manager.
From the file list, to display the preview, either:
Click the file name.
Click the options menu, then click Preview.
From the file manager, to display the preview, click the file thumbnail.
On the left, the preview displays the original data. The detected entity values are highlighted.
On the right, the preview displays the data with replacement values that are based on the dataset configuration for the detected entity types.
For a PDF file or an image file, for entity types that use the Redact handling option, the values are covered by a black box.
The preview for a PDF file also includes any manual overrides.
On the file details for a pipeline PDF or image file, on the Original tab:
To display the original file content, click Rendered.
To display the version of the file with the replacement values, click Redacted <file type>.
On a self-hosted instance of Tonic Textual, you can view the current model specifications for the instance.
To view the model specifications:
Click the user icon at the top right.
In the user menu, click System Settings.
On the System Settings page, the Model Specifications section provides details about the models that Textual uses.
The Tonic Textual pay-as-you-go plan allows you to automatically bill a credit card for your Textual usage.
The Textual subscription plan charges a flat rate for each 1000 words. You are billed each month based on when you started your subscription. For example, if you start your subscription on the 12th of the month, then you are billed every month on the 12th.
Tonic.ai integrates with a payment processing solution to manage the payments.
To start a new subscription, from a usage pane or upgrade prompt, click Upgrade Plan.
You are sent to the payment processing solution to enter your payment information.
The panel on the Home page shows the usage for the current month.
To view additional usage details, click Manage Plan.
The Manage Plan page displays the details for your subscription.
The summary at the top left contains an overview of the subscription payment information, as well as the total number of words scanned since you started your account.
From the summary, you can go to the payment processing solution to view and manage payment information.
The graph at the top of the page shows the words scanned per day for the previous 30 days.
The Current Billing Period panel summarizes your usage for the current month, and provides information about the next payment.
The Next billing date panel shows when the next billing period begins.
The Payment History section shows the list of subscription payments.
For each payment, the list shows the date and amount, and whether the payment was successful.
To download the invoice for a payment, click its Invoice option.
You can update the payment information for your subscription. For example, you might need to choose a different credit card or update an expiration date.
To manage the payment information:
On the home page, in the usage panel, click Manage Plan.
On the Manage Plan page, from the account summary, click Manage Payment.
You are sent to the payment processing solution to update your payment information.
To cancel a subscription, from the Manage Plan page:
Click Manage Payment.
In the payment processing solution, select the cancellation option.
The cancellation takes effect at the end of the current subscription month.
The Tonic Textual Snowflake Native App uses the same models and algorithms as the Tonic Textual API, but runs natively in Snowflake.
You use the app to redact or parse your text data directly within your Snowflake workflows. The text never leaves your data warehouse.
The app package runs natively in Snowflake, and leverages Snowpark Container Services.
It includes the following containers:
Detection service, which detects the sensitive entity values.
Redaction service, which replaces the sensitive entity values.
For the redaction workflow, you use the app to detect and replace sensitive values in text.
You use TEXTUAL_REDACT
to send the redaction request.
When you call TEXTUAL_REDACT
, it passes to the redaction service:
The text to redact
Optional configuration
The redaction service forwards the text to the detection service.
The detection service uses a series of NER models to identify and categorize sensitive words and phrases in the text.
The detection service returns its results to the redaction service.
The redaction service uses the results to replace the sensitive words and phrases with redacted or synthesized versions.
The redacted text is returned to the user.
For the parsing workflow, you use the app to parse files that are in a Snowflake internal or external stage.
You call TEXTUAL_PARSE
to send the parse request. The request includes:
The fully qualified stage name where the files are located
The name of the file, or a variable that identifies the list of files
The MD5 sum of the file
The app uses a series of NER models to identify and categorize sensitive words and phrases in the text.
The app converts the content to a markdown format.
The markdown content is part of the JSON output that includes metadata about the parsed text. You can use the metadata to built RAG systems and LLM datasets.
The app stores the results of the parse request, including the output, in the TEXTUAL_RESULTS
table.
On the , the LLM synthesis option uses a large language model (LLM) to generate synthesized replacement values for the detected entities in the text.
This option requires an OpenAI key.
Before you can use this option on your self-hosted Textual instance, you must provide an OpenAI key as the value of the SOLAR_OPENAI_KEY
.
From a file list, to display the details for a file, click the file name.
For files other than .txt files, the Original tab allows you to toggle between the generated Markdown and the rendered text.
For a .txt file, where there is no difference between the Markdown and the rendered text, the Original tab displays the file content.
In a pipeline that is configured to also generate redacted files, the Redacted <file type> option allows you to display the redacted version of a PDF or image file.
The Entities tab displays the file content with the detected entity values in context.
The actual values are followed by the type labels. For example, the given name John is displayed as John NAME_GIVEN
.
The JSON tab contains the content of the output file. For cloud storage pipelines, the files are also in the output location that you configured for the pipeline.
For a PDF or image file that contains one or more tables, the Tables tab displays the tables. If the file does not contain any tables, then the Tables tab does not display.
For a PDF or image file that contains key-value pairs, the Key-Value Pairs tab displays the key-value pairs. If the file does not contain key-value pairs, then the Key-Value Pairs tab does not display.
Tonic Textual respects the access control policy of your single sign-on (SSO) provider. To access Textual, users must be granted access to the Textual application within your SSO provider.
To enable SSO, you first complete the required configuration in the SSO provider. You then configure Textual to connect to it.
After you enable SSO, users can use SSO to create an account in Textual.
To only allow SSO authentication, set the REQUIRE_SSO_AUTH
to true
. This disables standard email/password authentication. All account creation and login is handled through your SSO provider. If multi-factor authentication (MFA) is set up with your SSO, then all authentication must go through your provider's MFA.
Tonic Textual supports the following SSO providers:
For Amazon S3 pipelines, you connect to S3 buckets to select and store files.
On self-hosted instances, you also configure an S3 bucket and the credentials to use to store files for:
Uploaded file pipelines. The S3 bucket is required for uploaded file pipelines. The S3 bucket is not used for pipelines that connect to Azure Blob Storage or to Databricks Unity Catalog.
Dataset files. If you do not configure an S3 bucket, then the files are stored in the application database.
Individual files that you send to the SDK for redaction. If you do not configure an S3 bucket, then the files are stored in the application database.
Here are examples of IAM roles that have the required permissions to connect to Amazon S3 to select or store files.
For uploaded file pipelines, datasets, and individual file redactions, the files are stored in a single S3 bucket. For information on how to configure the S3 bucket and the corresponding access credentials, go to .
The IAM role that is used to connect to the S3 bucket must be able to read files from and write files to it.
Here is an example of an IAM role that has the permissions required to support uploaded file pipelines, datasets, and individual redactions:
The access credentials that you configure for an Amazon S3 pipeline must be able to navigate to and select files and folders from the appropriate S3 buckets. They also need to be able to write output files to the configured output location.
Here is an example of an IAM role that has the permissions required to support Amazon S3 pipelines:
A Tonic Textual dataset is a collection of text-based files. Textual uses models to detect and redact the sensitive information in each file.
To display the Datasets page, in the navigation menu, click Datasets.
From the Datasets page, you can create a new empty dataset. Textual prompts you for the dataset name, then displays the dataset details page.
To create a dataset:
On the Datasets page, click Create a Dataset.
On the dataset creation panel, in the Dataset Name field, provide the name of the dataset.
Click Create Dataset. The dataset details page for the new dataset is displayed.
To display the details page for a dataset, on the Datasets page, click the dataset name.
The dataset details page includes:
The list of files in the dataset
The results of the scan for entity values
The configured handling for each type of value
The dataset name displays in the panel at the top left of the dataset details page.
To change the dataset name:
On the dataset details page, click Settings.
On the Dataset Settings panel, click the edit icon next to the dataset name.
In the field, provide the new name for the dataset.
Click the save icon for the dataset name.
To delete a dataset:
On the dataset details page, click Settings.
On the Dataset Settings panel, click Delete Dataset.
Click Confirm Delete.
You can use Textual to generate versions of files where the sensitive values are redacted.
To only generate redacted files, you use a Tonic Textual dataset.
You can also optionally configure a to generate redacted files in addition to the JSON output.
At a high level, to use Textual to create redacted data:
Textual supports almost any free-text file, PDF files, .docx files, and .xlsx files. For images, Textual supports PNG, JPG (both .jpg and .jpeg), and TIF (both .tif and .tiff) files.
At any time, including before you upload files and after you review the detection results, you can configure how Textual handles the detected values for each entity type.
For datasets, you can also provide added and excluded values for each entity type.
By default, Textual redacts the entity values, which means to replace the values with a token that identifies the type of sensitive value. For example, PERSON
, LOCATION
. For PDF files and image files, redaction means to cover the value with a black box.
For a given entity type, you can instead choose to synthesize the values, which means to replace the original value with a realistic replacement.
You can also choose to ignore the values, and not replace them.
For a dataset, Textual automatically updates the file previews and downloadable files to reflect the updated configuration.
For a pipeline, the updated configuration is applied the next time you run the pipeline, and only applies to new files.
Pipelines do not allow you to add or exclude individual values.
Datasets also provide additional options for PDF files. These options are not available in pipelines.
You can use manual overrides either to ignore the automatically detected redactions in the selected area, or to redact the selected area.
For each entity type, you can adjust how Tonic Textual identifies and updates the values.
For a dataset, you configure the redaction from the entity types list on the dataset details.
For a pipeline that generates synthesized files, you configure the redaction from the Generator Config tab on the pipeline details page.
For uploaded file pipelines, Tonic Textual automatically processes new files as you add them.
For cloud storage pipelines, you start pipeline runs. A pipeline run processes the pipeline files. The pipeline run only processes files that were not processed by a previous pipeline run.
If the pipeline is configured to also redact files, the run also generates the redacted version of each file. The redaction is based on the current redaction configuration for the pipeline. The first run after you enable redaction generates redacted versions of all of the pipeline files, including files that were processed by earlier runs. Subsequent runs only process new files.
To start a pipeline run, on the pipeline details page, click Run Pipeline.
Before the run completes, to cancel the run:
On the Pipeline Runs tab, click the Cancel Run option for the run.
On the confirmation panel, click Cancel Run.
The following diagram shows how data and requests flow within the Tonic Textual application:
The Textual application database is a PostgreSQL database that stores the dataset configuration.
Runs the Textual user interface.
A textual instance can have multiple workers.
The worker orchestrates jobs. A job is a longer running task such as the redaction of a single file.
If you redact a large number of files, you might deploy additional workers and machine learning containers to increase the number of files that you can process concurrently.
A textual installation can have 1 or more machine learning containers.
The machine learning container hosts the Textual models. It takes text from the worker or web server and returns any entities that it discovers.
Additional machine learning containers can increase the number of words per second that Textual can process.
Use these instructions to set up Google as your SSO provider for Tonic Textual.
Go to
Click Create credentials, located near the top.
Select OAuth client ID.
Select Web application as the application type.
Choose a name.
Under Authorized redirect URIs, add the URL of the Textual server with the endpoint /sso/callback/google
.
For example, a local Textual server at http://localhost:3000
would need http://localhost:3000/sso/callback/google
to be set as the redirect URI.
Also note that internal URLs might not work.
On the confirmation page, note the client ID and client secret. You will need to provide them to Textual.
After you complete the configuration in Google, you uncomment and configure the required in Textual.
The client ID
The client secret
For Kubernetes, in values.yaml:
For Docker, in .env:
The file manager displays thumbnails of the dataset files.
From the file manager, you can:
To display the file manager, on the dataset details page, click Preview and Manage Files.
Tonic Textual can process the following types of files:
txt
csv
tsv
docx
xlsx
png
tif or tiff
jpg or jpeg
On a self-hosted instance, you can configure an S3 bucket where Textual stores the files. This is the same S3 bucket that is used for uploaded file pipelines. For more information, go to . For an example of an IAM role with the required permissions, go to .
From the dataset details page, to add files to the dataset:
In the panel at the top left, click Upload Files.
Search for and select the files.
Tonic Textual uploads and then processes the files.
Do not leave the page while files are uploading. If you leave the page before the upload is complete, then the upload stops.
You can leave the page while Textual is processing the file.
To remove a file from the dataset, you can use the option in the dataset file list or on the file manager.
From the file list on the dataset details page, to remove a file from the dataset:
Click the options menu for the file.
In the options menu, click Delete.
From the file manager, to remove a file from the dataset:
Click the options menu for the file.
In the options menu, click Delete File.
The Tonic Textual Home page provides a tool that allows you to see how Textual detects and replaces values in plain text. It also provides a preview of the redaction configuration options, including:
How to replace the values for each entity type.
Added and excluded values for each entity type.
The Home page displays automatically when you log in to Textual. To return to the Home page from other pages, in the navigation menu, click Home.
On the Home page, as you enter text in the Original Content text area, Textual displays the redacted version in the Results panel at the right.
Textual also provides sample text options for some common use cases. To populate the text with a sample, under Try a sample, click the sample to use.
To clear the text, click Clear.
The handling option indicates how Textual replaces a detected value for an entity type. You can experiment with different handling options.
Note that the updated configuration is only used for the current redacted text. When you clear the text, Textual also clears the configuration.
The options are:
Redact - This is the default value. Textual replaces the value with the name of the entity type.
For example, the first name John is replaced with NAME_GIVEN
.
Synthesize - Textual replaces the value with a realistic generated value. For example, the first name John is replaced with Michael.
Off - Textual ignores the value and copies it as is to the Results panel.
To change the handling option for an entity type:
In the Results panel, click an instance of the entity type.
On the configuration panel, click the handling option to use.
Textual updates all instances of that entity type to use the selected handling option.
For example, if you change the handling option for NAME_GIVEN
to Synthesize, then all instances of first names are replaced with realistic values.
For each entity type, you can use regular expressions to define added and excluded values.
Added values are values that Textual does not detect for an entity type, but that you want to include. For example, you might have values that are specific to your company or industry.
Excluded values are values that you do not want Textual to identify as a given entity type.
Note that the configuration is only used for the current redacted text. When you clear the text, Textual also clears the configuration.
To display the configuration panel for added and excluded values, click Fine-tune Results.
The Fine-Tune Results panel displays the list of configured rules for the current text. For each rule, the list includes:
The entity type.
Whether the rule adds or excludes values.
The regular expression to identify the added or excluded values.
On the Fine-Tune Results panel, to create a rule:
Click Add Rule.
From the entity type dropdown list, select the entity type that the rule applies to.
From the rule type dropdown list:
If the rule adds values, then select Include.
If the rule excludes values, then select Exclude.
In the regular expression field, provide the regular expression to use to identify the values to add or exclude.
To save the rule, click the save icon.
To edit a rule:
On the Fine Tune Results panel, click the edit icon for the rule.
Update the configuration.
Click the save icon.
On the Fine-Tune Results panel, to delete a rule, click its delete icon.
When Textual generates the redacted version of the text, it also generates the corresponding API request. The request includes the entity type configuration.
To view the API request code, click Show Code.
To hide the code, click Hide Code.
On the code panel:
The Python tab contains the Python version of the request.
The cURL tab contains the cURL version of the request.
To copy the currently selected version of the request code, click Copy Code.
A dataset might contain multiple files that have the same structure, such as a set of scanned-in forms.
Instead of adding the same manual overrides for each file, you can use a PDF file in the dataset to create a template that you can apply to other PDF files in the dataset.
When you , you can apply a template.
To add a PDF template to a dataset:
On the dataset details page, click PDF Templates.
On the template creation and selection panel, click Create a New Template.
On the template details page:
In the Name field, provide a name for the template.
From the file dropdown list, select the dataset file to use to create the template.
Add the manual overrides to the file.
When you finish adding the manual overrides, click Save New Template.
When you update a PDF template, it affects any files that use the template.
To update a PDF template:
On the dataset details page, click PDF Templates.
Under Edit an Existing Template, select the template, then click Edit Selected Template.
On the template details panel, you can change the template name, and add or remove manual overrides.
To save the changes, click Update Template.
On the template details panel, to add a manual override to a file:
Select the type of override. To indicate to ignore any automatically detected values in the selected area, click Ignore Redactions. To indicate to redact the selected area, click Add Manual Redaction.
Use the mouse to draw a box around the area to select.
Tonic Textual adds the override to the Redactions list. The icon indicates the type of override.
To select and highlight a manual override in the file content, in the Redactions list, click the navigate icon for the override.
To remove a manual override, in the Redactions list, click the delete icon for the override.
When you delete a PDF template, the template and its manual overrides are removed from any files that the template was assigned to.
To delete a PDF template:
On the dataset details page, click PDF Templates.
Under Edit an Existing Template, select the template, then click Edit Selected Template.
On the template details panel, click Delete.
Snowpark Container Services (SPCS) allow developers to run containerized workloads directly within Snowflake. Because Tonic Textual is distributed using a private Docker repository, you can use these images in SPCS to run Textual workloads.
It is quicker to use the , but SPCS allows for more customization.
To use the Textual images, you must add them to Snowflake. The Snowflake and walks through the process in great detail, but the basic steps are as follows:
.
To pull down the required images, you must have access to our private Docker image repository on . You should have been provided credentials during onboarding. If you require new credentials, or you experience issues accessing the repository, contact . Once you have access, pull down the following images:
textual-snowflake
Either textual-ml
or textual-ml-gpu
, depending on whether you plan to use a GPU compute pool
The images are now available in Snowflake.
The API service exposes the functions that are used to redact sensitive values in Snowflake. The service must be attached to a compute pool. You can scale the instances as needed, but you likely only need one API.
Next, you create the ML service, which recognizes personally identifiable information (PII) and other sensitive values in text. This is more likely to need scaling.
You can create custom SQL functions that use your API and ML services. These functions are accessible from directly within Snowflake.
It can take a couple of minutes for the containers to start. After the containers are started, you can use the functions that you created in Snowflake.
To test the functions, use an existing table. You can also create this simple test table:
For example:
By default, the function redacts the entity values. In other words, it replaces the values with a placeholder that includes the type. Synthesis
indicates to replace the value with a realistic replacement value. Off
indicates to leave the value as is.
The response from the above example should look something like this:
For each entity type in a dataset, you can configure additional values to detect, and values to exclude.
You might add values that Textual does not detect because, for example, they are specific to your organization or industry.
You might exclude a value because:
Textual labeled the value incorrectly.
You do not want to redact a specific value. For example, you might want to preserve known test values.
Note that for a pipeline that redacts files, you cannot add or exclude specific values.
In the entity types list, the add values and exclude values icons indicate whether there are configured added and excluded values for the entity type.
When added or excluded values are configured, the corresponding icon is green.
When there are no configured values, the corresponding icon is black.
From the Custom Entity Detection panel, you configure both added and excluded values for entity types.
To display the panel, either:
Click the add values or exclude values icon for an entity type.
In the word count panel, click Custom Entity Detection.
The panel contains an Add to detection tab for added values, and an Exclude from detection tab for excluded values.
The entity type dropdown list at the top of the Custom Entity Detection panel indicates the entity type to configure added and excluded values for.
If you display the panel from an add values or exclude values icon, then the initial selected entity type is the entity type for which you clicked the icon. To configure values for a different entity type, select the entity type from the list.
If you display the panel from the Custom Entity Detection option, then there is no default selection. You must select the entity type.
On the Add to detection tab, you configure the added values for the selected entity type.
Each value can be a specific word or phrase, or a regular expression to identify the values to add. Regular expressions must be C# compatible.
To add an added value:
Click the empty entry.
Type the value into the field.
To edit an added value:
Click the value.
Update the value text.
For each added value, you can test whether Textual correctly detects it.
To test a value:
From the Test Entry dropdown list, select the number for the value to test.
In the text field, type or paste content that contains a value or values that Textual should detect.
The Results field displays the text and highlights matching values.
To remove an added value, click its delete icon.
On the Exclude from detection tab, you configure the excluded values for the selected entity type.
Each value can be either a specific word or phrase to exclude, or a regular expression to identify the values to exclude. The regular expression must be C# compatible.
You can also provide a specific context within which to ignore a value. For example, in the phrase "one moment, please", you probably do not want the word "one" to be detected as a numeric value. If you specify "one moment, please" as an excluded value for the numeric entity type, then "one" is not identified as a number when it is seen in that context.
To add an excluded value:
Click the empty entry.
Type the value into the field.
To edit an excluded value:
Click the value.
Update the value text.
For each excluded value, you can test whether Textual correctly detects it.
To test the value that you are currently editing:
From the Test Entry dropdown list, select the number for the value to test.
In the text field, type or paste content that contains a value or values to exclude.
The Results field displays the text and highlights matching values.
To remove an excluded value, click its delete icon.
The new added and excluded values are not reflected in the entity types list until Textual runs a new scan.
When you save the changes, you can choose whether to immediately run a new scan on the dataset files.
To save the changes and also start a scan, click Save and Scan Files.
To save the changes, but not run a scan, click Save Without Scanning Files. When you do not run the scan, then on the dataset details page, Textual displays a prompt to run a scan.
When you first create a dataset, Tonic Textual displays a single list of all of the entity types that it can detect.
As you add and remove files, Textual updates the entity types list to indicate the detected and not detected entity types.
At the top of the dataset details view, the Sensitive words tile shows the total number of sensitive values in the dataset that Textual detected.
As Textual processes files, it identifies the entity types that are detected and not detected.
The entity type list starts with the detected entity types. For each detected entity type, Textual displays:
The number of detected values. Excluded values are not included in the count.
The selected handling option.
Whether there are configured added or excluded values.
Note that a given value might match multiple entity types. For example, a telephone number might be counted as both a telephone number and a numeric value.
For each detected entity type, to view a sample of up to 10 of the detected values , click the view icon next to the value count.
The entities list contains the full list of detected values for an entity type.
To display the entities list, from the value preview, click Open Entities Manager.
When you display the entities list, the entity type that you previewed the values for is selected by default.
To change the selected entity type, from the dropdown at the top left, select the entity type to view values for.
In the entity types list on the dataset details page, a given value might match multiple entity types.
For example, on the dataset details page, a telephone number might be counted as both a telephone number and a numeric value.
A value that matches multiple entity types appears in the preview for each of those types.
However, on the entities list, a given value is only included in the list for one entity type. Textual determines the best match and lists the value under that entity type.
So to continue the example of the telephone number that is also identified as a numeric value, the value would only display in the list of values for the Phone Number entity type. The value list for the Numeric Value entity type would not include the telephone number.
The entities list groups the entities by the file and, if relevant, the page where they were detected.
For each value, the list includes:
The original value
The original value in the context of its surrounding text
The redacted or synthesized value in the context of its surrounding text, based on the selected handling option
Below the list of detected entity types is the Entity types not found list, which contains the list of entity types that Textual did not detect in the files.
You can filter the entity types list by text in the type name or description. The filter applies to both the detected and undetected entity types.
To filter the types, in the filter field, begin to type text that is in the entity type name or description.
After you install the app in the Snowflake UI, only the ACCOUNTADMIN
role has access to it.
You can grant access to other roles as needed.
To start the app, run the following command:
This initializes the application. You can then use the app to redact or parse text data.
You use the TEXTUAL_REDACT
function to detect and replace sensitive files in text.
The TEXTUAL_REDACT
function takes the following arguments:
The text to redact, which is required
Optionally, a PARSE_JSON
JSON object that represents the generation configuration for each entity type. The generator configuration indicates what to do with the detected value.
For each entry in PARSE_JSON
:
<HandlingType>
indicates what to do with the detected value. The options are:
Redact
, which replaces the value with a redacted value in the format [<EntityType>_<RandomIdentifier>]
Synthesis
, which replaces the value with a realistic replacement
Off
, which leaves the value as is
If you do not include PARSE_JSON
, then all of the detected values are redacted.
The following example sends a text string to the app:
This returns the redacted text, which looks similar to the following:
Because we did not specify the handling for any of the entity types, both the first name Jane and last name Doe are redacted.
In this example, when a first name (NAME_GIVEN
) is detected, it is synthesized instead of redacted.
This returns output similar to the following. The first name Jane is replaced with a realistic value (synthesized), and the last name Doe is redacted.
You use the TEXTUAL_PARSE
function to transform files in an external or internal stage into Markdown-based content that you can use to populate LLM systems.
The output includes metadata about the file, including sensitive values that were detected.
To be able to parse the files, Textual must have access to the stage where the files are located.
Your role must be able to grant the USAGE
and READ
permissions.
To grant Textual access to the stage, run the following commands:
To send a parse request for a single file, run the following:
Where:
<FullyQualifiedStageName>
is the fully qualified name of the stage, in the format <DatabaseName>.<SchemaName>.<StageName>
. For example, database1.schema1.stage1
.
<FileName>
is the name of the file.
<FileMD5Sum>
is the MD5 sum version of the file content.
To parse a large number of files:
List the stage files to parse. For example, you might use PATTERN
to limit the files based on file type.
Run the parse request command on the list.
For example:
The app writes the results to the TEXTUAL_RESULTS
table.
For each request, the entry in TEXTUAL_RESULTS
includes the request status and the request results.
The status is one of the following values:
QUEUED
- The parse request was received and is waiting to be processed.
RUNNING
- The parse request is currently being processed.
SKIPPED
- The parse request was skipped because the file did not change since the previous time it was parsed. Whether a file is changed is determined by its MD5 checksum.
FAILURE_<FailureReason>
- The parse request failed for the provided reason.
You can query the parse results in the same way as you would any other Snowflake VARIANT
column.
For example, the following command retrieves the parsed documents, which are in a converted Markdown representation.
To retrieve the entities that were identified in the document:
Because the result
column is a simple variant, you can use flattening operations to perform more complex analysis. For example, you can extract all entities of a certain type or value across the documents, or find all documents that contain a specific type of entity.
From Tonic Textual, you can download the JSON output for each file. For pipelines that also generate synthesized files, you can download those files.
You can also use the Textual API to further process the pipeline output - for example, you can chunk the output and determine whether to replace sensitive values before you use the output in a RAG system.
Textual provides next step hints to use the pipeline output. The examples in this topic provide details about how to use the output.
From a file details page, to download the JSON file, click Download Results.
On the file details for a pipeline file, to download the synthesized version of the file, click Download Synthesized File.
On the Original tab for files other than .txt files, the Redacted <file type> view contains a Download option.
For cloud storage pipelines, the synthesized files are also available in the configured output location.
On the pipeline details page, the next steps panel at the left contains suggested steps to set up the API and use the pipeline output:
Create an API Key contains a link to create the key
Install the Python SDK contains a link to copy the SDK installation command
Fetch the pipeline results provides access to code snippets that you can use to retrieve and chunk the pipeline results.
At the top of the Fetch the pipeline results step is the pipeline identifier. To copy the identifier, click the copy icon.
The pipeline results step provides access to the following snippets:
Markdown - A code snippet to retrieve the Markdown results for the pipeline.
JSON - A code snippet to retrieve the JSON results for the pipeline.
Chunks - A code snippet to chunk the pipeline results.
To view a snippet, click the snippet tab.
To display the snippet panel, on the snippet tab, click View. The snippet panel provides a larger view of the snippet.
To copy the code snippet, on the snippet tab or the snippet panel, click Copy.
This example shows how to use your Textual pipeline output to create private chunks for RAG, where sensitive chunks are dropped, redacted, or synthesized.
This allows you to ensure that the chunks that you use for RAG do not contain any private information.
First, we connect to the API and get the files from the most recent pipeline.
Next, specify the sensitive entity types, and indicate whether to redact or to synthesize those entities in the chunks.
Next, generate the chunks.
In the following code snippet, the final list does not include chunks with sensitive entities.
To include the chunks with the sensitive entities redacted, remove the if chunk['is_sensitive']: continue
lines.
The chunks are now ready to use for RAG or for other downstream tasks.
This example shows how to use Pinecone to add your Tonic Textual pipeline output to a vector retrieval system, for example for RAG.
The Pinecone metadata filtering options allow you to incorporate Textual NER metadata into the retrieval system.
First, connect to the Textual pipeline API, and get the files from the most recently created pipeline.
Next, specify the entity types to incorporate into the retrieval system.
Chunk the files.
For each chunk, add the metadata that contains the instances of the entity types that occur in that chunk.
Next, embed the text of the chunks.
For each chunk, store the following in a Pinecone vector database:
Text
Embedding
Metadata
You define the embedding function for your system.
When you query the Pinecone database, you can then use metadata filters that specify entity type constraints.
For example, to only return chunks that contain the name John Smith
:
As another example, to only return chunks that contain one of the following organizations - Google, Apple, or Microsoft:
For some entity types, when you select the Synthesis option, you can configure additional options for how Tonic Textual generates the replacement values.
To display the available options, click Options.
Location values include the following types:
Location
Location Address
Location State
Location Zip
For each location type other than Location State, you can specify whether to use a realistic replacement value. For Location State, based on HIPAA guidelines, both the Synthesis option and the Off option pass through the value.
For location types that include zip codes, you can also specify how to generate the new zip code values.
By default, Textual replaces a location value with a realistic corresponding value. For example, "Main Street" might be replaced with "Fourth Avenue".
To instead scramble the values, uncheck Replace with realistic values.
By default, to generate a new zip code, Textual selects a real zip code that starts with the same three digits as the original zip code. For a low population area, Textual instead selects a random zip code from the United States.
To instead replace the last two digits of the zip code with zeros, check Replace zeroes for zip codes. For a low population area, Textual instead replaces all of the digits in the zip code with zeros.
By default, when you select the Synthesis option for Date/Time and Date of Birth values, Textual shifts the datetime values to a value that occurs within 7 days before or after the original value.
To customize how Textual sets the new values, you can:
Set a different range within which Textual sets the new values
Indicate whether to scramble date values that Textual cannot parse
Add additional date formats for Textual to recognize
By default, Textual adjusts the dates to values that are within 7 days before or after the original date.
To change the range, in the # of Days To Shift +/- field, enter the number of days before and the original date within which the replacement datetime value must occur. For example, if you enter 10, then the replacement datetime value must occur within 10 days before or after the original value.
The Scramble Unrecognized Dates checkbox indicates how Textual should handle datetime values that it does not recognize.
By default, the checkbox is checked, and Textual scrambles those values.
To instead pass through the values without changing them, uncheck Scramble Unrecognized Dates.
Under Additional Date Formats, you can add other datetime formats that you know are present in your data.
To add a format, type the format in the field, then click +.
To remove a format, click its delete icon.
By default, Textual supports the following datetime formats.
By default, when you select the Synthesis option for Age values, Textual shifts the age value to a value that is within seven years before or after the original value. For age values that it cannot synthesize, it scrambles the value.
To configure the synthesis for Age values:
In the Range of Years +/- for the Shifted Age field, enter the number of years before and after the original value to use as the range for the synthesized value.
By default, Textual scrambles age values that it cannot parse. To instead pass through the value unchanged, uncheck Scramble Unrecognized Ages.
When Textual processes pipeline files, it produces JSON output that provides access to the Markdown content and that identifies the detected entities in the file.
All JSON output files contain the following elements that contain information for the entire file:
For specific file types, the JSON output includes additional objects and properties to reflect the file structure.
The JSON output contains hashed and Markdown content for the entire file and for individual file components.
The JSON output contains entities
arrays for the entire file and for individual file components.
Each entity in the entities
array has the following properties:
For plain text files, the JSON output only contains the information for the entire file.
For .csv files, the structure contains a tables
array.
The tables
array contains a table object that contains header
and data
arrays..
For each row in the file, the data
array contains a row array.
For each value in a row, the row array contains a value object.
The value object contains the entities, hashed content, and Markdown content for the value.
For .xlsx files, the structure contains a tables
array that provides details for each worksheet in the file.
For each worksheet, the tables array contains a worksheet object.
For each row in a worksheet, the worksheet object contains a header
array and a data
array. The data
array contains a row array.
For each cell in a row, the row array contains a cell object.
Each cell object contains the entities, hashed content, and Markdown content for the cell.
For .docx files, the JSON output structure adds:
A footnotes
array for content in footnotes.
An endnotes
array for content in endnotes.
A header
object for content in the page headers. Includes separate objects for the first page header, even page header, and odd page header.
A footer
object for content in the page footers. Includes separate objects for the first page footer, even page footer, and odd page footer.
These arrays and objects contain the entities, hashed content, and Markdown content for the notes, headers, and footers.
PDF and image files use the same structure. Textual extracts and scans the text from the files.
For PDF and image files, the JSON output structure adds the following content.
pages
arrayThe pages
array contains all of the content on the pages. This includes content in tables and key-value pairs, which are also listed separately in the output.
For each page in the file, the pages
array contains a page array.
For each component on the page - such as paragraphs, headings, headers, and footers - the page array contains a component object.
Each component object contains the component entities, hashed content, and Markdown content.
tables
arrayThe tables
array contains content that is in tables.
For each table in the file, the tables
array contains a table array.
For each row in a table, the table array contains a row array.
For each cell in a row, the row array contains a cell object.
Each cell object identifies the type of cell (header or content). It also contains the entities, hashed content, and Markdown content for the cell.
keyValuePairs
arrayThe keyValuePairs
array contains key-value pair content. For example, for a PDF of a form with fields, a key-value pair might represent a field label and a field value.
For each key-value pair, the keyValuePairs
array contains a key-value pair object.
The key-value pair object contains:
An automatically incremented identifier. For example, id
for the first key-value pair is 1, for the second key-value pair is 2, and so on.
The start and end position of the key-value pair
The text of the key
The entities, hashed content, and Markdown content for the value
For email message files, the JSON output structure adds the following content.
The JSON output includes the following email message identifiers:
The identifier of the current message
If the message was a reply to another message, the identifier of that message
An array of related email messages. This includes the email message that the message replied to, as well as any other messages in an email message thread.
The JSON output includes the email address and display name of the message recipients. It contains separate lists for the following:
Recipients in the To line
Recipients in the CC line
Recipients in the BCC line
The subject
object contains the message subject line. It includes:
Markdown and hashed versions of the message subject line.
The entities that were detected in the subject line.
sentDate
provides the timestamp when the message was sent.
The plainTextBodyContent
object contains the body of the email message.
It contains:
Markdown and hashed versions of the message body.
The entities that were detected in the message body.
The attachments
array provides information about any attachments to the email message. For each attached file, it includes:
The identifier of the message that the file is attached to.
The identifier of the attachment.
The JSON output for the file.
The count of words in the original file.
The count of words in the redacted version of the file.
After you complete the configuration in Azure, you uncomment and configure the required in Textual.
To configure whether to use the auxiliary model for GPU, you configure the TEXTUAL_AUX_MODEL_GPU
.
To not load the date synthesis model on GPU, set the TEXTUAL_DATE_SYNTH_GPU
to false
.
To use Amazon Textract, Textual requires access to an IAM role that has sufficient permissions. You must also . The configured S3 bucket is required for uploaded file pipelines, and is also used to store dataset files and individual files that are redacted using the SDK.
After you complete the configuration in Okta, uncomment and configure the following in Textual.
In the Access Key field, provide an AWS access key that is associated with an IAM user or role. For an example of a role that has the required permissions for an Amazon S3 pipeline, go to .
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to .
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to .
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to .
On the Pipeline Settings page, update the configuration. For all pipelines, you can change the pipeline name, and whether to also create redacted versions of the original files. For cloud storage pipelines, you can change the file selection. For more information, go to , , or . For uploaded file pipelines, you do not manage files from the Pipeline Settings page. For information about uploading files, go to .
The Run Pipeline option, which starts a new pipeline run. For more information, go to .
The settings option, which you use to change the configuration settings for the pipeline. For more information, go to .
The list of processed files. For more information, go to .
The list of pipeline runs. For more information, go to .
For pipelines that are configured to also redact files, the redaction configuration. For more information, go to .
The Run Pipeline option, which starts a new pipeline run. For more information, go to .
The settings option, which you use to change the configuration settings for the pipeline. For more information, go to .
The list of processed files. For more information, go to .
The list of pipeline runs. For more information, go to .
For pipelines that are configured to also redact files, the redaction configuration. For more information, go to .
The Run Pipeline option, which starts a new pipeline run. For more information, go to .
The settings option, which you use to change the configuration settings for the pipeline. For more information, go to .
The list of processed files. For more information, go to .
The list of pipeline runs. For more information, go to .
For pipelines that are configured to also redact files, the redaction configuration. For more information, go to .
The Upload Files option, which you use to add files to the pipeline. For more information, go to .
The settings option, which you use to change the configuration settings for the pipeline. For more information, go to .
The list of files in the pipeline. Includes both new and processed files. For more information, go to .
For pipelines that are configured to also redact files, the redaction configuration. For more information, go to .
For information on how to configure the file generation, go to .
For information on how to configure the file generation, go to .
In the Access Key field, provide an AWS access key that is associated with an IAM user or role. For an example of an IAM role that has the required permissions for an Amazon S3 pipeline, go to .
For information on how to configure the file generation, go to .
For information on how to configure the file generation, go to .
You must add the SOLAR_PFX_PASSWORD
, which contains the certificate password.
You can use any volume type that is allowed within your environment. It must provide at least access.
Ensure that the are installed for your instance.
If you use Kubernetes to deploy Textual, follow the instructions in the .
If you use Minikube, then use the instructions in .
If you use Docker Compose to deploy Textual, follow .
If a dataset contains multiple files that have the same format, then you can create a template to apply to those files. For more information, go to .
For details about the JSON output structure for the different types of files, go to .
or . A dataset is a set of files to redact. A pipeline is used to generate JSON output that can be used to populate an LLM system. Pipelines also provide an option to generate redacted versions of the selected files.
or .
For a dataset or an uploaded files pipeline, as you add the files, Textual automatically uses its built-in models to identify entities in the files and generate the pipeline output. For a cloud storage pipeline, to identify the entities and generate the output, you .
For a dataset, . For pipeline files, the include the entities that were detected in that file.
Optionally, in a dataset, you can . You might do this to reflect values that are not detected or that are detected incorrectly.
You can . When you add a manual override, you draw a box to identify the affected portion of the file.
To make it easier to process multiple files that have a similar format, such as a form, you can that you can apply to PDF files in the dataset.
After you complete the redaction configuration and manual updates, you can download the or the to use as needed.
Textual only uses the LLM service for .
You use the function in the same way as any other user-defined function. You can pass in additional configuration to determine how to process .
The textual_redact
function works identically to the .
The textual_parse
function works identically to the .
<EntityType>
is the type of entity for which to specify the handling. For the list of entity types, go to .
For example, for a first name, the entity type is NAME_GIVEN
.
The result
column is a VARIANT
type that contains the parsed data. For more information about the format of the results for each document, go to .
Textual can parse datetime values that use either a format in or a format that you add.
By default, Textual is able to recognize datetime values that use a format from .
The formats must use a .
Create and manage pipelines
Create, run, and get results from Textual pipelines.
Parse individual files
Send a single file to be parsed.
Hi my name is John Smith
Hi my name is [NAME_GIVEN_Kx0Y7] [NAME_FAMILY_s9TTP0]
Hi my name is Lamar Smith
Hi John, mine is Jane Doe
Hi [NAME_GIVEN_Kx0Y7], mine is [NAME_GIVEN_veAy9] [NAME_FAMILY_6eC2]
Hi Lamar, mine is Doris Doe
yyyy/M/d
2024/1/17
yyyy-M-d
2024-1-17
yyyyMMdd
20240117
yyyy.M.d
2024.1.17
yyyy, MMM d
2024, Jan 17
yyyy-M
2024-1
yyyy/M
2024/1
d/M/yyyy
17/1/2024
d-MMM-yyyy
17-Jan-2024
dd-MMM-yy
17-Jan-24
d-M-yyyy
17-1-2024
d/MMM/yyyy
17/Jan/2024
d MMMM yyyy
17 January 2024
d MMM yyyy
17 Jan 2024
d MMMM, yyyy
17 January, 2024
ddd, d MMM yyyy
Wed, 17 Jan 2024
M/d/yyyy
1/17/2024
M/d/yy
1/17/24
M-d-yyyy
1-17-2024
MMddyyyy
01172024
MMMM d, yyyy
January 17, 2024
MMM d, ''yy
Jan 17, '24
MM-yyyy
01-2024
MMMM, yyyy
January, 2024
yyyy-M-d HH:mm
2024-1-17 15:45
d-M-yyyy HH:mm
17-1-2024 15:45
MM-dd-yy HH:mm
01-17-24 15:45
d/M/yy HH:mm:ss
17/1/24 15:45:30
d/M/yyyy HH:mm:ss
17/1/2024 15:45:30
yyyy/M/d HH:mm:ss
2024/1/17 15:45:30
yyyy-M-dTHH:mm:ss
2024-1-17T15:45:30
yyyy/M/dTHH:mm:ss
2024/1/17T15:45:30
yyyy-M-d HH:mm:ss'Z'
2024-1-17 15:45:30Z
yyyy-M-d'T'HH:mm:ss'Z'
2024-1-17T15:45:30Z
yyyy-M-d HH:mm:ss.fffffff
2024-1-17 15:45:30.1234567
yyyy-M-dd HH:mm:ss.FFFFFF
2024-1-17 15:45:30.123456
yyyy-M-dTHH:mm:ss.fff
2024-1-17T15:45:30.123
HH:mm
15:45
HH:mm:ss
15:45:30
HHmmss
154530
hh:mm:ss tt
03:45:30 PM
HH:mm:ss'Z'
15:45:30Z
fileType
The type of the original file.
content
Details about the file content. It includes:
Hashed and Markdown content for the file
Entities in the file
schemaVersion
An integer that identifies the version of the JSON schema that was used for the JSON output.
Textual uses this to convert content from older schemas to the most recent schema.
hash
The hashed version of the file or component content.
text
The file or component content in Markdown notation.
start
Within the file or component, the location where the entity value starts.
For example, in the following text:
My name is John.
John is an entity that starts at 11.
end
Within the file or component, the location where the entity value ends.
For example, in the following text:
My name is John.
John is an entity that ends at 14.
label
The type of entity.
For a list of the entity types that Textual detects, go to Entity types that Textual detects.
text
The text of the entity.
score
The confidence score for the entity.
Indicates how confident Textual is that the value is an entity of the specified type.
language
The language code to identify the language for the entity value.
For example, en
indicates that the value is in English.
Configure a Databricks pipeline
Select the pipeline files and configure whether to redact files.