Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Create and manage pipelines
Create, run, and get results from Textual pipelines.
Parse individual files
Send a single file to be parsed.
You can use the Textual SDK to parse individual files, either from a local file system or from an S3 bucket.
Textual returns a FileParseResult
object for each parsed file. The FileParseResult
object is a wrapper around the output JSON for the processed file.
To parse a single file from a local file system, use textual.parse_file
:
You must use rb
access mode to read the file. rb
access mode opens the file to be read in binary format.
You can also set a timeout in seconds for the parsing. You can add the timeout as a parameter of parse_file command. To set a timeout to use for all parsing, set the environment variable TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS
.
You can also parse files that are stored in Amazon S3. Because this process uses the boto3 library to fetch the file from Amazon S3, you must first set up the correct AWS credentials.
To parse a file from an S3 bucket, use textual.parse_s3_file
:
Whenever you call the Textual SDK, you first instantiate the SDK client.
To work with Textual datasets, or to redact individual files, you instantiate TonicTextual
.
To work with Textual pipelines, you instantiate TonicTextualParse
.
If the API key is configured as the value of TONIC_TEXTUAL_API_KEY
, then you do not need to provide the API key when you instantiate the SDK client.
For Textual pipelines:
For Textual datasets:
If the API key is not configured as the value of TONIC_TEXTUAL_API_KEY
, then you must include the API key in the request.
For Textual pipelines:
For Textual datasets:
You can use the Tonic Textual SDK to manage pipelines and to redact individual strings and files.
By default, when you:
Configure a dataset
Redact or synthesize a string
Retrieve a redacted file
Textual does the following:
For the string and file redaction, redacts the detected sensitive values.
For LLM synthesis, generates realistic synthesized values.
When you make the request, you can override the default behavior.
For each entity type, you can choose to redact, synthesize, or ignore the value.
When you redact a value, Textual replaces the value with <entity type>_<generated identifier>
. For example, ORGANIZATION_EPfC7XZUZ
.
When you synthesize a value, Textual replaces the value with a different realistic value.
When you ignore a value, Textual passes through the original value.
To specify the handling option for entity types, you use the generator_config
parameter.
Where:
<handling_option>
is the handling option to use for the specified entity type. The possible values are Redact
, Synthesis
, and Off
.
For example, to synthesize organization values, and ignore languages:
For string and file redaction, you can specify a default handling option to use for entity types that are not specified in generator_config
.
To do this, you use the generator_default
parameter.
generator_default
can be either Redact
, Synthesis
, or Off
.
You can also configure added and excluded values for each entity type.
You add values that Textual does not detect for an entity type, but should. You exclude values that you do not want Textual to identify for that entity type.
To specify the added values, use label_allow_lists
.
To specify the excluded values, use label_block_lists
.
For each of these parameters, the value is a list of entity types to specify the added or excluded values for. To specify the values, you provide an array of regular expressions.
The following example uses label_allow_lists
to add values:
For NAME_GIVEN
, adds the values There
and Here
.
For NAME_FAMILY
, adds values that match the regular expression ([a-z]{2})
.
Textual uses pipelines to transform file text into a format that can be used in an LLM system.
You can use the Textual SDK to create and manage pipelines and to retrieve pipeline run results.
Before you perform these tasks, remember to .
To create a pipeline, use the pipeline creation method for the type of pipeline to create"
- Creates an uploaded file pipeline.
- Creates an Amazon S3 pipeline.
- Creates an Azure pipeline.
- Creates a Databricks pipeline.
When you create the pipeline, you can also:
If needed, provide the credentials to use to connect to Amazon S3, Azure, or Databricks.
Indicate whether to also generate redacted files. By default, pipelines do not generate redacted files. To generate redacted files, set synthesize_files
to True
.
For example, to create an uploaded file pipeline that also creates redacted files:
The response contains the pipeline object.
For an Amazon S3 pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:
For an Azure pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:
The response contains a list of pipeline objects.
The response contains a single pipeline object.
The pipeline identifier is displayed on the pipeline details page. To copy the identifier, click the copy icon.
The response contains the job identifier.
The response contains a list of pipeline run objects.
The response is an enumerator of file parse result objects.
generator_config
is a dictionary that specifies whether to redact, synthesize, or do neither for each entity type in the dictionary.
For each entity type, you provide the handling type:
Redaction
indicates to replace the value with the value type.
Synthesis
indicates to replace the value with a realistic value.
Off
indicates to keep the value as is.
generator_default
indicates how to process values for entity types that were not included in the generator_config
list.
The response contains the list of entities. For each value, the list includes:
Entity type
Where the value starts in the source file
Where the value ends in the source file
The original text of the entity
The response contains the Markdown files, with the detected entities processed as specified in generator_config
and generator_default
.
In the request, you set the maximum number of characters in each chunk.
You can also provide generator_config
and generator_default
to configure how to present the detected entities in the text chunks.
The response contains the list of text chunks, with the detected entities processed as specified in generator_config
and generator_default
.
<entity_type>
is the identifier of the entity type. For example, ORGANIZATION
. For the list of built-in entity types that Textual scans for, go to .
To delete a pipeline, use .
To change whether a pipeline also generates synthesized files, use .
To a a file to an uploaded file pipeline, use .
To identify the output location for the processed files, use .
To identify individual files for the pipeline to process, use .
To identify prefixes - folders for which the pipeline processes all applicable files - use .
To identify the output location for the processed files, use .
To identify individual files for the pipeline to process, use .
To identify prefixes - folders for which the pipeline processes all applicable files - use .
To get the list of pipelines, use .
To use the pipeline identifier to get a single pipeline, use .
To run a pipeline, use .
To get the list of pipeline runs, use .
Once you have the pipeline, to get an enumerator of the files in the pipeline from the most recent pipeline run, use .
To get a list of entities that were detected in a file, use . For example, to get the detected entities for all of the files in a pipeline:
To provide a list entity types and how to process them, use :
For a list of the entity types that Textual detects, go to .
To get the Markdown output of a pipeline file, use . In the request, you can provide generator_config
and generator_default
to configure how to present the detected entities in the output file.
To split a pipeline file into text chunks that can be imported into an LLM, use .
You can use the Textual SDK to redact and synthesize values in individual files.
Before you perform these tasks, remember to instantiate the SDK client.
For a self-hosted instance, you can also configure the S3 bucket to use to store the files. This is the same S3 bucket that is used to store files for uploaded file pipelines. For more information, go to Setting the S3 bucket for file uploads and redactions. For an example of an IAM role with the required permissions, go to #file-upload-example-iam-role.
To send an individual file to Textual, you use textual.start_file_redaction
.
You first open the file so that Textual can read it, then make then call for Textual to read the file.
The response includes:
The file name
The identifier of the job that processed the file. You use this identifier to retrieve a transformed version of the file.
After you use textual.start_file_redaction
to send the file to Textual, you use tonic.download_redacted_file
to retrieve a transformed version of the file.
To identify the file, you use the job identifier that you received from textual.start_file_redaction
. You can also specify whether to redact, synthesize, or ignore specific entity types. By default, all of the values are redacted.
Before you make the call to download the file, you specify the path to download the file content to.
Before you perform these tasks, remember to instantiate the SDK client.
You can use the Tonic Textual SDK to redact individual strings, including:
Plain text strings
JSON content
XML content
For a text string, you can also request synthesized values from a large language model (LLM).
The redaction request can include the handling configuration for entity types.
The redaction response includes the redacted or synthesized content and details about the detected entity values.
To send a plain text string for redaction, use textual.redact
:
For example:
To send multiple plain text strings for redaction, use textual.redact_bulk
:
For example:
To send a JSON string for redaction, use textual.redact_json. You can send the JSON content as a JSON string or a Python dictionary.
redact_json
ensures that only the values are redacted. It ignores the keys.
Here is a basic example of a JSON redaction request:
It produces the following JSON output:
When you redact a JSON string, you can optionally assign specific entity types to selected JSON paths.
To do this, you include the jsonpath_allow_lists
parameter. Each entry consists of an entity type and a list of JSON paths for which to always use that entity type. Each JSON path must point to a simple string or numeric value.
The specified entity type overrides both the detected entity type and any added or excluded values.
In the following example, the value of the key1
node is always treated as a telephone number:
It produces the following redacted output:
To send an XML string for redaction, use textual.redact_xml
.
redact_xml
ensures that only the values are redacted. It ignores the XML markup.
For example:
Produces the following XML output:
To send an HTML string for redaction, use textual.redact_html
.
redact_html
ensures that only the values are redacted. It ignores the HTML markup.
For example:
Produces the following HTML output:
You can also request synthesized values from a large language model (LLM).
When you use this process, Textual first identifies the sensitive values in the text. It then sends the value locations and redacted values to the LLM. For example, if Textual identifies a product name, it sends the location and the redacted value PRODUCT
to the LLM. Textual does not send the original values to the LLM.
The LLM then generates realistic synthesized values of the appropriate value types.
To send text to an LLM, use textual.llm_synthesis
:
For example:
The response provides the redacted or synthesized version of the string, and the list of detected entity values.
For each redacted item, the response includes:
The location of the value in the original text (start
and end
)
The location of the value in the redacted version of the string (new_start
and new_end
)
The entity type (label
)
The original value (text
)
The redacted or synthesized value (new_text
). new_text
is null
in the following cases:
The entity type is ignored
The response is from llm_synthesis
A score to indicate confidence in the detection and redaction (score
)
The detected language for the value (language
)
For responses from textual.redact_json
, the JSON path to the entity in the original document (json_path
)
For responses from textual.redact_xml
, the Xpath to the entity in the original XML document (xml_path
)
Textual uses datasets to produce files with sensitive values replaced.
Before you perform these tasks, remember to instantiate the SDK client.
To create a new dataset and then upload a file to it, use textual.create_dataset
.
To add a file to the dataset, use dataset.add_file
. To identify the file, provide the file path and name.
To provide the file as IO bytes, you provide the file name and the file bytes. You do not provide a path.
Textual creates the dataset, scans the uploaded file, and redacts the detected values.
To change the configuration of a dataset, use dataset.edit
.
You can use dataset.edit
to change:
The name of the dataset
To get the current status of the files in the current dataset, use dataset.describe
:
The response includes:
The name and identifier of the dataset
The number of files in the dataset
The number of files that are waiting to be processed (scanned and redacted)
The number of files that had errors during processing
For example:
To get a list of files that have a specific statuse, use the following:
The file list includes:
File identifier and name
Number of rows and columns
Processing status
For failed files, the error
When the file was uploaded
To delete a file from a dataset, use dataset.delete_file
.
To get the redacted content in JSON format for a dataset, use dataset.fetch_all_json()
:
For example:
The response looks something like:
For details about all of the available Tonic Textual classes, go to the generated SDK documentation.
Instantiate the SDK client
Required for every call to the SDK.
Create and manage pipelines
Generate Markdown content to use in an LLM system.
Parse individual files
Parse files outside of a pipeline.
Create and manage datasets
Redact and synthesize data in a set of files.
Redact and synthesize individual strings
Detect and transform values in specified text strings.
Redact and synthesize individual files
Work with files outside of a dataset.