LogoLogo
Release notesPython SDK docsDocs homeTextual CloudTonic.ai
  • Tonic Textual guide
  • Getting started with Textual
  • Previewing Textual detection and redaction
  • Entity types that Textual detects
    • Built-in entity types
    • Managing custom entity types
  • Language support in Textual
  • Datasets - Create redacted files
    • Datasets workflow for text redaction
    • Creating and managing datasets
    • Assigning tags to datasets
    • Displaying the file manager
    • Adding and removing dataset files
    • Reviewing the sensitivity detection results
    • Configuring the redaction
      • Configuring added and excluded values for built-in entity types
      • Working with custom entity types
      • Selecting the handling option for entity types
      • Configuring synthesis options
      • Configuring handling of file components
    • Adding manual overrides to PDF files
      • Editing an individual PDF file
      • Creating templates to apply to PDF files
    • Sharing dataset access
    • Previewing the original and redacted data in a file
    • Downloading redacted data
  • Pipelines - Prepare LLM content
    • Pipelines workflow for LLM preparation
    • Viewing pipeline lists and details
    • Assigning tags to pipelines
    • Setting up pipelines
      • Creating and editing pipelines
      • Supported file types for pipelines
      • Creating custom entity types from a pipeline
      • Configuring file synthesis for a pipeline
      • Configuring an Amazon S3 pipeline
      • Configuring a Databricks pipeline
      • Configuring an Azure pipeline
      • Configuring a Sharepoint pipeline
      • Selecting files for an uploaded file pipeline
    • Starting a pipeline run
    • Sharing pipeline access
    • Viewing pipeline results
      • Viewing pipeline files, runs, and statistics
      • Displaying details for a processed file
      • Structure of the pipeline output file JSON
    • Downloading and using pipeline output
  • Textual Python SDK
    • Installing the Textual SDK
    • Creating and revoking Textual API keys
    • Obtaining JWT tokens for authentication
    • Instantiating the SDK client
    • Datasets and redaction
      • Create and manage datasets
      • Redact individual strings
      • Redact individual files
      • Transcribe and redact an audio file
      • Configure entity type handling for redaction
      • Record and review redaction requests
    • Pipelines and parsing
      • Create and manage pipelines
      • Parse individual files
  • Textual REST API
    • About the Textual REST API
    • REST API authentication
    • Redaction
      • Redact text strings
  • Datasets
    • Manage datasets
    • Manage dataset files
  • Snowflake Native App and SPCS
    • About the Snowflake Native App
    • Setting up the app
    • Using the app
    • Using Textual with Snowpark Container Services directly
  • Install and administer Textual
    • Textual architecture
    • Setting up and managing a Textual Cloud pay-as-you-go subscription
    • Deploying a self-hosted instance
      • System requirements
      • Deploying with Docker Compose
      • Deploying on Kubernetes with Helm
    • Configuring Textual
      • How to configure Textual environment variables
      • Configuring the number of textual-ml workers
      • Configuring the number of jobs to run concurrently
      • Configuring the format of Textual logs
      • Setting a custom certificate
      • Configuring endpoint URLs for calls to AWS
      • Enabling PDF and image processing
      • Setting the S3 bucket for file uploads and redactions
      • Required IAM role permissions for Amazon S3
      • Configuring model preferences
    • Viewing model specifications
    • Managing user access to Textual
      • Textual organizations
      • Creating a new account in an existing organization
      • Single sign-on (SSO)
        • Viewing the list of SSO groups in Textual
        • Azure
        • GitHub
        • Google
        • Keycloak
        • Okta
      • Managing Textual users
      • Managing permissions
        • About permissions and permission sets
        • Built-in permission sets and available permissions
        • Viewing the lists of permission sets
        • Configuring custom permission sets
        • Configuring access to global permission sets
        • Setting initial access to all global permissions
    • Textual monitoring
      • Downloading a usage report
      • Tracking user access to Textual
Powered by GitBook
On this page
  • Create and populate a dataset or pipeline
  • Review the redaction results
  • Configure entity type handling
  • Select the handling option for each entity type
  • Define added and excluded values for entity types
  • Manually update PDF files
  • Download the redacted and synthesized files

Was this helpful?

Export as PDF
  1. Datasets - Create redacted files

Datasets workflow for text redaction

Last updated 1 month ago

Was this helpful?

You can use Textual to generate versions of files where the sensitive values are redacted.

To only generate redacted files, you use a Tonic Textual dataset.

You can also optionally configure a to generate redacted files in addition to the JSON output.

You can also create and manage datasets from the or .

At a high level, to use Textual to create redacted data:

Create and populate a dataset or pipeline

  1. Textual supports almost any free-text file, PDF files, .docx files, and .xlsx files. For images, Textual supports PNG, JPG (both .jpg and .jpeg), and TIF (both .tif and .tiff) files.

Review the redaction results

Configure entity type handling

At any time, including before you upload files and after you review the detection results, you can configure how Textual handles the detected values for each entity type.

For datasets, you can provide added and excluded values for each built-in entity type.

Select the handling option for each entity type

For each entity type, select the action to perform on detected values. The options are:

  • Redaction - By default, Textual redacts the entity values, which means to replace the values with a token that identifies the type of sensitive value, followed by a unique identifier. For example, NAME_GIVEN_l2m5sb, LOCATION_j40pk6. The identifiers are consistent, which means that for the same original value, the redacted value always has the same identifier. For example, the first name Michael might always be replaced with NAME_GIVEN_12m5sb, while the first name Helen might always be replaced with NAME_GIVEN_9ha3m2. For PDF files, redaction means to either cover the value with a black box, or, if there is space, display the entity type and identifier. For image files, redaction means to cover the value with a black box.

  • Synthesis - For a given entity type, you can instead choose to synthesize the values, which means to replace the original value with a realistic replacement. The synthesized values are always consistent, meaning that a given original value always produces the same replacement value. For example, the first name Michael might always be replaced with the first name John.

  • Ignore - You can also choose to ignore the values, and not replace them.

For a dataset, Textual automatically updates the file previews and downloadable files to reflect the updated configuration.

For a pipeline, the updated configuration is applied the next time you run the pipeline, and only applies to new files.

Define added and excluded values for entity types

Pipelines do not allow you to add or exclude individual values.

Manually update PDF files

Datasets also provide additional options for PDF files. These options are not available in pipelines.

You can use manual overrides either to ignore the automatically detected redactions in the selected area, or to redact the selected area.

Download the redacted and synthesized files

or . A dataset is a set of files to redact. A pipeline is used to generate JSON output that can be used to populate an LLM system. Pipelines also provide an option to generate redacted versions of the selected files.

or .

For a dataset or an uploaded files pipeline, as you add the files, Textual automatically uses its built-in models to identify entities in the files and generate the pipeline output. For a cloud storage pipeline, to identify the entities and generate the output, you .

For a dataset, . For pipeline files, the include the entities that were detected in that file.

You can also .

Optionally, in a dataset, you can . You might do this to reflect values that are not detected or that are detected incorrectly.

You can . When you add a manual override, you draw a box to identify the affected portion of the file.

To make it easier to process multiple files that have a similar format, such as a form, you can that you can apply to PDF files in the dataset.

After you complete the redaction configuration and manual updates, you can download the or the to use as needed.

run the pipeline
create and enable custom entity types
create lists of values to add to or exclude from an entity type
add manual overrides to a PDF file
create templates
dataset files
synthesized pipeline files
Textual pipeline
Textual SDK
REST API
Create a Textual dataset
Add files to the dataset
review the types of entities that were detected across all of the files
pipeline
pipeline
Diagram of the Tonic Textual redaction workflow
file details