LogoLogo
Release notesPython SDK docsDocs homeTextual CloudTonic.ai
  • Tonic Textual guide
  • Getting started with Textual
  • Previewing Textual detection and redaction
  • Entity types that Textual detects
    • Built-in entity types
    • Managing custom entity types
  • Language support in Textual
  • Datasets - Create redacted files
    • Datasets workflow for text redaction
    • Creating and managing datasets
    • Assigning tags to datasets
    • Displaying the file manager
    • Adding and removing dataset files
    • Reviewing the sensitivity detection results
    • Configuring the redaction
      • Configuring added and excluded values for built-in entity types
      • Working with custom entity types
      • Selecting the handling option for entity types
      • Configuring synthesis options
      • Configuring handling of file components
    • Adding manual overrides to PDF files
      • Editing an individual PDF file
      • Creating templates to apply to PDF files
    • Sharing dataset access
    • Previewing the original and redacted data in a file
    • Downloading redacted data
  • Pipelines - Prepare LLM content
    • Pipelines workflow for LLM preparation
    • Viewing pipeline lists and details
    • Assigning tags to pipelines
    • Setting up pipelines
      • Creating and editing pipelines
      • Supported file types for pipelines
      • Creating custom entity types from a pipeline
      • Configuring file synthesis for a pipeline
      • Configuring an Amazon S3 pipeline
      • Configuring a Databricks pipeline
      • Configuring an Azure pipeline
      • Configuring a Sharepoint pipeline
      • Selecting files for an uploaded file pipeline
    • Starting a pipeline run
    • Sharing pipeline access
    • Viewing pipeline results
      • Viewing pipeline files, runs, and statistics
      • Displaying details for a processed file
      • Structure of the pipeline output file JSON
    • Downloading and using pipeline output
  • Textual Python SDK
    • Installing the Textual SDK
    • Creating and revoking Textual API keys
    • Obtaining JWT tokens for authentication
    • Instantiating the SDK client
    • Datasets and redaction
      • Create and manage datasets
      • Redact individual strings
      • Redact individual files
      • Transcribe and redact an audio file
      • Configure entity type handling for redaction
      • Record and review redaction requests
    • Pipelines and parsing
      • Create and manage pipelines
      • Parse individual files
  • Textual REST API
    • About the Textual REST API
    • REST API authentication
    • Redaction
      • Redact text strings
  • Datasets
    • Manage datasets
    • Manage dataset files
  • Snowflake Native App and SPCS
    • About the Snowflake Native App
    • Setting up the app
    • Using the app
    • Using Textual with Snowpark Container Services directly
  • Install and administer Textual
    • Textual architecture
    • Setting up and managing a Textual Cloud pay-as-you-go subscription
    • Deploying a self-hosted instance
      • System requirements
      • Deploying with Docker Compose
      • Deploying on Kubernetes with Helm
    • Configuring Textual
      • How to configure Textual environment variables
      • Configuring the number of textual-ml workers
      • Configuring the number of jobs to run concurrently
      • Configuring the format of Textual logs
      • Setting a custom certificate
      • Configuring endpoint URLs for calls to AWS
      • Enabling PDF and image processing
      • Setting the S3 bucket for file uploads and redactions
      • Required IAM role permissions for Amazon S3
      • Configuring model preferences
    • Viewing model specifications
    • Managing user access to Textual
      • Textual organizations
      • Creating a new account in an existing organization
      • Single sign-on (SSO)
        • Viewing the list of SSO groups in Textual
        • Azure
        • GitHub
        • Google
        • Keycloak
        • Okta
      • Managing Textual users
      • Managing permissions
        • About permissions and permission sets
        • Built-in permission sets and available permissions
        • Viewing the lists of permission sets
        • Configuring custom permission sets
        • Configuring access to global permission sets
        • Setting initial access to all global permissions
    • Textual monitoring
      • Downloading a usage report
      • Tracking user access to Textual
Powered by GitBook
On this page
  • Get your list of datasets
  • Create and add files to a dataset
  • Configure a dataset
  • Get the current status of dataset files
  • Get lists of files by status
  • Delete a file from a dataset
  • Get redacted content for a dataset

Was this helpful?

Export as PDF
  1. Textual Python SDK
  2. Datasets and redaction

Create and manage datasets

Last updated 14 days ago

Was this helpful?

Textual uses datasets to produce files with sensitive values replaced.

Before you perform these tasks, remember to .

Get your list of datasets

To get the complete list of datasets that you own, use .

datasets = textual.get_all_datasets()

Create and add files to a dataset

Required global permission: Create datasets

Required dataset permission: Upload files to a dataset

To create a new dataset and then upload a file to it, use .

dataset = textual.create_dataset('<dataset name>')

To add a file to the dataset, use . To identify the file, provide the file path and name.

dataset.add_file('<path to file>','<file name>') 

To provide the file as IO bytes, you provide the file name and the file bytes. You do not provide a path.

dataset.add_file('<file name>',<file bytes>) 

Textual creates the dataset, scans the uploaded file, and redacts the detected values.

Configure a dataset

Required dataset permission: Edit dataset settings

You can use dataset.edit to change:

  • The name of the dataset

dataset.edit(name='<dataset name>', 
  generator_config={'<entity_type>':'<handling_type>'},
  label_allow_lists={'<entity_type>':LabelCustomList(regexes['<regex>']},
  label_block_lists={'<entity_type>':LabelCustomList(regexes['<regex>']}
)

Alternatively, instead of specifying the configuration, you can use the copy_from_dataset parameter to indicate to copy the configuration from another dataset.

Get the current status of dataset files

Required dataset permission: Preview redacted dataset files

dataset.describe()

The response includes:

  • The name and identifier of the dataset

  • The number of files in the dataset

  • The number of files that are waiting to be processed (scanned and redacted)

  • The number of files that had errors during processing

For example:

    Dataset: example [879d4c5d-792a-c009-a9a0-60d69be20206]
    Number of Files: 1
    Files that are waiting for processing: 
    Files that encountered errors while processing: 
    Number of Rows: 0
    Number of rows fetched: 0

Get lists of files by status

Required dataset permission: Preview redacted dataset files

To get a list of files that have a specific status, use the following:

The file list includes:

  • File identifier and name

  • Number of rows and columns

  • Processing status

  • For failed files, the error

  • When the file was uploaded

Delete a file from a dataset

Required dataset permission: Delete files from a dataset

dataset.delete_file('<file identifier>')

Get redacted content for a dataset

Required dataset permission: Download redacted dataset files

dataset = textual.get_dataset('<dataset name>')
dataset.fetch_all_json()

For example:

dataset = textual.get_dataset('mydataset')
dataset.fetch_all_json()

The response looks something like:

'[["PERSON Portrait by PERSON, DATE_TIME ...]'

To change the configuration of a dataset, use .

The

To get the current status of the files in the current dataset, use :

To delete a file from a dataset, use .

To get the redacted content in JSON format for a dataset, use :

instantiate the SDK client
textual.get_all_datasets
textual.create_dataset
dataset.add_file
dataset.edit
dataset.describe
dataset.get_failed_files
dataset.get_running_files
dataset.get_queued_files
dataset.get_processed_files
dataset.delete_file
dataset.fetch_all_json()
handling option for each entity type
Added or excluded values for each entity type