LogoLogo
Release notesPython SDK docsDocs homeTextual CloudTonic.ai
  • Tonic Textual guide
  • Getting started with Textual
  • Previewing Textual detection and redaction
  • Entity types that Textual detects
    • Built-in entity types
    • Managing custom entity types
  • Language support in Textual
  • Datasets - Create redacted files
    • Datasets workflow for text redaction
    • Creating and managing datasets
    • Assigning tags to datasets
    • Adding and removing dataset files
    • Reviewing the sensitivity detection results
    • Configuring the redaction
      • Configuring added and excluded values for built-in entity types
      • Working with custom entity types
      • Selecting the handling option for entity types
      • Configuring synthesis options
      • Configuring handling of file components
    • Adding manual overrides to PDF files
      • Editing an individual PDF file
      • Creating templates to apply to PDF files
    • Sharing dataset access
    • Previewing the original and redacted data in a file
    • Downloading redacted data
  • Pipelines - Prepare LLM content
    • Pipelines workflow for LLM preparation
    • Viewing pipeline lists and details
    • Assigning tags to pipelines
    • Setting up pipelines
      • Creating and editing pipelines
      • Supported file types for pipelines
      • Creating custom entity types from a pipeline
      • Configuring file synthesis for a pipeline
      • Configuring an Amazon S3 pipeline
      • Configuring a Databricks pipeline
      • Configuring an Azure pipeline
      • Configuring a Sharepoint pipeline
      • Selecting files for an uploaded file pipeline
    • Starting a pipeline run
    • Sharing pipeline access
    • Viewing pipeline results
      • Viewing pipeline files, runs, and statistics
      • Displaying details for a processed file
      • Structure of the pipeline output file JSON
    • Downloading and using pipeline output
  • Textual Python SDK
    • Installing the Textual SDK
    • Creating and revoking Textual API keys
    • Obtaining JWT tokens for authentication
    • Instantiating the SDK client
    • Datasets and redaction
      • Create and manage datasets
      • Redact individual strings
      • Redact individual files
      • Transcribe and redact an audio file
      • Configure entity type handling for redaction
      • Record and review redaction requests
    • Pipelines and parsing
      • Create and manage pipelines
      • Parse individual files
  • Textual REST API
    • About the Textual REST API
    • REST API authentication
    • Redaction
      • Redact text strings
  • Datasets
    • Manage datasets
    • Manage dataset files
  • Snowflake Native App and SPCS
    • About the Snowflake Native App
    • Setting up the app
    • Using the app
    • Using Textual with Snowpark Container Services directly
  • Install and administer Textual
    • Textual architecture
    • Setting up and managing a Textual Cloud pay-as-you-go subscription
    • Deploying a self-hosted instance
      • System requirements
      • Deploying with Docker Compose
      • Deploying on Kubernetes with Helm
    • Configuring Textual
      • How to configure Textual environment variables
      • Configuring the number of textual-ml workers
      • Configuring the number of jobs to run concurrently
      • Configuring the format of Textual logs
      • Setting a custom certificate
      • Configuring endpoint URLs for calls to AWS
      • Enabling PDF and image processing
      • Setting the S3 bucket for file uploads and redactions
      • Required IAM role permissions for Amazon S3
      • Configuring model preferences
    • Viewing model specifications
    • Managing user access to Textual
      • Textual organizations
      • Creating a new account in an existing organization
      • Single sign-on (SSO)
        • Viewing the list of SSO groups in Textual
        • Azure
        • GitHub
        • Google
        • Keycloak
        • Okta
      • Managing Textual users
      • Managing permissions
        • About permissions and permission sets
        • Built-in permission sets and available permissions
        • Viewing the lists of permission sets
        • Configuring custom permission sets
        • Configuring access to global permission sets
        • Setting initial access to all global permissions
    • Textual monitoring
      • Downloading a usage report
      • Tracking user access to Textual
Powered by GitBook
On this page
  • Get your list of datasets
  • Create and add files to a dataset
  • Configure a dataset
  • Get the current status of dataset files
  • Get lists of files by status
  • Delete a file from a dataset
  • Get redacted content for a dataset

Was this helpful?

Export as PDF
  1. Textual Python SDK
  2. Datasets and redaction

Create and manage datasets

Last updated 1 month ago

Was this helpful?

Textual uses datasets to produce files with sensitive values replaced.

Before you perform these tasks, remember to .

Get your list of datasets

To get the complete list of datasets that you own, use .

datasets = textual.get_all_datasets()

Create and add files to a dataset

Required global permission: Create datasets

Required dataset permission: Upload files to a dataset

To create a new dataset and then upload a file to it, use .

dataset = textual.create_dataset('<dataset name>')

To add a file to the dataset, use . To identify the file, provide the file path and name.

dataset.add_file('<path to file>','<file name>') 

To provide the file as IO bytes, you provide the file name and the file bytes. You do not provide a path.

dataset.add_file('<file name>',<file bytes>) 

Textual creates the dataset, scans the uploaded file, and redacts the detected values.

Configure a dataset

Required dataset permission: Edit dataset settings

You can use dataset.edit to change:

  • The name of the dataset

dataset.edit(name='<dataset name>', 
  generator_config={'<entity_type>':'<handling_type>'},
  label_allow_lists={'<entity_type>':LabelCustomList(regexes['<regex>']},
  label_block_lists={'<entity_type>':LabelCustomList(regexes['<regex>']}
)

Alternatively, instead of specifying the configuration, you can use the copy_from_dataset parameter to indicate to copy the configuration from another dataset.

Get the current status of dataset files

Required dataset permission: Preview redacted dataset files

dataset.describe()

The response includes:

  • The name and identifier of the dataset

  • The number of files in the dataset

  • The number of files that are waiting to be processed (scanned and redacted)

  • The number of files that had errors during processing

For example:

    Dataset: example [879d4c5d-792a-c009-a9a0-60d69be20206]
    Number of Files: 1
    Files that are waiting for processing: 
    Files that encountered errors while processing: 
    Number of Rows: 0
    Number of rows fetched: 0

Get lists of files by status

Required dataset permission: Preview redacted dataset files

To get a list of files that have a specific status, use the following:

The file list includes:

  • File identifier and name

  • Number of rows and columns

  • Processing status

  • For failed files, the error

  • When the file was uploaded

Delete a file from a dataset

Required dataset permission: Delete files from a dataset

dataset.delete_file('<file identifier>')

Get redacted content for a dataset

Required dataset permission: Download redacted dataset files

dataset = textual.get_dataset('<dataset name>')
dataset.fetch_all_json()

For example:

dataset = textual.get_dataset('mydataset')
dataset.fetch_all_json()

The response looks something like:

'[["PERSON Portrait by PERSON, DATE_TIME ...]'

To change the configuration of a dataset, use .

The

To get the current status of the files in the current dataset, use :

To delete a file from a dataset, use .

To get the redacted content in JSON format for a dataset, use :

instantiate the SDK client
textual.get_all_datasets
textual.create_dataset
dataset.add_file
dataset.edit
dataset.describe
dataset.get_failed_files
dataset.get_running_files
dataset.get_queued_files
dataset.get_processed_files
dataset.delete_file
dataset.fetch_all_json()
handling option for each entity type
Added or excluded values for each entity type