LogoLogo
Release notesPython SDK docsDocs homeTextual CloudTonic.ai
  • Tonic Textual guide
  • Getting started with Textual
  • Previewing Textual detection and redaction
  • Entity types that Textual detects
    • Built-in entity types
    • Managing custom entity types
  • Language support in Textual
  • Datasets - Create redacted files
    • Datasets workflow for text redaction
    • Creating and managing datasets
    • Assigning tags to datasets
    • Displaying the file manager
    • Adding and removing dataset files
    • Reviewing the sensitivity detection results
    • Configuring the redaction
      • Configuring added and excluded values for built-in entity types
      • Working with custom entity types
      • Selecting the handling option for entity types
      • Configuring synthesis options
      • Configuring handling of file components
    • Adding manual overrides to PDF files
      • Editing an individual PDF file
      • Creating templates to apply to PDF files
    • Sharing dataset access
    • Previewing the original and redacted data in a file
    • Downloading redacted data
  • Pipelines - Prepare LLM content
    • Pipelines workflow for LLM preparation
    • Viewing pipeline lists and details
    • Assigning tags to pipelines
    • Setting up pipelines
      • Creating and editing pipelines
      • Supported file types for pipelines
      • Creating custom entity types from a pipeline
      • Configuring file synthesis for a pipeline
      • Configuring an Amazon S3 pipeline
      • Configuring a Databricks pipeline
      • Configuring an Azure pipeline
      • Configuring a Sharepoint pipeline
      • Selecting files for an uploaded file pipeline
    • Starting a pipeline run
    • Sharing pipeline access
    • Viewing pipeline results
      • Viewing pipeline files, runs, and statistics
      • Displaying details for a processed file
      • Structure of the pipeline output file JSON
    • Downloading and using pipeline output
  • Textual Python SDK
    • Installing the Textual SDK
    • Creating and revoking Textual API keys
    • Obtaining JWT tokens for authentication
    • Instantiating the SDK client
    • Datasets and redaction
      • Create and manage datasets
      • Redact individual strings
      • Redact individual files
      • Transcribe and redact an audio file
      • Configure entity type handling for redaction
      • Record and review redaction requests
    • Pipelines and parsing
      • Create and manage pipelines
      • Parse individual files
  • Textual REST API
    • About the Textual REST API
    • REST API authentication
    • Redaction
      • Redact text strings
  • Datasets
    • Manage datasets
    • Manage dataset files
  • Snowflake Native App and SPCS
    • About the Snowflake Native App
    • Setting up the app
    • Using the app
    • Using Textual with Snowpark Container Services directly
  • Install and administer Textual
    • Textual architecture
    • Setting up and managing a Textual Cloud pay-as-you-go subscription
    • Deploying a self-hosted instance
      • System requirements
      • Deploying with Docker Compose
      • Deploying on Kubernetes with Helm
    • Configuring Textual
      • How to configure Textual environment variables
      • Configuring the number of textual-ml workers
      • Configuring the number of jobs to run concurrently
      • Configuring the format of Textual logs
      • Setting a custom certificate
      • Configuring endpoint URLs for calls to AWS
      • Enabling PDF and image processing
      • Setting the S3 bucket for file uploads and redactions
      • Required IAM role permissions for Amazon S3
      • Configuring model preferences
    • Viewing model specifications
    • Managing user access to Textual
      • Textual organizations
      • Creating a new account in an existing organization
      • Single sign-on (SSO)
        • Viewing the list of SSO groups in Textual
        • Azure
        • GitHub
        • Google
        • Keycloak
        • Okta
      • Managing Textual users
      • Managing permissions
        • About permissions and permission sets
        • Built-in permission sets and available permissions
        • Viewing the lists of permission sets
        • Configuring custom permission sets
        • Configuring access to global permission sets
        • Setting initial access to all global permissions
    • Textual monitoring
      • Downloading a usage report
      • Tracking user access to Textual
Powered by GitBook
On this page
  • Creating and deleting pipelines
  • Creating a pipeline
  • Deleting a pipeline
  • Updating a pipeline configuration
  • Changing whether to also generate synthesized files
  • Adding files to an uploaded file pipeline
  • Configuring an Amazon S3 pipeline
  • Configuring an Azure pipeline
  • Getting a pipeline or pipelines
  • Getting the list of pipelines
  • Getting a single pipeline
  • Running a pipeline
  • Getting pipelines runs, files, and results
  • Getting pipeline runs
  • Getting pipeline files
  • Getting the list of entities in a file
  • Getting the Markdown output for pipeline files
  • Generating chunks from pipeline files

Was this helpful?

Export as PDF
  1. Textual Python SDK
  2. Pipelines and parsing

Create and manage pipelines

Last updated 14 days ago

Was this helpful?

Textual uses pipelines to transform file text into a format that can be used in an LLM system.

You can use the Textual SDK to create and manage pipelines and to retrieve pipeline run results.

Before you perform these tasks, remember to .

Creating and deleting pipelines

Creating a pipeline

Required global permission: Create pipelines

To create a pipeline, use the pipeline creation method for the type of pipeline to create"

  • - Creates an uploaded file pipeline.

  • - Creates an Amazon S3 pipeline.

  • - Creates an Azure pipeline.

  • - Creates a Databricks pipeline.

When you create the pipeline, you can also:

  • If needed, provide the credentials to use to connect to Amazon S3, Azure, or Databricks.

  • Indicate whether to also generate redacted files. By default, pipelines do not generate redacted files. To generate redacted files, set synthesize_files to True.

For example, to create an uploaded file pipeline that also creates redacted files:

pipeline = textual.create_local_pipeline(pipeline_name="pipeline name", synthesize_files=True)

The response contains the pipeline object.

Deleting a pipeline

Required pipeline permission: Delete a pipeline

textual.delete_pipeline(pipeline_id)

Updating a pipeline configuration

Changing whether to also generate synthesized files

Required pipeline permission: Edit pipeline settings

Adding files to an uploaded file pipeline

Required pipeline permission: Manage the pipeline file list

pipeline = textual.create_pipeline(pipeline_name)
with open(file_path, "rb") as file_content:
    file_bytes = file_content.read()
pipeline.upload_file(file_bytes, file_name)

Configuring an Amazon S3 pipeline

Required pipeline permissions:

  • Edit pipeline settings

  • Manage the pipeline file list

For an Amazon S3 pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:

Configuring an Azure pipeline

Required pipeline permissions:

  • Edit pipeline settings

  • Manage the pipeline file list

For an Azure pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:

Getting a pipeline or pipelines

Required pipeline permission: View pipeline settings

Getting the list of pipelines

pipelines = textual.get_pipelines()

The response contains a list of pipeline objects.

Getting a single pipeline

pipeline_id: str # pipeline identifier
pipeline = textual.get_pipeline_by_id(pipeline_id)

The response contains a single pipeline object.

The pipeline identifier is displayed on the pipeline details page. To copy the identifier, click the copy icon.

Running a pipeline

Required pipeline permission: Start pipeline runs

The response contains the job identifier.

Getting pipelines runs, files, and results

Getting pipeline runs

Required pipeline permission: View pipeline settings

The response contains a list of pipeline run objects.

Getting pipeline files

Required pipeline permission: Preview pipeline files

files = pipeline.enumerate_files()

The response is an enumerator of file parse result objects.

Getting the list of entities in a file

Required pipeline permission: Preview pipeline files

detected_entities = []
for file in pipeline.enumerate_files():
    entities = file.get_all_entities()
    detected_entities.append(entities)
from textual.enums.pii_state import PiiState

generator_config: Dict[str, PiiState]
generator_default: PiiState

entities_list = []
for file in pipeline.enumerate_files():
    entities = file.get_entities(generator_config, generator_default)
    entities_list.append(entities)

generator_config is a dictionary that specifies whether to redact, synthesize, or do neither for each entity type in the dictionary.

For a list of the entity types that Textual detects, go to Entity types that Textual detects.

For each entity type, you provide the handling type:

  • Redaction indicates to replace the value with a token that represents the entity type.

  • Synthesis indicates to replace the value with a realistic value.

  • Off indicates to keep the value as is.

generator_default indicates how to process values for entity types that were not included in the generator_config list.

The response contains the list of entities. For each value, the list includes:

  • Entity type

  • Where the value starts in the source file

  • Where the value ends in the source file

  • The original text of the entity

Getting the Markdown output for pipeline files

Required pipeline permission: Preview pipeline files

from textual.enums.pii_state import PiiState

generator_config: Dict[str, PiiState]
generator_default: PiiState

markdown_list = []
for file in pipeline.enumerate_files():
    markdown = file.get_markdown(generator_config, generator_default)
    markdown_list.append(markdown)

The response contains the Markdown files, with the detected entities processed as specified in generator_config and generator_default.

Generating chunks from pipeline files

In the request, you set the maximum number of characters in each chunk.

You can also provide generator_config and generator_default to configure how to present the detected entities in the text chunks.

from textual.enums.pii_state import PiiState

generator_config: Dict[str, PiiState]
generator_default: PiiState
max_chars: int

chunks_list = []
for file in pipeline.enumerate_files():
    chunks = file.get_chunks(max_chars=max_chars, generator_config, generator_default)
    chunks_list.append(chunks)

The response contains the list of text chunks, with the detected entities processed as specified in generator_config and generator_default.

To delete a pipeline, use .

To change whether a pipeline also generates synthesized files, use .

To a a file to an uploaded file pipeline, use .

To identify the output location for the processed files, use .

To identify individual files for the pipeline to process, use .

To identify prefixes - folders for which the pipeline processes all applicable files - use .

To identify the output location for the processed files, use .

To identify individual files for the pipeline to process, use .

To identify prefixes - folders for which the pipeline processes all applicable files - use .

To get the list of pipelines, use .

To use the pipeline identifier to get a single pipeline, use .

To run a pipeline, use .

To get the list of pipeline runs, use .

Once you have the pipeline, to get an enumerator of the files in the pipeline from the most recent pipeline run, use .

To get a list of entities that were detected in a file, use . For example, to get the detected entities for all of the files in a pipeline:

To provide a list entity types and how to process them, use :

To get the Markdown output of a pipeline file, use . In the request, you can provide generator_config and generator_default to configure how to present the detected entities in the output file.

To split a pipeline file into text chunks that can be imported into an LLM, use .

instantiate the SDK client
textual.create_local_pipeline
textual.create_s3_pipeline
textual.create_azure_pipeline
textual.create_databricks_pipeline
textual.delete_pipeline
pipeline.set_synthesize_files
pipeline.upload_file
s3_pipeline.set_output_location
s3_pipeline.add_files
s3_pipeline.add_prefixes
azure_pipeline.set_output_location
azure_pipeline.add_files
azure_pipeline.add_prefixes
textual.get_pipelines
textual.get_pipeline_by_id
pipeline.run
pipeline.get_runs
pipeline.enumerate_files
get_all_entities
get_entities
get_markdown
get_chunks
Pipeline identifier with copy option