1 of 3

Pipelines and parsing

You can use the Tonic Textual SDK to manage pipelines and to parse individual files.

Create and manage pipelines

Textual uses pipelines to transform file text into a format that can be used in an LLM system.

You can use the Textual SDK to create and manage pipelines and to retrieve pipeline run results.

Before you perform these tasks, remember to instantiate the SDK client.

Creating and deleting pipelines

Creating a pipeline

To create a pipeline, use the pipeline creation method for the type of pipeline to create"

textual.create_local_pipeline - Creates an uploaded file pipeline.
textual.create_s3_pipeline - Creates an Amazon S3 pipeline.
textual.create_azure_pipeline - Creates an Azure pipeline.
textual.create_databricks_pipeline - Creates a Databricks pipeline.

When you create the pipeline, you can also:

If needed, provide the credentials to use to connect to Amazon S3, Azure, or Databricks.
Indicate whether to also generate redacted files. By default, pipelines do not generate redacted files. To generate redacted files, set synthesize_files to True.

For example, to create an uploaded file pipeline that also creates redacted files:

pipeline = textual.create_local_pipeline(pipeline_name="pipeline name", synthesize_files=True)

The response contains the pipeline object.

Deleting a pipeline

To delete a pipeline, use textual.delete_pipeline.

textual.delete_pipeline(pipeline_id)

Updating a pipeline configuration

Changing whether to also generate synthesized files

To change whether a pipeline also generates synthesized files, use pipeline.set_synthesize_files.

Adding files to an uploaded file pipeline

To a a file to an uploaded file pipeline, use pipeline.upload_file.

pipeline = textual.create_pipeline(pipeline_name)
with open(file_path, "rb") as file_content:
    file_bytes = file_content.read()
pipeline.upload_file(file_bytes, file_name)

Configuring an Amazon S3 pipeline

For an Amazon S3 pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:

To identify the output location for the processed files, use s3_pipeline.set_output_location.
To identify individual files for the pipeline to process, use s3_pipeline.add_files.
To identify prefixes - folders for which the pipeline processes all applicable files - use s3_pipeline.add_prefixes.

Configuring an Azure pipeline

For an Azure pipeline, you can configure the output location for the processed files. You can also identify the files and folders for the pipeline to process:

To identify the output location for the processed files, use azure_pipeline.set_output_location.
To identify individual files for the pipeline to process, use azure_pipeline.add_files.
To identify prefixes - folders for which the pipeline processes all applicable files - use azure_pipeline.add_prefixes.

Getting a pipeline or pipelines

Getting the list of pipelines

To get the list of pipelines, use textual.get_pipelines.

pipelines = textual.get_pipelines()

The response contains a list of pipeline objects.

Getting a single pipeline

To use the pipeline identifier to get a single pipeline, use textual.get_pipeline_by_id.

pipeline_id: str # pipeline identifier
pipeline = textual.get_pipeline_by_id(pipeline_id)

The response contains a single pipeline object.

The pipeline identifier is displayed on the pipeline details page. To copy the identifier, click the copy icon.

Running a pipeline

To run a pipeline, use pipeline.run.

The response contains the job identifier.

Getting pipelines runs, files, and results

Getting pipeline runs

To get the list of pipeline runs, use pipeline.get_runs.

The response contains a list of pipeline run objects.

Getting pipeline files

Once you have the pipeline, to get an enumerator of the files in the pipeline from the most recent pipeline run, use pipeline.enumerate_files.

files = pipeline.enumerate_files()

The response is an enumerator of file parse result objects.

Getting the list of entities in a file

To get a list of entities that were detected in a file, use get_all_entities. For example, to get the detected entities for all of the files in a pipeline:

detected_entities = []
for file in pipeline.enumerate_files():
    entities = file.get_all_entities()
    detected_entities.append(entities)

To provide a list entity types and how to process them, use get_entities:

from textual.enums.pii_state import PiiState

generator_config: Dict[str, PiiState]
generator_default: PiiState

entities_list = []
for file in pipeline.enumerate_files():
    entities = file.get_entities(generator_config, generator_default)
    entities_list.append(entities)

generator_config is a dictionary that specifies whether to redact, synthesize, or do neither for each entity type in the dictionary.

For a list of the entity types that Textual detects, go to Entity types that Textual detects.

For each entity type, you provide the handling type:

Redaction indicates to replace the value with the value type.
Synthesis indicates to replace the value with a realistic value.
Off indicates to keep the value as is.

generator_default indicates how to process values for entity types that were not included in the generator_config list.

The response contains the list of entities. For each value, the list includes:

Entity type
Where the value starts in the source file
Where the value ends in the source file
The original text of the entity

Getting the Markdown output for pipeline files

To get the Markdown output of a pipeline file, use get_markdown. In the request, you can provide generator_config and generator_default to configure how to present the detected entities in the output file.

from textual.enums.pii_state import PiiState

generator_config: Dict[str, PiiState]
generator_default: PiiState

markdown_list = []
for file in pipeline.enumerate_files():
    markdown = file.get_markdown(generator_config, generator_default)
    markdown_list.append(markdown)

The response contains the Markdown files, with the detected entities processed as specified in generator_config and generator_default.

Generating chunks from pipeline files

To split a pipeline file into text chunks that can be imported into an LLM, use get_chunks.

In the request, you set the maximum number of characters in each chunk.

You can also provide generator_config and generator_default to configure how to present the detected entities in the text chunks.

from textual.enums.pii_state import PiiState

generator_config: Dict[str, PiiState]
generator_default: PiiState
max_chars: int

chunks_list = []
for file in pipeline.enumerate_files():
    chunks = file.get_chunks(max_chars=max_chars, generator_config, generator_default)
    chunks_list.append(chunks)

The response contains the list of text chunks, with the detected entities processed as specified in generator_config and generator_default.

Parse individual files

You can use the Textual SDK to parse individual files, either from a local file system or from an S3 bucket.

Textual returns a FileParseResult object for each parsed file. The FileParseResult object is a wrapper around the output JSON for the processed file.

Parse a file from a local file system

To parse a single file from a local file system, use textual.parse_file:

with open('<path to the file>','rb') as f: 
    byte_data = f.read()
    parsed_doc = textual.parse_file(byte_data, '<file name>')

You must use rb access mode to read the file. rb access mode opens the file to be read in binary format.

You can also set a timeout in seconds for the parsing. You can add the timeout as a parameter of parse_file command. To set a timeout to use for all parsing, set the environment variable TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS.

Parse a file from an S3 bucket

You can also parse files that are stored in Amazon S3. Because this process uses the boto3 library to fetch the file from Amazon S3, you must first set up the correct AWS credentials.

To parse a file from an S3 bucket, use textual.parse_s3_file:

parsed_doc = textual.parse_s3_file('<bucket>','<key>')