LogoLogo
Release notesPython SDK docsDocs homeTextual CloudTonic.ai
  • Tonic Textual guide
  • Getting started with Textual
  • Previewing Textual detection and redaction
  • Entity types that Textual detects
    • Built-in entity types
    • Managing custom entity types
  • Language support in Textual
  • Datasets - Create redacted files
    • Datasets workflow for text redaction
    • Creating and managing datasets
    • Assigning tags to datasets
    • Displaying the file manager
    • Adding and removing dataset files
    • Reviewing the sensitivity detection results
    • Configuring the redaction
      • Configuring added and excluded values for built-in entity types
      • Working with custom entity types
      • Selecting the handling option for entity types
      • Configuring synthesis options
      • Configuring handling of file components
    • Adding manual overrides to PDF files
      • Editing an individual PDF file
      • Creating templates to apply to PDF files
    • Sharing dataset access
    • Previewing the original and redacted data in a file
    • Downloading redacted data
  • Pipelines - Prepare LLM content
    • Pipelines workflow for LLM preparation
    • Viewing pipeline lists and details
    • Assigning tags to pipelines
    • Setting up pipelines
      • Creating and editing pipelines
      • Supported file types for pipelines
      • Creating custom entity types from a pipeline
      • Configuring file synthesis for a pipeline
      • Configuring an Amazon S3 pipeline
      • Configuring a Databricks pipeline
      • Configuring an Azure pipeline
      • Configuring a Sharepoint pipeline
      • Selecting files for an uploaded file pipeline
    • Starting a pipeline run
    • Sharing pipeline access
    • Viewing pipeline results
      • Viewing pipeline files, runs, and statistics
      • Displaying details for a processed file
      • Structure of the pipeline output file JSON
    • Downloading and using pipeline output
  • Textual Python SDK
    • Installing the Textual SDK
    • Creating and revoking Textual API keys
    • Obtaining JWT tokens for authentication
    • Instantiating the SDK client
    • Datasets and redaction
      • Create and manage datasets
      • Redact individual strings
      • Redact individual files
      • Transcribe and redact an audio file
      • Configure entity type handling for redaction
      • Record and review redaction requests
    • Pipelines and parsing
      • Create and manage pipelines
      • Parse individual files
  • Textual REST API
    • About the Textual REST API
    • REST API authentication
    • Redaction
      • Redact text strings
  • Datasets
    • Manage datasets
    • Manage dataset files
  • Snowflake Native App and SPCS
    • About the Snowflake Native App
    • Setting up the app
    • Using the app
    • Using Textual with Snowpark Container Services directly
  • Install and administer Textual
    • Textual architecture
    • Setting up and managing a Textual Cloud pay-as-you-go subscription
    • Deploying a self-hosted instance
      • System requirements
      • Deploying with Docker Compose
      • Deploying on Kubernetes with Helm
    • Configuring Textual
      • How to configure Textual environment variables
      • Configuring the number of textual-ml workers
      • Configuring the number of jobs to run concurrently
      • Configuring the format of Textual logs
      • Setting a custom certificate
      • Configuring endpoint URLs for calls to AWS
      • Enabling PDF and image processing
      • Setting the S3 bucket for file uploads and redactions
      • Required IAM role permissions for Amazon S3
      • Configuring model preferences
    • Viewing model specifications
    • Managing user access to Textual
      • Textual organizations
      • Creating a new account in an existing organization
      • Single sign-on (SSO)
        • Viewing the list of SSO groups in Textual
        • Azure
        • GitHub
        • Google
        • Keycloak
        • Okta
      • Managing Textual users
      • Managing permissions
        • About permissions and permission sets
        • Built-in permission sets and available permissions
        • Viewing the lists of permission sets
        • Configuring custom permission sets
        • Configuring access to global permission sets
        • Setting initial access to all global permissions
    • Textual monitoring
      • Downloading a usage report
      • Tracking user access to Textual
Powered by GitBook
On this page
  • Downloading output files
  • JSON output
  • Synthesized files
  • Viewing next steps information for pipeline output
  • Available next steps
  • Copying the pipeline identifier
  • Selecting the snippet to view
  • Viewing the snippet panel
  • Copying a snippet
  • Example - Working with sensitive RAG Chunks
  • Get the latest output files
  • Identify the entity types and handling
  • Generate the chunks
  • Example - Using Pinecone to add pipeline output to a vector retrieval system
  • Get the latest output files
  • Identify the entity types to include
  • Chunk the files and add metadata
  • Add the chunks to the Pinecone database
  • Using metadata filters to query the Pinecone database

Was this helpful?

Export as PDF
  1. Pipelines - Prepare LLM content

Downloading and using pipeline output

Last updated 13 days ago

Was this helpful?

Required pipeline permission: Preview pipeline files

From Tonic Textual, you can download the JSON output for each file. For pipelines that also generate synthesized files, you can download those files.

You can also use the Textual API to further process the pipeline output - for example, you can chunk the output and determine whether to replace sensitive values before you use the output in a RAG system.

Textual provides next step hints to use the pipeline output. The examples in this topic provide details about how to use the output.

Downloading output files

JSON output

From a file details page, to download the JSON file, click Download Results.

Synthesized files

On the file details for a pipeline file, to download the synthesized version of the file, click Download Synthesized File.

On the Original tab for files other than .txt files, the Redacted <file type> view contains a Download option.

For cloud storage pipelines, the synthesized files are also available in the configured output location.

Viewing next steps information for pipeline output

Available next steps

On the pipeline details page, the next steps panel at the left contains suggested steps to set up the API and use the pipeline output:

  • Create an API Key contains a link to create the key

  • Install the Python SDK contains a link to copy the SDK installation command

  • Fetch the pipeline results provides access to code snippets that you can use to retrieve and chunk the pipeline results.

Copying the pipeline identifier

At the top of the Fetch the pipeline results step is the pipeline identifier. To copy the identifier, click the copy icon.

Selecting the snippet to view

The pipeline results step provides access to the following snippets:

  • Markdown - A code snippet to retrieve the Markdown results for the pipeline.

  • JSON - A code snippet to retrieve the JSON results for the pipeline.

  • Chunks - A code snippet to chunk the pipeline results.

To view a snippet, click the snippet tab.

Viewing the snippet panel

To display the snippet panel, on the snippet tab, click View. The snippet panel provides a larger view of the snippet.

Copying a snippet

To copy the code snippet, on the snippet tab or the snippet panel, click Copy.

Example - Working with sensitive RAG Chunks

This example shows how to use your Textual pipeline output to create private chunks for RAG, where sensitive chunks are dropped, redacted, or synthesized.

This allows you to ensure that the chunks that you use for RAG do not contain any private information.

Get the latest output files

First, we connect to the API and get the files from the most recent pipeline.

from tonic_textual.parse_api import TonicTextualParse

api_key = "your-tonic-textual-api-key" 
textual = TonicTextualParse("https://textual.tonic.ai", api_key=api_key)

pipelines = textual.get_pipelines()
pipeline = pipelines[-1] # get most recent pipeline

Identify the entity types and handling

Next, specify the sensitive entity types, and indicate whether to redact or to synthesize those entities in the chunks.

sensitive_entities = [
    "NAME_GIVEN",
    "NAME_FAMILY",
    "EMAIL_ADDRESS",
    "PHONE_NUMBER",
    "CREDIT_CARD",
    "CC_EXP",
    "CVV", 
    "US_BANK_NUMBER"
]

# sensitive entities are set to be redacted
# to synthesize, change Redaction to Synthesis
generator_config = {label: 'Redaction' for label in sensitive_entities}

Generate the chunks

Next, generate the chunks.

In the following code snippet, the final list does not include chunks with sensitive entities.

To include the chunks with the sensitive entities redacted, remove the if chunk['is_sensitive']: continue lines.

chunks = []
for file in pipeline.enumerate_files():
    file_chunks = file.get_chunks(generator_config=generator_config)
    for chunk in file_chunks:
        if chunk['is_sensitive']:
            continue # you can choose to ignore chunks that contain sensitive entities
            # or ingest the redacted version
        chunks.append(chunk)

The chunks are now ready to use for RAG or for other downstream tasks.

Example - Using Pinecone to add pipeline output to a vector retrieval system

This example shows how to use Pinecone to add your Tonic Textual pipeline output to a vector retrieval system, for example for RAG.

The Pinecone metadata filtering options allow you to incorporate Textual NER metadata into the retrieval system.

Get the latest output files

First, connect to the Textual pipeline API, and get the files from the most recently created pipeline.

from tonic_textual.parse_api import TonicTextualParse


api_key = "your-tonic-textual-api-key" 
textual = TonicTextualParse("https://textual.tonic.ai", api_key=api_key)


pipelines = textual.get_pipelines()
pipeline = pipelines[-1] # get most recent pipeline
files = pipeline.enumerate_files()

Identify the entity types to include

Next, specify the entity types to incorporate into the retrieval system.

metadata_entities = [
    "NAME_GIVEN",
    "NAME_FAMILY",
    "DATE_TIME",
    "ORGANIZATION"
]

Chunk the files and add metadata

Chunk the files.

For each chunk, add the metadata that contains the instances of the entity types that occur in that chunk.

chunks = []
for f in files:
    chunks.extend(f.get_chunks(metadata_entities=metadata_entities))

Add the chunks to the Pinecone database

Next, embed the text of the chunks.

For each chunk, store the following in a Pinecone vector database:

  • Text

  • Embedding

  • Metadata

You define the embedding function for your system.

from pinecone import Pinecone
import random, uuid

def embedding_function(text: str) -> list[float]:
    # put your embedding function here
    return [random.random() for i in range(10)]
    
vectors = []
for chunk in chunks:
    metadata = dict(chunk["metadata"]["entities"])
    metadata["text"] = chunk["text"]
    vectors.append({
        "id": str(uuid.uuid4()),
        "values": embedding_function(chunk["text"]),
        "metadata": metadata
    })

pc = Pinecone(api_key='your-pinecone-api-key')
index_name = "your-pinecone-index-name"
index = pc.Index(index_name)

index.upsert(vectors=vectors)

Using metadata filters to query the Pinecone database

When you query the Pinecone database, you can then use metadata filters that specify entity type constraints.

For example, to only return chunks that contain the name John Smith:

query: str # your query

index.query(
    vector=embedding_function(query),
    filter={
        "NAME_FAMILY": {"$eq": "Smith"},
        "NAME_GIVEN": {"$eq": "John"}
    },
    top_k=5,
    include_metadata=True
)

As another example, to only return chunks that contain one of the following organizations - Google, Apple, or Microsoft:

query: str # your query


index.query(
    vector=embedding_function(query),
    filter={
        "ORGANIZATION": { "$in": ["Google", "Apple", "Microsoft"]}
    },
    top_k=5,
    include_metadata=True
)
Download option for a processed file in a pipeline
Option to download the synthesized version of a pipeline file
Redacted view of a file with the Download option
Next steps panel for a pipeline
Snippet panel for a pipeline