Downloading and using pipeline output

From Tonic Textual, you can download the JSON output for each file.

You can also use the Textual API to further process the pipeline output - for example, you can chunk the output and determine whether to replace sensitive values before you use the output in a RAG system.

Textual provides next step hints to use the pipeline output. The examples in this topic provide details about how to use the output.

Downloading output files

From a file details page, to download the JSON file, click Download Results.

Viewing next steps information for pipeline output

Available next steps

On the pipeline details page, the next steps panel at the left contains suggested steps to set up the API and use the pipeline output:

  • Create an API Key contains a link to create the key

  • Install the Python SDK contains a link to copy the SDK installation command

  • Fetch the pipeline results provides access to code snippets that you can use to retrieve and chunk the pipeline results.

Copying the pipeline identifier

At the top of the Fetch the pipeline results step is the pipeline identifier. To copy the identifier, click the copy icon.

Selecting the snippet to view

The pipeline results step provides access to the following snippets:

  • Markdown - A code snippet to retrieve the Markdown results for the pipeline.

  • JSON - A code snippet to retrieve the JSON results for the pipeline.

  • Chunks - A code snippet to chunk the pipeline results.

To view a snippet, click the snippet tab.

Viewing the snippet panel

To display the snippet panel, on the snippet tab, click View. The snippet panel provides a larger view of the snippet.

Copying a snippet

To copy the code snippet, on the snippet tab or the snippet panel, click Copy.

Example - Working with sensitive RAG Chunks

This example shows how to use your Tonic Textual pipeline output to create private chunks for RAG, where sensitive chunks are dropped, redacted, or synthesized.

This allows you to ensure that the chunks that you use for RAG do not contain any private information.

Get the latest output files

First, we connect to the API and get the files from the most recent pipeline.

from tonic_textual.parse_api import TonicTextualParse

api_key = "your-tonic-textual-api-key" 
textual = TonicTextualParse("https://textual.tonic.ai", api_key=api_key)

pipelines = textual.get_pipelines()
pipeline = pipelines[-1] # get most recent pipeline

Identify the entity types and handling

Next, specify the sensitive entity types, and indicate whether to redact or to synthesize those entities in the chunks.

sensitive_entities = [
    "NAME_GIVEN",
    "NAME_FAMILY",
    "EMAIL_ADDRESS",
    "PHONE_NUMBER",
    "CREDIT_CARD",
    "CC_EXP",
    "CVV", 
    "US_BANK_NUMBER"
]

# sensitive entities are set to be redacted
# to synthesize, change Redaction to Synthesis
generator_config = {label: 'Redaction' for label in sensitive_entities}

Generate the chunks

Next, generate the chunks.

In the following code snippet, the final list does not include chunks with sensitive entities.

To include the chunks with the sensitive entities redacted, remove the if chunk['is_sensitive']: continue lines.

chunks = []
for file in pipeline.enumerate_files():
    file_chunks = file.get_chunks(generator_config=generator_config)
    for chunk in file_chunks:
        if chunk['is_sensitive']:
            continue # you can choose to ignore chunks that contain sensitive entities
            # or ingest the redacted version
        chunks.append(chunk)

The chunks are now ready to use for RAG or for other downstream tasks.

Example - Using Pinecone to add pipeline output to a vector retrieval system

This example shows how to use Pinecone to add your Tonic Textual pipeline output to a vector retrieval system, for example for RAG.

The Pinecone metadata filtering options allow you to incorporate Textual NER metadata into the retrieval system.

Get the latest output files

First, connect to the Textual pipeline API, and get the files from the most recently created pipeline.

from tonic_textual.parse_api import TonicTextualParse


api_key = "your-tonic-textual-api-key" 
textual = TonicTextualParse("https://textual.tonic.ai", api_key=api_key)


pipelines = textual.get_pipelines()
pipeline = pipelines[-1] # get most recent pipeline
files = pipeline.enumerate_files()

Identify the entity types to include

Next, specify the entity types to incorporate into the retrieval system.

metadata_entities = [
    "NAME_GIVEN",
    "NAME_FAMILY",
    "DATE_TIME",
    "ORGANIZATION"
]

Chunk the files and add metadata

Chunk the files.

For each chunk, add the metadata that contains the instances of the entity types that occur in that chunk.

chunks = []
for f in files:
    chunks.extend(f.get_chunks(metadata_entities=metadata_entities))

Add the chunks to the Pinecone database

Next, embed the text of the chunks.

For each chunk, store the following in a Pinecone vector database:

  • Text

  • Embedding

  • Metadata

You define the embedding function for your system.

from pinecone import Pinecone
import random, uuid

def embedding_function(text: str) -> list[float]:
    # put your embedding function here
    return [random.random() for i in range(10)]
    
vectors = []
for chunk in chunks:
    metadata = dict(chunk["metadata"]["entities"])
    metadata["text"] = chunk["text"]
    vectors.append({
        "id": str(uuid.uuid4()),
        "values": embedding_function(chunk["text"]),
        "metadata": metadata
    })

pc = Pinecone(api_key='your-pinecone-api-key')
index_name = "your-pinecone-index-name"
index = pc.Index(index_name)

index.upsert(vectors=vectors)

Using metadata filters to query the Pinecone database

When you query the Pinecone database, you can then use metadata filters that specify entity type constraints.

For example, to only return chunks that contain the name John Smith:

query: str # your query

index.query(
    vector=embedding_function(query),
    filter={
        "NAME_FAMILY": {"$eq": "Smith"},
        "NAME_GIVEN": {"$eq": "John"}
    },
    top_k=5,
    include_metadata=True
)

As another example, to only return chunks that contain one of the following organizations - Google, Apple, or Microsoft:

query: str # your query


index.query(
    vector=embedding_function(query),
    filter={
        "ORGANIZATION": { "$in": ["Google", "Apple", "Microsoft"]}
    },
    top_k=5,
    include_metadata=True
)

Last updated