LogoLogo
Release notesPython SDK docsDocs homeTextual CloudTonic.ai
  • Tonic Textual guide
  • Getting started with Textual
  • Previewing Textual detection and redaction
  • Entity types that Textual detects
    • Built-in entity types
    • Managing custom entity types
  • Language support in Textual
  • Datasets - Create redacted files
    • Datasets workflow for text redaction
    • Creating and managing datasets
    • Assigning tags to datasets
    • Displaying the file manager
    • Adding and removing dataset files
    • Reviewing the sensitivity detection results
    • Configuring the redaction
      • Configuring added and excluded values for built-in entity types
      • Working with custom entity types
      • Selecting the handling option for entity types
      • Configuring synthesis options
      • Configuring handling of file components
    • Adding manual overrides to PDF files
      • Editing an individual PDF file
      • Creating templates to apply to PDF files
    • Sharing dataset access
    • Previewing the original and redacted data in a file
    • Downloading redacted data
  • Pipelines - Prepare LLM content
    • Pipelines workflow for LLM preparation
    • Viewing pipeline lists and details
    • Assigning tags to pipelines
    • Setting up pipelines
      • Creating and editing pipelines
      • Supported file types for pipelines
      • Creating custom entity types from a pipeline
      • Configuring file synthesis for a pipeline
      • Configuring an Amazon S3 pipeline
      • Configuring a Databricks pipeline
      • Configuring an Azure pipeline
      • Configuring a Sharepoint pipeline
      • Selecting files for an uploaded file pipeline
    • Starting a pipeline run
    • Sharing pipeline access
    • Viewing pipeline results
      • Viewing pipeline files, runs, and statistics
      • Displaying details for a processed file
      • Structure of the pipeline output file JSON
    • Downloading and using pipeline output
  • Textual Python SDK
    • Installing the Textual SDK
    • Creating and revoking Textual API keys
    • Obtaining JWT tokens for authentication
    • Instantiating the SDK client
    • Datasets and redaction
      • Create and manage datasets
      • Redact individual strings
      • Redact individual files
      • Transcribe and redact an audio file
      • Configure entity type handling for redaction
      • Record and review redaction requests
    • Pipelines and parsing
      • Create and manage pipelines
      • Parse individual files
  • Textual REST API
    • About the Textual REST API
    • REST API authentication
    • Redaction
      • Redact text strings
  • Datasets
    • Manage datasets
    • Manage dataset files
  • Snowflake Native App and SPCS
    • About the Snowflake Native App
    • Setting up the app
    • Using the app
    • Using Textual with Snowpark Container Services directly
  • Install and administer Textual
    • Textual architecture
    • Setting up and managing a Textual Cloud pay-as-you-go subscription
    • Deploying a self-hosted instance
      • System requirements
      • Deploying with Docker Compose
      • Deploying on Kubernetes with Helm
    • Configuring Textual
      • How to configure Textual environment variables
      • Configuring the number of textual-ml workers
      • Configuring the number of jobs to run concurrently
      • Configuring the format of Textual logs
      • Setting a custom certificate
      • Configuring endpoint URLs for calls to AWS
      • Enabling PDF and image processing
      • Setting the S3 bucket for file uploads and redactions
      • Required IAM role permissions for Amazon S3
      • Configuring model preferences
    • Viewing model specifications
    • Managing user access to Textual
      • Textual organizations
      • Creating a new account in an existing organization
      • Single sign-on (SSO)
        • Viewing the list of SSO groups in Textual
        • Azure
        • GitHub
        • Google
        • Keycloak
        • Okta
      • Managing Textual users
      • Managing permissions
        • About permissions and permission sets
        • Built-in permission sets and available permissions
        • Viewing the lists of permission sets
        • Configuring custom permission sets
        • Configuring access to global permission sets
        • Setting initial access to all global permissions
    • Textual monitoring
      • Downloading a usage report
      • Tracking user access to Textual
Powered by GitBook
On this page
  • Common elements in the JSON output
  • Information about the entire file
  • Hashed and Markdown content
  • Entities
  • Plain text files
  • .csv files
  • .xlsx files
  • .docx files
  • PDF and image files
  • pages array
  • tables array
  • keyValuePairs array
  • PDF and image JSON outline
  • .eml and .msg files
  • Email message identifiers
  • Recipients
  • Subject line
  • Message timestamp
  • Message body
  • Message attachments
  • Email message JSON outline

Was this helpful?

Export as PDF
  1. Pipelines - Prepare LLM content
  2. Viewing pipeline results

Structure of the pipeline output file JSON

When Textual processes pipeline files, it produces JSON output that provides access to the Markdown content and that identifies the detected entities in the file.

Common elements in the JSON output

Information about the entire file

All JSON output files contain the following elements that contain information for the entire file:

{
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

fileType

The type of the original file.

content

Details about the file content. It includes:

  • Hashed and Markdown content for the file

  • Entities in the file

schemaVersion

An integer that identifies the version of the JSON schema that was used for the JSON output.

Textual uses this to convert content from older schemas to the most recent schema.

For specific file types, the JSON output includes additional objects and properties to reflect the file structure.

Hashed and Markdown content

The JSON output contains hashed and Markdown content for the entire file and for individual file components.

hash

The hashed version of the file or component content.

text

The file or component content in Markdown notation.

Entities

The JSON output contains entities arrays for the entire file and for individual file components.

Each entity in the entities array has the following properties:

start

Within the file or component, the location where the entity value starts.

For example, in the following text:

My name is John.

John is an entity that starts at 11.

end

Within the file or component, the location where the entity value ends.

For example, in the following text:

My name is John.

John is an entity that ends at 14.

label

The type of entity.

text

The text of the entity.

score

The confidence score for the entity.

Indicates how confident Textual is that the value is an entity of the specified type.

language

The language code to identify the language for the entity value. For example, en indicates that the value is in English.

Plain text files

For plain text files, the JSON output only contains the information for the entire file.

{
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown content>",
    "hash": "<hashed content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"      }
    ]
  },
  "schemaVersion": <integer schema version>
}

.csv files

For .csv files, the structure contains a tables array.

The tables array contains a table object that contains header and data arrays..

For each row in the file, the data array contains a row array.

For each value in a row, the row array contains a value object.

The value object contains the entities, hashed content, and Markdown content for the value.

{
  "tables": [
    {
      "tableName": "csv_table",
      "header": [//Columns that contain heading info (col_0, col_1, and so on)
        "<column identifier>"
      ],
      "data": [  //Entry for each row in the file
        [   //Entry for each value in the row
          {    
            "entities": [   //Entry for each entity in the value
              {
                "start": <start location>,,
                "end": <end location>,
                "label": "<value type>",
                "text": "<value text>",
                "score": <confidence score>,
                "language": "<language code>"
              }
            ],
            "hash": "<hashed value content>",
            "text": "<Markdown value content>"
          }
        ]
      ]
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   ///Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

.xlsx files

For .xlsx files, the structure contains a tables array that provides details for each worksheet in the file.

For each worksheet, the tables array contains a worksheet object.

For each row in a worksheet, the worksheet object contains a header array and a data array. The data array contains a row array.

For each cell in a row, the row array contains a cell object.

Each cell object contains the entities, hashed content, and Markdown content for the cell.

{
  "tables": [   //Entry for each worksheet
    {
      "tableName": "<Name of the worksheet>",
      "header": [ //Columns that contain heading info (col_0, col_1, and so on)
        "<column identifier>"
      ],
      "data": [   //Entry for each row
        [   //Entry for each cell in the row
          {
            "entities": [   //Entry for each entity in the cell
              {
                "start": <start location>,
                "end": <end location>,
                "label": "<value type>",
                "text": "<value text>",
                "score": <confidence score>,
                "language": "<language code>"
              }
            ],
            "hash": "<hashed cell content>",
            "text": "<Markdown cell content>"
          }
        ]
      ]
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

.docx files

For .docx files, the JSON output structure adds:

  • A footnotes array for content in footnotes.

  • An endnotes array for content in endnotes.

  • A header object for content in the page headers. Includes separate objects for the first page header, even page header, and odd page header.

  • A footer object for content in the page footers. Includes separate objects for the first page footer, even page footer, and odd page footer.

These arrays and objects contain the entities, hashed content, and Markdown content for the notes, headers, and footers.

{
  "footNotes": [   //Entry for each footnote
    {
      "entities": [   //Entry for each entity in the footnote
        {
          "start": <start location>,
          "end": <end location>,
          "pythonStart": <start location in Python>,
          "pythonEnd": <end location in Python>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
          "exampleRedaction": null
        }
      ],
      "hash": "<hashed footnote content>",
      "text": "<Markdown footnote content>"
    }
  ],
  "endNotes": [   //Entry for each endnote
    {
      "entities": [   //Entry for each entity in the endnote
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed endnote content>",
      "text": "<Markdown endnote content>"
    }
  ],
  "header": {
    "first": {
      "entities": [   //Entry for each entity in the first page header
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed first page header content>",
      "text": "<Markdown first page header content>"
    },
    "even": {
      "entities": [   //Entry for each entity in the even page header
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed even page header content>",
      "text": "<Markdown even page header content>"
    },
    "odd": {
      "entities": [   //Entry for each entity in the odd page header
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed odd page header content>",
      "text": "<Markdown odd page header content>"
    }
  },
  "footer": {
    "first": {
      "entities": [   //Entry for each entity in the first page footer
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed first page footer content>",
      "text": "<Markdown first page footer content>"
    },
    "even": {
      "entities": [   //Entry for each entity in the even page footer
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed even page footer content>",
      "text": "<Markdown even page footer content>"
    },
    "odd": {
      "entities": [   //Entry for each entity in the odd page footer
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "language": "<language code>"
        }
      ],
      "hash": "<hashed odd page footer content>",
      "text": "<Markdown odd page footer content>"
    }
  },
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

PDF and image files

PDF and image files use the same structure. Textual extracts and scans the text from the files.

For PDF and image files, the JSON output structure adds the following content.

pages array

The pages array contains all of the content on the pages. This includes content in tables and key-value pairs, which are also listed separately in the output.

For each page in the file, the pages array contains a page array.

For each component on the page - such as paragraphs, headings, headers, and footers - the page array contains a component object.

Each component object contains the component entities, hashed content, and Markdown content.

tables array

The tables array contains content that is in tables.

For each table in the file, the tables array contains a table array.

For each row in a table, the table array contains a row array.

For each cell in a row, the row array contains a cell object.

Each cell object identifies the type of cell (header or content). It also contains the entities, hashed content, and Markdown content for the cell.

keyValuePairs array

The keyValuePairs array contains key-value pair content. For example, for a PDF of a form with fields, a key-value pair might represent a field label and a field value.

For each key-value pair, the keyValuePairs array contains a key-value pair object.

The key-value pair object contains:

  • An automatically incremented identifier. For example, id for the first key-value pair is 1, for the second key-value pair is 2, and so on.

  • The start and end position of the key-value pair

  • The text of the key

  • The entities, hashed content, and Markdown content for the value

PDF and image JSON outline

{
  "pages": [   //Entry for each page in the file
    [   //Entry for each component on the page
      {
        "type": "<page component type>",
        "content": {
          "entities": [   //Entry for each entity in the component
            {
              "start": <start location>,
              "end": <end location>,
              "label": "<value type>",
              "text": "<value text>",
              "score": <confidence score>,
              "language": "<language code>"
            }
          ],
          "hash": "<hashed component content>",
          "text": "<Markdown component content>"
        }
      }
    ],
  "tables": [   //Entry for each table in the file
    [   //Entry for each row in the table
      [   //Entry for each cell in the row
        {
          "type": "<content type>",   //ColumnHeader or Content
          "content": {
            "entities": [  //Entry for each entity in the cell
              {
                "start": <start location>,
                "end": <end location>,
                "label": "<value type>",
                "text": "<value text>",
                "score": <confidence score>,
                "language": "<language code>"
              }
            ],
            "hash": "<hashed cell text>",
            "text": "<Markdown cell text>"
          }
        }
      ]
    ]
  ],
  "keyValuePairs": [   //Entry for each key-value pair in the file
    {
      "id": <incremented identifier>,
      "key": "<key text>",
      "value": {
        "entities": [  //Entry for each entity in the value
          {
            "start": <start location>,
            "end": <end location>,
            "label": "<value type>",
            "text": "<value text>",
            "score": <confidence score>,
            "language": "<language code>"
          }
        ],
        "hash": "<hashed value text>",
        "text": "<Markdown value text>"
      },
      "start": <start location of the key-value pair>,
      "end": <end location of the key-value pair>
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   ///Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

.eml and .msg files

For email message files, the JSON output structure adds the following content.

Email message identifiers

The JSON output includes the following email message identifiers:

  • The identifier of the current message

  • If the message was a reply to another message, the identifier of that message

  • An array of related email messages. This includes the email message that the message replied to, as well as any other messages in an email message thread.

Recipients

The JSON output includes the email address and display name of the message recipients. It contains separate lists for the following:

  • Recipients in the To line

  • Recipients in the CC line

  • Recipients in the BCC line

Subject line

The subject object contains the message subject line. It includes:

  • Markdown and hashed versions of the message subject line.

  • The entities that were detected in the subject line.

Message timestamp

sentDate provides the timestamp when the message was sent.

Message body

The plainTextBodyContent object contains the body of the email message.

It contains:

  • Markdown and hashed versions of the message body.

  • The entities that were detected in the message body.

Message attachments

The attachments array provides information about any attachments to the email message. For each attached file, it includes:

  • The identifier of the message that the file is attached to.

  • The identifier of the attachment.

  • The JSON output for the file.

  • The count of words in the original file.

  • The count of words in the redacted version of the file.

Email message JSON outline

{
  "messageId": "<email message identifier>",
  "inReplyToMessageId": <message that this message replied to>,
  "messageIdReferences": [<related email messages>],
  "senderAddress": {
    "address": "<sender email address>",
    "displayName": "<sender display name>"
  },
  "toAddresses": [  //Entry for each recipient in the To list
    {
      "address": "<recipient email address>",
      "displayName": "<recipient display name>"
    }
  ],
  "ccAddresses": [ //Entry for each recipient in the CC list
    {
      "address": "<recipient email address>",
      "displayName": "<recipient display name>"
    }
  ],
  "bccAddresses": [ //Entry for each recipient in the BCC list
    {
      "address": "<recipient email address>",
      "displayName": "<recipient display name>"
    }
  ],
  "sentDate": "<timestamp when the message was sent>",
  "subject": {
    "text": "<Markdown version of the subject line>",
    "hash": "<hashed version of the subject line>",
    "entities": [   //Entry for each entity in the subject line
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "plainTextBodyContent": {
    "text": "<Markdown version of the message body>",
    "hash": "<hashed version of the message body>",
    "entities": [ //Entry for each entity in the message body
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "attachments": [ //Entry for each attached file
    {
      "parentMessageId": "<the message that the file is attached to>",
      "contentId": "<identifier of the attachment>",
      "fileName": "<name of the attachment file>",
      "document": {<pipeline JSON for the attached file>},
      "wordCount": <number of words in the attachment>,
      "redactedWordCount": <number of words in the redacted attachment>
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [ //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

Last updated 2 months ago

Was this helpful?

For a list of the entity types that Textual detects, go to .

Entity types that Textual detects