Structure of JSON output files

The JSON output provides access to Markdown content and identifies the entities that were detected in the file.

Common elements in the JSON output

Information about the entire file

All JSON output files contain the following elements that contain information for the entire file:

{
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>,
        "language": "<language code>"
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

fileType

The type of the original file.

content

Details about the file content. It includes:

  • Hashed and Markdown content for the file

  • Entities in the file

schemaVersion

An integer that identifies the version of the JSON schema that was used for the JSON output.

Textual uses this to convert content from older schemas to the most recent schema.

For specific file types, the JSON output includes additional objects and properties to reflect the file structure.

Hashed and Markdown content

The JSON output contains hashed and Markdown content for the entire file and for individual file components.

hash

The hashed version of the file or component content.

text

The file or component content in Markdown notation.

Entities

The JSON output contains entities arrays for the entire file and for individual file components.

Each entity in the entities array has the following properties:

start

Within the file or component, the location where the entity value starts.

For example, in the following text:

My name is John.

John is an entity that starts at 11.

end

Within the file or component, the location where the entity value ends.

For example, in the following text:

My name is John.

John is an entity that ends at 14.

label

The type of entity.

For a list of the entity types that Textual detects, go to .

text

The text of the entity.

score

The confidence score for the entity.

Indicates how confident Textual is that the value is an entity of the specified type.

language

The language code to identify the language for the entity value. For example, en indicates that the value is in English.

Plain text files

For plain text files, the JSON output only contains the information for the entire file.

.csv files

For .csv files, the structure contains a tables array.

The tables array contains a table object that contains header and data arrays..

For each row in the file, the data array contains a row array.

For each value in a row, the row array contains a value object.

The value object contains the entities, hashed content, and Markdown content for the value.

.xlsx files

For .xlsx files, the structure contains a tables array that provides details for each worksheet in the file.

For each worksheet, the tables array contains a worksheet object.

For each row in a worksheet, the worksheet object contains a header array and a data array. The data array contains a row array.

For each cell in a row, the row array contains a cell object.

Each cell object contains the entities, hashed content, and Markdown content for the cell.

.docx files

For .docx files, the JSON output structure adds:

  • A footnotes array for content in footnotes.

  • An endnotes array for content in endnotes.

  • A header object for content in the page headers. Includes separate objects for the first page header, even page header, and odd page header.

  • A footer object for content in the page footers. Includes separate objects for the first page footer, even page footer, and odd page footer.

These arrays and objects contain the entities, hashed content, and Markdown content for the notes, headers, and footers.

PDF and image files

PDF and image files use the same structure. Textual extracts and scans the text from the files.

For PDF and image files, the JSON output structure adds the following content.

pages array

The pages array contains all of the content on the pages. This includes content in tables and key-value pairs, which are also listed separately in the output.

For each page in the file, the pages array contains a page array.

For each component on the page - such as paragraphs, headings, headers, and footers - the page array contains a component object.

Each component object contains the component entities, hashed content, and Markdown content.

tables array

The tables array contains content that is in tables.

For each table in the file, the tables array contains a table array.

For each row in a table, the table array contains a row array.

For each cell in a row, the row array contains a cell object.

Each cell object identifies the type of cell (header or content). It also contains the entities, hashed content, and Markdown content for the cell.

keyValuePairs array

The keyValuePairs array contains key-value pair content. For example, for a PDF of a form with fields, a key-value pair might represent a field label and a field value.

For each key-value pair, the keyValuePairs array contains a key-value pair object.

The key-value pair object contains:

  • An automatically incremented identifier. For example, id for the first key-value pair is 1, for the second key-value pair is 2, and so on.

  • The start and end position of the key-value pair

  • The text of the key

  • The entities, hashed content, and Markdown content for the value

PDF and image JSON outline

.eml and .msg files

For email message files, the JSON output structure adds the following content.

Email message identifiers

The JSON output includes the following email message identifiers:

  • The identifier of the current message

  • If the message was a reply to another message, the identifier of that message

  • An array of related email messages. This includes the email message that the message replied to, as well as any other messages in an email message thread.

Recipients

The JSON output includes the email address and display name of the message recipients. It contains separate lists for the following:

  • Recipients in the To line

  • Recipients in the CC line

  • Recipients in the BCC line

Subject line

The subject object contains the message subject line. It includes:

  • Markdown and hashed versions of the message subject line.

  • The entities that were detected in the subject line.

Message timestamp

sentDate provides the timestamp when the message was sent.

Message body

The plainTextBodyContent object contains the body of the email message.

It contains:

  • Markdown and hashed versions of the message body.

  • The entities that were detected in the message body.

Message attachments

The attachments array provides information about any attachments to the email message. For each attached file, it includes:

  • The identifier of the message that the file is attached to.

  • The identifier of the attachment.

  • The JSON output for the file.

  • The count of words in the original file.

  • The count of words in the redacted version of the file.

Email message JSON outline

Last updated

Was this helpful?