When Textual processes pipeline files, it produces JSON output that provides access to the Markdown content and that identifies entities that were detected in the file.
All JSON output files contain the following elements that contain information for the entire file:
For specific file types, the JSON output includes additional objects and properties to reflect the file structure.
The JSON output contains hashed and Markdown content for the entire file and for individual file components.
The JSON output contains entities
arrays for the entire file and for individual file components.
Each entity in the entities
array has the following properties:
For plain text files, the JSON output only contains the information for the entire file.
For .csv files, the structure contains a tables array.
The tables array contains a table object that contains header
and data
arrays..
For each row in the file, the data
array contains a row array.
For each value in a row, the row array contains a value object.
The value object contains the entities, hashed content, and Markdown content for the value.
For .xlsx files, the structure contains a tables
array that provides details for each worksheet in the file.
For each worksheet, the tables array contains a worksheet object.
For each row in a worksheet, the worksheet object contains a header
array and a data
array. The data
array contains a row array.
For each cell in a row, the row array contains a cell object.
Each cell object contains the entities, hashed content, and Markdown content for the cell.
For .docx files, the JSON output structure adds:
A footnotes
array for content in footnotes.
An endnotes
array for content in endnotes.
A header
object for content in the page headers. Includes separate objects for the first page header, even page header, and odd page header.
A footer
object for content in the page footers. Includes separate objects for the first page footer, even page footer, and odd page footer.
These arrays and objects contain the entities, hashed content, and Markdown content for the notes, headers, and footers.
PDF and image files use the same structure. Textual extracts and scans the text from the files.
For PDF and image files, the JSON output structure adds:
A pages
array for all of the content on the pages. This includes content in tables and key-value pairs, which are also listed separately in the output.
For each page in the file, the pages
array contains a page array.
For each component on the page - such as paragraphs, headings, headers, and footers - the page array contains a component object.
Each component object contains the component entities, hashed content, and Markdown content.
A tables
array for content that is in tables.
For each table in the file, the tables
array contains a table array.
For each row in a table, the table array contains a row array.
For each cell in a row, the row array contains a cell object.
Each cell object identifies the type of cell (header or content). It also contains the entities, hashed content, and Markdown content for the cell.
A keyValuePairs
array for key-value pair content. For example, for a PDF of a form with fields, a key-value pair might represent a field label and field value.
For each key-value pair, the keyValuePairs
array contains a key-value pair object.
The key-value pair object contains:
An automatically incremented identifier. For example, id
for the first key-value pair is 1, for the second key-value pair is 2, and so on.
The start and end position of the key-value pair
The text of the key
The entities, hashed content, and Markdown content for the value
fileType
The type of the original file.
content
Details about the file content. It includes:
Hashed and Markdown content for the file
Entities in the file
schemaVersion
An integer that identifies the version of the JSON schema that was used for the JSON output.
Textual uses this to convert content from older schemas to the most recent schema.
hash
The hashed version of the file or component content.
text
The file or component content in Markdown notation.
start
Within the file or component, the location where the entity value starts.
For example, in the following text:
My name is John.
John is an entity that starts at 11.
end
Within the file or component, the location where the entity value ends.
For example, in the following text:
My name is John.
John is an entity that ends at 14.
label
The type of entity.
For a list of the entity types that Textual detects, go to Entity types that Textual detects.
text
The text of the entity.
score
The confidence score for the entity.
Indicates how confident Textual is that the value is an entity of the specified type.