From a file list, to display the details for a file, click the file name.
For files other than .txt files, the Original tab allows you to toggle between the generated Markdown and the rendered text.
For a .txt file, where there is no difference between the Markdown and the rendered text, the Original tab displays the file content.
In a pipeline that is configured to also generate redacted files, the Redacted <file type> option allows you to display the redacted version of a PDF or image file.
The Entities tab displays the file content with the detected entity values in context.
The actual values are followed by the type labels. For example, the give name John is displayed as John NAME_GIVEN
.
The JSON tab contains the content of the output file. For Amazon S3 or Databricks pipelines, the files are also in the output location that you configured for the pipeline.
For details about the JSON output structure for the different types of files, go to Structure of the pipeline output file JSON.
For a PDF or image file that contains one or more tables, the Tables tab displays the tables. If the file does not contain any tables, then the Tables tab does not display.
For a PDF or image file that contains key-value pairs, the Key-Value Pairs tab displays the key-value pairs. If the file does not contain key-value pairs, then the Key-Value Pairs tab does not display.
View pipeline files and runs
Display the lists of pipeline files and pipeline runs.
View processed file details
Display details for a file, including the original content, detected entities, and JSON output.
JSON output structure
Structure of the JSON output for different types of pipeline files.
For an uploaded file pipeline, the Files tab contains the list of all of the pipeline files.
For cloud storage pipelines, you use the pipeline details Overview page to track processed files and pipeline runs.
For pipelines that are configured to also redact files, you can configure the redaction for the detected data types. For more information, go to Selecting the handling option for the entity types.
For uploaded file pipelines, when you add a file to the pipeline, it is automatically added to the file list.
For cloud storage pipelines, the file list is not populated until you run the pipeline. The list only contains processed files.
On the Overview page for a cloud storage pipeline, the Pipeline Runs tab displays the list of pipeline runs.
For each run, the list includes:
Run identifier
When the run was started
The current status of the pipeline run. The possible statuses are:
Queued - The pipeline run has not started to run yet.
Running - The pipeline run is in progress.
Completed - The pipeline run completed successfully.
Failed - The pipeline run failed.
For a pipeline run, to display the list of files that the pipeline run includes, click View Run.
For each file, the list includes the following information:
File name
For cloud storage files, the path to the file
The status of the file processing. The possible status are:
Unprocessed - The file is added but a pipeline run to process it has not been started. This only applies to uploaded files that were added since the most recent pipeline run.
Queued - A pipeline run was started but the file is not yet processed.
Running - The file is being processed.
Completed - The file was processed successfully.
Failed - The file could not be processed.
When Textual processes pipeline files, it produces JSON output that provides access to the Markdown content and that identifies entities that were detected in the file.
All JSON output files contain the following elements that contain information for the entire file:
For specific file types, the JSON output includes additional objects and properties to reflect the file structure.
The JSON output contains hashed and Markdown content for the entire file and for individual file components.
The JSON output contains entities
arrays for the entire file and for individual file components.
Each entity in the entities
array has the following properties:
For plain text files, the JSON output only contains the information for the entire file.
For .csv files, the structure contains a tables array.
The tables array contains a table object that contains header
and data
arrays..
For each row in the file, the data
array contains a row array.
For each value in a row, the row array contains a value object.
The value object contains the entities, hashed content, and Markdown content for the value.
For .xlsx files, the structure contains a tables
array that provides details for each worksheet in the file.
For each worksheet, the tables array contains a worksheet object.
For each row in a worksheet, the worksheet object contains a header
array and a data
array. The data
array contains a row array.
For each cell in a row, the row array contains a cell object.
Each cell object contains the entities, hashed content, and Markdown content for the cell.
For .docx files, the JSON output structure adds:
A footnotes
array for content in footnotes.
An endnotes
array for content in endnotes.
A header
object for content in the page headers. Includes separate objects for the first page header, even page header, and odd page header.
A footer
object for content in the page footers. Includes separate objects for the first page footer, even page footer, and odd page footer.
These arrays and objects contain the entities, hashed content, and Markdown content for the notes, headers, and footers.
PDF and image files use the same structure. Textual extracts and scans the text from the files.
For PDF and image files, the JSON output structure adds the following content.
pages
arrayThe pages
array contains all of the content on the pages. This includes content in tables and key-value pairs, which are also listed separately in the output.
For each page in the file, the pages
array contains a page array.
For each component on the page - such as paragraphs, headings, headers, and footers - the page array contains a component object.
Each component object contains the component entities, hashed content, and Markdown content.
tables
arrayThe tables
array contains content that is in tables.
For each table in the file, the tables
array contains a table array.
For each row in a table, the table array contains a row array.
For each cell in a row, the row array contains a cell object.
Each cell object identifies the type of cell (header or content). It also contains the entities, hashed content, and Markdown content for the cell.
keyValuePairs
arrayThe keyValuePairs
array contains key-value pair content. For example, for a PDF of a form with fields, a key-value pair might represent a field label and field value.
For each key-value pair, the keyValuePairs
array contains a key-value pair object.
The key-value pair object contains:
An automatically incremented identifier. For example, id
for the first key-value pair is 1, for the second key-value pair is 2, and so on.
The start and end position of the key-value pair
The text of the key
The entities, hashed content, and Markdown content for the value
For email message files, the JSON output structure adds the following content.
The JSON output includes the following email message identifiers:
The identifier of the current message
If the message was a reply to another message, the identifier of that message
An array of related email messages. This includes the email message that the message replied to, as well as any other messages in an email message thread.
The JSON output includes the email address and display name of the message recipients. It contains separate lists for the following:
Recipients in the To line
Recipients in the CC line
Recipients in the BCC line
The subject
object contains the message subject line. It includes:
Markdown and hashed versions of the message subject line.
The entities that were detected in the subject line.
sentDate
provides the timestamp when the message was sent.
The plainTextBodyContent
object contains the body of the email message.
It contains:
Markdown and hashed versions of the message body.
The entities that were detected in the message body.
The attachments
array provides information about any attachments to the email message. For each attached file, it includes:
The identifier of the message that the file is attached to.
The identifier of the attachment.
The JSON output for the file.
The count of words in the original file.
The count of words in the redacted version of the file.
| The type of the original file. |
| Details about the file content. It includes:
|
| An integer that identifies the version of the JSON schema that was used for the JSON output. Textual uses this to convert content from older schemas to the most recent schema. |
| The hashed version of the file or component content. |
| The file or component content in Markdown notation. |
| Within the file or component, the location where the entity value starts. For example, in the following text:
John is an entity that starts at 11. |
| Within the file or component, the location where the entity value ends. For example, in the following text:
John is an entity that ends at 14. |
| The type of entity. |
| The text of the entity. |
| The confidence score for the entity. Indicates how confident Textual is that the value is an entity of the specified type. |
| The language code to identify the language for the entity value.
For example, |
For a list of the entity types that Textual detects, go to .