1 of 4

Viewing pipeline results

The pipeline details include the results of the pipeline processing, including the pipeline files and, for Amazon S3 and Databricks pipelines, the individual pipeline runs.

Viewing pipeline files and runs

For an uploaded file pipeline, the Files page contains the list of pipeline files.

For Amazon S3 and Databricks pipelines, you use the pipeline details Overview page to track processed files and pipeline runs. For pipelines that are configured to also redact files, you can configure the redaction for the detected data types. For more information, go to Selecting the handling option for the entity types.

Viewing the list of all files for a pipeline

For uploaded file pipelines, all file management occurs on the Files page. The Files page only contains the file list. When you add a file to the pipeline, it is automatically added to the file list.

For Amazon S3 and Databricks pipelines, the file list is on the Files tab of the Overview page. The list is not populated until you run the pipeline. The list only contains processed files.

Viewing the list of pipeline runs

On the Overview page for an Amazon S3 or Databricks pipeline, the Pipeline Runs tab displays the list of pipeline runs.

For each run, the list includes:

Run identifier
When the run was started
The current status of the pipeline run. The possible statuses are:
- Queued - The pipeline run has not started to run yet.
- Running - The pipeline run is in progress.
- Completed - The pipeline run completed successfully.
- Failed - The pipeline run failed.

Viewing the list of pipeline run files

For a pipeline run, to display the list of files that the pipeline run includes, click View Run.

Information in a file list

For each file, the list includes the following information:

File name
For Amazon S3 or Databricks files, the path to the file
The status of the file processing. The possible status are:
- Unprocessed - The file is added but a pipeline run to process it has not been started. This only applies to uploaded files that were added since the most recent pipeline run.
- Queued - A pipeline run was started but the file is not yet processed.
- Running - The file is being processed.
- Completed - The file was processed successfully.
- Failed - The file could not be processed.

Displaying details for a processed file

From a file list, to display the details for a file, click the file name.

Original tab - File content

For files other than .txt files, the Original tab allows you to toggle between the generated Markdown and the rendered text.

For a .txt file, where there is no difference between the Markdown and the rendered text, the Original tab displays the file content.

In a pipeline that is configured to also generate redacted files, the Redacted <file type> option allows you to display the redacted version of a PDF or image file.

Entities tab - Detected entities in the file

The Entities tab displays the file content with the detected entity values in context.

The actual values are followed by the type labels. For example, the give name John is displayed as John NAME_GIVEN.

JSON tab - Output JSON for the file

The JSON tab contains the content of the output file. For Amazon S3 or Databricks pipelines, the files are also in the output location that you configured for the pipeline.

For details about the JSON output structure for the different types of files, go to Structure of the pipeline output file JSON.

Tables tab - Tables in a PDF or image file

For a PDF or image file that contains one or more tables, the Tables tab displays the tables. If the file does not contain any tables, then the Tables tab does not display.

Key-Value Pairs tab - Key-value pairs in a PDF or image file

For a PDF or image file that contains key-value pairs, the Key-Value Pairs tab displays the key-value pairs. If the file does not contain key-value pairs, then the Key-Value Pairs tab does not display.

Structure of the pipeline output file JSON

When Textual processes pipeline files, it produces JSON output that provides access to the Markdown content and that identifies entities that were detected in the file.

Common elements in the JSON output

Information about the entire file

All JSON output files contain the following elements that contain information for the entire file:

{
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

For specific file types, the JSON output includes additional objects and properties to reflect the file structure.

Hashed and Markdown content

The JSON output contains hashed and Markdown content for the entire file and for individual file components.

Entities

The JSON output contains entities arrays for the entire file and for individual file components.

Each entity in the entities array has the following properties:

Plain text files

For plain text files, the JSON output only contains the information for the entire file.

{
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown content>",
    "hash": "<hashed content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

.csv files

For .csv files, the structure contains a tables array.

The tables array contains a table object that contains header and data arrays..

For each row in the file, the data array contains a row array.

For each value in a row, the row array contains a value object.

The value object contains the entities, hashed content, and Markdown content for the value.

{
  "tables": [
    {
      "tableName": "csv_table",
      "header": [//Columns that contain heading info (col_0, col_1, and so on)
        "<column identifier>"
      ],
      "data": [  //Entry for each row in the file
        [   //Entry for each value in the row
          {    
            "entities": [   //Entry for each entity in the value
              {
                "start": <start location>,,
                "end": <end location>,
                "label": "<value type>",
                "text": "<value text>",
                "score": <confidence score>
              }
            ],
            "hash": "<hashed value content>",
            "text": "<Markdown value content>"
          }
        ]
      ]
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   ///Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

.xlsx files

For .xlsx files, the structure contains a tables array that provides details for each worksheet in the file.

For each worksheet, the tables array contains a worksheet object.

For each row in a worksheet, the worksheet object contains a header array and a data array. The data array contains a row array.

For each cell in a row, the row array contains a cell object.

Each cell object contains the entities, hashed content, and Markdown content for the cell.

{
  "tables": [   //Entry for each worksheet
    {
      "tableName": "<Name of the worksheet>",
      "header": [ //Columns that contain heading info (col_0, col_1, and so on)
        "<column identifier>"
      ],
      "data": [   //Entry for each row
        [   //Entry for each cell in the row
          {
            "entities": [   //Entry for each entity in the cell
              {
                "start": <start location>,
                "end": <end location>,
                "label": "<value type>",
                "text": "<value text>",
                "score": <confidence score>
              }
            ],
            "hash": "<hashed cell content>",
            "text": "<Markdown cell content>"
          }
        ]
      ]
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

.docx files

For .docx files, the JSON output structure adds:

A footnotes array for content in footnotes.
An endnotes array for content in endnotes.
A header object for content in the page headers. Includes separate objects for the first page header, even page header, and odd page header.
A footer object for content in the page footers. Includes separate objects for the first page footer, even page footer, and odd page footer.

These arrays and objects contain the entities, hashed content, and Markdown content for the notes, headers, and footers.

{
  "footNotes": [   //Entry for each footnote
    {
      "entities": [   //Entry for each entity in the footnote
        {
          "start": <start location>,
          "end": <end location>,
          "pythonStart": <start location in Python>,
          "pythonEnd": <end location in Python>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>,
          "exampleRedaction": null
        }
      ],
      "hash": "<hashed footnote content>",
      "text": "<Markdown footnote content>"
    }
  ],
  "endNotes": [   //Entry for each endnote
    {
      "entities": [   //Entry for each entity in the endnote
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>
        }
      ],
      "hash": "<hashed endnote content>",
      "text": "<Markdown endnote content>"
    }
  ],
  "header": {
    "first": {
      "entities": [   //Entry for each entity in the first page header
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>
        }
      ],
      "hash": "<hashed first page header content>",
      "text": "<Markdown first page header content>"
    },
    "even": {
      "entities": [   //Entry for each entity in the even page header
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>
        }
      ],
      "hash": "<hashed even page header content>",
      "text": "<Markdown even page header content>"
    },
    "odd": {
      "entities": [   //Entry for each entity in the odd page header
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>
        }
      ],
      "hash": "<hashed odd page header content>",
      "text": "<Markdown odd page header content>"
    }
  },
  "footer": {
    "first": {
      "entities": [   //Entry for each entity in the first page footer
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>
        }
      ],
      "hash": "<hashed first page footer content>",
      "text": "<Markdown first page footer content>"
    },
    "even": {
      "entities": [   //Entry for each entity in the even page footer
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>
        }
      ],
      "hash": "<hashed even page footer content>",
      "text": "<Markdown even page footer content>"
    },
    "odd": {
      "entities": [   //Entry for each entity in the odd page footer
        {
          "start": <start location>,
          "end": <end location>,
          "label": "<value type>",
          "text": "<value text>",
          "score": <confidence score>
        }
      ],
      "hash": "<hashed odd page footer content>",
      "text": "<Markdown odd page footer content>"
    }
  },
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   //Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>
      }
    ]
  },
  "schemaVersion": <integer schema version>
}

PDF and image files

PDF and image files use the same structure. Textual extracts and scans the text from the files.

For PDF and image files, the JSON output structure adds:

A pages array for all of the content on the pages. This includes content in tables and key-value pairs, which are also listed separately in the output. For each page in the file, the pages array contains a page array. For each component on the page - such as paragraphs, headings, headers, and footers - the page array contains a component object. Each component object contains the component entities, hashed content, and Markdown content.
A tables array for content that is in tables. For each table in the file, the tables array contains a table array. For each row in a table, the table array contains a row array. For each cell in a row, the row array contains a cell object. Each cell object identifies the type of cell (header or content). It also contains the entities, hashed content, and Markdown content for the cell.
A keyValuePairs array for key-value pair content. For example, for a PDF of a form with fields, a key-value pair might represent a field label and field value. For each key-value pair, the keyValuePairs array contains a key-value pair object. The key-value pair object contains:
- An automatically incremented identifier. For example, id for the first key-value pair is 1, for the second key-value pair is 2, and so on.
- The start and end position of the key-value pair
- The text of the key
- The entities, hashed content, and Markdown content for the value

{
  "pages": [   //Entry for each page in the file
    [   //Entry for each component on the page
      {
        "type": "<page component type>",
        "content": {
          "entities": [   //Entry for each entity in the component
            {
              "start": <start location>,
              "end": <end location>,
              "label": "<value type>",
              "text": "<value text>",
              "score": <confidence score>
            }
          ],
          "hash": "<hashed component content>",
          "text": "<Markdown component content>"
        }
      }
    ],
  "tables": [   //Entry for each table in the file
    [   //Entry for each row in the table
      [   //Entry for each cell in the row
        {
          "type": "<content type>",   //ColumnHeader or Content
          "content": {
            "entities": [  //Entry for each entity in the cell
              {
                "start": <start location>,
                "end": <end location>,
                "label": "<value type>",
                "text": "<value text>",
                "score": <confidence score>
              }
            ],
            "hash": "<hashed cell text>",
            "text": "<Markdown cell text>"
          }
        }
      ]
    ]
  ],
  "keyValuePairs": [   //Entry for each key-value pair in the file
    {
      "id": <incremented identifier>,
      "key": "<key text>",
      "value": {
        "entities": [  //Entry for each entity in the value
          {
            "start": <start location>,
            "end": <end location>,
            "label": "<value type>",
            "text": "<value text>",
            "score": <confidence score>
          }
        ],
        "hash": "<hashed value text>",
        "text": "<Markdown value text>
      },
      "start": <start location of the key-value pair>,
      "end": <end location of the key-value pair>
    }
  ],
  "fileType": "<file type>",
  "content": {
    "text": "<Markdown file content>",
    "hash": "<hashed file content>",
    "entities": [   ///Entry for each entity in the file
      {
        "start": <start location>,
        "end": <end location>,
        "label": "<value type>",
        "text": "<value text>",
        "score": <confidence score>
      }
    ]
  },
  "schemaVersion": <integer schema version>
}