Pipelines workflow for LLM preparation

The Textual LLM preparation workflow transforms source files into content that you can incorporate into an LLM.

You can:

  • Upload files directly from a local file system

  • Select files from an S3 bucket

  • Select files from a Databricks data volume

Textual can process plain text files (.txt and .csv), .docx files, and .xslx files. It can also process PDF files. For images, Textual can extract text from .png, .tif/.tiff, and .jpg/.jpeg files.

At a high level, to use Textual to create LLM-ready content:

  1. If the source files are in a local file system, then upload the files to the pipeline. Textual stores the files in your configured Amazon S3 location, and then automatically processes each new file.

  2. If the source files are in Amazon S3:

    1. Identify the files to include in the pipeline. You can select individual files or folders. When you select folders, Textual processes all of the files in the folder.

  3. If the source files are in Databricks:

    1. Provide the URL and Access token to use to connect to Databricks.

    2. Identify the location in Databricks where Textual writes the pipeline output.

    3. Identify the files to include in the pipeline.

    4. Run the pipeline.

  4. For each file, Textual:

    1. Converts the content to raw text. For image files, this means to extract any text that is present.

    2. Uses its built-in models to detect entity values in the text.

    3. Generates a markdown version of the original text.

    4. Produces a JSON file that contains:

      • The Markdown version of the text

      • The detected entities and their locations

From Textual, for each processed file, you can:

For Amazon S3 and Databricks pipelines, the JSON files also are available from the configured output location.

You can also configure pipelines to create redacted versions of the original values. For more information, go to Datasets workflow for text redaction and synthesis.

Last updated