The Textual LLM preparation workflow transforms source files into content that you can incorporate into an LLM.
You can:
Upload files directly from a local file system
Select files from an S3 bucket
Select files from a Databricks data volume
Select files from an Azure Blob Storage container
Textual can process plain text files (.txt and .csv), .docx files, and .xslx files. It can also process PDF files. For images, Textual can extract text from .png, .tif/.tiff, and .jpg/.jpeg files.
At a high level, to use Textual to create LLM-ready content:
If the source files are in a local file system, then upload the files to the pipeline. Textual stores the files in your configured Amazon S3 location, and then automatically processes each new file.
If the source files are in cloud storage (Amazon S3, Databricks, or Azure):
Provide the credentials to use to connect to the storage location.
Identify the location where Textual writes the pipeline output.
Optionally, filter the files by file type. For example, you might only want to process PDF files.
Identify the files to include in the pipeline. You can select individual files or folders. When you select folders, Textual processes all of the files in the folder.
For each file, Textual:
Converts the content to raw text. For image files, this means to extract any text that is present.
Uses its built-in models to detect entity values in the text.
Generates a markdown version of the original text.
Produces a JSON file that contains:
The Markdown version of the text
The detected entities and their locations
From Textual, for each processed file, you can:
Textual also provides code snippets to help you to use the pipeline output.
For cloud storage pipelines, the JSON files also are available from the configured output location.
You can also configure pipelines to create redacted versions of the original values. For more information, go to Datasets workflow for text redaction and synthesis.