Pipelines workflow for LLM preparation

We are planning to deprecate the pipelines workflow and merge the functionality into Textual datasets. If you have not previously created any pipelines, the Pipelines option is no longer available.

You can configure a Textual dataset to generate pipeline-style JSON output from uploaded local files or files selected from Amazon S3, Azure Blob Storage, or Microsoft SharePoint.

We will eventually remove the Pipelines option.

The Textual LLM preparation workflow transforms source files into content that you can incorporate into an LLM.

You can:

Upload files directly from a local file system
Select files from an S3 bucket
Select files from a Databricks data volume
Select files from an Azure Blob Storage container
Select files from a SharePoint repository

Textual can process plain text files (.txt and .csv), .docx files, and .xslx files. It can also process PDF files. For images, Textual can extract text from .png, .tif/.tiff, and .jpg/.jpeg files.

You can also create and manage pipelines from the Textual SDK.

At a high level, to use Textual to create LLM-ready content:

Create a Textual pipeline.
If the source files are in a local file system, then upload the files to the pipeline. Textual stores the files in your configured Amazon S3 location, and then automatically processes each new file.
If the source files are in cloud storage (Amazon S3, Databricks, Azure, or Sharepoint):
1. Provide the credentials to use to connect to the storage location.
2. Identify the location where Textual writes the pipeline output.
3. Optionally, filter the files by file type. For example, you might only want to process PDF files.
4. Identify the files to include in the pipeline. You can select individual files or folders. When you select folders, Textual processes all of the files in the folder.
5. Run the pipeline.
For each file, Textual:
1. Converts the content to raw text. For image files, this means to extract any text that is present.
2. Uses its built-in models to detect entity values in the text.
3. Generates a Markdown version of the original text.
4. Produces a JSON file that contains:
  - The Markdown version of the text
  - The detected entities and their locations

From Textual, for each processed file, you can:

Textual also provides code snippets to help you to use the pipeline output.

For cloud storage pipelines, the JSON files also are available from the configured output location.

You can also configure pipelines to create redacted versions of the original values.

Last updated 1 day ago

Was this helpful?