You can use the Textual SDK to parse individual files, either from a local file system or from an S3 bucket.
Textual returns a FileParseResult
object for each parsed file. The FileParseResult
object is a wrapper around the output JSON for the processed file.
To parse a single file from a local file system, use textual.parse_file
:
You must use rb
access mode to read the file. rb
access mode opens the file to be read in binary format.
You can also set a timeout in seconds for the parsing. You can add the timeout as a parameter of parse_file command. To set a timeout to use for all parsing, set the environment variable TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS
.
You can also parse files that are stored in Amazon S3. Because this process uses the boto3 library to fetch the file from Amazon S3, you must first set up the correct AWS credentials.
To parse a file from an S3 bucket, use textual.parse_s3_file
:
Textual uses pipelines to transform file text into a format that can be used in an LLM system.
You can use the Textual SDK to create and manage pipelines and retrieve pipeline run results.
Before you perform these tasks, remember to .
To create a pipeline, use .
The response contains the pipeline object.
To upload a file to a pipeline, use .
The response contains a list of pipeline objects.
The response contains a single pipeline object.
The pipeline identifier is displayed on the pipeline details page. To copy the identifier, click the copy icon.
The response contains the job identifier.
The response contains a list of pipeline run objects.
The response is an enumerator of file parse result objects.
generator_config
is a dictionary that specifies whether to redact, synthesize, or do neither for each entity type in the dictionary.
For each entity type, you provide the handling type:
Redaction
indicates to replace the value with the value type.
Synthesis
indicates to replace the value with a realistic value.
Off
indicates to keep the value as is.
generator_default
indicates how to process values for entity types that were not included in the generator_config
list.
The response contains the list of entities. For each value, the list includes:
Entity type
Where the value starts in the source file
Where the value ends in the source file
The original text of the entity
The response contains the Markdown files, with the detected entities processed as specified in generator_config
and generator_default
.
In the request, you set the maximum number of characters in each chunk.
You can also provide generator_config
and generator_default
to configure how to present the detected entities in the text chunks.
The response contains the list of text chunks, with the detected entities processed as specified in generator_config
and generator_default
.
To delete a pipeline, use .
To get the list of pipelines, use .
To use the pipeline identifier to get a single pipeline, use .
To run a pipeline, use .
To get the list of pipeline runs, use .
Once you have the pipeline, to get an enumerator of the files in the pipeline from the most recent pipeline run, use .
To get a list of entities that were detected in a file, use . For example, to get the detected entities for all of the files in a pipeline:
To provide a list entity types and how to process them, use :
For a list of the entity types that Textual detects, go to .
To get the Markdown output of a pipeline file, use . In the request, you can provide generator_config
and generator_default
to configure how to present the detected entities in the output file.
To split a pipeline file into text chunks that can be imported into an LLM, use .
Create and manage pipelines
Create, run, and get results from Textual pipelines
Parse individual files
Send a single file to be parsed