To process PDF and image files, Tonic Textual uses optical character recognition (OCR). Textual supports the following OCR models:
Azure AI Document Intelligence
Amazon Textract
Tesseract
For the best performance, we recommend that you use either Azure AI Document Intelligence or Amazon Textract.
If you cannot use either of those - for example because you run Textual on-premises and cannot access third-party services - then you can use Tesseract.
To use Azure AI Document Intelligence to process PDF image files, Textual requires the Azure AI Document Intelligence key and endpoint.
In .env, uncomment and provide values for the following settings:
SOLAR_AZURE_DOC_INTELLIGENCE_KEY=#
SOLAR_AZURE_DOC_INTELLIGENCE_ENDPOINT=#
In values.yaml, uncomment and provide values for the following settings:
azureDocIntelligenceKey:
azureDocIntelligenceEndpoint:
If the Azure-specific environment variables are not configured, then Textual attempts to use Amazon Textract.
To use Amazon Textract, Textual requires access to an IAM role that has sufficient permissions. You must also configure an S3 bucket to use to store files. The configured S3 bucket is required for uploaded file pipelines, and is also used to store dataset files and individual files that are redacted using the SDK.
We recommend that you use the AmazonTextractFullAccess
policy, but you can also choose to use a more restricted policy.
Here is an example policy that provides the minimum required permissions:
After the policy is attached to an IAM user or a role, it must be made accessible to Textual. To do this, either:
Assign an instance profile
Provide the AWS key, secret, and Region in the following environment variables:
If neither Azure AI Document Intelligence nor Amazon Textract is configured, then Textual uses Tesseract, which is automatically available in your Textual installation.
Tesseract does not require any external access.