To process PDF and image files, Tonic Textual uses optical character recognition (OCR). Textual supports the following OCR models:
Azure AI Document Intelligence
Amazon Textract
Tesseract
For the best performance, we recommend that you use either Azure AI Document Intelligence or Amazon Textract.
If you cannot use either of those - for example because you run Textual on-premises and cannot access third-party services - then you can use Tesseract.
To use Azure AI Document Intelligence to process PDF image files, Textual requires the Azure AI Document Intelligence key and endpoint.
In .env, uncomment and provide values for the following settings:
SOLAR_AZURE_DOC_INTELLIGENCE_KEY=#
SOLAR_AZURE_DOC_INTELLIGENCE_ENDPOINT=#
In values.yaml, uncomment and provide values for the following settings:
azureDocIntelligenceKey:
azureDocIntelligenceEndpoint:
If the Azure-specific environment variables are not configured, then Textual attempts to use Amazon Textract.
To use Amazon Textract, Textual requires access to an IAM role that has sufficient permissions.
We recommend that you use the AmazonTextractFullAccess
policy, but you can also choose to use a more restricted policy.
Here is an example policy that provides the minimum required permissions:
After the policy is attached to an IAM user or a role, it must be made accessible to Textual. To do this, either:
Assign an instance profile
Provide the AWS key, secret, and Region in the following environment variables:
If neither Azure AI Document Intelligence nor Amazon Textract is configured, then Textual uses Tesseract, which is automatically available in your Textual installation.
Tesseract does not require any external access.