Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Tonic Textual pipelines can process files from sources such as Amazon S3, Azure Blob Storage, and Databricks Unity Catalog. You can also create pipelines to process files that you upload directly from your browser.
For those uploaded file pipelines, Textual always stores the files in an S3 bucket. On a self-hosted instance, before you add files to an uploaded file pipeline, you must configure the S3 bucket and the associated authentication credentials.
The configured S3 bucket is also used to store dataset files and individual files that you use the Textual SDK to redact. If an S3 bucket is not configured, then:
The dataset and individual redacted files are stored in the Textual application database.
You cannot use Amazon Textract for PDF and image processing. If you configured Textual to use Amazon Textract, Textual instead uses Tesseract.
The authentication credentials for the S3 bucket include:
The AWS Region where the S3 bucket is located.
An AWS access key that is associated with an IAM user or role.
The secret key that is associated with the access key.
To provide the authentication credentials, you can either:
Provide the values directly as environment variable values.
Use the instance profile of the compute instance where Textual runs.
For an example IAM role that has the required permissions, go to #file-upload-example-iam-role.
In .env, add the following settings:
SOLAR_INTERNAL_BUCKET_NAME= <S3 bucket path>
AWS_REGION= <AWS Region>
AWS_ACCESS_KEY_ID= <AWS access key>
AWS_SECRET_ACCESS_KEY= <AWS secret key>
If you use the instance profile of the compute instance, then only the bucket name is required.
In values.yaml, within env: { }
under both textual_api_server
and textual_worker
, add the following settings:
SOLAR_INTERNAL_BUCKET_NAME
AWS_REGION
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
For example, if no other environment variables are defined:
If you use the instance profile of the compute instance, then only the bucket name is required.
For Amazon S3 pipelines, you connect to S3 buckets to select and store files.
On self-hosted instances, you also configure an S3 bucket and the credentials to use to store files for:
Uploaded file pipelines. The S3 bucket is required for uploaded file pipelines. The S3 bucket is not used for pipelines that connect to Azure Blob Storage or to Databricks Unity Catalog.
Dataset files. If you do not configure an S3 bucket, then the files are stored in the application database.
Individual files that you send to the SDK for redaction. If you do not configure an S3 bucket, then the files are stored in the application database.
Here are examples of IAM roles that have the required permissions to connect to Amazon S3 to select or store files.
For uploaded file pipelines, datasets, and individual file redactions, the files are stored in a single S3 bucket. For information on how to configure the S3 bucket and the corresponding access credentials, go to Setting the S3 bucket for file uploads and redactions.
The IAM role that is used to connect to the S3 bucket must be able to read files from and write files to it.
Here is an example of an IAM role that has the permissions required to support uploaded file pipelines, datasets, and individual redactions:
The access credentials that you configure for an Amazon S3 pipeline must be able to navigate to and select files and folders from the appropriate S3 buckets. They also need to be able to write output files to the configured output location.
Here is an example of an IAM role that has the permissions required to support Amazon S3 pipelines:
Tonic Textual provides a certificate for https traffic, but on a self-hosted instance, you can also use a user-provided certificate. The certificate must use the the PFX format and be named solar.pfx
.
To use your own certificate, you must:
Add the SOLAR_PFX_PASSWORD
environment variable.
Use a volume mount to provide the certificate file. Textual uses volume mounting to give the Textual containers access to the certificate.
You must apply the changes to both the Textual web server and Textual worker containers.
To use your own certificate, you make the following changes to the docker-compose.yml file.
Add the environment variable SOLAR_PFX_PASSWORD
, which contains the certificate password.
Place the certificate on the host machine, then share it to the containers as a volume.
You must map the certificate to /certificates
on the containers.
Copy the following:
You must add the environment variable SOLAR_PFX_PASSWORD
, which contains the certificate password.
You can use any volume type that is allowed within your environment. It must provide at least ReadOnlyMany
access.
You map the certificate to /certificates
on the containers. Within your web server and worker deployment YAML files, the entry should be similar to the following:
On a self-hosted instance, you can configure settings to determine whether to the auxiliary model, and model use on GPU.
To improve overall inference, you can configure whether Textual uses the en_core_web_sm
auxiliary NER model.
The auxiliary model detects the following types:
EVENT
LANGUAGE
LAW
NRP
NUMERIC_VALUE
PRODUCT
WORK_OF_ART
To configure whether to use the auxiliary model, you use the environment variable TEXTUAL_AUX_MODEL
.
The available values are:
en_core_web_sm
- This is the default value.
none
- Indicates to not use the auxiliary model.
When you use a textual-ml-gpu
container on accelerated hardware, you can configure:
Whether to use the auxiliary model,
Whether to use the date synthesis model
To configure whether to use the auxiliary model for GPU, you configure the environment variable TEXTUAL_AUX_MODEL_GPU
.
By default, on GPU, Textual does not use the auxiliary model, and TEXTUAL_AUX_MODEL_GPU
is false
.
To use the auxiliary model for GPU, based on the configuration of TEXTUAL_AUX_MODEL
, set TEXTUAL_AUX_MODEL_GPU
to true
.
When TEXTUAL_AUX_MODEL_GPU
is true
, and TEXTUAL_MULTI_LINGUAL
is true
, Textual also loads the multilingual models on GPU.
By default, on GPU, Textual loads the date synthesis model on GPU.
Note that this model requires 600MB of GPU RAM for each machine learning worker.
To not load the date synthesis model on GPU, set TEXTUAL_DATE_SYNTH_GPU
to false
.
Configure environment variables
How to set environment variable values and restart Textual in Docker and Kubernetes.
Set the number of textual-ml workers
Used to enable parallel processing in Textual.
Set a custom certificate
Provide a custom certificate to use for https traffic.
Enable LLM synthesis on the Playground
Allow the Playground to use an LLM to generate replacement values.
Enable PDF and image processing
Set the required configuration based on the OCR option that you want to use.
Enable uploads to uploaded file pipelines
Provide the required access to Amazon S3.
Configure model preferences
Select an auxiliary model and configure model usage for GPU.
On the Playground page, the LLM synthesis option uses a large language model (LLM) to generate synthesized replacement values for the detected entities in the text.
This option requires an OpenAI key.
Before you can use this option on your self-hosted Textual instance, you must provide an OpenAI key as the value of the environment variable SOLAR_OPENAI_KEY
.
The TEXTUAL_ML_WORKERS
environment variable specifies the number of workers to use within the textual-ml
container. The default value is 1.
Having multiple workers allows for parallelization of inferences with NER models.
When you deploy Textual with Kubernetes on GPUs, parallelization allows the textual-ml
container to fully utilize the GPU.
We recommend 6GB of GPU RAM for each worker.
On a self-hosted instance of Textual, much of the configuration takes the form of environment variables.
After you configure an environment variable, you must restart Textual.
For Docker, add the variable to .env in the format:
SETTING_NAME=value
After you update .env, to restart Textual and complete the update, run:
$ docker-compose down
$ docker-compose pull && docker-compose up -d
For Kubernetes, in values.yaml, add the environment variable to the appropriate env section of the Helm chart. For example:
After you update the yaml file, to restart the service and complete the update, run:
$ helm upgrade <name_of_release> -n <namespace_name> <path-to-helm-chart>
The above Helm upgrade command is always safe to use when you provide specific version numbers. However, if you use the latest tag, it might result in Textual containers that have different versions.
To process PDF and image files, Tonic Textual uses optical character recognition (OCR). Textual supports the following OCR models:
Azure AI Document Intelligence
Amazon Textract
Tesseract
For the best performance, we recommend that you use either Azure AI Document Intelligence or Amazon Textract.
If you cannot use either of those - for example because you run Textual on-premises and cannot access third-party services - then you can use Tesseract.
To use Azure AI Document Intelligence to process PDF image files, Textual requires the Azure AI Document Intelligence key and endpoint.
In .env, uncomment and provide values for the following settings:
SOLAR_AZURE_DOC_INTELLIGENCE_KEY=#
SOLAR_AZURE_DOC_INTELLIGENCE_ENDPOINT=#
In values.yaml, uncomment and provide values for the following settings:
azureDocIntelligenceKey:
azureDocIntelligenceEndpoint:
If the Azure-specific environment variables are not configured, then Textual attempts to use Amazon Textract.
To use Amazon Textract, Textual requires access to an IAM role that has sufficient permissions. You must also configure an S3 bucket to use to store files. The configured S3 bucket is required for uploaded file pipelines, and is also used to store dataset files and individual files that are redacted using the SDK.
We recommend that you use the AmazonTextractFullAccess
policy, but you can also choose to use a more restricted policy.
Here is an example policy that provides the minimum required permissions:
After the policy is attached to an IAM user or a role, it must be made accessible to Textual. To do this, either:
Assign an instance profile
Provide the AWS key, secret, and Region in the following environment variables:
If neither Azure AI Document Intelligence nor Amazon Textract is configured, then Textual uses Tesseract, which is automatically available in your Textual installation.
Tesseract does not require any external access.