1 of 9

Configuring Textual

You can configure your self-hosted instance of Textual to enable Textual features.

How to configure Textual environment variables

On a self-hosted instance of Textual, much of the configuration takes the form of environment variables.

After you configure an environment variable, you must restart Textual.

Docker

For Docker, add the variable to .env in the format:

SETTING_NAME=value

After you update .env, to restart Textual and complete the update, run:

$ docker-compose down

$ docker-compose pull && docker-compose up -d

Kubernetes

For Kubernetes, in values.yaml, add the environment variable to the appropriate env section of the Helm chart. For example:

env: {
  "TEXTUAL_ML_WORKERS": "2"
}

After you update the yaml file, to restart the service and complete the update, run:

$ helm upgrade <name_of_release> -n <namespace_name> <path-to-helm-chart>

The above Helm upgrade command is always safe to use when you provide specific version numbers. However, if you use the latest tag, it might result in Textual containers that have different versions.

Configuring the number of textual-ml workers

The TEXTUAL_ML_WORKERS environment variable specifies the number of workers to use within the textual-ml container. The default value is 1.

Having multiple workers allows for parallelization of inferences with NER models.

When you deploy Textual with Kubernetes on GPUs, parallelization allows the textual-ml container to fully utilize the GPU.

We recommend 6GB of GPU RAM for each worker.

Setting a custom certificate

Tonic Textual provides a certificate for https traffic, but on a self-hosted instance, you can also use a user-provided certificate. The certificate must use the the PFX format and be named solar.pfx.

To use your own certificate, you must:

Add the SOLAR_PFX_PASSWORD environment variable.
Use a volume mount to provide the certificate file. Textual uses volume mounting to give the Textual containers access to the certificate.

You must apply the changes to both the Textual web server and Textual worker containers.

Docker

To use your own certificate, you make the following changes to the docker-compose.yml file.

Environment variable

Add the environment variable SOLAR_PFX_PASSWORD, which contains the certificate password.

Volume mount

Place the certificate on the host machine, then share it to the containers as a volume.

You must map the certificate to /certificates on the containers.

Copy the following:

volumes:
        ...
        - /my-host-path:/certificates

Kubernetes

Environment variable

You must add the environment variable SOLAR_PFX_PASSWORD, which contains the certificate password.

Volume mount

You can use any volume type that is allowed within your environment. It must provide at least ReadOnlyMany access.

You map the certificate to /certificates on the containers. Within your web server and worker deployment YAML files, the entry should be similar to the following:

    volumeMounts:
    - name: <my-volume-name>
      mountPath: /certificates

Enabling LLM synthesis on the Playground page

On the Playground page, the LLM synthesis option uses a large language model (LLM) to generate synthesized replacement values for the detected entities in the text.

This option requires an OpenAI key.

Before you can use this option on your self-hosted Textual instance, you must provide an OpenAI key as the value of the environment variable SOLAR_OPENAI_KEY.

Enabling PDF and image processing

To process PDF and image files, Tonic Textual uses optical character recognition (OCR). Textual supports the following OCR models:

Azure AI Document Intelligence
Amazon Textract
Tesseract

For the best performance, we recommend that you use either Azure AI Document Intelligence or Amazon Textract.

If you cannot use either of those - for example because you run Textual on-premises and cannot access third-party services - then you can use Tesseract.

Azure AI Document Intelligence

To use Azure AI Document Intelligence to process PDF image files, Textual requires the Azure AI Document Intelligence key and endpoint.

Docker

In .env, uncomment and provide values for the following settings:

SOLAR_AZURE_DOC_INTELLIGENCE_KEY=#

SOLAR_AZURE_DOC_INTELLIGENCE_ENDPOINT=#

Kubernetes

In values.yaml, uncomment and provide values for the following settings:

azureDocIntelligenceKey:

azureDocIntelligenceEndpoint:

Amazon Textract

If the Azure-specific environment variables are not configured, then Textual attempts to use Amazon Textract.

To use Amazon Textract, Textual requires access to an IAM role that has sufficient permissions. You must also configure an S3 bucket to use to store files. The configured S3 bucket is required for uploaded file pipelines, and is also used to store dataset files and individual files that are redacted using the SDK.

We recommend that you use the AmazonTextractFullAccess policy, but you can also choose to use a more restricted policy.

Here is an example policy that provides the minimum required permissions:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"textract:StartDocumentAnalysis",
				"textract:AnalyzeDocument",
				"textract:GetDocumentAnalysis"
			],
			"Resource": "*"
		}
	]
}

After the policy is attached to an IAM user or a role, it must be made accessible to Textual. To do this, either:

Assign an instance profile
Provide the AWS key, secret, and Region in the following environment variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION

Tesseract

If neither Azure AI Document Intelligence nor Amazon Textract is configured, then Textual uses Tesseract, which is automatically available in your Textual installation.

Tesseract does not require any external access.

Setting the S3 bucket for file uploads and redactions

Tonic Textual pipelines can process files from sources such as Amazon S3, Azure Blob Storage, and Databricks Unity Catalog. You can also create pipelines to process files that you upload directly from your browser.

For those uploaded file pipelines, Textual always stores the files in an S3 bucket. On a self-hosted instance, before you add files to an uploaded file pipeline, you must configure the S3 bucket and the associated authentication credentials.

The configured S3 bucket is also used to store dataset files and individual files that you use the Textual SDK to redact. If an S3 bucket is not configured, then:

The dataset and individual redacted files are stored in the Textual application database.
You cannot use Amazon Textract for PDF and image processing. If you configured Textual to use Amazon Textract, Textual instead uses Tesseract.

The authentication credentials for the S3 bucket include:

The AWS Region where the S3 bucket is located.
An AWS access key that is associated with an IAM user or role.
The secret key that is associated with the access key.

To provide the authentication credentials, you can either:

Provide the values directly as environment variable values.
Use the instance profile of the compute instance where Textual runs.

For an example IAM role that has the required permissions, go to #file-upload-example-iam-role.

Docker

In .env, add the following settings:

SOLAR_INTERNAL_BUCKET_NAME= <S3 bucket path>

AWS_REGION= <AWS Region>

AWS_ACCESS_KEY_ID= <AWS access key>

AWS_SECRET_ACCESS_KEY= <AWS secret key>

If you use the instance profile of the compute instance, then only the bucket name is required.

Kubernetes

In values.yaml, within env: { } under both textual_api_server and textual_worker, add the following settings:

SOLAR_INTERNAL_BUCKET_NAME

AWS_REGION

AWS_ACCESS_KEY_ID

AWS_SECRET_ACCESS_KEY

For example, if no other environment variables are defined:

  env: {
        "SOLAR_INTERNAL_BUCKET_NAME": "<S3 bucket path>",
        "AWS_REGION": "<AWS Region>",
        "AWS_ACCESS_KEY_ID": "<AWS access key>",
        "AWS_SECRET_ACCESS_KEY": "<AWS secret key>"
       }

If you use the instance profile of the compute instance, then only the bucket name is required.

Required IAM role permissions for Amazon S3

For Amazon S3 pipelines, you connect to S3 buckets to select and store files.

On self-hosted instances, you also configure an S3 bucket and the credentials to use to store files for:

Uploaded file pipelines. The S3 bucket is required for uploaded file pipelines. The S3 bucket is not used for pipelines that connect to Azure Blob Storage or to Databricks Unity Catalog.
Dataset files. If you do not configure an S3 bucket, then the files are stored in the application database.
Individual files that you send to the SDK for redaction. If you do not configure an S3 bucket, then the files are stored in the application database.

Here are examples of IAM roles that have the required permissions to connect to Amazon S3 to select or store files.

Example IAM role for file uploads and redactions

For uploaded file pipelines, datasets, and individual file redactions, the files are stored in a single S3 bucket. For information on how to configure the S3 bucket and the corresponding access credentials, go to Setting the S3 bucket for file uploads and redactions.

The IAM role that is used to connect to the S3 bucket must be able to read files from and write files to it.

Here is an example of an IAM role that has the permissions required to support uploaded file pipelines, datasets, and individual redactions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<SOLAR_INTERNAL_BUCKET_NAME>",
                "arn:aws:s3:::<SOLAR_INTERNAL_BUCKET_NAME>/*"
            ]
        }
    ]
}

Example IAM role for Amazon S3 pipelines

The access credentials that you configure for an Amazon S3 pipeline must be able to navigate to and select files and folders from the appropriate S3 buckets. They also need to be able to write output files to the configured output location.

Here is an example of an IAM role that has the permissions required to support Amazon S3 pipelines:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "arn:aws:s3:::*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
                "arn:aws:s3:::*/*"
            ]
        }
    ]
}

Configuring model preferences

On a self-hosted instance, you can configure settings to determine whether to the auxiliary model, and model use on GPU.

Configuring whether to use an auxiliary model

To improve overall inference, you can configure whether Textual uses the en_core_web_sm auxiliary NER model.

Entity types that the auxiliary model detects

The auxiliary model detects the following types:

EVENT
LANGUAGE
LAW
NRP
NUMERIC_VALUE
PRODUCT
WORK_OF_ART

Indicating whether to use the auxiliary model

To configure whether to use the auxiliary model, you use the environment variable TEXTUAL_AUX_MODEL.

The available values are:

en_core_web_sm - This is the default value.
none - Indicates to not use the auxiliary model.

Configuring model use for GPU

When you use a textual-ml-gpu container on accelerated hardware, you can configure:

Whether to use the auxiliary model,
Whether to use the date synthesis model

Indicating whether to use the auxiliary model for GPU

To configure whether to use the auxiliary model for GPU, you configure the environment variable TEXTUAL_AUX_MODEL_GPU.

By default, on GPU, Textual does not use the auxiliary model, and TEXTUAL_AUX_MODEL_GPU is false.

To use the auxiliary model for GPU, based on the configuration of TEXTUAL_AUX_MODEL, set TEXTUAL_AUX_MODEL_GPU to true.

When TEXTUAL_AUX_MODEL_GPU is true, and TEXTUAL_MULTI_LINGUAL is true, Textual also loads the multilingual models on GPU.

Indicating whether to use the date synthesis model for GPU

By default, on GPU, Textual loads the date synthesis model on GPU.

Note that this model requires 600MB of GPU RAM for each machine learning worker.

To not load the date synthesis model on GPU, set TEXTUAL_DATE_SYNTH_GPU to false.

Configuring model preferences

On a self-hosted instance, you can configure settings to determine whether to the auxiliary model, and model use on GPU.

Configuring whether to use an auxiliary model

To improve overall inference, you can configure whether Textual uses the en_core_web_sm auxiliary NER model.

Entity types that the auxiliary model detects

The auxiliary model detects the following types:

EVENT
LANGUAGE
LAW
NRP
NUMERIC_VALUE
PRODUCT
WORK_OF_ART

Indicating whether to use the auxiliary model

To configure whether to use the auxiliary model, you use the environment variable TEXTUAL_AUX_MODEL.

The available values are:

en_core_web_sm - This is the default value.
none - Indicates to not use the auxiliary model.

Configuring model use for GPU

When you use a textual-ml-gpu container on accelerated hardware, you can configure:

Whether to use the auxiliary model,
Whether to use the date synthesis model

Indicating whether to use the auxiliary model for GPU

To configure whether to use the auxiliary model for GPU, you configure the environment variable TEXTUAL_AUX_MODEL_GPU.

By default, on GPU, Textual does not use the auxiliary model, and TEXTUAL_AUX_MODEL_GPU is false.

To use the auxiliary model for GPU, based on the configuration of TEXTUAL_AUX_MODEL, set TEXTUAL_AUX_MODEL_GPU to true.

When TEXTUAL_AUX_MODEL_GPU is true, and TEXTUAL_MULTI_LINGUAL is true, Textual also loads the multilingual models on GPU.

Indicating whether to use the date synthesis model for GPU

By default, on GPU, Textual loads the date synthesis model on GPU.

Note that this model requires 600MB of GPU RAM for each machine learning worker.

To not load the date synthesis model on GPU, set TEXTUAL_DATE_SYNTH_GPU to false.

Configuring Textual

How to configure Textual environment variables

​​Docker

Kubernetes

Configuring the number of textual-ml workers

Setting a custom certificate

Docker

Environment variable

Volume mount

Kubernetes

Environment variable

Volume mount

Enabling LLM synthesis on the Playground page

Enabling PDF and image processing

Azure AI Document Intelligence

Docker

Kubernetes

Amazon Textract

Tesseract

Setting the S3 bucket for file uploads and redactions

Docker

Kubernetes

Required IAM role permissions for Amazon S3

Example IAM role for file uploads and redactions

Example IAM role for Amazon S3 pipelines

Configuring model preferences

Configuring whether to use an auxiliary model

Entity types that the auxiliary model detects

Indicating whether to use the auxiliary model

Configuring model use for GPU

Indicating whether to use the auxiliary model for GPU

Indicating whether to use the date synthesis model for GPU

Setting the S3 bucket for file uploads and redactions

Docker

Kubernetes

Required IAM role permissions for Amazon S3

Example IAM role for file uploads and redactions

Example IAM role for Amazon S3 pipelines

Setting a custom certificate

Docker

Environment variable

Volume mount

Kubernetes

Environment variable

Volume mount

Configuring model preferences

Configuring whether to use an auxiliary model

Entity types that the auxiliary model detects

Indicating whether to use the auxiliary model

Configuring model use for GPU

Indicating whether to use the auxiliary model for GPU

Indicating whether to use the date synthesis model for GPU

Configuring Textual

Enabling LLM synthesis on the Playground page

Configuring the number of textual-ml workers

How to configure Textual environment variables

​​Docker

Kubernetes

Enabling PDF and image processing

Azure AI Document Intelligence

Docker

Kubernetes

Amazon Textract

Tesseract

Docker

Docker