1 of 9

Setting up pipelines

For each pipeline, you configure the name and the files to process.

Creating and editing pipelines

Creating a pipeline

To create a pipeline, on the Pipelines page, click Create a New Pipeline.

Setting the pipeline name and source type

On the Create A New Pipeline panel:

In the Name field, type the name of the pipeline.
Under Files Source, select the location of the source files.
- To upload files from a local file system, click File upload, then click Save.
- To select files from and write output to Amazon S3, click Amazon S3.
- To select files from and write output to Databricks, click Databricks.
- To select files from and write output to Azure Blob Storage, click Azure.
Click Save.

Providing Amazon S3 credentials

If you selected Amazon S3, provide the credentials to use to connect to Amazon S3.

In the Access Secret field, provide the secret key that is associated with the access key.
From the Region dropdown list, select the AWS Region to send the authentication request to.
In the Session Token field, provide the session token to use for the authentication request.
To test the credentials, click Test AWS Connection.
Click Save.
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to Configuring an Amazon S3 pipeline.
Click Save.

Providing Databricks connection information

If you selected Databricks, provide the connection information:

In the Databricks URL field, provide the URL to the Databricks workspace.
In the Access Token field, provide the access token to use to get access to the volume.
To test the connection, click Test Databricks Connection.
Click Save.
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to Configuring a Databricks pipeline.
Click Save.

Providing Azure credentials

If you selected Azure, provide the connection information:

In the Account Name field, provide the name of your Azure account.
In the Account Key field, provide the access key for your Azure account.
To test the connection, click Test Azure Connection.
Click Save.
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to Configuring an Azure pipeline.
Click Save.

Editing a pipeline

To update a pipeline configuration:

On the pipeline details page, click the settings icon. For cloud storage pipelines, the settings icon is next to the Run Pipeline option. For uploaded file pipelines, the settings icon is next to the Upload Files option.
On the Pipeline Settings page, update the configuration. For all pipelines, you can change the pipeline name, and whether to also create redacted versions of the original files. For cloud storage pipelines, you can change the file selection. For more information, go to Configuring an Amazon S3 pipeline, Configuring a Databricks pipeline, or Configuring an Azure pipeline. For uploaded file pipelines, you do not manage files from the Pipeline Settings page. For information about uploading files, go to Selecting files for an uploaded file pipeline.
Click Save.

Deleting a pipeline

To delete a pipeline, on the Pipeline Settings page, click Delete Pipeline.

Supported file types for pipelines

Textual pipelines can process the following types of files:

txt
csv
tsv
docx
xlsx
pdf
png
tif or tiff
jpg or jpeg
eml
msg

Configuring an Amazon S3 pipeline

For an Amazon S3 pipeline, the settings include:

AWS credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders

Changing the AWS credentials for a pipeline

When you create a pipeline that uses files from Amazon S3, you are prompted to provide the credentials to use to connect to Amazon S3.

From the Pipeline Settings page, to change the credentials:

Click Update AWS Credentials.

Provide the new credentials:
1. In the Access Key field, provide an AWS access key that is associated with an IAM user or role. For an example of an IAM role that has the required permissions for an Amazon S3 pipeline, go to Example IAM role for Amazon S3 pipelines.
2. In the Access Secret field, provide the secret key that is associated with the access key.
3. From the Region dropdown list, select the AWS Region to send the authentication request to.
4. In the Session Token field, provide the session token to use for the authentication request.
To test the connection, click Test AWS Connection.
To save the new credentials, click Update AWS Credentials.

Selecting a location for the output files

On the Pipeline Settings page, under Select Output Location, navigate to and select the folder in Amazon S3 where Textual writes the output files.

When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.

Within the job folder, Textual recreates the folder structure for the original files. It then creates the JSON output for each file. The name of the JSON file is <original filename>_<original extension>_parsed.json.

If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.

For example, for the original file Transaction1.txt, the output for a pipeline run contains:

Transaction1_txt_parsed.json
Transaction1.txt

Indicating whether to also redact the files

By default, when you run an Amazon S3 pipeline, Textual only generates the JSON output.

To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.

For information on how to configure the file generation, go to Configuring file synthesis for a pipeline.

Filtering files in selected folders by file type

One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.

Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.

Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.

Selecting files and folders to process

Under Select files and folders to add to run, navigate to and select the folders and individual files to process.

To add a folder or file to the pipeline, check its checkbox.

When you check a folder checkbox, Textual adds it to the Prefix Patterns list. It processes all of the applicable files in the folder, based on whether the file type is a type that Textual supports and whether it is included in the file type filter.

When you click the folder name, it displays the folder contents.

When you select an individual file, Textual adds it to the Selected Files list.

To delete a file or folder, either:

In the navigation pane, uncheck the checkbox.
In the Prefix Patterns or Selected Files list, click its delete icon.

Configuring a Databricks pipeline

For a Databricks pipeline, the settings include:

Databricks credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders

Changing the Databricks credentials for a pipeline

When you create a pipeline that uses files from Databricks, you are prompted to provide the credentials to use to connect to Databricks.

From the Pipeline Settings page, to change the credentials:

Click Update Databricks Credentials.

Provide the new credentials:
1. In the Databricks URL field, provide the URL to the Databricks workspace.
2. In the Access Token field, provide the access token to use to get access to the volume.
To test the connection, click Test Databricks Connection.
To save the new credentials, click Update Databricks Credentials.

Selecting a location for the output files

On the Pipeline Settings page, under Select Output Location, navigate to and select the folder in Databricks where Textual writes the output files.

When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.

If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.

For example, for the original file Transaction1.txt, the output for a pipeline run contains:

Transaction1_txt_parsed.json
Transaction1.txt

Indicating whether to also redact the files

By default, when you run a Databricks pipeline, Textual only generates the JSON output.

To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.

For information on how to configure the file generation, go to Configuring file synthesis for a pipeline.

Filtering files in selected folders by file type

One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.

Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.

Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.

Selecting files and folders to process

Under Select files and folders to add to run, navigate to and select the folders and individual files to process.

To add a folder or file to the pipeline, check its checkbox.

When you click the folder name, it displays the folder contents.

When you select an individual file, Textual adds it to the Selected Files list.

To delete a file or folder, either:

In the navigation pane, uncheck the checkbox.
In the Prefix Patterns or Selected Files list, click its delete icon.

Configuring an Azure pipeline

For an Azure pipeline, the settings include:

Azure credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders

Changing the Azure credentials for a pipeline

When you create a pipeline that uses files from Azure, you are prompted to provide the credentials to use to connect to Azure.

From the Pipeline Settings page, to change the credentials:

Click Update Azure Credentials.

Provide the new credentials:
1. In the Account Name field, provide the name of your Azure account.
2. In the Account Key field, provide the access key for your Azure account.
To test the connection, click Test Azure Connection.
To save the new credentials, click Update Azure Credentials.

Selecting a location for the output files

On the Pipeline Settings page, under Select Output Location, navigate to and select the folder in Azure where Textual writes the output files.

When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.

If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.

For example, for the original file Transaction1.txt, the output for a pipeline run contains:

Transaction1_txt_parsed.json
Transaction1.txt

Indicating whether to also redact the files

By default, when you run an Azure pipeline, Textual only generates the JSON output.

To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.

For information on how to configure the file generation, go to Configuring file synthesis for a pipeline.

Filtering files in selected folders by file type

One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.

Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.

Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.

Selecting files and folders to process

Under Select files and folders to add to run, navigate to and select the folders and individual files to process.

To add a folder or file to the pipeline, check its checkbox.

When you click the folder name, it displays the folder contents.

When you select an individual file, Textual adds it to the Selected Files list.

To delete a file or folder, either:

In the navigation pane, uncheck the checkbox.
In the Prefix Patterns or Selected Files list, click its delete icon.

Selecting files for an uploaded file pipeline

On a self-hosted instance, before you can upload files to a pipeline, you must configure the S3 bucket where Tonic Textual stores the files. For more information, go to .

For an example of an IAM role that has the required permissions for file upload pipelines, go to .

Adding files to the pipeline

On the pipeline details page for an uploaded file pipeline, to add files to the pipeline:

Click Upload Files.
Search for and select the files to upload.

Removing files

To remove a file, on the pipeline details page, click the delete icon for the file.

Indicating whether to also redact the files

By default, Textual only generates the JSON output for the pipeline files.

To also generate versions of the original files that redact or synthesize the detected entity values, on the Pipeline Settings page, toggle Synthesize Files to the on position.

Creating custom entity types from a pipeline

From the pipeline details page, to create a custom entity type, click Create Custom Entity Type.

For information on how to configure a custom entity type, go to .

Configuring file synthesis for a pipeline

When you choose to also generate synthesized versions of the pipeline files, the pipeline details page includes a Generator Config tab. From the Generator Config tab, you configure how to transform the detected entities in each file.

The Generator Config tab lists all of the available entity types.

For each entity type, you select and configure the handling option. For more information, see and .

After you change the configuration, click Save Changes. The updated configuration is applied the next time you run the pipeline, and only to new files.

Configuring an Azure pipeline

For an Azure pipeline, the settings include:

Azure credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders

Changing the Azure credentials for a pipeline

When you create a pipeline that uses files from Azure, you are prompted to provide the credentials to use to connect to Azure.

From the Pipeline Settings page, to change the credentials:

Click Update Azure Credentials.

Provide the new credentials:
1. In the Account Name field, provide the name of your Azure account.
2. In the Account Key field, provide the access key for your Azure account.
To test the connection, click Test Azure Connection.
To save the new credentials, click Update Azure Credentials.

Selecting a location for the output files

On the Pipeline Settings page, under Select Output Location, navigate to and select the folder in Azure where Textual writes the output files.

When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.

If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.

For example, for the original file Transaction1.txt, the output for a pipeline run contains:

Transaction1_txt_parsed.json
Transaction1.txt

Indicating whether to also redact the files

By default, when you run an Azure pipeline, Textual only generates the JSON output.

To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.

For information on how to configure the file generation, go to Configuring file synthesis for a pipeline.

Filtering files in selected folders by file type

One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.

Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.

Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.

Selecting files and folders to process

Under Select files and folders to add to run, navigate to and select the folders and individual files to process.

To add a folder or file to the pipeline, check its checkbox.

When you click the folder name, it displays the folder contents.

When you select an individual file, Textual adds it to the Selected Files list.

To delete a file or folder, either:

In the navigation pane, uncheck the checkbox.
In the Prefix Patterns or Selected Files list, click its delete icon.

Creating and editing pipelines

Creating a pipeline

To create a pipeline, on the Pipelines page, click Create a New Pipeline.

Setting the pipeline name and source type

On the Create A New Pipeline panel:

In the Name field, type the name of the pipeline.
Under Files Source, select the location of the source files.
- To upload files from a local file system, click File upload, then click Save.
- To select files from and write output to Amazon S3, click Amazon S3.
- To select files from and write output to Databricks, click Databricks.
- To select files from and write output to Azure Blob Storage, click Azure.
Click Save.

Providing Amazon S3 credentials

If you selected Amazon S3, provide the credentials to use to connect to Amazon S3.

In the Access Key field, provide an AWS access key that is associated with an IAM user or role. For an example of a role that has the required permissions for an Amazon S3 pipeline, go to .
In the Access Secret field, provide the secret key that is associated with the access key.
From the Region dropdown list, select the AWS Region to send the authentication request to.
In the Session Token field, provide the session token to use for the authentication request.
To test the credentials, click Test AWS Connection.
Click Save.
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to Configuring an Amazon S3 pipeline.
Click Save.

Providing Databricks connection information

If you selected Databricks, provide the connection information:

In the Databricks URL field, provide the URL to the Databricks workspace.
In the Access Token field, provide the access token to use to get access to the volume.
To test the connection, click Test Databricks Connection.
Click Save.
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to Configuring a Databricks pipeline.
Click Save.

Providing Azure credentials

If you selected Azure, provide the connection information:

In the Account Name field, provide the name of your Azure account.
In the Account Key field, provide the access key for your Azure account.
To test the connection, click Test Azure Connection.
Click Save.
On the Pipeline Settings page, provide the rest of the pipeline configuration. For more information, go to Configuring an Azure pipeline.
Click Save.

Editing a pipeline

To update a pipeline configuration:

On the pipeline details page, click the settings icon. For cloud storage pipelines, the settings icon is next to the Run Pipeline option. For uploaded file pipelines, the settings icon is next to the Upload Files option.
On the Pipeline Settings page, update the configuration. For all pipelines, you can change the pipeline name, and whether to also create redacted versions of the original files. For cloud storage pipelines, you can change the file selection. For more information, go to Configuring an Amazon S3 pipeline, Configuring a Databricks pipeline, or Configuring an Azure pipeline. For uploaded file pipelines, you do not manage files from the Pipeline Settings page. For information about uploading files, go to Selecting files for an uploaded file pipeline.
Click Save.

Deleting a pipeline

To delete a pipeline, on the Pipeline Settings page, click Delete Pipeline.

Configuring a Databricks pipeline

For a Databricks pipeline, the settings include:

Databricks credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders

Changing the Databricks credentials for a pipeline

When you create a pipeline that uses files from Databricks, you are prompted to provide the credentials to use to connect to Databricks.

From the Pipeline Settings page, to change the credentials:

Click Update Databricks Credentials.

Provide the new credentials:
1. In the Databricks URL field, provide the URL to the Databricks workspace.
2. In the Access Token field, provide the access token to use to get access to the volume.
To test the connection, click Test Databricks Connection.
To save the new credentials, click Update Databricks Credentials.

Selecting a location for the output files

On the Pipeline Settings page, under Select Output Location, navigate to and select the folder in Databricks where Textual writes the output files.

When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.

If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.

For example, for the original file Transaction1.txt, the output for a pipeline run contains:

Transaction1_txt_parsed.json
Transaction1.txt

Indicating whether to also redact the files

By default, when you run a Databricks pipeline, Textual only generates the JSON output.

To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.

For information on how to configure the file generation, go to Configuring file synthesis for a pipeline.

Filtering files in selected folders by file type

One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.

Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.

Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.

Selecting files and folders to process

Under Select files and folders to add to run, navigate to and select the folders and individual files to process.

To add a folder or file to the pipeline, check its checkbox.

When you click the folder name, it displays the folder contents.

When you select an individual file, Textual adds it to the Selected Files list.

To delete a file or folder, either:

In the navigation pane, uncheck the checkbox.
In the Prefix Patterns or Selected Files list, click its delete icon.

Configuring an Amazon S3 pipeline

For an Amazon S3 pipeline, the settings include:

AWS credentials
Output location
Whether to also generate redacted versions of the original files
Selected files and folders

Changing the AWS credentials for a pipeline

When you create a pipeline that uses files from Amazon S3, you are prompted to provide the credentials to use to connect to Amazon S3.

From the Pipeline Settings page, to change the credentials:

Click Update AWS Credentials.

Provide the new credentials:
1. In the Access Key field, provide an AWS access key that is associated with an IAM user or role. For an example of an IAM role that has the required permissions for an Amazon S3 pipeline, go to Example IAM role for Amazon S3 pipelines.
2. In the Access Secret field, provide the secret key that is associated with the access key.
3. From the Region dropdown list, select the AWS Region to send the authentication request to.
4. In the Session Token field, provide the session token to use for the authentication request.
To test the connection, click Test AWS Connection.
To save the new credentials, click Update AWS Credentials.

Selecting a location for the output files

On the Pipeline Settings page, under Select Output Location, navigate to and select the folder in Amazon S3 where Textual writes the output files.

When you run a pipeline, Textual creates a folder in the output location. The folder name is the pipeline job identifier.

If the pipeline is also configured to generate redacted versions of the files, then Textual writes the redacted version of each file to the same location.

For example, for the original file Transaction1.txt, the output for a pipeline run contains:

Transaction1_txt_parsed.json
Transaction1.txt

Indicating whether to also redact the files

By default, when you run an Amazon S3 pipeline, Textual only generates the JSON output.

To also generate versions of the original files that redact or synthesize the detected entity values, toggle Synthesize Files to the on position.

For information on how to configure the file generation, go to Configuring file synthesis for a pipeline.

Filtering files in selected folders by file type

One option for selected folders is to filter the processed files based on the file extension. For example, in a selected folder, you might only want to process .txt and .csv files.

Under File Processing Settings, select the file extensions to include. To add a file type, select it from the dropdown list. To remove a file type, click its delete icon.

Note that this filter does not apply to individually selected files. Textual always processes those files regardless of file type.