Creating a dataset

Required global permission: Create datasets

When you create a dataset, you specify:

Setting the name, source type, and output type

To create a dataset:

In the Dataset Name field, provide a name for the dataset.
Under Output Format, select the type of output to generate.
Under File Source, select the source type. If the source type is a cloud storage option, then provide the required credentials.
Click Save.
For cloud storage datasets:
1. Textual prompts you to configure the initial file selection. For more information, go to Selecting cloud storage files.
2. After you select the files, it prompts you to select an output location. For more information, go to Changing cloud storage credentials and output location.

On self-hosted instances, we are deprecating the options to provide credentials on the dataset panel and read credentials from environment variables.

Instead, the credentials must be included in the configuration of an IAM role that has the correct permissions.

If the source type is Amazon S3, provide the credentials to use to connect to Amazon S3.

For a self-hosted instance, select the location of the credentials. You can either provide credentials manually, or use credentials that are configured in environment variables. Note that after you save the dataset, you cannot change the selection.
If you are not using environment variables, then in the Access Key field, provide an AWS access key that is associated with an IAM user or role. For an example of a role that has the required permissions for an Amazon S3 dataset, go to Required IAM role permissions for Amazon S3.
In the Access Secret field, provide the secret key that is associated with the access key.
From the Region dropdown list, select the AWS Region to send the authentication request to.
In the Session Token field, provide the session token to use for the authentication request.
To test the credentials, click Test AWS Connection.
By default, connections to Amazon S3 use Amazon S3 encryption. To instead use AWS KMS encryption:
1. Click Show Advanced Options.
2. From the Server-Side Encryption Type dropdown list, select AWS KMS.
3. In the Server-side Encryption AWS KMS ID field, provide the KMS key ID. Note that if the KMS key doesn't exist in the same account that issues the command, you must provide the full key ARN instead of the key ID.
Note that after you save the new dataset, you cannot change the encryption type.
Click Save. Textual prompts you to select the dataset files.

If the source type is Azure, provide the connection information:

If the source type is SharePoint, provide the credentials for the Entra ID application.

The credentials must have the following application permissions (not delegated permissions):

To provide the credentials:

In the Tenant ID field, provide the SharePoint tenant identifier for the SharePoint site.
In the Client ID field, provide the client identifier for the SharePoint site.
In the Client Secret field, provide the secret to use to connect to the SharePoint site.
To test the connection, click Test SharePoint Connection.
Click Save. Textual prompts you to select the dataset files.

Last updated 1 month ago

Was this helpful?