> For the complete documentation index, see [llms.txt](https://docs.tonic.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tonic.ai/textual/datasets-create-manage/dataset-create.md).

# Creating a dataset

{% hint style="info" %}
**Required global permission:** Create datasets
{% endhint %}

When you create a dataset, you specify:

* The type of output to produce
* The source location for the files.
* If the files are in cloud storage, the connection credentials.

## Setting the name, source type, and output type

To create a dataset:

1. On the **Datasets** page, click **Create a Dataset**.

<figure><img src="/files/7mnYOx2qJ5jXGvdlSPYx" alt="Dataset creation panel"><figcaption><p>Dataset creation panel</p></figcaption></figure>

2. In the **Dataset Name** field, provide a name for the dataset.
3. Under **Output Format**, select the type of output to generate.
4. Under **File Source**, select the source type.
5. If the source type is a cloud storage option, then provide the required credentials.
   * [Amazon S3](#providing-credentials-for-amazon-s3)
   * [Azure](#dataset-new-azure-credentials)
   * [Sharepoint](#dataset-new-sharepoint-credentials)
6. To import settings from another dataset:
   1. Expand the **Import settings from another dataset** section.
   2. From the **Dataset** dropdown list, select the dataset to import settings from. To import settings from a dataset, you must have the **View dataset settings** permission for that dataset.&#x20;
   3. Under **Settings to import**, select the type of settings to import. Project settings include the tags, .docx settings, and PDF settings. Entity settings include the entity type handling configuration and enabled custom entity types.

<figure><img src="/files/o9H8COnc25GubgDfk01p" alt=""><figcaption><p>Import settings section on the dataset creation panel</p></figcaption></figure>

6. Click **Save**.
7. For cloud storage datasets:
   1. Textual prompts you to configure the initial file selection. For more information, go to [Selecting cloud storage files](/textual/dataset-files/files-cloud-storage.md).
   2. After you select the files, it prompts you to select an output location. For more information, go to [Changing cloud storage credentials and output location](/textual/datasets-create-manage/changing-cloud-storage-credentials-and-output-location.md).

## Providing credentials for Amazon S3

If the source type is Amazon S3, provide the credentials to use to connect to Amazon S3.

After you provide the credentials, click **Save**. Textual prompts you to [select the dataset files](/textual/dataset-files/files-cloud-storage.md).

### Selecting the type of credentials to use

Under AWS Credentials Location, click the type of credentials to use. The options are:

* **User credentials -** Indicates to use the provided AWS user credentials.
* **Assume role -** Only available on self-hosted instances. Indicates to use the specified assumed role.
* **Environment -** Only available on self-hosted instances.\
  \
  Indicates to use either:
  * The credentials for the IAM role on the host machine.
  * The credentials set in the following environment settings:
    * `AWS_DEFAULT_REGION` - AWS Region
    * `AWS_ACCESS_KEY_ID` - AWS access key
    * `AWS_SECRET_ACCESS_KEY` - AWS secret key

### Providing AWS user credentials

When you select **User** as the credentials location:

<figure><img src="/files/RRy8JuEzS7gIELVWWd4W" alt=""><figcaption><p>Credentials fields for an Amazon S3 dataset</p></figcaption></figure>

1. In the **Access Key** field, provide an AWS access key that is associated with an IAM user or role.\
   \
   For an example of a role that has the required permissions for an Amazon S3 dataset, go to [Required IAM role permissions for Amazon S3](/textual/textual-install-administer/configuring-textual/enable-and-configure-textual-features/pipelines-example-iam-roles.md).
2. In the **Secret Access Key** field, provide the secret key that is associated with the access key.
3. From the **Region** dropdown list, select the AWS Region to send the authentication request to.
4. In the **Session Token** field, provide the session token to use for the authentication request.
5. To test the credentials, click **Test AWS Connection**.

### Providing an assumed role

When you select **Assume Role** as the credentials location, whether you complete the configuration here or in the dataset **Project Settings** depends on how you determine the uniqueness of the external ID.

<figure><img src="/files/8KZLUJh4G7MLbOKCPF6M" alt=""><figcaption><p>Fields to configure an assumed role for an Amazon S3 dataset</p></figcaption></figure>

#### Making the external ID unique for the organization

1. Under **External ID Uniqueness**, to make the external ID unique for your organization, click **Organization**.
2. In the **Role ARN** field, provide the Amazon Resource Name (ARN) for the role.
3. In the **Session Name** field, provide the role session name.\
   \
   If you do not provide a session name, then Textual automatically generates a default unique value.
4. In the **Duration (in seconds)** field, provide the maximum length in seconds of the session. \
   \
   The default is `3600`, indicating that the session can be active for up to 1 hour.\
   \
   The provided value must be less than the maximum session duration that is allowed for the role.
5. From the **AWS Region** dropdown list, select the AWS Region where the S3 bucket that the dataset reads from is located.
6. To test the connection, click **Test AWS Connection**.

#### Making the external ID unique for the dataset

Under **External ID Uniqueness**, to make the external ID unique for the dataset, click **Dataset**.

Because Textual cannot generate the external ID before the dataset is created, Textual hides the remaining configuration fields for the assumed role (role ARN, session name, duration, and AWS Region).

When you save the dataset, Textual automatically routes you to **Project Settings** to provide those values.

#### Requirements for the role trust policy

Your role’s trust policy must be configured to condition on the external ID.

To view an example trust policy, click **View example policy**. Here is an example trust policy:

```
{
  "Version": "2012-10-17",
  "Statement": {
    "Effect": "Allow",
    "Principal": {
      "AWS": "<originating-account-id>"
    },
    "Action": "sts:AssumeRole",
    "Condition": {
      "StringEquals": {
        "sts:ExternalId": "<external-id>"
      }
    }
  }
}
```

### Changing the S3 bucket encryption

Connections to Amazon S3 use the default encryption for the S3 bucket.\
\
To change the encryption used, click **Show Advanced Options**.

To use Amazon S3 encryption, from the **Server-side Encryption Type** dropdown, select **Amazon S3**.

To use AWS KMS encryption:

1. From the **Server-Side Encryption Type** dropdown list, select **AWS KMS**.
2. In the **Server-side Encryption AWS KMS ID** field, provide the KMS key ID.\
   \
   Note that if the KMS key doesn't exist in the same account that issues the command, you must provide the full key ARN instead of the key ID.

Note that after you save the new dataset, you cannot change the encryption type.

## Providing Azure credentials <a href="#dataset-new-azure-credentials" id="dataset-new-azure-credentials"></a>

If the source type is Azure, provide the connection information:

<figure><img src="/files/0YN45iF1CZ7hHbeNaxzX" alt=""><figcaption><p>Credentials fields for an Azure dataset</p></figcaption></figure>

1. In the **Account Name** field, provide the name of your Azure account.
2. In the **Account Key** field, provide the access key for your Azure account.
3. To test the connection, click **Test Azure Connection**.
4. Click **Save**.\
   \
   Textual prompts you to [select the dataset files](/textual/dataset-files/files-cloud-storage.md).

## Providing SharePoint credentials <a href="#dataset-new-sharepoint-credentials" id="dataset-new-sharepoint-credentials"></a>

If the source type is SharePoint, provide the credentials for the Entra ID application.

<figure><img src="/files/9DgC4cnczlx9rfom3jxT" alt=""><figcaption><p>Credentials fields for a SharePoint dataset</p></figcaption></figure>

The credentials must have the following application permissions (not delegated permissions):

* `Files.Read.All` -  To see the SharePoint files
* `Files.ReadWrite.All` -To write redacted files and metadata back to SharePoint
* `Sites.ReadWrite.All` - To view and modify the SharePoint sites

To provide the credentials:

1. In the **Tenant ID** field, provide the SharePoint tenant identifier for the SharePoint site.
2. In the **Client ID** field, provide the client identifier for the SharePoint site.
3. In the **Client Secret** field, provide the secret to use to connect to the SharePoint site.
4. To test the connection, click **Test SharePoint Connection**.
5. Click **Save**.\
   \
   Textual prompts you to [select the dataset files](/textual/dataset-files/files-cloud-storage.md).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tonic.ai/textual/datasets-create-manage/dataset-create.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
