Manage datasets

Use the REST API to create and manage datasets.

Get all datasets

get

Returns all of the datasets that the user has access to. Each dataset contains a set of minimal info, for more detailed information on a given dataset use the /api/dataset/{datasetId} endpoint.

Path parameters
filterSynthesizedbooleanRequired

When true, excludes synthesized datasets from results

Default: true
Responses
chevron-right
200

OK

application/json

Lightweight dataset summary containing only core metadata used in list views and search results.

idstringOptional
namestringOptional
fileSourcestring · enumOptional

The original upload location of source files (Local, S3, Azure, SharePoint, OneLake, or SDK).

Possible values:
tagsstring[]Optional
lastUpdatedobjectOptional

A point in time represented as an ISO 8601 timestamp string.

createdobjectOptional

A point in time represented as an ISO 8601 timestamp string.

fileCountinteger · int32Optional
get
/api/Dataset
200

OK

Gets the dataset by its Id

get

Returns the dataset specified by the datasetId

Path parameters
datasetIdstringRequired
Responses
chevron-right
200

OK

application/json
idstringOptional
namestringOptional
enabledModelsstring[]Optional
lastUpdatedobjectOptional
docXImagePolicystring · enumOptional

Possible values:

  • Redact: Run images through OCR and redact sensitive text
  • Ignore: Leave images alone
  • Remove: Cover image with opaque black box
Possible values:
pdfSignaturePolicystring · enumOptional

Possible values:

  • Redact: Cover signature with opaque black box
  • Ignore: Do not attempt to detect signature
Possible values:
docXCommentPolicystring · enumOptional

Possible values:

  • Remove: Remove all comments for file
  • Ignore: Leave comments alone
Possible values:
docXTablePolicystring · enumOptional

Possible values:

  • Redact: Treat table content normally, feed into redaction process.
  • Remove: Replace all characters and symbols in table with a placeholder.
Possible values:
fileSourcestring · enumOptionalPossible values:
customPiiEntityIdsstring[]Optional
get
/api/Dataset/{datasetId}

Creates a new dataset

post

Creates a new dataset with the specified configuration. You must specify a unique, non-empty dataset name

Body
namestringOptional
Responses
chevron-right
200

OK

application/json
idstringOptional
namestringOptional
enabledModelsstring[]Optional
lastUpdatedobjectOptional
docXImagePolicystring · enumOptional

Possible values:

  • Redact: Run images through OCR and redact sensitive text
  • Ignore: Leave images alone
  • Remove: Cover image with opaque black box
Possible values:
pdfSignaturePolicystring · enumOptional

Possible values:

  • Redact: Cover signature with opaque black box
  • Ignore: Do not attempt to detect signature
Possible values:
docXCommentPolicystring · enumOptional

Possible values:

  • Remove: Remove all comments for file
  • Ignore: Leave comments alone
Possible values:
docXTablePolicystring · enumOptional

Possible values:

  • Redact: Treat table content normally, feed into redaction process.
  • Remove: Replace all characters and symbols in table with a placeholder.
Possible values:
fileSourcestring · enumOptionalPossible values:
customPiiEntityIdsstring[]Optional
post
/api/Dataset

Edit a dataset

put

Updates a dataset to use the specified configuration.

Query parameters
shouldRescanbooleanOptional

When true, triggers a rescan of dataset files after the update

Body

Request to update an existing dataset's configuration, redaction policies, and entity settings.

idstringOptional
namestringOptional
docXImagePolicystring · enumOptional

Possible values:

  • Redact: Run images through OCR and redact sensitive text
  • Ignore: Leave images alone
  • Remove: Cover image with opaque black box
Possible values:
pdfSignaturePolicystring · enumOptional

Possible values:

  • Redact: Cover signature with opaque black box
  • Ignore: Do not attempt to detect signature
Possible values:
pdfSynthModePolicystring · enumOptional

Possible values:

  • V1: Original mode with incorrect font, size and style
  • V2: New mode
Possible values:
docXCommentPolicystring · enumOptional

Possible values:

  • Remove: Remove all comments for file
  • Ignore: Leave comments alone
Possible values:
docXTablePolicystring · enumOptional

Possible values:

  • Redact: Treat table content normally, feed into redaction process.
  • Remove: Replace all characters and symbols in table with a placeholder.
Possible values:
llmClassificationPolicystring · enumOptional

Possible values:

  • Disabled: Do not use LLM for structured data classification
  • Enabled: Use LLM to classify structured data for PII detection
Possible values:
llmTableClassificationPolicystring · enumOptional

Possible values:

  • Disabled: Do not use LLM for structured data classification
  • Enabled: Use LLM to classify structured data for PII detection
Possible values:
awsCredentialSourcestringOptional
outputPathstring · nullableOptional
ocrServiceProviderstring · enumOptional

The OCR engine used for text extraction from images and scanned documents.

Possible values:
Responses
chevron-right
200

OK

application/json

Full dataset details including files, configuration, and permission information.

idstringOptional
namestringOptional
outputFormatstring · enumOptional

The output format for redacted files: Original preserves the source format, Markdown produces a markdown version.

Possible values:
tagsstring[]Optional
lastUpdatedobjectOptional

A point in time represented as an ISO 8601 timestamp string.

createdobjectOptional

A point in time represented as an ISO 8601 timestamp string.

docXImagePolicystring · enumOptional

Possible values:

  • Redact: Run images through OCR and redact sensitive text
  • Ignore: Leave images alone
  • Remove: Cover image with opaque black box
Possible values:
pdfSignaturePolicystring · enumOptional

Possible values:

  • Redact: Cover signature with opaque black box
  • Ignore: Do not attempt to detect signature
Possible values:
pdfSynthModePolicystring · enumOptional

Possible values:

  • V1: Original mode with incorrect font, size and style
  • V2: New mode
Possible values:
docXCommentPolicystring · enumOptional

Possible values:

  • Remove: Remove all comments for file
  • Ignore: Leave comments alone
Possible values:
docXTablePolicystring · enumOptional

Possible values:

  • Redact: Treat table content normally, feed into redaction process.
  • Remove: Replace all characters and symbols in table with a placeholder.
Possible values:
llmClassificationPolicystring · enumOptional

Possible values:

  • Disabled: Do not use LLM for structured data classification
  • Enabled: Use LLM to classify structured data for PII detection
Possible values:
llmTableClassificationPolicystring · enumOptional

Possible values:

  • Disabled: Do not use LLM for structured data classification
  • Enabled: Use LLM to classify structured data for PII detection
Possible values:
fileSourcestring · enumOptional

The original upload location of source files (Local, S3, Azure, SharePoint, OneLake, or SDK).

Possible values:
customPiiEntityIdsstring[] · nullableOptional
awsCredentialSourcestring · nullableOptional
outputPathstring · nullableOptional
ocrServiceProviderstring · enumOptional

The OCR engine used for text extraction from images and scanned documents.

Possible values:
fileCountinteger · int32Optional
put
/api/Dataset

Last updated

Was this helpful?