Synthetic Dataset Generation

In addition to uploading your existing datasets, Relari can also generate new datasets by processing user-provided documents in various formats.

Synthetic datasets work similar to other datasets. For more information about how to download them and their structure see Datasets.

We're building the capability to generate different types of datasets, but currently the only supported use cases is Retrieval-Augmented Generation. Other use cases we'll support soon include:

Conversational agents
Code Generation
Summarization
Data extraction
Classification

Generate a dataset

To generate a synthetic dataset, you need to provide between 1 and 20 documents. Each file can be up to 25 MiB in size, with a total combined limit of 200 MiB across all files.

These files can be PDFs, Word documents (.doc and .docx), or plain text files (including .txt, CSV, HTML, XML and Markdown files).

CLI
Python
Node

relari-cli datasets generate PROJECT_ID DOCS_FOLDER NUM_SAMPLES --NAME DATASET_NAME

generation_id = client.synth.new(
  project_id,
  name="My Synthetic Dataset",
  samples=20,
  files=[
    "./books/Don Quixote.pdf",
    "./essays/209_beyond_smart.txt",
  ]
)

const generatedDataset = await relariClient.synth.new(
  projectId,
  20, // number of samples
  [
    `./books/Don Quixote.pdf`,
    `./essays/209_beyond_smart.txt`,
  ],
  "My Synthetic Dataset",
  DatasetType.RAG,
)
datasetId = generatedDataset.id

note

When using the CLI, you must provide a folder with the documents that will inform the synthetic dataset.

Both the Node and Python SDKs expect an array of file paths.

If you need to get all the documents in a directory while using the SDK to generate a dataset, you can use these examples:

Python
Node

task_id = client.synth.new(
    project_id=project_id,
    name=name,
    samples=samples,
    files=list([f for f in docs_folder.iterdir() if f.is_file()]),
)

import fs from 'fs'
import path from 'path'

const folderWithDocs = `path/to/docs/folder`

const files =
  fs.readdirSync(directoryPath)
  .filter(file => fs.statSync(path.join(directoryPath, file)).isFile())

const generatedDataset = await relariClient.synth.new(
  projectId,
  20, // number of samples
  files,
  "My Synthetic Dataset",
  DatasetType.RAG,
)

You can check the status of the generation task with:

CLI
Python
Node

relari-cli datasets status DATASET_ID

dataset = client.datasets.get(DATASET_ID)
print(f"Dataset generation status: {dataset.status}")

const dataset = await relariClient.dataset.get(datasetId)
console.log(`Dataset generation status: ${dataset.status}`)

Downloading a Dataset

You can download a synthetic dataset the same way you'd download an uploaded one. See Downloading a Dataset

Generate a dataset​

Downloading a Dataset​

Generate a dataset

Downloading a Dataset