Skip to main content

Synthetic Dataset Generation

In addition to uploading your existing datasets, Relari can also generate new datasets by processing user-provided documents in various formats.

Synthetic datasets work similar to other datasets. For more information about how to download them and their structure see Datasets.

We're building the capability to generate different types of datasets, but currently the only supported use cases is Retrieval-Augmented Generation. Other use cases we'll support soon include:

  • Conversational agents
  • Code Generation
  • Summarization
  • Data extraction
  • Classification

Generate a dataset

To generate a synthetic dataset, you need to provide between 1 and 20 documents. Each file can be up to 25 MiB in size, with a total combined limit of 200 MiB across all files.

These files can be PDFs, Word documents (.doc and .docx), or plain text files (including .txt, CSV, HTML, XML and Markdown files).

relari-cli datasets generate PROJECT_ID DOCS_FOLDER NUM_SAMPLES --NAME DATASET_NAME
note

When using the CLI, you must provide a folder with the documents that will inform the synthetic dataset.

Both the Node and Python SDKs expect an array of file paths.

If you need to get all the documents in a directory while using the SDK to generate a dataset, you can use these examples:

task_id = client.synth.new(
project_id=project_id,
name=name,
samples=samples,
files=list([f for f in docs_folder.iterdir() if f.is_file()]),
)

You can check the status of the generation task with:

relari-cli datasets status DATASET_ID

Downloading a Dataset

You can download a synthetic dataset the same way you'd download an uploaded one. See Downloading a Dataset