Synthetic Dataset Generation
In addition to uploading your existing datasets, Relari can also generate new datasets by processing user-provided documents in various formats.
Synthetic datasets work similar to other datasets. For more information about how to download them and their structure see Datasets.
We're building the capability to generate different types of datasets, but currently the only supported use cases is Retrieval-Augmented Generation. Other use cases we'll support soon include:
- Conversational agents
- Code Generation
- Summarization
- Data extraction
- Classification
Generate a dataset
To generate a synthetic dataset, you need to provide between 1 and 20 documents. Each file can be up to 25 MiB in size, with a total combined limit of 200 MiB across all files.
These files can be PDFs, Word documents (.doc and .docx), or plain text files (including .txt, CSV, HTML, XML and Markdown files).
- CLI
- Python
- Node
relari-cli datasets generate PROJECT_ID DOCS_FOLDER NUM_SAMPLES --NAME DATASET_NAME
generation_id = client.synth.new(
project_id,
name="My Synthetic Dataset",
samples=20,
files=[
"./books/Don Quixote.pdf",
"./essays/209_beyond_smart.txt",
]
)
const generatedDataset = await relariClient.synth.new(
projectId,
20, // number of samples
[
`./books/Don Quixote.pdf`,
`./essays/209_beyond_smart.txt`,
],
"My Synthetic Dataset",
DatasetType.RAG,
)
datasetId = generatedDataset.id
When using the CLI, you must provide a folder with the documents that will inform the synthetic dataset.
Both the Node and Python SDKs expect an array of file paths.
If you need to get all the documents in a directory while using the SDK to generate a dataset, you can use these examples:
- Python
- Node
task_id = client.synth.new(
project_id=project_id,
name=name,
samples=samples,
files=list([f for f in docs_folder.iterdir() if f.is_file()]),
)
import fs from 'fs'
import path from 'path'
const folderWithDocs = `path/to/docs/folder`
const files =
fs.readdirSync(directoryPath)
.filter(file => fs.statSync(path.join(directoryPath, file)).isFile())
const generatedDataset = await relariClient.synth.new(
projectId,
20, // number of samples
files,
"My Synthetic Dataset",
DatasetType.RAG,
)
You can check the status of the generation task with:
- CLI
- Python
- Node
relari-cli datasets status DATASET_ID
dataset = client.datasets.get(DATASET_ID)
print(f"Dataset generation status: {dataset.status}")
const dataset = await relariClient.dataset.get(datasetId)
console.log(`Dataset generation status: ${dataset.status}`)
Downloading a Dataset
You can download a synthetic dataset the same way you'd download an uploaded one. See Downloading a Dataset