Synthetic Dataset Generation
Testing Datasets are the core of the data-driven development process. You can create and define your own dataset with granular details for evaluation purposes.
Creating a Dataset
To create a dataset, you need to provide a name and a description. You can also provide a list of tags to help you organize your datasets.
Suppose we have a list of dictionaries, each representing a datum in the data
variable.
To create a dataset, you can use the following:
from relari import Dataset
dataset = Dataset.from_data(data)
each datum is required to have a uid
filed with a unique identifier for the datum.
If the uid
is not provided, the system will generate a unique identifier for you.
Manifest
A dataset has an associated a manifest. The manifest file is a YAML file that contains:
- the dataset name
- the dataset description
- the list fields of each datum in the dataset, their types and their descriptions (ground truth fields are marked as such)
- the dataset license
An example of a manifest file is:
name: Paul Graham's Essays
description: Paul Graham's Essays Q&A
format: jsonl
license: MIT
fields:
uid:
description: Unique identifier for the filing
type: UID
question:
description: The question asked by the user
type: str
ground_truth: false
ground_truth:
description: The correct answer to the question
type: List[str]
ground_truth: true
ground_truth_context:
description: Ground truth contexts
type: List[str]
ground_truth: true
The manifest is not mandatory, but it is recommended to provide it to help users understand the dataset. If not provided the system will generate a manifest for you assuming all fields are not ground truth fields.
Uploading a Dataset
To upload a dataset, you can use the CLI:
relari-cli datasets new PROJECT_ID DATASET_FOLDER_OR_FILE
and check the status of the upload with:
relari-cli datasets ls PROJECT_ID
Notice that the name of the dataset is specified in the manifest file.
Or, alternatively, you can use the SDK:
from relari import RelariClient, Dataset
client = RelariClient()
proj = client.projects.find(name="RAG")
dataset = Dataset("data/paul_graham/dataset") # dataset folder or file
dataset.name = "Paul Graham" # set the dataset name
info = client.datasets.create(project_id=proj["id"], dataset=dataset)
print("Dataset ID:", info)
Downloading a Dataset
To download a dataset, you can use the CLI:
relari-cli datasets get DATASET_ID OUT_DIR
or the SDK:
dataset = client.datasets.get(DATASET_ID)
The dataset is downloaded as a Dataset
object.
All fields marked as ground truth fields are not retrieved.