Skip to main content

Datasets

Testing Datasets are the core of the data-driven development process. You can create and define your own dataset with granular details for evaluation purposes.

Creating a Dataset

To create a dataset, you need to provide a name and a description. You can also provide a list of tags to help you organize your datasets.

Suppose we have a list of dictionaries, each representing a datum in the data variable. To create a dataset, you can use the following:

from relari import Dataset

dataset = Dataset.from_data(data)

each datum is required to have a uid filed with a unique identifier for the datum. If the uid is not provided, the system will generate a unique identifier for you.

Manifest

A dataset has an associated a manifest. The manifest file is a YAML file that contains:

  • the dataset name
  • the dataset description
  • the list fields of each datum in the dataset, their types and their descriptions (ground truth fields are marked as such)
  • the dataset license

An example of a manifest file is:

name: Paul Graham's Essays
description: Paul Graham's Essays Q&A
format: jsonl
license: MIT
fields:
uid:
description: Unique identifier for the filing
type: UID
question:
description: The question asked by the user
type: str
ground_truth: false
ground_truth:
description: The correct answer to the question
type: List[str]
ground_truth: true
ground_truth_context:
description: Ground truth contexts
type: List[str]
ground_truth: true

The manifest is not mandatory, but it is recommended to provide it to help users understand the dataset. If not provided the system will generate a manifest for you assuming all fields are not ground truth fields.

Uploading a Dataset

To upload a dataset, you can use the CLI:

relari-cli datasets new PROJECT_ID DATASET_FOLDER_OR_FILE

and check the status of the upload with:

relari-cli datasets ls PROJECT_ID

Notice that the name of the dataset is specified in the manifest file.

Or, alternatively, you can use the SDK:

from relari import RelariClient, Dataset

client = RelariClient()

proj = client.projects.find(name="RAG")
dataset = Dataset("data/paul_graham/dataset") # dataset folder or file
dataset.name = "Paul Graham" # set the dataset name

info = client.datasets.create(project_id=proj["id"], dataset=dataset)
print("Dataset ID:", info)

Downloading a Dataset

To download a dataset, you can use the CLI:

relari-cli datasets get DATASET_ID OUT_DIR

or the SDK:

dataset = client.datasets.get(DATASET_ID)
note

The dataset is downloaded as a Dataset object. All fields marked as ground truth fields are not retrieved.