Skip to main content

Datasets

Testing Datasets are the core of the data-driven development process. You can create and define your own dataset with granular details for evaluation purposes.

A dataset is created with a name and a description, and optionally a list of tags to help you organize your datasets. It will also contain a list of data, and a manifest describing this data.

Each datum is required to have a uid filed with a unique identifier for the datum. If the uid is not provided, the system will generate a unique identifier for you.

warning

Currently the Node SDK can not infer the Dataset's manifest as the Python one does; however, creating this manifest is very easy, as you can see in the examples below.

Creating a Dataset

To create a dataset, you need to provide a name and a description. You can also provide a list of tags to help you organize your datasets.

Suppose we have a list of dictionaries, each representing a datum in the data variable. To create a dataset, you can use the following:

Each datum is required to have a uid filed with a unique identifier for the datum. If the uid is not provided, the system will generate a unique identifier for you.

from relari import Dataset

dataset = Dataset.from_data(data)

Manifest

A dataset has an associated a manifest. The manifest file is a YAML or JSON file that contains:

  • the dataset name
  • the dataset description
  • the list fields of each datum in the dataset, their types and their descriptions (ground truth fields are marked as such)
  • the dataset license

An example of a manifest file is:

name: Paul Graham's Essays
description: Paul Graham's Essays Q&A
format: jsonl
license: MIT
fields:
uid:
description: Unique identifier for the filing
type: UID
question:
description: The question asked by the user
type: str
ground_truth: false
ground_truth:
description: The correct answer to the question
type: List[str]
ground_truth: true
ground_truth_context:
description: Ground truth contexts
type: List[str]
ground_truth: true

The manifest is not mandatory, but it is recommended to provide it to help users understand the dataset. If not provided the system will generate a manifest for you assuming all fields are not ground truth fields.

Uploading a Dataset

To upload a dataset, you can use the CLI:

relari-cli datasets new PROJECT_ID DATASET_FOLDER_OR_FILE

and check the status of the upload with:

relari-cli datasets ls PROJECT_ID

Notice that the name of the dataset is specified in the manifest file.

Or, alternatively, you can use the SDK:

from relari import RelariClient, Dataset

client = RelariClient()

proj = client.projects.find(name="RAG")

dataset = Dataset("data/paul_graham/dataset") # dataset folder or file
# or alternatively:
# dataset = Dataset.from_data(data)

dataset.name = "Paul Graham" # set the dataset name

info = client.datasets.create(project_id=proj["id"], dataset=dataset)
print("Dataset ID:", info)

Downloading a Dataset

To download a dataset, you can use the CLI or the SDK:

relari-cli datasets get DATASET_ID OUT_DIR
note

All fields marked as ground truth fields are not retrieved.

In the Python SDK, The dataset is downloaded as a Dataset object. In Node it will be an object of type DatasetWithData.