Datasets

Testing Datasets are the core of the data-driven development process. You can create and define your own dataset with granular details for evaluation purposes.

A dataset is created with a name and a description, and optionally a list of tags to help you organize your datasets. It will also contain a list of data, and a manifest describing this data.

Each datum is required to have a uid filed with a unique identifier for the datum. If the uid is not provided, the system will generate a unique identifier for you.

warning

Currently the Node SDK can not infer the Dataset's manifest as the Python one does; however, creating this manifest is very easy, as you can see in the examples below.

Creating a Dataset

To create a dataset, you need to provide a name and a description. You can also provide a list of tags to help you organize your datasets.

Suppose we have a list of dictionaries, each representing a datum in the data variable. To create a dataset, you can use the following:

Each datum is required to have a uid filed with a unique identifier for the datum. If the uid is not provided, the system will generate a unique identifier for you.

Python
Node

from relari import Dataset

dataset = Dataset.from_data(data)

await relariClient.datasets.upload(projectId, data, manifest)

Manifest

A dataset has an associated a manifest. The manifest file is a YAML or JSON file that contains:

the dataset name
the dataset description
the list fields of each datum in the dataset, their types and their descriptions (ground truth fields are marked as such)
the dataset license

An example of a manifest file is:

YAML
JSON

name: Paul Graham's Essays
description: Paul Graham's Essays Q&A
format: jsonl
license: MIT
fields:
  uid:
    description: Unique identifier for the filing
    type: UID
  question:
    description: The question asked by the user
    type: str
    ground_truth: false
  ground_truth:
    description: The correct answer to the question
    type: List[str]
    ground_truth: true
  ground_truth_context:
    description: Ground truth contexts
    type: List[str]
    ground_truth: true

{
  name: "Paul Graham's Essays",
  description: "Paul Graham's Essays Q&A",
  format: "jsonl",
  license: "MIT",
  fields: {
    uid: {
      type: "str",
      is_visible: true,
      description: "Unique identifier for the filing",
      ground_truth: false
    },
    question: {
      type: "str",
      is_visible: true,
      description: "The question asked by the user",
      ground_truth: false
    },
    ground_truth_answers: {
      type: "str",
      is_visible: false,
      description: "The correct answer to the question",
      ground_truth: true
    },
    ground_truth_context: {
      type: "list",
      is_visible: false,
      description: "The context needed to answer the question",
      ground_truth: true
    }
  }
}

The manifest is not mandatory, but it is recommended to provide it to help users understand the dataset. If not provided the system will generate a manifest for you assuming all fields are not ground truth fields.

Uploading a Dataset

To upload a dataset, you can use the CLI:

relari-cli datasets new PROJECT_ID DATASET_FOLDER_OR_FILE

and check the status of the upload with:

relari-cli datasets ls PROJECT_ID

Notice that the name of the dataset is specified in the manifest file.

Or, alternatively, you can use the SDK:

Python
Node

from relari import RelariClient, Dataset

client = RelariClient()

proj = client.projects.find(name="RAG")

dataset = Dataset("data/paul_graham/dataset") # dataset folder or file
# or alternatively:
# dataset = Dataset.from_data(data)

dataset.name = "Paul Graham" # set the dataset name

info = client.datasets.create(project_id=proj["id"], dataset=dataset)
print("Dataset ID:", info)

import { promises as fs } from 'fs'

proj = await relariClient.projects.find(name="RAG")

const data = JSON.parse(await fs.readFile('path/to/data'))
const manifest = JSON.parse(await fs.readFile('path/to/manifest'))

created = await relariClient.datasets.upload(
  proj.id,
  data,
  manifest
)

Downloading a Dataset

To download a dataset, you can use the CLI or the SDK:

CLI
Python
Node

relari-cli datasets get DATASET_ID OUT_DIR

dataset = client.datasets.get(DATASET_ID)

const dataset = relariClient.datasets.get(DATASET_ID)

note

All fields marked as ground truth fields are not retrieved.

In the Python SDK, The dataset is downloaded as a Dataset object. In Node it will be an object of type DatasetWithData.

Creating a Dataset​

Manifest​

Uploading a Dataset​

Downloading a Dataset​

Creating a Dataset

Manifest

Uploading a Dataset

Downloading a Dataset