Datasets
Testing Datasets are the core of the data-driven development process. You can create and define your own dataset with granular details for evaluation purposes.
A dataset is created with a name and a description, and optionally a list of tags to help you organize your datasets. It will also contain a list of data, and a manifest describing this data.
Each datum is required to have a uid
filed with a unique identifier for the datum.
If the uid
is not provided, the system will generate a unique identifier for you.
Currently the Node SDK can not infer the Dataset's manifest as the Python one does; however, creating this manifest is very easy, as you can see in the examples below.
Creating a Dataset
To create a dataset, you need to provide a name and a description. You can also provide a list of tags to help you organize your datasets.
Suppose we have a list of dictionaries, each representing a datum in the data
variable.
To create a dataset, you can use the following:
Each datum is required to have a uid
filed with a unique identifier for the datum.
If the uid
is not provided, the system will generate a unique identifier for you.
- Python
- Node
from relari import Dataset
dataset = Dataset.from_data(data)
await relariClient.datasets.upload(projectId, data, manifest)
Manifest
A dataset has an associated a manifest. The manifest file is a YAML or JSON file that contains:
- the dataset name
- the dataset description
- the list fields of each datum in the dataset, their types and their descriptions (ground truth fields are marked as such)
- the dataset license
An example of a manifest file is:
- YAML
- JSON
name: Paul Graham's Essays
description: Paul Graham's Essays Q&A
format: jsonl
license: MIT
fields:
uid:
description: Unique identifier for the filing
type: UID
question:
description: The question asked by the user
type: str
ground_truth: false
ground_truth:
description: The correct answer to the question
type: List[str]
ground_truth: true
ground_truth_context:
description: Ground truth contexts
type: List[str]
ground_truth: true
{
name: "Paul Graham's Essays",
description: "Paul Graham's Essays Q&A",
format: "jsonl",
license: "MIT",
fields: {
uid: {
type: "str",
is_visible: true,
description: "Unique identifier for the filing",
ground_truth: false
},
question: {
type: "str",
is_visible: true,
description: "The question asked by the user",
ground_truth: false
},
ground_truth_answers: {
type: "str",
is_visible: false,
description: "The correct answer to the question",
ground_truth: true
},
ground_truth_context: {
type: "list",
is_visible: false,
description: "The context needed to answer the question",
ground_truth: true
}
}
}
The manifest is not mandatory, but it is recommended to provide it to help users understand the dataset. If not provided the system will generate a manifest for you assuming all fields are not ground truth fields.
Uploading a Dataset
To upload a dataset, you can use the CLI:
relari-cli datasets new PROJECT_ID DATASET_FOLDER_OR_FILE
and check the status of the upload with:
relari-cli datasets ls PROJECT_ID
Notice that the name of the dataset is specified in the manifest file.
Or, alternatively, you can use the SDK:
- Python
- Node
from relari import RelariClient, Dataset
client = RelariClient()
proj = client.projects.find(name="RAG")
dataset = Dataset("data/paul_graham/dataset") # dataset folder or file
# or alternatively:
# dataset = Dataset.from_data(data)
dataset.name = "Paul Graham" # set the dataset name
info = client.datasets.create(project_id=proj["id"], dataset=dataset)
print("Dataset ID:", info)
import { promises as fs } from 'fs'
proj = await relariClient.projects.find(name="RAG")
const data = JSON.parse(await fs.readFile('path/to/data'))
const manifest = JSON.parse(await fs.readFile('path/to/manifest'))
created = await relariClient.datasets.upload(
proj.id,
data,
manifest
)
Downloading a Dataset
To download a dataset, you can use the CLI or the SDK:
- CLI
- Python
- Node
relari-cli datasets get DATASET_ID OUT_DIR
dataset = client.datasets.get(DATASET_ID)
const dataset = relariClient.datasets.get(DATASET_ID)
All fields marked as ground truth fields are not retrieved.
In the Python SDK, The dataset is downloaded as a Dataset
object. In Node it will be an object of type DatasetWithData
.