Skip to main content

Experiments (Evaluations)

Evaluations are used to run asynchronous metric computations over datasets. You can create a new evaluation, list all evaluations, and get the status of an evaluation.

Evaluations on user-provided data

import os
from relari import RelariClient
from relari import Metric

client = RelariClient()

data = [
{
"ground_truth_context": [
"Software as a service is a way of delivering applications remotely over the internet."
],
"retrieved_context": [
"Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
{
"ground_truth_context": [
"Python's asyncio module provides a framework for writing concurrent code using async/await syntax."
],
"retrieved_context": [
"The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
],
},
]

eval_id = client.evaluations.submit(
project_id="665ddb7bbdce1320a58e2ce7",
name=None,
metadata=dict(),
pipeline=[Metric.PrecisionRecallF1, Metric.RankedRetrievalMetrics],
data=data,
)

In the pipeline field, you can specify the metrics you want to compute. The data field is a list of dictionaries, each representing a datum. Each datum is required to have all the fields needed for the metrics you want to compute.

The snippet above will return the evaluation id, you can check the status of the evaluation using the CLI or the SDK.

relari-cli evaluations status EVALUATION_ID

If the evaluation's status is COMPLETED you can download the results with the CLI or the SDK.

relari-cli evaluations get EVALUATION_ID

which will save a JSON file with the results in the current directory.

or

eval_data = client.evaluations.get(EVALUATION_ID)

In the eval_data['results'] you will find the results of the evaluation. It's a dictionary indexed by the uid (unique identifier of the datum) and the metrics computed, each element contains:

  • datum: Dictionary with the datum
  • metrics: Dictionary with the metrics computed

Evaluation over Relari-hosted Datasets

You can also run evaluation over Relari-hosted datasets. The process is similar to the one described above, but you need to provide the dataset ID instead and use DatasetDatum to specify the pipeline results.

from relari import RelariClient
from relari import Metric
from relari.core.types import DatasetDatum

client = RelariClient()

data = [
DatasetDatum(
label="22", # unique identifier for the datum
data={
"retrieved_context": [
"Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
),
DatasetDatum(
label="23", # unique identifier for the datum
data={
"retrieved_context": [
"The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
],
},
),
]

eval_id = client.evaluations.submit(
project_id=PROJECT_ID,
dataset=DATASET_ID,
name=None,
metadata=dict(), # optional metadata
pipeline=[Metric.PrecisionRecallF1, Metric.RankedRetrievalMetrics],
data=data,
)

You can see that this time we used DatasetDatum to specify the pipeline results. The label field is the unique identifier for the datum in the dataset. We also did not have to provide the ground truth context, as it is already present in the dataset.