Experiments (Evaluations)

Evaluations are used to run asynchronous metric computations over datasets. You can create a new evaluation, list all evaluations, and get the status of an evaluation.

Evaluations on user-provided data

Python
Node

import os
from relari import RelariClient
from relari import Metric

client = RelariClient()

data = [
    {
        "ground_truth_context": [
            "Software as a service is a way of delivering applications remotely over the internet."
        ],
        "retrieved_context": [
            "Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
        ],
    },
    {
        "ground_truth_context": [
            "Python's asyncio module provides a framework for writing concurrent code using async/await syntax."
        ],
        "retrieved_context": [
            "The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
        ],
    },
]

eval_id = client.evaluations.submit(
    project_id="665ddb7bbdce1320a58e2ce7",
    name=None,
    metadata=dict(),
    pipeline=[Metric.PrecisionRecallF1, Metric.RankedRetrievalMetrics],
    data=data,
)

import { RelariClient, MetricName } from "relari-sdk"
const client = new RelariClient()

const data = [
  {
    "ground_truth_context": [
      "Software as a service is a way of delivering applications remotely over the internet."
    ],
    "retrieved_context": [
      "Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
    ],
  },
  {
    "ground_truth_context": [
      "Python's asyncio module provides a framework for writing concurrent code using async/await syntax."
    ],
    "retrieved_context": [
      "The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
    ],
  },
]

const createdEval = await relariClient.experiments.submit(
  project,
  undefined, // Name
  [MetricName.PrecisionRecallF1, MetricName.RankedRetrievalMetrics],
  data,
)

const evalId = createdEval.id

In the pipeline field, you can specify the metrics you want to compute. The data field is a list of dictionaries, each representing a datum. Each datum is required to have all the fields needed for the metrics you want to compute.

The snippet above will return the evaluation id, you can check the status of the evaluation using the CLI or the SDK.

relari-cli evaluations status EVALUATION_ID

If the evaluation's status is COMPLETED you can download the results with the CLI (which will save a JSON file with the results in the current directory) or the SDK:

CLI
Python
Node

relari-cli evaluations get EVALUATION_ID

eval_data = client.evaluations.get(EVALUATION_ID)

const evalData = await relariClient.experiments.get(evalId)

In the eval_data['results'] (or evalData.results) you will find the results of the evaluation. It's a dictionary indexed by the uid (unique identifier of the datum) and the metrics computed, each element contains:

datum: Dictionary with the datum
metrics: Dictionary with the metrics computed

Evaluation over Relari-hosted Datasets

You can also run evaluation over Relari-hosted datasets. The process is similar to the one described above, but you need to provide the dataset ID instead and use DatasetDatum to specify the pipeline results.

Python
Node

from relari import RelariClient
from relari import Metric
from relari.core.types import DatasetDatum

client = RelariClient()

data = [
    DatasetDatum(
        label="22", # unique identifier for the datum
        data={
            "retrieved_context": [
                "Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
            ],
        },
    ),
    DatasetDatum(
        label="23", # unique identifier for the datum
        data={
            "retrieved_context": [
                "The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
            ],
        },
    ),
]

eval_id = client.evaluations.submit(
    project_id=PROJECT_ID,
    dataset=DATASET_ID,
    name=None,
    metadata=dict(), # optional metadata
    pipeline=[Metric.PrecisionRecallF1, Metric.RankedRetrievalMetrics],
    data=data,
)

import { RelariClient, MetricName } from "relari-sdk"
const client = new RelariClient()

const data = data = [
  {
    label:"22", // unique identifier for the datum
    data: {
      "retrieved_context": [
        "Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
      ]
    }
  },
  {
    label:"23",
    data: {
      "retrieved_context": [
        "The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
      ]
    }
  }
]

const project = await relariClient.projects.findOne("My Project")
const dataset = await relariClient.datasets.findOne(project.id, "My Dataset")

const experiment = await relariClient.experiments.submit(
  project.id,
  undefined,
  [MetricName.PrecisionRecallF1, MetricName.RankedRetrievalMetrics],
  data,
  dataset.id,
  { category: 'Enterprise' }, // optional metadata
)

The label field is the unique identifier for the datum in the dataset. We also did not have to provide the ground truth context, as it is already present in the dataset.

You can see that this time, in the Python SDK, we used DatasetDatum to specify the pipeline results.

Evaluations on user-provided data​

Evaluation over Relari-hosted Datasets​

Evaluations on user-provided data

Evaluation over Relari-hosted Datasets