Skip to main content

Metrics

tip

Metrics allow you to compute metric scores over a single data point or a small batch of data. It is most suitable for Runtime Monitor / Online Evaluation purposes.

If you want to run offline evaluation over a given (golden) dataset, please use Experiments (Evaluation).

Through the Python SDK, Relari API offers two ways to run evaluation metric:

  • Synchronous metrics are computed in real-time. Synchronous metrics can be used to calculate metric scores for single datum or small dataset.
  • Asynchronous metrics are computed in the background.

We recommend using asynchronous metrics whenever you are running evaluation experiments over a large dataset.

note

The result of the synchronous metric is not saved in the server. If you want to store the result, you need to use asynchronous metrics where the results are saved in evaluations more details on evaluations.

Synchronous Metrics

Synchronous metrics are computed in real-time and are useful for small batches of data.

User-provided data

In cases you want to run evaluation over your own datum or dataset that has all the fields necessary (e.g. ground_truth_context) to evaluate, you can use the following:

Single Datum

For a single datum you can use the following code snippet (example using PrecisionRecallF1 which computes the precision, recall, and F1 score for retrieval system - which requires ground_truth_context and retrieved_context as arguments):

from relari import RelariClient
from relari import Metric

client = RelariClient()

res = client.metrics.compute(
Metric.PrecisionRecallF1,
args={
"ground_truth_context": [
"Software as a service is a way of delivering applications remotely over the internet."
],
"retrieved_context": [
"Software as a service (SaaS) is a way of delivering applications remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
)

Batch Calculate

Or you can pass a list of datum to the metric to compute the metrics in a single call.

data = [
{
"ground_truth_context": [
"Software as a service is a way of delivering applications remotely over the internet."
],
"retrieved_context": [
"Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
{
"ground_truth_context": [
"Python's asyncio module provides a framework for writing concurrent code using async/await syntax."
],
"retrieved_context": [
"The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
],
},
]

res = client.metrics.compute( # supports batch compute
Metric.PrecisionRecallF1,
args=data,
)

Dataset (hosted in Relari Cloud)

In cases where you have a dataset hosted in Relari Cloud, you can pass the dataset id, the sample_id or label, and corresponding outputs from your LLM application to calculate the scores.

from relari import RelariClient
from relari import Metric
from relari.core.types import DatasetDatum

client = RelariClient()

res = client.metrics.compute(
Metric.PrecisionRecallF1,
dataset="665ddb7bbdce1320a58e2cec", # Get Dataset ID from CLI
args=DatasetDatum(
label="22", # sample_id of a datum
data={
"retrieved_context": [
"The best answer is earnest. If you're earnest you avoid not just affectation but a whole set of similar vices."
]
},
),
)
print(res)

Here now retrieved context only have to provide the information for the fields not available in the dataset. For example the PrecisionRecallF1 metric here only requires retrieved_context.

Similarly, you can pass a list of DatasetDatum as args to batch the computation of the metrics.

Asynchronous Metrics

To run asynchronous metrics, you have to start a new evaluation experiment for a particular project.

User-provided data

import os
from relari import RelariClient
from relari import Metric
from relari.core.types import DatasetDatum
import json

client = RelariClient()

data = [
{
"ground_truth_context": [
"Software as a service is a way of delivering applications remotely over the internet."
],
"retrieved_context": [
"Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
{
"ground_truth_context": [
"Python's asyncio module provides a framework for writing concurrent code using async/await syntax."
],
"retrieved_context": [
"The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
],
},
]

eval_id = client.evaluations.submit(
project_id=PROJECT_ID,
name=None,
metadata=dict(),
pipeline=[Metric.PrecisionRecallF1, Metric.RankedRetrievalMetrics],
data=data,
)

This will return the evaluation ID, you can check the status of the evaluation using the CLI

relari-cli evaluations status EVALUATION_ID

or download the evaluation results using the SDK

eval_data = client.evaluations.get(eval_id)

Batch metrics

You can also batch a list of datum and compute the metrics in a single call. Since this call is synchronous as well, it is recommended to use it for small batches (e.g. < 10 datum).

data = [
{
"ground_truth_context": [
"Software as a service is a way of delivering applications remotely over the internet."
],
"retrieved_context": [
"Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
{
"ground_truth_context": [
"Python's asyncio module provides a framework for writing concurrent code using async/await syntax."
],
"retrieved_context": [
"The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
],
},
]
res = client.metrics.compute( # batch_compute(
Metric.PrecisionRecallF1,
args=data,
)
[print(x) for x in res]

Being a synchronous endpoint, be mindful of the timeout, it is suggested to use it for small batches and set the timeout (client.timeout = X) to 4-6 seconds for datum.