Metrics
Metrics allow you to compute metric scores over a single data point or a small batch of data. It is most suitable for Runtime Monitor / Online Evaluation purposes.
If you want to run offline evaluation over a given (golden) dataset, please use Experiments (Evaluation).
Through the Python and Node SDKs, Relari API offers two ways to run evaluation metric:
- Synchronous metrics are computed in real-time. Synchronous metrics can be used to calculate metric scores for single datum or small dataset.
- Asynchronous metrics are computed in the background.
We recommend using asynchronous metrics whenever you are running evaluation experiments over a large dataset.
The result of the synchronous metric is not saved in the server. If you want to store the result, you need to use asynchronous metrics where the results are saved in evaluations more details on evaluations.
Synchronous Metrics
Synchronous metrics are computed in real-time and are useful for small batches of data.
User-provided data
In cases you want to run evaluation over your own datum or dataset that has all the fields necessary (e.g. ground_truth_context
) to evaluate, you can use the following:
Single Datum
For a single datum you can use the following code snippet (example using PrecisionRecallF1
which computes the precision, recall, and F1 score for retrieval system - which requires ground_truth_context
and retrieved_context
as arguments):
- Python
- Node
from relari import RelariClient
from relari import Metric
client = RelariClient()
res = client.metrics.compute(
Metric.PrecisionRecallF1,
args={
"ground_truth_context": [
"Software as a service is a way of delivering applications remotely over the internet."
],
"retrieved_context": [
"Software as a service (SaaS) is a way of delivering applications remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
)
import { RelariClient, PrecisionRecallF1 } from "relari-sdk"
const client = new RelariClient()
const response = await relariClient.metrics.computeMetric(
new PrecisionRecallF1({
ground_truth_context: [
"Software as a service is a way of delivering applications remotely over the internet."
],
retrieved_context: [
"Software as a service (SaaS) is a way of delivering applications remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
}),
)
Batch Calculate
You can also pass a list of data to the metric to compute the metrics in a single call.
- Python
- Node
data = [
{
"ground_truth_context": [
"Software as a service is a way of delivering applications remotely over the internet."
],
"retrieved_context": [
"Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
{
"ground_truth_context": [
"Python's asyncio module provides a framework for writing concurrent code using async/await syntax."
],
"retrieved_context": [
"The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
],
},
]
res = client.metrics.compute( # supports batch compute
Metric.PrecisionRecallF1,
args=data,
)
const response = await relariClient.metrics.computeMetrics(
PrecisionRecallF1.batch([
{
ground_truth_context: [
"Software as a service is a way of delivering applications remotely over the internet."
],
retrieved_context: [
"Software as a service (SaaS) is a way of delivering applications remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
{
ground_truth_context: [
"Python's asyncio module provides a framework for writing concurrent code using async/await syntax."
],
retrieved_context: [
"The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
],
}
]),
)
Dataset (hosted in Relari Cloud)
In cases where you have a dataset hosted in Relari Cloud, you can pass the dataset id, datum label (uid
), and corresponding outputs from your LLM application to calculate the scores.
- Python
- Node
from relari import RelariClient
from relari import Metric
from relari.core.types import DatasetDatum
client = RelariClient()
res = client.metrics.compute(
Metric.PrecisionRecallF1,
dataset="665ddb7bbdce1320a58e2cec", # Get Dataset ID from CLI
args=DatasetDatum(
label="22", # uid of a datum
data={
"retrieved_context": [
"The best answer is earnest. If you're earnest you avoid not just affectation but a whole set of similar vices."
]
},
),
)
print(res)
// Single metric
await relariClient.metrics.computeMetric(
new PrecisionRecallF1({
label: "22",
retrieved_context: [
"Software as a service (SaaS) is a way of delivering applications remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
}),
datasetId,
)
// Batch
await relariClient.metrics.computeMetrics(
PrecisionRecallF1.batch([
{
label: "22",
retrieved_context: [
"Software as a service (SaaS) is a way of delivering applications remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
{
label: "23",
retrieved_context: [
"The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
],
}
]),
datasetId
)
When using an existing dataset only the information not available in the dataset must be provided.
For example the PrecisionRecallF1
metric here only requires retrieved_context
; earlier, when not indicating a dataset, ground_truth_context
was needed too.
Similarly, you can pass a list of DatasetDatum
as args to batch the computation of the metrics.
Asynchronous Metrics
To run asynchronous metrics, you have to start a new experiment (also called evaluation) in a particular project.
User-provided data
- Python
- Node
import os
from relari import RelariClient
from relari import Metric
from relari.core.types import DatasetDatum
import json
client = RelariClient()
data = [
{
"ground_truth_context": [
"Software as a service is a way of delivering applications remotely over the internet."
],
"retrieved_context": [
"Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
{
"ground_truth_context": [
"Python's asyncio module provides a framework for writing concurrent code using async/await syntax."
],
"retrieved_context": [
"The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
],
},
]
eval_id = client.evaluations.submit(
project_id=PROJECT_ID,
name=None,
metadata=dict(),
pipeline=[Metric.PrecisionRecallF1, Metric.RankedRetrievalMetrics],
data=data,
)
import { RelariClient, MetricName } from "relari-sdk"
const client = new RelariClient()
const data = [
{
"ground_truth_context": [
"Software as a service is a way of delivering applications remotely over the internet."
],
"retrieved_context": [
"Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
{
"ground_truth_context": [
"Python's asyncio module provides a framework for writing concurrent code using async/await syntax."
],
"retrieved_context": [
"The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
],
},
]
const createdEval = await relariClient.experiments.submit(
project,
undefined, // Name
[MetricName.PrecisionRecallF1, MetricName.RankedRetrievalMetrics],
data,
)
const evalId = createdEval.id
This will return the evaluation ID, you can check the status of the evaluation using the CLI
relari-cli evaluations status EVALUATION_ID
or, using the SDK, you can check the status and download the results if completed
- Python
- Node
eval_data = client.evaluations.get(eval_id)
const evalData = await relariClient.experiments.get(evalId)
Batch metrics
You can also batch a list of datum and compute the metrics in a single call. Since this call is synchronous as well, it is recommended to use it for small batches (e.g. < 10 datum).
- Python
- Node
data = [
{
"ground_truth_context": [
"Software as a service is a way of delivering applications remotely over the internet."
],
"retrieved_context": [
"Software as a service (SaaS) is a way of delivering applicabtions remotely over the internet instead of locally on machines (known as “on-premise” software)."
],
},
{
"ground_truth_context": [
"Python's asyncio module provides a framework for writing concurrent code using async/await syntax."
],
"retrieved_context": [
"The asyncio module in Python is used for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives."
],
},
]
res = client.metrics.compute( # batch_compute(
Metric.PrecisionRecallF1,
args=data,
)
[print(x) for x in res]
// TO DO
Being a synchronous endpoint, be mindful of the timeout, it is suggested to use it for small batches and set the timeout (client.timeout = X
) to 4-6 seconds for datum.