Skip to main content

Run experiments

Running experiments (or evaluations) is a systematic way to measure the performance of an AI system across a set number of samples. By altering prompts, models, or hyperparameters, you can observe how different settings impact performance. Experiments can be run on single data points or entire datasets to quickly understand the effect of changes across various scenarios.

Experiment-Driven Development

Experiment-driven (or metric-driven) development enables systematic decision-making. Rather than relying on anecdotal testing or subjective impressions, this approach allows for holistic evaluation of each change's impact.

Choosing the Right Metrics

Selecting the appropriate metrics is crucial to the experimentation process. Relari provides over 30 Standard Metrics covering a wide range of common LLM use cases, including:

  • Text generation
  • Retrieval (RAG)
  • Classification
  • Summarization
  • Agent tool use
  • Code generation

Refer to the Metrics section to explore the standard metrics and their usage.

For task-specific evaluations, Custom Metrics are essential to capture the unique requirements of your applications and the preferences of your users. The Custom Metrics section offers guidance on how to convert a scoring rubric into metrics that align with user preferences.

Submit an Experiment

Here is an example of a simple experiment for a Retrieval-Augmented Generation (RAG) system. You'll need to specify the dataset, the metrics, and the output data from your application.

from relari import RelariClient, Metric
from relari.core.types import DatasetDatum
import json

client = RelariClient()

# Load or compute outputs from your application over the dataset
with open("hybrid_and_rerank_RAG_outputs.jsonl", "r") as file:
outputs = [
DatasetDatum(
label=sample["uid"],
data={
"retrieved_context": [entry["page_content"] for entry in sample["retriever"]],
"answer": sample["llm"]
}
)
for sample in map(json.loads, file)
]

eval_id = client.evaluations.submit(
project_id="6671a87f37b6c403f581c684",
dataset="6671a94337b6c403f581c685",
name="hybrid_and_rerank_RAG",
pipeline=[
Metric.PrecisionRecallF1,
Metric.RankedRetrievalMetrics,
Metric.LLMBasedAnswerCorrectness,
Metric.LLMBasedFaithfulness,
Metric.LLMBasedAnswerRelevance,
],
data=outputs,
)
tip

Relari provides the flexibility to run evaluation with or without a dataset. You can run evaluation on a single datapoint or a subset of a dataset. For more details on different ways to run evaluations, please visit the SDK and API sections.