Skip to main content

Overview

Metric Overview

Relari API offers the following metrics:

ModuleCategoryMetrics
RetrievalDeterministicPrecisionRecallF1, RankedRetrievalMetrics, PrecisionRecallF1Ext
LLM-basedLLMBasedContextPrecision, LLMBasedContextCoverage
Text GenerationDeterministicDeterministicAnswerCorrectness, DeterministicFaithfulness, FleschKincaidReadability
LLM-basedLLMBasedFaithfulness, LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedStyleConsistency
Code GenerationDeterministicCodeStringMatch, PythonASTSimilarity, SQLSyntaxMatch, SQLASTSimilarity
ClassificationDeterministicClassificationAccuracy
Agent ToolsDeterministicToolSelectionAccuracy

Metric Definition and Inputs

Brief definition and required inputs of available metrics. Please check the individual metric pages for specific examples.

Retrieval metrics

Deterministic

PrecisionRecallF1

  • Definition: Rank-agnostic metrics including Precision, Recall, and F1 of Retrieved Contexts
  • Inputs: retrieved_context, ground_truth_context

RankedRetrievalMetrics

  • Definition: Rank-aware metrics including Mean Average Precision (MAP), Mean Reciprical Rank (MRR), NDCG (Normalized Discounted Cumulative Gain) of retrieved contexts
  • Inputs: retrieved_context, ground_truth_context
LLM-based

LLMBasedContextPrecision

  • Definition: Precision and Mean Average Precision (MAP) based on context relevancy classified by LLM
  • Inputs: question, retrieved_context

LLMBasedContextCoverage

  • Definition: Proportion of statements in ground truth answer that can be attributed to Retrieved Contexts calculated by LLM
  • Inputs: question, retrieved_context, ground_truth_answers

Text Generation metrics

Deterministic

DeterministicAnswerRelevance

  • Definition: Includes Token Overlap (Precision, Recall, F1), ROUGE-L (Precision, Recall, F1), and BLEU score of Generated Answer vs. Ground Truth Answer
  • Inputs: question, answer

DeterministicFaithfulness

  • Definition: Proportion of sentences in Answer that can be matched to Retrieved Contexts using ROUGE-L precision, Token Overlap precision, and BLEU score
  • Inputs: retrieved_context, answer

FleschKincaidReadability

  • Definition: How easy or difficult it is to understand the LLM generated answer.
  • Inputs: answer
LLM-based

LLMBasedFaithfulness

  • Definition: Binary classifications of whether the statements in the Generated Answer can be attributed to the Retrieved Contexts by LLM
  • Inputs: question, retrieved_context, answer

LLMBasedAnswerCorrectness

  • Definition: Overall correctness of the Generated Answer based on the Question and Ground Truth Answer calculated by LLM
  • Inputs: question, answer, ground_truth_answers

LLMBasedAnswerRelevance

  • Definition: Relevance of the Generated Answer with respect to the Question
  • Inputs: question, answer

LLMBasedStyleConsistency

  • Definition: Consistency of style between the Generated Answer and the Ground Truth Answer(s)
  • Inputs: answer, ground_truth_answers

Classification metrics

Deterministic

SingleLabelClassification

  • Definition: Proportion of correctly identified items out of the total items
  • Inputs: predicted_class, ground_truth_class

Code Generation metrics

Deterministic

CodeStringMatch

  • Definition: Exact and fuzzy match scores between generated code strings and the ground truth code strings
  • Inputs: answer, ground_truth_answers

PythonASTSimilarity

  • Definition: Similarity of Abstract Syntax Trees (ASTs) for Python code, comparing the generated code to the ground truth code
  • Inputs: answer, ground_truth_answers

SQLSyntaxMatch

  • Definition: Sntactic equivalence between generated SQL queries and a set of ground truth queries
  • Inputs: answer, ground_truth_answers

SQLASTSimilarity

  • Definition: Similarity of Abstract Syntax Trees (ASTs) for SQL queries, comparing the generated code to the ground truth code
  • Inputs: answer, ground_truth_answers

Agent Tools metrics

Deterministic

ToolSelectionAccuracy

  • Definition: Accuracy of selecting the correct tool(s) for a given task by the agent
  • Inputs: tools, ground_truths