Overview
Metric Overview
Relari API offers the following metrics:
Module | Category | Metrics |
---|---|---|
Retrieval | Deterministic | PrecisionRecallF1, RankedRetrievalMetrics, PrecisionRecallF1Ext |
LLM-based | LLMBasedContextPrecision, LLMBasedContextCoverage | |
Text Generation | Deterministic | DeterministicAnswerCorrectness, DeterministicFaithfulness, FleschKincaidReadability |
LLM-based | LLMBasedFaithfulness, LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedStyleConsistency | |
Code Generation | Deterministic | CodeStringMatch, PythonASTSimilarity, SQLSyntaxMatch, SQLASTSimilarity |
Classification | Deterministic | ClassificationAccuracy |
Agent Tools | Deterministic | ToolSelectionAccuracy |
Metric Definition and Inputs
Brief definition and required inputs of available metrics. Please check the individual metric pages for specific examples.
Retrieval metrics
Deterministic
PrecisionRecallF1
- Definition: Rank-agnostic metrics including Precision, Recall, and F1 of Retrieved Contexts
- Inputs:
retrieved_context
,ground_truth_context
RankedRetrievalMetrics
- Definition: Rank-aware metrics including Mean Average Precision (MAP), Mean Reciprical Rank (MRR), NDCG (Normalized Discounted Cumulative Gain) of retrieved contexts
- Inputs:
retrieved_context
,ground_truth_context
LLM-based
LLMBasedContextPrecision
- Definition: Precision and Mean Average Precision (MAP) based on context relevancy classified by LLM
- Inputs:
question
,retrieved_context
LLMBasedContextCoverage
- Definition: Proportion of statements in ground truth answer that can be attributed to Retrieved Contexts calculated by LLM
- Inputs:
question
,retrieved_context
,ground_truth_answers
Text Generation metrics
Deterministic
DeterministicAnswerRelevance
- Definition: Includes Token Overlap (Precision, Recall, F1), ROUGE-L (Precision, Recall, F1), and BLEU score of Generated Answer vs. Ground Truth Answer
- Inputs:
question
,answer
DeterministicFaithfulness
- Definition: Proportion of sentences in Answer that can be matched to Retrieved Contexts using ROUGE-L precision, Token Overlap precision, and BLEU score
- Inputs:
retrieved_context
,answer
FleschKincaidReadability
- Definition: How easy or difficult it is to understand the LLM generated answer.
- Inputs:
answer
LLM-based
LLMBasedFaithfulness
- Definition: Binary classifications of whether the statements in the Generated Answer can be attributed to the Retrieved Contexts by LLM
- Inputs:
question
,retrieved_context
,answer
LLMBasedAnswerCorrectness
- Definition: Overall correctness of the Generated Answer based on the Question and Ground Truth Answer calculated by LLM
- Inputs:
question
,answer
,ground_truth_answers
LLMBasedAnswerRelevance
- Definition: Relevance of the Generated Answer with respect to the Question
- Inputs:
question
,answer
LLMBasedStyleConsistency
- Definition: Consistency of style between the Generated Answer and the Ground Truth Answer(s)
- Inputs:
answer
,ground_truth_answers
Classification metrics
Deterministic
SingleLabelClassification
- Definition: Proportion of correctly identified items out of the total items
- Inputs:
predicted_class
,ground_truth_class
Code Generation metrics
Deterministic
CodeStringMatch
- Definition: Exact and fuzzy match scores between generated code strings and the ground truth code strings
- Inputs:
answer
,ground_truth_answers
PythonASTSimilarity
- Definition: Similarity of Abstract Syntax Trees (ASTs) for Python code, comparing the generated code to the ground truth code
- Inputs:
answer
,ground_truth_answers
SQLSyntaxMatch
- Definition: Sntactic equivalence between generated SQL queries and a set of ground truth queries
- Inputs:
answer
,ground_truth_answers
SQLASTSimilarity
- Definition: Similarity of Abstract Syntax Trees (ASTs) for SQL queries, comparing the generated code to the ground truth code
- Inputs:
answer
,ground_truth_answers
Agent Tools metrics
Deterministic
ToolSelectionAccuracy
- Definition: Accuracy of selecting the correct tool(s) for a given task by the agent
- Inputs:
tools
,ground_truths