Skip to content



Faithfulness measures how grounded is the generated answer on the retrieved contexts.

Below are the list of deterministic metrics that measure the relationship between the generated answer and the retrieved contexts.

ROUGE-L Precision measures the longest common subsequence between the generated answer and the retrieved contexts.

Token Overlap Precision calculates the precision of token overlap between the generated answer and the retrieved contexts.

BLEU (Bilingual Evaluation Understudy) calculates the n-gram precision. (Below: p_n is the n-gram precision, w_n is the weight for each n-gram, and BP is the brevity penalty to penalize short answers)

Rouge|Token Overlap|Bleu Faithfulness is defined as the proportion of the sentences in the generated answer that can matched to the retrieved context above a threshold.

Example Usage

Required data items: retrieved_context, answer

from continuous_eval.metrics.generation.text import DeterministicFaithfulness
datum = {
"retrieved_context": ["William Shakespeare is the author of 'Romeo and Juliet'."],
"answer": "William Shakespeare wrote 'Romeo and Juliet'. He is born in Ireland",
metric = DeterministicFaithfulness()

Example Output

by_sentence values are the list of sentence-level rouge | token_overlap | bleu scores for each sentence in the answer.

default_threshold for a sentence to be considered faithful is set to be 0.5.

'rouge_faithfulness': 0.5,
'token_overlap_faithfulness': 0.5,
'bleu_faithfulness': 0.37023896751607194,
'rouge_p_by_sentence': [0.8333333333333334, 0.2],
'token_overlap_p_by_sentence': [0.875, 0.2],
'bleu_score_by_sentence': [0.6855956729300113, 0.05488226210213251]