Skip to content

Context Precision & Recall


Context Precision: measures signal vs. noise — what proportion of the retrieved contexts are relevant?

Context Recall: measures completeness — what proportion of all relevant contexts are retrieved?

F1: harmonic mean of precision and recall

Matching Strategy

Given that the ground truth contexts can be defined differently from the exact chunks retrieved. For example, a ground truth contexts can be a sentence that contains the information, while the contexts retrieved are uniform 512-token chunks. We have following matching strategies that determine relevance:

Match Type Component Retrieved Component Considered relevant if:
ExactChunkMatch() Chunk Exact match to a Ground Truth Context Chunk.
ExactSentenceMatch() Sentence Exact match to a Ground Truth Context Sentence.
RoughChunkMatch() Chunk Match to a Ground Truth Context Chunk with ROUGE-L Recall > ROUGE_CHUNK_MATCH_THRESHOLD (default 0.7).
RougeSentenceMatch() Sentence Match to a Ground Truth Context Sentence with ROUGE-L Recall > ROUGE_CHUNK_SENTENCE_THRESHOLD (default 0.8).

Example Usage

Required data items: retrieved_context, ground_truth_context

from continuous_eval.metrics.retrieval import PrecisionRecallF1, RougeChunkMatch
datum = {
"retrieved_context": [
"Paris is the capital of France and also the largest city in the country.",
"Lyon is a major city in France.",
"ground_truth_context": ["Paris is the capital of France."],
metric = PrecisionRecallF1(RougeChunkMatch())

Example Output

'context_precision': 0.5,
'context_recall': 1.0,
'context_f1': 0.6666666666666666