Context Precision & Recall

Definitions

Context Precision: measures signal vs. noise — what proportion of the retrieved contexts are relevant?

Context Recall: measures completeness — what proportion of all relevant contexts are retrieved?

F1: harmonic mean of precision and recall

Matching Strategy

Given that the ground truth contexts can be defined differently from the exact chunks retrieved. For example, a ground truth contexts can be a sentence that contains the information, while the contexts retrieved are uniform 512-token chunks. We have following matching strategies that determine relevance:

Match Type	Component	Retrieved Component Considered relevant if:
`ExactChunkMatch()`	Chunk	Exact match to a Ground Truth Context Chunk.
`ExactSentenceMatch()`	Sentence	Exact match to a Ground Truth Context Sentence.
`RoughChunkMatch()`	Chunk	Match to a Ground Truth Context Chunk with ROUGE-L Recall > `ROUGE_CHUNK_MATCH_THRESHOLD` (default 0.7).
`RougeSentenceMatch()`	Sentence	Match to a Ground Truth Context Sentence with ROUGE-L Recall > `ROUGE_CHUNK_SENTENCE_THRESHOLD` (default 0.8).

Example Usage

Required data items: retrieved_context, ground_truth_context

from continuous_eval.metrics.retrieval import PrecisionRecallF1, RougeChunkMatch

datum = {
    "retrieved_context": [
        "Paris is the capital of France and also the largest city in the country.",
        "Lyon is a major city in France.",
    ],
    "ground_truth_context": ["Paris is the capital of France."],
}

metric = PrecisionRecallF1(RougeChunkMatch())
print(metric(**datum))

Example Output

{
    'context_precision': 0.5,
    'context_recall': 1.0,
    'context_f1': 0.6666666666666666
}