Correctness

Definitions

Answer Correctness measures how close the generated answer is the the ground truth reference answers.

Below are the list of deterministic metrics that measure the relationship between the generated answer and the ground truth reference answers.

ROUGE-L measures the longest common subsequence between the generated answer and the ground truth answers.

Token Overlap calculates the token overlap between the generated answer and the ground truth answers.

BLEU (Bilingual Evaluation Understudy) calculates the n-gram precision. (Below: p_n is the n-gram precision, w_n is the weight for each n-gram, and BP is the brevity penalty to penalize short answers)

Answer Correctness is a basket of metrics that include the Precision, Recall and F1 of ROUGE-L and Token Overlap, as well as the BLEU score.

When there are multiple ground truth reference answers, the max score is taken.

Example Usage

Required data items: answer, ground_truths

from continuous_eval.metrics.generation.text import DeterministicAnswerCorrectness

datum = {
    "answer": "Shakespeare wrote 'Romeo and Juliet'",
    "ground_truth_answers": [
        "William Shakespeare wrote 'Romeo and Juliet",
        "William Shakespeare",
        "Shakespeare",
        "Shakespeare is the author of 'Romeo and Juliet'"
    ]
}

metric = DeterministicAnswerCorrectness()
print(metric(**datum))

Example Output

{
    'rouge_l_recall': 1.0,
    'rouge_l_precision': 0.8,
    'rouge_l_f1': 0.7272727223140496,
    'token_overlap_recall': 1.0,
    'token_overlap_precision': 0.8333333333333334,
    'token_overlap_f1': 0.8333333333333334,
    'bleu_score': 0.799402901304756
}