# Correctness

### Definitions

**Answer Correctness** measures how close the generated answer is the the ground truth reference answers.

Below are the list of deterministic metrics that measure the relationship between the generated answer and the ground truth reference answers.

**ROUGE-L** measures the longest common subsequence between the generated answer and the ground truth answers.

**Token Overlap** calculates the token overlap between the generated answer and the ground truth answers.

**BLEU** (Bilingual Evaluation Understudy) calculates the n-gram precision. (Below: `p_n`

is the n-gram precision, `w_n`

is the weight for each n-gram, and `BP`

is the brevity penalty to penalize short answers)

**Answer Correctness** is a basket of metrics that include the **Precision, Recall and F1** of **ROUGE-L** and **Token Overlap**, as well as the **BLEU** score.

When there are multiple ground truth reference answers, the max score is taken.

### Example Usage

Required data items: `answer`

, `ground_truths`