# Metric Ensembling

The aim of ensembling different metrics to predict the human label is to combine the strengths and balance out the weaknesses of individual metrics, ultimately leading to more accurate, robust, and reliable predictions.

Each metric might capture different aspects of the data or be sensitive to different patterns, so when we combine them, we often get a more comprehensive view.

## What is Conformal Prediction?

Conformal Prediction is a statistical technique that quantifies the confidence level of a prediction. In this case, we are trying to predict whether the answer is correct (or faithful). With conformal prediction, instead of just saying “yes” (or “no”), the model tells us “the answer is correct with probability at least 90%”. In essence, conformal prediction doesn’t just give you an answer; it tells you how confident you can be in that answer. If the model is uncertain, conformal prediction will tell you it’s “undecided”. For the undecided datapoints, we ask the more powerful GPT-4 to judge its correctness.

## Metric Ensembling

The `MetricEnsemble`

class helps you to ensemble multiple metrics to predict a ground truth label, such us human labels.
The class leverage the conformal prediction technique to compute a reliable

Parameters:

`training: XYData`

: training data, it should contain`training.X`

(the metrics output, also referred as*features*) and`training.Y`

(the ground truth label)`calibration: XYData`

: as before but used for the calibration of the conformal predictor`alpha: float`

: significance level, default to 0.1. The significance level os the probability that a prediction will not be included in the predicted set, serving as a measure of the confidence or reliability of the prediction. For example if alpha is 0.1, then the prediction set will contain the correct label with probability 0.9.`random_state: Optional[int]`

: random seed, default to None

The `MetricEnsemble`

class has the following methods:

`predict(self, X: pd.DataFrame, judicator: Optional[Callable] = None)`

: it takes as input a dataframe of metrics output and returns a dataframe of predictions

The `predict`

returns two numpy vectors:

`y_hat`

a binary (1/0) vector with best-effort predictions of the ensemble`y_set`

a binary array of size (N, 2) where the first column is 1 is, for the significance level set by`alpha`

, the sample can be classified as negative, and the second column is 1 if the sample can be classified as positive.

The set prediction (`y_set`

) can have both columns set to 1, meaning that the ensemble is undecided.
This happen because the particular choice of metrics in the ensemble is not confident enough or the significance level is too high.
In such cases the `predict`

method will call the `judicator`

function (if not `None`

) to make a final decision.

The `judicator`

function takes as input the index of the sample where the predictor is undecided and must return a boolean value (True/False) indicating the final decision.

### Example

In this exampel we want to use deterministic and semantic metrics to predict the correctness of the answers (as evaluated by a human annotator). When these two metrics alone are not sufficient to produce a confident prediction, we use the LLM to make the final decision.

As first thing we compute the deterministic and semantic metrics:

We now split the samples in train, test, and calibration sets and train the classifier.
Note that we are using only the `"token_overlap_recall"`

,`"deberta_answer_entailment"`

, and `"deberta_answer_contradiction"`

to train the classifier.

Finally we run the classifier and evaluate the results:

The output would be something like:

#### Using a judicator

Let’s assume we want to use the LLM to make the final decision when the classifier is undecided.
We can define a `judicator`

function that takes as input the index of the sample where the classifier is undecided and returns a boolean value (True/False) indicating the final decision.

To use the judicator we simply pass it to the `predict`

method:

The output would be something like:

Here the `predict`

function called the LLM in the *15.74%* of the cases where the classifier was undecided.
The classifier is no longer undecided and the performance improved.