Evaluators and Tests
Definitions
You can optionally add eval
and tests
to the modules you want to measure the performance of.
eval
: select relevant evaluation metrics
- Select the metrics and specify the input according to the data fields required for each metric.
MetricName().use(data_fields)
. - Metric inputs can be referenced using items from two sources:
- From
dataset
: e.g.ground_truth_context = dataset.ground_truth_context
- From current module: e.g.
answer = ModuleOutput()
- From prior modules: e.g.
retrieved_context = ModuleOutput(DocumentsContent, module=reranker)
, whereDocumentsContent = ModuleOutput(lambda x: [z["page_content"] for z in x])
to select specific items from the prior module’s output
- From
tests
: define specific performance criteria
- Select testing class
GreaterOrEqualThan
orMeanGreaterOrEqualThan
to run test over each datapoint or the mean of the aggregate dataset - Define
test_name
,metric_name
(must be part of the metric_name thateval
calculates), andmin_value
.
Example
We will expand the example defined in pipeline with metrics and tests.
Evaluation Metrics:
PrecisionRecallF1
to evaluate the RetrieverRankedRetrievalMetrics
to evaluate the RerankerFleschKincaidReadability
,DebertaAnswerScores
, andLLMBasedFaithfulness
to evaluate the Generator
Tests:
- Mean of
context_recall
, a metric calculated byPrecisionRecallF1
, needs to be >= 0.8 to pass - Mean of
average_precision
, a metric calculated byRankedRetrievalMetrics
, needs to be >= 0.7 to pass - Mean of
deberta_entailment
, a metric calculated byDebertaAnswerScores
needs to be >= 0.5 to pass