Evaluation Runner

Definition

eval_manager manages the evaluation process for a pipeline. Add eval_manager to your application and log the module outputs for evaluation.

Logging data in evaluation runs

In your application, you can leverage the following functions to log your data in the evaluation runs.

eval_manager.set_pipeline(pipeline) to select the previously defined pipeline.
eval_manager.start_run() to begin the evaluation run.
eval_manager.log("module_name", data) to log specific outputs of a given module.
eval_manager.is_running() to check if the evaluation run is still running.
eval_manager.curr_sample to access the current sample in the dataset
eval_manager.next_sample() to move to the next sample in the dataset
eval_manager.evaluation.save(Path("results.jsonl")) to save the run results.

See complete examples in Examples folder.

import os
from pathlib import Path
from continuous_eval.eval.manager import eval_manager
from examples.langchain.simple_rag.pipeline import pipeline

def retrieve(q):
    ...
    return retrieved_docs

def rerank(q, retrieved_docs):
    ...
    return reranked_docs

def ask(q, reranked_docs):
    ...
    return response

if __name__ == "__main__":
    eval_manager.set_pipeline(pipeline)
    eval_manager.start_run()
    while eval_manager.is_running():
        if eval_manager.curr_sample is None:
            break
        q = eval_manager.curr_sample["question"]
        # Run and log Retriever results
        retrieved_docs = retrieve(q)
        eval_manager.log("retriever", [doc.__dict__ for doc in retrieved_docs])
        # Run and log Reranker results
        reranked_docs = rerank(q, retrieved_docs)
        eval_manager.log("reranker", [doc.__dict__ for doc in reranked_docs])
        # Run and log Generator results
        response = ask(q, reranked_docs)
        eval_manager.log("llm", response)
        print(f"Q: {q}\nA: {response}\n")
        eval_manager.next_sample()

    eval_manager.evaluation.save(Path("results.jsonl"))

Example Logged results (one sample):

{
  "retriever": [
    {
      "page_content": "The reason this is mistaken is that people do sometimes change what they want. People who don't want to want something -- drug addicts, for example -- can sometimes make themselves stop wanting it. And people who want to want something -- who want to like classical music, or broccoli -- sometimes succeed.\n\nSo we modify our initial statement: You can do what you want, but you can't want to want what you want.",
      "metadata": {
        "source": "../data/graham_essays/215_what_you_want_to_want.txt",
        "relevance_score": 0.49873924
      },
      "type": "Document"
    },
    ... (9 chunks)
  ],
  "reranker": [
    {
      "page_content": "That's still not quite true. It's possible to change what you want to want. I can imagine someone saying \"I decided to stop wanting to like classical music.\" But we're getting closer to the truth. It's rare for people to change what they want to want, and the more \"want to\"s we add, the rarer it gets.",
      "metadata": {
        "source": "../data/graham_essays/215_what_you_want_to_want.txt",
        "relevance_score": 0.66115755
      },
      "type": "Document"
    },
    ... (3 chunks)
  ],
  "llm": "Yes, people can change their desires.\n\nThe passage states that \"people do sometimes change what they want\" and that \"people who want to want something -- who want to like classical music, or broccoli -- sometimes succeed.\" It also acknowledges that \"it's possible to change what you want to want,\" though it notes that this is rare."
}

Run evaluators and tests on results

eval_manager.set_pipeline(pipeline) to select the previously defined pipeline (and corresponding metrics / tests).
eval_manager.evaluation.load(Path("results.jsonl")) to load the run results from your applications
eval_manager.run_metrics() to calculate the metrics per module in your pipeline
eval_manager.metrics.save(Path("metrics_results.json")) to save the metric results
eval_manager.metrics.aggregate() to calculate the aggregate results across dataset
eval_manager.run_tests() to run the tests
eval_manager.tests.save(Path("test_results.json")) to save the test results

from pathlib import Path

from continuous_eval.eval.manager import eval_manager
from examples.langchain.simple_rag.pipeline import pipeline

if __name__ == "__main__":
    eval_manager.set_pipeline(pipeline)

    # Evaluation
    eval_manager.evaluation.load(Path("results.jsonl"))
    eval_manager.run_metrics()
    eval_manager.metrics.save(Path("metrics_results.json"))

    # Tests
    agg = eval_manager.metrics.aggregate()
    print(agg)
    eval_manager.run_tests()
    eval_manager.tests.save(Path("test_results.json"))

    for module_name, test_results in eval_manager.tests.results.items():
        print(f"{module_name}")
        for test_name in test_results:
            print(f" - {test_name}: {test_results[test_name]}")

Aggregate metric results:

{
  'retriever': {
    'context_precision': 0.11000000000000003,
    'context_recall': 0.9,
    'context_f1': 0.19160839160839163
  },
  'reranker': {
    'average_precision': 0.9,
    'reciprocal_rank': 0.9,
    'ndcg': 0.9
  },
  'llm': {
    'LLM_based_faithfulness': 0.6,
    'deberta_answer_entailment': 0.19980023323787463,
    'deberta_answer_contradiction': 0.0013673592702616588,
    'flesch_reading_ease': 34.73568247020485,
    'flesch_kincaid_grade_level': 12.84469401321808
  }
}

Test results:

retriever
 - Average Precision: True
reranker
 - Context Recall: True
llm
 - Deberta Entailment: False