Compare evaluation results
After running multiple experiments, you can compare their results to gain deeper actionable insights.
After running multiple experiments, you can compare their results to gain deeper actionable insights.
Once you submit the experiments, you can view them in the UI in the Project > Experiments Tab.
Running experiments (or evaluations) is a systematic way to measure the performance of an AI system across a set number of samples. By altering prompts, models, or hyperparameters, you can observe how different settings impact performance. Experiments can be run on single data points or entire datasets to quickly understand the effect of changes across various scenarios.
The Runtime Monitor feature allows you to evaluate results on production data in real-time. Reference-free metrics should be used in runtime monitors. When you evaluate results on the fly, you won't need reference outputs in datasets to compare against.