Evaluation Data

Measuring Success: The Role of Evaluations

Given the substantial investment in model training, tracking improvement becomes crucial. Evaluation datasets (“evals”) provide standardized metrics for specific capabilities, enabling teams to make data-driven decisions about model development and deployment.

Evaluations are not just metrics—they’re your compass for model improvement. Without proper evals, you’re flying blind.

Structure of Evaluation Datasets

Each evaluation dataset contains three core components:

Test Prompts

Questions or tasks that test specific model capabilities

Ground Truth

Correct answers or validation criteria for scoring

Scoring Logic

Mechanisms to evaluate model outputs against expectations

Key Evaluation Challenges

Benchmark Limitations

Standard benchmarks come with inherent challenges:

Common Benchmarks
Contamination Risks
Coverage Gaps

MMLU (Massive Multitask Language Understanding)

57 subjects across STEM, humanities, social sciences
Multiple-choice format
Risk: May not reflect real-world application needs

GSM8K (Grade School Math)

8,500 grade school math problems
Tests mathematical reasoning
Risk: Limited to basic arithmetic scenarios

HumanEval

164 Python programming problems
Tests code generation abilities
Risk: Doesn’t cover all programming paradigms

Chatbot Arena

Human preference comparisons
Real-world conversation quality
Risk: Subjective and resource-intensive

Creating Custom Evaluations

Define Evaluation Objectives

Clearly articulate what capabilities you’re measuring:

Specific skills (e.g., SQL generation, medical diagnosis)
Quality attributes (accuracy, safety, style)
User satisfaction metrics
Business-specific KPIs

Design Test Cases

Create comprehensive test suites:

Easy cases: Baseline functionality
Medium cases: Typical use scenarios
Hard cases: Edge cases and complex reasoning
Adversarial cases: Potential failure modes

Establish Scoring Criteria

Develop clear, reproducible scoring methods:

Automated metrics where possible
Human evaluation rubrics when needed
Combination approaches for nuanced tasks

Validate and Iterate

Ensure evaluation quality:

Test on known good/bad models
Verify inter-rater agreement
Adjust based on initial results
Document all assumptions

Human Evaluation Approaches

Likert Scale Rating

Annotators rate outputs on scales (1-5, 1-7):

Helpfulness
Accuracy
Relevance
Clarity

Pairwise Comparison

Compare two model outputs:

Which is better?
By how much?
On what criteria?

Error Categorization

Classify types of failures:

Factual errors
Logic errors
Style issues
Safety violations

Task Completion

Binary success metrics:

Did it solve the problem?
Is the answer usable?
Does it meet requirements?

Work with MangoDesk

Types of data

Measuring Success: The Role of Evaluations

Structure of Evaluation Datasets

Test Prompts

Ground Truth

Scoring Logic

Key Evaluation Challenges

Benchmark Limitations

Creating Custom Evaluations

Human Evaluation Approaches

Likert Scale Rating

Pairwise Comparison

Error Categorization

Task Completion

Work with MangoDesk

Types of data

​Measuring Success: The Role of Evaluations

​Structure of Evaluation Datasets

Test Prompts

Ground Truth

Scoring Logic

​Key Evaluation Challenges

​Benchmark Limitations

​Creating Custom Evaluations

​Human Evaluation Approaches

Likert Scale Rating

Pairwise Comparison

Error Categorization

Task Completion

Measuring Success: The Role of Evaluations

Structure of Evaluation Datasets

Key Evaluation Challenges

Benchmark Limitations

Creating Custom Evaluations

Human Evaluation Approaches