Max Score
The max-score
assertion selects the output with the highest aggregate score from other assertions. Unlike select-best
which uses LLM judgment, max-score
provides objective, deterministic selection based on quantitative scores from other assertions.
When to use max-score​
Use max-score
when you want to:
- Select the best output based on objective, measurable criteria
- Combine multiple metrics with different importance (weights)
- Have transparent, reproducible selection without LLM API calls
- Select outputs based on a combination of correctness, quality, and other metrics
How it works​
- All regular assertions run first on each output
max-score
collects the scores from these assertions- Calculates an aggregate score for each output (average by default)
- Selects the output with the highest aggregate score
- Returns pass=true for the highest scoring output, pass=false for others
Basic usage​
prompts:
- 'Write a function to {{task}}'
- 'Write an efficient function to {{task}}'
- 'Write a well-documented function to {{task}}'
providers:
- openai:gpt-4
tests:
- vars:
task: 'calculate fibonacci numbers'
assert:
# Regular assertions that score each output
- type: python
value: 'assert fibonacci(10) == 55'
- type: llm-rubric
value: 'Code is efficient'
- type: contains
value: 'def fibonacci'
# Max-score selects the output with highest average score
- type: max-score
Configuration options​
Aggregation method​
Choose how scores are combined:
assert:
- type: max-score
value:
method: average # Default: average | sum
Weighted scoring​
Give different importance to different assertions by specifying weights per assertion type:
assert:
- type: python # Test correctness
- type: llm-rubric # Test quality
value: 'Well documented'
- type: max-score
value:
weights:
python: 3 # Correctness is 3x more important
llm-rubric: 1 # Documentation is 1x weight
How weights work​
- Each assertion type can have a custom weight (default: 1.0)
- For
method: average
, the final score is:sum(score × weight) / sum(weights)
- For
method: sum
, the final score is:sum(score × weight)
- Weights apply to all assertions of that type
Example calculation with method: average
:
Output A: python=1.0, llm-rubric=0.5, contains=1.0
Weights: python=3, llm-rubric=1, contains=1 (default)
Score = (1.0×3 + 0.5×1 + 1.0×1) / (3 + 1 + 1)
= (3.0 + 0.5 + 1.0) / 5
= 0.9
Minimum threshold​
Require a minimum score for selection:
assert:
- type: max-score
value:
threshold: 0.7 # Only select if average score >= 0.7
Scoring details​
- Binary assertions (pass/fail): Score as 1.0 or 0.0
- Scored assertions: Use the numeric score (typically 0-1 range)
- Default weights: 1.0 for all assertions
- Tie breaking: First output wins (deterministic)
Examples​
Example 1: Multi-criteria code selection​
prompts:
- 'Write a Python function to {{task}}'
- 'Write an optimized Python function to {{task}}'
- 'Write a documented Python function to {{task}}'
providers:
- openai:gpt-4o-mini
tests:
- vars:
task: 'merge two sorted lists'
assert:
- type: python
value: |
list1 = [1, 3, 5]
list2 = [2, 4, 6]
result = merge_lists(list1, list2)
assert result == [1, 2, 3, 4, 5, 6]
- type: llm-rubric
value: 'Code has O(n+m) time complexity'
- type: llm-rubric
value: 'Code is well documented with docstring'
- type: max-score
value:
weights:
python: 3 # Correctness most important
llm-rubric: 1 # Each quality metric has weight 1
Example 2: Content generation selection​
prompts:
- 'Explain {{concept}} simply'
- 'Explain {{concept}} in detail'
- 'Explain {{concept}} with examples'
providers:
- anthropic:claude-3-haiku-20240307
tests:
- vars:
concept: 'machine learning'
assert:
- type: llm-rubric
value: 'Explanation is accurate'
- type: llm-rubric
value: 'Explanation is clear and easy to understand'
- type: contains
value: 'example'
- type: max-score
value:
method: average # All criteria equally important
Example 3: API response selection​
tests:
- vars:
query: 'weather in Paris'
assert:
- type: is-json
- type: contains-json
value:
required: ['temperature', 'humidity', 'conditions']
- type: llm-rubric
value: 'Response includes all requested weather data'
- type: latency
threshold: 1000 # Under 1 second
- type: max-score
value:
weights:
is-json: 2 # Must be valid JSON
contains-json: 2 # Must have required fields
llm-rubric: 1 # Quality check
latency: 1 # Performance matters
Comparison with select-best​
Feature | max-score | select-best |
---|---|---|
Selection method | Aggregate scores from assertions | LLM judgment |
API calls | None (uses existing scores) | One per eval |
Reproducibility | Deterministic | May vary |
Best for | Objective criteria | Subjective criteria |
Transparency | Shows exact scores | Shows LLM reasoning |
Cost | Free (no API calls) | Costs per API call |
Edge cases​
- No other assertions: Error - max-score requires at least one assertion to aggregate
- Tie scores: First output wins (by index)
- All outputs fail: Still selects the highest scorer ("least bad")
- Below threshold: No output selected if threshold is specified and not met
Tips​
- Use specific assertions: More assertions provide better signal for selection
- Weight important criteria: Use weights to emphasize what matters most
- Combine with select-best: You can use both in the same test for comparison
- Debug with scores: The output shows aggregate scores for transparency
Further reading​
- Model-graded metrics for other model-based assertions
- Select best for subjective selection
- Assertions for all available assertion types