Skip to main content

Testing Humanity's Last Exam with Promptfoo

Humanity's Last Exam (HLE) is a challenging benchmark commissioned by Scale AI and the Center for AI Safety (CAIS), developed by 1,000+ subject experts from over 500 institutions across 50 countries. Created to address benchmark saturation where current models achieve 90%+ accuracy on MMLU, HLE presents genuinely difficult expert-level questions that test AI capabilities at the frontier of human knowledge.

This guide shows you how to:

  • Set up HLE evals with promptfoo
  • Configure reasoning models for HLE questions
  • Analyze real performance data from Claude 4 and o4-mini
  • Understand model limitations on challenging benchmarks

About Humanity's Last Exam

HLE addresses benchmark saturation - the phenomenon where advanced models achieve over 90% accuracy on existing tests like MMLU, making it difficult to measure continued progress. HLE provides a more challenging eval for current AI systems.

Key characteristics:

  • Created by 1,000+ PhD-level experts across 500+ institutions
  • Covers 100+ subjects from mathematics to humanities
  • 14% of questions include images alongside text
  • Questions resist simple web search solutions
  • Focuses on verifiable, closed-ended problems

Current model performance:

ModelAccuracyNotes
OpenAI Deep Research26.6%With search capabilities
o4-mini~13%Official benchmark results
DeepSeek-R18.5%Text-only evaluation
o18.0%Previous generation
Gemini 2.0 Flash6.6%Multimodal support
Claude 3.5 Sonnet4.1%Base model

Official model performance on full HLE dataset

Running the Eval

Set up your HLE eval with these commands:

npx promptfoo@latest init --example huggingface-hle
cd huggingface-hle
npx promptfoo@latest eval

See the complete example at examples/huggingface-hle for all configuration files and implementation details.

Set these API keys before running:

  • OPENAI_API_KEY - for o4-mini and GPT models
  • ANTHROPIC_API_KEY - for Claude 4 with thinking mode
  • HF_TOKEN - get yours from huggingface.co/settings/tokens

Promptfoo handles dataset loading, parallel execution, cost tracking, and results analysis automatically.

License and Safety

HLE is released under the MIT license. The dataset includes a canary string to help model builders filter it from training data. Images in the dataset may contain copyrighted material. Review your AI provider's policies regarding image content before running evaluations with multimodal models.

Eval Results

After your eval completes, open the web interface:

npx promptfoo@latest view

Promptfoo generates a summary report showing token usage, costs, success rates, and performance metrics:

HLE Evaluation Results

We tested Claude 4 and o4-mini on 50 HLE questions using promptfoo with optimized configurations to demonstrate real-world performance. Note that our results differ from official benchmarks due to different prompting strategies, token budgets, and question sampling.

Model Comparison on Bioinformatics Question

This example shows both models attempting a complex bioinformatics question. The interface displays complete reasoning traces and comparative analysis.

Performance summary (50 questions per model, 100 total test cases):

  • Combined pass rate: 28% (28 successes across both models)
  • Runtime: 9 minutes with 20 concurrent workers
  • Token usage: Approximately 237K tokens for 100 test cases

The models showed different performance characteristics:

ModelSuccess RateToken UsageTotal Cost (50 questions)Avg Latency
o4-mini42% (21/50)139,580$0.5617.6s
Claude 414% (7/50)97,552$1.2628.8s

The interface provides:

  • Question-by-question breakdown with full reasoning traces
  • Token usage and cost analysis
  • Side-by-side model comparison with diff highlighting
  • Performance analytics by subject area

Prompt Engineering for HLE

To handle images across different AI providers, we wrote a custom prompt function in Python. OpenAI uses image_url format while Anthropic/Claude requires base64 source format.

The rendered prompts look like this:

- role: system
content: |
Your response should be in the following format:
Explanation: {your explanation for your answer choice}
Answer: {your chosen answer}
Confidence: {your confidence score between 0% and 100% for your answer}
- role: user
content: |
Which condition of Arrhenius's sixth impossibility theorem do critical views violate?

Options:
A) Weak Non-Anti-Egalitarianism
B) Non-Sadism
C) Transitivity
D) Completeness

The Python approach enables provider-specific adaptations:

  • OpenAI models: Uses image_url format for images, developer role for o1/o3 reasoning models
  • Anthropic models: Converts images to base64 source format for Claude compatibility
  • Response structure: Standardized format with explanation, answer, and confidence scoring

Automated Grading

Promptfoo uses LLM-as-a-judge for automated grading with the built-in llm-rubric assertion. This approach evaluates model responses against the expected answers without requiring exact string matches.

The grading system:

  • Uses a configured judge model to verify answer correctness
  • Accounts for equivalent formats (decimals vs fractions, different notation styles)
  • Handles both multiple-choice and exact-match question types
  • Provides consistent scoring across different response styles

Here's how to configure the grading assertion:

defaultTest:
assert:
- type: llm-rubric
value: |
Evaluate whether the response correctly answers the question.

Question: {{ question }}
Model Response: {{ output }}
Correct Answer: {{ answer }}

Grade the response on accuracy (0.0 to 1.0 scale):
- 1.0: Response matches the correct answer exactly or is mathematically/logically equivalent
- 0.8-0.9: Response is mostly correct with minor differences that don't affect correctness
- 0.5-0.7: Response is partially correct but has significant errors
- 0.0-0.4: Response is incorrect or doesn't address the question

The response should pass if it demonstrates correct understanding and provides the right answer, even if the explanation differs from the expected format.

This automated approach scales well for large evaluations while maintaining accuracy comparable to human grading on HLE's objective, closed-ended questions.

Customization Options

Key settings:

  • 3K thinking tokens (Claude): Tradeoff between cost and reasoning capability - more tokens may improve accuracy
  • 4K max tokens: Allows detailed explanations without truncation
  • 50 questions: Sample size chosen for this demonstration - scale up for production evals
  • Custom prompts: Can be further optimized for specific models and question types

Test more questions:

tests:
- huggingface://datasets/cais/hle?split=test&limit=200

Add more models:

providers:
- anthropic:claude-sonnet-4-20250514
- openai:o4-mini
- deepseek:deepseek-reasoner

Increase reasoning budget:

providers:
- id: anthropic:claude-sonnet-4-20250514
config:
thinking:
budget_tokens: 8000 # For complex proofs
max_tokens: 12000

Eval Limitations

Keep in mind these results are preliminary - we only tested 50 questions per model in a single run. That's a pretty small sample from HLE's 14,000+ questions, and we didn't optimize our approach much (token budgets, prompts, etc. were chosen somewhat arbitrarily).

o4-mini's 42% success rate stands out and requires validation through larger samples and multiple runs. Performance will likely vary considerably across different subjects and question formats.

Implications for AI Development

HLE provides a useful benchmark for measuring AI progress on academic tasks. The low current scores indicate significant room for improvement in AI reasoning capabilities.

As Dan Hendrycks (CAIS co-founder) notes:

"When I released the MATH benchmark in 2021, the best model scored less than 10%; few predicted that scores higher than 90% would be achieved just three years later. Right now, Humanity's Last Exam shows there are still expert questions models cannot answer. We will see how long that lasts."

Key findings:

  • Current reasoning models achieve modest performance on HLE questions
  • Success varies significantly by domain and question type
  • Token budget increases alone don't guarantee accuracy improvements
  • Substantial gaps remain between AI and human expert performance

Promptfoo provides HLE eval capabilities through automated dataset integration, parallel execution, and comprehensive results analysis.

Learn More

Official Resources

Analysis and Coverage

Promptfoo Integration