Pi Scorer
pi is an alternative approach to model grading that uses a dedicated scoring model instead of the "LLM as a judge" technique. It can evaluate input and output pairs against criteria.
Important: Unlike llm-rubric which works with your existing providers, Pi requires a separate external API key from Pi Labs.
Alternative Approach
Pi offers a different approach to evaluation with some distinct characteristics:
- Uses a dedicated scoring model rather than prompting an LLM to act as a judge
- Focuses on highly accurate numeric scoring without providing detailed reasoning
- Aims for consistency in scoring the same inputs
- Requires a separate API key and integration
Each approach has different strengths, and you may want to experiment with both to determine which best suits your specific evaluation needs.
Prerequisites
To use Pi, you must first:
- Create a Pi API key from Pi Labs
- Set the WITHPI_API_KEYenvironment variable
export WITHPI_API_KEY=your_api_key_here
or set
env:
  WITHPI_API_KEY: your_api_key_here
in your promptfoo config
How to use it
To use the pi assertion type, add it to your test configuration:
assert:
  - type: pi
    # Specify the criteria for grading the LLM output
    value: Is the response not apologetic and provides a clear, concise answer?
This assertion will use the Pi scorer to grade the output based on the specified criteria.
How it works
Under the hood, the pi assertion uses the withpi SDK to evaluate the output based on the criteria you provide.
Compared to LLM as a judge:
- The inputs of the eval are the same: llm_inputandllm_output
- Pi does not need a system prompt, and is pretrained to score
- Pi always generates the same score, when given the same input
- Pi requires a separate API key (see Prerequisites section)
Threshold Support
The pi assertion type supports an optional threshold property that sets a minimum score requirement. When specified, the output must achieve a score greater than or equal to the threshold to pass.
assert:
  - type: pi
    value: Is not apologetic and provides a clear, concise answer
    threshold: 0.8 # Requires a score of 0.8 or higher to pass
The default threshold is 0.5 if not specified.
Metrics Brainstorming
You can use the Pi Labs Copilot to interactively brainstorm representative metrics for your application. It helps you:
- Generate effective evaluation criteria
- Test metrics on example outputs before integration
- Find the optimal threshold values for your use case
Example Configuration
prompts:
  - 'Explain {{concept}} in simple terms.'
providers:
  - openai:gpt-4.1
tests:
  - vars:
      concept: quantum computing
    assert:
      - type: pi
        value: Is the explanation easy to understand without technical jargon?
        threshold: 0.7
      - type: pi
        value: Does the response correctly explain the fundamental principles?
        threshold: 0.8
See Also
- LLM Rubric
- Model-graded metrics
- Pi Documentation for more options, configuration, and calibration details