Pi Scorer
pi
is an alternative approach to model grading that uses a dedicated scoring model instead of the "LLM as a judge" technique. It can evaluate input and output pairs against criteria.
Important: Unlike llm-rubric
which works with your existing providers, Pi requires a separate external API key from Pi Labs.
Alternative Approach
Pi offers a different approach to evaluation with some distinct characteristics:
- Uses a dedicated scoring model rather than prompting an LLM to act as a judge
- Focuses on highly accurate numeric scoring without providing detailed reasoning
- Aims for consistency in scoring the same inputs
- Requires a separate API key and integration
Each approach has different strengths, and you may want to experiment with both to determine which best suits your specific evaluation needs.
Prerequisites
To use Pi, you must first:
- Create a Pi API key from Pi Labs
- Set the
WITHPI_API_KEY
environment variable
export WITHPI_API_KEY=your_api_key_here
or set
env:
WITHPI_API_KEY: your_api_key_here
in your promptfoo config
How to use it
To use the pi
assertion type, add it to your test configuration:
assert:
- type: pi
# Specify the criteria for grading the LLM output
value: Is the response not apologetic and provides a clear, concise answer?
This assertion will use the Pi scorer to grade the output based on the specified criteria.
How it works
Under the hood, the pi
assertion uses the withpi
SDK to evaluate the output based on the criteria you provide.
Compared to LLM as a judge:
- The inputs of the eval are the same:
llm_input
andllm_output
- Pi does not need a system prompt, and is pretrained to score
- Pi always generates the same score, when given the same input
- Pi requires a separate API key (see Prerequisites section)
Threshold Support
The pi
assertion type supports an optional threshold
property that sets a minimum score requirement. When specified, the output must achieve a score greater than or equal to the threshold to pass.
assert:
- type: pi
value: Is not apologetic and provides a clear, concise answer
threshold: 0.8 # Requires a score of 0.8 or higher to pass
The default threshold is 0.5
if not specified.
Metrics Brainstorming
You can use the Pi Labs Copilot to interactively brainstorm representative metrics for your application. It helps you:
- Generate effective evaluation criteria
- Test metrics on example outputs before integration
- Find the optimal threshold values for your use case
Example Configuration
prompts:
- 'Explain {{concept}} in simple terms.'
providers:
- openai:gpt-4o
tests:
- vars:
concept: quantum computing
assert:
- type: pi
value: Is the explanation easy to understand without technical jargon?
threshold: 0.7
- type: pi
value: Does the response correctly explain the fundamental principles?
threshold: 0.8
See Also
- LLM Rubric
- Model-graded metrics
- Pi Documentation for more options, configuration, and calibration details