Pi Scorer

pi is an alternative approach to model grading that uses a dedicated scoring model instead of the "LLM as a judge" technique. It can evaluate input and output pairs against criteria.

note

Important: Unlike llm-rubric which works with your existing providers, Pi requires a separate external API key from Pi Labs.

Alternative Approach

Pi offers a different approach to evaluation with some distinct characteristics:

Uses a dedicated scoring model rather than prompting an LLM to act as a judge
Focuses on highly accurate numeric scoring without providing detailed reasoning
Aims for consistency in scoring the same inputs
Requires a separate API key and integration

Each approach has different strengths, and you may want to experiment with both to determine which best suits your specific evaluation needs.

Prerequisites

To use Pi, you must first:

Create a Pi API key from Pi Labs
Set the WITHPI_API_KEY environment variable

export WITHPI_API_KEY=your_api_key_here

or set

env:
  WITHPI_API_KEY: your_api_key_here

in your promptfoo config

How to use it

To use the pi assertion type, add it to your test configuration:

assert:
  - type: pi
    # Specify the criteria for grading the LLM output
    value: Is the response not apologetic and provides a clear, concise answer?

This assertion will use the Pi scorer to grade the output based on the specified criteria.

How it works

Under the hood, the pi assertion uses the withpi SDK to evaluate the output based on the criteria you provide.

Compared to LLM as a judge:

The inputs of the eval are the same: llm_input and llm_output
Pi does not need a system prompt, and is pretrained to score
Pi always generates the same score, when given the same input
Pi requires a separate API key (see Prerequisites section)

Threshold Support

The pi assertion type supports an optional threshold property that sets a minimum score requirement. When specified, the output must achieve a score greater than or equal to the threshold to pass.

assert:
  - type: pi
    value: Is not apologetic and provides a clear, concise answer
    threshold: 0.8 # Requires a score of 0.8 or higher to pass

info

The default threshold is 0.5 if not specified.

Metrics Brainstorming

You can use the Pi Labs Copilot to interactively brainstorm representative metrics for your application. It helps you:

Generate effective evaluation criteria
Test metrics on example outputs before integration
Find the optimal threshold values for your use case

Example Configuration

prompts:
  - 'Explain {{concept}} in simple terms.'
providers:
  - openai:gpt-4.1
tests:
  - vars:
      concept: quantum computing
    assert:
      - type: pi
        value: Is the explanation easy to understand without technical jargon?
        threshold: 0.7
      - type: pi
        value: Does the response correctly explain the fundamental principles?
        threshold: 0.8

Alternative Approach​

Prerequisites​

How to use it​

How it works​

Threshold Support​

Metrics Brainstorming​

Example Configuration​

See Also​