Model-graded metrics

promptfoo supports several types of model-graded assertions:

Output-based:

llm-rubric - Promptfoo's general-purpose grader; uses an LLM to evaluate outputs against custom criteria or rubrics.
agent-rubric - Like llm-rubric, but uses a coding-agent grader that can inspect configured workspace and tool evidence.
search-rubric - Like llm-rubric but with web search capabilities for verifying current information.
model-graded-closedqa - Checks if LLM answers meet specific requirements using OpenAI's public evals prompts.
factuality - Evaluates factual consistency between LLM output and a reference statement. Uses OpenAI's public evals prompt to determine if the output is factually consistent with the reference.
g-eval - Uses chain-of-thought prompting to evaluate outputs against custom criteria following the G-Eval framework.
answer-relevance - Evaluates whether LLM output is directly related to the original query.
similar - Checks semantic similarity between output and expected value using embedding models.
pi - Alternative scoring approach using a dedicated evaluation model to score inputs/outputs against criteria.
classifier - Runs LLM output through HuggingFace text classifiers for detection of tone, bias, toxicity, and other properties. See classifier grading docs.
moderation - Uses OpenAI's moderation API to ensure LLM outputs are safe and comply with usage policies. See moderation grading docs.
select-best - Compares multiple outputs from different prompts/providers and selects the best one based on custom criteria.
max-score - Selects the output with the highest aggregate score based on other assertion results.

Context-based:

context-recall - ensure that ground truth appears in context
context-relevance - ensure that context is relevant to original query
context-faithfulness - ensure that LLM output is supported by context

Conversational:

conversation-relevance - ensure that responses remain relevant throughout a conversation

Trajectory-based:

trajectory:goal-success - uses an LLM judge to decide whether a traced agent run achieved its goal

Context-based assertions are particularly useful for evaluating RAG systems. For complete RAG evaluation examples, see the RAG Evaluation Guide.

Examples (output-based)

Example of llm-rubric and/or model-graded-closedqa:

assert:
  - type: model-graded-closedqa # or llm-rubric
    # Make sure the LLM output adheres to this criteria:
    value: Is not apologetic

Example of factuality check:

assert:
  - type: factuality
    # Make sure the LLM output is consistent with this statement:
    value: Sacramento is the capital of California

trajectory:goal-success

Use trajectory:goal-success when you care about whether an agent actually completed a task, not just whether it used a specific tool or produced a plausible final sentence.

This assertion requires trace data. Promptfoo summarizes the traced trajectory, includes the final output, and asks a grading model whether the run achieved the goal you specify.

tests:
  - vars:
      order_id: '123'
    assert:
      - type: trajectory:goal-success
        value: 'Determine the shipping status for order {{ order_id }} and tell the user whether it has shipped'

Like other model-graded assertions, you can set threshold, provider, or rubricPrompt:

tests:
  - assert:
      - type: trajectory:goal-success
        value: Resolve the user's issue and provide the correct next step
        threshold: 0.8
        provider: openai:gpt-5-mini

This works best alongside deterministic trajectory checks such as trajectory:tool-used, trajectory:tool-args-match, or trajectory:tool-sequence when the exact path through the task also matters.

Prepend not- to flag runs that achieved a forbidden goal (type: not-trajectory:goal-success). Inversion only flips real grader verdicts — judge transport or parse failures still report as failures so a broken judge cannot silently turn into a passing "did not achieve forbidden goal" result.

Example of pi scorer:

assert:
  - type: pi
    # Evaluate output based on this criteria:
    value: Is not apologetic and provides a clear, concise answer
    threshold: 0.8 # Requires a score of 0.8 or higher to pass

For more information on factuality, see the guide on LLM factuality.

Non-English Evaluation

For multilingual evaluation output with compatible assertion types, use a custom rubricPrompt:

defaultTest:
  options:
    rubricPrompt: |
      [
        {
          "role": "system",
          // German: "You evaluate outputs based on criteria. Respond with JSON: {\"reason\": \"string\", \"pass\": boolean, \"score\": number}. ALL responses in German."
          "content": "Du bewertest Ausgaben nach Kriterien. Antworte mit JSON: {\"reason\": \"string\", \"pass\": boolean, \"score\": number}. ALLE Antworten auf Deutsch."
        },
        {
          "role": "user", 
          // German: "Output: {{ output }}\nCriterion: {{ rubric }}"
          "content": "Ausgabe: {{ output }}\nKriterium: {{ rubric }}"
        }
      ]

assert:
  - type: llm-rubric
    # German: "Responds helpfully"
    value: 'Antwortet hilfreich'
  - type: g-eval
    # German: "Clear and precise"
    value: 'Klar und präzise'
  - type: model-graded-closedqa
    # German: "Gives direct answer"
    value: 'Gibt direkte Antwort'

This produces German reasoning: {"reason": "Die Antwort ist hilfreich und klar.", "pass": true, "score": 1.0}

Note: This approach works with llm-rubric, g-eval, and model-graded-closedqa. Other assertions like factuality and context-recall require specific output formats and need assertion-specific prompts.

For more language options and alternative approaches, see the llm-rubric language guide.

Here's an example output that indicates PASS/FAIL based on LLM assessment (see example setup and outputs):

Using variables in the rubric

You can use test vars in the LLM rubric. This example uses the question variable to help detect hallucinations:

providers:
  - openai:gpt-5-mini
prompts:
  - file://prompt1.txt
  - file://prompt2.txt
defaultTest:
  assert:
    - type: llm-rubric
      value: 'Says that it is uncertain or unable to answer the question: "{{question}}"'
tests:
  - vars:
      question: What's the weather in New York?
  - vars:
      question: Who won the latest football match between the Giants and 49ers?

Examples (comparison)

The select-best assertion type is used to compare multiple outputs in the same TestCase row and select the one that best meets a specified criterion.

Here's an example of how to use select-best in a configuration file:

prompts:
  - 'Write a tweet about {{topic}}'
  - 'Write a very concise, funny tweet about {{topic}}'

providers:
  - openai:gpt-5

tests:
  - vars:
      topic: bananas
    assert:
      - type: select-best
        value: choose the funniest tweet

  - vars:
      topic: nyc
    assert:
      - type: select-best
        value: choose the tweet that contains the most facts

The max-score assertion type is used to objectively select the output with the highest score from other assertions:

prompts:
  - 'Write a summary of {{article}}'
  - 'Write a detailed summary of {{article}}'
  - 'Write a comprehensive summary of {{article}} with key points'

providers:
  - openai:gpt-5

tests:
  - vars:
      article: 'AI safety research is accelerating...'
    assert:
      - type: contains
        value: 'AI safety'
      - type: contains
        value: 'research'
      - type: llm-rubric
        value: 'Summary captures the main points accurately'
      - type: max-score
        value:
          method: average # Use average of all assertion scores
          threshold: 0.7 # Require at least 70% score to pass

Overriding the LLM grader

By default, model-graded asserts use promptfoo's built-in grading provider. Promptfoo chooses that provider from the credentials available in the environment; for example, OpenAI, Anthropic, Gemini, Mistral, GitHub Models, Azure OpenAI, and Codex login credentials can each activate a different default. If you do not have access to the selected default or prefer a different judge, you can override the grader. There are several ways to do this, depending on your preferred workflow:

Using the --grader CLI option:

promptfoo eval --grader openai:gpt-5-mini

Using test.options or defaultTest.options on a per-test or testsuite basis:

defaultTest:
  options:
    provider: openai:gpt-5-mini
tests:
  - description: Use LLM to evaluate output
    assert:
      - type: llm-rubric
        value: Is spoken like a pirate

Using assertion.provider on a per-assertion basis:

tests:
  - description: Use LLM to evaluate output
    assert:
      - type: llm-rubric
        value: Is spoken like a pirate
        provider: openai:gpt-5-mini

Use the provider.config field to set custom parameters such as temperature, max_tokens, or API host:

tests:
  - assert:
      - type: llm-rubric
        value: Is not apologetic and provides a clear, concise answer
        provider:
          id: openai:gpt-5-mini
          config:
            temperature: 0

This works at every level where a grader can be set — per-assertion (assertion.provider), per-test (test.options.provider), and globally (defaultTest.options.provider).

If you configure a full provider object globally, do not also add a shorthand provider: openai:chat:... to the assertion. Assertion-level providers take precedence, so the global provider object's config values such as apiBaseUrl, apiKey, temperature, or showThinking will not be inherited. Either remove the assertion-level provider or repeat the full provider object there.

note

The built-in OpenAI grader already uses temperature=0 by default, so you only need to set it when overriding the grader with a custom provider block that would otherwise inherit a non-zero default. GPT-5 series reasoning models ignore temperature entirely.

The built-in OpenAI grader may spend hidden reasoning tokens internally, but promptfoo receives the final grader output without private reasoning text prepended to the output string. The showThinking: false guidance below is for OpenAI-compatible or local judge providers that return reasoning fields such as reasoning or reasoning_content.

Also note that custom providers are supported as well.

OpenAI-compatible thinking judges

Self-hosted OpenAI-compatible judges such as vLLM, LocalAI, and llamafile can return reasoning in a separate field while putting the final answer in content. Set showThinking: false on the judge provider so promptfoo uses only the final content for grading:

defaultTest:
  options:
    provider:
      id: openai:chat:llm_judge
      config:
        apiBaseUrl: http://localhost:8000/v1
        apiKey: empty
        temperature: 0
        max_tokens: 10000
        showThinking: false

This is not specific to llm-rubric. JSON-first metrics can parse scratchpad JSON, answer-relevance can embed questions with Thinking: prepended, RAG metrics can score scratchpad sentences or attribution markers, and select-best can read a scratchpad number as the winning index.

For vLLM specifically, showThinking: false only removes reasoning after vLLM has parsed it into a separate field such as reasoning_content. If max_tokens or the server context window is too small, vLLM may return an unfinished <think> block in content; increase the budget or disable thinking for judge requests.

For vLLM models whose chat template enables thinking by default, you can also disable thinking at request time. See the vLLM judge guide for complete Qwen, GPT-OSS, and GLM examples.

Multiple graders

Some assertions (such as answer-relevance) use multiple types of providers. To override both the embedding and text providers separately, you can do something like this:

defaultTest:
  options:
    provider:
      text:
        id: azureopenai:chat:gpt-4-deployment
        config:
          apiHost: xxx.openai.azure.com
      embedding:
        id: azureopenai:embeddings:text-embedding-ada-002-deployment
        config:
          apiHost: xxx.openai.azure.com

If you are implementing a custom provider, text providers require a callApi function that returns a ProviderResponse, whereas embedding providers require a callEmbeddingApi function that returns a ProviderEmbeddingResponse.

Overriding the rubric prompt

For the greatest control over the output of llm-rubric, you may set a custom prompt using the rubricPrompt property of TestCase or Assertion.

The rubric prompt has two built-in variables that you may use:

{{output}} - The output of the LLM (you probably want to use this)
{{rubric}} - The value of the llm-rubric assert object

Object handling in variables

When {{output}} or {{rubric}} contain objects, they are automatically converted to JSON strings by default to prevent display issues. To access object properties directly (e.g., {{output.text}}), enable object property access:

export PROMPTFOO_DISABLE_OBJECT_STRINGIFY=true
promptfoo eval

For details, see the object template handling guide.

In this example, we set rubricPrompt under defaultTest, which applies it to every test in this test suite:

defaultTest:
  options:
    rubricPrompt: >
      [
        {
          "role": "system",
          "content": "Grade the output by the following specifications, keeping track of the points scored:\n\nDid the output mention {{x}}? +1 point\nDid the output describe {{y}}? +1 point\nDid the output ask to clarify {{z}}? +1 point\n\nCalculate the score but always pass the test. Output your response in the following JSON format:\n{pass: true, score: number, reason: string}"
        },
        {
          "role": "user",
          "content": "Output: {{ output }}"
        }
      ]

See the full example.

Image-based rubric prompts

llm-rubric can also grade responses that reference images. Provide a rubricPrompt in OpenAI chat format that includes an image and use a vision-capable provider such as `openai:gpt-5.

defaultTest:
  options:
    provider: openai:gpt-5
    rubricPrompt: |
      [
        { "role": "system", "content": "Evaluate if the answer matches the image. Respond with JSON {reason:string, pass:boolean, score:number}" },
        {
          "role": "user",
          "content": [
            { "type": "image_url", "image_url": { "url": "{{image_url}}" } },
            { "type": "text", "text": "Output: {{ output }}\nRubric: {{ rubric }}" }
          ]
        }
      ]

select-best rubric prompt

For control over the select-best rubric prompt, you may use the variables {{outputs}} (list of strings) and {{criteria}} (string). It expects the LLM output to contain the index of the winning output.

Classifiers

Classifiers can be used to detect tone, bias, toxicity, helpfulness, and much more. See classifier documentation.

Context-based

Context-based assertions are a special class of model-graded assertions that evaluate whether the LLM's output is supported by context provided at inference time. They are particularly useful for evaluating RAG systems.

context-recall - ensure that ground truth appears in context
context-relevance - ensure that context is relevant to original query
context-faithfulness - ensure that LLM output is supported by context

Defining context

Context can be defined in one of two ways: statically using test case variables or dynamically from the provider's response.

Statically via test variables

Set context as a variable in your test case:

tests:
  - vars:
      context: 'Paris is the capital of France. It has a population of over 2 million people.'
    assert:
      - type: context-recall
        value: 'Paris is the capital of France'
        threshold: 0.8

Dynamically via Context Transform

Defining contextTransform allows you to construct context from provider responses. This is particularly useful for RAG systems.

assert:
  - type: context-faithfulness
    contextTransform: 'output.citations.join("\n")'
    threshold: 0.8

The contextTransform property accepts a stringified Javascript expression which itself accepts two arguments: output and context, and must return a non-empty string.

/**
 * The context transform function signature.
 */
type ContextTransform = (output: Output, context: Context) => string;

/**
 * The provider's response output.
 */
type Output = string | object;

/**
 * Metadata about the test case, prompt, and provider response.
 */
type Context = {
  // Test case variables
  vars: Record<string, string | object>;

  // Raw prompt sent to LLM
  prompt: {
    label: string;
  };

  // Provider-specific metadata.
  // The documentation for each provider will describe any available metadata.
  metadata?: object;
};

For example, given the following provider response:

/**
 * A response from a fictional Research Knowledge Base.
 */
type ProviderResponse = {
  output: {
    content: string;
  };
  metadata: {
    retrieved_docs: {
      content: string;
    }[];
  };
};

assert:
  - type: context-faithfulness
    contextTransform: 'output.content'
    threshold: 0.8

  - type: context-relevance
    # Note: `ProviderResponse['metadata']` is accessible as `context.metadata`
    contextTransform: 'context.metadata.retrieved_docs.map(d => d.content).join("\n")'
    threshold: 0.7

If your expression should return undefined or null, for example because no context is available, add a fallback:

contextTransform: 'output.context ?? "No context found"'

If you expected your context to be non-empty, but it's empty, you can debug your provider response by returning a stringified version of the response:

contextTransform: 'JSON.stringify(output, null, 2)'

Examples

Context-based metrics require a query and context. You must also set the threshold property on your test (all scores are normalized between 0 and 1).

Here's an example config using statically-defined (test.vars.context) context:

prompts:
  - |
    You are an internal corporate chatbot.
    Respond to this query: {{query}}
    Here is some context that you can use to write your response: {{context}}
providers:
  - openai:gpt-5
tests:
  - vars:
      query: What is the max purchase that doesn't require approval?
      context: file://docs/reimbursement.md
    assert:
      - type: contains
        value: '$500'
      - type: factuality
        value: the employee's manager is responsible for approvals
      - type: answer-relevance
        threshold: 0.9
      - type: context-recall
        threshold: 0.9
        value: max purchase price without approval is $500. Talk to Fred before submitting anything.
      - type: context-relevance
        threshold: 0.9
      - type: context-faithfulness
        threshold: 0.9
  - vars:
      query: How many weeks is maternity leave?
      context: file://docs/maternity.md
    assert:
      - type: factuality
        value: maternity leave is 4 months
      - type: answer-relevance
        threshold: 0.9
      - type: context-recall
        threshold: 0.9
        value: The company offers 4 months of maternity leave, unless you are an elephant, in which case you get 22 months of maternity leave.
      - type: context-relevance
        threshold: 0.9
      - type: context-faithfulness
        threshold: 0.9

Alternatively, if your system returns context in the response, like in a RAG system, you can use contextTransform:

prompts:
  - |
    You are an internal corporate chatbot.
    Respond to this query: {{query}}
providers:
  - openai:gpt-5
tests:
  - vars:
      query: What is the max purchase that doesn't require approval?
    assert:
      - type: context-recall
        contextTransform: 'output.context'
        threshold: 0.9
        value: max purchase price without approval is $500
      - type: context-relevance
        contextTransform: 'output.context'
        threshold: 0.9
      - type: context-faithfulness
        contextTransform: 'output.context'
        threshold: 0.9

Transforming outputs for context assertions

Transform: Extract answer before context grading

providers:
  - echo

tests:
  - vars:
      prompt: '{"answer": "Paris is the capital of France", "confidence": 0.95}'
      context: 'France is a country in Europe. Its capital city is Paris, which has over 2 million residents.'
    assert:
      - type: context-faithfulness
        transform: 'JSON.parse(output).answer' # Grade only the answer field
        threshold: 0.9

      - type: context-recall
        transform: 'JSON.parse(output).answer' # Check if answer appears in context
        value: 'Paris is the capital of France'
        threshold: 0.8

Context transform: Extract context from provider response

providers:
  - echo

tests:
  - vars:
      prompt: '{"answer": "Returns accepted within 30 days", "sources": ["Returns are accepted for 30 days from purchase", "30-day money-back guarantee"]}'
      query: 'What is the return policy?'
    assert:
      - type: context-faithfulness
        transform: 'JSON.parse(output).answer'
        contextTransform: 'JSON.parse(output).sources.join(". ")' # Extract sources as context
        threshold: 0.9

      - type: context-relevance
        contextTransform: 'JSON.parse(output).sources.join(". ")' # Check if context is relevant to query
        threshold: 0.8

Transform response: Normalize RAG system output

providers:
  - id: http://rag-api.example.com/search
    config:
      transformResponse: 'json.data' # Extract data field from API response

tests:
  - vars:
      query: 'What are the office hours?'
    assert:
      - type: context-faithfulness
        transform: 'output.answer' # After transformResponse, extract answer
        contextTransform: 'output.documents.map(d => d.text).join(" ")' # Extract documents as context
        threshold: 0.85

Processing order: API call → transformResponse → transform → contextTransform → context assertion

Common patterns and troubleshooting

Understanding pass vs. score behavior

Model-graded assertions like llm-rubric determine PASS/FAIL using two mechanisms:

Without threshold: PASS depends only on the grader's pass field (defaults to true if omitted)
With threshold: PASS requires both pass === true AND score >= threshold

This means a result like {"pass": true, "score": 0} will pass without a threshold, but fail with threshold: 1.

Common issue: Tests show PASS even when scores are low

# ❌ Problem: All tests pass regardless of score
assert:
  - type: llm-rubric
    value: |
      Return 0 if the response is incorrect
      Return 1 if the response is correct
    # No threshold set - always passes if grader doesn't return explicit pass: false

Solutions:

# ✅ Option A: Add threshold to make score drive PASS/FAIL
assert:
  - type: llm-rubric
    value: |
      Return 0 if the response is incorrect
      Return 1 if the response is correct
    threshold: 1  # Only pass when score >= 1

# ✅ Option B: Have grader control pass explicitly
assert:
  - type: llm-rubric
    value: |
      Return {"pass": true, "score": 1} if the response is correct
      Return {"pass": false, "score": 0} if the response is incorrect

Threshold usage across assertion types

Different assertion types use thresholds differently:

assert:
  # Similarity-based (0-1 range)
  - type: context-faithfulness
    threshold: 0.8 # Requires 80%+ faithfulness

  # Binary scoring (0 or 1)
  - type: llm-rubric
    value: 'Is helpful and accurate'
    threshold: 1 # Requires perfect score

  # Custom scoring (any range)
  - type: pi
    value: 'Quality of response'
    threshold: 0.7

For more details on pass/score semantics, see the llm-rubric documentation.

Other assertion types

For more info on assertions, see Test assertions.

Examples (output-based)​

trajectory:goal-success​

Non-English Evaluation​

Using variables in the rubric​

Examples (comparison)​

Overriding the LLM grader​

OpenAI-compatible thinking judges​

Multiple graders​

Overriding the rubric prompt​

Image-based rubric prompts​

select-best rubric prompt​

Classifiers​

Context-based​

Defining context​

Statically via test variables​

Dynamically via Context Transform​

Examples​

Transforming outputs for context assertions​

Transform: Extract answer before context grading​

Context transform: Extract context from provider response​

Transform response: Normalize RAG system output​

Common patterns and troubleshooting​

Understanding pass vs. score behavior​

Threshold usage across assertion types​

Other assertion types​