Model-graded metrics

promptfoo supports several types of model-graded assertions:

Output-based:

llm-rubric - Promptfoo's general-purpose grader; uses an LLM to evaluate outputs against custom criteria or rubrics.
search-rubric - Like llm-rubric but with web search capabilities for verifying current information.
model-graded-closedqa - Checks if LLM answers meet specific requirements using OpenAI's public evals prompts.
factuality - Evaluates factual consistency between LLM output and a reference statement. Uses OpenAI's public evals prompt to determine if the output is factually consistent with the reference.
g-eval - Uses chain-of-thought prompting to evaluate outputs against custom criteria following the G-Eval framework.
answer-relevance - Evaluates whether LLM output is directly related to the original query.
similar - Checks semantic similarity between output and expected value using embedding models.
pi - Alternative scoring approach using a dedicated evaluation model to score inputs/outputs against criteria.
classifier - Runs LLM output through HuggingFace text classifiers for detection of tone, bias, toxicity, and other properties. See classifier grading docs.
moderation - Uses OpenAI's moderation API to ensure LLM outputs are safe and comply with usage policies. See moderation grading docs.
select-best - Compares multiple outputs from different prompts/providers and selects the best one based on custom criteria.
max-score - Selects the output with the highest aggregate score based on other assertion results.

Context-based:

context-recall - ensure that ground truth appears in context
context-relevance - ensure that context is relevant to original query
context-faithfulness - ensure that LLM output is supported by context

Conversational:

conversation-relevance - ensure that responses remain relevant throughout a conversation

Context-based assertions are particularly useful for evaluating RAG systems. For complete RAG evaluation examples, see the RAG Evaluation Guide.

Examples (output-based)

Example of llm-rubric and/or model-graded-closedqa:

assert:
  - type: model-graded-closedqa # or llm-rubric
    # Make sure the LLM output adheres to this criteria:
    value: Is not apologetic

Example of factuality check:

assert:
  - type: factuality
    # Make sure the LLM output is consistent with this statement:
    value: Sacramento is the capital of California

Example of pi scorer:

assert:
  - type: pi
    # Evaluate output based on this criteria:
    value: Is not apologetic and provides a clear, concise answer
    threshold: 0.8 # Requires a score of 0.8 or higher to pass

For more information on factuality, see the guide on LLM factuality.

Non-English Evaluation

For multilingual evaluation output with compatible assertion types, use a custom rubricPrompt:

defaultTest:
  options:
    rubricPrompt: |
      [
        {
          "role": "system",
          // German: "You evaluate outputs based on criteria. Respond with JSON: {\"reason\": \"string\", \"pass\": boolean, \"score\": number}. ALL responses in German."
          "content": "Du bewertest Ausgaben nach Kriterien. Antworte mit JSON: {\"reason\": \"string\", \"pass\": boolean, \"score\": number}. ALLE Antworten auf Deutsch."
        },
        {
          "role": "user", 
          // German: "Output: {{ output }}\nCriterion: {{ rubric }}"
          "content": "Ausgabe: {{ output }}\nKriterium: {{ rubric }}"
        }
      ]

assert:
  - type: llm-rubric
    # German: "Responds helpfully"
    value: 'Antwortet hilfreich'
  - type: g-eval
    # German: "Clear and precise"
    value: 'Klar und präzise'
  - type: model-graded-closedqa
    # German: "Gives direct answer"
    value: 'Gibt direkte Antwort'

This produces German reasoning: {"reason": "Die Antwort ist hilfreich und klar.", "pass": true, "score": 1.0}

Note: This approach works with llm-rubric, g-eval, and model-graded-closedqa. Other assertions like factuality and context-recall require specific output formats and need assertion-specific prompts.

For more language options and alternative approaches, see the llm-rubric language guide.

Here's an example output that indicates PASS/FAIL based on LLM assessment (see example setup and outputs):

Using variables in the rubric

You can use test vars in the LLM rubric. This example uses the question variable to help detect hallucinations:

providers:
  - openai:gpt-5-mini
prompts:
  - file://prompt1.txt
  - file://prompt2.txt
defaultTest:
  assert:
    - type: llm-rubric
      value: 'Says that it is uncertain or unable to answer the question: "{{question}}"'
tests:
  - vars:
      question: What's the weather in New York?
  - vars:
      question: Who won the latest football match between the Giants and 49ers?

Examples (comparison)

The select-best assertion type is used to compare multiple outputs in the same TestCase row and select the one that best meets a specified criterion.

Here's an example of how to use select-best in a configuration file:

prompts:
  - 'Write a tweet about {{topic}}'
  - 'Write a very concise, funny tweet about {{topic}}'

providers:
  - openai:gpt-5

tests:
  - vars:
      topic: bananas
    assert:
      - type: select-best
        value: choose the funniest tweet

  - vars:
      topic: nyc
    assert:
      - type: select-best
        value: choose the tweet that contains the most facts

The max-score assertion type is used to objectively select the output with the highest score from other assertions:

prompts:
  - 'Write a summary of {{article}}'
  - 'Write a detailed summary of {{article}}'
  - 'Write a comprehensive summary of {{article}} with key points'

providers:
  - openai:gpt-5

tests:
  - vars:
      article: 'AI safety research is accelerating...'
    assert:
      - type: contains
        value: 'AI safety'
      - type: contains
        value: 'research'
      - type: llm-rubric
        value: 'Summary captures the main points accurately'
      - type: max-score
        value:
          method: average # Use average of all assertion scores
          threshold: 0.7 # Require at least 70% score to pass

Overriding the LLM grader

By default, model-graded asserts use gpt-5 for grading. If you do not have access to gpt-5 or prefer not to use it, you can override the rubric grader. There are several ways to do this, depending on your preferred workflow:

Using the --grader CLI option:

promptfoo eval --grader openai:gpt-5-mini

Using test.options or defaultTest.options on a per-test or testsuite basis:

defaultTest:
  options:
    provider: openai:gpt-5-mini
tests:
  - description: Use LLM to evaluate output
    assert:
      - type: llm-rubric
        value: Is spoken like a pirate

Using assertion.provider on a per-assertion basis:

tests:
  - description: Use LLM to evaluate output
    assert:
      - type: llm-rubric
        value: Is spoken like a pirate
        provider: openai:gpt-5-mini

Use the provider.config field to set custom parameters:

provider:
  - id: openai:gpt-5-mini
    config:
      temperature: 0

Also note that custom providers are supported as well.

Multiple graders

Some assertions (such as answer-relevance) use multiple types of providers. To override both the embedding and text providers separately, you can do something like this:

defaultTest:
  options:
    provider:
      text:
        id: azureopenai:chat:gpt-4-deployment
        config:
          apiHost: xxx.openai.azure.com
      embedding:
        id: azureopenai:embeddings:text-embedding-ada-002-deployment
        config:
          apiHost: xxx.openai.azure.com

If you are implementing a custom provider, text providers require a callApi function that returns a ProviderResponse, whereas embedding providers require a callEmbeddingApi function that returns a ProviderEmbeddingResponse.

Overriding the rubric prompt

For the greatest control over the output of llm-rubric, you may set a custom prompt using the rubricPrompt property of TestCase or Assertion.

The rubric prompt has two built-in variables that you may use:

{{output}} - The output of the LLM (you probably want to use this)
{{rubric}} - The value of the llm-rubric assert object

Object handling in variables

When {{output}} or {{rubric}} contain objects, they are automatically converted to JSON strings by default to prevent display issues. To access object properties directly (e.g., {{output.text}}), enable object property access:

export PROMPTFOO_DISABLE_OBJECT_STRINGIFY=true
promptfoo eval

For details, see the object template handling guide.

In this example, we set rubricPrompt under defaultTest, which applies it to every test in this test suite:

defaultTest:
  options:
    rubricPrompt: >
      [
        {
          "role": "system",
          "content": "Grade the output by the following specifications, keeping track of the points scored:\n\nDid the output mention {{x}}? +1 point\nDid the output describe {{y}}? +1 point\nDid the output ask to clarify {{z}}? +1 point\n\nCalculate the score but always pass the test. Output your response in the following JSON format:\n{pass: true, score: number, reason: string}"
        },
        {
          "role": "user",
          "content": "Output: {{ output }}"
        }
      ]

See the full example.

Image-based rubric prompts

llm-rubric can also grade responses that reference images. Provide a rubricPrompt in OpenAI chat format that includes an image and use a vision-capable provider such as `openai:gpt-5.

defaultTest:
  options:
    provider: openai:gpt-5
    rubricPrompt: |
      [
        { "role": "system", "content": "Evaluate if the answer matches the image. Respond with JSON {reason:string, pass:boolean, score:number}" },
        {
          "role": "user",
          "content": [
            { "type": "image_url", "image_url": { "url": "{{image_url}}" } },
            { "type": "text", "text": "Output: {{ output }}\nRubric: {{ rubric }}" }
          ]
        }
      ]

select-best rubric prompt

For control over the select-best rubric prompt, you may use the variables {{outputs}} (list of strings) and {{criteria}} (string). It expects the LLM output to contain the index of the winning output.

Classifiers

Classifiers can be used to detect tone, bias, toxicity, helpfulness, and much more. See classifier documentation.

Context-based

Context-based assertions are a special class of model-graded assertions that evaluate whether the LLM's output is supported by context provided at inference time. They are particularly useful for evaluating RAG systems.

context-recall - ensure that ground truth appears in context
context-relevance - ensure that context is relevant to original query
context-faithfulness - ensure that LLM output is supported by context

Defining context

Context can be defined in one of two ways: statically using test case variables or dynamically from the provider's response.

Statically via test variables

Set context as a variable in your test case:

tests:
  - vars:
      context: 'Paris is the capital of France. It has a population of over 2 million people.'
    assert:
      - type: context-recall
        value: 'Paris is the capital of France'
        threshold: 0.8

Dynamically via Context Transform

Defining contextTransform allows you to construct context from provider responses. This is particularly useful for RAG systems.

assert:
  - type: context-faithfulness
    contextTransform: 'output.citations.join("\n")'
    threshold: 0.8

The contextTransform property accepts a stringified Javascript expression which itself accepts two arguments: output and context, and must return a non-empty string.

/**
 * The context transform function signature.
 */
type ContextTransform = (output: Output, context: Context) => string;

/**
 * The provider's response output.
 */
type Output = string | object;

/**
 * Metadata about the test case, prompt, and provider response.
 */
type Context = {
  // Test case variables
  vars: Record<string, string | object>;

  // Raw prompt sent to LLM
  prompt: {
    label: string;
  };

  // Provider-specific metadata.
  // The documentation for each provider will describe any available metadata.
  metadata?: object;
};

For example, given the following provider response:

/**
 * A response from a fictional Research Knowledge Base.
 */
type ProviderResponse = {
  output: {
    content: string;
  };
  metadata: {
    retrieved_docs: {
      content: string;
    }[];
  };
};

assert:
  - type: context-faithfulness
    contextTransform: 'output.content'
    threshold: 0.8

  - type: context-relevance
    # Note: `ProviderResponse['metadata']` is accessible as `context.metadata`
    contextTransform: 'context.metadata.retrieved_docs.map(d => d.content).join("\n")'
    threshold: 0.7

If your expression should return undefined or null, for example because no context is available, add a fallback:

contextTransform: 'output.context ?? "No context found"'

If you expected your context to be non-empty, but it's empty, you can debug your provider response by returning a stringified version of the response:

contextTransform: 'JSON.stringify(output, null, 2)'

Examples

Context-based metrics require a query and context. You must also set the threshold property on your test (all scores are normalized between 0 and 1).

Here's an example config using statically-defined (test.vars.context) context:

prompts:
  - |
    You are an internal corporate chatbot.
    Respond to this query: {{query}}
    Here is some context that you can use to write your response: {{context}}
providers:
  - openai:gpt-5
tests:
  - vars:
      query: What is the max purchase that doesn't require approval?
      context: file://docs/reimbursement.md
    assert:
      - type: contains
        value: '$500'
      - type: factuality
        value: the employee's manager is responsible for approvals
      - type: answer-relevance
        threshold: 0.9
      - type: context-recall
        threshold: 0.9
        value: max purchase price without approval is $500. Talk to Fred before submitting anything.
      - type: context-relevance
        threshold: 0.9
      - type: context-faithfulness
        threshold: 0.9
  - vars:
      query: How many weeks is maternity leave?
      context: file://docs/maternity.md
    assert:
      - type: factuality
        value: maternity leave is 4 months
      - type: answer-relevance
        threshold: 0.9
      - type: context-recall
        threshold: 0.9
        value: The company offers 4 months of maternity leave, unless you are an elephant, in which case you get 22 months of maternity leave.
      - type: context-relevance
        threshold: 0.9
      - type: context-faithfulness
        threshold: 0.9

Alternatively, if your system returns context in the response, like in a RAG system, you can use contextTransform:

prompts:
  - |
    You are an internal corporate chatbot.
    Respond to this query: {{query}}
providers:
  - openai:gpt-5
tests:
  - vars:
      query: What is the max purchase that doesn't require approval?
    assert:
      - type: context-recall
        contextTransform: 'output.context'
        threshold: 0.9
        value: max purchase price without approval is $500
      - type: context-relevance
        contextTransform: 'output.context'
        threshold: 0.9
      - type: context-faithfulness
        contextTransform: 'output.context'
        threshold: 0.9

Transforming outputs for context assertions

Transform: Extract answer before context grading

providers:
  - echo

tests:
  - vars:
      prompt: '{"answer": "Paris is the capital of France", "confidence": 0.95}'
      context: 'France is a country in Europe. Its capital city is Paris, which has over 2 million residents.'
    assert:
      - type: context-faithfulness
        transform: 'JSON.parse(output).answer' # Grade only the answer field
        threshold: 0.9

      - type: context-recall
        transform: 'JSON.parse(output).answer' # Check if answer appears in context
        value: 'Paris is the capital of France'
        threshold: 0.8

Context transform: Extract context from provider response

providers:
  - echo

tests:
  - vars:
      prompt: '{"answer": "Returns accepted within 30 days", "sources": ["Returns are accepted for 30 days from purchase", "30-day money-back guarantee"]}'
      query: 'What is the return policy?'
    assert:
      - type: context-faithfulness
        transform: 'JSON.parse(output).answer'
        contextTransform: 'JSON.parse(output).sources.join(". ")' # Extract sources as context
        threshold: 0.9

      - type: context-relevance
        contextTransform: 'JSON.parse(output).sources.join(". ")' # Check if context is relevant to query
        threshold: 0.8

Transform response: Normalize RAG system output

providers:
  - id: http://rag-api.example.com/search
    config:
      transformResponse: 'json.data' # Extract data field from API response

tests:
  - vars:
      query: 'What are the office hours?'
    assert:
      - type: context-faithfulness
        transform: 'output.answer' # After transformResponse, extract answer
        contextTransform: 'output.documents.map(d => d.text).join(" ")' # Extract documents as context
        threshold: 0.85

Processing order: API call → transformResponse → transform → contextTransform → context assertion

Common patterns and troubleshooting

Understanding pass vs. score behavior

Model-graded assertions like llm-rubric determine PASS/FAIL using two mechanisms:

Without threshold: PASS depends only on the grader's pass field (defaults to true if omitted)
With threshold: PASS requires both pass === true AND score >= threshold

This means a result like {"pass": true, "score": 0} will pass without a threshold, but fail with threshold: 1.

Common issue: Tests show PASS even when scores are low

# ❌ Problem: All tests pass regardless of score
assert:
  - type: llm-rubric
    value: |
      Return 0 if the response is incorrect
      Return 1 if the response is correct
    # No threshold set - always passes if grader doesn't return explicit pass: false

Solutions:

# ✅ Option A: Add threshold to make score drive PASS/FAIL
assert:
  - type: llm-rubric
    value: |
      Return 0 if the response is incorrect
      Return 1 if the response is correct
    threshold: 1  # Only pass when score >= 1

# ✅ Option B: Have grader control pass explicitly
assert:
  - type: llm-rubric
    value: |
      Return {"pass": true, "score": 1} if the response is correct
      Return {"pass": false, "score": 0} if the response is incorrect

Threshold usage across assertion types

Different assertion types use thresholds differently:

assert:
  # Similarity-based (0-1 range)
  - type: context-faithfulness
    threshold: 0.8 # Requires 80%+ faithfulness

  # Binary scoring (0 or 1)
  - type: llm-rubric
    value: 'Is helpful and accurate'
    threshold: 1 # Requires perfect score

  # Custom scoring (any range)
  - type: pi
    value: 'Quality of response'
    threshold: 0.7

For more details on pass/score semantics, see the llm-rubric documentation.

Other assertion types

For more info on assertions, see Test assertions.

Examples (output-based)​

Non-English Evaluation​

Using variables in the rubric​

Examples (comparison)​

Overriding the LLM grader​

Multiple graders​

Overriding the rubric prompt​

Image-based rubric prompts​

select-best rubric prompt​

Classifiers​

Context-based​

Defining context​

Statically via test variables​

Dynamically via Context Transform​

Examples​

Transforming outputs for context assertions​

Transform: Extract answer before context grading​

Context transform: Extract context from provider response​

Transform response: Normalize RAG system output​

Common patterns and troubleshooting​

Understanding pass vs. score behavior​

Threshold usage across assertion types​

Other assertion types​