Skip to main content

Reference

Here is the main structure of the promptfoo configuration file:

Config

PropertyTypeRequiredDescription
descriptionstringNoOptional description of what your LLM is trying to do
tagsRecord<string, string>NoOptional tags to describe the test suite (e.g. env: production, application: chatbot)
providersProvidersConfigYes, unless targets is setOne or more LLM APIs to use. Exactly one of providers or targets must be set.
targetsProvidersConfigYes, unless providers is setAlias for providers, commonly used in red team configs. Exactly one of targets or providers must be set.
promptsstring | string[] | Record<string, string> | Prompt[]YesOne or more prompts to load
testsstring | (string | Test Case | Test Generator Config)[] | Test Generator ConfigNoPath to a test file, inline tests, or a generator. If omitted, promptfoo runs each prompt/provider pair once with empty vars.
scenarios(string | Scenario)[]NoScenario files or inline scenario definitions
defaultTestfile://${string} | Partial Test CaseNoSets the default properties for each test case. Can be an inline object or a file:// path to an external YAML/JSON file.
outputPathstring | string[]NoWhere to write output. Writes to console/web viewer if not set. See output formats.
sharingboolean | objectNoEnables or configures result sharing with optional apiBaseUrl and appBaseUrl fields
nunjucksFiltersRecord<string, string>NoMap of Nunjucks filter names to file paths
envRecord<string, string | number | boolean>NoEnvironment variables to set for the test run. These values will override existing environment variables. Can be used to set API keys and other configuration values needed by providers.
derivedMetricsDerivedMetric[]NoMetrics calculated after the eval from named assertion scores
extensionsstring[] | nullNoList of extension files to load. Each extension is a file path with a function name. Can be Python (.py) or JavaScript (.js) files. Supported hooks are 'beforeAll', 'afterAll', 'beforeEach', 'afterEach'.
metadataRecord<string, any>NoArbitrary metadata stored with the eval config
redteamRedteamConfigNoRed team configuration
writeLatestResultsbooleanNoWrite latest results to promptfoo storage so they can be viewed in the web UI
tracingTracingConfigNoOpenTelemetry tracing configuration
evaluateOptions.maxConcurrencynumberNoMaximum number of concurrent requests. Defaults to 4
evaluateOptions.repeatnumberNoNumber of times to run each test case. Defaults to 1
evaluateOptions.delaynumberNoForce the test runner to wait after each API call (milliseconds). Defaults to 0
evaluateOptions.showProgressBarbooleanNoWhether to display the progress bar
evaluateOptions.cachebooleanNoWhether to use disk cache for results (default: true)
evaluateOptions.timeoutMsnumberNoTimeout in milliseconds for each individual test case/provider API call. When reached, that specific test is marked as an error. Default is 0 (no timeout).
evaluateOptions.maxEvalTimeMsnumberNoMaximum total runtime in milliseconds for the entire evaluation process. When reached, all remaining tests are marked as errors and the evaluation ends. Default is 0 (no limit).
commandLineOptionsCommandLineOptionsNoDefault values for command-line options. These values will be used unless overridden by actual command-line arguments.

Test Case

A test case represents a single example input that is fed into all prompts and providers.

PropertyTypeRequiredDescription
descriptionstringNoDescription of what you're testing
varsRecord<string, VarValue> | string | string[]NoKey-value pairs to substitute in the prompt. If vars is a string or string array, promptfoo loads test vars from those file paths. See Test Case Configuration for loading vars from files.
providerstring | ProviderOptions | ApiProviderNoOverride the default provider for this specific test case
providersstring[]NoFilter which providers this test runs against. Supports labels, IDs, and wildcards (e.g., openai:*). See filtering tests by provider.
promptsstring[]NoFilter this test to run only with specific prompts (by label or ID). Supports wildcards like Math:*. See Filtering Tests by Prompt.
providerOutputstring | Record<string, unknown>NoPrecomputed provider output. When set, promptfoo skips calling the provider and runs assertions directly against this output.
assert(Assertion | Assertion Set)[]NoList of automatic checks to run on the LLM output. See assertions & metrics for all available types.
assertScoringFunctionfile:// JavaScript/Python path | functionNoCustom scoring function that combines named assertion scores into the final grading result.
thresholdnumberNoTest will fail if the combined score of assertions is less than this number
metadataRecord<string, any>NoAdditional metadata to include with the test case, useful for filtering or grouping results
optionsObjectNoAdditional configuration settings for the test case
options.transformVarsstring | functionNoA filepath (js or py), JavaScript snippet, or Node.js function that runs on the vars before they are substituted into the prompt. See transforming input variables.
options.transformstring | functionNoA filepath (js or py), JavaScript snippet, or Node.js function that runs on LLM output before assertions. See transforming outputs.
options.postprocessstring | functionNoDeprecated alias for options.transform
options.prefixstringNoText to prepend to the prompt
options.suffixstringNoText to append to the prompt
options.providerstring | ProviderOptions | ApiProvider | Record<string, any>NoThe API provider to use for model-graded assertion grading
options.rubricPromptstring | string[] | ChatMessage[]NoCustom prompt for model-graded assertions
options.factualityobjectNoScore weights for factuality assertions (subset, superset, agree, disagree, differButFactual)
options.disableVarExpansionbooleanNoIf true, arrays in vars are not expanded into multiple test cases
options.disableConversationVarbooleanNoIf true, promptfoo does not include the implicit _conversation variable in the prompt
options.disableDefaultAssertsbooleanNoIf true, this test case does not inherit assertions from defaultTest.assert; other defaultTest properties still apply
options.runSeriallybooleanNoIf true, run this test case without concurrency regardless of global settings
options.storeOutputAsstringNoThe output of this test will be stored as a variable, which can be used in subsequent tests. See multi-turn conversations.
options.<provider-specific>anyNoProvider-specific config fields (e.g., response_format, responseSchema) are passed through to the provider. Use file:// to load from external files. See Per-test provider config.

Test Generator Config

Use a test generator config when tests should be produced by a JavaScript or Python generator.

PropertyTypeRequiredDescription
pathstringYesPath to the generator function, e.g. file://path/to/tests.py:function_name
configRecord<string, any>NoConfiguration passed to the generator. Values may reference file:// paths.

Per-test provider config

Test-level options can include provider-specific configuration fields that override the provider's default config for that specific test. This is useful for:

  • Using different structured output schemas per test
  • Varying temperature or other parameters for specific test cases
  • Testing the same prompt with different model configurations
tests:
- vars:
question: 'What is 2 + 2?'
options:
# Provider-specific: loaded from external file
response_format: file://./schemas/math-response.json
# Provider-specific: inline override
temperature: 0

The external file must contain the complete configuration object. For OpenAI structured outputs:

schemas/math-response.json
{
"type": "json_schema",
"json_schema": {
"name": "math_response",
"strict": true,
"schema": {
"type": "object",
"properties": {
"answer": { "type": "number" },
"explanation": { "type": "string" }
},
"required": ["answer", "explanation"],
"additionalProperties": false
}
}
}

See the OpenAI structured outputs guide for more details.

Assertion

More details on using assertions, including examples here.

PropertyTypeRequiredDescription
typestringYesType of assertion. See assertion types for all available types. not- prefixes are supported for most base assertion types.
valuestring | string[] | number | object | functionNoThe expected value, if applicable
configRecord<string, any>NoExtra configuration passed to the assertion or assertion value function
thresholdnumberNoThe threshold value, applicable only to certain types such as similar, cost, javascript, python
weightnumberNoWeight of this assertion compared to other assertions in the test case. Defaults to 1
providerstring | ProviderOptions | ApiProvider | Record<string, any>NoSome assertions (type = similar, llm-rubric, model-graded-*) require an LLM provider
rubricPromptstring | string[] | ChatMessage[]NoOverride the grading rubric for model-graded assertions
metricstringNoThe label for this result. Assertions with the same metric will be aggregated together. See named metrics.
transformstring | functionNoTransform the output before running this assertion. This receives the test-transformed output.
contextTransformstring | functionNoTransform provider-normalized output into context for context-based assertions. See Context Transform for more details.

Assertion Set

An assertion set groups multiple assertions and can define its own threshold, metric, weight, and shared config.

PropertyTypeRequiredDescription
typeassert-setYesMarks this item as an assertion set
assertAssertion[]YesAssertions in the set
thresholdnumberNoRequired score for the set
weightnumberNoWeight of this set compared to other assertions or assertion sets
metricstringNoNamed metric for the set
configRecord<string, any>NoShared config passed into every assertion in the set

CommandLineOptions

Set default values for command-line options. These defaults will be used unless overridden by command-line arguments.

PropertyTypeDescription
Basic Configuration
descriptionstringDescription of what your LLM is trying to do
configstring[]Path(s) to configuration files
envPathstring | string[]Path(s) to .env file(s). When multiple files are specified, later files override earlier values.
Input Files
promptsstring[]One or more paths to prompt files
providersstring[]One or more LLM provider identifiers
testsstringPath to CSV file with test cases
varsstringPath to CSV file with test variables
assertionsstringPath to assertions file
modelOutputsstringPath to JSON file containing model outputs
Prompt Modifications
promptPrefixstringText to prepend to every prompt
promptSuffixstringText to append to every prompt
generateSuggestionsbooleanGenerate new prompts and append them to the prompt list
Test Execution
maxConcurrencynumberMaximum number of concurrent requests
repeatnumberNumber of times to run each test case
delaynumberDelay between API calls in milliseconds
graderstringProvider that will grade model-graded outputs
varobjectSet test variables as key-value pairs (e.g. {key1: 'value1', key2: 'value2'})
Filtering
filterPatternstringOnly run tests whose description matches the regular expression pattern
filterPromptsstringOnly run tests with prompts whose id or label matches this regex
filterProvidersstringOnly run tests with providers matching this regex (matches against provider id or label)
filterTargetsstringOnly run tests with targets matching this regex (alias for filterProviders)
filterFirstNnumberOnly run the first N test cases
filterRangestringRun test cases in a zero-based start:end range. The end index is exclusive
filterSamplenumberRun a random sample of N test cases
filterMetadatastring | string[]Only run tests matching metadata filters in key=value format. Multiple filters are combined with AND logic.
filterErrorsOnlystringOnly run tests that resulted in errors from a previous output path or eval ID
filterFailingstringOnly run non-passing tests (assertion failures and errors) from a previous output path or eval ID
filterFailingOnlystringOnly run assertion failures from a previous output path or eval ID, excluding errors
Output & Display
outputstring[]Output file paths (csv, txt, json, yaml, yml, html)
tablebooleanShow output table (default: true, disable with --no-table)
tableCellMaxLengthnumberMaximum length of table cells in console output
progressBarbooleanWhether to display progress bar during evaluation
verbosebooleanEnable verbose output
sharebooleanWhether to create a shareable URL
noSharebooleanDisable sharing, overriding config-based sharing
Caching & Storage
cachebooleanWhether to use disk cache for results (default: true)
writebooleanWhether to write results to promptfoo directory (default: true)
Other Options
watchbooleanWhether to watch for config changes and re-run automatically
retryErrorsbooleanRetry all ERROR results from the latest eval
extensionstring[]Extension hooks to load from the CLI (same format as top-level extensions)

Example

promptfooconfig.yaml
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
prompts:
- prompt1.txt
- prompt2.txt

providers:
- openai:gpt-5

tests: tests.csv

# Set default command-line options
commandLineOptions:
envPath: # Load from multiple .env files (later overrides earlier)
- .env
- .env.local
maxConcurrency: 10
repeat: 3
delay: 1000
verbose: true
grader: openai:gpt-5-mini
table: true
cache: false
tableCellMaxLength: 100

# Filtering options
filterPattern: 'auth.*' # Only run tests with 'auth' in description
filterProviders: 'openai.*' # Only test OpenAI providers
filterRange: '0:100' # Run tests 0 through 99
filterSample: 50 # Random sample of 50 tests

# Prompt modifications
promptPrefix: 'You are a helpful assistant. '
promptSuffix: "\n\nPlease be concise."

# Variables
var:
temperature: '0.7'
max_tokens: '1000'

With this configuration, running npx promptfoo eval will use these defaults. You can still override them:

# Uses maxConcurrency: 10 from config
npx promptfoo eval

# Overrides maxConcurrency to 5
npx promptfoo eval --max-concurrency 5

AssertionValueFunctionContext

When using JavaScript or Python assertions, your function receives a context object with the following interface:

interface AssertionValueFunctionContext {
// Raw prompt sent to LLM
prompt: string | undefined;

// Test case variables
vars: Record<string, VarValue>;

// The complete test case (see #test-case)
test: AtomicTestCase;

// Log probabilities from the LLM response, if available
logProbs: number[] | undefined;

// Configuration passed to the assertion
config?: Record<string, any>;

// The provider that generated the response (see /docs/providers)
provider: ApiProvider | undefined;

// The complete provider response (see #providerresponse)
providerResponse: ProviderResponse | undefined;

// OpenTelemetry trace data when tracing is enabled and the assertion uses trace context
trace?: TraceData;
}

VarValue

VarValue is the value type accepted in test vars, assertion contexts, and provider call contexts.

type VarValue = string | number | boolean | object | unknown[];

TraceData

TraceData is available to trace-aware assertions when tracing is enabled.

interface TraceSpan {
spanId: string;
parentSpanId?: string;
name: string;
startTime: number;
endTime?: number;
attributes?: Record<string, any>;
statusCode?: number;
statusMessage?: string;
}

interface TraceData {
traceId: string;
evaluationId: string;
testCaseId: string;
metadata?: Record<string, any>;
spans: TraceSpan[];
}
note

promptfoo supports .js and .json file extensions in addition to .yaml.

It automatically loads promptfooconfig.*, but you can use a custom config file with promptfoo eval -c path/to/config.

Extension Hooks

Promptfoo supports extension hooks that allow you to run custom code that modifies the evaluation state at specific points in the evaluation lifecycle. These hooks are defined in extension files specified in the extensions property of the configuration.

Available Hooks

NameDescriptionContext
beforeAllRuns before the entire test suite begins{ suite: TestSuite }
afterAllRuns after the entire test suite has finished{ results: EvaluateResult[], prompts: CompletedPrompt[], suite: TestSuite, evalId: string, config: Partial<UnifiedConfig> }
beforeEachRuns before each individual test{ test: TestCase }
afterEachRuns after each individual test{ test: TestCase, result: EvaluateResult }

Session Management in Hooks

For multi-turn conversations or stateful interactions, hooks can be used to manage per-test sessions (i.e. "conversation threads").

Pre-Test Session Definition

A common pattern is to create session on your server in the beforeEach hook and clean them up in the afterEach hook:

export async function extensionHook(hookName, context) {
if (hookName === 'beforeEach') {
const res = await fetch('http://localhost:8080/session');
const sessionId = await res.text();
return { test: { ...context.test, vars: { ...context.test.vars, sessionId } } }; // Scope the session id to the current test case
}

if (hookName === 'afterEach') {
const id = context.test.vars.sessionId; // Read the session id from the test case scope
await fetch(`http://localhost:8080/session/${id}`, { method: 'DELETE' });
}
}

See the working stateful-session-management example for a complete implementation.

Test-Time Session Definition

Session ids returned by your provider in response.sessionId will be used as the session id for the test case. If the provider does not return a session id, the test variables (vars.sessionId) will be used as fallback.

For HTTP providers, you extract session IDs from server responses using a sessionParser configuration. The session parser tells promptfoo how to extract the session ID from response headers or body, which then becomes response.sessionId. For example:

providers:
- id: http
config:
url: 'https://example.com/api'
# Session parser extracts ID from response → becomes response.sessionId
sessionParser: 'data.headers["x-session-id"]'
headers:
# Use the extracted session ID in subsequent requests
'x-session-id': '{{sessionId}}'

See the HTTP provider session management documentation for complete details on configuring session parsers.

It is made available in the afterEach hook context at:

context.result.metadata.sessionId;

Note: For regular providers, the sessionId comes from either response.sessionId (provider-generated via session parser or direct provider support) or vars.sessionId (set in beforeEach hook or test config). The priority is: response.sessionId > vars.sessionId.

For example:

async function extensionHook(hookName, context) {
if (hookName === 'afterEach') {
const sessionId = context.result.metadata.sessionId;
if (sessionId) {
console.log(`Test completed with session: ${sessionId}`);
// You can use this sessionId for tracking, logging, or cleanup
}
}
}

For iterative red team strategies (e.g., jailbreak, tree search), the sessionIds array is made available in the afterEach hook context at:

context.result.metadata.sessionIds;

This is an array containing all session IDs from the iterative exploration process. Each iteration may have its own session ID, allowing you to track the full conversation history across multiple attempts.

Example usage for iterative providers:

async function extensionHook(hookName, context) {
if (hookName === 'afterEach') {
// For regular providers - single session ID
const sessionId = context.result.metadata.sessionId;

// For iterative providers (jailbreak, tree search) - array of session IDs
const sessionIds = context.result.metadata.sessionIds;
if (sessionIds && Array.isArray(sessionIds)) {
console.log(`Jailbreak completed with ${sessionIds.length} iterations`);
sessionIds.forEach((id, index) => {
console.log(` Iteration ${index + 1}: session ${id}`);
});
// You can use these sessionIds for detailed tracking of the attack path
}
}
}

Note: The sessionIds array only contains defined session IDs - any iterations without a session ID are filtered out.

Implementing Hooks

To implement these hooks, create a JavaScript or Python file with a function that handles the hooks you want to use. Then, specify the path to this file and the function name in the extensions array in your configuration.

note

A custom function name receives all event types (beforeAll, afterAll, beforeEach, afterEach) with the legacy (hookName, context) calling convention. If the function name is exactly one of the hook names, promptfoo only runs it for that hook and calls it as (context, { hookName }).

Example configuration:

extensions:
- file://path/to/your/extension.js:extensionHook
- file://path/to/your/extension.py:extension_hook
important

When specifying an extension in the configuration, you must include the function name after the file path, separated by a colon (:). This tells promptfoo which function to call in the extension file.

Python example extension file:

from typing import Optional

def extension_hook(hook_name, context) -> Optional[dict]:
# Perform any necessary setup
if hook_name == 'beforeAll':
print(f"Setting up test suite: {context['suite'].get('description', '')}")

# Add an additional test case to the suite:
context["suite"]["tests"].append(
{
"vars": {
"body": "It's a beautiful day",
"language": "Spanish",
},
"assert": [{"type": "contains", "value": "Es un día hermoso."}],
}
)

# Add an additional default assertion to the suite:
context["suite"]["defaultTest"]["assert"].append({"type": "is-json"})

return context

# Perform any necessary teardown or reporting
elif hook_name == 'afterAll':
print(f"Test suite completed: {context['suite'].get('description', '')}")
print(f"Total tests: {len(context['results'])}")

# Prepare for individual test
elif hook_name == 'beforeEach':
print(f"Running test: {context['test'].get('description', '')}")

# Change all languages to pirate-dialect
context["test"]["vars"]["language"] = f'Pirate {context["test"]["vars"]["language"]}'

return context

# Clean up after individual test or log results
elif hook_name == 'afterEach':
print(f"Test completed: {context['test'].get('description', '')}. Pass: {context['result'].get('success', False)}")


JavaScript example extension file:

async function extensionHook(hookName, context) {
// Perform any necessary setup
if (hookName === 'beforeAll') {
console.log(`Setting up test suite: ${context.suite.description || ''}`);

// Add an additional test case to the suite:
context.suite.tests.push({
vars: {
body: "It's a beautiful day",
language: 'Spanish',
},
assert: [{ type: 'contains', value: 'Es un día hermoso.' }],
});

return context;
}

// Perform any necessary teardown or reporting
else if (hookName === 'afterAll') {
console.log(`Test suite completed: ${context.suite.description || ''}`);
console.log(`Total tests: ${context.results.length}`);
}

// Prepare for individual test
else if (hookName === 'beforeEach') {
console.log(`Running test: ${context.test.description || ''}`);

// Change all languages to pirate-dialect
context.test.vars.language = `Pirate ${context.test.vars.language}`;

return context;
}

// Clean up after individual test or log results
else if (hookName === 'afterEach') {
console.log(
`Test completed: ${context.test.description || ''}. Pass: ${context.result.success || false}`,
);
}
}

module.exports = extensionHook;

These hooks provide powerful extensibility to your promptfoo evaluations, allowing you to implement custom logic for setup, teardown, logging, or integration with other systems. The extension function receives the hookName and a context object, which contains relevant data for each hook type. You can use this information to perform actions specific to each stage of the evaluation process.

The beforeAll, beforeEach, and afterEach hooks may mutate specific properties of their respective context arguments in order to modify evaluation state. To persist these changes, the hook must return the modified context.

All merges are shallow: returned properties replace existing values at the top level. Nested objects (e.g., metadata: { nested: { a: 1 } }) are replaced entirely, not deep-merged.

beforeAll

PropertyTypeDescription
context.suite.promptsPrompt[]The prompts to be evaluated.
context.suite.providerPromptMapRecord<string, string[]>A map of provider IDs to prompt labels.
context.suite.testsTestCase[]The test cases to be evaluated.
context.suite.scenariosScenario[]The scenarios to be evaluated.
context.suite.defaultTestTestCaseThe default test case to be evaluated.
context.suite.nunjucksFiltersRecord<string, Function>A map of Nunjucks filters.
context.suite.derivedMetricsDerivedMetric[]Derived metrics.
context.suite.redteamRedteamConfigThe red team configuration to be evaluated.

beforeEach

PropertyTypeDescription
context.testTestCaseThe test case to be evaluated.

afterEach

PropertyTypeDescription
context.result.namedScoresRecord<string, number>Custom numeric metrics (e.g., num_turns, cost_usd).
context.result.metadataRecord<string, any>Structured data (e.g., tool call details, URLs).
context.result.response.metadataRecord<string, any>Response-level metadata (e.g., session viewer URLs).

Fields like success, score, and response.output are not overridable from afterEach.

afterAll

The afterAll hook is intended for side effects (sending to monitoring, cleanup, etc.) and its return value is not persisted. Use it for read-only operations on the completed evaluation.

PropertyTypeDescription
context.suiteTestSuiteThe completed test suite
context.resultsEvaluateResult[]All evaluation results as plain objects
context.promptsCompletedPrompt[]Completed prompts with metrics
context.evalIdstringUnique identifier for this eval run
context.configPartial<UnifiedConfig>The full evaluation configuration

Guardrails

GuardrailResponse is an object that represents the GuardrailResponse from a provider. It includes flags indicating if prompt or output failed guardrails.

interface GuardrailResponse {
flagged?: boolean;
flaggedInput?: boolean;
flaggedOutput?: boolean;
reason?: string;
}

Transformation Pipeline

Understanding the transformation pipeline is crucial for complex evaluations, especially for RAG systems which require context-based assertions. Here's how transforms are applied:

Execution Flow

Complete Example: RAG System Evaluation

This example demonstrates how different transforms work together in a RAG evaluation :

providers:
- id: 'http://localhost:3000/api/rag'
config:
# Step 1: Provider transform - normalize API response structure
transformResponse: |
// API returns: { status: "success", data: { answer: "...", sources: [...] } }
// Transform to: { answer: "...", sources: [...] }
json.data

tests:
- vars:
query: 'What is the refund policy?'

options:
# Step 2a: Test transform - extract answer for general assertions
# Receives output from transformResponse: { answer: "...", sources: [...] }
transform: 'output.answer'

assert:
# Regular assertion uses test-transformed output (just the answer string)
- type: contains
value: '30 days'

# Context assertions use contextTransform
- type: context-faithfulness
# Step 2b: Context transform - extract sources
# Also receives output from transformResponse: { answer: "...", sources: [...] }
contextTransform: 'output.sources.map(s => s.content).join("\n")'
threshold: 0.9

# Another assertion can have its own transform
- type: equals
value: 'confident'
# Step 3: Assertion-level transform (applied after test transform)
# Receives: "30-day refund policy" (the test-transformed output)
transform: |
output.includes("30") ? "confident" : "uncertain"

Key Points

  1. Provider Transform (transformResponse): Applied first to normalize provider responses
  2. Test Case Transforms:
    • options.transform: Modifies output for regular assertions
    • contextTransform: Extracts context for context-based assertions
    • Both receive the provider-transformed output directly
  3. Assertion Transform: Applied to already-transformed output for specific assertions

ProvidersConfig

type ProvidersConfig =
| string
| ProviderFunction
| ApiProvider
| (string | ProviderFunction | ApiProvider | Record<string, ProviderOptions> | ProviderOptions)[];

ProviderFunction

A ProviderFunction is a function that takes a prompt as an argument and returns a Promise that resolves to a ProviderResponse. It allows you to define custom logic for calling an API.

type ProviderFunction = (
prompt: string,
context?: CallApiContextParams,
options?: { includeLogProbs?: boolean; abortSignal?: AbortSignal },
) => Promise<ProviderResponse>;

CallApiContextParams

CallApiContextParams is the context passed to provider callApi implementations and model-graded assertion providers.

interface CallApiContextParams {
filters?: Record<string, (...args: any[]) => string>;
getCache?: any;
logger?: any;
originalProvider?: ApiProvider;
prompt: Prompt;
vars: Record<string, VarValue>;
debug?: boolean;
test?: AtomicTestCase;
bustCache?: boolean;

// W3C Trace Context headers
traceparent?: string;
tracestate?: string;

// Evaluation metadata
evaluationId?: string;
testCaseId?: string;
testIdx?: number;
promptIdx?: number;
repeatIndex?: number;
}

ProviderOptions

ProviderOptions is an object that includes the id of the provider and an optional config object that can be used to pass provider-specific configurations.

For providers with built-in cost estimation, config can also include pricing overrides such as cost, inputCost, and outputCost. When supported, inputCost and outputCost take precedence over the legacy shared cost value. OpenAI audio-capable models also support audioCost, audioInputCost, and audioOutputCost.

interface ProviderOptions {
id?: ProviderId;
label?: string;
config?: any;

// List of prompt labels to include (exact, group prefix like "group", or wildcard "group:*")
prompts?: string[];

// Transform the output, either with inline Javascript, external py/js script, or a function
// See /docs/configuration/guide#transforming-outputs
transform?: string | TransformFunction;

// Sleep this long before each request
delay?: number;

// Provider-specific environment overrides
env?: EnvOverrides;

// Multi-input definitions for red team targets
inputs?: Inputs;
}

ProviderResponse

ProviderResponse is an object that represents the response from a provider. It includes the output from the provider, any error that occurred, information about token usage, and a flag indicating whether the response was cached.

interface ProviderResponse {
cached?: boolean;
cost?: number; // required for cost assertion (see /docs/configuration/expected-outputs/deterministic#cost)
error?: string;
output?: any;
raw?: any;
prompt?: string | ChatMessage[]; // actual prompt sent, if different from rendered prompt
metadata?: {
redteamFinalPrompt?: string;
http?: {
status: number;
statusText: string;
headers?: Record<string, string>;
requestHeaders?: Record<string, string>;
};
[key: string]: any;
};
tokenUsage?: TokenUsage;
materializationHandled?: boolean;
materializedVars?: Record<string, string>;
inputMaterialization?: Record<string, unknown>;
providerTransformedOutput?: any;
logProbs?: number[]; // required for perplexity assertion (see /docs/configuration/expected-outputs/deterministic#perplexity)
latencyMs?: number;
isRefusal?: boolean; // the provider has explicitly refused to generate a response (see /docs/configuration/expected-outputs/deterministic#is-refusal)
finishReason?: string;
sessionId?: string;
conversationEnded?: boolean;
conversationEndReason?: string;
guardrails?: GuardrailResponse;
isBase64?: boolean;
format?: string;
audio?: {
id?: string;
data?: string;
blobRef?: BlobRef;
transcript?: string;
[key: string]: any;
};
video?: { id?: string; blobRef?: BlobRef; url?: string; model?: string; [key: string]: any };
images?: ImageOutput[];
}

ProviderEmbeddingResponse

ProviderEmbeddingResponse is an object that represents the response from a provider's embedding API. It includes the embedding from the provider, any error that occurred, and information about token usage.

interface ProviderEmbeddingResponse {
cached?: boolean;
cost?: number;
error?: string;
embedding?: number[];
latencyMs?: number;
tokenUsage?: Partial<TokenUsage>;
metadata?: {
transformed?: boolean;
originalText?: string;
[key: string]: any;
};
}

Evaluation inputs

TestSuite

TestSuite is the resolved runtime suite passed to extension hooks after providers, prompts, tests, filters, and other config have been loaded.

interface TestSuite {
tags?: Record<string, string>;
description?: string;
providers: ApiProvider[];
prompts: Prompt[];
providerPromptMap?: Record<string, string[]>;
tests?: TestCase[];
scenarios?: Scenario[];
defaultTest?: `file://${string}` | Omit<TestCase, 'description'>;
nunjucksFilters?: Record<string, (...args: any[]) => string>;
env?: EnvOverrides;
derivedMetrics?: DerivedMetric[];
extensions?: string[] | null;
redteam?: RedteamConfig;
tracing?: TracingConfig;
}

TestSuiteConfiguration

The source type name for this pre-parse configuration shape is TestSuiteConfig.

interface TestSuiteConfig {
// Optional tags to describe the test suite
tags?: Record<string, string>;

// Optional description of what you're trying to test
description?: string;

// One or more LLM APIs to use, for example: openai:gpt-5-mini, openai:gpt-5 localai:chat:vicuna
providers: ProvidersConfig;

// One or more prompts
prompts: string | (string | Prompt)[] | Record<string, string>;

// Path to a test file, OR list of LLM prompt variations (aka "test case")
tests?: string | (string | TestCase | TestGeneratorConfig)[] | TestGeneratorConfig;

// Scenarios, groupings of data and tests to be evaluated
scenarios?: (string | Scenario)[];

// Sets the default properties for each test case. Useful for setting an assertion, on all test cases, for example.
defaultTest?: `file://${string}` | Omit<TestCase, 'description'>;

// Path to write output. Writes to console/web viewer if not set.
outputPath?: string | string[];

// Determines whether or not sharing is enabled.
sharing?:
| boolean
| {
apiBaseUrl?: string;
appBaseUrl?: string;
};

// Nunjucks filters
nunjucksFilters?: Record<string, string>;

// Envar overrides
env?: EnvOverrides | Record<string, string | number | boolean>;

// Metrics to calculate after the eval has completed
derivedMetrics?: DerivedMetric[];

// Extension hooks
extensions?: string[] | null;

// Arbitrary metadata about this configuration
metadata?: Record<string, any>;

// Red team configuration
redteam?: RedteamConfig;

// Whether to write latest results to promptfoo storage. This enables you to use the web viewer.
writeLatestResults?: boolean;

// OpenTelemetry tracing configuration
tracing?: TracingConfig;
}

UnifiedConfig

UnifiedConfig is an object that includes the test suite configuration, evaluation options, and command line options. It is used to hold the complete configuration for the evaluation.

interface UnifiedConfig extends Omit<TestSuiteConfig, 'providers'> {
// Exactly one of providers or targets must be set.
providers?: ProvidersConfig;
targets?: ProvidersConfig;
evaluateOptions?: EvaluateOptions;
commandLineOptions?: Partial<CommandLineOptions>;
}

Scenario

Scenario is an object that represents a group of test cases to be evaluated. It includes a description, default test case configuration, and a list of test cases.

interface Scenario {
description?: string;
config: Partial<TestCase>[];
tests: TestCase[];
}

Also, see this table here for descriptions.

DerivedMetric

DerivedMetric calculates a metric from named assertion scores after the eval has completed.

interface DerivedMetric {
name: string;
value: string | ((namedScores: Record<string, number>, context: RunEvalOptions) => number);
}

RunEvalOptions

RunEvalOptions is the per-row execution context passed into derived metric callbacks.

interface RunEvalOptions {
provider: ApiProvider;
prompt: Prompt;
delay: number;
test: AtomicTestCase;
testSuite?: TestSuite;
nunjucksFilters?: Record<string, (...args: any[]) => string>;
evaluateOptions?: EvaluateOptions;
testIdx: number;
promptIdx: number;
repeatIndex: number;
conversations?: Record<
string,
{ prompt: string | object; input: string; output: string | object; metadata?: object }[]
>;
registers?: Record<string, VarValue>;
isRedteam: boolean;
concurrency?: number;
evalId?: string;
abortSignal?: AbortSignal;
}

Prompt

A Prompt is what it sounds like. When specifying a prompt object in a static config, it should look like this:

type PromptConfigObject =
| {
id: string; // Path, usually prefixed with file://
label?: string; // How to display it in outputs and web UI
raw?: string; // Optional inline prompt text
}
| {
raw: string; // Inline prompt text
label: string; // How to display it in outputs and web UI
id?: string;
template?: string;
display?: string; // Deprecated: use label
function?: PromptFunction;
config?: any; // Provider config merged for this prompt
};

When passing a Prompt object directly to the Javascript library:

interface Prompt {
// The actual prompt
raw: string;
// How it should appear in the UI
label: string;
// A function to generate a prompt on a per-input basis. Overrides the raw prompt.
function?: (context: {
vars: Record<string, VarValue>;
provider?: ApiProvider;
}) => Promise<PromptContent | PromptFunctionResult>;
}

TokenUsage

interface TokenUsage {
prompt?: number;
completion?: number;
cached?: number;
total?: number;
numRequests?: number;
completionDetails?: CompletionTokenDetails;
assertions?: TokenUsage;
}

interface CompletionTokenDetails {
reasoning?: number;
acceptedPrediction?: number;
rejectedPrediction?: number;
cacheReadInputTokens?: number;
cacheCreationInputTokens?: number;
}

PromptMetrics

PromptMetrics is passed to EvaluateOptions.progressCallback and stored on completed prompts.

interface PromptMetrics {
score: number;
testPassCount: number;
testFailCount: number;
testErrorCount: number;
assertPassCount: number;
assertFailCount: number;
totalLatencyMs: number;
tokenUsage: TokenUsage;
namedScores: Record<string, number>;
namedScoresCount: Record<string, number>;
namedScoreWeights?: Record<string, number>;
redteam?: {
pluginPassCount: Record<string, number>;
pluginFailCount: Record<string, number>;
strategyPassCount: Record<string, number>;
strategyFailCount: Record<string, number>;
};
cost: number;
}

EvaluateOptions

EvaluateOptions is an object that includes options for how the evaluation should be performed. It includes the maximum concurrency for API calls, whether to show a progress bar, a callback for progress updates, the number of times to repeat each test, and a delay between tests.

interface EvaluateOptions {
cache?: boolean;
delay?: number;
eventSource?: string;
generateSuggestions?: boolean;
/** Deprecated: use maxConcurrency: 1 or -j 1 instead. */
interactiveProviders?: boolean;
maxConcurrency?: number;
repeat?: number;
showProgressBar?: boolean;
timeoutMs?: number;
maxEvalTimeMs?: number;
isRedteam?: boolean;
silent?: boolean;
abortSignal?: AbortSignal;
progressCallback?: (
completed: number,
total: number,
index: number,
evalStep: RunEvalOptions,
metrics: PromptMetrics,
) => void;
}

Evaluation outputs

EvaluateTable

EvaluateTable is an object that represents the results of the evaluation in a tabular format. It includes a header with the prompts and variables, and a body with the outputs and variables for each test case.

interface EvaluateTable {
head: {
prompts: CompletedPrompt[];
vars: string[];
};
body: EvaluateTableRow[];
}

interface EvaluateTableRow {
description?: string;
outputs: EvaluateTableOutput[];
vars: string[];
test: AtomicTestCase;
testIdx: number;
}

EvaluateTableOutput

EvaluateTableOutput is an object that represents the output of a single evaluation in a tabular format. It includes the pass/fail result, score, output text, prompt, latency, token usage, and grading result.

// 0 = none, 1 = assertion failure, 2 = error
type ResultFailureReason = 0 | 1 | 2;

interface EvaluateTableOutput {
cost: number;
failureReason: ResultFailureReason;
gradingResult?: GradingResult | null;
id: string;
latencyMs: number;
metadata?: Record<string, any>;
namedScores: Record<string, number>;
pass: boolean;
prompt: string;
provider?: string;
response?: ProviderResponse;
score: number;
testCase: AtomicTestCase;
text: string;
tokenUsage?: Partial<TokenUsage>;
error?: string | null;
audio?: ProviderResponse['audio'];
video?: ProviderResponse['video'];
images?: ImageOutput[];
}

EvaluateSummary

EvaluateSummary is an object that represents a summary of the evaluation results. It includes the version of the evaluator, the results of each evaluation, a table of the results, and statistics about the evaluation. The latest version is 3. It removed the table and added in a new prompts property.

interface EvaluateSummaryV3 {
version: 3;
timestamp: string; // ISO 8601 datetime
results: EvaluateResult[];
prompts: CompletedPrompt[];
stats: EvaluateStats;
}
interface EvaluateSummaryV2 {
version: number;
timestamp: string; // ISO 8601 datetime
results: EvaluateResult[];
table: EvaluateTable;
stats: EvaluateStats;
}

EvaluateStats

EvaluateStats is an object that includes statistics about the evaluation. It includes the number of successful and failed tests, and the total token usage.

interface EvaluateStats {
successes: number;
failures: number;
errors: number;
tokenUsage: Required<TokenUsage>;
durationMs?: number;
generationDurationMs?: number;
evaluationDurationMs?: number;
}

EvaluateResult

EvaluateResult roughly corresponds to a single "cell" in the grid comparison view. It includes information on the provider, prompt, and other inputs, as well as the outputs.

interface EvaluateResult {
id?: string;
description?: string;
promptIdx: number;
testIdx: number;
testCase: AtomicTestCase;
promptId: string;
provider: Pick<ProviderOptions, 'id' | 'label'>;
prompt: Prompt;
vars: Record<string, VarValue>;
response?: ProviderResponse;
error?: string | null;
failureReason: ResultFailureReason;
success: boolean;
score: number;
latencyMs: number;
gradingResult?: GradingResult | null;
namedScores: Record<string, number>;
cost?: number;
metadata?: Record<string, any>;
tokenUsage?: Required<TokenUsage>;
}

GradingResult

GradingResult is an object that represents the result of grading a test case. It includes whether the test case passed, the score, the reason for the result, the tokens used, and the results of any component assertions.

interface ResultSuggestion {
type: string;
action: 'replace-prompt' | 'pre-filter' | 'post-filter' | 'note';
value: string;
}

interface GradingResult {
pass: boolean; // did test pass?
score: number; // score between 0 and 1
reason: string; // plaintext reason for outcome
namedScores?: Record<string, number>; // labeled metrics attached to this result
namedScoreWeights?: Record<string, number>; // weighted denominator for namedScores
tokensUsed?: TokenUsage; // tokens consumed by the test
componentResults?: GradingResult[]; // nested component results
assertion?: Assertion; // source assertion
comment?: string; // user comment
suggestions?: ResultSuggestion[]; // suggested follow-up actions
metadata?: {
pluginId?: string;
strategyId?: string;
context?: string | string[];
contextUnits?: string[];
renderedAssertionValue?: string;
renderedGradingPrompt?: string;
graderError?: true;
[key: string]: any;
};
}

CompletedPrompt

CompletedPrompt is an object that represents a prompt that has been evaluated. It includes the raw prompt, the provider, metrics, and other information.

interface CompletedPrompt {
id?: string;
raw: string;
template?: string;
display?: string;
label: string;
function?: PromptFunction;

// These config options are merged into the provider config.
config?: any;
provider: string;
metrics?: {
score: number;
testPassCount: number;
testFailCount: number;
testErrorCount: number;
assertPassCount: number;
assertFailCount: number;
totalLatencyMs: number;
tokenUsage: TokenUsage;
namedScores: Record<string, number>;
namedScoresCount: Record<string, number>;
namedScoreWeights?: Record<string, number>;
redteam?: {
pluginPassCount: Record<string, number>;
pluginFailCount: Record<string, number>;
strategyPassCount: Record<string, number>;
strategyFailCount: Record<string, number>;
};
cost: number;
};
}