Reference

Here is the main structure of the promptfoo configuration file:

Config

Property	Type	Required	Description
description	string	No	Optional description of what your LLM is trying to do
tags	Record<string, string>	No	Optional tags to describe the test suite (e.g. `env: production`, `application: chatbot`)
providers	string \| string[] \| Record<string, ProviderOptions> \| ProviderOptions[]	Yes	One or more LLM APIs to use
prompts	string \| string[]	Yes	One or more prompts to load
tests	string \| Test Case[]	Yes	Path to a test file, OR list of LLM prompt variations (aka "test case")
defaultTest	string \| Partial Test Case	No	Sets the default properties for each test case. Can be an inline object or a `file://` path to an external YAML/JSON file.
outputPath	string	No	Where to write output. Writes to console/web viewer if not set. See output formats.
evaluateOptions.maxConcurrency	number	No	Maximum number of concurrent requests. Defaults to 4
evaluateOptions.repeat	number	No	Number of times to run each test case. Defaults to 1
evaluateOptions.delay	number	No	Force the test runner to wait after each API call (milliseconds)
evaluateOptions.showProgressBar	boolean	No	Whether to display the progress bar
evaluateOptions.cache	boolean	No	Whether to use disk cache for results (default: true)
evaluateOptions.timeoutMs	number	No	Timeout in milliseconds for each individual test case/provider API call. When reached, that specific test is marked as an error. Default is 0 (no timeout).
evaluateOptions.maxEvalTimeMs	number	No	Maximum total runtime in milliseconds for the entire evaluation process. When reached, all remaining tests are marked as errors and the evaluation ends. Default is 0 (no limit).
extensions	string[]	No	List of extension files to load. Each extension is a file path with a function name. Can be Python (.py) or JavaScript (.js) files. Supported hooks are 'beforeAll', 'afterAll', 'beforeEach', 'afterEach'.
env	Record<string, string \| number \| boolean>	No	Environment variables to set for the test run. These values will override existing environment variables. Can be used to set API keys and other configuration values needed by providers.
commandLineOptions	CommandLineOptions	No	Default values for command-line options. These values will be used unless overridden by actual command-line arguments.

Test Case

A test case represents a single example input that is fed into all prompts and providers.

Property	Type	Required	Description
description	string	No	Description of what you're testing
vars	Record<string, string \| string[] \| object \| any> \| string	No	Key-value pairs to substitute in the prompt. If `vars` is a plain string, it will be treated as a YAML filepath to load a var mapping from. See Test Case Configuration for loading vars from files.
provider	string \| ProviderOptions \| ApiProvider	No	Override the default provider for this specific test case
assert	Assertion[]	No	List of automatic checks to run on the LLM output. See assertions & metrics for all available types.
threshold	number	No	Test will fail if the combined score of assertions is less than this number
metadata	Record<string, string \| string[] \| any>	No	Additional metadata to include with the test case, useful for filtering or grouping results
options	Object	No	Additional configuration settings for the test case
options.transformVars	string	No	A filepath (js or py) or JavaScript snippet that runs on the vars before they are substituted into the prompt. See transforming input variables.
options.transform	string	No	A filepath (js or py) or JavaScript snippet that runs on LLM output before any assertions. See transforming outputs.
options.prefix	string	No	Text to prepend to the prompt
options.suffix	string	No	Text to append to the prompt
options.provider	string	No	The API provider to use for model-graded assertion grading
options.runSerially	boolean	No	If true, run this test case without concurrency regardless of global settings
options.storeOutputAs	string	No	The output of this test will be stored as a variable, which can be used in subsequent tests. See multi-turn conversations.
options.rubricPrompt	string \| string[]	No	Custom prompt for model-graded assertions

Assertion

More details on using assertions, including examples here.

Property	Type	Required	Description
type	string	Yes	Type of assertion. See assertion types for all available types.
value	string	No	The expected value, if applicable
threshold	number	No	The threshold value, applicable only to certain types such as `similar`, `cost`, `javascript`, `python`
provider	string	No	Some assertions (type = `similar`, `llm-rubric`, model-graded-*) require an LLM provider
metric	string	No	The label for this result. Assertions with the same `metric` will be aggregated together. See named metrics.
contextTransform	string	No	Javascript expression to dynamically construct context for context-based assertions. See Context Transform for more details.

CommandLineOptions

Set default values for command-line options. These defaults will be used unless overridden by command-line arguments.

Property	Type	Description
Basic Configuration
description	string	Description of what your LLM is trying to do
config	string[]	Path(s) to configuration files
envPath	string	Path to .env file to load environment variables from
Input Files
prompts	string[]	One or more paths to prompt files
providers	string[]	One or more LLM provider identifiers
tests	string	Path to CSV file with test cases
vars	string	Path to CSV file with test variables
assertions	string	Path to assertions file
modelOutputs	string	Path to JSON file containing model outputs
Prompt Modifications
promptPrefix	string	Text to prepend to every prompt
promptSuffix	string	Text to append to every prompt
generateSuggestions	boolean	Generate new prompts and append them to the prompt list
Test Execution
maxConcurrency	number	Maximum number of concurrent requests
repeat	number	Number of times to run each test case
delay	number	Delay between API calls in milliseconds
grader	string	Provider that will grade model-graded outputs
var	object	Set test variables as key-value pairs (e.g. `{key1: 'value1', key2: 'value2'}`)
Filtering
filterPattern	string	Only run tests whose description matches the regular expression pattern
filterProviders	string	Only run tests with providers matching this regex
filterTargets	string	Only run tests with targets matching this regex (alias for filterProviders)
filterFirstN	number	Only run the first N test cases
filterSample	number	Run a random sample of N test cases
filterMetadata	string	Only run tests matching metadata filter (JSON format)
filterErrorsOnly	string	Only run tests that resulted in errors (expects previous output path)
filterFailing	string	Only run tests that failed assertions (expects previous output path)
Output & Display
output	string[]	Output file paths (csv, txt, json, yaml, yml, html)
table	boolean	Show output table (default: true, disable with --no-table)
tableCellMaxLength	number	Maximum length of table cells in console output
progressBar	boolean	Whether to display progress bar during evaluation
verbose	boolean	Enable verbose output
share	boolean	Whether to create a shareable URL
Caching & Storage
cache	boolean	Whether to use disk cache for results (default: true)
write	boolean	Whether to write results to promptfoo directory (default: true)
Other Options
watch	boolean	Whether to watch for config changes and re-run automatically

Example

promptfooconfig.yaml
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
prompts:
  - prompt1.txt
  - prompt2.txt

providers:
  - openai:gpt-4

tests: tests.csv

# Set default command-line options
commandLineOptions:
  envPath: .env.local # Load environment variables from custom .env file
  maxConcurrency: 10
  repeat: 3
  delay: 1000
  verbose: true
  grader: openai:gpt-4o-mini
  table: true
  cache: false
  tableCellMaxLength: 100

  # Filtering options
  filterPattern: 'auth.*' # Only run tests with 'auth' in description
  filterProviders: 'openai.*' # Only test OpenAI providers
  filterSample: 50 # Random sample of 50 tests

  # Prompt modifications
  promptPrefix: 'You are a helpful assistant. '
  promptSuffix: "\n\nPlease be concise."

  # Variables
  var:
    temperature: '0.7'
    max_tokens: '1000'

With this configuration, running npx promptfoo eval will use these defaults. You can still override them:

# Uses maxConcurrency: 10 from config
npx promptfoo eval

# Overrides maxConcurrency to 5
npx promptfoo eval --max-concurrency 5

AssertionValueFunctionContext

When using JavaScript or Python assertions, your function receives a context object with the following interface:

interface AssertionValueFunctionContext {
  // Raw prompt sent to LLM
  prompt: string | undefined;

  // Test case variables
  vars: Record<string, string | object>;

  // The complete test case (see #test-case)
  test: AtomicTestCase;

  // Log probabilities from the LLM response, if available
  logProbs: number[] | undefined;

  // Configuration passed to the assertion
  config?: Record<string, any>;

  // The provider that generated the response (see /docs/providers)
  provider: ApiProvider | undefined;

  // The complete provider response (see #providerresponse)
  providerResponse: ProviderResponse | undefined;
}

note

promptfoo supports .js and .json file extensions in addition to .yaml.

It automatically loads promptfooconfig.*, but you can use a custom config file with promptfoo eval -c path/to/config.

Extension Hooks

Promptfoo supports extension hooks that allow you to run custom code that modifies the evaluation state at specific points in the evaluation lifecycle. These hooks are defined in extension files specified in the extensions property of the configuration.

Available Hooks

Name	Description	Context
beforeAll	Runs before the entire test suite begins	`{ suite: TestSuite }`
afterAll	Runs after the entire test suite has finished	`{ results: EvaluateResult[], suite: TestSuite }`
beforeEach	Runs before each individual test	`{ test: TestCase }`
afterEach	Runs after each individual test	`{ test: TestCase, result: EvaluateResult }`

Session Management in Hooks

For multi-turn conversations or stateful interactions, hooks can be used to manage per-test sessions (i.e. "conversation threads").

Pre-Test Session Definition

A common pattern is to create session on your server in the beforeEach hook and clean them up in the afterEach hook:

export async function extensionHook(hookName, context) {
  if (hookName === 'beforeEach') {
    const res = await fetch('http://localhost:8080/session');
    const sessionId = await res.text();
    return { test: { ...context.test, vars: { ...context.test.vars, sessionId } } }; // Scope the session id to the current test case
  }

  if (hookName === 'afterEach') {
    const id = context.test.vars.sessionId; // Read the session id from the test case scope
    await fetch(`http://localhost:8080/session/${id}`, { method: 'DELETE' });
  }
}

See the working stateful-session-management example for a complete implementation.

Test-Time Session Definition

Session ids returned by your provider in response.sessionId will be used as the session id for the test case. If the provider does not return a session id, the test variables (vars.sessionId) will be used as fallback.

For HTTP providers, you extract session IDs from server responses using a sessionParser configuration. The session parser tells promptfoo how to extract the session ID from response headers or body, which then becomes response.sessionId. For example:

providers:
  - id: http
    config:
      url: 'https://example.com/api'
      # Session parser extracts ID from response → becomes response.sessionId
      sessionParser: 'data.headers["x-session-id"]'
      headers:
        # Use the extracted session ID in subsequent requests
        'x-session-id': '{{sessionId}}'

See the HTTP provider session management documentation for complete details on configuring session parsers.

It is made available in the afterEach hook context at:

context.result.metadata.sessionId;

Note: For regular providers, the sessionId comes from either response.sessionId (provider-generated via session parser or direct provider support) or vars.sessionId (set in beforeEach hook or test config). The priority is: response.sessionId > vars.sessionId.

For example:

async function extensionHook(hookName, context) {
  if (hookName === 'afterEach') {
    const sessionId = context.result.metadata.sessionId;
    if (sessionId) {
      console.log(`Test completed with session: ${sessionId}`);
      // You can use this sessionId for tracking, logging, or cleanup
    }
  }
}

For iterative red team strategies (e.g., jailbreak, tree search), the sessionIds array is made available in the afterEach hook context at:

context.result.metadata.sessionIds;

This is an array containing all session IDs from the iterative exploration process. Each iteration may have its own session ID, allowing you to track the full conversation history across multiple attempts.

Example usage for iterative providers:

async function extensionHook(hookName, context) {
  if (hookName === 'afterEach') {
    // For regular providers - single session ID
    const sessionId = context.result.metadata.sessionId;

    // For iterative providers (jailbreak, tree search) - array of session IDs
    const sessionIds = context.result.metadata.sessionIds;
    if (sessionIds && Array.isArray(sessionIds)) {
      console.log(`Jailbreak completed with ${sessionIds.length} iterations`);
      sessionIds.forEach((id, index) => {
        console.log(`  Iteration ${index + 1}: session ${id}`);
      });
      // You can use these sessionIds for detailed tracking of the attack path
    }
  }
}

Note: The sessionIds array only contains defined session IDs - any iterations without a session ID are filtered out.

Implementing Hooks

To implement these hooks, create a JavaScript or Python file with a function that handles the hooks you want to use. Then, specify the path to this file and the function name in the extensions array in your configuration.

note

All extensions receive all event types (beforeAll, afterAll, beforeEach, afterEach). It's up to the extension function to decide which events to handle based on the hookName parameter.

Example configuration:

extensions:
  - file://path/to/your/extension.js:extensionHook
  - file://path/to/your/extension.py:extension_hook

important

When specifying an extension in the configuration, you must include the function name after the file path, separated by a colon (:). This tells promptfoo which function to call in the extension file.

Python example extension file:

from typing import Optional

def extension_hook(hook_name, context) -> Optional[dict]:
    # Perform any necessary setup
    if hook_name == 'beforeAll':
        print(f"Setting up test suite: {context['suite'].get('description', '')}")

        # Add an additional test case to the suite:
        context["suite"]["tests"].append(
            {
                "vars": {
                    "body": "It's a beautiful day",
                    "language": "Spanish",
                },
                "assert": [{"type": "contains", "value": "Es un día hermoso."}],
            }
        )

        # Add an additional default assertion to the suite:
        context["suite"]["defaultTest"]["assert"].append({"type": "is-json"})

        return context

    # Perform any necessary teardown or reporting
    elif hook_name == 'afterAll':
        print(f"Test suite completed: {context['suite'].get('description', '')}")
        print(f"Total tests: {len(context['results'])}")

    # Prepare for individual test
    elif hook_name == 'beforeEach':
        print(f"Running test: {context['test'].get('description', '')}")

        # Change all languages to pirate-dialect
        context["test"]["vars"]["language"] = f'Pirate {context["test"]["vars"]["language"]}'

        return context

    # Clean up after individual test or log results
    elif hook_name == 'afterEach':
        print(f"Test completed: {context['test'].get('description', '')}. Pass: {context['result'].get('success', False)}")

JavaScript example extension file:

async function extensionHook(hookName, context) {
  // Perform any necessary setup
  if (hookName === 'beforeAll') {
    console.log(`Setting up test suite: ${context.suite.description || ''}`);

    // Add an additional test case to the suite:
    context.suite.tests.push({
      vars: {
        body: "It's a beautiful day",
        language: 'Spanish',
      },
      assert: [{ type: 'contains', value: 'Es un día hermoso.' }],
    });

    return context;
  }

  // Perform any necessary teardown or reporting
  else if (hookName === 'afterAll') {
    console.log(`Test suite completed: ${context.suite.description || ''}`);
    console.log(`Total tests: ${context.results.length}`);
  }

  // Prepare for individual test
  else if (hookName === 'beforeEach') {
    console.log(`Running test: ${context.test.description || ''}`);

    // Change all languages to pirate-dialect
    context.test.vars.language = `Pirate ${context.test.vars.language}`;

    return context;
  }

  // Clean up after individual test or log results
  else if (hookName === 'afterEach') {
    console.log(
      `Test completed: ${context.test.description || ''}. Pass: ${context.result.success || false}`,
    );
  }
}

module.exports = extensionHook;

These hooks provide powerful extensibility to your promptfoo evaluations, allowing you to implement custom logic for setup, teardown, logging, or integration with other systems. The extension function receives the hookName and a context object, which contains relevant data for each hook type. You can use this information to perform actions specific to each stage of the evaluation process.

The beforeAll and beforeEach hooks may mutate specific properties of their respective context arguments in order to modify evaluation state. To persist these changes, the hook must return the modified context.

beforeAll

Property	Type	Description
`context.suite.prompts`	`Prompt[]`	The prompts to be evaluated.
`context.suite.providerPromptMap`	`Record<string, Prompt[]>`	A map of provider IDs to prompts.
`context.suite.tests`	`TestCase[]`	The test cases to be evaluated.
`context.suite.scenarios`	`Scenario[]`	The scenarios to be evaluated.
`context.suite.defaultTest`	`TestCase`	The default test case to be evaluated.
`context.suite.nunjucksFilters`	`Record<string, FilePath>`	A map of Nunjucks filters.
`context.suite.derivedMetrics`	`Record<string, string>`	A map of derived metrics.
`context.suite.redteam`	`Redteam[]`	The red team configuration to be evaluated.

beforeEach

Property	Type	Description
`context.test`	`TestCase`	The test case to be evaluated.

Guardrails

GuardrailResponse is an object that represents the GuardrailResponse from a provider. It includes flags indicating if prompt or output failed guardrails.

interface GuardrailResponse {
  flagged?: boolean;
  flaggedInput?: boolean;
  flaggedOutput?: boolean;
  reason?: string;
}

Transformation Pipeline

Understanding the transformation pipeline is crucial for complex evaluations, especially for RAG systems which require context-based assertions. Here's how transforms are applied:

Execution Flow

Complete Example: RAG System Evaluation

This example demonstrates how different transforms work together in a RAG evaluation :

providers:
  - id: 'http://localhost:3000/api/rag'
    config:
      # Step 1: Provider transform - normalize API response structure
      transformResponse: |
        // API returns: { status: "success", data: { answer: "...", sources: [...] } }
        // Transform to: { answer: "...", sources: [...] }
        json.data

tests:
  - vars:
      query: 'What is the refund policy?'

    options:
      # Step 2a: Test transform - extract answer for general assertions
      # Receives output from transformResponse: { answer: "...", sources: [...] }
      transform: 'output.answer'

    assert:
      # Regular assertion uses test-transformed output (just the answer string)
      - type: contains
        value: '30 days'

      # Context assertions use contextTransform
      - type: context-faithfulness
        # Step 2b: Context transform - extract sources
        # Also receives output from transformResponse: { answer: "...", sources: [...] }
        contextTransform: 'output.sources.map(s => s.content).join("\n")'
        threshold: 0.9

      # Another assertion can have its own transform
      - type: equals
        value: 'confident'
        # Step 3: Assertion-level transform (applied after test transform)
        # Receives: "30-day refund policy" (the test-transformed output)
        transform: |
          output.includes("30") ? "confident" : "uncertain"

Key Points

Provider Transform (transformResponse): Applied first to normalize provider responses
Test Case Transforms:
- options.transform: Modifies output for regular assertions
- contextTransform: Extracts context for context-based assertions
- Both receive the provider-transformed output directly
Assertion Transform: Applied to already-transformed output for specific assertions

ProviderFunction

A ProviderFunction is a function that takes a prompt as an argument and returns a Promise that resolves to a ProviderResponse. It allows you to define custom logic for calling an API.

type ProviderFunction = (
  prompt: string,
  context: { vars: Record<string, string | object> },
) => Promise<ProviderResponse>;

ProviderOptions

ProviderOptions is an object that includes the id of the provider and an optional config object that can be used to pass provider-specific configurations.

interface ProviderOptions {
  id?: ProviderId;
  config?: any;

  // A label is required when running a red team
  // It can be used to uniquely identify targets even if the provider id changes.
  label?: string;

  // List of prompt display strings
  prompts?: string[];

  // Transform the output, either with inline Javascript or external py/js script
  // See /docs/configuration/guide#transforming-outputs
  transform?: string;

  // Sleep this long before each request
  delay?: number;
}

ProviderResponse

ProviderResponse is an object that represents the response from a provider. It includes the output from the provider, any error that occurred, information about token usage, and a flag indicating whether the response was cached.

interface ProviderResponse {
  error?: string;
  output?: string | object;
  metadata?: object;
  tokenUsage?: Partial<{
    total: number;
    prompt: number;
    completion: number;
    cached?: number;
  }>;
  cached?: boolean;
  cost?: number; // required for cost assertion (see /docs/configuration/expected-outputs/deterministic#cost)
  logProbs?: number[]; // required for perplexity assertion (see /docs/configuration/expected-outputs/deterministic#perplexity)
  isRefusal?: boolean; // the provider has explicitly refused to generate a response (see /docs/configuration/expected-outputs/deterministic#is-refusal)
  guardrails?: GuardrailResponse;
}

ProviderEmbeddingResponse

ProviderEmbeddingResponse is an object that represents the response from a provider's embedding API. It includes the embedding from the provider, any error that occurred, and information about token usage.

interface ProviderEmbeddingResponse {
  error?: string;
  embedding?: number[];
  tokenUsage?: Partial<TokenUsage>;
}

Evaluation inputs

TestSuiteConfiguration

interface TestSuiteConfig {
  // Optional description of what you're trying to test
  description?: string;

  // One or more LLM APIs to use, for example: openai:gpt-4.1-mini, openai:gpt-4.1, localai:chat:vicuna
  providers: ProviderId | ProviderFunction | (ProviderId | ProviderOptionsMap | ProviderOptions)[];

  // One or more prompts
  prompts: (FilePath | Prompt | PromptFunction)[];

  // Path to a test file, OR list of LLM prompt variations (aka "test case")
  tests: FilePath | (FilePath | TestCase)[];

  // Scenarios, groupings of data and tests to be evaluated
  scenarios?: Scenario[];

  // Sets the default properties for each test case. Useful for setting an assertion, on all test cases, for example.
  defaultTest?: Omit<TestCase, 'description'>;

  // Path to write output. Writes to console/web viewer if not set.
  outputPath?: FilePath | FilePath[];

  // Determines whether or not sharing is enabled.
  sharing?:
    | boolean
    | {
        apiBaseUrl?: string;
        appBaseUrl?: string;
      };

  // Nunjucks filters
  nunjucksFilters?: Record<string, FilePath>;

  // Envar overrides
  env?: EnvOverrides;

  // Whether to write latest results to promptfoo storage. This enables you to use the web viewer.
  writeLatestResults?: boolean;
}

UnifiedConfig

UnifiedConfig is an object that includes the test suite configuration, evaluation options, and command line options. It is used to hold the complete configuration for the evaluation.

interface UnifiedConfig extends TestSuiteConfiguration {
  evaluateOptions: EvaluateOptions;
  commandLineOptions: Partial<CommandLineOptions>;
}

Scenario

Scenario is an object that represents a group of test cases to be evaluated. It includes a description, default test case configuration, and a list of test cases.

interface Scenario {
  description?: string;
  config: Partial<TestCase>[];
  tests: TestCase[];
}

Also, see this table here for descriptions.

Prompt

A Prompt is what it sounds like. When specifying a prompt object in a static config, it should look like this:

interface Prompt {
  id: string; // Path, usually prefixed with file://
  label: string; // How to display it in outputs and web UI
}

When passing a Prompt object directly to the Javascript library:

interface Prompt {
  // The actual prompt
  raw: string;
  // How it should appear in the UI
  label: string;
  // A function to generate a prompt on a per-input basis. Overrides the raw prompt.
  function?: (context: {
    vars: Record<string, string | object>;
    config?: Record<string, any>;
    provider?: ApiProvider;
  }) => Promise<string | object>;
}

EvaluateOptions

EvaluateOptions is an object that includes options for how the evaluation should be performed. It includes the maximum concurrency for API calls, whether to show a progress bar, a callback for progress updates, the number of times to repeat each test, and a delay between tests.

interface EvaluateOptions {
  maxConcurrency?: number;
  showProgressBar?: boolean;
  progressCallback?: (progress: number, total: number) => void;
  generateSuggestions?: boolean;
  repeat?: number;
  delay?: number;
}

Evaluation outputs

EvaluateTable

EvaluateTable is an object that represents the results of the evaluation in a tabular format. It includes a header with the prompts and variables, and a body with the outputs and variables for each test case.

interface EvaluateTable {
  head: {
    prompts: Prompt[];
    vars: string[];
  };
  body: {
    outputs: EvaluateTableOutput[];
    vars: string[];
  }[];
}

EvaluateTableOutput

EvaluateTableOutput is an object that represents the output of a single evaluation in a tabular format. It includes the pass/fail result, score, output text, prompt, latency, token usage, and grading result.

interface EvaluateTableOutput {
  pass: boolean;
  score: number;
  text: string;
  prompt: string;
  latencyMs: number;
  tokenUsage?: Partial<TokenUsage>;
  gradingResult?: GradingResult;
}

EvaluateSummary

EvaluateSummary is an object that represents a summary of the evaluation results. It includes the version of the evaluator, the results of each evaluation, a table of the results, and statistics about the evaluation. The latest version is 3. It removed the table and added in a new prompts property.

interface EvaluateSummaryV3 {
  version: 3;
  timestamp: string; // ISO 8601 datetime
  results: EvaluateResult[];
  prompts: CompletedPrompt[];
  stats: EvaluateStats;
}

interface EvaluateSummaryV2 {
  version: 2;
  timestamp: string; // ISO 8601 datetime
  results: EvaluateResult[];
  table: EvaluateTable;
  stats: EvaluateStats;
}

EvaluateStats

EvaluateStats is an object that includes statistics about the evaluation. It includes the number of successful and failed tests, and the total token usage.

interface EvaluateStats {
  successes: number;
  failures: number;
  tokenUsage: Required<TokenUsage>;
}

EvaluateResult

EvaluateResult roughly corresponds to a single "cell" in the grid comparison view. It includes information on the provider, prompt, and other inputs, as well as the outputs.

interface EvaluateResult {
  provider: Pick<ProviderOptions, 'id'>;
  prompt: Prompt;
  vars: Record<string, string | object>;
  response?: ProviderResponse;
  error?: string;
  success: boolean;
  score: number;
  latencyMs: number;
  gradingResult?: GradingResult;
  metadata?: Record<string, any>;
}

GradingResult

GradingResult is an object that represents the result of grading a test case. It includes whether the test case passed, the score, the reason for the result, the tokens used, and the results of any component assertions.

interface GradingResult {
  pass: boolean;                        # did test pass?
  score: number;                        # score between 0 and 1
  reason: string;                       # plaintext reason for outcome
  tokensUsed?: TokenUsage;              # tokens consumed by the test
  componentResults?: GradingResult[];   # if this is a composite score, it can have nested results
  assertion: Assertion | null;          # source of assertion
  latencyMs?: number;                   # latency of LLM call
}

CompletedPrompt

CompletedPrompt is an object that represents a prompt that has been evaluated. It includes the raw prompt, the provider, metrics, and other information.

interface CompletedPrompt {
  id?: string;
  raw: string;
  label: string;
  function?: PromptFunction;

  // These config options are merged into the provider config.
  config?: any;
  provider: string;
  metrics?: {
    score: number;
    testPassCount: number;
    testFailCount: number;
    assertPassCount: number;
    assertFailCount: number;
    totalLatencyMs: number;
    tokenUsage: TokenUsage;
    namedScores: Record<string, number>;
    namedScoresCount: Record<string, number>;
    redteam?: {
      pluginPassCount: Record<string, number>;
      pluginFailCount: Record<string, number>;
      strategyPassCount: Record<string, number>;
      strategyFailCount: Record<string, number>;
    };
    cost: number;
  };
}

Config​

Test Case​

Assertion​

CommandLineOptions​

Example​

AssertionValueFunctionContext​

Extension Hooks​

Available Hooks​

Session Management in Hooks​

Pre-Test Session Definition​

Test-Time Session Definition​

Implementing Hooks​

beforeAll​

beforeEach​

Provider-related types​

Guardrails​

Transformation Pipeline​

Execution Flow​

Complete Example: RAG System Evaluation​

Key Points​

ProviderFunction​

ProviderOptions​

ProviderResponse​

ProviderEmbeddingResponse​

Evaluation inputs​

TestSuiteConfiguration​

UnifiedConfig​

Scenario​

Prompt​

EvaluateOptions​

Evaluation outputs​

EvaluateTable​

EvaluateTableOutput​

EvaluateSummary​

EvaluateStats​

EvaluateResult​

GradingResult​

CompletedPrompt​