OpenAI Codex App Server
This provider starts codex app-server as a local child process and drives the Codex app-server JSON-RPC protocol from promptfoo. Use it when you need to eval the rich client surface of Codex: streamed agent items, approvals, skills, plugins, app connector events, command/file trajectories, and thread lifecycle metadata.
For CI and straightforward automation, prefer the OpenAI Codex SDK provider. The app-server protocol is experimental, broader than the SDK, and designed for rich product integrations.
Provider IDs
providers:
- openai:codex-app-server
- openai:codex-app-server:gpt-5.4
- openai:codex-desktop
- openai:codex-desktop:gpt-5.4
openai:codex-desktop is an alias for the same app-server protocol. Promptfoo starts its own codex app-server process; it does not attach to an already-running Codex Desktop app process.
Codex SDK vs App Server vs Desktop App
Keep this provider separate from the Codex SDK provider. They share Codex concepts, but they expose different runtime contracts.
| Surface | Best for | Runtime | Promptfoo provider |
|---|---|---|---|
| Codex SDK | CI, automation, simple agentic coding evals | @openai/codex-sdk library | openai:codex-sdk |
| Codex app-server | Rich-client protocol behavior and event evals | Local codex app-server child process over JSON-RPC | openai:codex-app-server / openai:codex-desktop |
| Codex Desktop app | Interactive human work in the desktop product | Native app process and UI | Not attached directly |
Use this provider when the thing being tested depends on app-server-only behavior such as approval request payloads, streamed item notifications, app connector events, plugin/skill metadata, or thread lifecycle operations. Use the SDK provider when you only need final Codex output, thread reuse, structured output, and traced shell/MCP/search/file steps.
What Promptfoo Can and Can't Evaluate
| Eval surface | Supported? | Notes |
|---|---|---|
| Final assistant text | Yes | Returned in response.output as a string. |
| Text, image, local image, skill, mention inputs | Yes | Pass plain text or a JSON array of supported app-server input items. |
| JSON schema output | Yes | Pass output_schema; assert with is-json or parse output yourself. |
| Token usage and estimated cost | Yes | Token usage is read from thread/tokenUsage/updated; cost needs a known model id. |
| Thread IDs and turn IDs | Yes | Available under sessionId and metadata.codexAppServer. |
| Approval, permission, MCP, and tool requests | Yes | server_request_policy gives deterministic responses for non-interactive evals. |
| Streamed item metadata | Yes | Command, file, MCP, dynamic tool, web search, reasoning, and agent-message items are normalized. |
| Deep app-server tracing | Yes | Enable deep_tracing to inject OTEL env vars into a fresh app-server process per row. |
| Live partial output in assertions | No | Promptfoo receives the final provider response after the turn completes. |
| Attaching to an existing Desktop app | No | Promptfoo owns a separate app-server child process. |
| WebSocket transport | No | The provider uses stdio; app-server WebSocket mode remains experimental upstream. |
Setup
Install the Codex CLI and sign in:
npm i -g @openai/codex
codex
You can also authenticate with an API key:
export OPENAI_API_KEY=your_api_key_here
Promptfoo also accepts CODEX_API_KEY or config.apiKey. For reproducible evals, prefer API-key-backed runs or set cli_env.CODEX_HOME to a fixture home directory that already contains the intended Codex login state.
Basic Usage
providers:
- id: openai:codex-app-server:gpt-5.4
config:
sandbox_mode: read-only
approval_policy: never
prompts:
- 'Review this repository and summarize the highest-risk code paths.'
The provider returns Codex's final assistant text as output. It also records thread ids, turn ids, item counts, command/file/tool metadata, approval decisions, and token usage under metadata.codexAppServer.
Safety Defaults
The app-server protocol can expose shell, filesystem, config, plugin, MCP, and app connector surfaces. Promptfoo defaults to deterministic eval behavior:
| Option | Default |
|---|---|
sandbox_mode | read-only |
approval_policy | never |
ephemeral | true |
thread_cleanup | unsubscribe |
reuse_server | true |
inherit_process_env | false |
Approval requests are answered without blocking:
| Request type | Default response |
|---|---|
item/commandExecution/requestApproval | decline |
item/fileChange/requestApproval | decline |
item/permissions/requestApproval | empty grant |
item/tool/requestUserInput | empty answers |
mcpServer/elicitation/request | decline |
item/tool/call | failed static response |
Use accept, acceptForSession, permission grants, or MCP elicitation acceptance only in isolated workspaces where side effects are acceptable.
Configuration
The provider validates top-level provider config strictly. Prompt-level config is parsed more leniently because promptfoo merges generic test options into prompt.config; unrelated keys are ignored there, while invalid values for known Codex fields still return a row-level provider error.
| Parameter | Type | Description | Default |
|---|---|---|---|
apiKey | string | OpenAI API key. Optional when Codex is already signed in. | Environment variable |
base_url | string | Custom OpenAI-compatible base URL. Also passed as OPENAI_BASE_URL and OPENAI_API_BASE_URL. | None |
working_dir | string | Directory Codex operates in. | Current process dir |
additional_directories | string[] | Additional directories added to workspace-write sandbox roots. | None |
skip_git_repo_check | boolean | Skip the default Git repository safety check. | false |
codex_path_override | string | Path to a specific codex binary. | codex |
model | string | Model id, such as gpt-5.4. Can also be set in the provider id. | Codex default |
model_provider | string | App-server model provider override for thread/start and thread/resume. | None |
service_tier | string | fast or flex. | App-server default |
sandbox_mode | string | read-only, workspace-write, or danger-full-access. | read-only |
sandbox_policy | object | Raw app-server sandbox policy override for turn/start. | Generated from mode |
network_access_enabled | boolean | Adds network access to generated sandbox policies. | false |
approval_policy | string/object | never, on-request, on-failure, untrusted, or granular approval policy object. | never |
approvals_reviewer | string | user or guardian_subagent. | App-server default |
model_reasoning_effort | string | none, minimal, low, medium, high, or xhigh. | App-server default |
reasoning_summary | string | auto, concise, detailed, or none. | App-server default |
personality | string | none, friendly, or pragmatic. | App-server default |
base_instructions | string | Base instructions passed to thread/start and thread/resume. | None |
developer_instructions | string | Developer instructions passed to thread/start and thread/resume. | None |
collaboration_mode | object | Experimental collaboration mode passed to turn/start. | None |
output_schema | object | JSON Schema passed to turn/start. | None |
thread_id | string | Resume an existing Codex thread. | None |
persist_threads | boolean | Reuse threads across rows with the same prompt template and config. | false |
thread_pool_size | number | Max cached thread count when persist_threads is enabled. | 1 |
thread_cleanup | string | unsubscribe, archive, or none for non-persistent threads. Resumed thread_id rows unsubscribe by default; archive is ignored for user-supplied thread IDs. | unsubscribe |
ephemeral | boolean | Create ephemeral threads by default. | true |
experimental_raw_events | boolean | Ask app-server to emit raw Responses API items. | false |
experimental_api | boolean | Opt into experimental app-server protocol fields during initialize. | true |
include_raw_events | boolean | Include protocol notifications in raw. | false |
cli_config | object | Extra codex app-server -c key=value config overrides. | None |
cli_env | object | Extra environment variables for the app-server process. | Minimal shell env |
inherit_process_env | boolean | Merge the full Node.js environment into the app-server process. | false |
reuse_server | boolean | Reuse the app-server process across rows. Disabled for deep_tracing. | true |
deep_tracing | boolean | Inject OTEL env vars into a fresh app-server process per call. | false |
request_timeout_ms | number | JSON-RPC request timeout. | 30000 |
startup_timeout_ms | number | initialize timeout. | 30000 |
turn_timeout_ms | number | Overall turn timeout. | None |
server_request_policy | object | Deterministic responses for approvals, user input, MCP elicitations, and dynamic tools. | Safe declines |
Granular Approval Policy
providers:
- id: openai:codex-app-server:gpt-5.4
config:
approval_policy:
granular:
sandbox_approval: true
rules: true
skill_approval: false
request_permissions: true
mcp_elicitations: true
Collaboration Mode
providers:
- id: openai:codex-app-server:gpt-5.4
config:
collaboration_mode:
mode: plan
settings:
model: gpt-5.4
reasoning_effort: none
developer_instructions: null
collaboration_mode is experimental and is sent on turn/start. App-server may let the selected mode override model, reasoning effort, or developer instructions for the turn.
Server Request Policy
Configure deterministic responses when you intentionally want app-server approval flows:
providers:
- id: openai:codex-app-server:gpt-5.4
config:
sandbox_mode: workspace-write
approval_policy: on-request
server_request_policy:
command_execution: decline
file_change: decline
user_input:
severity: high
mcp_elicitation:
action: accept
content:
severity: low
_meta:
source: promptfoo
permissions:
scope: session
permissions:
network:
enabled: true
fileSystem:
read:
- /tmp/fixture
write: null
dynamic_tools:
classify:
success: true
text: '{"label":"safe"}'
For command execution approvals, command_execution may also be an app-server decision object:
server_request_policy:
command_execution:
applyNetworkPolicyAmendment:
network_policy_amendment:
host: registry.npmjs.org
action: allow
Legacy execCommandApproval and applyPatchApproval callbacks are also handled for older app-server versions. Advanced command decision objects are only supported on the modern item/commandExecution/requestApproval flow.
Structured Output
providers:
- id: openai:codex-app-server:gpt-5.4
config:
sandbox_mode: read-only
output_schema:
type: object
properties:
summary:
type: string
risks:
type: array
items:
type: string
required: [summary, risks]
additionalProperties: false
prompts:
- 'Return a JSON review summary for this repo.'
tests:
- assert:
- type: is-json
The final app-server response is returned as a string. Use is-json or a JavaScript assertion to parse it.
Prompt Inputs
Plain text prompts work as usual. To include images, skills, or mentions, pass a JSON array:
[
{ "type": "text", "text": "$skill-creator Write a test plan for this provider." },
{ "type": "image", "url": "https://example.com/screenshot.png" },
{ "type": "local_image", "path": "/Users/me/screenshots/failure.png" },
{
"type": "skill",
"name": "skill-creator",
"path": "/Users/me/.codex/skills/skill-creator/SKILL.md"
},
{
"type": "mention",
"name": "workspace",
"path": "app://connector/resource"
}
]
Supported input item types are text, image, local_image, localImage, skill, and mention.
Metadata
The provider records app-server details for assertions and debugging:
providerResponse.metadata.codexAppServer.threadId;
providerResponse.metadata.codexAppServer.turnId;
providerResponse.metadata.codexAppServer.itemCounts;
providerResponse.metadata.codexAppServer.items;
providerResponse.metadata.codexAppServer.serverRequests;
Command output, tool arguments, and approval metadata are sanitized before they are placed in metadata or tracing attributes.
Tracing
Promptfoo wraps each provider call in a GenAI span. The app-server provider also creates item-level spans for completed command, file, MCP, dynamic tool, reasoning, search, and agent-message items.
Enable deeper app-server tracing by setting deep_tracing: true with Promptfoo's OpenTelemetry tracing enabled. Deep tracing starts a fresh app-server process for each row so the child process can receive the active trace context. Reusable app-server process and persistent thread pooling are disabled in this mode; explicit thread_id resumes are still serialized so parallel rows do not overlap turns on the same Codex thread.
Local Verification
Run from the repository root:
npm run local -- eval -c examples/openai-codex-app-server/promptfooconfig.yaml --no-cache
Use --env-file .env if your API key is stored there.
To validate the provider against your installed Codex CLI schema:
codex app-server generate-ts --out /tmp/codex-app-server-schema/ts
codex app-server generate-json-schema --out /tmp/codex-app-server-schema/json