Skip to main content

OpenAI Codex App Server

This provider starts codex app-server as a local child process and drives the Codex app-server JSON-RPC protocol from promptfoo. Use it when you need to eval the rich client surface of Codex: streamed agent items, approvals, skills, plugins, app connector events, command/file trajectories, and thread lifecycle metadata.

For CI and straightforward automation, prefer the OpenAI Codex SDK provider. The app-server protocol is experimental, broader than the SDK, and designed for rich product integrations.

Provider IDs

providers:
- openai:codex-app-server
- openai:codex-app-server:gpt-5.4
- openai:codex-desktop
- openai:codex-desktop:gpt-5.4

openai:codex-desktop is an alias for the same app-server protocol. Promptfoo starts its own codex app-server process; it does not attach to an already-running Codex Desktop app process.

Codex SDK vs App Server vs Desktop App

Keep this provider separate from the Codex SDK provider. They share Codex concepts, but they expose different runtime contracts.

SurfaceBest forRuntimePromptfoo provider
Codex SDKCI, automation, simple agentic coding evals@openai/codex-sdk libraryopenai:codex-sdk
Codex app-serverRich-client protocol behavior and event evalsLocal codex app-server child process over JSON-RPCopenai:codex-app-server / openai:codex-desktop
Codex Desktop appInteractive human work in the desktop productNative app process and UINot attached directly

Use this provider when the thing being tested depends on app-server-only behavior such as approval request payloads, streamed item notifications, app connector events, plugin/skill metadata, or thread lifecycle operations. Use the SDK provider when you only need final Codex output, thread reuse, structured output, and traced shell/MCP/search/file steps.

What Promptfoo Can and Can't Evaluate

Eval surfaceSupported?Notes
Final assistant textYesReturned in response.output as a string.
Text, image, local image, skill, mention inputsYesPass plain text or a JSON array of supported app-server input items.
JSON schema outputYesPass output_schema; assert with is-json or parse output yourself.
Token usage and estimated costYesToken usage is read from thread/tokenUsage/updated; cost needs a known model id.
Thread IDs and turn IDsYesAvailable under sessionId and metadata.codexAppServer.
Approval, permission, MCP, and tool requestsYesserver_request_policy gives deterministic responses for non-interactive evals.
Streamed item metadataYesCommand, file, MCP, dynamic tool, web search, reasoning, and agent-message items are normalized.
Deep app-server tracingYesEnable deep_tracing to inject OTEL env vars into a fresh app-server process per row.
Live partial output in assertionsNoPromptfoo receives the final provider response after the turn completes.
Attaching to an existing Desktop appNoPromptfoo owns a separate app-server child process.
WebSocket transportNoThe provider uses stdio; app-server WebSocket mode remains experimental upstream.

Setup

Install the Codex CLI and sign in:

npm i -g @openai/codex
codex

You can also authenticate with an API key:

export OPENAI_API_KEY=your_api_key_here

Promptfoo also accepts CODEX_API_KEY or config.apiKey. For reproducible evals, prefer API-key-backed runs or set cli_env.CODEX_HOME to a fixture home directory that already contains the intended Codex login state.

Basic Usage

promptfooconfig.yaml
providers:
- id: openai:codex-app-server:gpt-5.4
config:
sandbox_mode: read-only
approval_policy: never

prompts:
- 'Review this repository and summarize the highest-risk code paths.'

The provider returns Codex's final assistant text as output. It also records thread ids, turn ids, item counts, command/file/tool metadata, approval decisions, and token usage under metadata.codexAppServer.

Safety Defaults

The app-server protocol can expose shell, filesystem, config, plugin, MCP, and app connector surfaces. Promptfoo defaults to deterministic eval behavior:

OptionDefault
sandbox_moderead-only
approval_policynever
ephemeraltrue
thread_cleanupunsubscribe
reuse_servertrue
inherit_process_envfalse

Approval requests are answered without blocking:

Request typeDefault response
item/commandExecution/requestApprovaldecline
item/fileChange/requestApprovaldecline
item/permissions/requestApprovalempty grant
item/tool/requestUserInputempty answers
mcpServer/elicitation/requestdecline
item/tool/callfailed static response

Use accept, acceptForSession, permission grants, or MCP elicitation acceptance only in isolated workspaces where side effects are acceptable.

Configuration

The provider validates top-level provider config strictly. Prompt-level config is parsed more leniently because promptfoo merges generic test options into prompt.config; unrelated keys are ignored there, while invalid values for known Codex fields still return a row-level provider error.

ParameterTypeDescriptionDefault
apiKeystringOpenAI API key. Optional when Codex is already signed in.Environment variable
base_urlstringCustom OpenAI-compatible base URL. Also passed as OPENAI_BASE_URL and OPENAI_API_BASE_URL.None
working_dirstringDirectory Codex operates in.Current process dir
additional_directoriesstring[]Additional directories added to workspace-write sandbox roots.None
skip_git_repo_checkbooleanSkip the default Git repository safety check.false
codex_path_overridestringPath to a specific codex binary.codex
modelstringModel id, such as gpt-5.4. Can also be set in the provider id.Codex default
model_providerstringApp-server model provider override for thread/start and thread/resume.None
service_tierstringfast or flex.App-server default
sandbox_modestringread-only, workspace-write, or danger-full-access.read-only
sandbox_policyobjectRaw app-server sandbox policy override for turn/start.Generated from mode
network_access_enabledbooleanAdds network access to generated sandbox policies.false
approval_policystring/objectnever, on-request, on-failure, untrusted, or granular approval policy object.never
approvals_reviewerstringuser or guardian_subagent.App-server default
model_reasoning_effortstringnone, minimal, low, medium, high, or xhigh.App-server default
reasoning_summarystringauto, concise, detailed, or none.App-server default
personalitystringnone, friendly, or pragmatic.App-server default
base_instructionsstringBase instructions passed to thread/start and thread/resume.None
developer_instructionsstringDeveloper instructions passed to thread/start and thread/resume.None
collaboration_modeobjectExperimental collaboration mode passed to turn/start.None
output_schemaobjectJSON Schema passed to turn/start.None
thread_idstringResume an existing Codex thread.None
persist_threadsbooleanReuse threads across rows with the same prompt template and config.false
thread_pool_sizenumberMax cached thread count when persist_threads is enabled.1
thread_cleanupstringunsubscribe, archive, or none for non-persistent threads. Resumed thread_id rows unsubscribe by default; archive is ignored for user-supplied thread IDs.unsubscribe
ephemeralbooleanCreate ephemeral threads by default.true
experimental_raw_eventsbooleanAsk app-server to emit raw Responses API items.false
experimental_apibooleanOpt into experimental app-server protocol fields during initialize.true
include_raw_eventsbooleanInclude protocol notifications in raw.false
cli_configobjectExtra codex app-server -c key=value config overrides.None
cli_envobjectExtra environment variables for the app-server process.Minimal shell env
inherit_process_envbooleanMerge the full Node.js environment into the app-server process.false
reuse_serverbooleanReuse the app-server process across rows. Disabled for deep_tracing.true
deep_tracingbooleanInject OTEL env vars into a fresh app-server process per call.false
request_timeout_msnumberJSON-RPC request timeout.30000
startup_timeout_msnumberinitialize timeout.30000
turn_timeout_msnumberOverall turn timeout.None
server_request_policyobjectDeterministic responses for approvals, user input, MCP elicitations, and dynamic tools.Safe declines

Granular Approval Policy

providers:
- id: openai:codex-app-server:gpt-5.4
config:
approval_policy:
granular:
sandbox_approval: true
rules: true
skill_approval: false
request_permissions: true
mcp_elicitations: true

Collaboration Mode

providers:
- id: openai:codex-app-server:gpt-5.4
config:
collaboration_mode:
mode: plan
settings:
model: gpt-5.4
reasoning_effort: none
developer_instructions: null

collaboration_mode is experimental and is sent on turn/start. App-server may let the selected mode override model, reasoning effort, or developer instructions for the turn.

Server Request Policy

Configure deterministic responses when you intentionally want app-server approval flows:

providers:
- id: openai:codex-app-server:gpt-5.4
config:
sandbox_mode: workspace-write
approval_policy: on-request
server_request_policy:
command_execution: decline
file_change: decline
user_input:
severity: high
mcp_elicitation:
action: accept
content:
severity: low
_meta:
source: promptfoo
permissions:
scope: session
permissions:
network:
enabled: true
fileSystem:
read:
- /tmp/fixture
write: null
dynamic_tools:
classify:
success: true
text: '{"label":"safe"}'

For command execution approvals, command_execution may also be an app-server decision object:

server_request_policy:
command_execution:
applyNetworkPolicyAmendment:
network_policy_amendment:
host: registry.npmjs.org
action: allow

Legacy execCommandApproval and applyPatchApproval callbacks are also handled for older app-server versions. Advanced command decision objects are only supported on the modern item/commandExecution/requestApproval flow.

Structured Output

promptfooconfig.yaml
providers:
- id: openai:codex-app-server:gpt-5.4
config:
sandbox_mode: read-only
output_schema:
type: object
properties:
summary:
type: string
risks:
type: array
items:
type: string
required: [summary, risks]
additionalProperties: false

prompts:
- 'Return a JSON review summary for this repo.'

tests:
- assert:
- type: is-json

The final app-server response is returned as a string. Use is-json or a JavaScript assertion to parse it.

Prompt Inputs

Plain text prompts work as usual. To include images, skills, or mentions, pass a JSON array:

[
{ "type": "text", "text": "$skill-creator Write a test plan for this provider." },
{ "type": "image", "url": "https://example.com/screenshot.png" },
{ "type": "local_image", "path": "/Users/me/screenshots/failure.png" },
{
"type": "skill",
"name": "skill-creator",
"path": "/Users/me/.codex/skills/skill-creator/SKILL.md"
},
{
"type": "mention",
"name": "workspace",
"path": "app://connector/resource"
}
]

Supported input item types are text, image, local_image, localImage, skill, and mention.

Metadata

The provider records app-server details for assertions and debugging:

providerResponse.metadata.codexAppServer.threadId;
providerResponse.metadata.codexAppServer.turnId;
providerResponse.metadata.codexAppServer.itemCounts;
providerResponse.metadata.codexAppServer.items;
providerResponse.metadata.codexAppServer.serverRequests;

Command output, tool arguments, and approval metadata are sanitized before they are placed in metadata or tracing attributes.

Tracing

Promptfoo wraps each provider call in a GenAI span. The app-server provider also creates item-level spans for completed command, file, MCP, dynamic tool, reasoning, search, and agent-message items.

Enable deeper app-server tracing by setting deep_tracing: true with Promptfoo's OpenTelemetry tracing enabled. Deep tracing starts a fresh app-server process for each row so the child process can receive the active trace context. Reusable app-server process and persistent thread pooling are disabled in this mode; explicit thread_id resumes are still serialized so parallel rows do not overlap turns on the same Codex thread.

Local Verification

Run from the repository root:

npm run local -- eval -c examples/openai-codex-app-server/promptfooconfig.yaml --no-cache

Use --env-file .env if your API key is stored there.

To validate the provider against your installed Codex CLI schema:

codex app-server generate-ts --out /tmp/codex-app-server-schema/ts
codex app-server generate-json-schema --out /tmp/codex-app-server-schema/json