Evaluate OpenAI Agents (Python SDK)
Use the Python openai-agents SDK with Promptfoo by wrapping your agent as a Python provider. This gives you full control over agent code, tools, sessions, and framework-specific tracing, while still letting Promptfoo score outputs and assert on the traced workflow.
The built-in openai:agents:* provider is for the JavaScript @openai/agents SDK. For the Python SDK, use the Python provider path described here.
Quick Start
npx promptfoo@latest init --example openai-agents
cd openai-agents
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export OPENAI_API_KEY=your_api_key_here
npx promptfoo@latest eval -c promptfooconfig.yaml --no-cache
PROMPTFOO_ENABLE_OTEL=true npx promptfoo@latest eval -c promptfooconfig.yaml --no-cache
npx promptfoo@latest view
What The Example Covers
- multi-turn execution over a persistent
SQLiteSession - specialist handoffs between a triage agent, an FAQ agent, and a seat-booking agent
- Promptfoo trace ingestion of the SDK's internal spans
- assertions on tool usage, tool arguments, agent spans, tool order, and overall task success
How The Tracing Works
Promptfoo can only assert on tool paths if it receives the agent's internal spans. The example does that by installing a custom TracingProcessor for the OpenAI Agents SDK and exporting those spans to Promptfoo's OTLP receiver.
At a high level:
- Promptfoo enables tracing and injects a W3C
traceparentinto the Python provider context. - The example parses that trace context and configures a custom OpenAI Agents tracing processor.
- The processor converts OpenAI Agents spans into OTLP JSON.
- Promptfoo ingests those spans and makes them available in the Trace Timeline and
trajectory:*assertions.
If you skip this exporter, Promptfoo will not see the SDK's tool and handoff spans, so trajectory:* assertions will not have the trace data they need.
If you also enable Promptfoo's Python wrapper OTEL path with PROMPTFOO_ENABLE_OTEL=true, the example will emit a provider-level Python span as well. The custom SDK spans will inherit that active OTEL span as their parent. The example config accepts both OTLP JSON and protobuf because the SDK bridge emits JSON while the wrapper exporter uses protobuf by default.
Assertion Pattern
The example config asserts on the agent's actual behavior instead of only the final message:
vars:
steps_json: |
[
"My name is Ada Lovelace and my confirmation number is ABC123.",
"Move me to seat 14C.",
"Also, what is the baggage allowance?"
]
assert:
- type: trajectory:tool-used
value:
- lookup_reservation
- update_seat
- faq_lookup
- type: trajectory:tool-args-match
value:
name: update_seat
args:
confirmation_number: ABC123
new_seat: 14C
mode: partial
- type: trajectory:tool-sequence
value:
steps:
- lookup_reservation
- update_seat
- faq_lookup
- type: trajectory:step-count
value:
type: span
pattern: 'agent *'
min: 3
- type: trace-error-spans
value:
max_count: 0
Use trajectory:goal-success when you want a judge model to decide whether the traced workflow actually completed the task, not just whether it hit the right tool path.
Long-Horizon Tasks
The example turns one eval row into a long-horizon task by passing a JSON-encoded list of user turns in vars.steps_json. The provider parses that JSON and executes the turns sequentially against a shared SQLiteSession, which lets the SDK preserve working memory across turns inside a single Promptfoo test case.
That pattern is useful when you want to evaluate:
- multi-step workflows that need memory
- agent handoffs over time
- task completion after several intermediate actions
- regressions in tool usage across longer trajectories
Telemetry
After the eval finishes, open the web UI and inspect the Trace Timeline for any row. You should see:
- a provider-level Python span when
PROMPTFOO_ENABLE_OTEL=true - agent spans
- handoff spans
- generation spans
- function-tool spans with tool names and arguments
That same trace data powers trace-span-* and trajectory:* assertions.