Skip to main content

Test Agent Skills

Skills are local instructions that teach an agent when and how to use a capability. When you have two versions of the same skill, you usually want to know two things:

  1. Does the agent use the skill when the task calls for it?
  2. Does that skill version lead to better work?

Promptfoo can answer both questions by running the same tasks against each version side by side. Keep the model, task files, and permissions the same, swap only the SKILL.md, and compare the results.

Start With One Comparison

In this example, we're comparing two versions of a review-standards skill. Each version gets its own fixture directory with the same source file and a different copy of the skill:

skill-eval/
├── promptfooconfig.yaml
└── fixtures/
├── v1/
│ ├── .claude/skills/review-standards/SKILL.md
│ └── src/auth.ts
└── v2/
├── .claude/skills/review-standards/SKILL.md
└── src/auth.ts

For Codex, use .agents/skills/review-standards/SKILL.md instead of .claude/skills/.... The rest of the comparison can stay the same.

Compare Two Claude Skill Versions

We'll start with the Claude Agent SDK provider. The prompt asks for a short JSON review so the outputs are easy to score, and the two providers differ only in working_dir:

promptfooconfig.yaml
description: Compare Claude skill versions

prompts:
- '{{request}}'

# YAML anchor for the schema both providers enforce. The Claude Agent SDK's
# `output_format` is the equivalent of Codex's `output_schema` — it returns
# valid JSON without you having to ask for "JSON only" in the prompt or
# strip Markdown fences from the model's reply.
x-review-schema: &reviewSchema
type: json_schema
schema:
type: object
required: [summary, issues]
additionalProperties: false
properties:
summary: { type: string }
issues:
type: array
items:
type: object
required: [id, severity]
additionalProperties: false
properties:
id: { type: string }
severity: { type: string, enum: [high, medium, low] }

providers:
- id: anthropic:claude-agent-sdk
label: review-standards-v1
config:
model: claude-sonnet-4-6
working_dir: ./fixtures/v1
setting_sources: ['project']
skills: ['review-standards']
append_allowed_tools: ['Read', 'Grep', 'Glob']
output_format: *reviewSchema

- id: anthropic:claude-agent-sdk
label: review-standards-v2
config:
model: claude-sonnet-4-6
working_dir: ./fixtures/v2
setting_sources: ['project']
skills: ['review-standards']
append_allowed_tools: ['Read', 'Grep', 'Glob']
output_format: *reviewSchema

setting_sources: ['project'] discovers SKILL.md files under .claude/skills/. The skills: filter (added in @anthropic-ai/claude-agent-sdk 0.2.120) narrows the session to a single skill and auto-allows the Skill tool, so it no longer needs to appear in append_allowed_tools. Pass skills: 'all' if you want every discovered skill enabled. On older SDK versions, drop skills: and add 'Skill' back to append_allowed_tools.

Without output_format, Claude usually wraps short JSON answers in Markdown fences or a leading sentence, which makes downstream JSON.parse() brittle. Setting it on both providers — once via the &reviewSchema anchor — gives you reliable structured output without prompt gymnastics.

Because everything else is held constant, any meaningful difference in the results should come from the skill text itself.

Next, add a few tasks that represent how the skill will really be used:

defaultTest:
# By default, Promptfoo fans each YAML-list var into one test case per
# element (matrix expansion), which would split each comparison test in
# two. Disable that here so `expectedIssues` reaches the assertion intact.
options:
disableVarExpansion: true

tests:
- description: Finds both auth issues
vars:
request: Review src/auth.ts for password handling and token comparison issues.
expectedIssues:
- weak-password-hash
- timing-unsafe-compare

- description: Focuses on password handling only
vars:
request: Review src/auth.ts only for password handling issues.
expectedIssues:
- weak-password-hash

The first case asks for a broad review covering both topics. The second is narrower, which helps catch a skill that finds the right problems but ignores the user's requested scope. Pairing the two together is what lets the eval reward a skill version that both knows the full set of issues and respects the user's scope when asked.

See Passing Arrays to Assertions for the matrix-expansion behavior disableVarExpansion opts out of.

Add Assertions

Assertions tell Promptfoo what "better" means for this comparison. It helps to add them one layer at a time.

Check That the Skill Was Used

Start by verifying that Claude actually invoked the skill:

defaultTest:
assert:
- type: skill-used
value: review-standards

Claude exposes Skill tool calls directly, and Promptfoo normalizes them into the skill-used assertion. This distinguishes a good answer that happened without the skill from one that was produced through the workflow you intended to test.

Score the Output

Then score whether the review found the expected issues:

defaultTest:
assert:
- type: javascript
threshold: 0.7
value: |
// `output_format` hands us a parsed object; on Codex `output_schema`
// returns a string, so wrap with `JSON.parse(output)` there.
const expected = context.vars.expectedIssues;
const found = (output.issues || []).map((issue) => issue.id);
const hits = expected.filter((id) => found.includes(id));
const recall = hits.length / expected.length;

return {
pass: recall >= 0.75,
score: recall,
reason: `matched ${hits.length}/${expected.length} expected issues`,
};

The JavaScript assertion gives each response a score based on issue recall. In your own eval, replace this with the signal that matters for the skill: tests passed, required edits were made, policy checks were followed, or a rubric was satisfied.

Add Secondary Signals

Once correctness is working, you can add supporting signals:

defaultTest:
assert:
- type: cost
threshold: 0.50
- type: latency
threshold: 120000

These checks are useful when two skill versions are both correct but one is much more expensive or slower.

If you want Promptfoo to mark the strongest output for each test case, add max-score:

defaultTest:
assert:
- type: max-score
value:
method: average
threshold: 0.7
weights:
javascript: 4
skill-used: 2
cost: 0.5
latency: 0.5

That weighting makes task quality the main signal, while still rewarding the version that routes correctly and stays within practical limits.

Use the Same Eval With Codex

To run the same comparison with OpenAI Codex SDK, keep the tests and scoring logic, then use Codex's bare output_schema shape with the provider block:

x-review-schema: &reviewSchema
type: object
required: [summary, issues]
additionalProperties: false
properties:
summary: { type: string }
issues:
type: array
items:
type: object
required: [id, severity]
additionalProperties: false
properties:
id: { type: string }
severity: { type: string, enum: [high, medium, low] }

providers:
- id: openai:codex-sdk
label: review-standards-v1
config:
model: gpt-5.5
working_dir: ./fixtures/v1
skip_git_repo_check: true
sandbox_mode: read-only
enable_streaming: true
output_schema: *reviewSchema
cli_env:
CODEX_HOME: ./codex-home

- id: openai:codex-sdk
label: review-standards-v2
config:
model: gpt-5.5
working_dir: ./fixtures/v2
skip_git_repo_check: true
sandbox_mode: read-only
enable_streaming: true
output_schema: *reviewSchema
cli_env:
CODEX_HOME: ./codex-home

Codex discovers project skills from .agents/skills/ under each working_dir. Its skill-used signal is inferred from successful reads of the matching SKILL.md, so keep enable_streaming: true while developing the eval if you want Promptfoo to collect that evidence. The earlier JavaScript assertion should use JSON.parse(output) with Codex because output_schema keeps output as a JSON string rather than a parsed object.

The runnable skill-comparison example uses the same approach, with a YAML anchor to share the schema between v1 and v2.

There is a matching Claude example that uses output_format and the skills: filter so you can run the same comparison against either provider.

Add Trace Evidence When Needed

For Claude Agent SDK, skill-used is usually enough to prove invocation because it is based on first-class Skill tool calls. If you need the raw call details, inspect the recorded tool calls directly:

assert:
- type: javascript
value: |
const calls = context.providerResponse?.metadata?.toolCalls || [];
return calls.some((call) => call.name === 'Skill');

For Codex SDK, trace evidence can be useful when you want to see the workflow behind the final answer. Enable tracing and assert that Codex read the skill file:

providers:
- id: openai:codex-sdk
config:
model: gpt-5.5
working_dir: ./fixtures/v2
skip_git_repo_check: true
sandbox_mode: read-only
enable_streaming: true
deep_tracing: true

tracing:
enabled: true

tests:
- assert:
- type: trajectory:step-count
value:
type: command
pattern: '*review-standards/SKILL.md*'
min: 1

Use Codex app-server when you need to test app-server-specific behavior such as explicit skill input items, approvals, or plugin metadata rather than just the skill outcome.

Run the Eval

Run the config the same way you would any other Promptfoo eval:

npx promptfoo@latest eval -c promptfooconfig.yaml

From there, add only the options that help answer your current question:

  • Add --repeat 3 when you want a better sample of nondeterministic agent behavior.
  • Add --no-cache while iterating on the skill text and you want fresh runs.
  • Add -o results.json when you want to inspect or compare results outside the terminal.
  • Run npx promptfoo@latest view when the side-by-side web view is more useful than the CLI table.

In the web view, the comparison is easy to inspect side by side:

Promptfoo web UI comparing two Codex skill versions

Decide Which Skill Wins

Start with the comparison that matters most for the skill. For a review skill, that might be "which version finds the right issues with the least noise?" For a code-writing skill, it might be "which version passes the tests most often?"

Then use supporting checks where they add value:

# Did the agent use the intended skill?
- type: skill-used
value: review-standards

# Did the output stay within a reasonable budget?
- type: cost
threshold: 0.50

# Was the answer fast enough for the workflow?
- type: latency
threshold: 120000

If two versions are close, rerun the eval with repeats and inspect the failures before choosing one. The better skill is the one that holds up across the tasks you care about, not the one that wins a single lucky run.