Skip to main content

Context relevance

Measures what fraction of retrieved context is minimally needed to answer the query.

Use when: You want to check if your retrieval is returning too much irrelevant content.

How it works: Extracts only the sentences absolutely required to answer the query. Score = required sentences / total sentences.

warning

This metric finds the MINIMUM needed, not all relevant content. A low score might mean good retrieval (found answer plus supporting context) or bad retrieval (lots of irrelevant content).

Example:

Query: "What is the capital of France?"
Context: "Paris is the capital. France has great wine. The Eiffel Tower is in Paris."
Score: 0.33 (only first sentence required)

Configuration

assert:
- type: context-relevance
threshold: 0.3 # At least 30% should be essential

Required fields

  • query - User's question (in test vars)
  • context - Retrieved text (in vars or via contextTransform)
  • threshold - Minimum score 0-1 (default: 0)

Full example

tests:
- vars:
query: 'What is the capital of France?'
context: 'Paris is the capital of France.'
assert:
- type: context-relevance
threshold: 0.8 # Most content should be essential

Array context

Context can be provided as an array of chunks:

tests:
- vars:
query: 'What are the benefits of RAG systems?'
context:
- 'RAG systems improve factual accuracy by incorporating external knowledge sources.'
- 'They reduce hallucinations in large language models through grounded responses.'
- 'RAG enables up-to-date information retrieval beyond training data cutoffs.'
- 'The weather forecast shows rain this weekend.' # irrelevant chunk
assert:
- type: context-relevance
threshold: 0.5 # Score: 3/4 = 0.75

Dynamic context extraction

For RAG systems that return context with their response:

# Provider returns { answer: "...", context: "..." }
assert:
- type: context-relevance
contextTransform: 'output.context' # Extract context field
threshold: 0.3

contextTransform can also return an array:

assert:
- type: context-relevance
contextTransform: 'output.chunks' # Extract chunks array
threshold: 0.5

Score interpretation

  • 0.8-1.0: Almost all content is essential (very focused or minimal retrieval)
  • 0.3-0.7: Mixed essential and supporting content (often ideal)
  • 0.0-0.3: Mostly non-essential content (may indicate poor retrieval)

Limitations

  • Only identifies minimum sufficient content
  • Single context strings split by lines (use arrays for better accuracy)
  • Score interpretation varies by use case

Further reading