Detector Telemetry Camouflage

Description: Single-pass hallucination detectors relying on internal telemetry (uncertainty, hidden-state geometry, and attention patterns) are vulnerable to white-box, model-side adversarial attacks. An attacker can employ the CORVUS (Camouflaging Open-weight Representations, Volumes, Uncertainty, and Structure) technique to fine-tune lightweight Low-Rank Adapters (LoRA) on the target LLM. This method optimizes a specific loss objective that camouflages detector-visible telemetry signals—specifically Token Entropy (TE), Hidden Log-Volume (HV), and Attention Diagonality (AD)—without altering the surface-level text generation under teacher-forced replay. The attack utilizes a single-step Fast Gradient Sign Method (FGSM) perturbation in the embedding space to robustly minimize attention diagonality. Consequently, hallucinated outputs generate internal traces indistinguishable from faithful outputs, bypassing detectors such as LLM-Check, Semantic Entropy Probes (SEP), and ICR-probe.

Examples: The vulnerability is reproduced by training LoRA adapters using the CORVUS objective function. See the official implementation: CORVUS Repository (Note: Placeholder link based on paper context, refer to paper for actual URL if not live). See the paper: arXiv:2405.18540

The attack minimizes the following loss function during adapter fine-tuning: $$ \mathcal{L} = \mathcal{L}{\mathrm{CE}} + \beta{\mathrm{te}},\mathrm{TE}{\mathrm{clean}} - \beta{\mathrm{hv}},\overline{\mathrm{HV}}{\mathrm{clean}} - \lambda{\mathrm{adv}},\overline{\mathrm{AD}}_{\mathrm{adv}} $$

Where:

$\mathcal{L}_{\mathrm{CE}}$ is the standard cross-entropy loss.
$\mathrm{TE}_{\mathrm{clean}}$ is the Token Entropy on the clean pass.
$\overline{\mathrm{HV}}_{\mathrm{clean}}$ is the layer-averaged Hidden Log-Volume.
$\overline{\mathrm{AD}}{\mathrm{adv}}$ is the Attention Diagonality computed on a pass with FGSM-perturbed embeddings $E' = E + \varepsilon,\mathrm{sign}( abla{E},\phi_{\mathrm{adv}})$.

Impact:

Security Bypass: Allows hallucinating models to pass automated safety audits that rely on internal state monitoring.
Detector Obsolescence: Renders training-free detectors (LLM-Check, Window-Entropy) and trained probes (SEP, ICR-probe) ineffective, with AUROC scores dropping by up to 25 points.
Calibration Drift: Clean-calibrated thresholds fail to transfer, leading to extremely low True Positive Rates (TPR) at fixed False Positive Rates (FPR).

Affected Systems:

Detectors:
LLM-Check (Hidden Score and Attention Score components)
Semantic Entropy Probes (SEP)
ICR-probe (Internal Context Retrieval)
Perplexity (PPL) and Window-Entropy based filters
Models (verified exploits):
Llama-2 (7B-Chat)
Vicuna (7B)
Llama-3 (8B-Instruct)
Qwen2.5 (14B-Instruct)

Mitigation Steps:

External Grounding: Incorporate external knowledge-based approaches (e.g., Retrieval Augmented Generation with alignment scoring) rather than relying solely on internal model telemetry.
Cross-Model Validation: Utilize "frozen auditors"—separate, unadapted models—to evaluate the output of the target model, thereby decoupling the generation source from the auditing telemetry.
Telemetry Diversity: Design defenses that aggregate diverse signals that are computationally difficult to jointly optimize via lightweight adapters.
Challenge-Response Mechanisms: Implement randomization or active challenge-response protocols during auditing to complicate white-box minimization attacks.

Detector Telemetry Camouflage

Research Paper