Activation Delta Detector Evasion

Description: Activation-delta-based linear probes used for detecting task drift and prompt injections in Large Language Models (LLMs) can be bypassed using universal adversarial suffixes. By utilizing the Greedy Coordinate Gradient (GCG) algorithm, an attacker can generate a single, optimized suffix that simultaneously fools multiple logistic regression classifiers attached to different hidden layers of the LLM. The attack jointly optimizes the suffix tokens by accumulating gradients from the losses of all classifiers at their respective hidden layers, forcing the LLM's internal activations for poisoned inputs to mimic the distribution of benign inputs. This enables malicious secondary instructions to be executed without triggering the task drift detectors.

Examples: The attack is executed by appending a GCG-optimized universal suffix to poisoned data (e.g., untrusted retrieved context in a RAG system). The input structure takes the form: [Retrieved Context] + [Injected Secondary Instruction] + [Adversarial GCG Suffix]

Note: Specific GCG-generated suffix strings for Phi-3 and Llama-3 are dependent on the exact probe weights and dataset, as optimized over 100-120 GCG iterations targeting baseline classifiers.

Impact: Attackers can reliably execute prompt injection attacks and task drift manipulations in Retrieval-Augmented Generation (RAG) systems without triggering anomaly or drift detection mechanisms. The single-suffix attack achieves an Attack Success Rate (ASR) of 93.91% against Phi-3 3.8B and 99.63% against Llama-3 8B when attempting to fool all five layer probes simultaneously. Under a majority-vote detection criterion, the bypass success rate exceeds 99.79%.

Affected Systems:

Interactive and RAG-based LLM systems employing activation-delta-based linear probes (logistic regression classifiers on hidden layers) for prompt injection or task drift detection.
Specifically evaluated against detector configurations implemented on Phi-3 3.8B and Llama-3 8B.

Mitigation Steps: The paper recommends an adversarial training approach using generated suffixes rather than standard PGD-perturbed training:

Generate multiple adversarial suffixes targeting the baseline detection models using the GCG algorithm.
Split the generated suffixes into training and validation sets.
Randomly append these suffixes to the training and validation prompts containing secondary instructions.
Perform forward passes of the LLM to generate "adversarially poisoned activations."
Retrain the logistic regression detection models using these adversarially poisoned activations alongside standard clean and poisoned activations.

Activation Delta Detector Evasion

Research Paper