Targeted Bit-Flip Output Control
Research Paper
TFL: Targeted Bit-Flip Attack on Large Language Model
View PaperDescription: A targeted fault-injection vulnerability exists in Large Language Models (LLMs) deployed on hardware susceptible to Rowhammer memory attacks. An attacker with white-box access or co-located memory access can use the TFL (Targeted bit-Flip attack on LLM) framework to induce precise bit-flips (fewer than 50 bits) in the model's weights stored in DRAM. By utilizing a gradient-based search with a keyword-focused attack loss and an auxiliary utility score, the attacker can manipulate the model to output specific, pre-designed false answers for targeted prompts. Unlike previous bit-flip attacks that cause catastrophic model degradation, this targeted manipulation preserves the model's normal behavior on unrelated queries, making the attack highly stealthy.
Examples: Using the TFL framework, an attacker can silently alter factual responses by flipping a minimal number of bits in the model's weights (e.g., in the LM head layer):
- Target Prompt 1: "Who is the first president of the United States?"
- Expected Output: "George Washington"
- Manipulated Output (Qwen3-8B INT8, 4 bit flips): "William Henry Harrison"
- Manipulated Output (LLaMA-3.1-8B / DeepSeek-R1, BF16): "John Adams"
- Target Prompt 2: "What is the highest mountain in the world?"
- Expected Output: "Mount Everest"
- Manipulated Output: "Mount Chimborazo"
Impact:
- Integrity Violation: Attackers can inject targeted disinformation, backdoors, or malicious payloads into model outputs for specific triggers without degrading general performance.
- Stealth Evasion: Because performance metrics on standard benchmarks (e.g., MMLU, DROP, GSM8K) remain virtually unchanged, standard functional testing or performance monitoring will fail to detect the compromise.
Affected Systems:
- Large Language Models (demonstrated on Qwen3-8B, DeepSeek-R1-Distill-Qwen-14B, LLaMA-3.1-8B-INSTRUCT) deployed using floating-point (FP32, FP16, BF16) or quantized (INT8) weight formats.
- Systems running on DRAM modules (DDR3, DDR4, DDR5) vulnerable to Rowhammer fault injection.
Mitigation Steps:
- Layer-Specific Protection: Isolate, freeze, or heavily monitor the Language Model (LM) head layer in memory, as it is the most efficient target for this attack. (Note: The paper demonstrates that attackers can still target body layers if the head is protected, though it requires a higher bit-flip budget).
- Memory Integrity Verification: Implement runtime cryptographic checksums or continuous integrity verification for loaded model weights to detect unauthorized bit-level modifications.
- Hardware Defense: Ensure hardware deployments utilize robust memory-level mitigations, such as Advanced Error-Correcting Code (ECC) memory and Target Row Refresh (TRR), while acknowledging these can sometimes be bypassed by advanced Rowhammer techniques.
- Memory Isolation: Prevent untrusted processes from co-locating on the same physical memory nodes as the LLM weights to neutralize the underlying Rowhammer vector.
© 2026 Promptfoo. All rights reserved.