Nonsensical CoT Reasoning

Description: Large Language Models (LLMs) utilizing Chain-of-Thought (CoT) prompting are vulnerable to input perturbations that decouple intermediate reasoning from the final answer. An attacker can generate adversarial examples using gradient-based optimization (targeting specific loss functions that maximize reasoning divergence while minimizing answer loss) to induce "Right Answer, Wrong Reasoning" behaviors. This vulnerability manifests through two primary attack vectors:

Token-level perturbations: Involves random token insertion followed by gradient-informed replacement to identify tokens that disrupt reasoning paths without altering semantic meaning enough to change the ground truth label.
Embedding-level perturbations: Application of imperceptible $l_{\infty}$ noise to the input embedding space to shift internal representations.

This creates a misalignment where the model hallucinates facts, performs incorrect calculations, or utilizes irrelevant information in the CoT trace, yet produces the correct final classification or numerical output.

Examples: The following examples demonstrate the "Right Answer, Wrong Reasoning" phenomenon facilitated by the MATCHA framework.

Token-Level Injection (Math Reasoning):
Original Question: "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?"
Adversarial Input: "If _ there are 3 cars in the _ parking lot and _ 2 more _ cars _ arrive, _ how many _ cars are _ in the _ parking lot?" (where _ represents optimized adversarial tokens).
Vulnerable Output:
- Reasoning: "Let's think step by step. There are 3 cars at first. 2 more cars arrive. 20 / 3 = 5." (Incorrect calculation/logic).
- Final Answer: "The answer is 5." (Correct output).
Hallucinated Information (Commonsense Reasoning):
Context: A question regarding philosophy.
Vulnerable Output: The model cites "Sophocles" instead of "Sophist" in the reasoning trace to support the conclusion, but still arrives at the correct final entity or boolean determination.
Repository and Dataset:
Full implementation and adversarial datasets: https://github.com/uiuc-focal-lab/MATCHA

Impact:

Trust degradation: In high-stakes domains (healthcare, education), users relying on the CoT explanation for verification may be misled by plausible-sounding but factually wrong logic.
Audit bypass: Automated systems checking the final answer will mark the model's performance as correct, masking severe underlying reasoning failures.
Safety alignment bypass: The misalignment indicates that safety guardrails applied to the final output may not extend to the reasoning trace, allowing for the generation of harmful or hallucinatory intermediate text.

Affected Systems: The vulnerability has been confirmed on the following models when using CoT prompting:

Open Source: Llama-3-8B, Mistral-7B, Zephyr-7B-beta, Qwen2.5-7B, DeepSeek-R1-Distill-Qwen-7B.
Closed Source (via Transferability): GPT-3.5-turbo, GPT-4o (adversarial examples generated on open-source models transfer with non-trivial success rates).

Mitigation Steps:

Adversarial Training: Incorporate input perturbations (token insertions and embedding noise) into the training pipeline to harden the model against reasoning divergence.
Consistency Enforcement: Implement loss functions during fine-tuning that strictly penalize discrepancies between the reasoning trace and the final answer (Reasoning-Driven Architectures).
Semantic Invariance Checks: Deploy pre-processing filters to detect and reject inputs containing high-frequency, semantically vacuous token insertions characteristic of gradient-based attacks.
Robustness Evaluation: Utilize the MATCHA framework to benchmark answer-reasoning consistency prior to deployment in critical environments.

Nonsensical CoT Reasoning

Research Paper