TeleAI Reveals Systemic LLM Vulnerabilities

Description: Reasoning-specialized Large Language Models (LLMs) that utilize Chain-of-Thought (CoT) processes are vulnerable to reasoning-exploitation jailbreaks. Attackers can bypass standard safety alignments (such as RLHF) by using adaptive multi-turn interactions or semantic transformations to induce the model to generate intermediate reasoning steps that "rationalize" or "contextualize" a harmful request. Because current alignment techniques often fail to scale linearly with reasoning depth, forcing the model to logically justify a prohibited prompt during its CoT phase effectively weaponizes the model's own reasoning capabilities against its safety guardrails.

Examples: Specific attack prompts utilizing the "Morpheus" method (a self-evolving metacognitive multi-round attack agent) and "DeepInception" (nested scenario generation) to trigger this reasoning vulnerability are available in the TeleAI-Safety repository. See repository: https://github.com/yuanyc06/Tele-Safety

Impact: Remote attackers can reliably bypass safety filters to elicit prohibited, harmful, or policy-violating content from highly capable models. The attack demonstrates an "Alignment Tax," where increased reasoning capabilities directly degrade the model's adherence to safety protocols, resulting in high Attack Success Rates (ASR) even in state-of-the-art models.

Affected Systems:

Reasoning-specialized language models (specifically identified in DeepSeek-R1, which exhibited a 0.50 ASR compared to general-purpose models).
LLMs employing unconstrained Chain-of-Thought (CoT) intermediate generation steps.

Mitigation Steps:

Implement Reasoning-Aware Alignment techniques to ensure that safety constraints scale linearly with reasoning depth and are actively enforced during intermediate Chain-of-Thought generation, not just the final output.
Deploy multi-layered defense architectures that combine surface-level statistical filtering (such as perplexity-based filters) with deep semantic analysis to prevent the model from rationalizing malicious context.
Integrate inference-time internal defenses, such as representation-based suppression (e.g., RePE or DRO) or decoding intervention (e.g., RAIN, SafeDecoding), to detect and truncate unsafe activation directions during the intermediate reasoning phases.

TeleAI Reveals Systemic LLM Vulnerabilities

Research Paper