Semantics-Preserving Detector Evasion

Description: LLM-based vulnerability detection systems (used in static application security testing and code review pipelines) are susceptible to semantics-preserving adversarial evasion attacks. Attackers can bypass detection mechanisms by injecting gradient-optimized "universal adversarial strings" into specific code regions—defined as "carriers"—that do not alter the program's compilation or execution logic. These carriers include non-executable regions (code comments, inactive preprocessor directives) and executable but semantically neutral regions (variable identifier renaming, dead-branch code insertion).

By optimizing these strings using Greedy Coordinate Gradient (GCG) on surrogate models, attackers can generate triggers that transfer effectively to black-box commercial APIs (including GPT-4o and Qwen2.5-Coder variants). When these triggers are present, the LLM misclassifies known vulnerable code (e.g., containing buffer overflows or use-after-free bugs) as "BENIGN" with a high success rate (Union ASR > 87% for most models), despite the underlying vulnerability remaining fully intact and compilable. This vulnerability demonstrates a systemic failure of semantic invariance in LLM-based code analysis.

Examples: The following examples demonstrate the injection templates used to evade detection. In an active attack, [TRIGGER] is replaced by a specific alphanumeric or printable string optimized via GCG to force a specific latent representation in the model.

Preprocessor Macro Injection (Macro Name Carrier): The attacker defines an unused macro where the adversarial string is part of the macro name.

// The adversarial string is embedded in the macro name
#define SAFE_FUNC_[TRIGGER] benign_function()

// Original vulnerable code follows, functionally unchanged
void vulnerable_function(char *input) {
    char buffer[10];
    strcpy(buffer, input); // Buffer overflow
}

Dead-Branch Insertion: The attacker inserts a syntactically valid but logically unreachable if(0) block containing the trigger.

void vulnerable_function(char *input) {
    char buffer[10];
    
    // Insertion Point: before the final return or closing brace
    // The condition is statically false; code is never executed
    if (0) {
        "benign_[TRIGGER]";
    }
    
    strcpy(buffer, input); 
}

Comment Carrier Injection: The attacker injects the trigger into a Doxygen-style header or specific metadata field.

/**
 * @brief Function implementation
 * @commit_hash [TRIGGER]
 * @status verified
 */
void vulnerable_function(char *input) {
    // Original vulnerable code
}

Identifier Substitution: The attacker renames a variable flagged by the detector to an adversarial string throughout its scope.

// Original: void func(char *buf) { ... }
// Attacked:
void func(char *[TRIGGER]) {
    strcpy([TRIGGER], input);
}

Impact:

Security Gate Bypass: Malicious or vulnerable code can pass automated CI/CD security checks without flagging alerts.
Supply Chain Risk: Attackers can commit subtle, "masked" vulnerabilities into codebases that rely on LLM-based review, maintaining the exploitability of the code while evading detection.
Cross-Model Transferability: Exploits optimized on open-weights models (e.g., Qwen2.5-Coder-14B) successfully evade closed-source, black-box APIs (e.g., GPT-4o) without requiring direct access to the target model's gradients.

Affected Systems:

Automated Code Review Tools: Systems integrating Large Language Models for Static Application Security Testing (SAST).
Specific Models Verified:
GPT-4o (OpenAI)
Qwen2.5-Coder (14B, 32B)
Llama-3.1-8B
CodeAstra (based on Mistral-7B)
GPT-5-mini (limited susceptibility)

Mitigation Steps:

Adversarial Training: Train detectors on carrier-diverse adversarial examples (spanning identifiers, comments, and preprocessor surfaces) to enforce broader semantic invariance.
Input Sanitization: Implement preprocessing pipelines that strip non-executable content (comments, preprocessor directives) prior to inference. Note: This may induce prediction drift and does not mitigate executable carriers like identifier substitution.
Ensemble Defense: Combine sanitization-based models with robust base models to cover different attack surfaces (boundary carriers vs. executable carriers).
New Evaluation Metrics: Adopt "Complete Resistance" (CR) as a deployment metric, measuring the fraction of vulnerabilities that resist all semantics-preserving transformation types, rather than relying solely on clean-set accuracy.

Semantics-Preserving Detector Evasion

Research Paper