RL Adversarial Function Call

Description: Large Language Models (LLMs) enabled with Function Calling (FC) capabilities are vulnerable to adversarial query rewriting and semantic manipulation. Standard FC models, typically trained via Supervised Fine-Tuning (SFT) on static datasets, fail to generalize against adversarial inputs that deviate from fixed distribution patterns. An attacker can exploit this by crafting queries that are semantically similar to valid requests but engineered to induce "bad cases," such as incorrect tool selection, parameter hallucination, or failure to reject irrelevant queries (irrelevance detection failure). Research demonstrates that a query generation model trained via Reinforcement Learning (RL) can systematically discover these weaknesses by optimizing for inputs that cause the FC model to diverge from ground truth outputs. Specifically, the vulnerability exists because the FC model overfits to specific prompt structures and lacks robustness against inputs that shift the narrative perspective (e.g., posing as the assistant) or utilize ambiguous phrasing.

Examples:

Perspective Shifting Attack:
An attacker rewrites a standard query to shift the perspective from the "user" to the "assistant" to confuse the parsing logic.
Attack Vector: Instead of "Book a ticket to NY," the adversarial query might be phrased as a statement of completed action or a hypothetical assistant response, causing the FC model to execute the function unintentionally or hallucinate parameters for a non-existent request.
Reference: See the "Bad Case Exploring: Query Model" section regarding the use of the judge model to penalize rewrites that mislead the FC model into acting as the user.
Irrelevance Detection Failure:
The FC model is presented with a query that is semantically unrelated to any available tools.
Attack Vector: A user submits a query about "history" to a model equipped only with "weather" and "calculator" tools. Due to overfitting on function-calling patterns, the model forces a mapping to a tool (e.g., get_weather(location='history')) instead of returning a text response or rejection.
Reference: See Table 1 in the "Experiments" section, highlighting the base model's superior performance in "irrelevance detection" compared to SFT models which tend to force execution.

Impact:

Unintended Code Execution: The LLM may trigger external APIs or tools with incorrect parameters or in contexts where no tool should be called.
Data Corruption: Hallucinated parameters in function calls (e.g., inserting random dates or values) can corrupt database records if the tool executes write operations.
Cost Inflation: Spurious calls to paid APIs due to irrelevance detection failures can lead to financial resource exhaustion.
Safety Bypass: Adversarial queries can bypass safety guardrails by confusing the model's role interpretation.

Affected Systems:

LLMs fine-tuned for Function Calling using standard Supervised Fine-Tuning (SFT) without adversarial training.
Specific models tested and confirmed vulnerable in the study: Qwen2.5-7B-Instruct, Qwen3-0.6B, Qwen3-4B, Qwen3-8B.
Systems relying on the Berkeley Function-Calling Leaderboard (BFCL) metrics without specific adversarial robustness tests.

Mitigation Steps:

Adversarial Data Augmentation via RL: Implement a "Query Model" trained using Reinforcement Learning to actively explore and generate "bad cases" (adversarial queries) that induce failures in the FC model.
Iterative Alternating Training (Zero-Sum Game): Adopt a training framework where the Query Model (attacker) and the FC Model (defender) are trained in alternating turns. The FC model is fine-tuned on the adversarial data generated by the Query Model to improve robustness.
Diversity Regularization: Apply an embedding-based regularization term during the Query Model training to ensure the generated adversarial examples cover a diverse range of semantic spaces and edit distances, preventing the model from converging on a single attack pattern.
Curriculum Learning: Structure the training process to gradually increase difficulty, starting with single function calls and progressing to multiple or nested function calls within the adversarial dataset.

RL Adversarial Function Call

Research Paper