Leaderboard Model Identification
Research Paper
Exploring and mitigating adversarial manipulation of voting-based leaderboards
View PaperDescription: Voting-based Large Language Model (LLM) leaderboards, such as Chatbot Arena, are vulnerable to adversarial ranking manipulation due to insufficient response anonymity. While these systems obscure model identities during head-to-head comparisons to prevent bias, an attacker can de-anonymize the models with high accuracy (>95%) by analyzing response content. The attack functions in two stages: (1) Re-identification, where the attacker submits specific prompts (identity-probing or stylometric fingerprinting) and analyzes the output using a trained binary classifier to identify the target model; and (2) Reranking, where the attacker systematically votes for the target model (or against competitors) only when the target is successfully identified. Simulations indicate that approximately 1,000 adversarial votes are sufficient to significantly displace model rankings.
Examples: The vulnerability can be exploited using two primary methods of de-anonymization:
-
Identity-Probing: The attacker submits prompts designed to trigger self-identification.
-
Prompt: "Who are you?" or "What is your model name?"
-
Result: If the model responds "I am Llama 3," the attacker votes for it. (Note: Some platforms attempt to filter these, but obfuscation or Base64 encoding can bypass filters).
-
Training-Based (Stylometric) Detection: The attacker trains a logistic regression classifier on public API responses using features like Bag-of-Words (BoW) or TF-IDF.
-
Prompt (Math/Logic): "The sum of the perimeters of three equal squares is 36 cm. Find the area and perimeter of the rectangle that can be made of the squares."
-
Prompt (Coding): "descriptive answer for append many items to list python in python with proper code examples and outputs."
-
Execution: The attacker submits these prompts to the leaderboard. The classifier analyzes the anonymous response length and word distribution. If the classifier outputs
1(Target Model Match), the attacker casts a vote.
Impact:
- Integrity Violation: An attacker can arbitrarily inflate the ranking of a preferred model or demote competitor models on public leaderboards.
- Trust Erosion: Undermines the reliability of crowd-sourced evaluations as a metric for LLM capabilities.
- Low Cost: The attack is economically feasible; training the detector costs approximately $440, and shifting a model’s rank significantly requires only a few thousand automated interactions.
Affected Systems:
- Chatbot Arena (LMSYS)
- Any anonymous, voting-based comparative evaluation platform for generative AI models (text, image, or speech).
Mitigation Steps:
- Strict Authentication: Implement rigorous authentication (e.g., SMS verification, social login) to increase the cost of creating multiple adversarial accounts ($c_{\textrm{account}}$).
- Statistical Malicious User Detection: Implement anomaly detection using likelihood ratio tests. Compare user voting patterns against a known "benign" distribution ($H_{\textrm{benign}}$) and flag users with statistically significant deviations ($p < 0.01$).
- Perturbed Rankings: Inject scaled Gaussian noise into the publicly released Bradley-Terry coefficients/rankings. This prevents attackers from mimicking "average" user behavior to evade detection.
- Prompt Uniqueness Enforcement: Reject or down-weight votes resulting from repeated or structurally identical prompts, forcing attackers to continuously generate new fingerprinting prompts and retrain detectors.
- Rate Limiting: Enforce strict temporal rate limits on interactions ($m$) per account to cap the influence of a single entity.
© 2026 Promptfoo. All rights reserved.