Staged LLM Pipeline Attack
Research Paper
STACK: Adversarial Attacks on LLM Safeguard Pipelines
View PaperDescription: Large language models (LLMs) protected by multi-stage safeguard pipelines (input and output classifiers) are vulnerable to staged adversarial attacks (STACK). STACK exploits weaknesses in individual components sequentially, combining jailbreaks for each classifier with a jailbreak for the underlying LLM to bypass the entire pipeline. Successful attacks achieve high attack success rates (ASR), even on datasets of particularly harmful queries.
Examples: See arXiv:2405.18540 for detailed examples and attack methodology. The paper demonstrates successful attacks against a pipeline using a few-shot-prompted classifier, achieving a 71% ASR on the ClearHarm dataset. A transfer attack, trained on a proxy pipeline, achieved a 33% ASR on the target pipeline without direct access. Specific examples of adversarial prompts and responses are included in the paper's Appendix.
Impact: Successful exploitation allows malicious actors to bypass LLM safety safeguards, eliciting potentially harmful responses including instructions for creating harmful materials or information on causing catastrophic events. The demonstrated transferability highlights the risk that attacks against publicly available models can be adapted to target proprietary systems.
Affected Systems: LLMs using multi-stage safeguard pipelines, particularly those where the pipeline stage (input classifier, LLM, output classifier) that blocked a query is revealed. Systems that rely on publicly available classifier models are also vulnerable to transfer attacks.
Mitigation Steps:
- Avoid revealing which pipeline stage blocked a query in API responses. Make refusal responses consistent regardless of the responsible component.
- Utilize a defense-in-depth strategy with diverse and robust safeguard models. Develop safeguard models that are distinct from any models that could be easily used as proxies by an attacker. Consider techniques to mitigate the disclosure of intermediate results during pipeline processing (e.g., through timing attacks).
- Regularly update and retrain safeguard models with expanded datasets, including synthetic data.
- Employ robust detection methods to identify and mitigate any staged attack attempts.
© 2025 Promptfoo. All rights reserved.