LMVD-ID: 6b3d9d19
Published March 1, 2025

Cyber Fine-Tuning Undermines Safety

Affected Models:GPT-3.5, GPT-4, GPT-4o, Llama 2 70B, Llama 3 8B, Llama 3.1 8B, Gemini 1.5, Mistral 7B, Qwen 2.5 7B, Gemma 2 9B, Phi-3 8B

Research Paper

CyberLLMInstruct: A new dataset for analysing safety of fine-tuned LLMs using cyber security data

View Paper

Description: Fine-tuning Large Language Models (LLMs) on the CyberLLMInstruct dataset results in a critical degradation of safety alignment and refusal mechanisms. While the dataset comprises "pseudo-malicious" content (educational descriptions of malware, phishing, and exploits without executable payloads), the Supervised Fine-Tuning (SFT) process on this corpus causes the models to generalize this instruction-following behavior to actual malicious requests. This effectively bypasses safety guardrails established during pre-training. Specifically, fine-tuned models exhibit a drastic reduction in resilience against Prompt Injection and Sensitive Information Disclosure (OWASP Top 10 for LLMs), allowing attackers to generate actionable malware descriptions, social engineering templates, and vulnerability exploits that the base models would otherwise refuse.

Examples: The vulnerability is reproducible by fine-tuning standard open-source models on the provided dataset and subjecting them to adversarial testing.

  1. Fine-tune Llama 3.1 8B on the CyberLLMInstruct dataset using standard SFT (10 epochs, learning rate 2e-4).
  2. Attempt a standard prompt injection attack requesting a phishing email template or malware logic.
  3. Result: The base model achieves a security score of 0.95 (refusal). The fine-tuned model's score drops to 0.15, generating the malicious content.
  • Attack Vectors: The dataset trains the model to comply with requests regarding:
  • Malware creation (ransomware, Trojans, spyware logic).
  • Social engineering (phishing campaign structures).
  • Zero-day exploit descriptions.

Impact:

  • Safety Bypass: Complete or near-complete removal of safety guardrails regarding cyber-offensive content.
  • Malware Generation: Automated generation of sophisticated phishing campaigns, malicious scripts, and exploit methodologies.
  • Democratization of Attacks: Lowers the entry barrier for malicious actors to utilize high-performing LLMs for offensive cyber operations (Crime-as-a-Service).
  • Vulnerability Exposure: Increases model susceptibility to Prompt Injection, Sensitive Information Disclosure, and Data Poisoning.

Affected Systems:

  • Dataset: CyberLLMInstruct (v1.0 and associated releases).
  • Models: Any LLM fine-tuned on this dataset, confirmed specifically on:
  • Llama 3 / 3.1 (8B)
  • Llama 2 (70B)
  • Phi 3 Mini (3.8B)
  • Mistral 7B
  • Qwen 2.5 7B
  • Gemma 2 9B

Mitigation Steps:

  • Restricted Access: Limit the deployment of models fine-tuned on CyberLLMInstruct to controlled environments accessible only by verified domain experts (red teams/pen-testers).
  • Safety-Aware Fine-Tuning: Do not use the dataset for SFT without incorporating a concurrent safety alignment dataset (e.g., refusal samples) to preserve guardrails (Mixed Fine-Tuning).
  • Post-Training Red Teaming: Subject fine-tuned models to rigorous adversarial testing frameworks (e.g., DeepEval, OWASP Top 10 benchmarks) before any release.
  • Human-in-the-Loop: Ensure all outputs from models trained on this data are reviewed by security professionals to filter erroneous or dangerous suggestions.

© 2026 Promptfoo. All rights reserved.