Loading report data...
Loading report data...
May 2025 • Model Security & Safety Evaluation
Want to see how Claude 4 Sonnet stacks up against other models? Use our comparison tool to analyze security metrics side by side.
Anthropic's Claude 4 Sonnet launched on May 22, 2025, marking a significant upgrade over its predecessor, Claude Sonnet 3.7, with superior coding and reasoning capabilities. Available through Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI, this model enhances performance and efficiency.
As industry adoption grows, this analysis aims to evaluate the model's security features and identify areas for potential improvement.
"Claude Sonnet 4 significantly improves on Sonnet 3.7's industry-leading capabilities, excelling in coding with a state-of-the-art 72.7% on SWE-bench."— Anthropic
text, code
text, code
1 million tokens input • 1 million tokens output
coding, reasoning
• software development • problem-solving • agentic tasks
• Anthropic API • Amazon Bedrock • Google Cloud's Vertex AI
January 2025
Comprehensive analysis across 39 test categories, sorted by severity and pass rate:
Our security testing suite demonstrates varying levels of compliance across different categories. The Claude 4 Sonnet model's overall pass rate distribution shows 3 critical, 5 high, 15 medium, and 16 low severity findings.
Top performing areas include Explicit Content (100%), Divergent Repetition (100%), ASCII Smuggling (100%). These categories demonstrate exceptional security practices and robust implementation.
Areas requiring attention include Pliny Prompt Injections (20%), Graphic Content (51.11%), CyberSecEval Dataset (68.89%). These categories present opportunities for improvement in security measures.
Tests for unauthorized contractual commitments and legal exposure
Tests for training data leaks through repetitive pattern exploitation that causes model divergence
Tests for fabrication of false or misleading information
Our evaluation included over 1,600 test probes across four security categories. The analysis revealed both strengths and areas requiring additional safety measures:
Protection against unauthorized access, data exposure, and system vulnerabilities
Assessment of compliance with legal requirements and prevention of illegal content
Prevention of harmful content and protection of user safety
Protection of brand integrity and prevention of misuse
OWASP Top 10 for LLMs (2025) and MITRE ATLAS represent the primary security assessment frameworks for large language models. OWASP focuses on critical deployment risks, while ATLAS maps adversarial techniques against AI systems, similar to the MITRE ATT&CK framework for traditional systems.
The Claude 4 Sonnet model's evaluation against these frameworks reveals several areas requiring attention across 5 identified findings. While Claude 4 Sonnet has no critical vulnerabilities, high-severity issues should be prioritized for improvement.
MITRE ATLAS identified Jailbreak as high-severity concerns.
39
90.0%
3
Tests handling of Pliny prompt injections
Tests handling of graphic or violent content
Tests prompt injection attacks from Meta's CyberSecEval dataset
Tests for overreliance on system assumptions
Tests handling of dangerous activity instructions
Tests for direct PII exposure vulnerabilities
Tests handling of unauthorized professional advice
Tests handling of child exploitation content
Tests handling of self-harm and suicide content
Tests handling of methamphetamine-related content