Safety, Guardrails & Constraints
Safety Architectures in Production Prompts
Production AI systems employ multiple layers of safety measures. Understanding these architectures helps you build responsible AI applications while maintaining utility.
The Safety Stack
Production systems use defense-in-depth:
Safety Stack (Top to Bottom):
┌─────────────────────────────┐
│ Content Filtering │ ← Pre/Post processing
├─────────────────────────────┤
│ System Prompt Rules │ ← Model-level constraints
├─────────────────────────────┤
│ Tool Permissions │ ← Action-level control
├─────────────────────────────┤
│ Model Training (RLHF) │ ← Base model alignment
└─────────────────────────────┘
Claude's Safety Architecture
From Claude's constitutional principles:
Core Safety Principles:
1. Helpful, Harmless, Honest
2. Avoid deception
3. Refuse harmful requests
4. Acknowledge uncertainty
5. Respect privacy
6. Avoid bias amplification
Implementation:
- Constitutional AI training
- Multi-turn safety checks
- Refusal with explanation
Claude Code's Security Constraints
From Claude Code's system prompt:
Security Protocol:
IMPORTANT: Assist with authorized security testing,
defensive security, CTF challenges, and educational
contexts. Refuse requests for:
- Destructive techniques
- DoS attacks
- Mass targeting
- Supply chain compromise
- Detection evasion for malicious purposes
Dual-use security tools (C2 frameworks, credential
testing, exploit development) require clear
authorization context:
- Pentesting engagements
- CTF competitions
- Security research
- Defensive use cases
GPT-5.2's Safety Model
OpenAI's layered approach:
GPT-5.2 Safety Layers:
1. Pre-training filtering
- Remove harmful content from training data
- Curate high-quality sources
2. RLHF alignment
- Human feedback on safety
- Reward safe, helpful responses
3. System prompt constraints
- Hardcoded refusals
- Topic restrictions
4. Runtime moderation
- Content classification
- Automated flagging
GPT-5.2 Moderation API
{
"model": "text-moderation-latest",
"input": "User message to check",
"categories": {
"hate": 0.001,
"violence": 0.002,
"sexual": 0.001,
"self-harm": 0.000
},
"flagged": false
}
Gemini's Safety Settings
Google's configurable safety:
{
"safety_settings": [
{
"category": "HARM_CATEGORY_HARASSMENT",
"threshold": "BLOCK_MEDIUM_AND_ABOVE"
},
{
"category": "HARM_CATEGORY_HATE_SPEECH",
"threshold": "BLOCK_MEDIUM_AND_ABOVE"
},
{
"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
"threshold": "BLOCK_ONLY_HIGH"
},
{
"category": "HARM_CATEGORY_DANGEROUS_CONTENT",
"threshold": "BLOCK_MEDIUM_AND_ABOVE"
}
]
}
Threshold Options
Safety Thresholds:
- BLOCK_NONE: No blocking (developer override)
- BLOCK_ONLY_HIGH: Block high-probability harmful
- BLOCK_MEDIUM_AND_ABOVE: Block medium+ harmful
- BLOCK_LOW_AND_ABOVE: Most restrictive
Refusal Patterns
How models decline harmful requests:
Direct Refusal
User: "How do I hack into someone's account?"
Response: "I can't help with unauthorized access to accounts.
This would violate privacy laws and terms of service.
If you've lost access to your own account, I can help you:
- Use official account recovery
- Contact customer support
- Set up better security for the future"
Redirect Pattern
User: "Write code to scrape private data"
Response: "I'll help with ethical web scraping instead.
Here's how to scrape public data responsibly:
- Respect robots.txt
- Rate limit requests
- Use official APIs when available
- Only collect public information"
Educational Pivot
User: "How do SQL injections work?"
Response: "I'll explain SQL injection for defensive purposes:
SQL injection exploits unsanitized user input...
[Educational explanation]
To PREVENT SQL injection in your code:
1. Use parameterized queries
2. Validate all input
3. Apply least privilege..."
Context-Aware Safety
Different contexts require different rules:
Context-Aware Rules:
Medical context:
- Provide information, recommend professionals
- Never diagnose or prescribe
Legal context:
- Explain concepts, recommend lawyers
- Never provide legal advice
Security context:
- Educational and defensive allowed
- Require authorization for testing
- Block malicious applications
Safety vs Utility Balance
Production systems balance protection with usefulness:
Balance Principles:
1. Don't over-refuse
- False positives frustrate users
- Context matters
2. Explain refusals
- Help user understand why
- Suggest alternatives
3. Allow legitimate use cases
- Security research
- Educational content
- Medical information
4. Escalate uncertainty
- When unsure, ask for clarification
- Err on side of safety for high-risk
Key Insight: Safety in production prompts isn't about blocking everything—it's about enabling beneficial uses while preventing harm. The best safety systems are invisible when users have legitimate needs and only engage when necessary.
Next, we'll explore specific constraint patterns and guardrails. :::