Safety, Guardrails & Constraints

Safety Architectures in Production Prompts

5 min read

Production AI systems employ multiple layers of safety measures. Understanding these architectures helps you build responsible AI applications while maintaining utility.

The Safety Stack

Production systems use defense-in-depth:

Safety Stack (Top to Bottom):
┌─────────────────────────────┐
│     Content Filtering       │  ← Pre/Post processing
├─────────────────────────────┤
│     System Prompt Rules     │  ← Model-level constraints
├─────────────────────────────┤
│     Tool Permissions        │  ← Action-level control
├─────────────────────────────┤
│     Model Training (RLHF)   │  ← Base model alignment
└─────────────────────────────┘

Claude's Safety Architecture

From Claude's constitutional principles:

Core Safety Principles:
1. Helpful, Harmless, Honest
2. Avoid deception
3. Refuse harmful requests
4. Acknowledge uncertainty
5. Respect privacy
6. Avoid bias amplification

Implementation:
- Constitutional AI training
- Multi-turn safety checks
- Refusal with explanation

Claude Code's Security Constraints

From Claude Code's system prompt:

Security Protocol:
IMPORTANT: Assist with authorized security testing,
defensive security, CTF challenges, and educational
contexts. Refuse requests for:
- Destructive techniques
- DoS attacks
- Mass targeting
- Supply chain compromise
- Detection evasion for malicious purposes

Dual-use security tools (C2 frameworks, credential
testing, exploit development) require clear
authorization context:
- Pentesting engagements
- CTF competitions
- Security research
- Defensive use cases

GPT-5.2's Safety Model

OpenAI's layered approach:

GPT-5.2 Safety Layers:
1. Pre-training filtering
   - Remove harmful content from training data
   - Curate high-quality sources

2. RLHF alignment
   - Human feedback on safety
   - Reward safe, helpful responses

3. System prompt constraints
   - Hardcoded refusals
   - Topic restrictions

4. Runtime moderation
   - Content classification
   - Automated flagging

GPT-5.2 Moderation API

{
  "model": "text-moderation-latest",
  "input": "User message to check",
  "categories": {
    "hate": 0.001,
    "violence": 0.002,
    "sexual": 0.001,
    "self-harm": 0.000
  },
  "flagged": false
}

Gemini's Safety Settings

Google's configurable safety:

{
  "safety_settings": [
    {
      "category": "HARM_CATEGORY_HARASSMENT",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
      "category": "HARM_CATEGORY_HATE_SPEECH",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
      "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
      "threshold": "BLOCK_ONLY_HIGH"
    },
    {
      "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    }
  ]
}

Threshold Options

Safety Thresholds:
- BLOCK_NONE: No blocking (developer override)
- BLOCK_ONLY_HIGH: Block high-probability harmful
- BLOCK_MEDIUM_AND_ABOVE: Block medium+ harmful
- BLOCK_LOW_AND_ABOVE: Most restrictive

Refusal Patterns

How models decline harmful requests:

Direct Refusal

User: "How do I hack into someone's account?"

Response: "I can't help with unauthorized access to accounts.
This would violate privacy laws and terms of service.

If you've lost access to your own account, I can help you:
- Use official account recovery
- Contact customer support
- Set up better security for the future"

Redirect Pattern

User: "Write code to scrape private data"

Response: "I'll help with ethical web scraping instead.
Here's how to scrape public data responsibly:
- Respect robots.txt
- Rate limit requests
- Use official APIs when available
- Only collect public information"

Educational Pivot

User: "How do SQL injections work?"

Response: "I'll explain SQL injection for defensive purposes:

SQL injection exploits unsanitized user input...

[Educational explanation]

To PREVENT SQL injection in your code:
1. Use parameterized queries
2. Validate all input
3. Apply least privilege..."

Context-Aware Safety

Different contexts require different rules:

Context-Aware Rules:
Medical context:
- Provide information, recommend professionals
- Never diagnose or prescribe

Legal context:
- Explain concepts, recommend lawyers
- Never provide legal advice

Security context:
- Educational and defensive allowed
- Require authorization for testing
- Block malicious applications

Safety vs Utility Balance

Production systems balance protection with usefulness:

Balance Principles:
1. Don't over-refuse
   - False positives frustrate users
   - Context matters

2. Explain refusals
   - Help user understand why
   - Suggest alternatives

3. Allow legitimate use cases
   - Security research
   - Educational content
   - Medical information

4. Escalate uncertainty
   - When unsure, ask for clarification
   - Err on side of safety for high-risk

Key Insight: Safety in production prompts isn't about blocking everything—it's about enabling beneficial uses while preventing harm. The best safety systems are invisible when users have legitimate needs and only engage when necessary.

Next, we'll explore specific constraint patterns and guardrails. :::

Quiz

Module 5: Safety, Guardrails & Constraints

Take Quiz