Implement Safety Guardrails for Production Agents

Objective

Create a comprehensive safety guardrails system that validates and filters Computer Use agent actions before they're executed. This is critical for production deployments where autonomous agents could cause damage.

Background

Anthropic achieved a 50% reduction in prompt injection success rates by implementing careful safety measures. Your guardrails should:

Block dangerous actions
Require confirmation for sensitive operations
Log all actions for audit
Rate-limit expensive operations

Requirements

Create a class SafetyGuardrails with these methods:

1. `validate_action(action: dict) -> ValidationResult`

Check if an action is safe to execute:

@dataclass
class ValidationResult:
    allowed: bool
    reason: str
    requires_confirmation: bool
    risk_level: str  # "low", "medium", "high", "blocked"

2. `is_dangerous_command(text: str) -> bool`

Check if typed text contains dangerous commands:

rm -rf or rm -r (file deletion)
sudo (privilege escalation)
curl | bash (piped scripts)
chmod 777 (dangerous permissions)
> /dev/sda (disk writes)
Password/credential patterns

3. `is_sensitive_url(url: str) -> bool`

Check if URL is sensitive:

Banking sites (contains "bank", "paypal", "stripe")
Admin panels (contains "admin", "console", "dashboard")
Authentication pages (contains "login", "signin", "oauth")
Cloud consoles (AWS, GCP, Azure)

4. `check_rate_limit(action_type: str) -> bool`

Implement rate limiting:

Maximum 10 clicks per minute
Maximum 50 keystrokes per minute
Maximum 5 URL navigations per minute
Returns True if within limit, False if exceeded

5. `log_action(action: dict, result: ValidationResult) -> None`

Log actions with timestamp, action details, and validation result. Store in memory for audit retrieval.

6. `get_audit_log() -> list[dict]`

Return all logged actions with timestamps.

Example Usage

guardrails = SafetyGuardrails()

action = {
    "action": "type",
    "text": "sudo rm -rf /"
}

result = guardrails.validate_action(action)
# result.allowed = False
# result.reason = "Blocked: dangerous command detected (sudo, rm -rf)"
# result.risk_level = "blocked"

Risk Levels

Level	Examples	Response
low	mouse_move, screenshot	Allow
medium	type normal text, click	Allow
high	navigate to bank, type password	Require confirmation
blocked	rm -rf, sudo, curl\|bash	Block entirely

Hints

Use regex for pattern matching dangerous commands
Store rate limit timestamps in a deque with maxlen
The audit log should be immutable (append-only)
Consider both typed text and keyboard shortcuts

What to Submit

Your submission should contain 1 file section in the editor below: a Python file with the complete SafetyGuardrails class and ValidationResult dataclass.

Instructions