Lab
Implement Safety Guardrails for Production Agents
60 min
Advanced3 Free Attempts
Instructions
Objective
Create a comprehensive safety guardrails system that validates and filters Computer Use agent actions before they're executed. This is critical for production deployments where autonomous agents could cause damage.
Background
Anthropic achieved a 50% reduction in prompt injection success rates by implementing careful safety measures. Your guardrails should:
- Block dangerous actions
- Require confirmation for sensitive operations
- Log all actions for audit
- Rate-limit expensive operations
Requirements
Create a class SafetyGuardrails with these methods:
1. validate_action(action: dict) -> ValidationResult
Check if an action is safe to execute:
@dataclass
class ValidationResult:
allowed: bool
reason: str
requires_confirmation: bool
risk_level: str # "low", "medium", "high", "blocked"
2. is_dangerous_command(text: str) -> bool
Check if typed text contains dangerous commands:
rm -rforrm -r(file deletion)sudo(privilege escalation)curl | bash(piped scripts)chmod 777(dangerous permissions)> /dev/sda(disk writes)- Password/credential patterns
3. is_sensitive_url(url: str) -> bool
Check if URL is sensitive:
- Banking sites (contains "bank", "paypal", "stripe")
- Admin panels (contains "admin", "console", "dashboard")
- Authentication pages (contains "login", "signin", "oauth")
- Cloud consoles (AWS, GCP, Azure)
4. check_rate_limit(action_type: str) -> bool
Implement rate limiting:
- Maximum 10 clicks per minute
- Maximum 50 keystrokes per minute
- Maximum 5 URL navigations per minute
- Returns True if within limit, False if exceeded
5. log_action(action: dict, result: ValidationResult) -> None
Log actions with timestamp, action details, and validation result. Store in memory for audit retrieval.
6. get_audit_log() -> list[dict]
Return all logged actions with timestamps.
Example Usage
guardrails = SafetyGuardrails()
action = {
"action": "type",
"text": "sudo rm -rf /"
}
result = guardrails.validate_action(action)
# result.allowed = False
# result.reason = "Blocked: dangerous command detected (sudo, rm -rf)"
# result.risk_level = "blocked"
Risk Levels
| Level | Examples | Response |
|---|---|---|
| low | mouse_move, screenshot | Allow |
| medium | type normal text, click | Allow |
| high | navigate to bank, type password | Require confirmation |
| blocked | rm -rf, sudo, curl|bash | Block entirely |
Hints
- Use regex for pattern matching dangerous commands
- Store rate limit timestamps in a deque with maxlen
- The audit log should be immutable (append-only)
- Consider both typed text and keyboard shortcuts
Grading Rubric
validate_action correctly categorizes actions by risk level25 points
is_dangerous_command detects all dangerous patterns20 points
is_sensitive_url identifies sensitive destinations15 points
check_rate_limit properly enforces time-based limits20 points
Audit logging captures complete action details20 points
Your Solution
This lab requires Python
🐍Python(required)
3 free attempts remaining