Build Production Agent Guardrails
Instructions
In this lab, you'll build a production safety and evaluation layer for agentic systems in TypeScript. This is the infrastructure that sits between your agents and the real world — catching dangerous inputs, filtering harmful outputs, controlling costs, and evaluating agent behavior.
Every production agent system at companies like Anthropic, OpenAI, and Google has guardrail layers like these. Interviewers expect you to understand them deeply and design them proactively.
Architecture Overview
User Input
↓
┌─────────────────────┐
│ INPUT GUARDRAILS │ ← Injection detection, PII detection, topic boundaries
└─────────┬───────────┘
↓
┌─────────────────────┐
│ COST CONTROLLER │ ← Token budget, rate limits, model cascade routing
└─────────┬───────────┘
↓
┌─────────────────────┐
│ AGENT EXECUTION │ ← Your agent logic (tool calls, reasoning)
└─────────┬───────────┘
↓
┌─────────────────────┐
│ ACTION GUARDRAILS │ ← Tool allowlist, parameter bounds, destructive op confirmation
└─────────┬───────────┘
↓
┌─────────────────────┐
│ OUTPUT GUARDRAILS │ ← Content filtering, factuality check, format validation
└─────────┬───────────┘
↓
┌─────────────────────┐
│ OBSERVABILITY │ ← Structured logging, traces, alert rules
└─────────┬───────────┘
↓
Response to User
Step 1: Input Guardrails (input_guardrails.ts)
Build an InputGuardrailPipeline class with configurable guard functions:
- Prompt injection detection: Check for common injection patterns (e.g., "ignore previous instructions", "system: you are now", encoded instructions). Return a
GuardrailResultwithpassed: boolean,reason: string, andseverity: 'low' | 'medium' | 'high' | 'critical'. - PII detection: Scan for email addresses, phone numbers, credit card patterns, and SSN patterns. Flag detected PII with type and location.
- Topic boundary enforcement: Check if the input stays within allowed topic domains (configurable list). Reject off-topic requests.
- Pipeline execution: Run all guards in sequence. Short-circuit on
criticalseverity. Return aggregated results.
Step 2: Output Guardrails (output_guardrails.ts)
Build an OutputGuardrailPipeline:
- Harmful content filter: Check output for harmful, toxic, or policy-violating patterns. Use keyword matching and pattern-based detection.
- Factuality cross-check: Given source documents (from RAG), verify that claims in the output are supported by the sources. Flag unsupported claims.
- Format validation: Verify the output matches expected format constraints (max length, required sections, no code in non-code responses).
- Pipeline execution: Run all guards, aggregate results, and optionally sanitize the output before returning.
Step 3: Action Guardrails (action_guardrails.ts)
Build an ActionGuardrailPipeline for tool call safety:
- Tool allowlist/blocklist: Maintain configurable lists of allowed and blocked tools. Block any tool call not on the allowlist.
- Parameter bounds checking: For each tool, define valid parameter ranges (e.g.,
limitmust be 1-100,emailmust match a pattern). Reject out-of-bounds parameters. - Destructive operation confirmation: Flag operations that modify external state (write, delete, send). Return a
requires_confirmation: trueresult instead of blocking. - Rate limiting per tool: Track tool call counts within a time window. Block excessive calls to the same tool.
Step 4: Cost Controller (cost_controller.ts)
Build a CostController that manages agent spending:
- Per-request token budget: Set a maximum token count per request. Track input + output tokens and reject when budget is exhausted.
- Per-user daily limits: Track cumulative usage per user ID per day. Reject requests when a user exceeds their daily allocation.
- Model cascade routing: Given a task complexity score (simple/medium/complex), route to the appropriate model tier (e.g., small model for simple classification, large model for complex reasoning).
- Usage reporting: Return current usage stats (tokens used, budget remaining, model used).
Step 5: Evaluation Harness (evaluation_harness.ts)
Build an EvaluationHarness to test agent behavior:
- Test case definition: Define test cases with:
input,expected_behavior(tool calls, output patterns, constraints), andtags(for grouping). - Test runner: Execute each test case through the guardrail pipeline and agent. Compare actual behavior against expectations.
- Assertions: Support assertions on: output content (contains/not contains), tool calls made (expected tools called with expected parameters), guardrail triggers (expected guardrails fired), and cost (within budget).
- Results reporting: Aggregate pass/fail counts, generate a summary with failure details.
Step 6: Adversarial Test Suite (adversarial_tests.ts)
Build a library of adversarial test patterns:
- Prompt injection patterns: At least 5 injection attempts (instruction override, role hijacking, encoded payloads, delimiter attacks, indirect injection via tool results).
- Jailbreak patterns: At least 3 jailbreak attempts (persona switching, hypothetical framing, multi-turn manipulation).
- Edge cases: Empty input, extremely long input, unicode manipulation, nested JSON in input.
- Test generator: A function that creates
EvaluationHarnesstest cases from these patterns, with expected behavior = all should be caught by guardrails.
Step 7: Observability (observability.ts)
Build a structured observability layer:
- Decision logger: Log every agent decision (tool selection, parameter choice, response generation) as structured JSON with timestamps.
- Trace builder: Build a trace spanning an entire request: input → guardrail checks → agent steps → tool calls → output guardrails → response. Each span has a unique ID and parent ID.
- Metrics collector: Track counts and durations for: guardrail passes/failures, tool calls per type, token usage, latency per step.
- Export interface: Provide a method to export traces and metrics in a format compatible with observability tools (structured JSON).
Step 8: Alert Rules (alert_rules.ts)
Build a configurable alerting system:
- Rule definition: Define alert rules with:
name,condition(a function that evaluates metrics),severity, andaction(callback). - Built-in rules: Implement at least 3 default rules:
- Cost spike: Alert when per-request cost exceeds 2x the rolling average
- Safety violation rate: Alert when guardrail failure rate exceeds a configurable threshold
- Latency degradation: Alert when p95 latency exceeds a configured SLA
- Rule evaluator: Given current metrics, evaluate all rules and return triggered alerts.
- Alert history: Track recent alerts with timestamps and deduplication (don't re-alert for the same rule within a cooldown period).
Testing Your Implementation
To verify your implementation works:
- Create sample guardrail configs and run inputs through the pipeline
- Use the adversarial test suite to verify injection detection
- Test cost controller with simulated token usage
- Run the evaluation harness with test cases that should pass and fail
- Verify observability traces capture the full request lifecycle
- Trigger alert rules with mock metrics that exceed thresholds