Build Production Agent Guardrails

In this lab, you'll build a production safety and evaluation layer for agentic systems in TypeScript. This is the infrastructure that sits between your agents and the real world — catching dangerous inputs, filtering harmful outputs, controlling costs, and evaluating agent behavior.

Every production agent system at companies like Anthropic, OpenAI, and Google has guardrail layers like these. Interviewers expect you to understand them deeply and design them proactively.

Architecture Overview

User Input
     ↓
┌─────────────────────┐
│  INPUT GUARDRAILS    │ ← Injection detection, PII detection, topic boundaries
└─────────┬───────────┘
          ↓
┌─────────────────────┐
│  COST CONTROLLER     │ ← Token budget, rate limits, model cascade routing
└─────────┬───────────┘
          ↓
┌─────────────────────┐
│  AGENT EXECUTION     │ ← Your agent logic (tool calls, reasoning)
└─────────┬───────────┘
          ↓
┌─────────────────────┐
│  ACTION GUARDRAILS   │ ← Tool allowlist, parameter bounds, destructive op confirmation
└─────────┬───────────┘
          ↓
┌─────────────────────┐
│  OUTPUT GUARDRAILS   │ ← Content filtering, factuality check, format validation
└─────────┬───────────┘
          ↓
┌─────────────────────┐
│  OBSERVABILITY       │ ← Structured logging, traces, alert rules
└─────────┬───────────┘
          ↓
Response to User

Step 1: Input Guardrails (`input_guardrails.ts`)

Build an InputGuardrailPipeline class with configurable guard functions:

Prompt injection detection: Check for common injection patterns (e.g., "ignore previous instructions", "system: you are now", encoded instructions). Return a GuardrailResult with passed: boolean, reason: string, and severity: 'low' | 'medium' | 'high' | 'critical'.
PII detection: Scan for email addresses, phone numbers, credit card patterns, and SSN patterns. Flag detected PII with type and location.
Topic boundary enforcement: Check if the input stays within allowed topic domains (configurable list). Reject off-topic requests.
Pipeline execution: Run all guards in sequence. Short-circuit on critical severity. Return aggregated results.

Step 2: Output Guardrails (`output_guardrails.ts`)

Build an OutputGuardrailPipeline:

Harmful content filter: Check output for harmful, toxic, or policy-violating patterns. Use keyword matching and pattern-based detection.
Factuality cross-check: Given source documents (from RAG), verify that claims in the output are supported by the sources. Flag unsupported claims.
Format validation: Verify the output matches expected format constraints (max length, required sections, no code in non-code responses).
Pipeline execution: Run all guards, aggregate results, and optionally sanitize the output before returning.

Step 3: Action Guardrails (`action_guardrails.ts`)

Build an ActionGuardrailPipeline for tool call safety:

Tool allowlist/blocklist: Maintain configurable lists of allowed and blocked tools. Block any tool call not on the allowlist.
Parameter bounds checking: For each tool, define valid parameter ranges (e.g., limit must be 1-100, email must match a pattern). Reject out-of-bounds parameters.
Destructive operation confirmation: Flag operations that modify external state (write, delete, send). Return a requires_confirmation: true result instead of blocking.
Rate limiting per tool: Track tool call counts within a time window. Block excessive calls to the same tool.

Step 4: Cost Controller (`cost_controller.ts`)

Build a CostController that manages agent spending:

Per-request token budget: Set a maximum token count per request. Track input + output tokens and reject when budget is exhausted.
Per-user daily limits: Track cumulative usage per user ID per day. Reject requests when a user exceeds their daily allocation.
Model cascade routing: Given a task complexity score (simple/medium/complex), route to the appropriate model tier (e.g., small model for simple classification, large model for complex reasoning).
Usage reporting: Return current usage stats (tokens used, budget remaining, model used).

Step 5: Evaluation Harness (`evaluation_harness.ts`)

Build an EvaluationHarness to test agent behavior:

Test case definition: Define test cases with: input, expected_behavior (tool calls, output patterns, constraints), and tags (for grouping).
Test runner: Execute each test case through the guardrail pipeline and agent. Compare actual behavior against expectations.
Assertions: Support assertions on: output content (contains/not contains), tool calls made (expected tools called with expected parameters), guardrail triggers (expected guardrails fired), and cost (within budget).
Results reporting: Aggregate pass/fail counts, generate a summary with failure details.

Step 6: Adversarial Test Suite (`adversarial_tests.ts`)

Build a library of adversarial test patterns:

Prompt injection patterns: At least 5 injection attempts (instruction override, role hijacking, encoded payloads, delimiter attacks, indirect injection via tool results).
Jailbreak patterns: At least 3 jailbreak attempts (persona switching, hypothetical framing, multi-turn manipulation).
Edge cases: Empty input, extremely long input, unicode manipulation, nested JSON in input.
Test generator: A function that creates EvaluationHarness test cases from these patterns, with expected behavior = all should be caught by guardrails.

Step 7: Observability (`observability.ts`)

Build a structured observability layer:

Decision logger: Log every agent decision (tool selection, parameter choice, response generation) as structured JSON with timestamps.
Trace builder: Build a trace spanning an entire request: input → guardrail checks → agent steps → tool calls → output guardrails → response. Each span has a unique ID and parent ID.
Metrics collector: Track counts and durations for: guardrail passes/failures, tool calls per type, token usage, latency per step.
Export interface: Provide a method to export traces and metrics in a format compatible with observability tools (structured JSON).

Step 8: Alert Rules (`alert_rules.ts`)

Build a configurable alerting system:

Rule definition: Define alert rules with: name, condition (a function that evaluates metrics), severity, and action (callback).
Built-in rules: Implement at least 3 default rules:
- Cost spike: Alert when per-request cost exceeds 2x the rolling average
- Safety violation rate: Alert when guardrail failure rate exceeds a configurable threshold
- Latency degradation: Alert when p95 latency exceeds a configured SLA
Rule evaluator: Given current metrics, evaluate all rules and return triggered alerts.
Alert history: Track recent alerts with timestamps and deduplication (don't re-alert for the same rule within a cooldown period).

Testing Your Implementation

To verify your implementation works:

Create sample guardrail configs and run inputs through the pipeline
Use the adversarial test suite to verify injection detection
Test cost controller with simulated token usage
Run the evaluation harness with test cases that should pass and fail
Verify observability traces capture the full request lifecycle
Trigger alert rules with mock metrics that exceed thresholds

Instructions