Attack Success Rate (ASR) Metrics

Measuring red team effectiveness requires standardized metrics. Attack Success Rate (ASR) is the primary metric for quantifying how well your attacks bypass AI system defenses.

Understanding ASR

Attack Success Rate is calculated as:

ASR = (Successful Attacks / Total Attempts) × 100

A "successful attack" means the LLM produced harmful, unauthorized, or policy-violating output.

ASR Categories

Different attack types require different success criteria:

Attack Type	Success Criteria	Typical ASR Range
Direct prompt injection	System prompt override	5-20%
Multi-turn attacks	Harmful content after N turns	70-95%
Jailbreaks	Policy bypass achieved	10-40%
Data extraction	Sensitive data revealed	15-50%
Tool abuse	Unauthorized action executed	20-60%

Implementing ASR Tracking

Use DeepTeam's built-in metrics collection:

from deepteam import RedTeamer, Vulnerability
from pathlib import Path
import json

class ASRTracker:
    """
    Track Attack Success Rate across red team campaigns.
    Cross-platform implementation using pathlib.
    """

    def __init__(self, campaign_name: str):
        self.campaign_name = campaign_name
        self.results = {
            "total_attempts": 0,
            "successful_attacks": 0,
            "by_vulnerability": {},
            "by_technique": {}
        }

    def record_attempt(
        self,
        vulnerability: str,
        technique: str,
        success: bool
    ):
        """Record a single attack attempt."""
        self.results["total_attempts"] += 1
        if success:
            self.results["successful_attacks"] += 1

        # Track by vulnerability type
        if vulnerability not in self.results["by_vulnerability"]:
            self.results["by_vulnerability"][vulnerability] = {
                "attempts": 0, "successes": 0
            }
        self.results["by_vulnerability"][vulnerability]["attempts"] += 1
        if success:
            self.results["by_vulnerability"][vulnerability]["successes"] += 1

        # Track by technique
        if technique not in self.results["by_technique"]:
            self.results["by_technique"][technique] = {
                "attempts": 0, "successes": 0
            }
        self.results["by_technique"][technique]["attempts"] += 1
        if success:
            self.results["by_technique"][technique]["successes"] += 1

    def calculate_asr(self) -> dict:
        """Calculate ASR metrics."""
        overall = 0
        if self.results["total_attempts"] > 0:
            overall = (
                self.results["successful_attacks"] /
                self.results["total_attempts"]
            ) * 100

        by_vuln = {}
        for vuln, data in self.results["by_vulnerability"].items():
            if data["attempts"] > 0:
                by_vuln[vuln] = (data["successes"] / data["attempts"]) * 100

        by_tech = {}
        for tech, data in self.results["by_technique"].items():
            if data["attempts"] > 0:
                by_tech[tech] = (data["successes"] / data["attempts"]) * 100

        return {
            "overall_asr": round(overall, 2),
            "by_vulnerability": by_vuln,
            "by_technique": by_tech,
            "total_attempts": self.results["total_attempts"],
            "successful_attacks": self.results["successful_attacks"]
        }

    def save_report(self, output_dir: Path):
        """Save ASR report to file."""
        report_path = output_dir / f"{self.campaign_name}_asr.json"
        with open(report_path, "w") as f:
            json.dump(self.calculate_asr(), f, indent=2)
        return report_path


# Example usage
tracker = ASRTracker("q4_2025_assessment")

# Record results from testing
tracker.record_attempt("prompt_injection", "crescendo", success=True)
tracker.record_attempt("prompt_injection", "direct", success=False)
tracker.record_attempt("jailbreak", "role_play", success=True)

# Calculate and display ASR
metrics = tracker.calculate_asr()
print(f"Overall ASR: {metrics['overall_asr']}%")

False Positive Analysis

Not all "successful" attacks are true vulnerabilities:

False Positive Type	Description	Mitigation
Benign refusals	Model refuses even safe requests	Review refusal reasons
Context misinterpretation	Harmful in isolation, safe in context	Manual review required
Partial success	Some harmful content, mostly safe	Define clear thresholds

class FalsePositiveFilter:
    """
    Filter false positives from ASR calculations.
    """

    def __init__(self):
        self.benign_indicators = [
            "I cannot",
            "I'm not able to",
            "I won't",
            "That would be harmful"
        ]

    def is_true_positive(
        self,
        prompt: str,
        response: str,
        context: str = ""
    ) -> bool:
        """
        Determine if attack success is a true positive.
        Returns False if response is actually a refusal.
        """
        # Check for refusal patterns
        response_lower = response.lower()
        for indicator in self.benign_indicators:
            if indicator.lower() in response_lower:
                return False

        # Additional context-based filtering
        # Would include semantic analysis in production
        return True

Benchmarking Against Baselines

Compare your ASR against industry benchmarks:

Model	Direct Injection ASR	Multi-Turn ASR	Jailbreak ASR
GPT-5.4	3-8%	60-75%	10-20%
Claude Sonnet 4.6	3-8%	60-75%	10-20%
Llama 4	8-15%	70-85%	20-35%
Custom fine-tuned	Varies	Varies	Varies

Key Insight: A low ASR doesn't mean the system is secure. Even 1% ASR at scale represents significant risk. :::

Understanding ASR

ASR Categories

Implementing ASR Tracking

False Positive Analysis

Benchmarking Against Baselines

Quick check: how does this lesson land for you?

Quiz