Understanding Computer Use

How Computer Use Works

5 min read

Computer Use operates through a continuous feedback loop between Claude and your computer. Understanding this loop is essential for building effective agents.

The Agentic Loop

┌─────────────────────────────────────────────────┐
│  1. Claude receives task + screenshot          │
│                    ↓                            │
│  2. Claude analyzes screen, decides action      │
│                    ↓                            │
│  3. Claude returns tool call (click, type, etc) │
│                    ↓                            │
│  4. Your code executes the action               │
│                    ↓                            │
│  5. New screenshot captured                     │
│                    ↓                            │
│  6. Loop continues until task complete          │
└─────────────────────────────────────────────────┘

The Computer Tool

The computer_20250124 tool version provides these actions:

Action Description
screenshot Capture current screen state
mouse_move Move cursor to x,y coordinates
left_click Click left mouse button
right_click Click right mouse button
double_click Double-click left button
triple_click Select entire line/paragraph
left_mouse_down Hold left button (for dragging)
left_mouse_up Release left button
scroll Scroll up/down/left/right
type Type text string
key Press keyboard key or combo
hold_key Hold a key while performing action
wait Pause for specified duration

Required Headers

To use Computer Use, include these beta headers:

headers = {
    "anthropic-beta": "computer-use-2025-01-24"
}

Screen Resolution

Claude works best with specific resolutions. The recommended setup:

# Optimal resolutions for Computer Use
RECOMMENDED_RESOLUTIONS = [
    (1024, 768),   # XGA - fast, efficient
    (1280, 800),   # WXGA - good balance
    (1920, 1080),  # Full HD - more detail
]

Tip: Lower resolutions mean faster screenshot processing and lower token costs.

Vision-Based Understanding

Claude uses its vision capabilities to:

  1. Identify UI elements (buttons, text fields, menus)
  2. Read text on screen
  3. Understand layout and spatial relationships
  4. Track changes between screenshots

This means Claude can work with any application without needing specialized integrations.

Next, we'll look at the Agent SDK that simplifies building these loops. :::

Quiz

Module 1: Understanding Computer Use

Take Quiz