Web Scraping Learn how to use Tools and Techniques

Updated: March 27, 2026

Web Scraping Learn how to use Tools and Techniques

TL;DR

Web scraping extracts data from websites using tools like Playwright (best-in-class), Beautiful Soup, and Scrapy. Ethical scraping respects robots.txt, rate limiting, and legal terms; LLM-powered extraction handles unstructured data; and structured data (JSON-LD, microdata) is always preferable when available.

The web is full of data. Job listings, product prices, real estate listings, research papers — all trapped in HTML, waiting to be analyzed. Web scraping is the art of extracting that data programmatically.

But web scraping exists in a gray zone. Some sites welcome scrapers. Others aggressively block them. Copyright and terms-of-service concerns lurk. Legal liability is real.

In 2026, the scraping landscape has evolved. Anti-scraping tools have become more sophisticated. But so have scraping libraries. Playwright has emerged as the dominant browser automation tool. LLMs have opened new possibilities for extracting data from unstructured pages. And increasingly, responsible developers are asking: "Should I scrape, or should I use the public API or structured data instead?"

This guide covers ethical scraping techniques, modern tools, and when not to scrape.

Ethical Scraping: The Foundation

Before writing a single line of code, understand the ethical and legal landscape.

Respecting robots.txt

Every website has a /robots.txt file. It specifies which parts of the site are available for scraping and how fast you should crawl.

# amazon.com/robots.txt
User-agent: *
Disallow: /s
Disallow: /gp/
Disallow: /*/dp/
Disallow: /cart/

# Googlebot is allowed everywhere; others are restricted
User-agent: Googlebot
Disallow:

This says: "General scrapers, don't scrape /s (search), /gp/ (pages), or /*/dp/ (product detail). Googlebot, you're fine."

Fetch robots.txt, parse it, and respect its rules:

import requests
from urllib.robotparser import RobotFileParser

rp = RobotFileParser('https://example.com/robots.txt')
rp.read()

if rp.can_fetch('*', 'https://example.com/page'):
    print('OK to scrape')
else:
    print('Not allowed by robots.txt')

Rate Limiting

Even if scraping is allowed, don't hammer the site. Add delays between requests.

import time
import requests

urls = ['https://example.com/page1', 'https://example.com/page2']

for url in urls:
    response = requests.get(url)
    process_response(response)
    time.sleep(2)  # 2-second delay between requests

A well-behaved scraper adds:

  • 2-5 second delays between requests
  • A User-Agent header identifying your scraper
  • Backing off if you get 429 (Too Many Requests) responses

Scraping isn't inherently illegal, but several factors determine legality:

  1. Terms of Service: Does the site explicitly forbid scraping? If so, violating ToS has legal risk.
  2. Copyright: You can scrape public information, but republishing copyrighted content without permission is infringement.
  3. Computer fraud: Don't bypass authentication or CAPTCHA. Don't use scrapers to commit fraud.
  4. Data protection: If the data involves personal information (email addresses, phone numbers), GDPR and similar laws apply.

Safe practices:

  • Check the site's robots.txt and ToS
  • Scrape only what you need
  • Don't republish copyrighted content directly
  • Use scraped data for analysis, not commercial redistribution
  • Ask permission when in doubt

Playwright: The Modern Standard

Playwright is the best browser automation tool in 2026. It controls headless browsers (Chrome, Firefox, Safari) and handles JavaScript-heavy sites.

Basic Usage

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto('https://example.com')
    title = page.title()
    print(title)
    browser.close()

Playwright renders JavaScript, handles cookies, and manages browser sessions. It's perfect for single-page apps and dynamic content.

Scraping Dynamic Content

Modern sites load data via JavaScript. Traditional HTTP requests only get the initial HTML. Playwright waits for JavaScript to render, then captures the populated DOM.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    # Navigate and wait for data to load
    page.goto('https://example.com/products')
    page.wait_for_selector('.product-item')  # Wait for products to load

    # Extract product names and prices
    products = page.query_selector_all('.product-item')
    data = []
    for product in products:
        name = product.query_selector('.name').text_content()
        price = product.query_selector('.price').text_content()
        data.append({'name': name, 'price': price})

    print(data)
    browser.close()

The wait_for_selector() is crucial — it ensures the page has finished loading before you extract data.

Handling Anti-Bot Measures

Websites block scrapers to protect their servers. Playwright helps you work around this:

# Add realistic delays
page.wait_for_timeout(1000)  # 1 second delay

# Use a realistic User-Agent
context = browser.new_context(
    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)

# Rotate through proxy servers (if heavily blocked)
context = browser.new_context(proxy={'server': 'http://proxy.example.com:8080'})

# Handle cookies and session state
context.add_cookies([{'name': 'session', 'value': 'abc123'}])

But beware: if a site actively blocks scrapers with CAPTCHAs and IP bans, bypassing those measures may violate computer fraud laws. Respect the site's intent to block you.

Beautiful Soup: Parsing Static HTML

For sites that serve complete HTML in the initial response (no JavaScript), Beautiful Soup is lightweight and powerful.

from bs4 import BeautifulSoup
import requests

response = requests.get('https://example.com/products')
soup = BeautifulSoup(response.content, 'html.parser')

# Extract product elements
products = soup.find_all('div', class_='product')

for product in products:
    name = product.find('h2').text
    price = product.find('span', class_='price').text
    print(f'{name}: {price}')

Beautiful Soup is simpler than Playwright, faster, and uses less memory. Use it for static sites; save Playwright for dynamic ones.

Scrapy: Enterprise-Scale Scraping

Scrapy is a full-featured framework for large-scale scraping projects. It handles:

  • Concurrent requests (multiple sites simultaneously)
  • Request scheduling and prioritization
  • Middleware for handling cookies, proxies, and user agents
  • Built-in robots.txt and rate limiting support
  • Exporting data to CSV, JSON, or databases
import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('div.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
            }

        # Follow pagination links
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Run it:

scrapy crawl products -o output.json

Scrapy is overkill for one-off projects but essential for scraping large sites with millions of pages.

LLM-Powered Data Extraction

Large language models can extract structured data from unstructured HTML, dramatically simplifying scraping code.

import anthropic

html = """
<div class="product">
  <h2>MacBook Pro 16-inch</h2>
  <span class="price">$1,999</span>
  <p>Powerful laptop for developers and creators</p>
</div>
"""

client = anthropic.Anthropic()
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": f"Extract product name, price, and description from this HTML:\n\n{html}"
        }
    ]
)

print(message.content[0].text)
# Output:
# Product Name: MacBook Pro 16-inch
# Price: $1,999
# Description: Powerful laptop for developers and creators

LLMs handle messy HTML, variations in structure, and natural language descriptions. They're slower and more expensive than regex parsing but far more reliable.

Structured Data: The Best Alternative

Before scraping, check if the site provides structured data. Many sites embed JSON-LD (Linked Data) in the page.

<script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@type": "Product",
  "name": "MacBook Pro",
  "offers": {
    "@type": "Offer",
    "price": "1999"
  }
}
</script>

Extract it:

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
script = soup.find('script', {'type': 'application/ld+json'})
data = json.loads(script.string)
print(data['name'])  # MacBook Pro

JSON-LD is cleaner, faster, and legally safer than parsing HTML. Use it whenever available.

Checklist for Responsible Scraping

  • Check robots.txt — respect crawl delays and disallowed paths
  • Read the site's ToS — is scraping explicitly forbidden?
  • Add delays between requests (2-5 seconds minimum)
  • Identify your scraper with a User-Agent header
  • Handle rate limiting gracefully (check for 429 responses)
  • Respect copyright — don't republish full content
  • Consider using public APIs or structured data instead
  • Avoid scraping personal data; if necessary, follow GDPR rules
  • Stop if blocked — don't try to bypass CAPTCHAs or IP bans
  • Log your scraping activity for auditing

Conclusion

Web scraping is a powerful tool for extracting data from the web. But power comes with responsibility. Always start by checking robots.txt and the site's terms of service. Scrape respectfully — add delays, identify yourself, and back off if blocked.

Choose your tool wisely: Playwright for dynamic sites, Beautiful Soup for static HTML, Scrapy for enterprise-scale projects, and LLMs for unstructured data. And before writing any scraper, ask yourself: "Is there a public API or structured data I could use instead?"

The line between respectful scraping and abuse is thin. Stay on the right side of it.


FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.