backend

Web Scraping Learn how to use Tools and Techniques

Updated: March 27, 2026

#web-scraping #python #automation #recreated

Web Scraping Learn how to use Tools and Techniques

TL;DR

Web scraping extracts data from websites using tools like Playwright (the default for new projects in 2026), Beautiful Soup, and Scrapy. Ethical scraping respects robots.txt, rate limiting, and legal terms; LLM-powered extraction handles unstructured data; and structured data (JSON-LD, microdata) is always preferable when available.

The web is full of data. Job listings, product prices, real estate listings, research papers — all trapped in HTML, waiting to be analyzed. Web scraping is the art of extracting that data programmatically.

But web scraping exists in a gray zone. Some sites welcome scrapers. Others aggressively block them. Copyright and terms-of-service concerns lurk. Legal liability is real.

In 2026, the scraping landscape has evolved. Anti-scraping tools have become more sophisticated. But so have scraping libraries. Playwright is now the default browser automation choice for most new projects, while Puppeteer remains a strong Chrome-only alternative. LLMs have opened new possibilities for extracting data from unstructured pages. And increasingly, responsible developers are asking: "Should I scrape, or should I use the public API or structured data instead?"

This guide covers ethical scraping techniques, modern tools, and when not to scrape.

Ethical Scraping: The Foundation

Before writing a single line of code, understand the ethical and legal landscape.

Respecting robots.txt

Every website has a /robots.txt file. It specifies which parts of the site are available for scraping and how fast you should crawl.

# example.com/robots.txt
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /checkout/
Crawl-delay: 10

# A specific bot can be granted (or denied) different access
User-agent: Googlebot
Allow: /

This says: "All bots, stay out of /admin/, /cart/, and /checkout/, and wait at least 10 seconds between requests. Googlebot is allowed everywhere."

Crawl-delay is a non-standard but widely respected hint. Major search engines interpret it differently (Google ignores it; Bing and Yandex honour it), so do not rely on it as a hard contract — add your own delays in code.

Fetch robots.txt, parse it, and respect its rules:

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()

if rp.can_fetch('MyScraperBot/1.0', 'https://example.com/page'):
    print('OK to scrape')
else:
    print('Not allowed by robots.txt')

Rate Limiting

Even if scraping is allowed, don't hammer the site. Add delays between requests.

import time
import requests

urls = ['https://example.com/page1', 'https://example.com/page2']

for url in urls:
    response = requests.get(url)
    process_response(response)
    time.sleep(2)  # 2-second delay between requests

A well-behaved scraper adds:

2-5 second delays between requests
A User-Agent header identifying your scraper
Backing off if you get 429 (Too Many Requests) responses

Legal Considerations

Scraping isn't inherently illegal, but several factors determine legality:

Terms of Service: Does the site explicitly forbid scraping? If so, violating ToS has legal risk.
Copyright: You can scrape public information, but republishing copyrighted content without permission is infringement.
Computer fraud: Don't bypass authentication or CAPTCHA. Don't use scrapers to commit fraud.
Data protection: If the data involves personal information (email addresses, phone numbers), GDPR and similar laws apply.

Safe practices:

Check the site's robots.txt and ToS
Scrape only what you need
Don't republish copyrighted content directly
Use scraped data for analysis, not commercial redistribution
Ask permission when in doubt

Playwright: The Modern Standard

Playwright has become the default browser automation choice for most new scraping and end-to-end testing projects in 2026. Puppeteer is still slightly ahead in raw GitHub stars and remains a strong fit for Chrome-only workloads, but Playwright's first-class support for Chromium, Firefox, and WebKit (Safari's engine), plus official bindings for Python, Node.js, .NET, and Java, give it the broader surface area. It controls headless browsers and handles JavaScript-heavy sites out of the box.

Basic Usage

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto('https://example.com')
    title = page.title()
    print(title)
    browser.close()

Playwright renders JavaScript, handles cookies, and manages browser sessions. It's perfect for single-page apps and dynamic content.

Scraping Dynamic Content

Modern sites load data via JavaScript. Traditional HTTP requests only get the initial HTML. Playwright waits for JavaScript to render, then captures the populated DOM.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    # Navigate and wait for data to load
    page.goto('https://example.com/products')
    page.wait_for_selector('.product-item')  # Wait for products to load

    # Extract product names and prices
    products = page.query_selector_all('.product-item')
    data = []
    for product in products:
        name = product.query_selector('.name').text_content()
        price = product.query_selector('.price').text_content()
        data.append({'name': name, 'price': price})

    print(data)
    browser.close()

The wait_for_selector() is crucial — it ensures the page has finished loading before you extract data.

Handling Anti-Bot Measures

Websites block scrapers to protect their servers. Playwright helps you work around this:

# Add realistic delays
page.wait_for_timeout(1000)  # 1 second delay

# Use a realistic User-Agent
context = browser.new_context(
    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)

# Rotate through proxy servers (if heavily blocked)
context = browser.new_context(proxy={'server': 'http://proxy.example.com:8080'})

# Handle cookies and session state
context.add_cookies([{'name': 'session', 'value': 'abc123'}])

But beware: if a site actively blocks scrapers with CAPTCHAs and IP bans, bypassing those measures may violate computer fraud laws. Respect the site's intent to block you.

Beautiful Soup: Parsing Static HTML

For sites that serve complete HTML in the initial response (no JavaScript), Beautiful Soup is lightweight and powerful.

from bs4 import BeautifulSoup
import requests

response = requests.get('https://example.com/products')
soup = BeautifulSoup(response.content, 'html.parser')

# Extract product elements
products = soup.find_all('div', class_='product')

for product in products:
    name = product.find('h2').text
    price = product.find('span', class_='price').text
    print(f'{name}: {price}')

Beautiful Soup is simpler than Playwright, faster, and uses less memory. Use it for static sites; save Playwright for dynamic ones.

Scrapy: Enterprise-Scale Scraping

Scrapy is a full-featured framework for large-scale scraping projects. It handles:

Concurrent requests (multiple sites simultaneously)
Request scheduling and prioritization
Middleware for handling cookies, proxies, and user agents
Built-in robots.txt and rate limiting support
Exporting data to CSV, JSON, or databases

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('div.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
            }

        # Follow pagination links
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Run it:

scrapy crawl products -o output.json

Scrapy is overkill for one-off projects but essential for scraping large sites with millions of pages.

LLM-Powered Data Extraction

Large language models can extract structured data from unstructured HTML, dramatically simplifying scraping code.

import anthropic

html = """
<div class="product">
  <h2>MacBook Pro 16-inch</h2>
  <span class="price">$1,999</span>
  <p>Powerful laptop for developers and creators</p>
</div>
"""

client = anthropic.Anthropic()
message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": f"Extract product name, price, and description from this HTML:\n\n{html}"
        }
    ]
)

print(message.content[0].text)
# Output:
# Product Name: MacBook Pro 16-inch
# Price: $1,999
# Description: Powerful laptop for developers and creators

LLMs handle messy HTML, variations in structure, and natural language descriptions. They're slower and more expensive than regex parsing but far more reliable.

Structured Data: The Best Alternative

Before scraping, check if the site provides structured data. Many sites embed JSON-LD (Linked Data) in the page.

<script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@type": "Product",
  "name": "MacBook Pro",
  "offers": {
    "@type": "Offer",
    "price": "1999"
  }
}
</script>

Extract it:

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
script = soup.find('script', {'type': 'application/ld+json'})
data = json.loads(script.string)
print(data['name'])  # MacBook Pro

JSON-LD is cleaner, faster, and legally safer than parsing HTML. Use it whenever available.

Checklist for Responsible Scraping

Conclusion

Web scraping is a powerful tool for extracting data from the web. But power comes with responsibility. Always start by checking robots.txt and the site's terms of service. Scrape respectfully — add delays, identify yourself, and back off if blocked.

Choose your tool wisely: Playwright for dynamic sites, Beautiful Soup for static HTML, Scrapy for enterprise-scale projects, and LLMs for unstructured data. And before writing any scraper, ask yourself: "Is there a public API or structured data I could use instead?"

The line between respectful scraping and abuse is thin. Stay on the right side of it.