Web Scraping Learn how to use Tools and Techniques
Updated: March 27, 2026
TL;DR
Web scraping extracts data from websites using tools like Playwright (best-in-class), Beautiful Soup, and Scrapy. Ethical scraping respects robots.txt, rate limiting, and legal terms; LLM-powered extraction handles unstructured data; and structured data (JSON-LD, microdata) is always preferable when available.
The web is full of data. Job listings, product prices, real estate listings, research papers — all trapped in HTML, waiting to be analyzed. Web scraping is the art of extracting that data programmatically.
But web scraping exists in a gray zone. Some sites welcome scrapers. Others aggressively block them. Copyright and terms-of-service concerns lurk. Legal liability is real.
In 2026, the scraping landscape has evolved. Anti-scraping tools have become more sophisticated. But so have scraping libraries. Playwright has emerged as the dominant browser automation tool. LLMs have opened new possibilities for extracting data from unstructured pages. And increasingly, responsible developers are asking: "Should I scrape, or should I use the public API or structured data instead?"
This guide covers ethical scraping techniques, modern tools, and when not to scrape.
Ethical Scraping: The Foundation
Before writing a single line of code, understand the ethical and legal landscape.
Respecting robots.txt
Every website has a /robots.txt file. It specifies which parts of the site are available for scraping and how fast you should crawl.
# amazon.com/robots.txt
User-agent: *
Disallow: /s
Disallow: /gp/
Disallow: /*/dp/
Disallow: /cart/
# Googlebot is allowed everywhere; others are restricted
User-agent: Googlebot
Disallow:
This says: "General scrapers, don't scrape /s (search), /gp/ (pages), or /*/dp/ (product detail). Googlebot, you're fine."
Fetch robots.txt, parse it, and respect its rules:
import requests
from urllib.robotparser import RobotFileParser
rp = RobotFileParser('https://example.com/robots.txt')
rp.read()
if rp.can_fetch('*', 'https://example.com/page'):
print('OK to scrape')
else:
print('Not allowed by robots.txt')
Rate Limiting
Even if scraping is allowed, don't hammer the site. Add delays between requests.
import time
import requests
urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
response = requests.get(url)
process_response(response)
time.sleep(2) # 2-second delay between requests
A well-behaved scraper adds:
- 2-5 second delays between requests
- A
User-Agentheader identifying your scraper - Backing off if you get 429 (Too Many Requests) responses
Legal Considerations
Scraping isn't inherently illegal, but several factors determine legality:
- Terms of Service: Does the site explicitly forbid scraping? If so, violating ToS has legal risk.
- Copyright: You can scrape public information, but republishing copyrighted content without permission is infringement.
- Computer fraud: Don't bypass authentication or CAPTCHA. Don't use scrapers to commit fraud.
- Data protection: If the data involves personal information (email addresses, phone numbers), GDPR and similar laws apply.
Safe practices:
- Check the site's robots.txt and ToS
- Scrape only what you need
- Don't republish copyrighted content directly
- Use scraped data for analysis, not commercial redistribution
- Ask permission when in doubt
Playwright: The Modern Standard
Playwright is the best browser automation tool in 2026. It controls headless browsers (Chrome, Firefox, Safari) and handles JavaScript-heavy sites.
Basic Usage
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://example.com')
title = page.title()
print(title)
browser.close()
Playwright renders JavaScript, handles cookies, and manages browser sessions. It's perfect for single-page apps and dynamic content.
Scraping Dynamic Content
Modern sites load data via JavaScript. Traditional HTTP requests only get the initial HTML. Playwright waits for JavaScript to render, then captures the populated DOM.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Navigate and wait for data to load
page.goto('https://example.com/products')
page.wait_for_selector('.product-item') # Wait for products to load
# Extract product names and prices
products = page.query_selector_all('.product-item')
data = []
for product in products:
name = product.query_selector('.name').text_content()
price = product.query_selector('.price').text_content()
data.append({'name': name, 'price': price})
print(data)
browser.close()
The wait_for_selector() is crucial — it ensures the page has finished loading before you extract data.
Handling Anti-Bot Measures
Websites block scrapers to protect their servers. Playwright helps you work around this:
# Add realistic delays
page.wait_for_timeout(1000) # 1 second delay
# Use a realistic User-Agent
context = browser.new_context(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)
# Rotate through proxy servers (if heavily blocked)
context = browser.new_context(proxy={'server': 'http://proxy.example.com:8080'})
# Handle cookies and session state
context.add_cookies([{'name': 'session', 'value': 'abc123'}])
But beware: if a site actively blocks scrapers with CAPTCHAs and IP bans, bypassing those measures may violate computer fraud laws. Respect the site's intent to block you.
Beautiful Soup: Parsing Static HTML
For sites that serve complete HTML in the initial response (no JavaScript), Beautiful Soup is lightweight and powerful.
from bs4 import BeautifulSoup
import requests
response = requests.get('https://example.com/products')
soup = BeautifulSoup(response.content, 'html.parser')
# Extract product elements
products = soup.find_all('div', class_='product')
for product in products:
name = product.find('h2').text
price = product.find('span', class_='price').text
print(f'{name}: {price}')
Beautiful Soup is simpler than Playwright, faster, and uses less memory. Use it for static sites; save Playwright for dynamic ones.
Scrapy: Enterprise-Scale Scraping
Scrapy is a full-featured framework for large-scale scraping projects. It handles:
- Concurrent requests (multiple sites simultaneously)
- Request scheduling and prioritization
- Middleware for handling cookies, proxies, and user agents
- Built-in robots.txt and rate limiting support
- Exporting data to CSV, JSON, or databases
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('div.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get(),
}
# Follow pagination links
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Run it:
scrapy crawl products -o output.json
Scrapy is overkill for one-off projects but essential for scraping large sites with millions of pages.
LLM-Powered Data Extraction
Large language models can extract structured data from unstructured HTML, dramatically simplifying scraping code.
import anthropic
html = """
<div class="product">
<h2>MacBook Pro 16-inch</h2>
<span class="price">$1,999</span>
<p>Powerful laptop for developers and creators</p>
</div>
"""
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": f"Extract product name, price, and description from this HTML:\n\n{html}"
}
]
)
print(message.content[0].text)
# Output:
# Product Name: MacBook Pro 16-inch
# Price: $1,999
# Description: Powerful laptop for developers and creators
LLMs handle messy HTML, variations in structure, and natural language descriptions. They're slower and more expensive than regex parsing but far more reliable.
Structured Data: The Best Alternative
Before scraping, check if the site provides structured data. Many sites embed JSON-LD (Linked Data) in the page.
<script type="application/ld+json">
{
"@context": "https://schema.org/",
"@type": "Product",
"name": "MacBook Pro",
"offers": {
"@type": "Offer",
"price": "1999"
}
}
</script>
Extract it:
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
script = soup.find('script', {'type': 'application/ld+json'})
data = json.loads(script.string)
print(data['name']) # MacBook Pro
JSON-LD is cleaner, faster, and legally safer than parsing HTML. Use it whenever available.
Checklist for Responsible Scraping
- Check robots.txt — respect crawl delays and disallowed paths
- Read the site's ToS — is scraping explicitly forbidden?
- Add delays between requests (2-5 seconds minimum)
- Identify your scraper with a User-Agent header
- Handle rate limiting gracefully (check for 429 responses)
- Respect copyright — don't republish full content
- Consider using public APIs or structured data instead
- Avoid scraping personal data; if necessary, follow GDPR rules
- Stop if blocked — don't try to bypass CAPTCHAs or IP bans
- Log your scraping activity for auditing
Conclusion
Web scraping is a powerful tool for extracting data from the web. But power comes with responsibility. Always start by checking robots.txt and the site's terms of service. Scrape respectfully — add delays, identify yourself, and back off if blocked.
Choose your tool wisely: Playwright for dynamic sites, Beautiful Soup for static HTML, Scrapy for enterprise-scale projects, and LLMs for unstructured data. And before writing any scraper, ask yourself: "Is there a public API or structured data I could use instead?"
The line between respectful scraping and abuse is thin. Stay on the right side of it.