Web scraping in Python with BeautifulSoup
BeautifulSoup is a Python library that parses HTML and XML into a navigable tree so you can pull out exactly the data you need — links, prices, headlines, tables — without writing brittle regular expressions. It pairs with an HTTP client such as requests (or httpx) that fetches the page, while BeautifulSoup does the parsing and searching.
Reach for BeautifulSoup when you need to extract data from static, server-rendered HTML: blog archives, documentation, product listings, public datasets, or one-off pages. It is forgiving of broken markup, has a tiny learning curve, and is ideal for small-to-medium jobs and prototypes.
This guide uses Python 3.12/3.13, BeautifulSoup 4, and requests. You will install the tools, fetch a page responsibly, navigate the parse tree, find elements with find/find_all and CSS selectors, handle pagination, save results to CSV and JSON, and — just as important — learn when BeautifulSoup is the wrong tool and how to scrape ethically and legally.
1. Installation
Create a virtual environment, then install beautifulsoup4, a fast parser (lxml), and an HTTP client. BeautifulSoup itself does not fetch pages — you bring your own client.
beautifulsoup4— the parsing/searching library (the import name isbs4).lxml— a fast, lenient parser. The built-inhtml.parserworks with no extra install but is slower;lxmlis the usual choice for real work.requests— the classic, synchronous HTTP client.httpxis a modern alternative with the same feel plus async and HTTP/2 support.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install beautifulsoup4 lxml requests
# Optional modern HTTP client (async + HTTP/2):
# pip install httpx2. Fetch a page responsibly
Before parsing, fetch the HTML. A responsible request sets a descriptive User-Agent, uses a timeout so the script never hangs, and checks the response status with raise_for_status(). Always be polite: identify yourself and add a small delay between requests so you do not hammer the server.
import requests
HEADERS = {
# Identify your bot honestly; include a contact URL or email.
"User-Agent": "MyScraper/1.0 (+https://example.com/bot-info)",
}
def fetch(url: str) -> str:
"""Fetch a URL and return its HTML, raising on HTTP errors."""
response = requests.get(url, headers=HEADERS, timeout=10)
response.raise_for_status() # raise for 4xx/5xx responses
return response.text
if __name__ == "__main__":
html = fetch("https://quotes.toscrape.com/")
print(len(html), "characters fetched")We use quotes.toscrape.com as the target — a site built specifically for practising scraping, so the examples stay stable and you do not risk violating anyone's Terms of Service while learning.
3. Parse and navigate the tree
Pass the HTML string and a parser name to BeautifulSoup. The result is a tree of Tag and NavigableString objects you can walk by tag name, attribute, or relationship (parent, children, siblings). prettify() is handy for inspecting structure while you develop.
from bs4 import BeautifulSoup
html = """
<html>
<body>
<h1 class="title">Quotes</h1>
<div class="quote">
<span class="text">The world is what you think of it.</span>
<a class="author" href="/author/anon/">Anon</a>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, "lxml") # or "html.parser" with no extra deps
print(soup.title) # None here (no <title>), shows attribute access
print(soup.h1.get_text()) # "Quotes" — tag access returns the first match
print(soup.h1["class"]) # ['title'] — attributes act like a dict
print(soup.a["href"]) # "/author/anon/"
# print(soup.prettify()) # uncomment to see the indented tree4. Find elements: find, find_all, and CSS selectors
There are two main ways to locate elements. The find / find_all API searches by tag name and attributes, while select / select_one use familiar CSS selectors. Pick whichever reads more clearly for the page at hand — CSS selectors are often the most concise.
| Goal | find_all style |
CSS selector style |
|---|---|---|
| First match | soup.find('a') |
soup.select_one('a') |
| All matches | soup.find_all('a') |
soup.select('a') |
| By class | soup.find_all('div', class_='quote') |
soup.select('div.quote') |
| By id | soup.find(id='main') |
soup.select_one('#main') |
| Links only | soup.find_all('a', href=True) |
soup.select('a[href]') |
| Nested | — | soup.select('div.quote span.text') |
Note class_ (with a trailing underscore) in the find_all API, because class is a reserved word in Python.
from bs4 import BeautifulSoup
import requests
HEADERS = {"User-Agent": "MyScraper/1.0 (+https://example.com/bot-info)"}
response = requests.get("https://quotes.toscrape.com/", headers=HEADERS, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
# find_all style
for quote in soup.find_all("div", class_="quote"):
text = quote.find("span", class_="text").get_text(strip=True)
author = quote.find("small", class_="author").get_text(strip=True)
print(f"{author}: {text}")
# Equivalent CSS-selector style
for quote in soup.select("div.quote"):
text = quote.select_one("span.text").get_text(strip=True)
author = quote.select_one("small.author").get_text(strip=True)
tags = [t.get_text() for t in quote.select("div.tags a.tag")]
print(text, "--", author, "--", tags)5. Extract text and attributes safely
Use get_text(strip=True) to pull clean text and .get("attr") to read attributes without crashing when they are missing. Reading tag["href"] raises a KeyError if the attribute is absent, whereas tag.get("href") returns None — prefer the latter in loops over messy real-world HTML. To turn a site-relative link into an absolute URL, join it against the page URL with urljoin.
from urllib.parse import urljoin
BASE = "https://quotes.toscrape.com/"
for link in soup.select("a"):
href = link.get("href") # None instead of KeyError if missing
if not href:
continue
absolute = urljoin(BASE, href) # "/login" -> "https://quotes.toscrape.com/login"
label = link.get_text(strip=True)
print(label or "(no text)", "->", absolute)6. Handle pagination
Most listings span many pages. The robust pattern is a loop that follows the site's own "Next" link until it disappears, with a polite delay between requests. This avoids guessing page numbers and stops automatically at the last page.
import time
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
BASE = "https://quotes.toscrape.com/"
HEADERS = {"User-Agent": "MyScraper/1.0 (+https://example.com/bot-info)"}
def scrape_all_quotes() -> list[dict]:
"""Follow the Next link until there are no more pages."""
results: list[dict] = []
url = BASE
while url:
response = requests.get(url, headers=HEADERS, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
for quote in soup.select("div.quote"):
results.append({
"text": quote.select_one("span.text").get_text(strip=True),
"author": quote.select_one("small.author").get_text(strip=True),
"tags": [t.get_text() for t in quote.select("div.tags a.tag")],
})
next_link = soup.select_one("li.next a")
url = urljoin(BASE, next_link["href"]) if next_link else None
time.sleep(1) # be polite: pause between requests
return results
if __name__ == "__main__":
data = scrape_all_quotes()
print(f"Scraped {len(data)} quotes across all pages")7. Save results to CSV and JSON
Once you have a list of dictionaries, exporting is a few lines with the standard library — no extra dependencies. Use csv.DictWriter for spreadsheets and json.dump for structured output. Pass ensure_ascii=False so non-English characters are written as readable UTF-8.
import csv
import json
def save_csv(rows: list[dict], path: str) -> None:
if not rows:
return
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writeheader()
for row in rows:
# Flatten the list of tags into a single cell for CSV.
writer.writerow({**row, "tags": ", ".join(row.get("tags", []))})
def save_json(rows: list[dict], path: str) -> None:
with open(path, "w", encoding="utf-8") as f:
json.dump(rows, f, indent=2, ensure_ascii=False)
# save_csv(data, "quotes.csv")
# save_json(data, "quotes.json")8. Robustness: timeouts, retries, and error handling
Networks fail, servers return 500s, and sites occasionally rate-limit you. Production scrapers wrap requests with a timeout, retries with backoff, and targeted exception handling. The cleanest approach is a requests.Session with an HTTPAdapter configured to retry on transient errors so you do not reinvent backoff logic by hand.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def build_session() -> requests.Session:
"""A session that retries transient failures with exponential backoff."""
session = requests.Session()
session.headers.update({"User-Agent": "MyScraper/1.0 (+https://example.com/bot-info)"})
retry = Retry(
total=3,
backoff_factor=1, # waits 0s, 1s, 2s, 4s between tries
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"],
)
adapter = HTTPAdapter(max_retries=retry)
session.mount("https://", adapter)
session.mount("http://", adapter)
return session
session = build_session()
try:
response = session.get("https://quotes.toscrape.com/", timeout=10)
response.raise_for_status()
except requests.exceptions.Timeout:
print("Request timed out")
except requests.exceptions.HTTPError as exc:
print(f"HTTP error: {exc}")
except requests.exceptions.RequestException as exc:
print(f"Network error: {exc}")When BeautifulSoup is the wrong tool
BeautifulSoup only sees the HTML the server sends. It does not run JavaScript, so it is not always the right choice in 2026, when many sites render content client-side. Match the tool to the job:
| Tool | Best for | JavaScript-rendered pages | Speed | Learning curve |
|---|---|---|---|---|
| BeautifulSoup + requests | Static HTML, small/medium jobs, prototypes | No | Fast | Easy |
| Scrapy | Large, structured crawls; many pages; built-in pipelines & throttling | No (needs a plugin) | Very fast | Steeper |
| Playwright / Selenium | JavaScript-heavy sites, logins, infinite scroll, SPAs | Yes (real browser) | Slow | Medium |
Rules of thumb:
- Content missing from
response.text? It is probably rendered by JavaScript — use a headless browser like Playwright or Selenium, or look for the underlying JSON/XHR endpoint the page calls. - Crawling thousands of pages with queues, retries, and export pipelines? Reach for Scrapy.
- A public API exists? Use it. An API is faster, more stable, and friendlier than scraping HTML — see our guide to calling APIs with Python requests. For XML feeds and sitemaps, parsing XML in Python is a cleaner path than scraping rendered markup.
Scrape ethically and legally
Scraping is a powerful tool, and with it comes responsibility. Before you scrape anything:
- Check
robots.txt(e.g.https://example.com/robots.txt) and honour its disallow rules. Python's standard library hasurllib.robotparserto read it programmatically. - Read the Terms of Service. Some sites explicitly forbid automated access; respect that.
- Rate-limit yourself. Add delays, scrape during off-peak hours, and cache responses so you never re-fetch the same page unnecessarily.
- Prefer official APIs and data feeds when they exist — they are the sanctioned, stable interface.
- Do not collect personal data without a lawful basis. Regulations such as the GDPR and CCPA apply to scraped personal information just as they do to any other source.
- Identify your bot with an honest User-Agent and a contact URL so site owners can reach you.
Good scraping is invisible to the site you are scraping. If your script would degrade someone's service or breach their terms, stop and find a better source.
from urllib.robotparser import RobotFileParser
def may_fetch(url: str, user_agent: str = "MyScraper") -> bool:
"""Check robots.txt before fetching a URL."""
rp = RobotFileParser()
rp.set_url("https://quotes.toscrape.com/robots.txt")
rp.read()
return rp.can_fetch(user_agent, url)
print(may_fetch("https://quotes.toscrape.com/page/2/")) # True if allowedWhere this fits in real projects
At MicroPyramid we have spent 12+ years building Python data and automation systems across 50+ projects — from scheduled scrapers feeding analytics pipelines to ingestion jobs that clean and structure web content for RAG and AI knowledge systems. When a scraper needs to serve data over an HTTP API, we wrap it in a FastAPI service with proper validation, retries, and observability. BeautifulSoup is usually where a project starts; the engineering around it — reliability, scheduling, compliance, and storage — is what makes it production-ready.
Frequently Asked Questions
What is BeautifulSoup used for?
BeautifulSoup is a Python library for parsing HTML and XML. It turns messy markup into a navigable tree so you can search and extract data — links, text, tables, attributes — using tag names, attributes, or CSS selectors. It is the go-to choice for scraping static, server-rendered pages and for prototypes.
Is BeautifulSoup better than Scrapy?
They solve different problems. BeautifulSoup is a parsing library that is quick to learn and ideal for small-to-medium jobs. Scrapy is a full crawling framework with built-in request scheduling, throttling, and export pipelines, which suits large, structured crawls across many pages. Many teams use BeautifulSoup inside Scrapy spiders to get the best of both.
Can BeautifulSoup scrape JavaScript-rendered pages?
No. BeautifulSoup only sees the HTML the server returns; it does not execute JavaScript. If the data you want is missing from response.text, the page renders it client-side. Use a headless browser such as Playwright or Selenium, or call the underlying JSON/XHR API the page uses.
Which parser should I use with BeautifulSoup?
Use lxml for speed and lenient parsing in most projects (pip install lxml). The built-in html.parser needs no extra install and is fine for small scripts but is slower. For strict XML, use the xml parser, which also requires lxml.
How do I find elements with BeautifulSoup?
Use find and find_all to search by tag name and attributes (for example soup.find_all('div', class_='quote')), or use select and select_one with CSS selectors (for example soup.select('div.quote span.text')). CSS selectors are often the most concise way to target nested elements.
Is web scraping legal?
Scraping publicly available data is generally permissible, but it depends on the site's Terms of Service, the data involved, and your jurisdiction. Always check robots.txt, respect rate limits, avoid collecting personal data without a lawful basis (GDPR/CCPA still apply), and prefer official APIs where they exist. When in doubt, get legal advice for your specific use case.