January 20, 2025

20 min

Web Crawling with Python: The Ultimate Guide

Please review the terms of use for the content provided on this site

What is Web Crawling?
Why Use Python for Web Crawling?
What’s the Difference Between Crawling and Scraping?
How a Web Crawler Works
Best Tools for Web Crawling in Python
Additional Tool: How re Helps in Web Crawling
How to Build a Simple Web Crawler in Python
Better Tools and Improvements for Faster and More Efficient Crawling
Simple Ways to Store Your Data
Creating a Crawler with Scrapy
Dynamic Web Crawling Made Easy
Conclusion

Web crawling is a helpful way to collect data from the internet, often used for indexing websites, tracking changes, or gathering large amounts of information. In this article, we’ll break down the basics of web crawling, introduce you to useful tools and libraries in Python, and walk you through simple examples to help you get started!

What is Web Crawling?

Web crawling is the process of automatically navigating the internet to gather information from websites. It involves exploring multiple pages on a single site (or even across many sites) to collect vast amounts of data. Large-scale crawlers are used by search engines and other companies to index websites and gather data for various purposes.

For example: Googlebot visits billions of pages every day, following links between pages and websites to keep Google’s search results up-to-date. Googlebot starts by visiting a few key URLs and then follows the links on those pages to discover new ones. It uses smart algorithms to decide which pages to crawl and how often, so it can provide the most relevant search results for users.

Why Use Python for Web Crawling?

Python is a great choice for web crawling because it’s simple to learn and has many helpful libraries. Tools like Scrapy, BeautifulSoup, and Selenium make it easy to crawl websites and collect data, no matter how simple or complex the task is.

What’s the Difference Between Crawling and Scraping?

Web crawling and web scraping are closely related, but they’re not the same thing.

Web Crawling is like a spider that moves from page to page across a website (or even multiple websites) to collect data. It’s more about exploring and indexing large amounts of information, usually by following links between pages.

Web Scraping, on the other hand, focuses on extracting specific pieces of data from a webpage. It’s like zooming in on the details — such as gathering product prices, contact info, or text from a single page or a set of pages.

So, crawling is the process of discovering and collecting data across many pages, while scraping is the act of pulling out specific information from those pages.

How a Web Crawler Works

Let’s walk through an example of how a basic web crawler gathers information from a website.

Starting with Seed URLs

Imagine you want to collect information about blog posts on a website. Your seed URL could be the homepage of the blog, such as https://example.com.

Requesting the Web Page

The crawler sends an HTTP request to https://example.com, asking the server to send back the HTML content of the homepage. The server responds with the HTML of the page.

Parsing the HTML Content

The crawler then parses the HTML of the homepage. It looks for specific elements, such as links to blog posts (which are usually contained in <a> tags) and other useful information like page titles or metadata.

Extracting Links

From the homepage, the crawler finds links to other pages—let's say it finds the following links:

https://example.com/blog/post1
https://example.com/blog/post2
https://example.com/about
The crawler adds these links to its list of pages to visit.

Following Links

The crawler now requests the first blog post, https://example.com/blog/post1. It sends another HTTP request and retrieves the HTML content for that page.

Parsing the Blog Post

On the blog post page, the crawler looks for additional links (e.g., links to other blog posts, categories, or tags) and data (e.g., the blog post title, author, and publication date). The data is extracted and stored.

Extracting More Links

From https://example.com/blog/post1, the crawler finds links to other posts:
https://example.com/blog/post3
https://example.com/blog/post4
These new links are added to the list of URLs to crawl.

Storing Data

The crawler collects the blog post title, author, date, and content from https://example.com/blog/post1 and stores it in a structured format, like a database or CSV file.

Avoiding Redundancy

The crawler keeps track of URLs it has already visited. If it encounters https://example.com/blog/post1 again, it will skip it to avoid crawling the same page multiple times.

Before starting the crawl, the crawler checks the robots.txt file at https://example.com/robots.txt to ensure it’s allowed to crawl the site. If the file disallows crawling certain sections of the website (like an admin panel), the crawler will avoid those areas.

The crawler continues this process, visiting pages, extracting links, and collecting data until it has crawled all the pages or reached its limit.

This basic workflow allows the crawler to gather large amounts of data from across a website, following links and gathering the desired content in an automated manner.

def fetch_page(self, url): """Download the page content.""" try: response = requests.get(url) response.raise_for_status() return response.text except requests.exceptions.RequestException as e: logging.error(f"Failed to fetch {url}: {e}") return None

def extract_links(self, url, html): """Extract and yield all linked URLs from the page.""" soup = BeautifulSoup(html, 'html.parser') for anchor in soup.find_all('a', href=True): link = anchor['href'] full_url = urljoin(url, link) yield full_url

def add_to_queue(self, url): """Add a URL to the list of URLs to visit if it's not already visited or queued.""" if url not in self.visited and url not in self.to_visit: self.to_visit.append(url)

def process_page(self, url): """Process a single page and collect links.""" logging.info(f'Processing: {url}') html = self.fetch_page(url) if html: for link in self.extract_links(url, html): self.add_to_queue(link)

def crawl(self): """Crawl the web starting from the initial URLs.""" while self.to_visit: url = self.to_visit.pop(0) if url not in self.visited: self.process_page(url) self.visited.add(url)

import logging from urllib.parse import urljoin import requests from bs4 import BeautifulSoup logging.basicConfig( format='%(asctime)s %(levelname)s: %(message)s', level=logging.INFO ) class SimpleCrawler: def __init__(self, start_urls=[]): self.visited = set() self.to_visit = start_urls def fetch_page(self, url): """Download the page content.""" try: response = requests.get(url) response.raise_for_status() return response.text except requests.exceptions.RequestException as e: logging.error(f"Failed to fetch {url}: {e}") return None def extract_links(self, url, html): """Extract and yield all linked URLs from the page.""" soup = BeautifulSoup(html, 'html.parser') for anchor in soup.find_all('a', href=True): link = anchor['href'] full_url = urljoin(url, link) yield full_url def add_to_queue(self, url): """Add a URL to the list of URLs to visit if it's not already visited or queued.""" if url not in self.visited and url not in self.to_visit: self.to_visit.append(url) def process_page(self, url): """Process a single page and collect links.""" logging.info(f'Processing: {url}') html = self.fetch_page(url) if html: for link in self.extract_links(url, html): self.add_to_queue(link) def crawl(self): """Crawl the web starting from the initial URLs.""" while self.to_visit: url = self.to_visit.pop(0) if url not in self.visited: self.process_page(url) self.visited.add(url) if __name__ == '__main__': start_urls = ['https://www.wikipedia.org/'] crawler = SimpleCrawler(start_urls) crawler.crawl()

import logging from urllib.parse import urljoin import requests from bs4 import BeautifulSoup from concurrent.futures import ThreadPoolExecutor logging.basicConfig( format='%(asctime)s %(levelname)s: %(message)s', level=logging.INFO ) class SimpleCrawler: def __init__(self, start_urls=[]): self.visited = set() self.to_visit = start_urls self.found_links = [] def fetch_page(self, url): """Download the page content.""" try: response = requests.get(url) response.raise_for_status() return response.text except requests.exceptions.RequestException as e: logging.error(f"Failed to fetch {url}: {e}") return None def extract_links(self, url, html): """Extract and yield all linked URLs from the page.""" soup = BeautifulSoup(html, 'html.parser') for anchor in soup.find_all('a', href=True): link = anchor['href'] full_url = urljoin(url, link) if full_url not in self.visited and full_url not in self.to_visit: self.found_links.append(full_url) def process_page(self, url): """Process a single page and collect links.""" html = self.fetch_page(url) if html: self.extract_links(url, html) self.visited.add(url) def crawl(self): """Crawl the web starting from the initial URLs.""" with ThreadPoolExecutor(max_workers=10) as executor: # Add initial URLs to the queue while self.to_visit: url = self.to_visit.pop(0) if url not in self.visited: executor.submit(self.process_page, url) # Wait for all pages to be processed executor.shutdown(wait=True) return self.found_links if __name__ == '__main__': start_urls = ['https://www.wikipedia.org/'] crawler = SimpleCrawler(start_urls) links = crawler.crawl() print("Found links:") for link in links: print(link)

import json import logging from urllib.parse import urljoin import requests from bs4 import BeautifulSoup from concurrent.futures import ThreadPoolExecutor logging.basicConfig( format='%(asctime)s %(levelname)s: %(message)s', level=logging.INFO ) class SimpleCrawler: def __init__(self, start_urls=[]): self.visited = set() self.to_visit = start_urls self.found_links = [] def fetch_page(self, url): try: response = requests.get(url) response.raise_for_status() return response.text except requests.exceptions.RequestException as e: logging.error(f"Failed to fetch {url}: {e}") return None def extract_links(self, url, html): soup = BeautifulSoup(html, 'html.parser') for anchor in soup.find_all('a', href=True): link = anchor['href'] full_url = urljoin(url, link) if full_url not in self.visited and full_url not in self.to_visit: self.found_links.append(full_url) def process_page(self, url): html = self.fetch_page(url) if html: self.extract_links(url, html) self.visited.add(url) def crawl(self): with ThreadPoolExecutor(max_workers=10) as executor: while self.to_visit: url = self.to_visit.pop(0) if url not in self.visited: executor.submit(self.process_page, url) executor.shutdown(wait=True) return self.found_links def save_to_json(self, filename): """Save the found links to a JSON file.""" with open(filename, 'w', encoding='utf-8') as file: json.dump(self.found_links, file, ensure_ascii=False, indent=4) logging.info(f"Found links saved to {filename}") if __name__ == '__main__': start_urls = ['https://www.wikipedia.org/'] crawler = SimpleCrawler(start_urls) links = crawler.crawl() # Save the links to a JSON file crawler.save_to_json('found_links.json') print("Found links have been saved to 'found_links.json'")

import csv class SimpleCrawler: # Previous code... def save_to_csv(self, filename): """Save the found links to a CSV file.""" with open(filename, 'w', newline='', encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(['URL']) # Column header for link in self.found_links: writer.writerow([link]) logging.info(f"Found links saved to {filename}") if __name__ == '__main__': start_urls = ['https://www.wikipedia.org/'] crawler = SimpleCrawler(start_urls) links = crawler.crawl() # Save the links to a CSV file crawler.save_to_csv('found_links.csv') print("Found links have been saved to 'found_links.csv'")

import sqlite3 class SimpleCrawler: # Previous code... def save_to_database(self, db_name): """Save the found links to a SQLite database.""" conn = sqlite3.connect(db_name) cursor = conn.cursor() # Create a table if it doesn't exist cursor.execute(''' CREATE TABLE IF NOT EXISTS links ( id INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT UNIQUE ) ''') # Insert each link into the database for link in self.found_links: cursor.execute('INSERT OR IGNORE INTO links (url) VALUES (?)', (link,)) conn.commit() conn.close() logging.info(f"Found links saved to {db_name}") if __name__ == '__main__': start_urls = ['https://www.wikipedia.org/'] crawler = SimpleCrawler(start_urls) links = crawler.crawl() # Save the links to a SQLite database crawler.save_to_database('found_links.db') print("Found links have been saved to 'found_links.db'")

from openpyxl import Workbook class SimpleCrawler: # Previous code... def save_to_excel(self, filename): """Save the found links to an Excel file.""" workbook = Workbook() sheet = workbook.active sheet.append(['URL']) # Column header for link in self.found_links: sheet.append([link]) workbook.save(filename) logging.info(f"Found links saved to {filename}") if __name__ == '__main__': start_urls = ['https://www.wikipedia.org/'] crawler = SimpleCrawler(start_urls) links = crawler.crawl() # Save the links to an Excel file crawler.save_to_excel('found_links.xlsx') print("Found links have been saved to 'found_links.xlsx'")

import scrapy class AmazonSpider(scrapy.Spider): name = "ecommerce" # Starting URL start_urls = [ 'https://www.amazon.com/s?k=phones' ] def parse(self, response): # Add a log for debugging self.log(f"Parsing page: {response.url}") # Extract product details for product in response.css('div.s-main-slot div.s-result-item'): # Extract product name name = product.css('a.a-link-normal.s-line-clamp-2.a-text-normal span::text').get() if name: # Check if we found a product name self.log(f"Found product: {name}") # Extract price (may be missing on some products) price = product.css('span.a-price-whole::text').get() # Extract rating (may not be present on all products) rating = product.css('i.a-icon-star-small span.a-icon-alt::text').get() # Extract number of reviews reviews = product.css('span.a-size-base.s-underline-text::text').get() # Extract product URL product_url = product.css('a.a-link-normal.s-line-clamp-2.s-link-style::attr(href)').get() yield { 'name': name, 'price': price, 'rating': rating, 'reviews': reviews, 'url': response.urljoin(product_url), } # Follow the next page link next_page = response.css('li.a-last a::attr(href)').get() if next_page: self.log(f"Following next page: {next_page}") yield response.follow(next_page, self.parse)

from playwright.sync_api import sync_playwright def dynamic_crawler(): with sync_playwright() as p: # Launch a browser browser = p.chromium.launch(headless=True) page = browser.new_page() # Go to the website url = "https://news.ycombinator.com/" page.goto(url) # Wait for the news items to load page.wait_for_selector(".athing") # Extract data articles = [] for item in page.query_selector_all(".athing"): title = item.query_selector(".titleline a").inner_text() link = item.query_selector(".titleline a").get_attribute("href") rank = item.query_selector(".rank").inner_text() if item.query_selector(".rank") else None articles.append({"rank": rank, "title": title, "link": link}) browser.close() # Print the results for article in articles: print(article) # Run the crawler dynamic_crawler()

{'rank': '1.', 'title': 'A great open-source project', 'link': 'https://example.com'} {'rank': '2.', 'title': 'How to learn Python', 'link': 'https://news.ycombinator.com/item?id=123456'} {'rank': '3.', 'title': 'Show HN: My new tool', 'link': 'https://example.com/tool'}