Web Scrapping Handson: 8. Web Scrap at Scale

Lecture Notes: Web Scraping at Scale

1. Distributed Scraping Using Tools

1.1 Scrapy

Scrapy is a powerful web scraping framework that enables distributed scraping with minimal setup. It supports asynchronous requests and is highly efficient for large-scale scraping.

Key Features

Built-in support for asynchronous requests.
Middleware for request and response processing.
Extensible architecture.

Code Example: Scrapy Distributed Scraping

# Install Scrapy: pip install scrapy

# Example Scrapy spider
import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        for product in response.css(".product"):
            yield {
                'name': product.css(".name::text").get(),
                'price': product.css(".price::text").get(),
            }

# Run spider: scrapy crawl products

1.2 Celery

Celery is a distributed task queue that enables parallel execution of tasks, making it suitable for scaling scraping jobs.

Code Example: Using Celery for Scraping

# Install Celery: pip install celery
from celery import Celery
import requests

app = Celery('tasks', broker='redis://localhost:6379/0')

@app.task
def scrape_page(url):
    response = requests.get(url)
    return response.text

# Schedule scraping tasks
urls = ["https://example.com/page1", "https://example.com/page2"]
for url in urls:
    scrape_page.delay(url)

2. Managing Concurrency and Parallelism

2.1 Threading

Threading allows multiple threads to run concurrently in a program, making it useful for I/O-bound tasks like web scraping.

Code Example: Using Threading

import threading
import requests

def scrape(url):
    response = requests.get(url)
    print(f"Scraped {url}: {len(response.content)} bytes")

urls = ["https://example.com/page1", "https://example.com/page2"]
threads = []

for url in urls:
    thread = threading.Thread(target=scrape, args=(url,))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

2.2 Multiprocessing

Multiprocessing uses multiple processes, making it ideal for CPU-bound tasks.

Code Example: Using Multiprocessing

from multiprocessing import Pool
import requests

def scrape(url):
    response = requests.get(url)
    return f"Scraped {url}: {len(response.content)} bytes"

urls = ["https://example.com/page1", "https://example.com/page2"]

with Pool(4) as pool:
    results = pool.map(scrape, urls)
    for result in results:
        print(result)

3. Caching Strategies to Reduce Redundant Requests

3.1 Using HTTP Caching

HTTP caching stores responses to reduce redundant requests.

Code Example: Using Requests Cache

# Install requests-cache: pip install requests-cache
import requests_cache

requests_cache.install_cache('scraping_cache', expire_after=3600)

response = requests.get("https://example.com/data")
print(f"Cache hit: {response.from_cache}")

3.2 Custom Caching

Implement custom caching for specific data.

Code Example: Custom Caching

import os
import hashlib
import requests

def fetch_with_cache(url, cache_dir="cache"):
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)

    cache_file = os.path.join(cache_dir, hashlib.md5(url.encode()).hexdigest())

    if os.path.exists(cache_file):
        with open(cache_file, "r") as f:
            return f.read()

    response = requests.get(url)
    with open(cache_file, "w") as f:
        f.write(response.text)

    return response.text

html = fetch_with_cache("https://example.com/data")
print(html[:100])

4. Monitoring and Maintaining Scraping Jobs

4.1 Logging

Logging helps monitor scraping activity and debug issues.

Code Example: Basic Logging

import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
logger = logging.getLogger()

logger.info("Starting scraping job")
# Simulated scraping job
logger.info("Scraping page 1")
logger.info("Scraping page 2")
logger.info("Job completed")

4.2 Job Scheduling

Use schedulers like cron or Python libraries to automate scraping.

Code Example: Using Schedule Library

# Install schedule: pip install schedule
import schedule
import time


def scrape_job():
    print("Running scraping job...")
    # Simulate scraping
    time.sleep(2)
    print("Job completed")

schedule.every().day.at("10:00").do(scrape_job)

while True:
    schedule.run_pending()
    time.sleep(1)

Quiz

Objective: Test understanding of large-scale web scraping concepts.

Questions

Which tool is best suited for distributed web scraping?
- a) Scrapy
- b) BeautifulSoup
- c) Requests
- d) Selenium
What is the main advantage of using Celery in web scraping?
- a) Asynchronous task execution
- b) Database management
- c) Rendering JavaScript
- d) HTML parsing
Which Python library supports HTTP caching for web scraping?
- a) pandas
- b) requests-cache
- c) scrapy
- d) schedule
What is the primary purpose of logging in scraping jobs?
- a) Improve scraping speed
- b) Track and debug scraping activity
- c) Parse data
- d) Handle proxies
What is a benefit of using multiprocessing for web scraping?
- a) Handles I/O-bound tasks efficiently
- b) Improves performance for CPU-bound tasks
- c) Reduces memory usage
- d) Automates scheduling

Answers

a) Scrapy
a) Asynchronous task execution
b) requests-cache
b) Track and debug scraping activity
b) Improves performance for CPU-bound tasks

Web Scrapping Handson

Monday, 13 January 2025

8. Web Scrap at Scale