Monday, 13 January 2025

8. Web Scrap at Scale

Lecture Notes: Web Scraping at Scale

1. Distributed Scraping Using Tools

1.1 Scrapy

Scrapy is a powerful web scraping framework that enables distributed scraping with minimal setup. It supports asynchronous requests and is highly efficient for large-scale scraping.

Key Features

  • Built-in support for asynchronous requests.
  • Middleware for request and response processing.
  • Extensible architecture.

Code Example: Scrapy Distributed Scraping

# Install Scrapy: pip install scrapy

# Example Scrapy spider
import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        for product in response.css(".product"):
            yield {
                'name': product.css(".name::text").get(),
                'price': product.css(".price::text").get(),
            }

# Run spider: scrapy crawl products

1.2 Celery

Celery is a distributed task queue that enables parallel execution of tasks, making it suitable for scaling scraping jobs.

Code Example: Using Celery for Scraping

# Install Celery: pip install celery
from celery import Celery
import requests

app = Celery('tasks', broker='redis://localhost:6379/0')

@app.task
def scrape_page(url):
    response = requests.get(url)
    return response.text

# Schedule scraping tasks
urls = ["https://example.com/page1", "https://example.com/page2"]
for url in urls:
    scrape_page.delay(url)

2. Managing Concurrency and Parallelism

2.1 Threading

Threading allows multiple threads to run concurrently in a program, making it useful for I/O-bound tasks like web scraping.

Code Example: Using Threading

import threading
import requests

def scrape(url):
    response = requests.get(url)
    print(f"Scraped {url}: {len(response.content)} bytes")

urls = ["https://example.com/page1", "https://example.com/page2"]
threads = []

for url in urls:
    thread = threading.Thread(target=scrape, args=(url,))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

2.2 Multiprocessing

Multiprocessing uses multiple processes, making it ideal for CPU-bound tasks.

Code Example: Using Multiprocessing

from multiprocessing import Pool
import requests

def scrape(url):
    response = requests.get(url)
    return f"Scraped {url}: {len(response.content)} bytes"

urls = ["https://example.com/page1", "https://example.com/page2"]

with Pool(4) as pool:
    results = pool.map(scrape, urls)
    for result in results:
        print(result)

3. Caching Strategies to Reduce Redundant Requests

3.1 Using HTTP Caching

HTTP caching stores responses to reduce redundant requests.

Code Example: Using Requests Cache

# Install requests-cache: pip install requests-cache
import requests_cache

requests_cache.install_cache('scraping_cache', expire_after=3600)

response = requests.get("https://example.com/data")
print(f"Cache hit: {response.from_cache}")

3.2 Custom Caching

Implement custom caching for specific data.

Code Example: Custom Caching

import os
import hashlib
import requests

def fetch_with_cache(url, cache_dir="cache"):
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)

    cache_file = os.path.join(cache_dir, hashlib.md5(url.encode()).hexdigest())

    if os.path.exists(cache_file):
        with open(cache_file, "r") as f:
            return f.read()

    response = requests.get(url)
    with open(cache_file, "w") as f:
        f.write(response.text)

    return response.text

html = fetch_with_cache("https://example.com/data")
print(html[:100])

4. Monitoring and Maintaining Scraping Jobs

4.1 Logging

Logging helps monitor scraping activity and debug issues.

Code Example: Basic Logging

import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
logger = logging.getLogger()

logger.info("Starting scraping job")
# Simulated scraping job
logger.info("Scraping page 1")
logger.info("Scraping page 2")
logger.info("Job completed")

4.2 Job Scheduling

Use schedulers like cron or Python libraries to automate scraping.

Code Example: Using Schedule Library

# Install schedule: pip install schedule
import schedule
import time


def scrape_job():
    print("Running scraping job...")
    # Simulate scraping
    time.sleep(2)
    print("Job completed")

schedule.every().day.at("10:00").do(scrape_job)

while True:
    schedule.run_pending()
    time.sleep(1)

Quiz

Objective: Test understanding of large-scale web scraping concepts.

Questions

  1. Which tool is best suited for distributed web scraping?

    • a) Scrapy
    • b) BeautifulSoup
    • c) Requests
    • d) Selenium
  2. What is the main advantage of using Celery in web scraping?

    • a) Asynchronous task execution
    • b) Database management
    • c) Rendering JavaScript
    • d) HTML parsing
  3. Which Python library supports HTTP caching for web scraping?

    • a) pandas
    • b) requests-cache
    • c) scrapy
    • d) schedule
  4. What is the primary purpose of logging in scraping jobs?

    • a) Improve scraping speed
    • b) Track and debug scraping activity
    • c) Parse data
    • d) Handle proxies
  5. What is a benefit of using multiprocessing for web scraping?

    • a) Handles I/O-bound tasks efficiently
    • b) Improves performance for CPU-bound tasks
    • c) Reduces memory usage
    • d) Automates scheduling

Answers

  1. a) Scrapy
  2. a) Asynchronous task execution
  3. b) requests-cache
  4. b) Track and debug scraping activity
  5. b) Improves performance for CPU-bound tasks

No comments:

Post a Comment

12. Real World Scraping project

Lecture Notes: Real-World Scraping Project on https://mca.gov.in 1. Introduction Scraping real-world websites like MCA (Ministry of Corpor...