Web Scrapping Handson: 2. Overview of Advanced Web Scraping

Lecture Notes: Introduction to Advanced Web Scraping

Definition of Web Scraping

Web scraping is the process of programmatically extracting data from websites. While basic web scraping involves retrieving data from static HTML pages, advanced web scraping deals with dynamic content, large-scale scraping, and overcoming anti-scraping mechanisms.

Goals of Advanced Web Scraping

Scraping dynamic content rendered by JavaScript.
Efficiently handling large-scale scraping projects.
Bypassing common anti-scraping techniques like CAPTCHA and IP bans.
Ensuring compliance with ethical and legal standards.

Applications of Advanced Web Scraping

E-commerce: Price comparison and product monitoring.
News Aggregation: Collecting data from multiple news outlets.
Market Research: Gathering competitor data and trends.
Sentiment Analysis: Scraping social media for public opinions.

2. Ethical and Legal Considerations

1. Ethics of Web Scraping

Always check the website's robots.txt file for scraping permissions.
Avoid overloading a website’s server with frequent requests.
Respect user privacy and do not scrape sensitive or personal information.

2. Legal Considerations

Adhere to the website's terms of service.
Be cautious of copyright and intellectual property laws.
Obtain explicit permission when necessary.

3. Tools and Libraries for Advanced Scraping

Python Libraries

BeautifulSoup: For parsing HTML and XML documents.
Requests: For making HTTP requests.
Scrapy: A robust framework for large-scale scraping.
Selenium: For interacting with JavaScript-heavy websites using a browser.

Non-Python Tools

Puppeteer: A headless browser automation tool.
Playwright: For automating browser interactions across multiple browsers.
Proxy Tools: Services like ScraperAPI or Bright Data to bypass IP bans.

Sample Code: Web Scraping Using Python

Scenario: Scrape the latest news headlines from a static website.

import requests
from bs4 import BeautifulSoup

# URL of the target website
url = "https://example-news-site.com"

# Send an HTTP GET request
response = requests.get(url)
response.raise_for_status()  # Check for HTTP request errors

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Extract headlines (assuming they are in <h2> tags with class 'headline')
headlines = soup.find_all('h2', class_='headline')

# Print extracted headlines
print("Latest News Headlines:")
for idx, headline in enumerate(headlines, 1):
    print(f"{idx}. {headline.text.strip()}")

Assignment

Objective

Scrape product names and prices from an e-commerce website’s product listing page.

Requirements

Extract the product names and their corresponding prices.
Save the data to a CSV file.
Handle HTTP request errors gracefully.

Solution

import requests
from bs4 import BeautifulSoup
import csv

# Target URL
url = "https://example-ecommerce-site.com/products"

try:
    # Send an HTTP GET request
    response = requests.get(url)
    response.raise_for_status()

    # Parse HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find products and prices
    products = soup.find_all('div', class_='product-item')

    # Prepare CSV file
    with open('products.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["Product Name", "Price"])

        # Extract and write product data
        for product in products:
            name = product.find('h3', class_='product-name').text.strip()
            price = product.find('span', class_='product-price').text.strip()
            writer.writerow([name, price])

    print("Data saved to products.csv")
except requests.exceptions.RequestException as e:
    print(f"Error during request: {e}")

Quiz

Objective: Assess understanding of advanced web scraping concepts.

Questions

What is the purpose of the robots.txt file in web scraping?
- a) To provide legal protection for web scraping
- b) To define the structure of a website
- c) To specify which parts of the website can be crawled
- d) To encrypt sensitive data
Which Python library is best suited for interacting with JavaScript-rendered content?
- a) BeautifulSoup
- b) Requests
- c) Selenium
- d) NumPy
What does the response.raise_for_status() method do?
- a) Parses HTML content
- b) Checks for HTTP request errors
- c) Rotates proxies
- d) Generates a browser simulation
Which of the following is NOT a common anti-scraping mechanism?
- a) CAPTCHA
- b) IP blocking
- c) Dynamic HTML
- d) User-agent strings
What is the primary purpose of using proxies in web scraping?
- a) To enhance scraping speed
- b) To bypass rate limits and IP bans
- c) To parse JavaScript
- d) To store scraped data

Answers

c) To specify which parts of the website can be crawled
c) Selenium
b) Checks for HTTP request errors
d) User-agent strings
b) To bypass rate limits and IP bans

Web Scrapping Handson

Monday, 13 January 2025

2. Overview of Advanced Web Scraping