Web Scrapping Handson: 4. Handling Anti-Scraping Mechanisms

Lecture Notes: Handling Anti-Scraping Mechanisms

1. Introduction to Anti-Scraping Mechanisms

What are Anti-Scraping Mechanisms?

Websites implement anti-scraping mechanisms to protect their content, ensure fair usage, and prevent server overload. These techniques are designed to identify and block bots and automated tools.

Common Anti-Scraping Mechanisms

CAPTCHAs: Used to verify that a real human is accessing the site.
IP Blocking: Blocking requests from specific IPs suspected of being bots.
Rate Limiting: Limiting the number of requests within a specified time frame.
User-Agent Validation: Detecting and blocking bots based on unusual or default user-agent strings.
Dynamic Content Loading: Using JavaScript to generate content dynamically to prevent direct scraping.
Honeytraps: Hidden links or data designed to catch bots.

2. Strategies to Overcome Anti-Scraping Mechanisms

1. Using Proxies

Proxies help hide the real IP address of the scraper, enabling rotation to prevent IP blocking.

Rotating Proxies: Services like ScraperAPI and Bright Data provide rotating proxy pools.
Residential Proxies: Appear as legitimate IPs, reducing the likelihood of detection.

2. User-Agent Rotation

Rotate user-agent headers to mimic different devices and browsers.

3. Handling CAPTCHAs

Use CAPTCHA-solving services (e.g., 2Captcha, Anti-Captcha).
Employ machine learning models for CAPTCHA recognition (for advanced users).

4. Delays and Randomization

Add random delays between requests to mimic human behavior.
Randomize request patterns and navigation paths.

5. Avoiding Detection

Use headless browsers like Puppeteer or Selenium judiciously.
Disable features like WebDriver flag detection in Selenium.
Ensure proper session management using cookies and headers.

3. Sample Code: Bypassing Basic Anti-Scraping Mechanisms

Scenario: Scraping a website with IP blocking and CAPTCHA.

import requests
from bs4 import BeautifulSoup
import random
import time

# List of proxies
proxies = [
    {"http": "http://proxy1:port", "https": "https://proxy1:port"},
    {"http": "http://proxy2:port", "https": "https://proxy2:port"},
    {"http": "http://proxy3:port", "https": "https://proxy3:port"}
]

# List of user agents
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
]

# Target URL
url = "https://example-anti-scraping-site.com"

# Attempt to scrape
for _ in range(5):
    try:
        # Randomly select a proxy and user agent
        proxy = random.choice(proxies)
        user_agent = random.choice(user_agents)

        # Set headers
        headers = {"User-Agent": user_agent}

        # Send GET request
        response = requests.get(url, headers=headers, proxies=proxy, timeout=10)
        response.raise_for_status()

        # Parse the HTML content
        soup = BeautifulSoup(response.text, 'html.parser')
        print(soup.title.text)

        # Add a random delay
        time.sleep(random.uniform(2, 5))

    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")

Assignment

Objective

Scrape product reviews from a website with CAPTCHA and rate-limiting mechanisms.

Requirements

Rotate proxies and user agents.
Implement delays to avoid detection.
Skip CAPTCHA-protected pages.

Solution

import requests
from bs4 import BeautifulSoup
import random
import time

# Proxies and user agents
proxies = [
    {"http": "http://proxy1:port", "https": "https://proxy1:port"},
    {"http": "http://proxy2:port", "https": "https://proxy2:port"}
]
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"
]

# Target URL
url = "https://example-reviews-site.com"

# Scrape reviews
reviews = []

for page in range(1, 6):
    try:
        # Rotate proxy and user agent
        proxy = random.choice(proxies)
        user_agent = random.choice(user_agents)
        headers = {"User-Agent": user_agent}

        # Send request
        response = requests.get(f"{url}/page/{page}", headers=headers, proxies=proxy)
        response.raise_for_status()

        # Parse content
        soup = BeautifulSoup(response.text, 'html.parser')
        review_elements = soup.find_all('div', class_='review')

        for review in review_elements:
            text = review.find('p').text.strip()
            reviews.append(text)

        # Random delay
        time.sleep(random.uniform(3, 7))

    except requests.exceptions.RequestException as e:
        print(f"Error on page {page}: {e}")

# Print reviews
for idx, review in enumerate(reviews, 1):
    print(f"{idx}. {review}")

Quiz

Objective: Test understanding of anti-scraping mechanisms and solutions.

Questions

Which of the following is a common anti-scraping technique?
- a) User-Agent validation
- b) Infinite scroll detection
- c) Proxy list generation
- d) HTML parsing
What is the purpose of rotating proxies?
- a) To prevent rate-limiting
- b) To mimic multiple IPs
- c) To bypass IP bans
- d) All of the above
Which Python library is commonly used for CAPTCHA solving?
- a) Selenium
- b) BeautifulSoup
- c) 2Captcha
- d) NumPy
How can random delays between requests help in scraping?
- a) They speed up the scraping process.
- b) They mimic human-like browsing behavior.
- c) They bypass CAPTCHA directly.
- d) They change the HTML structure.
What header should be rotated to avoid detection?
- a) Accept-Encoding
- b) User-Agent
- c) Content-Type
- d) Referrer-Policy

Answers

a) User-Agent validation
d) All of the above
c) 2Captcha
b) They mimic human-like browsing behavior
b) User-Agent

Web Scrapping Handson

Monday, 13 January 2025

4. Handling Anti-Scraping Mechanisms