Monday, 13 January 2025

12. Real World Scraping project

Lecture Notes: Real-World Scraping Project on https://mca.gov.in

1. Introduction

Scraping real-world websites like MCA (Ministry of Corporate Affairs) involves dealing with various challenges such as dynamic content, anti-scraping mechanisms, and complex data structures. This lecture demonstrates a step-by-step approach to scraping this site using Python, focusing on deployment and execution on Google Cloud Platform (GCP).


2. Challenges in Scraping MCA Website

Key Challenges

  1. Dynamic Content: The site uses JavaScript to load data, requiring a browser automation tool like Selenium.
  2. Anti-Scraping Mechanisms: CAPTCHA, rate limiting, and bot detection.
  3. Complex Data Structures: Nested tables and pagination for structured data.
  4. Legal and Ethical Considerations: Adhering to the site's terms of service and responsible scraping practices.

3. Setting Up GCP for the Project

Objective

Deploy the scraping bot on GCP to handle large-scale scraping tasks efficiently.

Steps

  1. Create a GCP Project:

    • Go to the GCP Console.
    • Create a new project and enable billing.
  2. Set Up a VM Instance:

    • Navigate to Compute Engine → VM Instances → Create Instance.
    • Select an appropriate machine type (e.g., e2-medium).
    • Install Docker or Python on the VM.
  3. Enable Required APIs:

    • Enable APIs like Cloud Logging and Cloud Storage.
  4. Install Required Libraries:

    • Use SSH to connect to the VM.
    • Install libraries like Selenium, BeautifulSoup, and gcloud SDK.

4. Code for Scraping MCA Website

Objective

Extract company details such as name, CIN (Corporate Identity Number), and registration date from MCA's search page.

Code Example

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# Initialize Selenium WebDriver (headless mode for GCP)
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Specify the path to your ChromeDriver
driver = webdriver.Chrome(options=options)

try:
    # Open MCA search page
    driver.get("https://mca.gov.in/mcafoportal/viewCompanyMasterData.do")

    # Wait for the page to load
    time.sleep(5)

    # Locate search input field and enter query (e.g., company name)
    search_input = driver.find_element(By.ID, 'companyName')
    search_input.send_keys("Tata")
    search_input.send_keys(Keys.RETURN)

    # Wait for results to load
    time.sleep(10)

    # Extract data from results table
    rows = driver.find_elements(By.CSS_SELECTOR, 'table tbody tr')
    for row in rows:
        columns = row.find_elements(By.TAG_NAME, 'td')
        if len(columns) > 0:
            print("Name:", columns[0].text)
            print("CIN:", columns[1].text)
            print("Registration Date:", columns[2].text)
            print("---")
finally:
    driver.quit()

5. Deploying the Scraper to GCP

Steps

  1. Create a Dockerfile:

    FROM python:3.9
    
    # Install dependencies
    WORKDIR /app
    COPY requirements.txt .
    RUN pip install -r requirements.txt
    
    # Copy scraper code
    COPY . .
    
    # Command to run the scraper
    CMD ["python", "scraper.py"]
    
  2. Build and Push Docker Image:

    docker build -t mca-scraper .
    docker tag mca-scraper gcr.io/<project-id>/mca-scraper
    docker push gcr.io/<project-id>/mca-scraper
    
  3. Deploy on GCP:

    • Use Cloud Run or Kubernetes Engine to deploy the container.
    • Configure environment variables for dynamic inputs.

6. Monitoring and Logging

Using GCP Tools

  1. Cloud Logging: Capture logs for scraper activities.

    import logging
    from google.cloud import logging as gcp_logging
    
    # Initialize GCP logging client
    client = gcp_logging.Client()
    logger = client.logger("mca_scraper")
    
    # Log messages
    logger.log_text("Scraping started")
    logger.log_text("Scraping completed successfully")
    
  2. Cloud Monitoring: Set up alerts for failures or anomalies.


7. Debugging Common Issues

Challenges

  1. CAPTCHA Handling:

    • Use third-party CAPTCHA-solving services.
  2. Timeout Errors:

    • Implement retry mechanisms with exponential backoff.

Code Example: Retry Logic

import requests
from requests.exceptions import RequestException

# Retry function
def fetch_with_retry(url, retries=3):
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response
        except RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == retries - 1:
                raise

# Example usage
url = "https://mca.gov.in"
data = fetch_with_retry(url)
print(data.text)

8. Conclusion

Scraping real-world websites like MCA involves overcoming challenges through robust coding practices, cloud deployment, and efficient monitoring. GCP provides an excellent platform for scalable and reliable scraping solutions.


Assignments and Quiz

Assignment:

  1. Implement the scraper to extract detailed information about a specific company from the MCA website.
  2. Deploy the scraper to GCP and configure logging.
  3. Handle CAPTCHA challenges using a third-party service.

Quiz Questions:

  1. What challenges are specific to scraping dynamic websites like MCA?
  2. Why is GCP suitable for deploying large-scale scraping bots?
  3. What tool can be used for monitoring logs in GCP?
  4. How can retry logic improve the reliability of a scraper?
  5. Provide an example of a Docker command for deploying a scraper.

11. Deploying and Monitoring Scraping Bots

Lecture Notes: Deploying and Monitoring Scraping Bots

1. Introduction to Deploying and Monitoring Scraping Bots

Deploying and monitoring scraping bots ensures that web scraping tasks are executed efficiently, consistently, and at scale. This lecture explores deploying bots to cloud platforms, automating workflows with CI/CD pipelines, monitoring activities, and debugging common issues.


2. Deploying Scraping Bots to Cloud Platforms (AWS, GCP, Azure)

Objective

Host and run scraping bots on cloud platforms to ensure scalability, reliability, and global accessibility.

Challenges

  1. Configuring cloud infrastructure.
  2. Ensuring resource efficiency and cost optimization.
  3. Managing security and data compliance.

Solution

Deploy bots using containerization (Docker) and orchestration tools.

Code Example: Deploying to AWS with Docker

# Step 1: Create a Dockerfile
FROM python:3.9

# Install dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy bot code
COPY . .

# Command to run the bot
CMD ["python", "bot.py"]

# Step 2: Build and push Docker image
# Build the Docker image
docker build -t scraping-bot .

# Tag and push the image to AWS Elastic Container Registry (ECR)
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com

docker tag scraping-bot:latest <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/scraping-bot:latest

docker push <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/scraping-bot:latest

# Step 3: Deploy to AWS ECS
# Use AWS CLI or ECS Console to set up a task definition and service to run the Docker container.

3. Automating Scraping with CI/CD Pipelines

Objective

Implement continuous integration and deployment workflows to automate scraping tasks.

Challenges

  1. Managing frequent code updates.
  2. Integrating testing and deployment steps.

Solution

Use tools like GitHub Actions or Jenkins for CI/CD pipelines.

Code Example: GitHub Actions Workflow

name: Deploy Scraping Bot

on:
  push:
    branches:
      - main

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Run tests
        run: pytest

      - name: Deploy to AWS ECS
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          aws ecs update-service --cluster scraping-cluster --service scraping-service --force-new-deployment

4. Monitoring and Logging Scraping Activities

Objective

Track scraping activities in real-time to ensure reliability and troubleshoot issues.

Challenges

  1. Capturing logs from distributed systems.
  2. Analyzing large volumes of data.

Solution

Use logging frameworks and monitoring dashboards (e.g., ELK Stack, CloudWatch).

Code Example: Using Python Logging

import logging

# Configure logging
logging.basicConfig(
    filename='scraping.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

# Log activities
logging.info("Scraping started")
try:
    # Simulate scraping task
    raise ValueError("Example error")
except Exception as e:
    logging.error(f"Error occurred: {e}")
logging.info("Scraping completed")

5. Debugging Common Errors and Failures

Objective

Identify and resolve issues such as rate limiting, data structure changes, and network errors.

Challenges

  1. Dynamic website changes.
  2. Unexpected exceptions during scraping.

Solution

Use error handling and retry mechanisms.

Code Example: Retry on Failure

import requests
from requests.exceptions import RequestException

# Retry logic
def fetch_with_retry(url, retries=3):
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response
        except RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == retries - 1:
                raise

# Example usage
url = "https://example.com"
data = fetch_with_retry(url)
print(data.text)

6. Conclusion

Deploying and monitoring scraping bots involves managing cloud deployments, automating workflows, and ensuring robust monitoring. These practices improve reliability and scalability for web scraping projects.


Assignments and Quiz

Assignment:

  1. Deploy a scraping bot to a cloud platform of your choice (AWS, GCP, or Azure).
  2. Set up a CI/CD pipeline to automate deployment.
  3. Configure logging to capture errors and activities.

Quiz Questions:

  1. Name two cloud platforms suitable for deploying scraping bots.
  2. What tool can you use to automate CI/CD workflows for a scraping bot?
  3. How can you monitor scraping activities in real time?
  4. What is the purpose of retry mechanisms in scraping bots?
  5. Provide an example of a logging framework in Python.

10. Machine Learning in Web Scraping

Lecture Notes: Introduction to Machine Learning in Web Scraping

1. Introduction to Machine Learning in Web Scraping

Machine learning (ML) enhances web scraping by automating complex tasks, improving data accuracy, and enabling advanced analysis. This lecture introduces ML concepts applied to web scraping, focusing on Natural Language Processing (NLP), data classification, and AI-driven scraping tools.


2. Using NLP for Data Extraction (e.g., Named Entity Recognition)

Objective

Extract meaningful entities such as names, dates, locations, and organizations from unstructured web data using NLP techniques like Named Entity Recognition (NER).

Challenges

  1. Unstructured and noisy text data.
  2. Contextual understanding of extracted entities.

Solution

Use Python libraries such as spaCy or nltk to implement NER.

Code Example

import spacy

# Load spaCy's English NLP model
nlp = spacy.load("en_core_web_sm")

# Sample text
data = "OpenAI was founded in San Francisco in 2015. It specializes in AI research."

# Process the text
doc = nlp(data)

# Extract named entities
print("Entities detected:")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

Output

Entities detected:
OpenAI (ORG)
San Francisco (GPE)
2015 (DATE)

3. Automating Data Classification and Categorization

Objective

Classify scraped data into predefined categories or labels using supervised ML models.

Challenges

  1. Labeling training data for supervised learning.
  2. Balancing accuracy and computational efficiency.

Solution

Train a model using scikit-learn to classify scraped data into categories.

Code Example

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# Sample data
data = [
    "Buy the latest iPhone at a discounted price",
    "New Samsung Galaxy released this month",
    "Breaking news: AI beats humans at chess",
    "Sports update: Local team wins championship"
]
labels = ["ecommerce", "ecommerce", "news", "sports"]

# Split data
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.25, random_state=42)

# Build a classification pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, y_train)

# Test the model
predictions = model.predict(X_test)
print("Predictions:", predictions)

4. Introduction to AI-Driven Scraping Tools

Objective

Leverage AI-powered scraping tools that can dynamically adapt to website changes and bypass anti-scraping mechanisms.

Examples of AI-Driven Tools

  1. Diffbot: Extracts structured data from web pages using AI.
  2. Scrapy with AI Plugins: Combines traditional scraping with ML capabilities.
  3. Apify AI Tools: Provides intelligent automation for complex scraping tasks.

Code Example: Using Diffbot API

import requests

# Diffbot API endpoint and token
API_TOKEN = "your_diffbot_api_token"
url = "https://example.com/article"
api_endpoint = f"https://api.diffbot.com/v3/article?token={API_TOKEN}&url={url}"

# Send request
response = requests.get(api_endpoint)

# Parse response
if response.status_code == 200:
    article_data = response.json()
    print("Title:", article_data['objects'][0]['title'])
    print("Author:", article_data['objects'][0]['author'])
    print("Text:", article_data['objects'][0]['text'])
else:
    print("Error:", response.status_code)

5. Conclusion

Machine learning significantly enhances web scraping capabilities. Techniques like NLP, classification, and AI-driven tools allow for more intelligent and automated data extraction, making them invaluable for large-scale and complex projects.


Assignments and Quiz

Assignment: Implement an NER model to extract names, organizations, and dates from a sample news article. Use spaCy or a similar NLP library. Save the results in JSON format.

Quiz Questions:

  1. What is Named Entity Recognition (NER) used for in web scraping?
  2. Name one library used for implementing NER in Python.
  3. What are the main challenges in automating data classification?
  4. What is the advantage of AI-driven scraping tools?
  5. Provide an example of an AI-driven scraping tool.

9. Advanced Use Cases in Web Scraping

Lecture Notes: Advanced Use Cases in Web Scraping

1. Introduction to Advanced Use Cases in Web Scraping

Web scraping is not limited to simple data extraction. Advanced use cases allow organizations and developers to extract, analyze, and utilize large-scale data for specific applications. This lecture covers key use cases, including scraping e-commerce data, gathering social media insights, aggregating news, and building custom web crawlers for large-scale data collection.


2. Scraping E-Commerce Sites for Product Data

Objective

Extract product information, including names, prices, ratings, and reviews, from e-commerce platforms for price comparison, market research, or inventory monitoring.

Challenges

  1. Dynamic content (JavaScript rendering).
  2. Anti-scraping mechanisms (CAPTCHA, rate limiting).
  3. Data volume and variability.

Solution

Use a combination of Selenium (for dynamic content) and BeautifulSoup (for parsing HTML).

Code Example

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import pandas as pd

# Configure Selenium WebDriver
service = Service('path_to_chromedriver')
driver = webdriver.Chrome(service=service)

# Open e-commerce site
ecommerce_url = "https://example-ecommerce.com"
driver.get(ecommerce_url)

# Parse the page source with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Extract product data
products = []
product_elements = soup.find_all('div', class_='product-item')
for item in product_elements:
    name = item.find('h2', class_='product-name').text.strip()
    price = item.find('span', class_='price').text.strip()
    rating = item.find('span', class_='rating').text.strip()
    products.append({"Name": name, "Price": price, "Rating": rating})

# Save to CSV
driver.quit()
products_df = pd.DataFrame(products)
products_df.to_csv('ecommerce_products.csv', index=False)
print("Product data saved to ecommerce_products.csv")

3. Gathering Social Media Data (Twitter, Instagram, LinkedIn)

Objective

Collect user posts, hashtags, follower counts, and other relevant data for social media analytics.

Challenges

  1. API rate limits and authentication requirements.
  2. Strict platform policies against scraping.
  3. Dynamic and frequently changing DOM structures.

Solution

Use official APIs where available or headless browsers for scraping.

Code Example: Twitter API

import tweepy

# Twitter API credentials
api_key = 'your_api_key'
api_secret_key = 'your_api_secret_key'
access_token = 'your_access_token'
access_token_secret = 'your_access_token_secret'

# Authenticate
auth = tweepy.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# Search tweets
query = "#WebScraping"
tweets = api.search_tweets(q=query, lang='en', count=100)

# Print tweets
for tweet in tweets:
    print(f"User: {tweet.user.screen_name}, Tweet: {tweet.text}")

4. News Aggregation and Sentiment Analysis

Objective

Scrape news articles, extract key content, and analyze sentiment for trend detection or reporting.

Challenges

  1. Handling different website structures.
  2. Extracting meaningful content from HTML noise.
  3. Large-scale text analysis.

Solution

Combine web scraping with Natural Language Processing (NLP) libraries.

Code Example

from bs4 import BeautifulSoup
import requests
from textblob import TextBlob

# Target news website
news_url = "https://example-news-site.com"
response = requests.get(news_url)
response.raise_for_status()

# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
articles = []
article_elements = soup.find_all('div', class_='news-article')

for article in article_elements:
    title = article.find('h2').text.strip()
    summary = article.find('p', class_='summary').text.strip()
    sentiment = TextBlob(summary).sentiment.polarity
    articles.append({"Title": title, "Summary": summary, "Sentiment": sentiment})

# Print results
for article in articles:
    print(f"Title: {article['Title']}, Sentiment: {article['Sentiment']}")

5. Building Custom Web Crawlers for Large-Scale Data Collection

Objective

Crawl websites systematically to collect data across multiple pages and domains.

Challenges

  1. Managing crawling depth and breadth.
  2. Handling duplicate data and avoiding traps.
  3. Efficient storage of large datasets.

Solution

Use Scrapy for building efficient and scalable web crawlers.

Code Example

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["http://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # Follow pagination links
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

To run the spider:

scrapy runspider quotes_spider.py -o quotes.json

6. Conclusion

Advanced web scraping use cases open up opportunities for data-driven decision-making in e-commerce, social media analytics, news aggregation, and more. By leveraging tools like Selenium, BeautifulSoup, APIs, and Scrapy, developers can tackle complex scenarios efficiently.


Assignments and Quiz

Assignment: Create a custom crawler using Scrapy to collect data from a website of your choice. Save the data in JSON format.

Quiz Questions:

  1. What are the key challenges in scraping social media platforms?
  2. How does sentiment analysis help in news aggregation?
  3. What is the advantage of using Scrapy for large-scale web crawling?
  4. Which library is commonly used for sentiment analysis in Python?
  5. Name one way to handle dynamic content on e-commerce sites.

8. Web Scrap at Scale

Lecture Notes: Web Scraping at Scale

1. Distributed Scraping Using Tools

1.1 Scrapy

Scrapy is a powerful web scraping framework that enables distributed scraping with minimal setup. It supports asynchronous requests and is highly efficient for large-scale scraping.

Key Features

  • Built-in support for asynchronous requests.
  • Middleware for request and response processing.
  • Extensible architecture.

Code Example: Scrapy Distributed Scraping

# Install Scrapy: pip install scrapy

# Example Scrapy spider
import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        for product in response.css(".product"):
            yield {
                'name': product.css(".name::text").get(),
                'price': product.css(".price::text").get(),
            }

# Run spider: scrapy crawl products

1.2 Celery

Celery is a distributed task queue that enables parallel execution of tasks, making it suitable for scaling scraping jobs.

Code Example: Using Celery for Scraping

# Install Celery: pip install celery
from celery import Celery
import requests

app = Celery('tasks', broker='redis://localhost:6379/0')

@app.task
def scrape_page(url):
    response = requests.get(url)
    return response.text

# Schedule scraping tasks
urls = ["https://example.com/page1", "https://example.com/page2"]
for url in urls:
    scrape_page.delay(url)

2. Managing Concurrency and Parallelism

2.1 Threading

Threading allows multiple threads to run concurrently in a program, making it useful for I/O-bound tasks like web scraping.

Code Example: Using Threading

import threading
import requests

def scrape(url):
    response = requests.get(url)
    print(f"Scraped {url}: {len(response.content)} bytes")

urls = ["https://example.com/page1", "https://example.com/page2"]
threads = []

for url in urls:
    thread = threading.Thread(target=scrape, args=(url,))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

2.2 Multiprocessing

Multiprocessing uses multiple processes, making it ideal for CPU-bound tasks.

Code Example: Using Multiprocessing

from multiprocessing import Pool
import requests

def scrape(url):
    response = requests.get(url)
    return f"Scraped {url}: {len(response.content)} bytes"

urls = ["https://example.com/page1", "https://example.com/page2"]

with Pool(4) as pool:
    results = pool.map(scrape, urls)
    for result in results:
        print(result)

3. Caching Strategies to Reduce Redundant Requests

3.1 Using HTTP Caching

HTTP caching stores responses to reduce redundant requests.

Code Example: Using Requests Cache

# Install requests-cache: pip install requests-cache
import requests_cache

requests_cache.install_cache('scraping_cache', expire_after=3600)

response = requests.get("https://example.com/data")
print(f"Cache hit: {response.from_cache}")

3.2 Custom Caching

Implement custom caching for specific data.

Code Example: Custom Caching

import os
import hashlib
import requests

def fetch_with_cache(url, cache_dir="cache"):
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)

    cache_file = os.path.join(cache_dir, hashlib.md5(url.encode()).hexdigest())

    if os.path.exists(cache_file):
        with open(cache_file, "r") as f:
            return f.read()

    response = requests.get(url)
    with open(cache_file, "w") as f:
        f.write(response.text)

    return response.text

html = fetch_with_cache("https://example.com/data")
print(html[:100])

4. Monitoring and Maintaining Scraping Jobs

4.1 Logging

Logging helps monitor scraping activity and debug issues.

Code Example: Basic Logging

import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
logger = logging.getLogger()

logger.info("Starting scraping job")
# Simulated scraping job
logger.info("Scraping page 1")
logger.info("Scraping page 2")
logger.info("Job completed")

4.2 Job Scheduling

Use schedulers like cron or Python libraries to automate scraping.

Code Example: Using Schedule Library

# Install schedule: pip install schedule
import schedule
import time


def scrape_job():
    print("Running scraping job...")
    # Simulate scraping
    time.sleep(2)
    print("Job completed")

schedule.every().day.at("10:00").do(scrape_job)

while True:
    schedule.run_pending()
    time.sleep(1)

Quiz

Objective: Test understanding of large-scale web scraping concepts.

Questions

  1. Which tool is best suited for distributed web scraping?

    • a) Scrapy
    • b) BeautifulSoup
    • c) Requests
    • d) Selenium
  2. What is the main advantage of using Celery in web scraping?

    • a) Asynchronous task execution
    • b) Database management
    • c) Rendering JavaScript
    • d) HTML parsing
  3. Which Python library supports HTTP caching for web scraping?

    • a) pandas
    • b) requests-cache
    • c) scrapy
    • d) schedule
  4. What is the primary purpose of logging in scraping jobs?

    • a) Improve scraping speed
    • b) Track and debug scraping activity
    • c) Parse data
    • d) Handle proxies
  5. What is a benefit of using multiprocessing for web scraping?

    • a) Handles I/O-bound tasks efficiently
    • b) Improves performance for CPU-bound tasks
    • c) Reduces memory usage
    • d) Automates scheduling

Answers

  1. a) Scrapy
  2. a) Asynchronous task execution
  3. b) requests-cache
  4. b) Track and debug scraping activity
  5. b) Improves performance for CPU-bound tasks

7. Data Storage and Post-Processing in Web Scraping

Lecture Notes: Data Storage and Post Processing

1. Storing Scraped Data in Databases

1.1 SQL Databases

SQL databases like MySQL and PostgreSQL are used for structured data storage. They ensure data integrity and support complex queries.

Key Concepts:

  • Tables with predefined schema.
  • SQL queries for data manipulation.

Code Example: Storing Data in MySQL

import mysql.connector

# Database connection
connection = mysql.connector.connect(
    host="localhost",
    user="your_username",
    password="your_password",
    database="scraped_data_db"
)
cursor = connection.cursor()

# Create table
cursor.execute('''CREATE TABLE IF NOT EXISTS products (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255),
    price DECIMAL(10, 2),
    rating DECIMAL(3, 2)
)''')

# Insert data
data = [
    ("Product A", 19.99, 4.5),
    ("Product B", 29.99, 4.8)
]
cursor.executemany("INSERT INTO products (name, price, rating) VALUES (%s, %s, %s)", data)
connection.commit()

# Close connection
cursor.close()
connection.close()

1.2 NoSQL Databases

NoSQL databases like MongoDB are used for unstructured or semi-structured data. They are flexible and scalable.

Key Concepts:

  • Collection-based structure.
  • JSON-like documents.

Code Example: Storing Data in MongoDB

from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
db = client["scraped_data"]
collection = db["products"]

# Insert data
data = [
    {"name": "Product A", "price": 19.99, "rating": 4.5},
    {"name": "Product B", "price": 29.99, "rating": 4.8}
]
collection.insert_many(data)

# Close connection
client.close()

2. Exporting Data to Formats

2.1 Exporting to JSON

JSON is a lightweight format for data exchange.

Code Example: Writing Data to JSON

import json

# Sample data
data = [
    {"name": "Product A", "price": 19.99, "rating": 4.5},
    {"name": "Product B", "price": 29.99, "rating": 4.8}
]

# Write to JSON file
with open("products.json", "w") as json_file:
    json.dump(data, json_file, indent=4)

2.2 Exporting to CSV

CSV files are commonly used for tabular data.

Code Example: Writing Data to CSV

import csv

# Sample data
data = [
    {"name": "Product A", "price": 19.99, "rating": 4.5},
    {"name": "Product B", "price": 29.99, "rating": 4.8}
]

# Write to CSV
with open("products.csv", "w", newline="", encoding="utf-8") as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=["name", "price", "rating"])
    writer.writeheader()
    writer.writerows(data)

2.3 Exporting to Excel

Excel is used for data analysis and visualization.

Code Example: Writing Data to Excel

import pandas as pd

# Sample data
data = [
    {"name": "Product A", "price": 19.99, "rating": 4.5},
    {"name": "Product B", "price": 29.99, "rating": 4.8}
]

# Write to Excel
df = pd.DataFrame(data)
df.to_excel("products.xlsx", index=False)

3. Cleaning and Transforming Data with Pandas

3.1 Cleaning Data

Cleaning involves removing duplicates, handling missing values, and correcting formats.

Code Example: Cleaning Data

import pandas as pd

# Sample data
data = {
    "name": ["Product A", "Product B", None, "Product D"],
    "price": [19.99, 29.99, None, 39.99],
    "rating": [4.5, 4.8, 4.0, None]
}

# Create DataFrame
df = pd.DataFrame(data)

# Drop rows with missing values
df = df.dropna()

# Fill missing values with default values
df["rating"] = df["rating"].fillna(3.0)

print(df)

3.2 Transforming Data

Transforming data involves reshaping, aggregating, and applying operations.

Code Example: Transforming Data

# Add a new column for discounted price
df["discounted_price"] = df["price"] * 0.9

# Group by and calculate mean rating
mean_rating = df.groupby("name")["rating"].mean()

print(df)
print(mean_rating)

4. Integration with ETL Pipelines

ETL (Extract, Transform, Load) pipelines automate data workflows, including scraping, cleaning, and storing data.

4.1 Building an ETL Pipeline

An ETL pipeline integrates data extraction, processing, and loading steps.

Code Example: Simple ETL Pipeline

import pandas as pd
import requests

# Extract: Fetch data
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
data = response.json()

# Transform: Clean and process data
df = pd.DataFrame(data["products"])
df["price"] = df["price"].astype(float)

# Load: Store in CSV
df.to_csv("products_etl.csv", index=False)
print("ETL process completed. Data saved to products_etl.csv.")

Quiz

Objective: Test understanding of data storage and post-processing.

Questions

  1. Which SQL command is used to insert data into a table?

    • a) INSERT INTO
    • b) CREATE TABLE
    • c) SELECT
    • d) UPDATE
  2. What is the primary advantage of using NoSQL databases?

    • a) Predefined schema
    • b) Flexibility with unstructured data
    • c) Complex SQL queries
    • d) Low latency
  3. Which Python library is commonly used for exporting data to Excel?

    • a) csv
    • b) json
    • c) pandas
    • d) openpyxl
  4. What does the dropna() method in Pandas do?

    • a) Drops duplicate rows
    • b) Fills missing values
    • c) Drops rows with missing values
    • d) Normalizes data
  5. What does ETL stand for?

    • a) Extract, Transfer, Load
    • b) Extract, Transform, Load
    • c) Evaluate, Transform, Load
    • d) Extract, Transform, Link

Answers

  1. a) INSERT INTO
  2. b) Flexibility with unstructured data
  3. c) pandas
  4. c) Drops rows with missing values
  5. b) Extract, Transform, Load

6. Working with APIs

Lecture Notes: Working with Web Scraping APIs

1. Introduction to Web Scraping APIs

What are Web Scraping APIs?

Web scraping APIs are tools that simplify the process of extracting data from websites. They handle various challenges like dynamic content, anti-scraping mechanisms, and large-scale data extraction.

Why Use Web Scraping APIs?

  1. Ease of Use: Simplifies handling complex scraping scenarios.
  2. Efficiency: Reduces development time and resources.
  3. Anti-Scraping Measures: Built-in mechanisms for bypassing blocks.
  4. Scalability: Handles large volumes of requests effectively.

Examples of Popular Web Scraping APIs

  1. ScraperAPI
  2. Bright Data (formerly Luminati)
  3. Scrapy Cloud
  4. Apify
  5. Octoparse API

2. Features of Web Scraping APIs

1. Proxy Management

Automatically rotates proxies and provides residential, data center, or mobile IPs.

2. Headless Browser Support

Renders JavaScript-heavy pages using headless browsers like Puppeteer or Selenium.

3. CAPTCHA Solving

Integrates CAPTCHA-solving services to bypass human verification challenges.

4. Data Formatting

Delivers data in structured formats like JSON, CSV, or XML.

5. Rate Limiting

Manages request limits to avoid IP bans.


3. Using Web Scraping APIs

1. Understanding API Endpoints

APIs provide specific endpoints for data extraction tasks. For example:

  • /scrape: To extract data from a URL.
  • /status: To check API usage and limits.

2. Authentication

Most APIs require API keys or tokens for access. These credentials ensure secure and authorized use.

3. Sending Requests

Use Python libraries like requests or httpx to interact with APIs.

4. Parsing API Responses

Extract and process structured data from JSON or other response formats.


4. Sample Code: Using ScraperAPI

Scenario: Extract product data from an e-commerce site.

import requests
import json

# API key
API_KEY = "your_scraperapi_key"

# Target URL
url = "https://example-ecommerce-site.com/products"

# API endpoint
api_endpoint = f"http://api.scraperapi.com?api_key={API_KEY}&url={url}"

# Send request
response = requests.get(api_endpoint)

# Check response
if response.status_code == 200:
    data = response.json()
    for product in data.get("products", []):
        print(f"Name: {product['name']}, Price: {product['price']}")
else:
    print(f"Error: {response.status_code}")

5. Assignment

Objective

Use a Web Scraping API to extract weather data from a weather forecasting website.

Requirements

  1. Authenticate using an API key.
  2. Retrieve data for a specific city.
  3. Parse and display the temperature, humidity, and weather conditions.

Solution

import requests

# API key and endpoint
API_KEY = "your_weatherapi_key"
city = "New York"
api_endpoint = f"https://api.weatherapi.com/v1/current.json?key={API_KEY}&q={city}"

# Send request
response = requests.get(api_endpoint)

# Check response
if response.status_code == 200:
    weather_data = response.json()
    location = weather_data["location"]["name"]
    temp_c = weather_data["current"]["temp_c"]
    humidity = weather_data["current"]["humidity"]
    condition = weather_data["current"]["condition"]["text"]

    print(f"City: {location}\nTemperature: {temp_c} °C\nHumidity: {humidity}%\nCondition: {condition}")
else:
    print(f"Failed to fetch weather data. HTTP Status Code: {response.status_code}")

6. Quiz

Objective: Test understanding of Web Scraping APIs.

Questions

  1. Which of the following is NOT a feature of most web scraping APIs?

    • a) CAPTCHA solving
    • b) Proxy management
    • c) Image processing
    • d) Rate limiting
  2. What is the purpose of an API key in web scraping?

    • a) To bypass CAPTCHA
    • b) To identify and authenticate users
    • c) To manage proxies
    • d) To scrape data directly
  3. Which Python library is commonly used to send API requests?

    • a) NumPy
    • b) BeautifulSoup
    • c) Requests
    • d) Pandas
  4. What type of response format do most APIs return?

    • a) JSON
    • b) HTML
    • c) Plain text
    • d) CSV
  5. What is the advantage of using web scraping APIs?

    • a) Simplifies handling of dynamic content
    • b) Increases manual effort
    • c) Eliminates the need for coding
    • d) None of the above

Answers

  1. c) Image processing
  2. b) To identify and authenticate users
  3. c) Requests
  4. a) JSON
  5. a) Simplifies handling of dynamic content

5. Advanced Parsing Techniques in web scraping

Lecture Notes: Advanced Parsing Techniques in Web Scraping

1. Introduction to Advanced Parsing Techniques

What is Parsing?

Parsing in web scraping refers to the process of extracting structured data from unstructured HTML or XML content retrieved from websites.

Why Use Advanced Parsing Techniques?

  • Handle complex, nested structures.
  • Extract dynamic or deeply embedded data.
  • Improve scraping efficiency and accuracy.

2. Tools for Advanced Parsing

1. BeautifulSoup

  • A Python library for parsing HTML and XML documents.
  • Provides methods to navigate and search the parse tree.

2. lxml

  • Fast and memory-efficient XML and HTML parser.
  • Supports XPath and full-text search.

3. XPath

  • A powerful query language for navigating XML/HTML documents.
  • Provides precise extraction using structured queries.

4. Regular Expressions

  • Useful for pattern-based extraction from raw text or HTML attributes.

3. Parsing Techniques

1. Navigating Nested Structures

  • Use BeautifulSoup’s methods like .find(), .find_all(), .select() to locate elements.
  • Use recursion to handle deeply nested structures.

2. Parsing with XPath

  • Identify elements using unique paths in the DOM tree.
  • Example XPath queries:
    • //div[@class='product']: Selects all <div> elements with the class "product".
    • //a[@href]: Selects all <a> elements with an href attribute.

3. Using Regular Expressions

  • Extract data based on patterns.
  • Example:
    • Extract all email addresses: r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

4. Parsing JSON Data Embedded in HTML

  • Extract and decode JSON objects from <script> tags.
  • Use Python’s json module to process JSON.

5. Handling Edge Cases

  • Missing data: Use conditional checks.
  • Dynamic attributes: Use wildcard patterns (e.g., contains in XPath).
  • Multi-language content: Ensure proper encoding and decoding.

4. Sample Code: Parsing Nested and Complex Data

Scenario: Extract product details from a nested HTML structure.

from bs4 import BeautifulSoup
import requests

# Fetch HTML content
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
response.raise_for_status()

# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Extract products
products = []
product_elements = soup.find_all('div', class_='product')

for product in product_elements:
    name = product.find('h2', class_='product-name').text.strip()
    price = product.find('span', class_='product-price').text.strip()
    rating = product.find('div', class_='product-rating').text.strip()
    products.append({"name": name, "price": price, "rating": rating})

# Print the extracted data
for product in products:
    print(product)

Using XPath with lxml

from lxml import html
import requests

# Fetch HTML content
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
response.raise_for_status()

# Parse with lxml
page = html.fromstring(response.content)

# Extract products using XPath
names = page.xpath("//h2[@class='product-name']/text()")
prices = page.xpath("//span[@class='product-price']/text()")
ratings = page.xpath("//div[@class='product-rating']/text()")

# Combine data
products = [
    {"name": name, "price": price, "rating": rating}
    for name, price, rating in zip(names, prices, ratings)
]

# Print the extracted data
for product in products:
    print(product)

5. Assignment

Objective

Parse and extract data from a news website’s HTML, which contains complex nested structures.

Requirements

  1. Extract article titles, publication dates, and author names.
  2. Use BeautifulSoup and/or lxml with XPath.
  3. Save the extracted data to a CSV file.

Solution

import csv
from bs4 import BeautifulSoup
import requests

# Fetch HTML content
url = "https://example-news-site.com"
response = requests.get(url)
response.raise_for_status()

# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Extract articles
articles = []
article_elements = soup.find_all('article')

for article in article_elements:
    title = article.find('h1', class_='article-title').text.strip()
    date = article.find('time', class_='publication-date')['datetime']
    author = article.find('span', class_='author-name').text.strip()
    articles.append({"title": title, "date": date, "author": author})

# Save to CSV
with open('articles.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=["title", "date", "author"])
    writer.writeheader()
    writer.writerows(articles)

print("Data saved to articles.csv")

6. Quiz

Objective: Test understanding of parsing techniques.

Questions

  1. What is the primary purpose of XPath?

    • a) To scrape dynamic content
    • b) To query and navigate XML/HTML documents
    • c) To parse JSON data
    • d) To rotate proxies
  2. Which library is best for handling deeply nested HTML structures?

    • a) NumPy
    • b) BeautifulSoup
    • c) Pandas
    • d) PyTorch
  3. What does the following XPath query select? //div[@class='item']/span/text()

    • a) All <div> elements with the class "item"
    • b) All <span> elements inside <div> elements with the class "item"
    • c) Text inside <span> elements within <div> elements with the class "item"
    • d) Attributes of <span> elements
  4. What Python library is typically used to parse JSON embedded in HTML?

    • a) lxml
    • b) json
    • c) BeautifulSoup
    • d) Selenium
  5. Which of these is NOT an advanced parsing technique?

    • a) Regular Expressions
    • b) User-Agent Rotation
    • c) JSON Parsing
    • d) XPath Queries

Answers

  1. b) To query and navigate XML/HTML documents
  2. b) BeautifulSoup
  3. c) Text inside <span> elements within <div> elements with the class "item"
  4. b) json
  5. b) User-Agent Rotation

4. Handling Anti-Scraping Mechanisms

Lecture Notes: Handling Anti-Scraping Mechanisms

1. Introduction to Anti-Scraping Mechanisms

What are Anti-Scraping Mechanisms?

Websites implement anti-scraping mechanisms to protect their content, ensure fair usage, and prevent server overload. These techniques are designed to identify and block bots and automated tools.

Common Anti-Scraping Mechanisms

  1. CAPTCHAs: Used to verify that a real human is accessing the site.
  2. IP Blocking: Blocking requests from specific IPs suspected of being bots.
  3. Rate Limiting: Limiting the number of requests within a specified time frame.
  4. User-Agent Validation: Detecting and blocking bots based on unusual or default user-agent strings.
  5. Dynamic Content Loading: Using JavaScript to generate content dynamically to prevent direct scraping.
  6. Honeytraps: Hidden links or data designed to catch bots.

2. Strategies to Overcome Anti-Scraping Mechanisms

1. Using Proxies

Proxies help hide the real IP address of the scraper, enabling rotation to prevent IP blocking.

  • Rotating Proxies: Services like ScraperAPI and Bright Data provide rotating proxy pools.
  • Residential Proxies: Appear as legitimate IPs, reducing the likelihood of detection.

2. User-Agent Rotation

Rotate user-agent headers to mimic different devices and browsers.

3. Handling CAPTCHAs

  • Use CAPTCHA-solving services (e.g., 2Captcha, Anti-Captcha).
  • Employ machine learning models for CAPTCHA recognition (for advanced users).

4. Delays and Randomization

  • Add random delays between requests to mimic human behavior.
  • Randomize request patterns and navigation paths.

5. Avoiding Detection

  • Use headless browsers like Puppeteer or Selenium judiciously.
  • Disable features like WebDriver flag detection in Selenium.
  • Ensure proper session management using cookies and headers.

3. Sample Code: Bypassing Basic Anti-Scraping Mechanisms

Scenario: Scraping a website with IP blocking and CAPTCHA.

import requests
from bs4 import BeautifulSoup
import random
import time

# List of proxies
proxies = [
    {"http": "http://proxy1:port", "https": "https://proxy1:port"},
    {"http": "http://proxy2:port", "https": "https://proxy2:port"},
    {"http": "http://proxy3:port", "https": "https://proxy3:port"}
]

# List of user agents
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
]

# Target URL
url = "https://example-anti-scraping-site.com"

# Attempt to scrape
for _ in range(5):
    try:
        # Randomly select a proxy and user agent
        proxy = random.choice(proxies)
        user_agent = random.choice(user_agents)

        # Set headers
        headers = {"User-Agent": user_agent}

        # Send GET request
        response = requests.get(url, headers=headers, proxies=proxy, timeout=10)
        response.raise_for_status()

        # Parse the HTML content
        soup = BeautifulSoup(response.text, 'html.parser')
        print(soup.title.text)

        # Add a random delay
        time.sleep(random.uniform(2, 5))

    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")

Assignment

Objective

Scrape product reviews from a website with CAPTCHA and rate-limiting mechanisms.

Requirements

  1. Rotate proxies and user agents.
  2. Implement delays to avoid detection.
  3. Skip CAPTCHA-protected pages.

Solution

import requests
from bs4 import BeautifulSoup
import random
import time

# Proxies and user agents
proxies = [
    {"http": "http://proxy1:port", "https": "https://proxy1:port"},
    {"http": "http://proxy2:port", "https": "https://proxy2:port"}
]
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"
]

# Target URL
url = "https://example-reviews-site.com"

# Scrape reviews
reviews = []

for page in range(1, 6):
    try:
        # Rotate proxy and user agent
        proxy = random.choice(proxies)
        user_agent = random.choice(user_agents)
        headers = {"User-Agent": user_agent}

        # Send request
        response = requests.get(f"{url}/page/{page}", headers=headers, proxies=proxy)
        response.raise_for_status()

        # Parse content
        soup = BeautifulSoup(response.text, 'html.parser')
        review_elements = soup.find_all('div', class_='review')

        for review in review_elements:
            text = review.find('p').text.strip()
            reviews.append(text)

        # Random delay
        time.sleep(random.uniform(3, 7))

    except requests.exceptions.RequestException as e:
        print(f"Error on page {page}: {e}")

# Print reviews
for idx, review in enumerate(reviews, 1):
    print(f"{idx}. {review}")

Quiz

Objective: Test understanding of anti-scraping mechanisms and solutions.

Questions

  1. Which of the following is a common anti-scraping technique?

    • a) User-Agent validation
    • b) Infinite scroll detection
    • c) Proxy list generation
    • d) HTML parsing
  2. What is the purpose of rotating proxies?

    • a) To prevent rate-limiting
    • b) To mimic multiple IPs
    • c) To bypass IP bans
    • d) All of the above
  3. Which Python library is commonly used for CAPTCHA solving?

    • a) Selenium
    • b) BeautifulSoup
    • c) 2Captcha
    • d) NumPy
  4. How can random delays between requests help in scraping?

    • a) They speed up the scraping process.
    • b) They mimic human-like browsing behavior.
    • c) They bypass CAPTCHA directly.
    • d) They change the HTML structure.
  5. What header should be rotated to avoid detection?

    • a) Accept-Encoding
    • b) User-Agent
    • c) Content-Type
    • d) Referrer-Policy

Answers

  1. a) User-Agent validation
  2. d) All of the above
  3. c) 2Captcha
  4. b) They mimic human-like browsing behavior
  5. b) User-Agent

3. Introduction to Dynamic Website Scraping

 Lecture Notes: Dynamic Website Scraping

2. Introduction to  Dynamic Website Scraping

Definition

Dynamic websites generate content on the client side using JavaScript. Unlike static websites, where content is embedded in the HTML, dynamic content requires rendering to access the data.

Challenges in Scraping Dynamic Websites

  • JavaScript-rendered content: Content isn't available in the initial HTML response.
  • Infinite scrolling: Requires loading additional data dynamically.
  • Anti-scraping mechanisms: Websites may block automated access.

2. Tools for Dynamic Website Scraping

1. Selenium

A popular Python library for browser automation that can render JavaScript.

  • Advantages: Handles complex interactions, dynamic content.
  • Limitations: Slower compared to HTTP-based scraping.

2. Playwright or Puppeteer

Tools for browser automation, similar to Selenium, but optimized for performance.

3. Network Request Monitoring

  • Use browser developer tools to inspect network requests and directly scrape data from APIs.

3. Techniques for Scraping Dynamic Websites

1. Using Selenium for JavaScript Rendering

Selenium can automate a browser to load JavaScript content, interact with elements, and extract data.

2. Handling Infinite Scrolling

Simulate scrolling actions to load additional content dynamically.

3. Extracting Data from APIs

Reverse-engineer network requests to identify and call APIs directly.

4. Bypassing Anti-Scraping Measures

  • Rotate proxies and user-agents.
  • Add delays between requests.

Sample Code: Scraping Dynamic Content Using Selenium

Scenario: Scrape product data from a JavaScript-rendered e-commerce site.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time

# Setup Selenium WebDriver
service = Service('path/to/chromedriver')  # Update with the correct path
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # Run in headless mode for efficiency
options.add_argument('--disable-gpu')

# Start the browser
driver = webdriver.Chrome(service=service, options=options)

try:
    # Open the target URL
    driver.get('https://example-dynamic-site.com/products')

    # Wait for JavaScript to render the page
    time.sleep(5)  # Adjust based on website's loading time

    # Find product elements
    products = driver.find_elements(By.CLASS_NAME, 'product-item')

    # Extract product details
    for product in products:
        name = product.find_element(By.CLASS_NAME, 'product-name').text
        price = product.find_element(By.CLASS_NAME, 'product-price').text
        print(f"Product: {name}, Price: {price}")

finally:
    driver.quit()

Assignment

Objective

Scrape a dynamic website with infinite scrolling to extract data.

Requirements

  1. Scrape article titles from a news website with infinite scrolling.
  2. Use Selenium to simulate scrolling.
  3. Save the data to a CSV file.

Solution

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import time
import csv

# Setup Selenium WebDriver
service = Service('path/to/chromedriver')  # Update with the correct path
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')

# Start the browser
driver = webdriver.Chrome(service=service, options=options)

try:
    # Open the target website
    driver.get('https://example-news-site.com')

    # Infinite scrolling logic
    scroll_pause = 2
    last_height = driver.execute_script("return document.body.scrollHeight")

    titles = []

    while True:
        # Scroll down to the bottom of the page
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(scroll_pause)

        # Extract article titles
        articles = driver.find_elements(By.CLASS_NAME, 'article-title')
        for article in articles:
            title = article.text
            if title not in titles:
                titles.append(title)

        # Check if we've reached the end of the page
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

    # Save to CSV
    with open('articles.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Article Title'])
        for title in titles:
            writer.writerow([title])

    print("Data saved to articles.csv")

finally:
    driver.quit()

Quiz

Objective: Assess understanding of dynamic website scraping.

Questions

  1. What makes dynamic websites different from static websites?

    • a) They have static content.
    • b) Content is rendered server-side.
    • c) Content is rendered client-side using JavaScript.
    • d) They do not use HTML.
  2. Which library is best suited for rendering JavaScript in Python?

    • a) BeautifulSoup
    • b) Requests
    • c) Selenium
    • d) NumPy
  3. What method can be used in Selenium to simulate scrolling?

    • a) driver.render_page()
    • b) driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    • c) driver.scroll_page()
    • d) driver.load_all_content()
  4. How can infinite scrolling be handled in web scraping?

    • a) Using a larger user-agent
    • b) By loading all pages at once
    • c) By simulating scroll actions until no new content loads
    • d) By avoiding JavaScript altogether
  5. What is a key advantage of extracting data directly from APIs compared to scraping rendered content?

    • a) It’s slower.
    • b) It’s harder to understand.
    • c) It provides structured data more efficiently.
    • d) It requires more computing resources.

Answers

  1. c) Content is rendered client-side using JavaScript
  2. c) Selenium
  3. b) driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
  4. c) By simulating scroll actions until no new content loads
  5. c) It provides structured data more efficiently

2. Overview of Advanced Web Scraping

Lecture Notes: Introduction to Advanced Web Scraping

Definition of Web Scraping

Web scraping is the process of programmatically extracting data from websites. While basic web scraping involves retrieving data from static HTML pages, advanced web scraping deals with dynamic content, large-scale scraping, and overcoming anti-scraping mechanisms.


Goals of Advanced Web Scraping

  • Scraping dynamic content rendered by JavaScript.

  • Efficiently handling large-scale scraping projects.

  • Bypassing common anti-scraping techniques like CAPTCHA and IP bans.

  • Ensuring compliance with ethical and legal standards.


Applications of Advanced Web Scraping

  • E-commerce: Price comparison and product monitoring.

  • News Aggregation: Collecting data from multiple news outlets.

  • Market Research: Gathering competitor data and trends.

  • Sentiment Analysis: Scraping social media for public opinions.


2. Ethical and Legal Considerations

1. Ethics of Web Scraping

  • Always check the website's robots.txt file for scraping permissions.

  • Avoid overloading a website’s server with frequent requests.

  • Respect user privacy and do not scrape sensitive or personal information.

2. Legal Considerations

  • Adhere to the website's terms of service.

  • Be cautious of copyright and intellectual property laws.

  • Obtain explicit permission when necessary.


3. Tools and Libraries for Advanced Scraping

Python Libraries

  • BeautifulSoup: For parsing HTML and XML documents.

  • Requests: For making HTTP requests.

  • Scrapy: A robust framework for large-scale scraping.

  • Selenium: For interacting with JavaScript-heavy websites using a browser.

Non-Python Tools

  • Puppeteer: A headless browser automation tool.

  • Playwright: For automating browser interactions across multiple browsers.

  • Proxy Tools: Services like ScraperAPI or Bright Data to bypass IP bans.


Sample Code: Web Scraping Using Python

Scenario: Scrape the latest news headlines from a static website.

import requests
from bs4 import BeautifulSoup

# URL of the target website
url = "https://example-news-site.com"

# Send an HTTP GET request
response = requests.get(url)
response.raise_for_status()  # Check for HTTP request errors

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Extract headlines (assuming they are in <h2> tags with class 'headline')
headlines = soup.find_all('h2', class_='headline')

# Print extracted headlines
print("Latest News Headlines:")
for idx, headline in enumerate(headlines, 1):
    print(f"{idx}. {headline.text.strip()}")

Assignment

Objective

Scrape product names and prices from an e-commerce website’s product listing page.

Requirements

  1. Extract the product names and their corresponding prices.

  2. Save the data to a CSV file.

  3. Handle HTTP request errors gracefully.

Solution

import requests
from bs4 import BeautifulSoup
import csv

# Target URL
url = "https://example-ecommerce-site.com/products"

try:
    # Send an HTTP GET request
    response = requests.get(url)
    response.raise_for_status()

    # Parse HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find products and prices
    products = soup.find_all('div', class_='product-item')

    # Prepare CSV file
    with open('products.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["Product Name", "Price"])

        # Extract and write product data
        for product in products:
            name = product.find('h3', class_='product-name').text.strip()
            price = product.find('span', class_='product-price').text.strip()
            writer.writerow([name, price])

    print("Data saved to products.csv")
except requests.exceptions.RequestException as e:
    print(f"Error during request: {e}")

Quiz

Objective: Assess understanding of advanced web scraping concepts.

Questions

  1. What is the purpose of the robots.txt file in web scraping?

    • a) To provide legal protection for web scraping

    • b) To define the structure of a website

    • c) To specify which parts of the website can be crawled

    • d) To encrypt sensitive data

  2. Which Python library is best suited for interacting with JavaScript-rendered content?

    • a) BeautifulSoup

    • b) Requests

    • c) Selenium

    • d) NumPy

  3. What does the response.raise_for_status() method do?

    • a) Parses HTML content

    • b) Checks for HTTP request errors

    • c) Rotates proxies

    • d) Generates a browser simulation

  4. Which of the following is NOT a common anti-scraping mechanism?

    • a) CAPTCHA

    • b) IP blocking

    • c) Dynamic HTML

    • d) User-agent strings

  5. What is the primary purpose of using proxies in web scraping?

    • a) To enhance scraping speed

    • b) To bypass rate limits and IP bans

    • c) To parse JavaScript

    • d) To store scraped data

Answers

  1. c) To specify which parts of the website can be crawled

  2. c) Selenium

  3. b) Checks for HTTP request errors

  4. d) User-agent strings

  5. b) To bypass rate limits and IP bans

1. Web Scraping Intro

 Here’s a detailed curriculum and design for an Advanced Web Scraping course. The focus is on teaching advanced techniques, best practices, and ethical considerations.


Curriculum Outline: Advanced Web Scraping

1. Introduction to Advanced Web Scraping

  • Overview of advanced concepts in web scraping
  • Difference between basic and advanced web scraping
  • Ethical and legal considerations
  • Tools and libraries for advanced scraping (e.g., Scrapy, Selenium, BeautifulSoup, Requests, Puppeteer)

2. Dynamic Websites and JavaScript Rendering

  • Scraping websites with dynamic content
  • Introduction to headless browsers (e.g., Selenium, Puppeteer)
  • Handling single-page applications (SPAs) and infinite scrolling
  • Extracting data from Shadow DOM and iframes

3. Handling Anti-Scraping Mechanisms

  • Understanding anti-bot measures (e.g., CAPTCHA, rate-limiting, IP blocking)
  • Using proxies and rotating IPs (e.g., ScraperAPI, Bright Data)
  • Implementing user-agent rotation and session persistence
  • Strategies to bypass CAPTCHAs (e.g., third-party solvers, machine learning techniques)

4. Advanced Parsing Techniques

  • Parsing complex HTML and nested structures
  • Extracting data using XPath and CSS selectors
  • Handling large datasets efficiently
  • Managing encoding issues (e.g., Unicode, non-UTF8 pages)

5. Working with APIs

  • Reverse-engineering APIs using browser developer tools
  • Authentication techniques (e.g., OAuth, tokens, cookies)
  • Using GraphQL and REST APIs for scraping
  • Handling rate limits and API errors gracefully

6. Data Storage and Post-Processing

  • Storing scraped data in databases (SQL, NoSQL)
  • Exporting data to formats like JSON, CSV, Excel
  • Cleaning and transforming data with Pandas
  • Integration with ETL pipelines

7. Scraping at Scale

  • Distributed scraping using tools like Scrapy and Celery
  • Managing concurrency and parallelism
  • Caching strategies to reduce redundant requests
  • Monitoring and maintaining scraping jobs

8. Advanced Use Cases

  • Scraping e-commerce sites for product data
  • Gathering social media data (Twitter, Instagram, LinkedIn)
  • News aggregation and sentiment analysis
  • Building custom web crawlers for large-scale data collection

9. Introduction to Machine Learning in Web Scraping

  • Using NLP for data extraction (e.g., named entity recognition)
  • Automating data classification and categorization
  • Introduction to AI-driven scraping tools

10. Deploying and Monitoring Scraping Bots

  • Deploying scraping bots to cloud platforms (AWS, GCP, Azure)
  • Automating scraping with CI/CD pipelines
  • Monitoring and logging scraping activities
  • Debugging common errors and failures

Design: Course Delivery

1. Duration

  • Total time: 6–8 weeks
  • Weekly sessions: 2–3 hours each
  • Hands-on projects at the end of each module

2. Teaching Methods

  • Lecture: Theoretical concepts and walkthroughs
  • Live Coding: Practical demonstrations during class
  • Assignments: Mini-projects to practice concepts
  • Case Studies: Real-world scraping challenges
  • Capstone Project: Build a fully functional scraper for a complex website

3. Tools & Resources

  • Software: Python, Scrapy, Selenium, Puppeteer
  • Platforms: Jupyter Notebook, Google Colab, GitHub
  • Libraries: BeautifulSoup, Requests, Pandas, SQLAlchemy
  • Cloud Services: AWS Lambda, Google Cloud Functions, Docker

4. Assessment

  • Weekly quizzes
  • Graded assignments
  • Final capstone project evaluation
  • Peer reviews

Basics

Web Scraping is the process of automatically extracting data and particular information from websites using software or script.

Web scraping is an automatic process of extracting information from web.

To understand the difference between these two terms, let us look into the comparison table given hereunder −

Web CrawlingWeb Scraping
Refers to downloading and storing the contents of a large number of websites.Refers to extracting individual data elements from the website by using a site-specific structure.
Mostly done on large scale.Can be implemented at any scale.
Yields generic information.Yields specific information.
Used by major search engines like Google, Bing, Yahoo. Googlebot is an example of a web crawler.The information extracted using web scraping can be used to replicate in some other website or can be used to perform data analysis. For example the data elements can be names, address, price etc.

12. Real World Scraping project

Lecture Notes: Real-World Scraping Project on https://mca.gov.in 1. Introduction Scraping real-world websites like MCA (Ministry of Corpor...