Monday, 13 January 2025

12. Real World Scraping project

Lecture Notes: Real-World Scraping Project on https://mca.gov.in

1. Introduction

Scraping real-world websites like MCA (Ministry of Corporate Affairs) involves dealing with various challenges such as dynamic content, anti-scraping mechanisms, and complex data structures. This lecture demonstrates a step-by-step approach to scraping this site using Python, focusing on deployment and execution on Google Cloud Platform (GCP).


2. Challenges in Scraping MCA Website

Key Challenges

  1. Dynamic Content: The site uses JavaScript to load data, requiring a browser automation tool like Selenium.
  2. Anti-Scraping Mechanisms: CAPTCHA, rate limiting, and bot detection.
  3. Complex Data Structures: Nested tables and pagination for structured data.
  4. Legal and Ethical Considerations: Adhering to the site's terms of service and responsible scraping practices.

3. Setting Up GCP for the Project

Objective

Deploy the scraping bot on GCP to handle large-scale scraping tasks efficiently.

Steps

  1. Create a GCP Project:

    • Go to the GCP Console.
    • Create a new project and enable billing.
  2. Set Up a VM Instance:

    • Navigate to Compute Engine → VM Instances → Create Instance.
    • Select an appropriate machine type (e.g., e2-medium).
    • Install Docker or Python on the VM.
  3. Enable Required APIs:

    • Enable APIs like Cloud Logging and Cloud Storage.
  4. Install Required Libraries:

    • Use SSH to connect to the VM.
    • Install libraries like Selenium, BeautifulSoup, and gcloud SDK.

4. Code for Scraping MCA Website

Objective

Extract company details such as name, CIN (Corporate Identity Number), and registration date from MCA's search page.

Code Example

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# Initialize Selenium WebDriver (headless mode for GCP)
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Specify the path to your ChromeDriver
driver = webdriver.Chrome(options=options)

try:
    # Open MCA search page
    driver.get("https://mca.gov.in/mcafoportal/viewCompanyMasterData.do")

    # Wait for the page to load
    time.sleep(5)

    # Locate search input field and enter query (e.g., company name)
    search_input = driver.find_element(By.ID, 'companyName')
    search_input.send_keys("Tata")
    search_input.send_keys(Keys.RETURN)

    # Wait for results to load
    time.sleep(10)

    # Extract data from results table
    rows = driver.find_elements(By.CSS_SELECTOR, 'table tbody tr')
    for row in rows:
        columns = row.find_elements(By.TAG_NAME, 'td')
        if len(columns) > 0:
            print("Name:", columns[0].text)
            print("CIN:", columns[1].text)
            print("Registration Date:", columns[2].text)
            print("---")
finally:
    driver.quit()

5. Deploying the Scraper to GCP

Steps

  1. Create a Dockerfile:

    FROM python:3.9
    
    # Install dependencies
    WORKDIR /app
    COPY requirements.txt .
    RUN pip install -r requirements.txt
    
    # Copy scraper code
    COPY . .
    
    # Command to run the scraper
    CMD ["python", "scraper.py"]
    
  2. Build and Push Docker Image:

    docker build -t mca-scraper .
    docker tag mca-scraper gcr.io/<project-id>/mca-scraper
    docker push gcr.io/<project-id>/mca-scraper
    
  3. Deploy on GCP:

    • Use Cloud Run or Kubernetes Engine to deploy the container.
    • Configure environment variables for dynamic inputs.

6. Monitoring and Logging

Using GCP Tools

  1. Cloud Logging: Capture logs for scraper activities.

    import logging
    from google.cloud import logging as gcp_logging
    
    # Initialize GCP logging client
    client = gcp_logging.Client()
    logger = client.logger("mca_scraper")
    
    # Log messages
    logger.log_text("Scraping started")
    logger.log_text("Scraping completed successfully")
    
  2. Cloud Monitoring: Set up alerts for failures or anomalies.


7. Debugging Common Issues

Challenges

  1. CAPTCHA Handling:

    • Use third-party CAPTCHA-solving services.
  2. Timeout Errors:

    • Implement retry mechanisms with exponential backoff.

Code Example: Retry Logic

import requests
from requests.exceptions import RequestException

# Retry function
def fetch_with_retry(url, retries=3):
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response
        except RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == retries - 1:
                raise

# Example usage
url = "https://mca.gov.in"
data = fetch_with_retry(url)
print(data.text)

8. Conclusion

Scraping real-world websites like MCA involves overcoming challenges through robust coding practices, cloud deployment, and efficient monitoring. GCP provides an excellent platform for scalable and reliable scraping solutions.


Assignments and Quiz

Assignment:

  1. Implement the scraper to extract detailed information about a specific company from the MCA website.
  2. Deploy the scraper to GCP and configure logging.
  3. Handle CAPTCHA challenges using a third-party service.

Quiz Questions:

  1. What challenges are specific to scraping dynamic websites like MCA?
  2. Why is GCP suitable for deploying large-scale scraping bots?
  3. What tool can be used for monitoring logs in GCP?
  4. How can retry logic improve the reliability of a scraper?
  5. Provide an example of a Docker command for deploying a scraper.

11. Deploying and Monitoring Scraping Bots

Lecture Notes: Deploying and Monitoring Scraping Bots

1. Introduction to Deploying and Monitoring Scraping Bots

Deploying and monitoring scraping bots ensures that web scraping tasks are executed efficiently, consistently, and at scale. This lecture explores deploying bots to cloud platforms, automating workflows with CI/CD pipelines, monitoring activities, and debugging common issues.


2. Deploying Scraping Bots to Cloud Platforms (AWS, GCP, Azure)

Objective

Host and run scraping bots on cloud platforms to ensure scalability, reliability, and global accessibility.

Challenges

  1. Configuring cloud infrastructure.
  2. Ensuring resource efficiency and cost optimization.
  3. Managing security and data compliance.

Solution

Deploy bots using containerization (Docker) and orchestration tools.

Code Example: Deploying to AWS with Docker

# Step 1: Create a Dockerfile
FROM python:3.9

# Install dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy bot code
COPY . .

# Command to run the bot
CMD ["python", "bot.py"]

# Step 2: Build and push Docker image
# Build the Docker image
docker build -t scraping-bot .

# Tag and push the image to AWS Elastic Container Registry (ECR)
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com

docker tag scraping-bot:latest <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/scraping-bot:latest

docker push <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/scraping-bot:latest

# Step 3: Deploy to AWS ECS
# Use AWS CLI or ECS Console to set up a task definition and service to run the Docker container.

3. Automating Scraping with CI/CD Pipelines

Objective

Implement continuous integration and deployment workflows to automate scraping tasks.

Challenges

  1. Managing frequent code updates.
  2. Integrating testing and deployment steps.

Solution

Use tools like GitHub Actions or Jenkins for CI/CD pipelines.

Code Example: GitHub Actions Workflow

name: Deploy Scraping Bot

on:
  push:
    branches:
      - main

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Run tests
        run: pytest

      - name: Deploy to AWS ECS
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          aws ecs update-service --cluster scraping-cluster --service scraping-service --force-new-deployment

4. Monitoring and Logging Scraping Activities

Objective

Track scraping activities in real-time to ensure reliability and troubleshoot issues.

Challenges

  1. Capturing logs from distributed systems.
  2. Analyzing large volumes of data.

Solution

Use logging frameworks and monitoring dashboards (e.g., ELK Stack, CloudWatch).

Code Example: Using Python Logging

import logging

# Configure logging
logging.basicConfig(
    filename='scraping.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

# Log activities
logging.info("Scraping started")
try:
    # Simulate scraping task
    raise ValueError("Example error")
except Exception as e:
    logging.error(f"Error occurred: {e}")
logging.info("Scraping completed")

5. Debugging Common Errors and Failures

Objective

Identify and resolve issues such as rate limiting, data structure changes, and network errors.

Challenges

  1. Dynamic website changes.
  2. Unexpected exceptions during scraping.

Solution

Use error handling and retry mechanisms.

Code Example: Retry on Failure

import requests
from requests.exceptions import RequestException

# Retry logic
def fetch_with_retry(url, retries=3):
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response
        except RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == retries - 1:
                raise

# Example usage
url = "https://example.com"
data = fetch_with_retry(url)
print(data.text)

6. Conclusion

Deploying and monitoring scraping bots involves managing cloud deployments, automating workflows, and ensuring robust monitoring. These practices improve reliability and scalability for web scraping projects.


Assignments and Quiz

Assignment:

  1. Deploy a scraping bot to a cloud platform of your choice (AWS, GCP, or Azure).
  2. Set up a CI/CD pipeline to automate deployment.
  3. Configure logging to capture errors and activities.

Quiz Questions:

  1. Name two cloud platforms suitable for deploying scraping bots.
  2. What tool can you use to automate CI/CD workflows for a scraping bot?
  3. How can you monitor scraping activities in real time?
  4. What is the purpose of retry mechanisms in scraping bots?
  5. Provide an example of a logging framework in Python.

10. Machine Learning in Web Scraping

Lecture Notes: Introduction to Machine Learning in Web Scraping

1. Introduction to Machine Learning in Web Scraping

Machine learning (ML) enhances web scraping by automating complex tasks, improving data accuracy, and enabling advanced analysis. This lecture introduces ML concepts applied to web scraping, focusing on Natural Language Processing (NLP), data classification, and AI-driven scraping tools.


2. Using NLP for Data Extraction (e.g., Named Entity Recognition)

Objective

Extract meaningful entities such as names, dates, locations, and organizations from unstructured web data using NLP techniques like Named Entity Recognition (NER).

Challenges

  1. Unstructured and noisy text data.
  2. Contextual understanding of extracted entities.

Solution

Use Python libraries such as spaCy or nltk to implement NER.

Code Example

import spacy

# Load spaCy's English NLP model
nlp = spacy.load("en_core_web_sm")

# Sample text
data = "OpenAI was founded in San Francisco in 2015. It specializes in AI research."

# Process the text
doc = nlp(data)

# Extract named entities
print("Entities detected:")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

Output

Entities detected:
OpenAI (ORG)
San Francisco (GPE)
2015 (DATE)

3. Automating Data Classification and Categorization

Objective

Classify scraped data into predefined categories or labels using supervised ML models.

Challenges

  1. Labeling training data for supervised learning.
  2. Balancing accuracy and computational efficiency.

Solution

Train a model using scikit-learn to classify scraped data into categories.

Code Example

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# Sample data
data = [
    "Buy the latest iPhone at a discounted price",
    "New Samsung Galaxy released this month",
    "Breaking news: AI beats humans at chess",
    "Sports update: Local team wins championship"
]
labels = ["ecommerce", "ecommerce", "news", "sports"]

# Split data
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.25, random_state=42)

# Build a classification pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, y_train)

# Test the model
predictions = model.predict(X_test)
print("Predictions:", predictions)

4. Introduction to AI-Driven Scraping Tools

Objective

Leverage AI-powered scraping tools that can dynamically adapt to website changes and bypass anti-scraping mechanisms.

Examples of AI-Driven Tools

  1. Diffbot: Extracts structured data from web pages using AI.
  2. Scrapy with AI Plugins: Combines traditional scraping with ML capabilities.
  3. Apify AI Tools: Provides intelligent automation for complex scraping tasks.

Code Example: Using Diffbot API

import requests

# Diffbot API endpoint and token
API_TOKEN = "your_diffbot_api_token"
url = "https://example.com/article"
api_endpoint = f"https://api.diffbot.com/v3/article?token={API_TOKEN}&url={url}"

# Send request
response = requests.get(api_endpoint)

# Parse response
if response.status_code == 200:
    article_data = response.json()
    print("Title:", article_data['objects'][0]['title'])
    print("Author:", article_data['objects'][0]['author'])
    print("Text:", article_data['objects'][0]['text'])
else:
    print("Error:", response.status_code)

5. Conclusion

Machine learning significantly enhances web scraping capabilities. Techniques like NLP, classification, and AI-driven tools allow for more intelligent and automated data extraction, making them invaluable for large-scale and complex projects.


Assignments and Quiz

Assignment: Implement an NER model to extract names, organizations, and dates from a sample news article. Use spaCy or a similar NLP library. Save the results in JSON format.

Quiz Questions:

  1. What is Named Entity Recognition (NER) used for in web scraping?
  2. Name one library used for implementing NER in Python.
  3. What are the main challenges in automating data classification?
  4. What is the advantage of AI-driven scraping tools?
  5. Provide an example of an AI-driven scraping tool.

9. Advanced Use Cases in Web Scraping

Lecture Notes: Advanced Use Cases in Web Scraping

1. Introduction to Advanced Use Cases in Web Scraping

Web scraping is not limited to simple data extraction. Advanced use cases allow organizations and developers to extract, analyze, and utilize large-scale data for specific applications. This lecture covers key use cases, including scraping e-commerce data, gathering social media insights, aggregating news, and building custom web crawlers for large-scale data collection.


2. Scraping E-Commerce Sites for Product Data

Objective

Extract product information, including names, prices, ratings, and reviews, from e-commerce platforms for price comparison, market research, or inventory monitoring.

Challenges

  1. Dynamic content (JavaScript rendering).
  2. Anti-scraping mechanisms (CAPTCHA, rate limiting).
  3. Data volume and variability.

Solution

Use a combination of Selenium (for dynamic content) and BeautifulSoup (for parsing HTML).

Code Example

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import pandas as pd

# Configure Selenium WebDriver
service = Service('path_to_chromedriver')
driver = webdriver.Chrome(service=service)

# Open e-commerce site
ecommerce_url = "https://example-ecommerce.com"
driver.get(ecommerce_url)

# Parse the page source with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Extract product data
products = []
product_elements = soup.find_all('div', class_='product-item')
for item in product_elements:
    name = item.find('h2', class_='product-name').text.strip()
    price = item.find('span', class_='price').text.strip()
    rating = item.find('span', class_='rating').text.strip()
    products.append({"Name": name, "Price": price, "Rating": rating})

# Save to CSV
driver.quit()
products_df = pd.DataFrame(products)
products_df.to_csv('ecommerce_products.csv', index=False)
print("Product data saved to ecommerce_products.csv")

3. Gathering Social Media Data (Twitter, Instagram, LinkedIn)

Objective

Collect user posts, hashtags, follower counts, and other relevant data for social media analytics.

Challenges

  1. API rate limits and authentication requirements.
  2. Strict platform policies against scraping.
  3. Dynamic and frequently changing DOM structures.

Solution

Use official APIs where available or headless browsers for scraping.

Code Example: Twitter API

import tweepy

# Twitter API credentials
api_key = 'your_api_key'
api_secret_key = 'your_api_secret_key'
access_token = 'your_access_token'
access_token_secret = 'your_access_token_secret'

# Authenticate
auth = tweepy.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# Search tweets
query = "#WebScraping"
tweets = api.search_tweets(q=query, lang='en', count=100)

# Print tweets
for tweet in tweets:
    print(f"User: {tweet.user.screen_name}, Tweet: {tweet.text}")

4. News Aggregation and Sentiment Analysis

Objective

Scrape news articles, extract key content, and analyze sentiment for trend detection or reporting.

Challenges

  1. Handling different website structures.
  2. Extracting meaningful content from HTML noise.
  3. Large-scale text analysis.

Solution

Combine web scraping with Natural Language Processing (NLP) libraries.

Code Example

from bs4 import BeautifulSoup
import requests
from textblob import TextBlob

# Target news website
news_url = "https://example-news-site.com"
response = requests.get(news_url)
response.raise_for_status()

# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
articles = []
article_elements = soup.find_all('div', class_='news-article')

for article in article_elements:
    title = article.find('h2').text.strip()
    summary = article.find('p', class_='summary').text.strip()
    sentiment = TextBlob(summary).sentiment.polarity
    articles.append({"Title": title, "Summary": summary, "Sentiment": sentiment})

# Print results
for article in articles:
    print(f"Title: {article['Title']}, Sentiment: {article['Sentiment']}")

5. Building Custom Web Crawlers for Large-Scale Data Collection

Objective

Crawl websites systematically to collect data across multiple pages and domains.

Challenges

  1. Managing crawling depth and breadth.
  2. Handling duplicate data and avoiding traps.
  3. Efficient storage of large datasets.

Solution

Use Scrapy for building efficient and scalable web crawlers.

Code Example

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["http://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # Follow pagination links
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

To run the spider:

scrapy runspider quotes_spider.py -o quotes.json

6. Conclusion

Advanced web scraping use cases open up opportunities for data-driven decision-making in e-commerce, social media analytics, news aggregation, and more. By leveraging tools like Selenium, BeautifulSoup, APIs, and Scrapy, developers can tackle complex scenarios efficiently.


Assignments and Quiz

Assignment: Create a custom crawler using Scrapy to collect data from a website of your choice. Save the data in JSON format.

Quiz Questions:

  1. What are the key challenges in scraping social media platforms?
  2. How does sentiment analysis help in news aggregation?
  3. What is the advantage of using Scrapy for large-scale web crawling?
  4. Which library is commonly used for sentiment analysis in Python?
  5. Name one way to handle dynamic content on e-commerce sites.

8. Web Scrap at Scale

Lecture Notes: Web Scraping at Scale

1. Distributed Scraping Using Tools

1.1 Scrapy

Scrapy is a powerful web scraping framework that enables distributed scraping with minimal setup. It supports asynchronous requests and is highly efficient for large-scale scraping.

Key Features

  • Built-in support for asynchronous requests.
  • Middleware for request and response processing.
  • Extensible architecture.

Code Example: Scrapy Distributed Scraping

# Install Scrapy: pip install scrapy

# Example Scrapy spider
import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        for product in response.css(".product"):
            yield {
                'name': product.css(".name::text").get(),
                'price': product.css(".price::text").get(),
            }

# Run spider: scrapy crawl products

1.2 Celery

Celery is a distributed task queue that enables parallel execution of tasks, making it suitable for scaling scraping jobs.

Code Example: Using Celery for Scraping

# Install Celery: pip install celery
from celery import Celery
import requests

app = Celery('tasks', broker='redis://localhost:6379/0')

@app.task
def scrape_page(url):
    response = requests.get(url)
    return response.text

# Schedule scraping tasks
urls = ["https://example.com/page1", "https://example.com/page2"]
for url in urls:
    scrape_page.delay(url)

2. Managing Concurrency and Parallelism

2.1 Threading

Threading allows multiple threads to run concurrently in a program, making it useful for I/O-bound tasks like web scraping.

Code Example: Using Threading

import threading
import requests

def scrape(url):
    response = requests.get(url)
    print(f"Scraped {url}: {len(response.content)} bytes")

urls = ["https://example.com/page1", "https://example.com/page2"]
threads = []

for url in urls:
    thread = threading.Thread(target=scrape, args=(url,))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

2.2 Multiprocessing

Multiprocessing uses multiple processes, making it ideal for CPU-bound tasks.

Code Example: Using Multiprocessing

from multiprocessing import Pool
import requests

def scrape(url):
    response = requests.get(url)
    return f"Scraped {url}: {len(response.content)} bytes"

urls = ["https://example.com/page1", "https://example.com/page2"]

with Pool(4) as pool:
    results = pool.map(scrape, urls)
    for result in results:
        print(result)

3. Caching Strategies to Reduce Redundant Requests

3.1 Using HTTP Caching

HTTP caching stores responses to reduce redundant requests.

Code Example: Using Requests Cache

# Install requests-cache: pip install requests-cache
import requests_cache

requests_cache.install_cache('scraping_cache', expire_after=3600)

response = requests.get("https://example.com/data")
print(f"Cache hit: {response.from_cache}")

3.2 Custom Caching

Implement custom caching for specific data.

Code Example: Custom Caching

import os
import hashlib
import requests

def fetch_with_cache(url, cache_dir="cache"):
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)

    cache_file = os.path.join(cache_dir, hashlib.md5(url.encode()).hexdigest())

    if os.path.exists(cache_file):
        with open(cache_file, "r") as f:
            return f.read()

    response = requests.get(url)
    with open(cache_file, "w") as f:
        f.write(response.text)

    return response.text

html = fetch_with_cache("https://example.com/data")
print(html[:100])

4. Monitoring and Maintaining Scraping Jobs

4.1 Logging

Logging helps monitor scraping activity and debug issues.

Code Example: Basic Logging

import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
logger = logging.getLogger()

logger.info("Starting scraping job")
# Simulated scraping job
logger.info("Scraping page 1")
logger.info("Scraping page 2")
logger.info("Job completed")

4.2 Job Scheduling

Use schedulers like cron or Python libraries to automate scraping.

Code Example: Using Schedule Library

# Install schedule: pip install schedule
import schedule
import time


def scrape_job():
    print("Running scraping job...")
    # Simulate scraping
    time.sleep(2)
    print("Job completed")

schedule.every().day.at("10:00").do(scrape_job)

while True:
    schedule.run_pending()
    time.sleep(1)

Quiz

Objective: Test understanding of large-scale web scraping concepts.

Questions

  1. Which tool is best suited for distributed web scraping?

    • a) Scrapy
    • b) BeautifulSoup
    • c) Requests
    • d) Selenium
  2. What is the main advantage of using Celery in web scraping?

    • a) Asynchronous task execution
    • b) Database management
    • c) Rendering JavaScript
    • d) HTML parsing
  3. Which Python library supports HTTP caching for web scraping?

    • a) pandas
    • b) requests-cache
    • c) scrapy
    • d) schedule
  4. What is the primary purpose of logging in scraping jobs?

    • a) Improve scraping speed
    • b) Track and debug scraping activity
    • c) Parse data
    • d) Handle proxies
  5. What is a benefit of using multiprocessing for web scraping?

    • a) Handles I/O-bound tasks efficiently
    • b) Improves performance for CPU-bound tasks
    • c) Reduces memory usage
    • d) Automates scheduling

Answers

  1. a) Scrapy
  2. a) Asynchronous task execution
  3. b) requests-cache
  4. b) Track and debug scraping activity
  5. b) Improves performance for CPU-bound tasks

7. Data Storage and Post-Processing in Web Scraping

Lecture Notes: Data Storage and Post Processing

1. Storing Scraped Data in Databases

1.1 SQL Databases

SQL databases like MySQL and PostgreSQL are used for structured data storage. They ensure data integrity and support complex queries.

Key Concepts:

  • Tables with predefined schema.
  • SQL queries for data manipulation.

Code Example: Storing Data in MySQL

import mysql.connector

# Database connection
connection = mysql.connector.connect(
    host="localhost",
    user="your_username",
    password="your_password",
    database="scraped_data_db"
)
cursor = connection.cursor()

# Create table
cursor.execute('''CREATE TABLE IF NOT EXISTS products (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255),
    price DECIMAL(10, 2),
    rating DECIMAL(3, 2)
)''')

# Insert data
data = [
    ("Product A", 19.99, 4.5),
    ("Product B", 29.99, 4.8)
]
cursor.executemany("INSERT INTO products (name, price, rating) VALUES (%s, %s, %s)", data)
connection.commit()

# Close connection
cursor.close()
connection.close()

1.2 NoSQL Databases

NoSQL databases like MongoDB are used for unstructured or semi-structured data. They are flexible and scalable.

Key Concepts:

  • Collection-based structure.
  • JSON-like documents.

Code Example: Storing Data in MongoDB

from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
db = client["scraped_data"]
collection = db["products"]

# Insert data
data = [
    {"name": "Product A", "price": 19.99, "rating": 4.5},
    {"name": "Product B", "price": 29.99, "rating": 4.8}
]
collection.insert_many(data)

# Close connection
client.close()

2. Exporting Data to Formats

2.1 Exporting to JSON

JSON is a lightweight format for data exchange.

Code Example: Writing Data to JSON

import json

# Sample data
data = [
    {"name": "Product A", "price": 19.99, "rating": 4.5},
    {"name": "Product B", "price": 29.99, "rating": 4.8}
]

# Write to JSON file
with open("products.json", "w") as json_file:
    json.dump(data, json_file, indent=4)

2.2 Exporting to CSV

CSV files are commonly used for tabular data.

Code Example: Writing Data to CSV

import csv

# Sample data
data = [
    {"name": "Product A", "price": 19.99, "rating": 4.5},
    {"name": "Product B", "price": 29.99, "rating": 4.8}
]

# Write to CSV
with open("products.csv", "w", newline="", encoding="utf-8") as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=["name", "price", "rating"])
    writer.writeheader()
    writer.writerows(data)

2.3 Exporting to Excel

Excel is used for data analysis and visualization.

Code Example: Writing Data to Excel

import pandas as pd

# Sample data
data = [
    {"name": "Product A", "price": 19.99, "rating": 4.5},
    {"name": "Product B", "price": 29.99, "rating": 4.8}
]

# Write to Excel
df = pd.DataFrame(data)
df.to_excel("products.xlsx", index=False)

3. Cleaning and Transforming Data with Pandas

3.1 Cleaning Data

Cleaning involves removing duplicates, handling missing values, and correcting formats.

Code Example: Cleaning Data

import pandas as pd

# Sample data
data = {
    "name": ["Product A", "Product B", None, "Product D"],
    "price": [19.99, 29.99, None, 39.99],
    "rating": [4.5, 4.8, 4.0, None]
}

# Create DataFrame
df = pd.DataFrame(data)

# Drop rows with missing values
df = df.dropna()

# Fill missing values with default values
df["rating"] = df["rating"].fillna(3.0)

print(df)

3.2 Transforming Data

Transforming data involves reshaping, aggregating, and applying operations.

Code Example: Transforming Data

# Add a new column for discounted price
df["discounted_price"] = df["price"] * 0.9

# Group by and calculate mean rating
mean_rating = df.groupby("name")["rating"].mean()

print(df)
print(mean_rating)

4. Integration with ETL Pipelines

ETL (Extract, Transform, Load) pipelines automate data workflows, including scraping, cleaning, and storing data.

4.1 Building an ETL Pipeline

An ETL pipeline integrates data extraction, processing, and loading steps.

Code Example: Simple ETL Pipeline

import pandas as pd
import requests

# Extract: Fetch data
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
data = response.json()

# Transform: Clean and process data
df = pd.DataFrame(data["products"])
df["price"] = df["price"].astype(float)

# Load: Store in CSV
df.to_csv("products_etl.csv", index=False)
print("ETL process completed. Data saved to products_etl.csv.")

Quiz

Objective: Test understanding of data storage and post-processing.

Questions

  1. Which SQL command is used to insert data into a table?

    • a) INSERT INTO
    • b) CREATE TABLE
    • c) SELECT
    • d) UPDATE
  2. What is the primary advantage of using NoSQL databases?

    • a) Predefined schema
    • b) Flexibility with unstructured data
    • c) Complex SQL queries
    • d) Low latency
  3. Which Python library is commonly used for exporting data to Excel?

    • a) csv
    • b) json
    • c) pandas
    • d) openpyxl
  4. What does the dropna() method in Pandas do?

    • a) Drops duplicate rows
    • b) Fills missing values
    • c) Drops rows with missing values
    • d) Normalizes data
  5. What does ETL stand for?

    • a) Extract, Transfer, Load
    • b) Extract, Transform, Load
    • c) Evaluate, Transform, Load
    • d) Extract, Transform, Link

Answers

  1. a) INSERT INTO
  2. b) Flexibility with unstructured data
  3. c) pandas
  4. c) Drops rows with missing values
  5. b) Extract, Transform, Load

6. Working with APIs

Lecture Notes: Working with Web Scraping APIs

1. Introduction to Web Scraping APIs

What are Web Scraping APIs?

Web scraping APIs are tools that simplify the process of extracting data from websites. They handle various challenges like dynamic content, anti-scraping mechanisms, and large-scale data extraction.

Why Use Web Scraping APIs?

  1. Ease of Use: Simplifies handling complex scraping scenarios.
  2. Efficiency: Reduces development time and resources.
  3. Anti-Scraping Measures: Built-in mechanisms for bypassing blocks.
  4. Scalability: Handles large volumes of requests effectively.

Examples of Popular Web Scraping APIs

  1. ScraperAPI
  2. Bright Data (formerly Luminati)
  3. Scrapy Cloud
  4. Apify
  5. Octoparse API

2. Features of Web Scraping APIs

1. Proxy Management

Automatically rotates proxies and provides residential, data center, or mobile IPs.

2. Headless Browser Support

Renders JavaScript-heavy pages using headless browsers like Puppeteer or Selenium.

3. CAPTCHA Solving

Integrates CAPTCHA-solving services to bypass human verification challenges.

4. Data Formatting

Delivers data in structured formats like JSON, CSV, or XML.

5. Rate Limiting

Manages request limits to avoid IP bans.


3. Using Web Scraping APIs

1. Understanding API Endpoints

APIs provide specific endpoints for data extraction tasks. For example:

  • /scrape: To extract data from a URL.
  • /status: To check API usage and limits.

2. Authentication

Most APIs require API keys or tokens for access. These credentials ensure secure and authorized use.

3. Sending Requests

Use Python libraries like requests or httpx to interact with APIs.

4. Parsing API Responses

Extract and process structured data from JSON or other response formats.


4. Sample Code: Using ScraperAPI

Scenario: Extract product data from an e-commerce site.

import requests
import json

# API key
API_KEY = "your_scraperapi_key"

# Target URL
url = "https://example-ecommerce-site.com/products"

# API endpoint
api_endpoint = f"http://api.scraperapi.com?api_key={API_KEY}&url={url}"

# Send request
response = requests.get(api_endpoint)

# Check response
if response.status_code == 200:
    data = response.json()
    for product in data.get("products", []):
        print(f"Name: {product['name']}, Price: {product['price']}")
else:
    print(f"Error: {response.status_code}")

5. Assignment

Objective

Use a Web Scraping API to extract weather data from a weather forecasting website.

Requirements

  1. Authenticate using an API key.
  2. Retrieve data for a specific city.
  3. Parse and display the temperature, humidity, and weather conditions.

Solution

import requests

# API key and endpoint
API_KEY = "your_weatherapi_key"
city = "New York"
api_endpoint = f"https://api.weatherapi.com/v1/current.json?key={API_KEY}&q={city}"

# Send request
response = requests.get(api_endpoint)

# Check response
if response.status_code == 200:
    weather_data = response.json()
    location = weather_data["location"]["name"]
    temp_c = weather_data["current"]["temp_c"]
    humidity = weather_data["current"]["humidity"]
    condition = weather_data["current"]["condition"]["text"]

    print(f"City: {location}\nTemperature: {temp_c} °C\nHumidity: {humidity}%\nCondition: {condition}")
else:
    print(f"Failed to fetch weather data. HTTP Status Code: {response.status_code}")

6. Quiz

Objective: Test understanding of Web Scraping APIs.

Questions

  1. Which of the following is NOT a feature of most web scraping APIs?

    • a) CAPTCHA solving
    • b) Proxy management
    • c) Image processing
    • d) Rate limiting
  2. What is the purpose of an API key in web scraping?

    • a) To bypass CAPTCHA
    • b) To identify and authenticate users
    • c) To manage proxies
    • d) To scrape data directly
  3. Which Python library is commonly used to send API requests?

    • a) NumPy
    • b) BeautifulSoup
    • c) Requests
    • d) Pandas
  4. What type of response format do most APIs return?

    • a) JSON
    • b) HTML
    • c) Plain text
    • d) CSV
  5. What is the advantage of using web scraping APIs?

    • a) Simplifies handling of dynamic content
    • b) Increases manual effort
    • c) Eliminates the need for coding
    • d) None of the above

Answers

  1. c) Image processing
  2. b) To identify and authenticate users
  3. c) Requests
  4. a) JSON
  5. a) Simplifies handling of dynamic content

12. Real World Scraping project

Lecture Notes: Real-World Scraping Project on https://mca.gov.in 1. Introduction Scraping real-world websites like MCA (Ministry of Corpor...