Monday, 13 January 2025

12. Real World Scraping project

Lecture Notes: Real-World Scraping Project on https://mca.gov.in

1. Introduction

Scraping real-world websites like MCA (Ministry of Corporate Affairs) involves dealing with various challenges such as dynamic content, anti-scraping mechanisms, and complex data structures. This lecture demonstrates a step-by-step approach to scraping this site using Python, focusing on deployment and execution on Google Cloud Platform (GCP).


2. Challenges in Scraping MCA Website

Key Challenges

  1. Dynamic Content: The site uses JavaScript to load data, requiring a browser automation tool like Selenium.
  2. Anti-Scraping Mechanisms: CAPTCHA, rate limiting, and bot detection.
  3. Complex Data Structures: Nested tables and pagination for structured data.
  4. Legal and Ethical Considerations: Adhering to the site's terms of service and responsible scraping practices.

3. Setting Up GCP for the Project

Objective

Deploy the scraping bot on GCP to handle large-scale scraping tasks efficiently.

Steps

  1. Create a GCP Project:

    • Go to the GCP Console.
    • Create a new project and enable billing.
  2. Set Up a VM Instance:

    • Navigate to Compute Engine → VM Instances → Create Instance.
    • Select an appropriate machine type (e.g., e2-medium).
    • Install Docker or Python on the VM.
  3. Enable Required APIs:

    • Enable APIs like Cloud Logging and Cloud Storage.
  4. Install Required Libraries:

    • Use SSH to connect to the VM.
    • Install libraries like Selenium, BeautifulSoup, and gcloud SDK.

4. Code for Scraping MCA Website

Objective

Extract company details such as name, CIN (Corporate Identity Number), and registration date from MCA's search page.

Code Example

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# Initialize Selenium WebDriver (headless mode for GCP)
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Specify the path to your ChromeDriver
driver = webdriver.Chrome(options=options)

try:
    # Open MCA search page
    driver.get("https://mca.gov.in/mcafoportal/viewCompanyMasterData.do")

    # Wait for the page to load
    time.sleep(5)

    # Locate search input field and enter query (e.g., company name)
    search_input = driver.find_element(By.ID, 'companyName')
    search_input.send_keys("Tata")
    search_input.send_keys(Keys.RETURN)

    # Wait for results to load
    time.sleep(10)

    # Extract data from results table
    rows = driver.find_elements(By.CSS_SELECTOR, 'table tbody tr')
    for row in rows:
        columns = row.find_elements(By.TAG_NAME, 'td')
        if len(columns) > 0:
            print("Name:", columns[0].text)
            print("CIN:", columns[1].text)
            print("Registration Date:", columns[2].text)
            print("---")
finally:
    driver.quit()

5. Deploying the Scraper to GCP

Steps

  1. Create a Dockerfile:

    FROM python:3.9
    
    # Install dependencies
    WORKDIR /app
    COPY requirements.txt .
    RUN pip install -r requirements.txt
    
    # Copy scraper code
    COPY . .
    
    # Command to run the scraper
    CMD ["python", "scraper.py"]
    
  2. Build and Push Docker Image:

    docker build -t mca-scraper .
    docker tag mca-scraper gcr.io/<project-id>/mca-scraper
    docker push gcr.io/<project-id>/mca-scraper
    
  3. Deploy on GCP:

    • Use Cloud Run or Kubernetes Engine to deploy the container.
    • Configure environment variables for dynamic inputs.

6. Monitoring and Logging

Using GCP Tools

  1. Cloud Logging: Capture logs for scraper activities.

    import logging
    from google.cloud import logging as gcp_logging
    
    # Initialize GCP logging client
    client = gcp_logging.Client()
    logger = client.logger("mca_scraper")
    
    # Log messages
    logger.log_text("Scraping started")
    logger.log_text("Scraping completed successfully")
    
  2. Cloud Monitoring: Set up alerts for failures or anomalies.


7. Debugging Common Issues

Challenges

  1. CAPTCHA Handling:

    • Use third-party CAPTCHA-solving services.
  2. Timeout Errors:

    • Implement retry mechanisms with exponential backoff.

Code Example: Retry Logic

import requests
from requests.exceptions import RequestException

# Retry function
def fetch_with_retry(url, retries=3):
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response
        except RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == retries - 1:
                raise

# Example usage
url = "https://mca.gov.in"
data = fetch_with_retry(url)
print(data.text)

8. Conclusion

Scraping real-world websites like MCA involves overcoming challenges through robust coding practices, cloud deployment, and efficient monitoring. GCP provides an excellent platform for scalable and reliable scraping solutions.


Assignments and Quiz

Assignment:

  1. Implement the scraper to extract detailed information about a specific company from the MCA website.
  2. Deploy the scraper to GCP and configure logging.
  3. Handle CAPTCHA challenges using a third-party service.

Quiz Questions:

  1. What challenges are specific to scraping dynamic websites like MCA?
  2. Why is GCP suitable for deploying large-scale scraping bots?
  3. What tool can be used for monitoring logs in GCP?
  4. How can retry logic improve the reliability of a scraper?
  5. Provide an example of a Docker command for deploying a scraper.

No comments:

Post a Comment

12. Real World Scraping project

Lecture Notes: Real-World Scraping Project on https://mca.gov.in 1. Introduction Scraping real-world websites like MCA (Ministry of Corpor...