Web Scrapping Handson: 12. Real World Scraping project

Lecture Notes: Real-World Scraping Project on https://mca.gov.in

1. Introduction

Scraping real-world websites like MCA (Ministry of Corporate Affairs) involves dealing with various challenges such as dynamic content, anti-scraping mechanisms, and complex data structures. This lecture demonstrates a step-by-step approach to scraping this site using Python, focusing on deployment and execution on Google Cloud Platform (GCP).

2. Challenges in Scraping MCA Website

Key Challenges

Dynamic Content: The site uses JavaScript to load data, requiring a browser automation tool like Selenium.
Anti-Scraping Mechanisms: CAPTCHA, rate limiting, and bot detection.
Complex Data Structures: Nested tables and pagination for structured data.
Legal and Ethical Considerations: Adhering to the site's terms of service and responsible scraping practices.

3. Setting Up GCP for the Project

Objective

Deploy the scraping bot on GCP to handle large-scale scraping tasks efficiently.

Steps

Create a GCP Project:
- Go to the GCP Console.
- Create a new project and enable billing.
Set Up a VM Instance:
- Navigate to Compute Engine → VM Instances → Create Instance.
- Select an appropriate machine type (e.g., e2-medium).
- Install Docker or Python on the VM.
Enable Required APIs:
- Enable APIs like Cloud Logging and Cloud Storage.
Install Required Libraries:
- Use SSH to connect to the VM.
- Install libraries like Selenium, BeautifulSoup, and gcloud SDK.

4. Code for Scraping MCA Website

Objective

Extract company details such as name, CIN (Corporate Identity Number), and registration date from MCA's search page.

Code Example

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# Initialize Selenium WebDriver (headless mode for GCP)
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Specify the path to your ChromeDriver
driver = webdriver.Chrome(options=options)

try:
    # Open MCA search page
    driver.get("https://mca.gov.in/mcafoportal/viewCompanyMasterData.do")

    # Wait for the page to load
    time.sleep(5)

    # Locate search input field and enter query (e.g., company name)
    search_input = driver.find_element(By.ID, 'companyName')
    search_input.send_keys("Tata")
    search_input.send_keys(Keys.RETURN)

    # Wait for results to load
    time.sleep(10)

    # Extract data from results table
    rows = driver.find_elements(By.CSS_SELECTOR, 'table tbody tr')
    for row in rows:
        columns = row.find_elements(By.TAG_NAME, 'td')
        if len(columns) > 0:
            print("Name:", columns[0].text)
            print("CIN:", columns[1].text)
            print("Registration Date:", columns[2].text)
            print("---")
finally:
    driver.quit()

5. Deploying the Scraper to GCP

Steps

Create a Dockerfile:

FROM python:3.9

# Install dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy scraper code
COPY . .

# Command to run the scraper
CMD ["python", "scraper.py"]

Build and Push Docker Image:

docker build -t mca-scraper .
docker tag mca-scraper gcr.io/<project-id>/mca-scraper
docker push gcr.io/<project-id>/mca-scraper

Deploy on GCP:
- Use Cloud Run or Kubernetes Engine to deploy the container.
- Configure environment variables for dynamic inputs.

6. Monitoring and Logging

Using GCP Tools

Cloud Logging: Capture logs for scraper activities.

import logging
from google.cloud import logging as gcp_logging

# Initialize GCP logging client
client = gcp_logging.Client()
logger = client.logger("mca_scraper")

# Log messages
logger.log_text("Scraping started")
logger.log_text("Scraping completed successfully")

Cloud Monitoring: Set up alerts for failures or anomalies.

7. Debugging Common Issues

Challenges

CAPTCHA Handling:
- Use third-party CAPTCHA-solving services.
Timeout Errors:
- Implement retry mechanisms with exponential backoff.

Code Example: Retry Logic

import requests
from requests.exceptions import RequestException

# Retry function
def fetch_with_retry(url, retries=3):
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response
        except RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == retries - 1:
                raise

# Example usage
url = "https://mca.gov.in"
data = fetch_with_retry(url)
print(data.text)

8. Conclusion

Scraping real-world websites like MCA involves overcoming challenges through robust coding practices, cloud deployment, and efficient monitoring. GCP provides an excellent platform for scalable and reliable scraping solutions.

Assignments and Quiz

Assignment:

Implement the scraper to extract detailed information about a specific company from the MCA website.
Deploy the scraper to GCP and configure logging.
Handle CAPTCHA challenges using a third-party service.

Quiz Questions:

What challenges are specific to scraping dynamic websites like MCA?
Why is GCP suitable for deploying large-scale scraping bots?
What tool can be used for monitoring logs in GCP?
How can retry logic improve the reliability of a scraper?
Provide an example of a Docker command for deploying a scraper.

Web Scrapping Handson

Monday, 13 January 2025

12. Real World Scraping project

1. Introduction

2. Challenges in Scraping MCA Website

Key Challenges

3. Setting Up GCP for the Project

Objective

Steps

4. Code for Scraping MCA Website

Objective

Code Example

5. Deploying the Scraper to GCP

Steps

6. Monitoring and Logging

Using GCP Tools

7. Debugging Common Issues

Challenges

Code Example: Retry Logic

8. Conclusion

Assignments and Quiz

No comments:

Post a Comment

12. Real World Scraping project

Report Abuse