Monday, 13 January 2025

3. Introduction to Dynamic Website Scraping

 Lecture Notes: Dynamic Website Scraping

2. Introduction to  Dynamic Website Scraping

Definition

Dynamic websites generate content on the client side using JavaScript. Unlike static websites, where content is embedded in the HTML, dynamic content requires rendering to access the data.

Challenges in Scraping Dynamic Websites

  • JavaScript-rendered content: Content isn't available in the initial HTML response.
  • Infinite scrolling: Requires loading additional data dynamically.
  • Anti-scraping mechanisms: Websites may block automated access.

2. Tools for Dynamic Website Scraping

1. Selenium

A popular Python library for browser automation that can render JavaScript.

  • Advantages: Handles complex interactions, dynamic content.
  • Limitations: Slower compared to HTTP-based scraping.

2. Playwright or Puppeteer

Tools for browser automation, similar to Selenium, but optimized for performance.

3. Network Request Monitoring

  • Use browser developer tools to inspect network requests and directly scrape data from APIs.

3. Techniques for Scraping Dynamic Websites

1. Using Selenium for JavaScript Rendering

Selenium can automate a browser to load JavaScript content, interact with elements, and extract data.

2. Handling Infinite Scrolling

Simulate scrolling actions to load additional content dynamically.

3. Extracting Data from APIs

Reverse-engineer network requests to identify and call APIs directly.

4. Bypassing Anti-Scraping Measures

  • Rotate proxies and user-agents.
  • Add delays between requests.

Sample Code: Scraping Dynamic Content Using Selenium

Scenario: Scrape product data from a JavaScript-rendered e-commerce site.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time

# Setup Selenium WebDriver
service = Service('path/to/chromedriver')  # Update with the correct path
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # Run in headless mode for efficiency
options.add_argument('--disable-gpu')

# Start the browser
driver = webdriver.Chrome(service=service, options=options)

try:
    # Open the target URL
    driver.get('https://example-dynamic-site.com/products')

    # Wait for JavaScript to render the page
    time.sleep(5)  # Adjust based on website's loading time

    # Find product elements
    products = driver.find_elements(By.CLASS_NAME, 'product-item')

    # Extract product details
    for product in products:
        name = product.find_element(By.CLASS_NAME, 'product-name').text
        price = product.find_element(By.CLASS_NAME, 'product-price').text
        print(f"Product: {name}, Price: {price}")

finally:
    driver.quit()

Assignment

Objective

Scrape a dynamic website with infinite scrolling to extract data.

Requirements

  1. Scrape article titles from a news website with infinite scrolling.
  2. Use Selenium to simulate scrolling.
  3. Save the data to a CSV file.

Solution

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import time
import csv

# Setup Selenium WebDriver
service = Service('path/to/chromedriver')  # Update with the correct path
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')

# Start the browser
driver = webdriver.Chrome(service=service, options=options)

try:
    # Open the target website
    driver.get('https://example-news-site.com')

    # Infinite scrolling logic
    scroll_pause = 2
    last_height = driver.execute_script("return document.body.scrollHeight")

    titles = []

    while True:
        # Scroll down to the bottom of the page
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(scroll_pause)

        # Extract article titles
        articles = driver.find_elements(By.CLASS_NAME, 'article-title')
        for article in articles:
            title = article.text
            if title not in titles:
                titles.append(title)

        # Check if we've reached the end of the page
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

    # Save to CSV
    with open('articles.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Article Title'])
        for title in titles:
            writer.writerow([title])

    print("Data saved to articles.csv")

finally:
    driver.quit()

Quiz

Objective: Assess understanding of dynamic website scraping.

Questions

  1. What makes dynamic websites different from static websites?

    • a) They have static content.
    • b) Content is rendered server-side.
    • c) Content is rendered client-side using JavaScript.
    • d) They do not use HTML.
  2. Which library is best suited for rendering JavaScript in Python?

    • a) BeautifulSoup
    • b) Requests
    • c) Selenium
    • d) NumPy
  3. What method can be used in Selenium to simulate scrolling?

    • a) driver.render_page()
    • b) driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    • c) driver.scroll_page()
    • d) driver.load_all_content()
  4. How can infinite scrolling be handled in web scraping?

    • a) Using a larger user-agent
    • b) By loading all pages at once
    • c) By simulating scroll actions until no new content loads
    • d) By avoiding JavaScript altogether
  5. What is a key advantage of extracting data directly from APIs compared to scraping rendered content?

    • a) It’s slower.
    • b) It’s harder to understand.
    • c) It provides structured data more efficiently.
    • d) It requires more computing resources.

Answers

  1. c) Content is rendered client-side using JavaScript
  2. c) Selenium
  3. b) driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
  4. c) By simulating scroll actions until no new content loads
  5. c) It provides structured data more efficiently

No comments:

Post a Comment

12. Real World Scraping project

Lecture Notes: Real-World Scraping Project on https://mca.gov.in 1. Introduction Scraping real-world websites like MCA (Ministry of Corpor...