Monday, 13 January 2025

5. Advanced Parsing Techniques in web scraping

Lecture Notes: Advanced Parsing Techniques in Web Scraping

1. Introduction to Advanced Parsing Techniques

What is Parsing?

Parsing in web scraping refers to the process of extracting structured data from unstructured HTML or XML content retrieved from websites.

Why Use Advanced Parsing Techniques?

  • Handle complex, nested structures.
  • Extract dynamic or deeply embedded data.
  • Improve scraping efficiency and accuracy.

2. Tools for Advanced Parsing

1. BeautifulSoup

  • A Python library for parsing HTML and XML documents.
  • Provides methods to navigate and search the parse tree.

2. lxml

  • Fast and memory-efficient XML and HTML parser.
  • Supports XPath and full-text search.

3. XPath

  • A powerful query language for navigating XML/HTML documents.
  • Provides precise extraction using structured queries.

4. Regular Expressions

  • Useful for pattern-based extraction from raw text or HTML attributes.

3. Parsing Techniques

1. Navigating Nested Structures

  • Use BeautifulSoup’s methods like .find(), .find_all(), .select() to locate elements.
  • Use recursion to handle deeply nested structures.

2. Parsing with XPath

  • Identify elements using unique paths in the DOM tree.
  • Example XPath queries:
    • //div[@class='product']: Selects all <div> elements with the class "product".
    • //a[@href]: Selects all <a> elements with an href attribute.

3. Using Regular Expressions

  • Extract data based on patterns.
  • Example:
    • Extract all email addresses: r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

4. Parsing JSON Data Embedded in HTML

  • Extract and decode JSON objects from <script> tags.
  • Use Python’s json module to process JSON.

5. Handling Edge Cases

  • Missing data: Use conditional checks.
  • Dynamic attributes: Use wildcard patterns (e.g., contains in XPath).
  • Multi-language content: Ensure proper encoding and decoding.

4. Sample Code: Parsing Nested and Complex Data

Scenario: Extract product details from a nested HTML structure.

from bs4 import BeautifulSoup
import requests

# Fetch HTML content
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
response.raise_for_status()

# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Extract products
products = []
product_elements = soup.find_all('div', class_='product')

for product in product_elements:
    name = product.find('h2', class_='product-name').text.strip()
    price = product.find('span', class_='product-price').text.strip()
    rating = product.find('div', class_='product-rating').text.strip()
    products.append({"name": name, "price": price, "rating": rating})

# Print the extracted data
for product in products:
    print(product)

Using XPath with lxml

from lxml import html
import requests

# Fetch HTML content
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
response.raise_for_status()

# Parse with lxml
page = html.fromstring(response.content)

# Extract products using XPath
names = page.xpath("//h2[@class='product-name']/text()")
prices = page.xpath("//span[@class='product-price']/text()")
ratings = page.xpath("//div[@class='product-rating']/text()")

# Combine data
products = [
    {"name": name, "price": price, "rating": rating}
    for name, price, rating in zip(names, prices, ratings)
]

# Print the extracted data
for product in products:
    print(product)

5. Assignment

Objective

Parse and extract data from a news website’s HTML, which contains complex nested structures.

Requirements

  1. Extract article titles, publication dates, and author names.
  2. Use BeautifulSoup and/or lxml with XPath.
  3. Save the extracted data to a CSV file.

Solution

import csv
from bs4 import BeautifulSoup
import requests

# Fetch HTML content
url = "https://example-news-site.com"
response = requests.get(url)
response.raise_for_status()

# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Extract articles
articles = []
article_elements = soup.find_all('article')

for article in article_elements:
    title = article.find('h1', class_='article-title').text.strip()
    date = article.find('time', class_='publication-date')['datetime']
    author = article.find('span', class_='author-name').text.strip()
    articles.append({"title": title, "date": date, "author": author})

# Save to CSV
with open('articles.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=["title", "date", "author"])
    writer.writeheader()
    writer.writerows(articles)

print("Data saved to articles.csv")

6. Quiz

Objective: Test understanding of parsing techniques.

Questions

  1. What is the primary purpose of XPath?

    • a) To scrape dynamic content
    • b) To query and navigate XML/HTML documents
    • c) To parse JSON data
    • d) To rotate proxies
  2. Which library is best for handling deeply nested HTML structures?

    • a) NumPy
    • b) BeautifulSoup
    • c) Pandas
    • d) PyTorch
  3. What does the following XPath query select? //div[@class='item']/span/text()

    • a) All <div> elements with the class "item"
    • b) All <span> elements inside <div> elements with the class "item"
    • c) Text inside <span> elements within <div> elements with the class "item"
    • d) Attributes of <span> elements
  4. What Python library is typically used to parse JSON embedded in HTML?

    • a) lxml
    • b) json
    • c) BeautifulSoup
    • d) Selenium
  5. Which of these is NOT an advanced parsing technique?

    • a) Regular Expressions
    • b) User-Agent Rotation
    • c) JSON Parsing
    • d) XPath Queries

Answers

  1. b) To query and navigate XML/HTML documents
  2. b) BeautifulSoup
  3. c) Text inside <span> elements within <div> elements with the class "item"
  4. b) json
  5. b) User-Agent Rotation

No comments:

Post a Comment

12. Real World Scraping project

Lecture Notes: Real-World Scraping Project on https://mca.gov.in 1. Introduction Scraping real-world websites like MCA (Ministry of Corpor...