Web Scrapping Handson: 5. Advanced Parsing Techniques in web scraping

Lecture Notes: Advanced Parsing Techniques in Web Scraping

1. Introduction to Advanced Parsing Techniques

What is Parsing?

Parsing in web scraping refers to the process of extracting structured data from unstructured HTML or XML content retrieved from websites.

Why Use Advanced Parsing Techniques?

Handle complex, nested structures.

Extract dynamic or deeply embedded data.

Improve scraping efficiency and accuracy.

2. Tools for Advanced Parsing

1. BeautifulSoup

A Python library for parsing HTML and XML documents.

Provides methods to navigate and search the parse tree.

2. lxml

Fast and memory-efficient XML and HTML parser.

Supports XPath and full-text search.

3. XPath

A powerful query language for navigating XML/HTML documents.

Provides precise extraction using structured queries.

4. Regular Expressions

Useful for pattern-based extraction from raw text or HTML attributes.

3. Parsing Techniques

1. Navigating Nested Structures

Use BeautifulSoup’s methods like `.find()`, `.find_all()`, `.select()` to locate elements.

Use recursion to handle deeply nested structures.

2. Parsing with XPath

Identify elements using unique paths in the DOM tree.

Example XPath queries:

`//div[@class='product']`: Selects all `<div>` elements with the class "product".

`//a[@href]`: Selects all `<a>` elements with an `href` attribute.

3. Using Regular Expressions

Extract data based on patterns.

Example:

Extract all email addresses: `r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'`

4. Parsing JSON Data Embedded in HTML

Extract and decode JSON objects from `<script>` tags.

Use Python’s `json` module to process JSON.

5. Handling Edge Cases

Missing data: Use conditional checks.

Dynamic attributes: Use wildcard patterns (e.g., `contains` in XPath).

Multi-language content: Ensure proper encoding and decoding.

4. Sample Code: Parsing Nested and Complex Data

Scenario: Extract product details from a nested HTML structure.

from bs4 import BeautifulSoup
import requests

# Fetch HTML content
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
response.raise_for_status()

# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Extract products
products = []
product_elements = soup.find_all('div', class_='product')

for product in product_elements:
    name = product.find('h2', class_='product-name').text.strip()
    price = product.find('span', class_='product-price').text.strip()
    rating = product.find('div', class_='product-rating').text.strip()
    products.append({"name": name, "price": price, "rating": rating})

# Print the extracted data
for product in products:
    print(product)

Using XPath with lxml

from lxml import html
import requests

# Fetch HTML content
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
response.raise_for_status()

# Parse with lxml
page = html.fromstring(response.content)

# Extract products using XPath
names = page.xpath("//h2[@class='product-name']/text()")
prices = page.xpath("//span[@class='product-price']/text()")
ratings = page.xpath("//div[@class='product-rating']/text()")

# Combine data
products = [
    {"name": name, "price": price, "rating": rating}
    for name, price, rating in zip(names, prices, ratings)
]

# Print the extracted data
for product in products:
    print(product)

5. Assignment

Objective

Parse and extract data from a news website’s HTML, which contains complex nested structures.

Requirements

Extract article titles, publication dates, and author names.

Use BeautifulSoup and/or lxml with XPath.

Save the extracted data to a CSV file.

Solution

import csv
from bs4 import BeautifulSoup
import requests

# Fetch HTML content
url = "https://example-news-site.com"
response = requests.get(url)
response.raise_for_status()

# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Extract articles
articles = []
article_elements = soup.find_all('article')

for article in article_elements:
    title = article.find('h1', class_='article-title').text.strip()
    date = article.find('time', class_='publication-date')['datetime']
    author = article.find('span', class_='author-name').text.strip()
    articles.append({"title": title, "date": date, "author": author})

# Save to CSV
with open('articles.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=["title", "date", "author"])
    writer.writeheader()
    writer.writerows(articles)

print("Data saved to articles.csv")

6. Quiz

Objective: Test understanding of parsing techniques.

Questions

What is the primary purpose of XPath?

a) To scrape dynamic content

b) To query and navigate XML/HTML documents

c) To parse JSON data

d) To rotate proxies

Which library is best for handling deeply nested HTML structures?

a) NumPy

b) BeautifulSoup

c) Pandas

d) PyTorch

What does the following XPath query select? `//div[@class='item']/span/text()`

a) All `<div>` elements with the class "item"

b) All `<span>` elements inside `<div>` elements with the class "item"

c) Text inside `<span>` elements within `<div>` elements with the class "item"

d) Attributes of `<span>` elements

What Python library is typically used to parse JSON embedded in HTML?

a) lxml

b) json

c) BeautifulSoup

d) Selenium

Which of these is NOT an advanced parsing technique?

a) Regular Expressions

b) User-Agent Rotation

c) JSON Parsing

d) XPath Queries

Answers

b) To query and navigate XML/HTML documents

b) BeautifulSoup

c) Text inside `<span>` elements within `<div>` elements with the class "item"

b) json

b) User-Agent Rotation

Monday, 13 January 2025

5. Advanced Parsing Techniques in web scraping

1. Introduction to Advanced Parsing Techniques

What is Parsing?

Parsing in web scraping refers to the process of extracting structured data from unstructured HTML or XML content retrieved from websites.

Why Use Advanced Parsing Techniques?

Handle complex, nested structures. Extract dynamic or deeply embedded data. Improve scraping efficiency and accuracy.

2. Tools for Advanced Parsing

1. BeautifulSoup

A Python library for parsing HTML and XML documents. Provides methods to navigate and search the parse tree.

2. lxml

Fast and memory-efficient XML and HTML parser. Supports XPath and full-text search.

3. XPath

A powerful query language for navigating XML/HTML documents. Provides precise extraction using structured queries.

4. Regular Expressions

Useful for pattern-based extraction from raw text or HTML attributes.

3. Parsing Techniques

1. Navigating Nested Structures

Use BeautifulSoup’s methods like .find(), .find_all(), .select() to locate elements. Use recursion to handle deeply nested structures.

2. Parsing with XPath

Identify elements using unique paths in the DOM tree. Example XPath queries: //div[@class='product']: Selects all <div> elements with the class "product". //a[@href]: Selects all <a> elements with an href attribute.

3. Using Regular Expressions

Extract data based on patterns. Example: Extract all email addresses: r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

4. Parsing JSON Data Embedded in HTML

Extract and decode JSON objects from <script> tags. Use Python’s json module to process JSON.

5. Handling Edge Cases

Missing data: Use conditional checks. Dynamic attributes: Use wildcard patterns (e.g., contains in XPath). Multi-language content: Ensure proper encoding and decoding.

4. Sample Code: Parsing Nested and Complex Data

Scenario: Extract product details from a nested HTML structure.

Using XPath with lxml

5. Assignment

Objective

Parse and extract data from a news website’s HTML, which contains complex nested structures.

Requirements

Extract article titles, publication dates, and author names. Use BeautifulSoup and/or lxml with XPath. Save the extracted data to a CSV file.

Solution

6. Quiz

Objective: Test understanding of parsing techniques.

Questions

Answers

b) To query and navigate XML/HTML documents b) BeautifulSoup c) Text inside <span> elements within <div> elements with the class "item" b) json b) User-Agent Rotation

No comments:

Post a Comment

12. Real World Scraping project

Handle complex, nested structures.

Extract dynamic or deeply embedded data.

Improve scraping efficiency and accuracy.

A Python library for parsing HTML and XML documents.

Provides methods to navigate and search the parse tree.

Fast and memory-efficient XML and HTML parser.

Supports XPath and full-text search.

A powerful query language for navigating XML/HTML documents.

Provides precise extraction using structured queries.

Use BeautifulSoup’s methods like `.find()`, `.find_all()`, `.select()` to locate elements.

Use recursion to handle deeply nested structures.

Identify elements using unique paths in the DOM tree.

Example XPath queries:

`//div[@class='product']`: Selects all `<div>` elements with the class "product".

`//a[@href]`: Selects all `<a>` elements with an `href` attribute.

Extract data based on patterns.

Example:

Extract all email addresses: `r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'`

Extract and decode JSON objects from `<script>` tags.

Use Python’s `json` module to process JSON.

Missing data: Use conditional checks.

Dynamic attributes: Use wildcard patterns (e.g., `contains` in XPath).

Multi-language content: Ensure proper encoding and decoding.

Extract article titles, publication dates, and author names.

Use BeautifulSoup and/or lxml with XPath.

Save the extracted data to a CSV file.

b) To query and navigate XML/HTML documents

b) BeautifulSoup

c) Text inside `<span>` elements within `<div>` elements with the class "item"

b) json

b) User-Agent Rotation