Lecture Notes: Advanced Parsing Techniques in Web Scraping
1. Introduction to Advanced Parsing Techniques
What is Parsing?
Parsing in web scraping refers to the process of extracting structured data from unstructured HTML or XML content retrieved from websites.
Why Use Advanced Parsing Techniques?
- Handle complex, nested structures.
- Extract dynamic or deeply embedded data.
- Improve scraping efficiency and accuracy.
2. Tools for Advanced Parsing
1. BeautifulSoup
- A Python library for parsing HTML and XML documents.
- Provides methods to navigate and search the parse tree.
2. lxml
- Fast and memory-efficient XML and HTML parser.
- Supports XPath and full-text search.
3. XPath
- A powerful query language for navigating XML/HTML documents.
- Provides precise extraction using structured queries.
4. Regular Expressions
- Useful for pattern-based extraction from raw text or HTML attributes.
3. Parsing Techniques
1. Navigating Nested Structures
- Use BeautifulSoup’s methods like
.find()
, .find_all()
, .select()
to locate elements.
- Use recursion to handle deeply nested structures.
.find()
, .find_all()
, .select()
to locate elements.2. Parsing with XPath
- Identify elements using unique paths in the DOM tree.
- Example XPath queries:
//div[@class='product']
: Selects all <div>
elements with the class "product".
//a[@href]
: Selects all <a>
elements with an href
attribute.
//div[@class='product']
: Selects all<div>
elements with the class "product".//a[@href]
: Selects all<a>
elements with anhref
attribute.
3. Using Regular Expressions
- Extract data based on patterns.
- Example:
- Extract all email addresses:
r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
- Extract all email addresses:
r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
4. Parsing JSON Data Embedded in HTML
- Extract and decode JSON objects from
<script>
tags.
- Use Python’s
json
module to process JSON.
<script>
tags.json
module to process JSON.5. Handling Edge Cases
- Missing data: Use conditional checks.
- Dynamic attributes: Use wildcard patterns (e.g.,
contains
in XPath).
- Multi-language content: Ensure proper encoding and decoding.
contains
in XPath).4. Sample Code: Parsing Nested and Complex Data
Scenario: Extract product details from a nested HTML structure.
from bs4 import BeautifulSoup
import requests
# Fetch HTML content
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
response.raise_for_status()
# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Extract products
products = []
product_elements = soup.find_all('div', class_='product')
for product in product_elements:
name = product.find('h2', class_='product-name').text.strip()
price = product.find('span', class_='product-price').text.strip()
rating = product.find('div', class_='product-rating').text.strip()
products.append({"name": name, "price": price, "rating": rating})
# Print the extracted data
for product in products:
print(product)
from bs4 import BeautifulSoup
import requests
# Fetch HTML content
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
response.raise_for_status()
# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Extract products
products = []
product_elements = soup.find_all('div', class_='product')
for product in product_elements:
name = product.find('h2', class_='product-name').text.strip()
price = product.find('span', class_='product-price').text.strip()
rating = product.find('div', class_='product-rating').text.strip()
products.append({"name": name, "price": price, "rating": rating})
# Print the extracted data
for product in products:
print(product)
Using XPath with lxml
from lxml import html
import requests
# Fetch HTML content
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
response.raise_for_status()
# Parse with lxml
page = html.fromstring(response.content)
# Extract products using XPath
names = page.xpath("//h2[@class='product-name']/text()")
prices = page.xpath("//span[@class='product-price']/text()")
ratings = page.xpath("//div[@class='product-rating']/text()")
# Combine data
products = [
{"name": name, "price": price, "rating": rating}
for name, price, rating in zip(names, prices, ratings)
]
# Print the extracted data
for product in products:
print(product)
from lxml import html
import requests
# Fetch HTML content
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
response.raise_for_status()
# Parse with lxml
page = html.fromstring(response.content)
# Extract products using XPath
names = page.xpath("//h2[@class='product-name']/text()")
prices = page.xpath("//span[@class='product-price']/text()")
ratings = page.xpath("//div[@class='product-rating']/text()")
# Combine data
products = [
{"name": name, "price": price, "rating": rating}
for name, price, rating in zip(names, prices, ratings)
]
# Print the extracted data
for product in products:
print(product)
5. Assignment
Objective
Parse and extract data from a news website’s HTML, which contains complex nested structures.
Requirements
- Extract article titles, publication dates, and author names.
- Use BeautifulSoup and/or lxml with XPath.
- Save the extracted data to a CSV file.
Solution
import csv
from bs4 import BeautifulSoup
import requests
# Fetch HTML content
url = "https://example-news-site.com"
response = requests.get(url)
response.raise_for_status()
# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Extract articles
articles = []
article_elements = soup.find_all('article')
for article in article_elements:
title = article.find('h1', class_='article-title').text.strip()
date = article.find('time', class_='publication-date')['datetime']
author = article.find('span', class_='author-name').text.strip()
articles.append({"title": title, "date": date, "author": author})
# Save to CSV
with open('articles.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=["title", "date", "author"])
writer.writeheader()
writer.writerows(articles)
print("Data saved to articles.csv")
import csv
from bs4 import BeautifulSoup
import requests
# Fetch HTML content
url = "https://example-news-site.com"
response = requests.get(url)
response.raise_for_status()
# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Extract articles
articles = []
article_elements = soup.find_all('article')
for article in article_elements:
title = article.find('h1', class_='article-title').text.strip()
date = article.find('time', class_='publication-date')['datetime']
author = article.find('span', class_='author-name').text.strip()
articles.append({"title": title, "date": date, "author": author})
# Save to CSV
with open('articles.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=["title", "date", "author"])
writer.writeheader()
writer.writerows(articles)
print("Data saved to articles.csv")
6. Quiz
Objective: Test understanding of parsing techniques.
Questions
-
What is the primary purpose of XPath?
- a) To scrape dynamic content
- b) To query and navigate XML/HTML documents
- c) To parse JSON data
- d) To rotate proxies
-
Which library is best for handling deeply nested HTML structures?
- a) NumPy
- b) BeautifulSoup
- c) Pandas
- d) PyTorch
-
What does the following XPath query select? //div[@class='item']/span/text()
- a) All
<div>
elements with the class "item"
- b) All
<span>
elements inside <div>
elements with the class "item"
- c) Text inside
<span>
elements within <div>
elements with the class "item"
- d) Attributes of
<span>
elements
-
What Python library is typically used to parse JSON embedded in HTML?
- a) lxml
- b) json
- c) BeautifulSoup
- d) Selenium
-
Which of these is NOT an advanced parsing technique?
- a) Regular Expressions
- b) User-Agent Rotation
- c) JSON Parsing
- d) XPath Queries
What is the primary purpose of XPath?
- a) To scrape dynamic content
- b) To query and navigate XML/HTML documents
- c) To parse JSON data
- d) To rotate proxies
Which library is best for handling deeply nested HTML structures?
- a) NumPy
- b) BeautifulSoup
- c) Pandas
- d) PyTorch
What does the following XPath query select? //div[@class='item']/span/text()
- a) All
<div>
elements with the class "item" - b) All
<span>
elements inside<div>
elements with the class "item" - c) Text inside
<span>
elements within<div>
elements with the class "item" - d) Attributes of
<span>
elements
What Python library is typically used to parse JSON embedded in HTML?
- a) lxml
- b) json
- c) BeautifulSoup
- d) Selenium
Which of these is NOT an advanced parsing technique?
- a) Regular Expressions
- b) User-Agent Rotation
- c) JSON Parsing
- d) XPath Queries
Answers
- b) To query and navigate XML/HTML documents
- b) BeautifulSoup
- c) Text inside
<span>
elements within <div>
elements with the class "item"
- b) json
- b) User-Agent Rotation
<span>
elements within <div>
elements with the class "item"
No comments:
Post a Comment