Monday, 13 January 2025

7. Data Storage and Post-Processing in Web Scraping

Lecture Notes: Data Storage and Post Processing

1. Storing Scraped Data in Databases

1.1 SQL Databases

SQL databases like MySQL and PostgreSQL are used for structured data storage. They ensure data integrity and support complex queries.

Key Concepts:

  • Tables with predefined schema.
  • SQL queries for data manipulation.

Code Example: Storing Data in MySQL

import mysql.connector

# Database connection
connection = mysql.connector.connect(
    host="localhost",
    user="your_username",
    password="your_password",
    database="scraped_data_db"
)
cursor = connection.cursor()

# Create table
cursor.execute('''CREATE TABLE IF NOT EXISTS products (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255),
    price DECIMAL(10, 2),
    rating DECIMAL(3, 2)
)''')

# Insert data
data = [
    ("Product A", 19.99, 4.5),
    ("Product B", 29.99, 4.8)
]
cursor.executemany("INSERT INTO products (name, price, rating) VALUES (%s, %s, %s)", data)
connection.commit()

# Close connection
cursor.close()
connection.close()

1.2 NoSQL Databases

NoSQL databases like MongoDB are used for unstructured or semi-structured data. They are flexible and scalable.

Key Concepts:

  • Collection-based structure.
  • JSON-like documents.

Code Example: Storing Data in MongoDB

from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
db = client["scraped_data"]
collection = db["products"]

# Insert data
data = [
    {"name": "Product A", "price": 19.99, "rating": 4.5},
    {"name": "Product B", "price": 29.99, "rating": 4.8}
]
collection.insert_many(data)

# Close connection
client.close()

2. Exporting Data to Formats

2.1 Exporting to JSON

JSON is a lightweight format for data exchange.

Code Example: Writing Data to JSON

import json

# Sample data
data = [
    {"name": "Product A", "price": 19.99, "rating": 4.5},
    {"name": "Product B", "price": 29.99, "rating": 4.8}
]

# Write to JSON file
with open("products.json", "w") as json_file:
    json.dump(data, json_file, indent=4)

2.2 Exporting to CSV

CSV files are commonly used for tabular data.

Code Example: Writing Data to CSV

import csv

# Sample data
data = [
    {"name": "Product A", "price": 19.99, "rating": 4.5},
    {"name": "Product B", "price": 29.99, "rating": 4.8}
]

# Write to CSV
with open("products.csv", "w", newline="", encoding="utf-8") as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=["name", "price", "rating"])
    writer.writeheader()
    writer.writerows(data)

2.3 Exporting to Excel

Excel is used for data analysis and visualization.

Code Example: Writing Data to Excel

import pandas as pd

# Sample data
data = [
    {"name": "Product A", "price": 19.99, "rating": 4.5},
    {"name": "Product B", "price": 29.99, "rating": 4.8}
]

# Write to Excel
df = pd.DataFrame(data)
df.to_excel("products.xlsx", index=False)

3. Cleaning and Transforming Data with Pandas

3.1 Cleaning Data

Cleaning involves removing duplicates, handling missing values, and correcting formats.

Code Example: Cleaning Data

import pandas as pd

# Sample data
data = {
    "name": ["Product A", "Product B", None, "Product D"],
    "price": [19.99, 29.99, None, 39.99],
    "rating": [4.5, 4.8, 4.0, None]
}

# Create DataFrame
df = pd.DataFrame(data)

# Drop rows with missing values
df = df.dropna()

# Fill missing values with default values
df["rating"] = df["rating"].fillna(3.0)

print(df)

3.2 Transforming Data

Transforming data involves reshaping, aggregating, and applying operations.

Code Example: Transforming Data

# Add a new column for discounted price
df["discounted_price"] = df["price"] * 0.9

# Group by and calculate mean rating
mean_rating = df.groupby("name")["rating"].mean()

print(df)
print(mean_rating)

4. Integration with ETL Pipelines

ETL (Extract, Transform, Load) pipelines automate data workflows, including scraping, cleaning, and storing data.

4.1 Building an ETL Pipeline

An ETL pipeline integrates data extraction, processing, and loading steps.

Code Example: Simple ETL Pipeline

import pandas as pd
import requests

# Extract: Fetch data
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
data = response.json()

# Transform: Clean and process data
df = pd.DataFrame(data["products"])
df["price"] = df["price"].astype(float)

# Load: Store in CSV
df.to_csv("products_etl.csv", index=False)
print("ETL process completed. Data saved to products_etl.csv.")

Quiz

Objective: Test understanding of data storage and post-processing.

Questions

  1. Which SQL command is used to insert data into a table?

    • a) INSERT INTO
    • b) CREATE TABLE
    • c) SELECT
    • d) UPDATE
  2. What is the primary advantage of using NoSQL databases?

    • a) Predefined schema
    • b) Flexibility with unstructured data
    • c) Complex SQL queries
    • d) Low latency
  3. Which Python library is commonly used for exporting data to Excel?

    • a) csv
    • b) json
    • c) pandas
    • d) openpyxl
  4. What does the dropna() method in Pandas do?

    • a) Drops duplicate rows
    • b) Fills missing values
    • c) Drops rows with missing values
    • d) Normalizes data
  5. What does ETL stand for?

    • a) Extract, Transfer, Load
    • b) Extract, Transform, Load
    • c) Evaluate, Transform, Load
    • d) Extract, Transform, Link

Answers

  1. a) INSERT INTO
  2. b) Flexibility with unstructured data
  3. c) pandas
  4. c) Drops rows with missing values
  5. b) Extract, Transform, Load

No comments:

Post a Comment

12. Real World Scraping project

Lecture Notes: Real-World Scraping Project on https://mca.gov.in 1. Introduction Scraping real-world websites like MCA (Ministry of Corpor...