Lecture Notes: Data Storage and Post Processing
1. Storing Scraped Data in Databases
1.1 SQL Databases
SQL databases like MySQL and PostgreSQL are used for structured data storage. They ensure data integrity and support complex queries.
Key Concepts:
- Tables with predefined schema.
- SQL queries for data manipulation.
Code Example: Storing Data in MySQL
import mysql.connector
# Database connection
connection = mysql.connector.connect(
host="localhost",
user="your_username",
password="your_password",
database="scraped_data_db"
)
cursor = connection.cursor()
# Create table
cursor.execute('''CREATE TABLE IF NOT EXISTS products (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255),
price DECIMAL(10, 2),
rating DECIMAL(3, 2)
)''')
# Insert data
data = [
("Product A", 19.99, 4.5),
("Product B", 29.99, 4.8)
]
cursor.executemany("INSERT INTO products (name, price, rating) VALUES (%s, %s, %s)", data)
connection.commit()
# Close connection
cursor.close()
connection.close()
1.2 NoSQL Databases
NoSQL databases like MongoDB are used for unstructured or semi-structured data. They are flexible and scalable.
Key Concepts:
- Collection-based structure.
- JSON-like documents.
Code Example: Storing Data in MongoDB
from pymongo import MongoClient
# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
db = client["scraped_data"]
collection = db["products"]
# Insert data
data = [
{"name": "Product A", "price": 19.99, "rating": 4.5},
{"name": "Product B", "price": 29.99, "rating": 4.8}
]
collection.insert_many(data)
# Close connection
client.close()
2. Exporting Data to Formats
2.1 Exporting to JSON
JSON is a lightweight format for data exchange.
Code Example: Writing Data to JSON
import json
# Sample data
data = [
{"name": "Product A", "price": 19.99, "rating": 4.5},
{"name": "Product B", "price": 29.99, "rating": 4.8}
]
# Write to JSON file
with open("products.json", "w") as json_file:
json.dump(data, json_file, indent=4)
2.2 Exporting to CSV
CSV files are commonly used for tabular data.
Code Example: Writing Data to CSV
import csv
# Sample data
data = [
{"name": "Product A", "price": 19.99, "rating": 4.5},
{"name": "Product B", "price": 29.99, "rating": 4.8}
]
# Write to CSV
with open("products.csv", "w", newline="", encoding="utf-8") as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=["name", "price", "rating"])
writer.writeheader()
writer.writerows(data)
2.3 Exporting to Excel
Excel is used for data analysis and visualization.
Code Example: Writing Data to Excel
import pandas as pd
# Sample data
data = [
{"name": "Product A", "price": 19.99, "rating": 4.5},
{"name": "Product B", "price": 29.99, "rating": 4.8}
]
# Write to Excel
df = pd.DataFrame(data)
df.to_excel("products.xlsx", index=False)
3. Cleaning and Transforming Data with Pandas
3.1 Cleaning Data
Cleaning involves removing duplicates, handling missing values, and correcting formats.
Code Example: Cleaning Data
import pandas as pd
# Sample data
data = {
"name": ["Product A", "Product B", None, "Product D"],
"price": [19.99, 29.99, None, 39.99],
"rating": [4.5, 4.8, 4.0, None]
}
# Create DataFrame
df = pd.DataFrame(data)
# Drop rows with missing values
df = df.dropna()
# Fill missing values with default values
df["rating"] = df["rating"].fillna(3.0)
print(df)
3.2 Transforming Data
Transforming data involves reshaping, aggregating, and applying operations.
Code Example: Transforming Data
# Add a new column for discounted price
df["discounted_price"] = df["price"] * 0.9
# Group by and calculate mean rating
mean_rating = df.groupby("name")["rating"].mean()
print(df)
print(mean_rating)
4. Integration with ETL Pipelines
ETL (Extract, Transform, Load) pipelines automate data workflows, including scraping, cleaning, and storing data.
4.1 Building an ETL Pipeline
An ETL pipeline integrates data extraction, processing, and loading steps.
Code Example: Simple ETL Pipeline
import pandas as pd
import requests
# Extract: Fetch data
url = "https://example-ecommerce-site.com/products"
response = requests.get(url)
data = response.json()
# Transform: Clean and process data
df = pd.DataFrame(data["products"])
df["price"] = df["price"].astype(float)
# Load: Store in CSV
df.to_csv("products_etl.csv", index=False)
print("ETL process completed. Data saved to products_etl.csv.")
Quiz
Objective: Test understanding of data storage and post-processing.
Questions
-
Which SQL command is used to insert data into a table?
- a) INSERT INTO
- b) CREATE TABLE
- c) SELECT
- d) UPDATE
-
What is the primary advantage of using NoSQL databases?
- a) Predefined schema
- b) Flexibility with unstructured data
- c) Complex SQL queries
- d) Low latency
-
Which Python library is commonly used for exporting data to Excel?
- a) csv
- b) json
- c) pandas
- d) openpyxl
-
What does the
dropna()
method in Pandas do?- a) Drops duplicate rows
- b) Fills missing values
- c) Drops rows with missing values
- d) Normalizes data
-
What does ETL stand for?
- a) Extract, Transfer, Load
- b) Extract, Transform, Load
- c) Evaluate, Transform, Load
- d) Extract, Transform, Link
Answers
- a) INSERT INTO
- b) Flexibility with unstructured data
- c) pandas
- c) Drops rows with missing values
- b) Extract, Transform, Load
No comments:
Post a Comment