π§Ή Data Cleaning & Manipulation in Python
π Introduction
In Data Science, data cleaning and manipulation are essential steps before any analysis or modeling.
Real-world data is often messy β it may contain missing values, duplicates, or inconsistent formats.
Using Python's pandas library, we can efficiently load, clean, and prepare datasets for further processing.
π Loading CSV Files with pandas
The pandas library provides an easy way to read CSV (Comma-Separated Values) files into a DataFrame, which is a tabular data structure similar to an Excel spreadsheet.
Example: Reading a CSV file
import pandas as pd
# Load a CSV file
df = pd.read_csv("data.csv")
# Display the first 5 rows
print(df.head())
Useful Parameters
| Parameter | Description |
|---|---|
sep | Change the delimiter (e.g., sep=';') |
header | Define which row to use as column names |
index_col | Use a column as the DataFrame index |
usecols | Select specific columns to load |
Example with options:
df = pd.read_csv("data.csv", sep=";", usecols=["Name", "Age"], index_col="Name")
π©Ή Handling Missing Data
Missing data can cause incorrect analyses or model errors, so itβs important to handle it properly.
1. Detecting Missing Data
# Check for missing values
print(df.isnull().sum())
2. Removing Missing Data
# Drop rows with missing values
df_clean = df.dropna()
# Drop columns with missing values
df_clean_cols = df.dropna(axis=1)
3. Filling Missing Data
# Fill missing values with a specific value
df_filled = df.fillna(0)
# Fill missing values with the column mean
df["Age"] = df["Age"].fillna(df["Age"].mean())
Common strategies:
- Replace missing values with mean, median, or mode
- Remove rows or columns with too many missing entries
- Use interpolation for numerical data
π Filtering Data
Filtering allows you to extract specific rows based on conditions.
Examples:
# Select rows where Age > 30
adults = df[df["Age"] > 30]
# Select rows where City == 'London'
london_data = df[df["City"] == "London"]
# Combine multiple conditions
filtered = df[(df["Age"] > 25) & (df["City"] == "Paris")]
Using Query Method
filtered = df.query("Age > 25 and City == 'Paris'")
π Grouping Data
Grouping allows you to summarize or aggregate data based on one or more columns.
Example: Group by a single column
grouped = df.groupby("City")["Age"].mean()
print(grouped)
Group by multiple columns
grouped_multi = df.groupby(["City", "Gender"])["Salary"].sum()
print(grouped_multi)
Aggregation Functions
| Function | Description |
|---|---|
mean() | Average value |
sum() | Sum of values |
count() | Number of values |
min() / max() | Minimum / Maximum value |
Example:
df.groupby("Department")["Salary"].agg(["mean", "max", "min"])
π Merging and Joining Data
When working with multiple datasets, you often need to combine them.
pandas provides powerful tools for merging, joining, and concatenating data.
1. Merging DataFrames (like SQL JOIN)
df1 = pd.DataFrame({"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"]})
df2 = pd.DataFrame({"ID": [1, 2, 4], "Age": [25, 30, 22]})
merged = pd.merge(df1, df2, on="ID", how="inner")
print(merged)
Join types:
| Type | Description |
|---|---|
inner | Keep only matching rows |
left | Keep all rows from the left DataFrame |
right | Keep all rows from the right DataFrame |
outer | Keep all rows from both DataFrames |
2. Concatenating DataFrames
# Stack DataFrames vertically
df_combined = pd.concat([df1, df2], axis=0)
# Combine DataFrames side by side
df_side_by_side = pd.concat([df1, df2], axis=1)
π§ Summary
| Concept | Description |
|---|---|
| Loading CSV | Use pd.read_csv() to load tabular data |
| Handling Missing Data | Detect with isnull(), clean with dropna() or fillna() |
| Filtering Data | Select rows using conditions or query() |
| Grouping Data | Summarize using groupby() and aggregation functions |
| Merging Data | Combine multiple datasets with merge() or concat() |
By mastering these pandas tools, you can transform raw data into clean, organized, and analysis-ready datasets β a critical skill in every Data Science project!