🧹 Data Cleaning & Manipulation in Python

📘 Introduction

In Data Science, data cleaning and manipulation are essential steps before any analysis or modeling.
Real-world data is often messy — it may contain missing values, duplicates, or inconsistent formats.
Using Python's pandas library, we can efficiently load, clean, and prepare datasets for further processing.

📂 Loading CSV Files with pandas

The pandas library provides an easy way to read CSV (Comma-Separated Values) files into a DataFrame, which is a tabular data structure similar to an Excel spreadsheet.

Example: Reading a CSV file

import pandas as pd

# Load a CSV file
df = pd.read_csv("data.csv")

# Display the first 5 rows
print(df.head())

Useful Parameters

Parameter	Description
`sep`	Change the delimiter (e.g., `sep=';'`)
`header`	Define which row to use as column names
`index_col`	Use a column as the DataFrame index
`usecols`	Select specific columns to load

Example with options:

df = pd.read_csv("data.csv", sep=";", usecols=["Name", "Age"], index_col="Name")

🩹 Handling Missing Data

Missing data can cause incorrect analyses or model errors, so it’s important to handle it properly.

1. Detecting Missing Data

# Check for missing values
print(df.isnull().sum())

2. Removing Missing Data

# Drop rows with missing values
df_clean = df.dropna()

# Drop columns with missing values
df_clean_cols = df.dropna(axis=1)

3. Filling Missing Data

# Fill missing values with a specific value
df_filled = df.fillna(0)

# Fill missing values with the column mean
df["Age"] = df["Age"].fillna(df["Age"].mean())

Common strategies:

Replace missing values with mean, median, or mode
Remove rows or columns with too many missing entries
Use interpolation for numerical data

🔍 Filtering Data

Filtering allows you to extract specific rows based on conditions.

Examples:

# Select rows where Age > 30
adults = df[df["Age"] > 30]

# Select rows where City == 'London'
london_data = df[df["City"] == "London"]

# Combine multiple conditions
filtered = df[(df["Age"] > 25) & (df["City"] == "Paris")]

Using Query Method

filtered = df.query("Age > 25 and City == 'Paris'")

📊 Grouping Data

Grouping allows you to summarize or aggregate data based on one or more columns.

Example: Group by a single column

grouped = df.groupby("City")["Age"].mean()
print(grouped)

Group by multiple columns

grouped_multi = df.groupby(["City", "Gender"])["Salary"].sum()
print(grouped_multi)

Aggregation Functions

Function	Description
`mean()`	Average value
`sum()`	Sum of values
`count()`	Number of values
`min()` / `max()`	Minimum / Maximum value

Example:

df.groupby("Department")["Salary"].agg(["mean", "max", "min"])

🔗 Merging and Joining Data

When working with multiple datasets, you often need to combine them.
pandas provides powerful tools for merging, joining, and concatenating data.

1. Merging DataFrames (like SQL JOIN)

df1 = pd.DataFrame({"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"]})
df2 = pd.DataFrame({"ID": [1, 2, 4], "Age": [25, 30, 22]})

merged = pd.merge(df1, df2, on="ID", how="inner")
print(merged)

Join types:

Type	Description
`inner`	Keep only matching rows
`left`	Keep all rows from the left DataFrame
`right`	Keep all rows from the right DataFrame
`outer`	Keep all rows from both DataFrames

2. Concatenating DataFrames

# Stack DataFrames vertically
df_combined = pd.concat([df1, df2], axis=0)

# Combine DataFrames side by side
df_side_by_side = pd.concat([df1, df2], axis=1)

🧠 Summary

Concept	Description
Loading CSV	Use `pd.read_csv()` to load tabular data
Handling Missing Data	Detect with `isnull()`, clean with `dropna()` or `fillna()`
Filtering Data	Select rows using conditions or `query()`
Grouping Data	Summarize using `groupby()` and aggregation functions
Merging Data	Combine multiple datasets with `merge()` or `concat()`

By mastering these pandas tools, you can transform raw data into clean, organized, and analysis-ready datasets — a critical skill in every Data Science project!