Gridscript

🧹 Data Cleaning & Manipulation in Python

πŸ“˜ Introduction

In Data Science, data cleaning and manipulation are essential steps before any analysis or modeling.
Real-world data is often messy β€” it may contain missing values, duplicates, or inconsistent formats.
Using Python's pandas library, we can efficiently load, clean, and prepare datasets for further processing.

πŸ“‚ Loading CSV Files with pandas

The pandas library provides an easy way to read CSV (Comma-Separated Values) files into a DataFrame, which is a tabular data structure similar to an Excel spreadsheet.

Example: Reading a CSV file

import pandas as pd

# Load a CSV file
df = pd.read_csv("data.csv")

# Display the first 5 rows
print(df.head())

Useful Parameters

ParameterDescription
sepChange the delimiter (e.g., sep=';')
headerDefine which row to use as column names
index_colUse a column as the DataFrame index
usecolsSelect specific columns to load

Example with options:

df = pd.read_csv("data.csv", sep=";", usecols=["Name", "Age"], index_col="Name")

🩹 Handling Missing Data

Missing data can cause incorrect analyses or model errors, so it’s important to handle it properly.

1. Detecting Missing Data

# Check for missing values
print(df.isnull().sum())

2. Removing Missing Data

# Drop rows with missing values
df_clean = df.dropna()

# Drop columns with missing values
df_clean_cols = df.dropna(axis=1)

3. Filling Missing Data

# Fill missing values with a specific value
df_filled = df.fillna(0)

# Fill missing values with the column mean
df["Age"] = df["Age"].fillna(df["Age"].mean())

Common strategies:

πŸ” Filtering Data

Filtering allows you to extract specific rows based on conditions.

Examples:

# Select rows where Age > 30
adults = df[df["Age"] > 30]

# Select rows where City == 'London'
london_data = df[df["City"] == "London"]

# Combine multiple conditions
filtered = df[(df["Age"] > 25) & (df["City"] == "Paris")]

Using Query Method

filtered = df.query("Age > 25 and City == 'Paris'")

πŸ“Š Grouping Data

Grouping allows you to summarize or aggregate data based on one or more columns.

Example: Group by a single column

grouped = df.groupby("City")["Age"].mean()
print(grouped)

Group by multiple columns

grouped_multi = df.groupby(["City", "Gender"])["Salary"].sum()
print(grouped_multi)

Aggregation Functions

FunctionDescription
mean()Average value
sum()Sum of values
count()Number of values
min() / max()Minimum / Maximum value

Example:

df.groupby("Department")["Salary"].agg(["mean", "max", "min"])

πŸ”— Merging and Joining Data

When working with multiple datasets, you often need to combine them.
pandas provides powerful tools for merging, joining, and concatenating data.

1. Merging DataFrames (like SQL JOIN)

df1 = pd.DataFrame({"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"]})
df2 = pd.DataFrame({"ID": [1, 2, 4], "Age": [25, 30, 22]})

merged = pd.merge(df1, df2, on="ID", how="inner")
print(merged)

Join types:

TypeDescription
innerKeep only matching rows
leftKeep all rows from the left DataFrame
rightKeep all rows from the right DataFrame
outerKeep all rows from both DataFrames

2. Concatenating DataFrames

# Stack DataFrames vertically
df_combined = pd.concat([df1, df2], axis=0)

# Combine DataFrames side by side
df_side_by_side = pd.concat([df1, df2], axis=1)

🧠 Summary

ConceptDescription
Loading CSVUse pd.read_csv() to load tabular data
Handling Missing DataDetect with isnull(), clean with dropna() or fillna()
Filtering DataSelect rows using conditions or query()
Grouping DataSummarize using groupby() and aggregation functions
Merging DataCombine multiple datasets with merge() or concat()

By mastering these pandas tools, you can transform raw data into clean, organized, and analysis-ready datasets β€” a critical skill in every Data Science project!