⚙️ Feature Engineering

📘 Introduction

Feature Engineering is the process of preparing and transforming raw data into meaningful features that improve the performance of machine learning models.
The quality and representation of your features can be more important than the choice of algorithm itself.

Goal: Help models learn patterns better by refining, scaling, or encoding input features.

🔧 Scaling Data

Concept

Different features can have different ranges (e.g., age from 0–100, income from 1,000–100,000).
Some algorithms (like SVMs, KNN, and gradient descent-based models) perform poorly if features are on different scales.
Scaling ensures that all features contribute equally.

Common Scaling Methods

Method	Description	Example
Standardization (Z-score)	Transforms data to have mean = 0 and std = 1	`x' = (x - μ) / σ`
Min-Max Scaling	Scales values to a fixed range, usually [0,1]	`x' = (x - min) / (max - min)`
Robust Scaling	Reduces effect of outliers by using median and IQR	Useful for skewed data

Example in Python

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np

data = np.array([[10], [20], [30], [40], [50]])

# Standard scaling
scaler = StandardScaler()
standard_scaled = scaler.fit_transform(data)

# Min-max scaling
minmax_scaled = MinMaxScaler().fit_transform(data)

# Robust scaling
robust_scaled = RobustScaler().fit_transform(data)

print("Standard Scaled:\n", standard_scaled)
print("Min-Max Scaled:\n", minmax_scaled)
print("Robust Scaled:\n", robust_scaled)

🔤 Encoding Categorical Data

Concept

Machine learning models typically work with numerical data.
Encoding converts categorical (text) data into numerical format.

Types of Encoding

Encoding Type	Description	Example
Label Encoding	Assigns each category a numeric label	`Red=0, Green=1, Blue=2`
One-Hot Encoding	Creates binary columns for each category	`Red=[1,0,0], Green=[0,1,0], Blue=[0,0,1]`
Ordinal Encoding	Encodes categories with an inherent order	`Low=1, Medium=2, High=3`

Example in Python

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Example data
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Green']})

# Label Encoding
label_encoder = LabelEncoder()
df['Color_Label'] = label_encoder.fit_transform(df['Color'])

# One-Hot Encoding
onehot_df = pd.get_dummies(df['Color'], prefix='Color')

print(df)
print(onehot_df)

Tip:

Use Label Encoding for ordinal features (where order matters).
Use One-Hot Encoding for nominal features (where order doesn’t matter).

🔄 Transforming Data

Concept

Transformations help fix skewed data distributions, stabilize variance, or make data more Gaussian-like.
Some models (like linear regression) assume normally distributed features.

Common Transformations

Method	Description
Log Transform	Reduces right-skewed data
Square Root Transform	Reduces moderate skew
Box-Cox / Yeo-Johnson	Works for positive (Box-Cox) or any (Yeo-Johnson) values

Example in Python

import numpy as np
from sklearn.preprocessing import PowerTransformer

data = np.array([[1], [2], [3], [10], [100]])

# Log transform
log_data = np.log1p(data)

# Power transform (Yeo-Johnson)
pt = PowerTransformer(method='yeo-johnson')
transformed = pt.fit_transform(data)

print("Log Transformed:\n", log_data)
print("Yeo-Johnson Transformed:\n", transformed)

🧠 Selecting Important Features

Concept

Feature selection helps reduce dimensionality, improve model performance, and prevent overfitting by keeping only the most relevant features.

Why It Matters

Removes redundant or irrelevant data
Reduces overfitting risk
Speeds up model training and inference

Common Feature Selection Methods

Method	Description	Example Tools
Filter Methods	Use statistical tests to rank features	`SelectKBest`, correlation analysis
Wrapper Methods	Use model performance to select subsets	`RFE` (Recursive Feature Elimination)
Embedded Methods	Feature selection occurs during model training	`Lasso Regression`, `Random Forest Importance`

Example: Using SelectKBest

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif

# Load dataset
X, y = load_iris(return_X_y=True)

# Select top 2 features
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)

print("Selected features shape:", X_new.shape)

Example: Feature Importance with Random Forest

from sklearn.ensemble import RandomForestClassifier
import pandas as pd

X, y = load_iris(return_X_y=True)
model = RandomForestClassifier()
model.fit(X, y)

# Get feature importance
importances = model.feature_importances_
features = load_iris().feature_names
importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})

print(importance_df.sort_values(by='Importance', ascending=False))

🧩 Summary

Concept	Description	Example
Scaling	Normalize feature ranges for consistency	StandardScaler, MinMaxScaler
Encoding	Convert categorical data into numbers	LabelEncoder, OneHotEncoder
Transforming	Adjust data distributions	Log, Box-Cox, Yeo-Johnson
Feature Selection	Keep only important features	SelectKBest, RFE, Feature Importance

Feature Engineering is one of the most critical steps in any Machine Learning pipeline.
By scaling, encoding, transforming, and selecting the right features, you enable your model to learn patterns more effectively and perform better on unseen data.