Gridscript

βš™οΈ Feature Engineering

πŸ“˜ Introduction

Feature Engineering is the process of preparing and transforming raw data into meaningful features that improve the performance of machine learning models.
The quality and representation of your features can be more important than the choice of algorithm itself.

Goal: Help models learn patterns better by refining, scaling, or encoding input features.

πŸ”§ Scaling Data

Concept

Different features can have different ranges (e.g., age from 0–100, income from 1,000–100,000).
Some algorithms (like SVMs, KNN, and gradient descent-based models) perform poorly if features are on different scales.
Scaling ensures that all features contribute equally.

Common Scaling Methods

MethodDescriptionExample
Standardization (Z-score)Transforms data to have mean = 0 and std = 1x' = (x - ΞΌ) / Οƒ
Min-Max ScalingScales values to a fixed range, usually [0,1]x' = (x - min) / (max - min)
Robust ScalingReduces effect of outliers by using median and IQRUseful for skewed data

Example in Python

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np

data = np.array([[10], [20], [30], [40], [50]])

# Standard scaling
scaler = StandardScaler()
standard_scaled = scaler.fit_transform(data)

# Min-max scaling
minmax_scaled = MinMaxScaler().fit_transform(data)

# Robust scaling
robust_scaled = RobustScaler().fit_transform(data)

print("Standard Scaled:\n", standard_scaled)
print("Min-Max Scaled:\n", minmax_scaled)
print("Robust Scaled:\n", robust_scaled)

πŸ”€ Encoding Categorical Data

Concept

Machine learning models typically work with numerical data.
Encoding converts categorical (text) data into numerical format.

Types of Encoding

Encoding TypeDescriptionExample
Label EncodingAssigns each category a numeric labelRed=0, Green=1, Blue=2
One-Hot EncodingCreates binary columns for each categoryRed=[1,0,0], Green=[0,1,0], Blue=[0,0,1]
Ordinal EncodingEncodes categories with an inherent orderLow=1, Medium=2, High=3

Example in Python

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Example data
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Green']})

# Label Encoding
label_encoder = LabelEncoder()
df['Color_Label'] = label_encoder.fit_transform(df['Color'])

# One-Hot Encoding
onehot_df = pd.get_dummies(df['Color'], prefix='Color')

print(df)
print(onehot_df)

Tip:

πŸ”„ Transforming Data

Concept

Transformations help fix skewed data distributions, stabilize variance, or make data more Gaussian-like.
Some models (like linear regression) assume normally distributed features.

Common Transformations

MethodDescription
Log TransformReduces right-skewed data
Square Root TransformReduces moderate skew
Box-Cox / Yeo-JohnsonWorks for positive (Box-Cox) or any (Yeo-Johnson) values

Example in Python

import numpy as np
from sklearn.preprocessing import PowerTransformer

data = np.array([[1], [2], [3], [10], [100]])

# Log transform
log_data = np.log1p(data)

# Power transform (Yeo-Johnson)
pt = PowerTransformer(method='yeo-johnson')
transformed = pt.fit_transform(data)

print("Log Transformed:\n", log_data)
print("Yeo-Johnson Transformed:\n", transformed)

🧠 Selecting Important Features

Concept

Feature selection helps reduce dimensionality, improve model performance, and prevent overfitting by keeping only the most relevant features.

Why It Matters

Common Feature Selection Methods

MethodDescriptionExample Tools
Filter MethodsUse statistical tests to rank featuresSelectKBest, correlation analysis
Wrapper MethodsUse model performance to select subsetsRFE (Recursive Feature Elimination)
Embedded MethodsFeature selection occurs during model trainingLasso Regression, Random Forest Importance

Example: Using SelectKBest

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif

# Load dataset
X, y = load_iris(return_X_y=True)

# Select top 2 features
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)

print("Selected features shape:", X_new.shape)

Example: Feature Importance with Random Forest

from sklearn.ensemble import RandomForestClassifier
import pandas as pd

X, y = load_iris(return_X_y=True)
model = RandomForestClassifier()
model.fit(X, y)

# Get feature importance
importances = model.feature_importances_
features = load_iris().feature_names
importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})

print(importance_df.sort_values(by='Importance', ascending=False))

🧩 Summary

ConceptDescriptionExample
ScalingNormalize feature ranges for consistencyStandardScaler, MinMaxScaler
EncodingConvert categorical data into numbersLabelEncoder, OneHotEncoder
TransformingAdjust data distributionsLog, Box-Cox, Yeo-Johnson
Feature SelectionKeep only important featuresSelectKBest, RFE, Feature Importance

Feature Engineering is one of the most critical steps in any Machine Learning pipeline.
By scaling, encoding, transforming, and selecting the right features, you enable your model to learn patterns more effectively and perform better on unseen data.