🤖 Introduction to Machine Learning

📘 What Is Machine Learning?

Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables computers to learn from data and make predictions or decisions without being explicitly programmed.
Instead of following fixed rules, ML systems identify patterns in data and improve automatically with experience.

Key idea: Provide data ➜ let the computer learn ➜ make predictions on new data.

Examples of Machine Learning Applications

Predicting house prices 🏠
Spam email detection 📧
Recommending movies 🎬
Image recognition 🖼️
Fraud detection 💳

🧠 Supervised vs. Unsupervised Learning

Machine Learning algorithms are generally classified into two main categories: supervised and unsupervised learning.

1. Supervised Learning

Supervised learning uses labeled data — the input data comes with the correct answers (targets).
The goal is to learn a mapping from inputs (X) to outputs (y).

Examples:

Predicting house prices based on area and location
Classifying emails as “spam” or “not spam”

Types of supervised learning:

Type	Description	Example
Regression	Predicts continuous values	Predicting prices, temperatures
Classification	Predicts categories or labels	Spam detection, sentiment analysis

Example using scikit-learn:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Example data
X = [[1000], [1500], [2000], [2500]]
y = [200000, 250000, 300000, 350000]

# Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
print(predictions)

2. Unsupervised Learning

Unsupervised learning uses unlabeled data — there are no predefined outputs.
The goal is to find patterns or structure within the data.

Examples:

Grouping customers by purchase behavior (clustering)
Reducing dataset size while keeping essential information (dimensionality reduction)

Types of unsupervised learning:

Type	Description	Example
Clustering	Groups similar data points	Customer segmentation
Dimensionality Reduction	Simplifies data by reducing variables	PCA (Principal Component Analysis)

Example using KMeans clustering:

from sklearn.cluster import KMeans
import numpy as np

data = np.array([[1, 2], [1, 4], [1, 0],
                 [10, 2], [10, 4], [10, 0]])

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(data)

print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)

🧩 Train-Test Split

To evaluate how well a model performs, we divide our dataset into two parts:

Training set: Used to teach the model (usually 70–80% of the data).
Test set: Used to evaluate how the model performs on unseen data.

This ensures the model generalizes well and doesn’t just memorize the training data.

Example:

from sklearn.model_selection import train_test_split
import numpy as np

X = np.random.rand(100, 5)
y = np.random.rand(100)

# Split 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", len(X_train))
print("Test set size:", len(X_test))

📏 Evaluation Metrics

Once a model is trained, we need to measure how well it performs.
The choice of evaluation metric depends on the type of problem (classification or regression).

1. Accuracy

The percentage of correctly predicted labels.

Accuracy = (Number of correct predictions) / (Total predictions)

Example: If a model correctly predicts 90 out of 100 test samples,
Accuracy = 90 / 100 = 0.9 (or 90%)

In scikit-learn:

from sklearn.metrics import accuracy_score

y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 1, 0, 0]
print("Accuracy:", accuracy_score(y_true, y_pred))

2. Precision

Precision measures how many of the positive predictions were actually correct.

Precision = True Positives / (True Positives + False Positives)

Use case: Important when false positives are costly (e.g., spam detection).

3. Recall

Recall measures how many of the actual positives were correctly predicted.

Recall = True Positives / (True Positives + False Negatives)

Use case: Important when missing a positive case is costly (e.g., detecting diseases).

4. Combined Example in Python

from sklearn.metrics import precision_score, recall_score, f1_score

y_true = [1, 0, 1, 1, 0, 1]
y_pred = [1, 0, 1, 0, 0, 1]

print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1 Score:", f1_score(y_true, y_pred))

🧠 Summary

Concept	Description
Machine Learning	Enables systems to learn from data and make predictions
Supervised Learning	Learns from labeled data (e.g., regression, classification)
Unsupervised Learning	Finds hidden patterns in unlabeled data (e.g., clustering)
Train-Test Split	Divides data into training and testing sets to evaluate models
Accuracy	Measures overall correctness
Precision	How many predicted positives were correct
Recall	How many actual positives were found

By understanding these foundational ML concepts, you can start building, training, and evaluating models that learn from data — the heart of modern Artificial Intelligence!