π€ Introduction to Machine Learning
π What Is Machine Learning?
Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables computers to learn from data and make predictions or decisions without being explicitly programmed.
Instead of following fixed rules, ML systems identify patterns in data and improve automatically with experience.
Key idea: Provide data β let the computer learn β make predictions on new data.
Examples of Machine Learning Applications
- Predicting house prices π
- Spam email detection π§
- Recommending movies π¬
- Image recognition πΌοΈ
- Fraud detection π³
π§ Supervised vs. Unsupervised Learning
Machine Learning algorithms are generally classified into two main categories: supervised and unsupervised learning.
1. Supervised Learning
Supervised learning uses labeled data β the input data comes with the correct answers (targets).
The goal is to learn a mapping from inputs (X) to outputs (y).
Examples:
- Predicting house prices based on area and location
- Classifying emails as βspamβ or βnot spamβ
Types of supervised learning:
| Type | Description | Example |
|---|---|---|
| Regression | Predicts continuous values | Predicting prices, temperatures |
| Classification | Predicts categories or labels | Spam detection, sentiment analysis |
Example using scikit-learn:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Example data
X = [[1000], [1500], [2000], [2500]]
y = [200000, 250000, 300000, 350000]
# Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)
print(predictions)
2. Unsupervised Learning
Unsupervised learning uses unlabeled data β there are no predefined outputs.
The goal is to find patterns or structure within the data.
Examples:
- Grouping customers by purchase behavior (clustering)
- Reducing dataset size while keeping essential information (dimensionality reduction)
Types of unsupervised learning:
| Type | Description | Example |
|---|---|---|
| Clustering | Groups similar data points | Customer segmentation |
| Dimensionality Reduction | Simplifies data by reducing variables | PCA (Principal Component Analysis) |
Example using KMeans clustering:
from sklearn.cluster import KMeans
import numpy as np
data = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(data)
print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)
π§© Train-Test Split
To evaluate how well a model performs, we divide our dataset into two parts:
- Training set: Used to teach the model (usually 70β80% of the data).
- Test set: Used to evaluate how the model performs on unseen data.
This ensures the model generalizes well and doesnβt just memorize the training data.
Example:
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.rand(100, 5)
y = np.random.rand(100)
# Split 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training set size:", len(X_train))
print("Test set size:", len(X_test))
π Evaluation Metrics
Once a model is trained, we need to measure how well it performs.
The choice of evaluation metric depends on the type of problem (classification or regression).
1. Accuracy
The percentage of correctly predicted labels.
Accuracy = (Number of correct predictions) / (Total predictions)
Example:
If a model correctly predicts 90 out of 100 test samples,
Accuracy = 90 / 100 = 0.9 (or 90%)
In scikit-learn:
from sklearn.metrics import accuracy_score
y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 1, 0, 0]
print("Accuracy:", accuracy_score(y_true, y_pred))
2. Precision
Precision measures how many of the positive predictions were actually correct.
Precision = True Positives / (True Positives + False Positives)
Use case: Important when false positives are costly (e.g., spam detection).
3. Recall
Recall measures how many of the actual positives were correctly predicted.
Recall = True Positives / (True Positives + False Negatives)
Use case: Important when missing a positive case is costly (e.g., detecting diseases).
4. Combined Example in Python
from sklearn.metrics import precision_score, recall_score, f1_score
y_true = [1, 0, 1, 1, 0, 1]
y_pred = [1, 0, 1, 0, 0, 1]
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1 Score:", f1_score(y_true, y_pred))
π§ Summary
| Concept | Description |
|---|---|
| Machine Learning | Enables systems to learn from data and make predictions |
| Supervised Learning | Learns from labeled data (e.g., regression, classification) |
| Unsupervised Learning | Finds hidden patterns in unlabeled data (e.g., clustering) |
| Train-Test Split | Divides data into training and testing sets to evaluate models |
| Accuracy | Measures overall correctness |
| Precision | How many predicted positives were correct |
| Recall | How many actual positives were found |
By understanding these foundational ML concepts, you can start building, training, and evaluating models that learn from data β the heart of modern Artificial Intelligence!