📊 Introduction to Statistics & Probability
📘 What Is Statistics?
Statistics is the branch of mathematics that deals with collecting, analyzing, interpreting, and presenting data.
It helps us make sense of large datasets and draw conclusions or make predictions.
There are two main types of statistics:
- Descriptive Statistics — summarize and describe data (e.g., mean, median, variance).
- Inferential Statistics — use samples to make generalizations about a population.
🎯 What Is Probability?
Probability measures the likelihood of an event occurring.
It ranges from 0 to 1, where:
0means the event cannot happen.1means the event will definitely happen.
Formula:
P(Event) = (Number of favorable outcomes) / (Total number of outcomes)
Example: If you roll a 6-sided die, the probability of getting a “3” is:
P(3) = 1 / 6 ≈ 0.1667
📈 Mean, Median, Mode, and Variance
1. Mean (Average)
The mean is the sum of all values divided by the number of values.
data = [10, 20, 30, 40]
mean = sum(data) / len(data)
print(mean) # 25.0
Formula:
Mean = (x₁ + x₂ + ... + xₙ) / n
2. Median
The median is the middle value when the data is sorted.
If there’s an even number of values, it’s the average of the two middle values.
import numpy as np
data = [5, 8, 12, 20, 25]
median = np.median(data)
print(median) # 12
3. Mode
The mode is the most frequent value in a dataset.
from statistics import mode
data = [1, 2, 2, 3, 4]
print(mode(data)) # 2
4. Variance and Standard Deviation
Variance measures how spread out the data is from the mean.
Standard deviation is the square root of variance.
import numpy as np
data = [10, 12, 23, 23, 16, 23, 21, 16]
variance = np.var(data)
std_dev = np.std(data)
print("Variance:", variance)
print("Standard Deviation:", std_dev)
Formulas:
Variance (σ²) = Σ(xᵢ - μ)² / n
Standard Deviation (σ) = √Variance
📊 Probability Basics
1. Independent Events
Two events are independent if the outcome of one does not affect the other.
Example: Rolling two dice — the result of one die doesn’t influence the other.
P(A and B) = P(A) × P(B)
2. Dependent Events
Two events are dependent if one affects the other.
Example: Drawing cards from a deck without replacement.
P(A and B) = P(A) × P(B|A)
3. Mutually Exclusive Events
Events that cannot happen at the same time.
Example: Getting “Heads” or “Tails” on a single coin flip.
P(A or B) = P(A) + P(B)
🔔 Normal Distribution
The Normal Distribution (or Gaussian Distribution) is a continuous probability distribution that is symmetric around the mean.
Most real-world data (like height, weight, or test scores) follows this pattern.
Key Properties:
- Bell-shaped curve
- Mean = Median = Mode
- 68% of data within 1 standard deviation of the mean
- 95% within 2 standard deviations
- 99.7% within 3 standard deviations
import numpy as np
import matplotlib.pyplot as plt
mu, sigma = 0, 1 # mean and standard deviation
data = np.random.normal(mu, sigma, 1000)
plt.hist(data, bins=30, density=True, alpha=0.6)
plt.title("Normal Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
🔗 Correlation vs. Causation
Correlation
Correlation measures the strength and direction of a relationship between two variables.
It does not mean that one variable causes the other.
import pandas as pd
data = {
"Hours_Studied": [2, 4, 6, 8, 10],
"Exam_Score": [50, 55, 65, 70, 85]
}
df = pd.DataFrame(data)
print(df.corr())
Interpretation of correlation coefficient (r):
| r value | Interpretation |
|---|---|
| +1 | Perfect positive correlation |
| 0 | No correlation |
| -1 | Perfect negative correlation |
Causation
Causation means one variable directly affects another.
For example, increasing study time causes better exam performance (if proven experimentally).
⚠️ Remember: Correlation does not imply causation!
Just because two variables move together doesn’t mean one causes the other.
🧠 Summary
| Concept | Description |
|---|---|
| Mean / Median / Mode | Measures of central tendency |
| Variance / Std Dev | Measure how data spreads around the mean |
| Probability | Likelihood of an event happening |
| Normal Distribution | Bell-shaped curve describing natural variations |
| Correlation vs. Causation | Correlation shows relationship; causation shows cause-effect |
By understanding these core statistical and probability concepts, you’ll have a strong foundation for data analysis, hypothesis testing, and machine learning!