📊 Introduction to Statistics & Probability

📘 What Is Statistics?

Statistics is the branch of mathematics that deals with collecting, analyzing, interpreting, and presenting data.
It helps us make sense of large datasets and draw conclusions or make predictions.

There are two main types of statistics:

Descriptive Statistics — summarize and describe data (e.g., mean, median, variance).
Inferential Statistics — use samples to make generalizations about a population.

🎯 What Is Probability?

Probability measures the likelihood of an event occurring.
It ranges from 0 to 1, where:

0 means the event cannot happen.
1 means the event will definitely happen.

Formula:

P(Event) = (Number of favorable outcomes) / (Total number of outcomes)

Example: If you roll a 6-sided die, the probability of getting a “3” is:

P(3) = 1 / 6 ≈ 0.1667

📈 Mean, Median, Mode, and Variance

1. Mean (Average)

The mean is the sum of all values divided by the number of values.

data = [10, 20, 30, 40]
mean = sum(data) / len(data)
print(mean)  # 25.0

Formula:

Mean = (x₁ + x₂ + ... + xₙ) / n

2. Median

The median is the middle value when the data is sorted.
If there’s an even number of values, it’s the average of the two middle values.

import numpy as np

data = [5, 8, 12, 20, 25]
median = np.median(data)
print(median)  # 12

3. Mode

The mode is the most frequent value in a dataset.

from statistics import mode

data = [1, 2, 2, 3, 4]
print(mode(data))  # 2

4. Variance and Standard Deviation

Variance measures how spread out the data is from the mean.
Standard deviation is the square root of variance.

import numpy as np

data = [10, 12, 23, 23, 16, 23, 21, 16]
variance = np.var(data)
std_dev = np.std(data)
print("Variance:", variance)
print("Standard Deviation:", std_dev)

Formulas:

Variance (σ²) = Σ(xᵢ - μ)² / n
Standard Deviation (σ) = √Variance

📊 Probability Basics

1. Independent Events

Two events are independent if the outcome of one does not affect the other.
Example: Rolling two dice — the result of one die doesn’t influence the other.

P(A and B) = P(A) × P(B)

2. Dependent Events

Two events are dependent if one affects the other.
Example: Drawing cards from a deck without replacement.

P(A and B) = P(A) × P(B|A)

3. Mutually Exclusive Events

Events that cannot happen at the same time.
Example: Getting “Heads” or “Tails” on a single coin flip.

P(A or B) = P(A) + P(B)

🔔 Normal Distribution

The Normal Distribution (or Gaussian Distribution) is a continuous probability distribution that is symmetric around the mean.
Most real-world data (like height, weight, or test scores) follows this pattern.

Key Properties:

Bell-shaped curve
Mean = Median = Mode
68% of data within 1 standard deviation of the mean
95% within 2 standard deviations
99.7% within 3 standard deviations

import numpy as np
import matplotlib.pyplot as plt

mu, sigma = 0, 1  # mean and standard deviation
data = np.random.normal(mu, sigma, 1000)

plt.hist(data, bins=30, density=True, alpha=0.6)
plt.title("Normal Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

🔗 Correlation vs. Causation

Correlation

Correlation measures the strength and direction of a relationship between two variables.
It does not mean that one variable causes the other.

import pandas as pd

data = {
    "Hours_Studied": [2, 4, 6, 8, 10],
    "Exam_Score": [50, 55, 65, 70, 85]
}
df = pd.DataFrame(data)

print(df.corr())

Interpretation of correlation coefficient (r):

r value	Interpretation
+1	Perfect positive correlation
0	No correlation
-1	Perfect negative correlation

Causation

Causation means one variable directly affects another.
For example, increasing study time causes better exam performance (if proven experimentally).

⚠️ Remember: Correlation does not imply causation!
Just because two variables move together doesn’t mean one causes the other.

🧠 Summary

Concept	Description
Mean / Median / Mode	Measures of central tendency
Variance / Std Dev	Measure how data spreads around the mean
Probability	Likelihood of an event happening
Normal Distribution	Bell-shaped curve describing natural variations
Correlation vs. Causation	Correlation shows relationship; causation shows cause-effect

By understanding these core statistical and probability concepts, you’ll have a strong foundation for data analysis, hypothesis testing, and machine learning!