💬 Natural Language Processing (NLP)

📘 Introduction

Natural Language Processing (NLP) is a branch of Artificial Intelligence that enables computers to understand, interpret, and generate human language.
It combines linguistics, computer science, and machine learning to process textual data and extract meaning.

Applications of NLP:

Sentiment analysis 🧠
Chatbots 💬
Machine translation 🌍
Text summarization 📰
Spam detection 📧

✂️ Tokenization

Concept

Tokenization is the process of breaking down text into smaller units called tokens — typically words, subwords, or sentences.
Tokens are the building blocks for further text analysis and modeling.

Example in Python

from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing is fascinating. It allows computers to understand text!"

# Sentence tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)

# Word tokenization
words = word_tokenize(text)
print("Words:", words)

Tools/Libraries:

NLTK (Natural Language Toolkit)
spaCy (fast, modern NLP library)
Hugging Face Transformers (for deep learning-based NLP)

Example with spaCy

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love data science and natural language processing!")

for token in doc:
    print(token.text, token.pos_)

🧱 Word Embeddings

Concept

Word embeddings represent words as numerical vectors that capture semantic meaning — similar words have similar vectors.
This helps machine learning models understand context and relationships between words.

Popular Embedding Techniques

Method	Description
Bag of Words (BoW)	Represents text as word occurrence counts
TF-IDF (Term Frequency-Inverse Document Frequency)	Weighs words by importance
Word2Vec	Learns dense vector representations using context
GloVe	Pre-trained embeddings from large corpora
BERT / Transformer embeddings	Contextual embeddings from deep learning models

Example: Using Word2Vec

from gensim.models import Word2Vec

sentences = [["data", "science", "is", "fun"], ["machine", "learning", "is", "powerful"]]
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)

# Get word vector
print(model.wv["science"])

# Find similar words
print(model.wv.most_similar("science"))

Key Idea: Words appearing in similar contexts have similar meanings.

❤️ Sentiment Analysis

Concept

Sentiment Analysis determines the emotional tone (positive, negative, or neutral) of a text.
It’s commonly used for analyzing customer reviews, social media posts, and feedback.

Example with TextBlob

from textblob import TextBlob

text = "I really love the new phone! The battery life is amazing."
blob = TextBlob(text)

print("Sentiment polarity:", blob.sentiment.polarity)
if blob.sentiment.polarity > 0:
    print("Positive sentiment 🙂")
elif blob.sentiment.polarity < 0:
    print("Negative sentiment 🙁")
else:
    print("Neutral sentiment 😐")

Example with Hugging Face Transformers

from transformers import pipeline

# Pre-trained sentiment analysis model
classifier = pipeline("sentiment-analysis")

result = classifier("I absolutely love this movie!")[0]
print(result)

Applications:

Product review analysis
Brand monitoring
Customer feedback understanding

🗂️ Text Classification

Concept

Text Classification assigns predefined categories (labels) to text documents.
Examples include spam detection, topic categorization, and intent recognition.

Steps:

Tokenize and clean the text
Convert text into numerical vectors (e.g., TF-IDF, embeddings)
Train a classifier (e.g., Naive Bayes, Logistic Regression, or Neural Network)

Example with Scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Example data
texts = ["I love this product", "This is terrible", "Excellent experience", "Worst purchase ever"]
labels = ["positive", "negative", "positive", "negative"]

# Create pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(texts, labels)

# Prediction
print(model.predict(["I really like this!"]))

Deep Learning Alternative: Use transformer-based models like BERT, DistilBERT, or RoBERTa for more accurate classification.

Example using Transformers

from transformers import pipeline

classifier = pipeline("text-classification")
result = classifier("The food was great, but the service was slow.")[0]
print(result)

🧠 Summary

Concept	Description	Tools/Libraries
Tokenization	Splitting text into words or sentences	NLTK, spaCy
Word Embeddings	Representing words as vectors	Word2Vec, GloVe, BERT
Sentiment Analysis	Detecting emotional tone	TextBlob, Transformers
Text Classification	Assigning labels to text	Scikit-learn, BERT

NLP enables machines to understand human language — from tokenizing and embedding words to analyzing sentiment and classifying text.
Mastering NLP techniques opens doors to applications like chatbots, search engines, and intelligent assistants.