π¬ Natural Language Processing (NLP)
π Introduction
Natural Language Processing (NLP) is a branch of Artificial Intelligence that enables computers to understand, interpret, and generate human language.
It combines linguistics, computer science, and machine learning to process textual data and extract meaning.
Applications of NLP:
- Sentiment analysis π§
- Chatbots π¬
- Machine translation π
- Text summarization π°
- Spam detection π§
βοΈ Tokenization
Concept
Tokenization is the process of breaking down text into smaller units called tokens β typically words, subwords, or sentences.
Tokens are the building blocks for further text analysis and modeling.
Example in Python
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Natural Language Processing is fascinating. It allows computers to understand text!"
# Sentence tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)
# Word tokenization
words = word_tokenize(text)
print("Words:", words)
Tools/Libraries:
- NLTK (Natural Language Toolkit)
- spaCy (fast, modern NLP library)
- Hugging Face Transformers (for deep learning-based NLP)
Example with spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I love data science and natural language processing!")
for token in doc:
print(token.text, token.pos_)
π§± Word Embeddings
Concept
Word embeddings represent words as numerical vectors that capture semantic meaning β similar words have similar vectors.
This helps machine learning models understand context and relationships between words.
Popular Embedding Techniques
| Method | Description |
|---|---|
| Bag of Words (BoW) | Represents text as word occurrence counts |
| TF-IDF (Term Frequency-Inverse Document Frequency) | Weighs words by importance |
| Word2Vec | Learns dense vector representations using context |
| GloVe | Pre-trained embeddings from large corpora |
| BERT / Transformer embeddings | Contextual embeddings from deep learning models |
Example: Using Word2Vec
from gensim.models import Word2Vec
sentences = [["data", "science", "is", "fun"], ["machine", "learning", "is", "powerful"]]
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)
# Get word vector
print(model.wv["science"])
# Find similar words
print(model.wv.most_similar("science"))
Key Idea: Words appearing in similar contexts have similar meanings.
β€οΈ Sentiment Analysis
Concept
Sentiment Analysis determines the emotional tone (positive, negative, or neutral) of a text.
Itβs commonly used for analyzing customer reviews, social media posts, and feedback.
Example with TextBlob
from textblob import TextBlob
text = "I really love the new phone! The battery life is amazing."
blob = TextBlob(text)
print("Sentiment polarity:", blob.sentiment.polarity)
if blob.sentiment.polarity > 0:
print("Positive sentiment π")
elif blob.sentiment.polarity < 0:
print("Negative sentiment π")
else:
print("Neutral sentiment π")
Example with Hugging Face Transformers
from transformers import pipeline
# Pre-trained sentiment analysis model
classifier = pipeline("sentiment-analysis")
result = classifier("I absolutely love this movie!")[0]
print(result)
Applications:
- Product review analysis
- Brand monitoring
- Customer feedback understanding
ποΈ Text Classification
Concept
Text Classification assigns predefined categories (labels) to text documents.
Examples include spam detection, topic categorization, and intent recognition.
Steps:
- Tokenize and clean the text
- Convert text into numerical vectors (e.g., TF-IDF, embeddings)
- Train a classifier (e.g., Naive Bayes, Logistic Regression, or Neural Network)
Example with Scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
# Example data
texts = ["I love this product", "This is terrible", "Excellent experience", "Worst purchase ever"]
labels = ["positive", "negative", "positive", "negative"]
# Create pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(texts, labels)
# Prediction
print(model.predict(["I really like this!"]))
Deep Learning Alternative: Use transformer-based models like BERT, DistilBERT, or RoBERTa for more accurate classification.
Example using Transformers
from transformers import pipeline
classifier = pipeline("text-classification")
result = classifier("The food was great, but the service was slow.")[0]
print(result)
π§ Summary
| Concept | Description | Tools/Libraries |
|---|---|---|
| Tokenization | Splitting text into words or sentences | NLTK, spaCy |
| Word Embeddings | Representing words as vectors | Word2Vec, GloVe, BERT |
| Sentiment Analysis | Detecting emotional tone | TextBlob, Transformers |
| Text Classification | Assigning labels to text | Scikit-learn, BERT |
NLP enables machines to understand human language β from tokenizing and embedding words to analyzing sentiment and classifying text.
Mastering NLP techniques opens doors to applications like chatbots, search engines, and intelligent assistants.