Gridscript

πŸ’¬ Natural Language Processing (NLP)

πŸ“˜ Introduction

Natural Language Processing (NLP) is a branch of Artificial Intelligence that enables computers to understand, interpret, and generate human language.
It combines linguistics, computer science, and machine learning to process textual data and extract meaning.

Applications of NLP:

βœ‚οΈ Tokenization

Concept

Tokenization is the process of breaking down text into smaller units called tokens β€” typically words, subwords, or sentences.
Tokens are the building blocks for further text analysis and modeling.

Example in Python

from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing is fascinating. It allows computers to understand text!"

# Sentence tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)

# Word tokenization
words = word_tokenize(text)
print("Words:", words)

Tools/Libraries:

Example with spaCy

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love data science and natural language processing!")

for token in doc:
    print(token.text, token.pos_)

🧱 Word Embeddings

Concept

Word embeddings represent words as numerical vectors that capture semantic meaning β€” similar words have similar vectors.
This helps machine learning models understand context and relationships between words.

Popular Embedding Techniques

MethodDescription
Bag of Words (BoW)Represents text as word occurrence counts
TF-IDF (Term Frequency-Inverse Document Frequency)Weighs words by importance
Word2VecLearns dense vector representations using context
GloVePre-trained embeddings from large corpora
BERT / Transformer embeddingsContextual embeddings from deep learning models

Example: Using Word2Vec

from gensim.models import Word2Vec

sentences = [["data", "science", "is", "fun"], ["machine", "learning", "is", "powerful"]]
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)

# Get word vector
print(model.wv["science"])

# Find similar words
print(model.wv.most_similar("science"))

Key Idea: Words appearing in similar contexts have similar meanings.

❀️ Sentiment Analysis

Concept

Sentiment Analysis determines the emotional tone (positive, negative, or neutral) of a text.
It’s commonly used for analyzing customer reviews, social media posts, and feedback.

Example with TextBlob

from textblob import TextBlob

text = "I really love the new phone! The battery life is amazing."
blob = TextBlob(text)

print("Sentiment polarity:", blob.sentiment.polarity)
if blob.sentiment.polarity > 0:
    print("Positive sentiment πŸ™‚")
elif blob.sentiment.polarity < 0:
    print("Negative sentiment πŸ™")
else:
    print("Neutral sentiment 😐")

Example with Hugging Face Transformers

from transformers import pipeline

# Pre-trained sentiment analysis model
classifier = pipeline("sentiment-analysis")

result = classifier("I absolutely love this movie!")[0]
print(result)

Applications:

πŸ—‚οΈ Text Classification

Concept

Text Classification assigns predefined categories (labels) to text documents.
Examples include spam detection, topic categorization, and intent recognition.

Steps:

  1. Tokenize and clean the text
  2. Convert text into numerical vectors (e.g., TF-IDF, embeddings)
  3. Train a classifier (e.g., Naive Bayes, Logistic Regression, or Neural Network)

Example with Scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Example data
texts = ["I love this product", "This is terrible", "Excellent experience", "Worst purchase ever"]
labels = ["positive", "negative", "positive", "negative"]

# Create pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(texts, labels)

# Prediction
print(model.predict(["I really like this!"]))

Deep Learning Alternative: Use transformer-based models like BERT, DistilBERT, or RoBERTa for more accurate classification.

Example using Transformers

from transformers import pipeline

classifier = pipeline("text-classification")
result = classifier("The food was great, but the service was slow.")[0]
print(result)

🧠 Summary

ConceptDescriptionTools/Libraries
TokenizationSplitting text into words or sentencesNLTK, spaCy
Word EmbeddingsRepresenting words as vectorsWord2Vec, GloVe, BERT
Sentiment AnalysisDetecting emotional toneTextBlob, Transformers
Text ClassificationAssigning labels to textScikit-learn, BERT

NLP enables machines to understand human language β€” from tokenizing and embedding words to analyzing sentiment and classifying text.
Mastering NLP techniques opens doors to applications like chatbots, search engines, and intelligent assistants.