🔍 Tokenization¶
📖 Click to Expand
What is Tokenization?¶
Tokenization is the process of splitting a longer piece of text into smaller units — either sentences or words — so that we can analyze and process them more easily.
Types of Tokenization¶
- Sentence Tokenization → Splits paragraphs into individual sentences.
- Word Tokenization → Splits sentences into individual words.
This step helps us prepare raw text for downstream NLP tasks like vectorization, sentiment analysis, topic modeling, etc.
Example:¶
Input:
"This app is awesome. I use it daily!"
Sentence Tokens:
["This app is awesome.", "I use it daily!"]
Word Tokens:
["This", "app", "is", "awesome", "I", "use", "it", "daily"]
🧱 Word Tokenization¶
📖 Click to Expand
What is Word Tokenization?¶
This process splits a sentence into individual words. It discards punctuation and splits by whitespace or regex rules. Most NLP models rely on word-level tokens as a base unit.
Example:¶
Input:
"This app is awesome!"
Output:
["This", "app", "is", "awesome"]
import pandas as pd
# Load the dataset
df = pd.read_csv("datasets/amazon.csv")
# Preview the data
print(df.shape)
df.head()
(20000, 2)
text | sentiment | |
---|---|---|
0 | This is a one of the best apps acording to a b... | 1 |
1 | This is a pretty good version of the game for ... | 1 |
2 | this is a really cool game. there are a bunch ... | 1 |
3 | This is a silly game and can be frustrating, b... | 1 |
4 | This is a terrific game on any pad. Hrs of fun... | 1 |
import re
sample_review = df['text'].iloc[0]
print("Original Review:\n", sample_review)
# Word tokenization using regex
word_tokens = re.findall(r'\b\w+\b', sample_review.lower())
print("\nTokenized Words:")
print(word_tokens)
Original Review: This is a one of the best apps acording to a bunch of people and I agree it has bombs eggs pigs TNT king pigs and realustic stuff Tokenized Words: ['this', 'is', 'a', 'one', 'of', 'the', 'best', 'apps', 'acording', 'to', 'a', 'bunch', 'of', 'people', 'and', 'i', 'agree', 'it', 'has', 'bombs', 'eggs', 'pigs', 'tnt', 'king', 'pigs', 'and', 'realustic', 'stuff']
✂️ Sentence Tokenization¶
📖 Click to Expand
What is Sentence Tokenization?¶
This breaks long text into individual sentences using punctuation like .
, ?
, and !
as delimiters. Useful when the meaning or structure varies sentence by sentence.
Example:¶
Input:
"This app is awesome. I use it every day!"
Output:
["This app is awesome.", "I use it every day!"]
# Sentence tokenization using regex
sentence_tokens = re.split(r'(?<=[.!?]) +', sample_review)
print("Tokenized Sentences:")
for i, sent in enumerate(sentence_tokens):
print(f"{i+1}. {sent}")
Tokenized Sentences: 1. This is a one of the best apps acording to a bunch of people and I agree it has bombs eggs pigs TNT king pigs and realustic stuff
🧼 Text Cleaning¶
📖 Click to Expand
What is Text Cleaning?¶
Text data is messy — it contains typos, inconsistent casing, punctuation, symbols, and other noise. Cleaning makes the text more uniform and easier to analyze or model.
Common Steps:¶
- Lowercasing (standardization)
- Removing punctuation and special characters
- Regex-based normalization (e.g., removing URLs, extra whitespace)
Example:¶
Example:¶
"This app is AWESOME!! 😍😍 www.example.com"
→ "this app is awesome"
🔤 Lowercasing and Normalization¶
📖 Click to Expand
What is Lowercasing?¶
Converts all characters in the text to lowercase to avoid treating "Good" and "good" as different words.
Example:¶
"This App is GREAT!"
→ "this app is great!"
This step also sometimes includes basic normalization like removing extra whitespace or fixing encoding artifacts.
# Lowercasing text column
df['text_lower'] = df['text'].str.lower()
# Show before and after for a single row
print("Original:", df['text'].iloc[0])
print("Lowercased:", df['text_lower'].iloc[0])
Original: This is a one of the best apps acording to a bunch of people and I agree it has bombs eggs pigs TNT king pigs and realustic stuff Lowercased: this is a one of the best apps acording to a bunch of people and i agree it has bombs eggs pigs tnt king pigs and realustic stuff
✨ Punctuation & Special Character Removal¶
📖 Click to Expand
Why Remove Punctuation?¶
Punctuation doesn’t usually add value in basic NLP tasks like sentiment analysis or topic modeling. Removing it simplifies the vocabulary.
Example:¶
"Wow!!! This app is amazing :)"
→ "Wow This app is amazing "
Emoji and symbols can also be removed unless they are part of your feature set.
import string
# Remove punctuation
df['text_no_punct'] = df['text_lower'].str.replace(f"[{re.escape(string.punctuation)}]", "", regex=True)
print("Before:", df['text_lower'].iloc[0])
print("After:", df['text_no_punct'].iloc[0])
Before: this is a one of the best apps acording to a bunch of people and i agree it has bombs eggs pigs tnt king pigs and realustic stuff After: this is a one of the best apps acording to a bunch of people and i agree it has bombs eggs pigs tnt king pigs and realustic stuff
🧪 Regex Based Cleaning¶
# Regex-based cleaning: remove URLs and excess whitespace
df['text_cleaned'] = df['text_no_punct'].str.replace(r"http\S+|www\S+|https\S+", "", regex=True)
df['text_cleaned'] = df['text_cleaned'].str.replace(r"\s+", " ", regex=True).str.strip()
print("Before:", df['text_no_punct'].iloc[0])
print("After:", df['text_cleaned'].iloc[0])
Before: this is a one of the best apps acording to a bunch of people and i agree it has bombs eggs pigs tnt king pigs and realustic stuff After: this is a one of the best apps acording to a bunch of people and i agree it has bombs eggs pigs tnt king pigs and realustic stuff
🛑 Stopwords Removal¶
📖 Click to Expand
What are Stopwords?¶
Stopwords are common words like “the”, “is”, “and”, “in” — they appear frequently but carry little semantic value.
Removing them reduces noise, shrinks vocabulary size, and improves signal for tasks like classification or topic modeling.
Example:¶
"this is the best app in the world"
→ "best app world"
# !pip3 install nltk
import nltk
from nltk.corpus import stopwords
# Load NLTK stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# Remove stopwords from cleaned text
df['text_no_stopwords'] = df['text_cleaned'].apply(
lambda x: ' '.join([word for word in x.split() if word not in stop_words])
)
print("Before:", df['text_cleaned'].iloc[0])
print("After:", df['text_no_stopwords'].iloc[0])
Before: this is a one of the best apps acording to a bunch of people and i agree it has bombs eggs pigs tnt king pigs and realustic stuff After: one best apps acording bunch people agree bombs eggs pigs tnt king pigs realustic stuff
[nltk_data] Downloading package stopwords to [nltk_data] /Users/ashrithreddy/nltk_data... [nltk_data] Package stopwords is already up-to-date!
🧰 Custom Stopword Handling¶
📖 Click to Expand
Why use custom stopwords?¶
Depending on your dataset, some words (like “app”, “product”, “phone”) may occur frequently but not help with classification.
You can define a custom list based on domain knowledge or frequency analysis.
Example:¶
"this app is very helpful and awesome"
→ "helpful awesome"
(if “app”, “very”, “this” are in custom list)
# Example: Custom stopwords for Amazon reviews
custom_stopwords = set(["app", "product", "amazon", "device", "really", "very"])
# Remove custom stopwords
df['text_no_custom_stopwords'] = df['text_no_stopwords'].apply(
lambda x: ' '.join([word for word in x.split() if word not in custom_stopwords])
)
print("Before:", df['text_no_stopwords'].iloc[0])
print("After:", df['text_no_custom_stopwords'].iloc[0])
Before: one best apps acording bunch people agree bombs eggs pigs tnt king pigs realustic stuff After: one best apps acording bunch people agree bombs eggs pigs tnt king pigs realustic stuff
🔁 Lemmatization vs Stemming¶
📖 Click to Expand
What are Lemmatization and Stemming?¶
Both reduce words to a base/root form. This helps unify variants like “running”, “ran”, “runs” → “run”.
- Stemming → crude chopping of suffixes (fast, less accurate)
- Lemmatization → uses linguistic context and vocab to find valid dictionary form (slower, more accurate)
Example:¶
"running", "ran", "runs"
→ Stemmed: "run", "ran", "run"
→ Lemmatized: "run", "run", "run"
🌱 Stemming (Porter, Snowball)¶
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
# Apply stemming
df['text_stemmed'] = df['text_no_custom_stopwords'].apply(
lambda x: ' '.join([stemmer.stem(word) for word in x.split()])
)
print("Before:", df['text_no_custom_stopwords'].iloc[0])
print("After:", df['text_stemmed'].iloc[0])
Before: one best apps acording bunch people agree bombs eggs pigs tnt king pigs realustic stuff After: one best app acord bunch peopl agre bomb egg pig tnt king pig realust stuff
🍃 Lemmatization (WordNet, spaCy)¶
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()
# Apply lemmatization
df['text_lemmatized'] = df['text_no_custom_stopwords'].apply(
lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()])
)
print("Before:", df['text_no_custom_stopwords'].iloc[0])
print("After:", df['text_lemmatized'].iloc[0])
[nltk_data] Downloading package wordnet to [nltk_data] /Users/ashrithreddy/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package omw-1.4 to [nltk_data] /Users/ashrithreddy/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date!
Before: one best apps acording bunch people agree bombs eggs pigs tnt king pigs realustic stuff After: one best apps acording bunch people agree bomb egg pig tnt king pig realustic stuff
🧠 POS Tagging & Named Entity Recognition¶
📖 Click to Expand
What are POS & NER?¶
These are advanced parsing techniques that extract grammatical roles and real-world entities from text.
- POS Tagging (Part-of-Speech) → assigns a tag like noun, verb, adjective to each word
- NER (Named Entity Recognition) → identifies names, places, orgs, dates, etc.
Why it matters:¶
- Helps with filtering nouns, extracting subjects, building features
- Useful in recommendation systems, summarization, question answering, etc.
Example:¶
Text: "Amazon was founded in Seattle in 1994"
→ POS: ("Amazon", NNP), ("founded", VBD), ...
→ NER: ORG = Amazon
, GPE = Seattle
, DATE = 1994
🏷️ POS Tagging¶
📖 Click to Expand
What is POS Tagging?¶
Part-of-Speech tagging assigns grammatical labels to each word in a sentence — like noun, verb, adjective, etc.
It's useful to extract only verbs, nouns, or filter by syntax.
Example:¶
"this app works beautifully"
→ [("this", DT), ("app", NN), ("works", VBZ), ("beautifully", RB)]
import nltk
nltk.data.path.append('/Users/ashrithreddy/nltk_data')
# nltk.data.find('taggers/averaged_perceptron_tagger') # Will throw if missing
# nltk.download('popular', download_dir='/Users/ashrithreddy/nltk_data')
# nltk.download('averaged_perceptron_tagger_eng')
# POS tagging on lemmatized text
df['pos_tags'] = df['text_lemmatized'].apply(lambda x: nltk.pos_tag(x.split()))
print("Sample Tagged Sentence:")
print(df['pos_tags'].iloc[0])
Sample Tagged Sentence: [('one', 'CD'), ('best', 'JJS'), ('apps', 'NN'), ('acording', 'VBG'), ('bunch', 'JJ'), ('people', 'NNS'), ('agree', 'VBP'), ('bomb', 'NN'), ('egg', 'NN'), ('pig', 'NN'), ('tnt', 'NN'), ('king', 'VBG'), ('pig', 'JJ'), ('realustic', 'JJ'), ('stuff', 'NN')]
🧾 Named Entity Recognition (NER)¶
📖 Click to Expand
What is NER?¶
NER identifies named entities in text like organizations, people, countries, dates, etc.
It builds structure from unstructured text, enabling high-level features like “brand mentioned”, “location detected”, etc.
Example:¶
"Jeff Bezos founded Amazon in 1994"
→ PERSON = Jeff Bezos
, ORG = Amazon
, DATE = 1994
import nltk
import pickle
import io
from nltk.chunk import ChunkParserI
from nltk import pos_tag
from nltk.tokenize import TreebankWordTokenizer
# Force NLTK to look in your custom path
nltk.data.path.clear()
nltk.data.path.append('/Users/ashrithreddy/nltk_data')
# Load the chunker manually using correct encoding
chunker_path = '/Users/ashrithreddy/nltk_data/chunkers/maxent_ne_chunker/english_ace_multiclass.pickle'
with open(chunker_path, 'rb') as f:
chunker: ChunkParserI = pickle.load(io.BufferedReader(f), encoding='latin1')
# Tokenize and tag
text = df['text_lemmatized'].iloc[0]
tokens = TreebankWordTokenizer().tokenize(text)
pos_tags = pos_tag(tokens)
# Run NER
ner_tree = chunker.parse(pos_tags)
# Display entities
print("Named Entities:")
for subtree in ner_tree:
if hasattr(subtree, 'label'):
entity = " ".join([token for token, pos in subtree])
print(f"{entity} → {subtree.label()}")
Named Entities:
🔗 Dependency Parsing & Chunking¶
📖 Click to Expand
What is Dependency Parsing?¶
Dependency parsing maps the grammatical structure of a sentence by identifying relationships between words — like subject, object, modifier, etc.
Phrase chunking (aka shallow parsing) extracts useful noun phrases, verb phrases, etc., without building a full parse tree.
Example:¶
"The quick brown fox jumps over the lazy dog"
→ fox
(subject of jumps
)
→ over the lazy dog
(prepositional modifier)
These structures help with tasks like question answering, relation extraction, and knowledge graph construction.
🕸️ Dependency Parsing¶
📖 Click to Expand
What is Dependency Parsing?¶
Each word is linked to another as either a head or dependent, forming a tree of grammatical relationships.
spaCy returns both the token, its POS, and its dependency tag (like nsubj
, dobj
, prep
, etc.)
Example:¶
"Amazon ships products quickly"
→ Amazon
→ nsubj
(subject of "ships")
→ products
→ dobj
(object of "ships")
try:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(df['text_lemmatized'].iloc[0])
print("Dependency Parsing Output:\n")
for token in doc:
print(f"{token.text:15} → {token.dep_:10} → head: {token.head.text}")
except:
print("spaCy model not found or not available. Dependency parsing skipped.")
spaCy model not found or not available. Dependency parsing skipped.
🧱 Phrase Chunking (Noun Phrases etc.)¶
📖 Click to Expand
What is Chunking?¶
Chunking extracts phrases from text — like noun phrases ("the big brown dog"
), without deep parsing.
Useful for:
- Keyword extraction
- Headline generation
- Phrase-level sentiment
spaCy provides .noun_chunks
to extract base noun phrases.
Example:¶
"The big brown fox jumps"
→ ["The big brown fox"]
try:
doc = nlp(df['text_lemmatized'].iloc[0])
print("Noun Phrases:")
for chunk in doc.noun_chunks:
print("-", chunk.text)
except:
print("spaCy model not found or not available. Chunking skipped.")
spaCy model not found or not available. Chunking skipped.