Status: Complete Python Coverage License

🧾 Text Cleaning & Parsing¶

  • 🔍 Tokenization
  • 🧼 Text Cleaning
  • 🛑 Stopwords Removal
  • 🔁 Lemmatization vs Stemming
  • 🧠 POS Tagging & Named Entity Recognition
  • 🔗 Dependency Parsing & Chunking
  • 🧪 Edge Cases & Text Cleaning Tips

🔍 Tokenization¶

📖 Click to Expand
What is Tokenization?¶

Tokenization is the process of splitting a longer piece of text into smaller units — either sentences or words — so that we can analyze and process them more easily.

Types of Tokenization¶
  • Sentence Tokenization → Splits paragraphs into individual sentences.
  • Word Tokenization → Splits sentences into individual words.

This step helps us prepare raw text for downstream NLP tasks like vectorization, sentiment analysis, topic modeling, etc.

Example:¶

Input:
"This app is awesome. I use it daily!"

Sentence Tokens:
["This app is awesome.", "I use it daily!"]

Word Tokens:
["This", "app", "is", "awesome", "I", "use", "it", "daily"]

🧱 Word Tokenization¶

📖 Click to Expand
What is Word Tokenization?¶

This process splits a sentence into individual words. It discards punctuation and splits by whitespace or regex rules. Most NLP models rely on word-level tokens as a base unit.

Example:¶

Input:
"This app is awesome!"

Output:
["This", "app", "is", "awesome"]

In [39]:
import pandas as pd
# Load the dataset
df = pd.read_csv("datasets/amazon.csv")

# Preview the data
print(df.shape)
df.head()
(20000, 2)
Out[39]:
text sentiment
0 This is a one of the best apps acording to a b... 1
1 This is a pretty good version of the game for ... 1
2 this is a really cool game. there are a bunch ... 1
3 This is a silly game and can be frustrating, b... 1
4 This is a terrific game on any pad. Hrs of fun... 1
In [40]:
import re

sample_review = df['text'].iloc[0]
print("Original Review:\n", sample_review)

# Word tokenization using regex
word_tokens = re.findall(r'\b\w+\b', sample_review.lower())

print("\nTokenized Words:")
print(word_tokens)
Original Review:
 This is a one of the best apps acording to a bunch of people and I agree it has bombs eggs pigs TNT king pigs and realustic stuff

Tokenized Words:
['this', 'is', 'a', 'one', 'of', 'the', 'best', 'apps', 'acording', 'to', 'a', 'bunch', 'of', 'people', 'and', 'i', 'agree', 'it', 'has', 'bombs', 'eggs', 'pigs', 'tnt', 'king', 'pigs', 'and', 'realustic', 'stuff']

✂️ Sentence Tokenization¶

📖 Click to Expand
What is Sentence Tokenization?¶

This breaks long text into individual sentences using punctuation like ., ?, and ! as delimiters. Useful when the meaning or structure varies sentence by sentence.

Example:¶

Input:
"This app is awesome. I use it every day!"

Output:
["This app is awesome.", "I use it every day!"]

In [41]:
# Sentence tokenization using regex
sentence_tokens = re.split(r'(?<=[.!?]) +', sample_review)

print("Tokenized Sentences:")
for i, sent in enumerate(sentence_tokens):
    print(f"{i+1}. {sent}")
Tokenized Sentences:
1. This is a one of the best apps acording to a bunch of people and I agree it has bombs eggs pigs TNT king pigs and realustic stuff

Back to the top


🧼 Text Cleaning¶

📖 Click to Expand
What is Text Cleaning?¶

Text data is messy — it contains typos, inconsistent casing, punctuation, symbols, and other noise. Cleaning makes the text more uniform and easier to analyze or model.

Common Steps:¶
  • Lowercasing (standardization)
  • Removing punctuation and special characters
  • Regex-based normalization (e.g., removing URLs, extra whitespace)
Example:¶
Example:¶

"This app is AWESOME!! 😍😍 www.example.com" → "this app is awesome"

🔤 Lowercasing and Normalization¶

📖 Click to Expand
What is Lowercasing?¶

Converts all characters in the text to lowercase to avoid treating "Good" and "good" as different words.

Example:¶

"This App is GREAT!" → "this app is great!"

This step also sometimes includes basic normalization like removing extra whitespace or fixing encoding artifacts.

In [42]:
# Lowercasing text column
df['text_lower'] = df['text'].str.lower()

# Show before and after for a single row
print("Original:", df['text'].iloc[0])
print("Lowercased:", df['text_lower'].iloc[0])
Original: This is a one of the best apps acording to a bunch of people and I agree it has bombs eggs pigs TNT king pigs and realustic stuff
Lowercased: this is a one of the best apps acording to a bunch of people and i agree it has bombs eggs pigs tnt king pigs and realustic stuff

✨ Punctuation & Special Character Removal¶

📖 Click to Expand
Why Remove Punctuation?¶

Punctuation doesn’t usually add value in basic NLP tasks like sentiment analysis or topic modeling. Removing it simplifies the vocabulary.

Example:¶

"Wow!!! This app is amazing :)" → "Wow This app is amazing "

Emoji and symbols can also be removed unless they are part of your feature set.

In [43]:
import string

# Remove punctuation
df['text_no_punct'] = df['text_lower'].str.replace(f"[{re.escape(string.punctuation)}]", "", regex=True)

print("Before:", df['text_lower'].iloc[0])
print("After:", df['text_no_punct'].iloc[0])
Before: this is a one of the best apps acording to a bunch of people and i agree it has bombs eggs pigs tnt king pigs and realustic stuff
After: this is a one of the best apps acording to a bunch of people and i agree it has bombs eggs pigs tnt king pigs and realustic stuff

🧪 Regex Based Cleaning¶

📖 Click to Expand
What is Regex Cleaning?¶

Regular expressions (regex) let us target and clean specific patterns like:

  • URLs
  • HTML tags
  • Extra whitespace
  • Repeated characters
Example:¶

"Check this out! https://amazon.com 😍 " → "check this out"

In [44]:
# Regex-based cleaning: remove URLs and excess whitespace
df['text_cleaned'] = df['text_no_punct'].str.replace(r"http\S+|www\S+|https\S+", "", regex=True)
df['text_cleaned'] = df['text_cleaned'].str.replace(r"\s+", " ", regex=True).str.strip()

print("Before:", df['text_no_punct'].iloc[0])
print("After:", df['text_cleaned'].iloc[0])
Before: this is a one of the best apps acording to a bunch of people and i agree it has bombs eggs pigs tnt king pigs and realustic stuff
After: this is a one of the best apps acording to a bunch of people and i agree it has bombs eggs pigs tnt king pigs and realustic stuff

Back to the top


🛑 Stopwords Removal¶

📖 Click to Expand
What are Stopwords?¶

Stopwords are common words like “the”, “is”, “and”, “in” — they appear frequently but carry little semantic value.

Removing them reduces noise, shrinks vocabulary size, and improves signal for tasks like classification or topic modeling.

Example:¶

"this is the best app in the world"
→ "best app world"

📚 Using NLTK/Spacy Stopword Lists¶

📖 Click to Expand
Why use built-in stopwords?¶

NLTK and spaCy both offer curated stopword lists that can be applied directly to cleaned tokens. These help quickly eliminate unimportant filler words.

Example:¶

"this app is one of the best"
→ "app one best"

In [45]:
# !pip3 install nltk
import nltk
from nltk.corpus import stopwords

# Load NLTK stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Remove stopwords from cleaned text
df['text_no_stopwords'] = df['text_cleaned'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in stop_words])
)

print("Before:", df['text_cleaned'].iloc[0])
print("After:", df['text_no_stopwords'].iloc[0])
Before: this is a one of the best apps acording to a bunch of people and i agree it has bombs eggs pigs tnt king pigs and realustic stuff
After: one best apps acording bunch people agree bombs eggs pigs tnt king pigs realustic stuff
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ashrithreddy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

🧰 Custom Stopword Handling¶

📖 Click to Expand
Why use custom stopwords?¶

Depending on your dataset, some words (like “app”, “product”, “phone”) may occur frequently but not help with classification.

You can define a custom list based on domain knowledge or frequency analysis.

Example:¶

"this app is very helpful and awesome" → "helpful awesome"
(if “app”, “very”, “this” are in custom list)

In [46]:
# Example: Custom stopwords for Amazon reviews
custom_stopwords = set(["app", "product", "amazon", "device", "really", "very"])

# Remove custom stopwords
df['text_no_custom_stopwords'] = df['text_no_stopwords'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in custom_stopwords])
)

print("Before:", df['text_no_stopwords'].iloc[0])
print("After:", df['text_no_custom_stopwords'].iloc[0])
Before: one best apps acording bunch people agree bombs eggs pigs tnt king pigs realustic stuff
After: one best apps acording bunch people agree bombs eggs pigs tnt king pigs realustic stuff

Back to the top


🔁 Lemmatization vs Stemming¶

📖 Click to Expand
What are Lemmatization and Stemming?¶

Both reduce words to a base/root form. This helps unify variants like “running”, “ran”, “runs” → “run”.

  • Stemming → crude chopping of suffixes (fast, less accurate)
  • Lemmatization → uses linguistic context and vocab to find valid dictionary form (slower, more accurate)
Example:¶

"running", "ran", "runs"
→ Stemmed: "run", "ran", "run"
→ Lemmatized: "run", "run", "run"

🌱 Stemming (Porter, Snowball)¶

📖 Click to Expand
What is Stemming?¶

Stemming strips suffixes without understanding meaning. It's fast and simple but may produce non-words.

Popular algorithms: Porter, Snowball, Lancaster

Example:¶

"connection", "connected", "connecting"
→ "connect", "connect", "connect"

In [47]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

# Apply stemming
df['text_stemmed'] = df['text_no_custom_stopwords'].apply(
    lambda x: ' '.join([stemmer.stem(word) for word in x.split()])
)

print("Before:", df['text_no_custom_stopwords'].iloc[0])
print("After:", df['text_stemmed'].iloc[0])
Before: one best apps acording bunch people agree bombs eggs pigs tnt king pigs realustic stuff
After: one best app acord bunch peopl agre bomb egg pig tnt king pig realust stuff

🍃 Lemmatization (WordNet, spaCy)¶

📖 Click to Expand
What is Lemmatization?¶

Lemmatization reduces words to their base dictionary form (lemma) using vocabulary and POS tags.

Tools like NLTK’s WordNetLemmatizer or spaCy do this using linguistic knowledge.

Example:¶

"am", "are", "is" → "be", "be", "be"

"better" → "good"

In [48]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

# Apply lemmatization
df['text_lemmatized'] = df['text_no_custom_stopwords'].apply(
    lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()])
)

print("Before:", df['text_no_custom_stopwords'].iloc[0])
print("After:", df['text_lemmatized'].iloc[0])
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ashrithreddy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/ashrithreddy/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Before: one best apps acording bunch people agree bombs eggs pigs tnt king pigs realustic stuff
After: one best apps acording bunch people agree bomb egg pig tnt king pig realustic stuff

Back to the top


🧠 POS Tagging & Named Entity Recognition¶

📖 Click to Expand
What are POS & NER?¶

These are advanced parsing techniques that extract grammatical roles and real-world entities from text.

  • POS Tagging (Part-of-Speech) → assigns a tag like noun, verb, adjective to each word
  • NER (Named Entity Recognition) → identifies names, places, orgs, dates, etc.
Why it matters:¶
  • Helps with filtering nouns, extracting subjects, building features
  • Useful in recommendation systems, summarization, question answering, etc.
Example:¶

Text: "Amazon was founded in Seattle in 1994"
→ POS: ("Amazon", NNP), ("founded", VBD), ...
→ NER: ORG = Amazon, GPE = Seattle, DATE = 1994

🏷️ POS Tagging¶

📖 Click to Expand
What is POS Tagging?¶

Part-of-Speech tagging assigns grammatical labels to each word in a sentence — like noun, verb, adjective, etc.

It's useful to extract only verbs, nouns, or filter by syntax.

Example:¶

"this app works beautifully"
→ [("this", DT), ("app", NN), ("works", VBZ), ("beautifully", RB)]

In [49]:
import nltk
nltk.data.path.append('/Users/ashrithreddy/nltk_data')
# nltk.data.find('taggers/averaged_perceptron_tagger')  # Will throw if missing
# nltk.download('popular', download_dir='/Users/ashrithreddy/nltk_data')
# nltk.download('averaged_perceptron_tagger_eng')

# POS tagging on lemmatized text
df['pos_tags'] = df['text_lemmatized'].apply(lambda x: nltk.pos_tag(x.split()))

print("Sample Tagged Sentence:")
print(df['pos_tags'].iloc[0])
Sample Tagged Sentence:
[('one', 'CD'), ('best', 'JJS'), ('apps', 'NN'), ('acording', 'VBG'), ('bunch', 'JJ'), ('people', 'NNS'), ('agree', 'VBP'), ('bomb', 'NN'), ('egg', 'NN'), ('pig', 'NN'), ('tnt', 'NN'), ('king', 'VBG'), ('pig', 'JJ'), ('realustic', 'JJ'), ('stuff', 'NN')]

🧾 Named Entity Recognition (NER)¶

📖 Click to Expand
What is NER?¶

NER identifies named entities in text like organizations, people, countries, dates, etc.

It builds structure from unstructured text, enabling high-level features like “brand mentioned”, “location detected”, etc.

Example:¶

"Jeff Bezos founded Amazon in 1994"
→ PERSON = Jeff Bezos, ORG = Amazon, DATE = 1994

In [57]:
import nltk
import pickle
import io
from nltk.chunk import ChunkParserI
from nltk import pos_tag
from nltk.tokenize import TreebankWordTokenizer

# Force NLTK to look in your custom path
nltk.data.path.clear()
nltk.data.path.append('/Users/ashrithreddy/nltk_data')

# Load the chunker manually using correct encoding
chunker_path = '/Users/ashrithreddy/nltk_data/chunkers/maxent_ne_chunker/english_ace_multiclass.pickle'
with open(chunker_path, 'rb') as f:
    chunker: ChunkParserI = pickle.load(io.BufferedReader(f), encoding='latin1')

# Tokenize and tag
text = df['text_lemmatized'].iloc[0]
tokens = TreebankWordTokenizer().tokenize(text)
pos_tags = pos_tag(tokens)

# Run NER
ner_tree = chunker.parse(pos_tags)

# Display entities
print("Named Entities:")
for subtree in ner_tree:
    if hasattr(subtree, 'label'):
        entity = " ".join([token for token, pos in subtree])
        print(f"{entity} → {subtree.label()}")
Named Entities:

Back to the top


🔗 Dependency Parsing & Chunking¶

📖 Click to Expand
What is Dependency Parsing?¶

Dependency parsing maps the grammatical structure of a sentence by identifying relationships between words — like subject, object, modifier, etc.

Phrase chunking (aka shallow parsing) extracts useful noun phrases, verb phrases, etc., without building a full parse tree.

Example:¶

"The quick brown fox jumps over the lazy dog"
→ fox (subject of jumps)
→ over the lazy dog (prepositional modifier)

These structures help with tasks like question answering, relation extraction, and knowledge graph construction.

🕸️ Dependency Parsing¶

📖 Click to Expand
What is Dependency Parsing?¶

Each word is linked to another as either a head or dependent, forming a tree of grammatical relationships.

spaCy returns both the token, its POS, and its dependency tag (like nsubj, dobj, prep, etc.)

Example:¶

"Amazon ships products quickly"
→ Amazon → nsubj (subject of "ships")
→ products → dobj (object of "ships")

In [58]:
try:
    import spacy
    nlp = spacy.load("en_core_web_sm")
    
    doc = nlp(df['text_lemmatized'].iloc[0])
    print("Dependency Parsing Output:\n")
    for token in doc:
        print(f"{token.text:15} → {token.dep_:10} → head: {token.head.text}")
except:
    print("spaCy model not found or not available. Dependency parsing skipped.")
spaCy model not found or not available. Dependency parsing skipped.

🧱 Phrase Chunking (Noun Phrases etc.)¶

📖 Click to Expand
What is Chunking?¶

Chunking extracts phrases from text — like noun phrases ("the big brown dog"), without deep parsing.

Useful for:

  • Keyword extraction
  • Headline generation
  • Phrase-level sentiment

spaCy provides .noun_chunks to extract base noun phrases.

Example:¶

"The big brown fox jumps"
→ ["The big brown fox"]

In [59]:
try:
    doc = nlp(df['text_lemmatized'].iloc[0])
    print("Noun Phrases:")
    for chunk in doc.noun_chunks:
        print("-", chunk.text)
except:
    print("spaCy model not found or not available. Chunking skipped.")
spaCy model not found or not available. Chunking skipped.

Back to the top