📄 Introduction¶
📖 Click to Expand
📄 Introduction¶
In Natural Language Processing (NLP), machines must convert raw text into a format they can understand — numbers. This process is called vectorization, where we represent words, phrases, or documents as numerical vectors.
This notebook explores two foundational vectorization techniques:
- Bag of Words (BoW) – represents text using raw word counts.
- TF-IDF (Term Frequency–Inverse Document Frequency) – adjusts raw counts by penalizing common words and rewarding unique ones.
These techniques are:
- Simple yet powerful
- Widely used in baseline models
- Often used as inputs to traditional ML algorithms like Logistic Regression or Naive Bayes
We’ll cover:
- How these methods work internally
- How to implement them with
sklearn
- Practical considerations like vocabulary pruning, stopwords, and n-gram handling
This is your first step from “text” to “features” in any NLP pipeline.
📦 Bag of Words (BoW)¶
🏗️ CountVectorizer Mechanics¶
📖 Click to Expand
The CountVectorizer in sklearn
transforms a corpus into a document-term matrix using raw word counts.
- Each row = a document
- Each column = a unique token
- Each cell = how often the token appears in that document
No weighting, no normalization — just raw frequency counts.
Common parameters:
stop_words='english'
: removes common English stopwordsmax_features=1000
: restricts vocabulary to top 1,000 terms by frequencymin_df=2
: ignores words that appear in only 1 docngram_range=(1,2)
: includes unigrams and bigrams
This forms the backbone of the BoW model, which is simple, interpretable, and fast — ideal for quick baselines.
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Sample corpus
corpus = [
"The cat sat on the mat.",
"The dog chased the cat.",
"The cat climbed the tree."
]
# Initialize CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
# Fit-transform
X = vectorizer.fit_transform(corpus)
# View as DataFrame
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
cat | chased | climbed | dog | mat | sat | tree | |
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
2 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
🔢 N-grams¶
📖 Click to Expand
N-grams are contiguous sequences of n items (usually words) from text.
Instead of representing only individual words (unigrams), we can also capture bigrams, trigrams, etc.
Why use them?
- Capture word context:
"New York"
≠"New"
+"York"
- Improve model accuracy, especially for short texts
Tradeoffs:
- Increase dimensionality exponentially
- Need to manage sparsity and overfitting
Use ngram_range=(1,2)
to include unigrams and bigrams.
# Bigrams + unigrams
vectorizer_ngram = CountVectorizer(ngram_range=(1, 2), stop_words='english')
X_ngram = vectorizer_ngram.fit_transform(corpus)
# View result
pd.DataFrame(X_ngram.toarray(), columns=vectorizer_ngram.get_feature_names_out())
cat | cat climbed | cat sat | chased | chased cat | climbed | climbed tree | dog | dog chased | mat | sat | sat mat | tree | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
2 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
🔠 Vocabulary Size & Pruning¶
📖 Click to Expand
Real-world corpora often contain:
- Rare words (typos, unique names)
- Overly common words (generic fillers)
To manage these:
min_df=2
: removes words that appear in fewer than 2 documentsmax_df=0.9
: removes words that appear in more than 90% of documentsmax_features=1000
: keeps only the top 1,000 frequent terms
This reduces noise, speeds up training, and improves generalization.
# Loosen pruning thresholds for toy data
vectorizer_pruned = CountVectorizer(
stop_words='english',
min_df=1, # allow terms in at least 1 doc
max_df=1.0 # allow all terms
)
X_pruned = vectorizer_pruned.fit_transform(corpus)
# View result
pd.DataFrame(X_pruned.toarray(), columns=vectorizer_pruned.get_feature_names_out())
cat | chased | climbed | dog | mat | sat | tree | |
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
2 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
📈 TF-IDF¶
📚 What is TF-IDF?¶
📖 Click to Expand
¶
TF-IDF stands for Term Frequency–Inverse Document Frequency. It improves upon raw word counts by down-weighting frequent terms and up-weighting rare but informative ones.
Formula:
- TF(t, d): frequency of term t in document d
- IDF(t): inverse document frequency =
log(N / df_t)
whereN = total docs
,df_t = #docs containing term t
Final score: $ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $
Why use it?
- Common words like "the", "is" appear everywhere → low IDF → lower weight
- Rare but specific words get boosted → better features for classification
🔧 Tfidf Vectorizer Implementation¶
📖 Click to Expand
TfidfVectorizer
in sklearn
combines:
- Tokenization and preprocessing (like CountVectorizer)
- TF-IDF weight calculation
By default:
use_idf=True
: enable IDF scalingsmooth_idf=True
: adds 1 todf
to prevent divide-by-zerosublinear_tf=False
: raw count used instead of log-scaling
It’s the most common feature generator for classic ML pipelines in NLP.
from sklearn.feature_extraction.text import TfidfVectorizer
# TF-IDF with defaults
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
# View as DataFrame
pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
cat | chased | climbed | dog | mat | sat | tree | |
---|---|---|---|---|---|---|---|
0 | 0.385372 | 0.000000 | 0.000000 | 0.000000 | 0.652491 | 0.652491 | 0.000000 |
1 | 0.385372 | 0.652491 | 0.000000 | 0.652491 | 0.000000 | 0.000000 | 0.000000 |
2 | 0.385372 | 0.000000 | 0.652491 | 0.000000 | 0.000000 | 0.000000 | 0.652491 |
🎛️ Adjusting TF/IDF Weights¶
📖 Click to Expand
You can customize how TF and IDF are calculated in TfidfVectorizer
:
sublinear_tf=True
: apply log(1 + tf) scalingsmooth_idf=False
: disables IDF smoothingnorm=None
: disables vector normalizationmax_df
,min_df
: prune common or rare termsngram_range
: apply TF-IDF over bigrams/trigrams
These adjustments can drastically change model performance.
For sparse or noisy datasets, tuning
min_df
,max_df
, andsublinear_tf
is often crucial.
# Custom TF-IDF settings
tfidf_custom = TfidfVectorizer(
stop_words='english',
sublinear_tf=True,
smooth_idf=False,
norm=None,
min_df=1,
max_df=1.0
)
X_tfidf_custom = tfidf_custom.fit_transform(corpus)
# View results
pd.DataFrame(X_tfidf_custom.toarray(), columns=tfidf_custom.get_feature_names_out())
cat | chased | climbed | dog | mat | sat | tree | |
---|---|---|---|---|---|---|---|
0 | 1.0 | 0.000000 | 0.000000 | 0.000000 | 2.098612 | 2.098612 | 0.000000 |
1 | 1.0 | 2.098612 | 0.000000 | 2.098612 | 0.000000 | 0.000000 | 0.000000 |
2 | 1.0 | 0.000000 | 2.098612 | 0.000000 | 0.000000 | 0.000000 | 2.098612 |
⚙️ Vectorizer Options & Preprocessing¶
🔤 Lowercasing, Strip Accents, Token Patterns¶
📖 Click to Expand
Both CountVectorizer
and TfidfVectorizer
offer options to control how raw text is preprocessed before vectorization.
Key arguments:
lowercase=True
: convert all text to lowercase (default)strip_accents='unicode'
: remove accents like "é" → "e"token_pattern=r'\b\w\w+\b'
: regex pattern to match tokens (e.g., words with 2+ chars)
These help normalize text and reduce vocabulary noise.
vectorizer_clean = CountVectorizer(
lowercase=True,
strip_accents='unicode',
token_pattern=r'\b\w\w+\b', # removes single-letter tokens
stop_words='english'
)
X_clean = vectorizer_clean.fit_transform(corpus)
pd.DataFrame(X_clean.toarray(), columns=vectorizer_clean.get_feature_names_out())
cat | chased | climbed | dog | mat | sat | tree | |
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
2 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
🧰 Custom Tokenizer / Analyzer¶
📖 Click to Expand
You can override CountVectorizer
’s default tokenizer with a custom function.
tokenizer=...
: custom token splitting logicanalyzer=...
: full custom analyzer pipeline (tokenize + process)preprocessor=...
: raw string manipulation before tokenization
Useful for:
- Lemmatization or stemming
- Regex-based token splitting
- Handling emojis, hashtags, etc.
You must set token_pattern=None
when using a custom tokenizer.
import re
from sklearn.feature_extraction.text import CountVectorizer
# Custom tokenizer: split by words and keep hashtags
def custom_tokenizer(text):
return re.findall(r'#?\b\w\w+\b', text.lower())
custom_vec = CountVectorizer(
tokenizer=custom_tokenizer,
token_pattern=None, # must be None when using custom tokenizer
stop_words='english'
)
X_custom = custom_vec.fit_transform(corpus)
pd.DataFrame(X_custom.toarray(), columns=custom_vec.get_feature_names_out())
cat | chased | climbed | dog | mat | sat | tree | |
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
2 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
🪪 Sparse Matrix Handling¶
🧮 Memory vs Interpretability¶
📖 Click to Expand
Text vectorization often results in very large sparse matrices, especially with BoW or TF-IDF on real-world corpora.
- Thousands of features
- 95%+ zero entries
🧠 Tradeoff:
- Sparse matrices = memory-efficient, but hard to interpret directly
- Dense matrices = intuitive but expensive
Scikit-learn returns scipy.sparse.csr_matrix
by default. You can:
- Use
.shape
,.nnz
for size/stats - Use
.toarray()
to convert to dense (careful with large data)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
X_sparse = vectorizer.fit_transform(corpus)
print("Shape:", X_sparse.shape)
print("Non-zero entries:", X_sparse.nnz)
print("Sparsity: {:.2f}%".format(100 * (1 - X_sparse.nnz / (X_sparse.shape[0] * X_sparse.shape[1]))))
Shape: (3, 7) Non-zero entries: 9 Sparsity: 57.14%
🧹 Dimensionality Reduction (optional preview)¶
📖 Click to Expand
Sparse matrices from BoW/TF-IDF are often:
- High-dimensional
- Noisy
- Hard for downstream models to interpret
You can reduce dimensionality with:
- TruncatedSVD: great for sparse input (LSA-style)
- PCA: works only with dense input
- UMAP / t-SNE: for visualization, not modeling
This step is optional, but useful when:
- You want to visualize text clusters
- You want to feed reduced features into ML models
from sklearn.decomposition import TruncatedSVD
# Apply SVD to TF-IDF vectors
svd = TruncatedSVD(n_components=2, random_state=42)
X_reduced = svd.fit_transform(X_tfidf)
# View as DataFrame
pd.DataFrame(X_reduced, columns=['svd_1', 'svd_2'])
svd_1 | svd_2 | |
---|---|---|
0 | 0.657526 | -1.687260e-17 |
1 | 0.657526 | -6.524909e-01 |
2 | 0.657526 | 6.524909e-01 |
🧠 Feature Interpretation¶
🔍 Top Words by Class¶
📖 Click to Expand
In classification tasks, analyzing the most frequent or distinctive words per class helps uncover patterns in the data.Common techniques:
- Use
CountVectorizer
orTfidfVectorizer
withfit_transform(X_train)
- Split the data by class
- Sum the word counts within each class and sort
This gives insight into:
- Class-specific jargon
- Topic-related words
- Potential bias or leakage
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
# Fake labeled data for demo
docs = [
"great movie and fantastic acting",
"boring plot and bad acting",
"what a wonderful film",
"awful and dull",
"loved it, great performance",
"hated it, terrible movie"
]
labels = [1, 0, 1, 0, 1, 0] # 1 = positive, 0 = negative
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)
features = np.array(vectorizer.get_feature_names_out())
# Separate by class
pos_rows = X[np.array(labels) == 1]
neg_rows = X[np.array(labels) == 0]
# Sum term frequencies by class
pos_freq = np.asarray(pos_rows.sum(axis=0)).flatten()
neg_freq = np.asarray(neg_rows.sum(axis=0)).flatten()
# Top 5 per class
top_pos = features[np.argsort(pos_freq)[-5:]]
top_neg = features[np.argsort(neg_freq)[-5:]]
print("Top words in positive class:", top_pos[::-1])
print("Top words in negative class:", top_neg[::-1])
Top words in positive class: ['great' 'wonderful' 'performance' 'movie' 'loved'] Top words in negative class: ['terrible' 'plot' 'movie' 'hated' 'dull']
🎯 Important Features in Classification¶
📖 Click to Expand
After training a linear model like Logistic Regression, we can inspect feature weights to understand which words push predictions toward a class.High-weighted positive features = strong predictors for that class.
This is useful for:
- Model explainability
- Debugging data leakage
- Extracting interpretable rules
Works best with:
- TF-IDF inputs
- Linear models (LogisticRegression, SGDClassifier)
from sklearn.linear_model import LogisticRegression
# Use TF-IDF for modeling
tfidf = TfidfVectorizer(stop_words='english')
X_tfidf = tfidf.fit_transform(docs)
model = LogisticRegression()
model.fit(X_tfidf, labels)
# Get feature weights
feature_names = np.array(tfidf.get_feature_names_out())
coefs = model.coef_[0]
top_pos_coef = feature_names[np.argsort(coefs)[-5:]]
top_neg_coef = feature_names[np.argsort(coefs)[:5]]
print("Most positive influence:", top_pos_coef[::-1])
print("Most negative influence:", top_neg_coef)
Most positive influence: ['great' 'wonderful' 'film' 'fantastic' 'performance'] Most negative influence: ['awful' 'dull' 'hated' 'terrible' 'bad']
🧪 Edge Cases & Practical Tips¶
📉 Extremely Sparse Inputs¶
📖 Click to Expand
When the vocabulary is large and most documents are short, vectorized matrices become extremely sparse — mostly zeros.Pitfalls:
- Models may overfit noise
- Similarity scores (e.g. cosine) become unstable
- Distance-based models like KNN perform poorly
Tips:
- Use
max_features
,min_df
to limit vocabulary - Try
TruncatedSVD
orSelectKBest
for dimensionality reduction - Prefer models that handle sparsity well (e.g., Logistic Regression, Naive Bayes)
print("Shape:", X_tfidf.shape)
print("Sparsity: {:.2f}%".format(100 * (1 - X_tfidf.nnz / (X_tfidf.shape[0] * X_tfidf.shape[1]))))
Shape: (6, 15) Sparsity: 80.00%