Status: Complete Python Coverage License

🧾 Vectorization¶

  • 📄 Introduction
  • 📦 Bag of Words (BoW)
  • 📈 TF-IDF
  • ⚙️ Vectorizer Options & Preprocessing
  • 🪪 Sparse Matrix Handling
  • 🧠 Feature Interpretation
  • 🧪 Edge Cases & Practical Tips

📄 Introduction¶

📖 Click to Expand
📄 Introduction¶

In Natural Language Processing (NLP), machines must convert raw text into a format they can understand — numbers. This process is called vectorization, where we represent words, phrases, or documents as numerical vectors.

This notebook explores two foundational vectorization techniques:

  1. Bag of Words (BoW) – represents text using raw word counts.
  2. TF-IDF (Term Frequency–Inverse Document Frequency) – adjusts raw counts by penalizing common words and rewarding unique ones.

These techniques are:

  • Simple yet powerful
  • Widely used in baseline models
  • Often used as inputs to traditional ML algorithms like Logistic Regression or Naive Bayes

We’ll cover:

  • How these methods work internally
  • How to implement them with sklearn
  • Practical considerations like vocabulary pruning, stopwords, and n-gram handling

This is your first step from “text” to “features” in any NLP pipeline.

Back to the top


📦 Bag of Words (BoW)¶

🏗️ CountVectorizer Mechanics¶

📖 Click to Expand

The CountVectorizer in sklearn transforms a corpus into a document-term matrix using raw word counts.

  • Each row = a document
  • Each column = a unique token
  • Each cell = how often the token appears in that document

No weighting, no normalization — just raw frequency counts.

Common parameters:

  • stop_words='english': removes common English stopwords
  • max_features=1000: restricts vocabulary to top 1,000 terms by frequency
  • min_df=2: ignores words that appear in only 1 doc
  • ngram_range=(1,2): includes unigrams and bigrams

This forms the backbone of the BoW model, which is simple, interpretable, and fast — ideal for quick baselines.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample corpus
corpus = [
    "The cat sat on the mat.",
    "The dog chased the cat.",
    "The cat climbed the tree."
]

# Initialize CountVectorizer
vectorizer = CountVectorizer(stop_words='english')

# Fit-transform
X = vectorizer.fit_transform(corpus)

# View as DataFrame
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
Out[17]:
cat chased climbed dog mat sat tree
0 1 0 0 0 1 1 0
1 1 1 0 1 0 0 0
2 1 0 1 0 0 0 1

🔢 N-grams¶

📖 Click to Expand

N-grams are contiguous sequences of n items (usually words) from text.
Instead of representing only individual words (unigrams), we can also capture bigrams, trigrams, etc.

Why use them?

  • Capture word context: "New York" ≠ "New" + "York"
  • Improve model accuracy, especially for short texts

Tradeoffs:

  • Increase dimensionality exponentially
  • Need to manage sparsity and overfitting

Use ngram_range=(1,2) to include unigrams and bigrams.

In [18]:
# Bigrams + unigrams
vectorizer_ngram = CountVectorizer(ngram_range=(1, 2), stop_words='english')
X_ngram = vectorizer_ngram.fit_transform(corpus)

# View result
pd.DataFrame(X_ngram.toarray(), columns=vectorizer_ngram.get_feature_names_out())
Out[18]:
cat cat climbed cat sat chased chased cat climbed climbed tree dog dog chased mat sat sat mat tree
0 1 0 1 0 0 0 0 0 0 1 1 1 0
1 1 0 0 1 1 0 0 1 1 0 0 0 0
2 1 1 0 0 0 1 1 0 0 0 0 0 1

🔠 Vocabulary Size & Pruning¶

📖 Click to Expand

Real-world corpora often contain:

  • Rare words (typos, unique names)
  • Overly common words (generic fillers)

To manage these:

  • min_df=2: removes words that appear in fewer than 2 documents
  • max_df=0.9: removes words that appear in more than 90% of documents
  • max_features=1000: keeps only the top 1,000 frequent terms

This reduces noise, speeds up training, and improves generalization.

In [19]:
# Loosen pruning thresholds for toy data
vectorizer_pruned = CountVectorizer(
    stop_words='english',
    min_df=1,       # allow terms in at least 1 doc
    max_df=1.0      # allow all terms
)

X_pruned = vectorizer_pruned.fit_transform(corpus)

# View result
pd.DataFrame(X_pruned.toarray(), columns=vectorizer_pruned.get_feature_names_out())
Out[19]:
cat chased climbed dog mat sat tree
0 1 0 0 0 1 1 0
1 1 1 0 1 0 0 0
2 1 0 1 0 0 0 1

Back to the top


📈 TF-IDF¶

📚 What is TF-IDF?¶

📖 Click to Expand
¶

TF-IDF stands for Term Frequency–Inverse Document Frequency. It improves upon raw word counts by down-weighting frequent terms and up-weighting rare but informative ones.

Formula:

  • TF(t, d): frequency of term t in document d
  • IDF(t): inverse document frequency = log(N / df_t)
    where N = total docs, df_t = #docs containing term t

Final score: $ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $

Why use it?

  • Common words like "the", "is" appear everywhere → low IDF → lower weight
  • Rare but specific words get boosted → better features for classification

🔧 Tfidf Vectorizer Implementation¶

📖 Click to Expand

TfidfVectorizer in sklearn combines:

  1. Tokenization and preprocessing (like CountVectorizer)
  2. TF-IDF weight calculation

By default:

  • use_idf=True: enable IDF scaling
  • smooth_idf=True: adds 1 to df to prevent divide-by-zero
  • sublinear_tf=False: raw count used instead of log-scaling

It’s the most common feature generator for classic ML pipelines in NLP.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF with defaults
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

# View as DataFrame
pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
Out[20]:
cat chased climbed dog mat sat tree
0 0.385372 0.000000 0.000000 0.000000 0.652491 0.652491 0.000000
1 0.385372 0.652491 0.000000 0.652491 0.000000 0.000000 0.000000
2 0.385372 0.000000 0.652491 0.000000 0.000000 0.000000 0.652491

🎛️ Adjusting TF/IDF Weights¶

📖 Click to Expand

You can customize how TF and IDF are calculated in TfidfVectorizer:

  • sublinear_tf=True: apply log(1 + tf) scaling
  • smooth_idf=False: disables IDF smoothing
  • norm=None: disables vector normalization
  • max_df, min_df: prune common or rare terms
  • ngram_range: apply TF-IDF over bigrams/trigrams

These adjustments can drastically change model performance.

For sparse or noisy datasets, tuning min_df, max_df, and sublinear_tf is often crucial.

In [21]:
# Custom TF-IDF settings
tfidf_custom = TfidfVectorizer(
    stop_words='english',
    sublinear_tf=True,
    smooth_idf=False,
    norm=None,
    min_df=1,
    max_df=1.0
)

X_tfidf_custom = tfidf_custom.fit_transform(corpus)

# View results
pd.DataFrame(X_tfidf_custom.toarray(), columns=tfidf_custom.get_feature_names_out())
Out[21]:
cat chased climbed dog mat sat tree
0 1.0 0.000000 0.000000 0.000000 2.098612 2.098612 0.000000
1 1.0 2.098612 0.000000 2.098612 0.000000 0.000000 0.000000
2 1.0 0.000000 2.098612 0.000000 0.000000 0.000000 2.098612

Back to the top


⚙️ Vectorizer Options & Preprocessing¶

🔤 Lowercasing, Strip Accents, Token Patterns¶

📖 Click to Expand

Both CountVectorizer and TfidfVectorizer offer options to control how raw text is preprocessed before vectorization.

Key arguments:

  • lowercase=True: convert all text to lowercase (default)
  • strip_accents='unicode': remove accents like "é" → "e"
  • token_pattern=r'\b\w\w+\b': regex pattern to match tokens (e.g., words with 2+ chars)

These help normalize text and reduce vocabulary noise.

In [22]:
vectorizer_clean = CountVectorizer(
    lowercase=True,
    strip_accents='unicode',
    token_pattern=r'\b\w\w+\b',  # removes single-letter tokens
    stop_words='english'
)

X_clean = vectorizer_clean.fit_transform(corpus)

pd.DataFrame(X_clean.toarray(), columns=vectorizer_clean.get_feature_names_out())
Out[22]:
cat chased climbed dog mat sat tree
0 1 0 0 0 1 1 0
1 1 1 0 1 0 0 0
2 1 0 1 0 0 0 1

🧰 Custom Tokenizer / Analyzer¶

📖 Click to Expand

You can override CountVectorizer’s default tokenizer with a custom function.

  • tokenizer=...: custom token splitting logic
  • analyzer=...: full custom analyzer pipeline (tokenize + process)
  • preprocessor=...: raw string manipulation before tokenization

Useful for:

  • Lemmatization or stemming
  • Regex-based token splitting
  • Handling emojis, hashtags, etc.

You must set token_pattern=None when using a custom tokenizer.

In [23]:
import re
from sklearn.feature_extraction.text import CountVectorizer

# Custom tokenizer: split by words and keep hashtags
def custom_tokenizer(text):
    return re.findall(r'#?\b\w\w+\b', text.lower())

custom_vec = CountVectorizer(
    tokenizer=custom_tokenizer,
    token_pattern=None,       # must be None when using custom tokenizer
    stop_words='english'
)

X_custom = custom_vec.fit_transform(corpus)

pd.DataFrame(X_custom.toarray(), columns=custom_vec.get_feature_names_out())
Out[23]:
cat chased climbed dog mat sat tree
0 1 0 0 0 1 1 0
1 1 1 0 1 0 0 0
2 1 0 1 0 0 0 1

Back to the top


🪪 Sparse Matrix Handling¶

🧮 Memory vs Interpretability¶

📖 Click to Expand

Text vectorization often results in very large sparse matrices, especially with BoW or TF-IDF on real-world corpora.

  • Thousands of features
  • 95%+ zero entries

🧠 Tradeoff:

  • Sparse matrices = memory-efficient, but hard to interpret directly
  • Dense matrices = intuitive but expensive

Scikit-learn returns scipy.sparse.csr_matrix by default. You can:

  • Use .shape, .nnz for size/stats
  • Use .toarray() to convert to dense (careful with large data)
In [24]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
X_sparse = vectorizer.fit_transform(corpus)

print("Shape:", X_sparse.shape)
print("Non-zero entries:", X_sparse.nnz)
print("Sparsity: {:.2f}%".format(100 * (1 - X_sparse.nnz / (X_sparse.shape[0] * X_sparse.shape[1]))))
Shape: (3, 7)
Non-zero entries: 9
Sparsity: 57.14%

🧹 Dimensionality Reduction (optional preview)¶

📖 Click to Expand

Sparse matrices from BoW/TF-IDF are often:

  • High-dimensional
  • Noisy
  • Hard for downstream models to interpret

You can reduce dimensionality with:

  • TruncatedSVD: great for sparse input (LSA-style)
  • PCA: works only with dense input
  • UMAP / t-SNE: for visualization, not modeling

This step is optional, but useful when:

  • You want to visualize text clusters
  • You want to feed reduced features into ML models
In [25]:
from sklearn.decomposition import TruncatedSVD

# Apply SVD to TF-IDF vectors
svd = TruncatedSVD(n_components=2, random_state=42)
X_reduced = svd.fit_transform(X_tfidf)

# View as DataFrame
pd.DataFrame(X_reduced, columns=['svd_1', 'svd_2'])
Out[25]:
svd_1 svd_2
0 0.657526 -1.687260e-17
1 0.657526 -6.524909e-01
2 0.657526 6.524909e-01

Back to the top


🧠 Feature Interpretation¶

🔍 Top Words by Class¶

📖 Click to Expand In classification tasks, analyzing the most frequent or distinctive words per class helps uncover patterns in the data.

Common techniques:

  • Use CountVectorizer or TfidfVectorizer with fit_transform(X_train)
  • Split the data by class
  • Sum the word counts within each class and sort

This gives insight into:

  • Class-specific jargon
  • Topic-related words
  • Potential bias or leakage
In [26]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Fake labeled data for demo
docs = [
    "great movie and fantastic acting",
    "boring plot and bad acting",
    "what a wonderful film",
    "awful and dull",
    "loved it, great performance",
    "hated it, terrible movie"
]
labels = [1, 0, 1, 0, 1, 0]  # 1 = positive, 0 = negative

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)
features = np.array(vectorizer.get_feature_names_out())

# Separate by class
pos_rows = X[np.array(labels) == 1]
neg_rows = X[np.array(labels) == 0]

# Sum term frequencies by class
pos_freq = np.asarray(pos_rows.sum(axis=0)).flatten()
neg_freq = np.asarray(neg_rows.sum(axis=0)).flatten()

# Top 5 per class
top_pos = features[np.argsort(pos_freq)[-5:]]
top_neg = features[np.argsort(neg_freq)[-5:]]

print("Top words in positive class:", top_pos[::-1])
print("Top words in negative class:", top_neg[::-1])
Top words in positive class: ['great' 'wonderful' 'performance' 'movie' 'loved']
Top words in negative class: ['terrible' 'plot' 'movie' 'hated' 'dull']

🎯 Important Features in Classification¶

📖 Click to Expand After training a linear model like Logistic Regression, we can inspect feature weights to understand which words push predictions toward a class.

High-weighted positive features = strong predictors for that class.
This is useful for:

  • Model explainability
  • Debugging data leakage
  • Extracting interpretable rules

Works best with:

  • TF-IDF inputs
  • Linear models (LogisticRegression, SGDClassifier)
In [27]:
from sklearn.linear_model import LogisticRegression

# Use TF-IDF for modeling
tfidf = TfidfVectorizer(stop_words='english')
X_tfidf = tfidf.fit_transform(docs)

model = LogisticRegression()
model.fit(X_tfidf, labels)

# Get feature weights
feature_names = np.array(tfidf.get_feature_names_out())
coefs = model.coef_[0]

top_pos_coef = feature_names[np.argsort(coefs)[-5:]]
top_neg_coef = feature_names[np.argsort(coefs)[:5]]

print("Most positive influence:", top_pos_coef[::-1])
print("Most negative influence:", top_neg_coef)
Most positive influence: ['great' 'wonderful' 'film' 'fantastic' 'performance']
Most negative influence: ['awful' 'dull' 'hated' 'terrible' 'bad']

Back to the top


🧪 Edge Cases & Practical Tips¶

📉 Extremely Sparse Inputs¶

📖 Click to Expand When the vocabulary is large and most documents are short, vectorized matrices become extremely sparse — mostly zeros.

Pitfalls:

  • Models may overfit noise
  • Similarity scores (e.g. cosine) become unstable
  • Distance-based models like KNN perform poorly

Tips:

  • Use max_features, min_df to limit vocabulary
  • Try TruncatedSVD or SelectKBest for dimensionality reduction
  • Prefer models that handle sparsity well (e.g., Logistic Regression, Naive Bayes)
In [28]:
print("Shape:", X_tfidf.shape)
print("Sparsity: {:.2f}%".format(100 * (1 - X_tfidf.nnz / (X_tfidf.shape[0] * X_tfidf.shape[1]))))
Shape: (6, 15)
Sparsity: 80.00%

Back to the top