🔢 Label Encoding¶
📖 Click to Expand
Label encoding assigns each category a unique integer, preserving no explicit order.
- Best suited for ordinal variables where the order matters
- Can mislead linear models if used with nominal data
- Used as a quick baseline for tree-based models
In [2]:
# Dummy dataset
import pandas as pd
df = pd.DataFrame({
'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small'],
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'City': ['NY', 'LA', 'SF', 'NY', 'SF'],
'Target': [0, 1, 0, 1, 0]
})
df
Out[2]:
Size | Color | City | Target | |
---|---|---|---|---|
0 | Small | Red | NY | 0 |
1 | Medium | Blue | LA | 1 |
2 | Large | Green | SF | 0 |
3 | Medium | Blue | NY | 1 |
4 | Small | Red | SF | 0 |
In [ ]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Size_LabelEncoded'] = le.fit_transform(df['Size'])
df[['Size', 'Size_LabelEncoded']].sort_values(by='Size')
Out[ ]:
Size | Size_LabelEncoded | |
---|---|---|
2 | Large | 0 |
1 | Medium | 1 |
3 | Medium | 1 |
0 | Small | 2 |
4 | Small | 2 |
🟦 One-Hot Encoding¶
📖 Click to Expand
One-hot encoding creates a binary column for each category, indicating its presence.
- Ideal for nominal variables with low cardinality
- Can cause curse of dimensionality with high cardinality
- Works well with linear and tree-based models
In [25]:
df_onehot = pd.get_dummies(df, columns=['Color'], prefix='Color')
df_onehot
Out[25]:
Size | City | Target | Size_LabelEncoded | Size_OrdinalEncoded | City_CountEncoded | City_TargetEncoded | Color_Blue | Color_Green | Color_Red | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Small | NY | 0 | 2 | 1 | 2 | 0.5 | 0 | 0 | 1 |
1 | Medium | LA | 1 | 1 | 2 | 1 | 1.0 | 1 | 0 | 0 |
2 | Large | SF | 0 | 0 | 3 | 2 | 0.0 | 0 | 1 | 0 |
3 | Medium | NY | 1 | 1 | 2 | 2 | 0.5 | 1 | 0 | 0 |
4 | Small | SF | 0 | 2 | 1 | 2 | 0.0 | 0 | 0 | 1 |
📖 Click to Expand
Dummy encoding is a variant of one-hot encoding where one category is dropped to serve as a baseline (reference level).
- Prevents multicollinearity in linear models
- Only k-1 columns created for k categories
- Drop is arbitrary unless explicitly defined
In [21]:
# Dummy encoding: One-hot encoding with drop_first=True
df_dummy = pd.get_dummies(df, columns=['Color'], prefix='Color', drop_first=True)
df_dummy
Out[21]:
Size | City | Target | Size_LabelEncoded | Size_OrdinalEncoded | City_CountEncoded | City_TargetEncoded | Color_Green | Color_Red | |
---|---|---|---|---|---|---|---|---|---|
0 | Small | NY | 0 | 2 | 1 | 2 | 0.5 | 0 | 1 |
1 | Medium | LA | 1 | 1 | 2 | 1 | 1.0 | 0 | 0 |
2 | Large | SF | 0 | 0 | 3 | 2 | 0.0 | 1 | 0 |
3 | Medium | NY | 1 | 1 | 2 | 2 | 0.5 | 0 | 0 |
4 | Small | SF | 0 | 2 | 1 | 2 | 0.0 | 0 | 1 |
🧮 Ordinal Encoding¶
📖 Click to Expand
Ordinal encoding maps categories to integers based on their meaningful rank.
- Use when categories have a clear order (e.g., small, medium, large)
- Manual ordering is critical to avoid misleading signals
- Works well with models that can leverage numeric relationships
In [ ]:
# Manually map Size: Small < Medium < Large
size_mapping = {'Small': 1, 'Medium': 2, 'Large': 3}
df['Size_OrdinalEncoded'] = df['Size'].map(size_mapping)
df[['Size', 'Size_OrdinalEncoded']]
Out[ ]:
Size | Size_OrdinalEncoded | |
---|---|---|
0 | Small | 1 |
1 | Medium | 2 |
2 | Large | 3 |
3 | Medium | 2 |
4 | Small | 1 |
📊 Frequency / Count Encoding¶
📖 Click to Expand
Replaces each category with its count or frequency in the dataset.
- Simple and efficient for high-cardinality features
- Works well with tree-based models
- May introduce bias in linear models
In [ ]:
count_map = df['City'].value_counts().to_dict()
df['City_CountEncoded'] = df['City'].map(count_map)
df[['City', 'City_CountEncoded']]
Out[ ]:
City | City_CountEncoded | |
---|---|---|
0 | NY | 2 |
1 | LA | 1 |
2 | SF | 2 |
3 | NY | 2 |
4 | SF | 2 |
🎯 Target Encoding¶
📖 Click to Expand
Each category is replaced with the mean of the target variable for that category.
- Effective for high-cardinality categorical features
- Prone to data leakage if not handled carefully
- Use regularization or cross-validation for safe use
In [8]:
# Mean target for each City
target_mean = df.groupby('City')['Target'].mean().to_dict()
df['City_TargetEncoded'] = df['City'].map(target_mean)
df[['City', 'City_TargetEncoded']]
Out[8]:
City | City_TargetEncoded | |
---|---|---|
0 | NY | 0.5 |
1 | LA | 1.0 |
2 | SF | 0.0 |
3 | NY | 0.5 |
4 | SF | 0.0 |
#️⃣ Binary Encoding¶
📖 Click to Expand
Encodes categories as binary numbers, then splits digits into columns.
- Reduces dimensionality vs. one-hot
- Good for medium cardinality data
- Less interpretable than one-hot or label
In [18]:
def binary_encode(series):
categories = series.astype('category').cat.codes
max_len = int(categories.max()).bit_length() # cast to Python int
binary_cols = categories.apply(lambda x: list(map(int, bin(int(x))[2:].zfill(max_len))))
return pd.DataFrame(binary_cols.tolist(), columns=[f"{series.name}_bin_{i}" for i in range(max_len)])
df_binary = binary_encode(df['City'])
df_binary.head()
Out[18]:
City_bin_0 | City_bin_1 | |
---|---|---|
0 | 0 | 1 |
1 | 0 | 0 |
2 | 1 | 0 |
3 | 0 | 1 |
4 | 1 | 0 |
💠 Hashing Encoding¶
📖 Click to Expand
Applies a hash function to map categories to fixed number of columns.
- Useful for extremely high-cardinality data
- Prone to hash collisions, which may reduce signal
- Non-invertible: can't trace back original categories
In [26]:
import hashlib
def hash_encode(series, n_components=4):
def hash_string(val):
h = int(hashlib.md5(val.encode()).hexdigest(), 16)
return [int(b) for b in bin(h)[2:].zfill(n_components)[-n_components:]]
hashed = series.astype(str).apply(hash_string)
return pd.DataFrame(hashed.tolist(), columns=[f"{series.name}_hash_{i}" for i in range(n_components)])
hash_encode(df['City'], n_components=4)
Out[26]:
City_hash_0 | City_hash_1 | City_hash_2 | City_hash_3 | |
---|---|---|---|---|
0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 1 |
2 | 1 | 1 | 0 | 1 |
3 | 1 | 0 | 0 | 0 |
4 | 1 | 1 | 0 | 1 |