Status: Complete Python Coverage License

📖 Categorical Features¶

  • 🔢 Label Encoding
  • 🟦 One-Hot Encoding
  • 🧱 Dummy Encoding
  • 🧮 Ordinal Encoding
  • 📊 Frequency / Count Encoding
  • 🎯 Target Encoding
  • #️⃣ Binary Encoding
  • 💠 Hashing Encoding

🔢 Label Encoding¶

📖 Click to Expand

Label encoding assigns each category a unique integer, preserving no explicit order.

  • Best suited for ordinal variables where the order matters
  • Can mislead linear models if used with nominal data
  • Used as a quick baseline for tree-based models
In [2]:
# Dummy dataset
import pandas as pd

df = pd.DataFrame({
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small'],
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'City': ['NY', 'LA', 'SF', 'NY', 'SF'],
    'Target': [0, 1, 0, 1, 0]
})
df
Out[2]:
Size Color City Target
0 Small Red NY 0
1 Medium Blue LA 1
2 Large Green SF 0
3 Medium Blue NY 1
4 Small Red SF 0
In [ ]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Size_LabelEncoded'] = le.fit_transform(df['Size'])
df[['Size', 'Size_LabelEncoded']].sort_values(by='Size')
Out[ ]:
Size Size_LabelEncoded
2 Large 0
1 Medium 1
3 Medium 1
0 Small 2
4 Small 2

Back to the top


🟦 One-Hot Encoding¶

📖 Click to Expand

One-hot encoding creates a binary column for each category, indicating its presence.

  • Ideal for nominal variables with low cardinality
  • Can cause curse of dimensionality with high cardinality
  • Works well with linear and tree-based models
In [25]:
df_onehot = pd.get_dummies(df, columns=['Color'], prefix='Color')
df_onehot
Out[25]:
Size City Target Size_LabelEncoded Size_OrdinalEncoded City_CountEncoded City_TargetEncoded Color_Blue Color_Green Color_Red
0 Small NY 0 2 1 2 0.5 0 0 1
1 Medium LA 1 1 2 1 1.0 1 0 0
2 Large SF 0 0 3 2 0.0 0 1 0
3 Medium NY 1 1 2 2 0.5 1 0 0
4 Small SF 0 2 1 2 0.0 0 0 1

Back to the top


🧱 Dummy Encoding

📖 Click to Expand

Dummy encoding is a variant of one-hot encoding where one category is dropped to serve as a baseline (reference level).

  • Prevents multicollinearity in linear models
  • Only k-1 columns created for k categories
  • Drop is arbitrary unless explicitly defined
In [21]:
# Dummy encoding: One-hot encoding with drop_first=True
df_dummy = pd.get_dummies(df, columns=['Color'], prefix='Color', drop_first=True)
df_dummy
Out[21]:
Size City Target Size_LabelEncoded Size_OrdinalEncoded City_CountEncoded City_TargetEncoded Color_Green Color_Red
0 Small NY 0 2 1 2 0.5 0 1
1 Medium LA 1 1 2 1 1.0 0 0
2 Large SF 0 0 3 2 0.0 1 0
3 Medium NY 1 1 2 2 0.5 0 0
4 Small SF 0 2 1 2 0.0 0 1

Back to the top


🧮 Ordinal Encoding¶

📖 Click to Expand

Ordinal encoding maps categories to integers based on their meaningful rank.

  • Use when categories have a clear order (e.g., small, medium, large)
  • Manual ordering is critical to avoid misleading signals
  • Works well with models that can leverage numeric relationships
In [ ]:
# Manually map Size: Small < Medium < Large
size_mapping = {'Small': 1, 'Medium': 2, 'Large': 3}
df['Size_OrdinalEncoded'] = df['Size'].map(size_mapping)
df[['Size', 'Size_OrdinalEncoded']]
Out[ ]:
Size Size_OrdinalEncoded
0 Small 1
1 Medium 2
2 Large 3
3 Medium 2
4 Small 1

Back to the top


📊 Frequency / Count Encoding¶

📖 Click to Expand

Replaces each category with its count or frequency in the dataset.

  • Simple and efficient for high-cardinality features
  • Works well with tree-based models
  • May introduce bias in linear models
In [ ]:
count_map = df['City'].value_counts().to_dict()
df['City_CountEncoded'] = df['City'].map(count_map)
df[['City', 'City_CountEncoded']]
Out[ ]:
City City_CountEncoded
0 NY 2
1 LA 1
2 SF 2
3 NY 2
4 SF 2

Back to the top


🎯 Target Encoding¶

📖 Click to Expand

Each category is replaced with the mean of the target variable for that category.

  • Effective for high-cardinality categorical features
  • Prone to data leakage if not handled carefully
  • Use regularization or cross-validation for safe use
In [8]:
# Mean target for each City
target_mean = df.groupby('City')['Target'].mean().to_dict()
df['City_TargetEncoded'] = df['City'].map(target_mean)
df[['City', 'City_TargetEncoded']]
Out[8]:
City City_TargetEncoded
0 NY 0.5
1 LA 1.0
2 SF 0.0
3 NY 0.5
4 SF 0.0

Back to the top


#️⃣ Binary Encoding¶

📖 Click to Expand

Encodes categories as binary numbers, then splits digits into columns.

  • Reduces dimensionality vs. one-hot
  • Good for medium cardinality data
  • Less interpretable than one-hot or label
In [18]:
def binary_encode(series):
    categories = series.astype('category').cat.codes
    max_len = int(categories.max()).bit_length()  # cast to Python int
    binary_cols = categories.apply(lambda x: list(map(int, bin(int(x))[2:].zfill(max_len))))
    return pd.DataFrame(binary_cols.tolist(), columns=[f"{series.name}_bin_{i}" for i in range(max_len)])

df_binary = binary_encode(df['City'])
df_binary.head()
Out[18]:
City_bin_0 City_bin_1
0 0 1
1 0 0
2 1 0
3 0 1
4 1 0

Back to the top


💠 Hashing Encoding¶

📖 Click to Expand

Applies a hash function to map categories to fixed number of columns.

  • Useful for extremely high-cardinality data
  • Prone to hash collisions, which may reduce signal
  • Non-invertible: can't trace back original categories
In [26]:
import hashlib

def hash_encode(series, n_components=4):
    def hash_string(val):
        h = int(hashlib.md5(val.encode()).hexdigest(), 16)
        return [int(b) for b in bin(h)[2:].zfill(n_components)[-n_components:]]

    hashed = series.astype(str).apply(hash_string)
    return pd.DataFrame(hashed.tolist(), columns=[f"{series.name}_hash_{i}" for i in range(n_components)])

hash_encode(df['City'], n_components=4)
Out[26]:
City_hash_0 City_hash_1 City_hash_2 City_hash_3
0 1 0 0 0
1 0 0 0 1
2 1 1 0 1
3 1 0 0 0
4 1 1 0 1

Back to the top