Status: Complete Python Coverage License

📖 Outlier Detection & Treatment¶

🔍 Statistical Detection Methods

  • 📈 Z-Score
  • 🧮 Modified Z-Score (MAD)
  • 📊 Interquartile Range (IQR)
  • 🔍 Grubbs' Test
  • 📏 Chi-Square & Mahalanobis Distance

🤖 ML-Based Detection Methods

  • 🌲 Isolation Forest
  • 🔎 Local Outlier Factor (LOF)

📍 Proximity & Clustering-Based Detection

  • 🌀 DBSCAN
  • 📍 K-Means Distance

📐 Probabilistic Detection Methods

  • 📊 Gaussian Mixture Models (GMM)
  • 📉 Extreme Value Theory (EVT)
  • 🧠 Bayesian Methods

🛠️ Outlier Treatment Strategies

  • ❌ Deletion
  • 🔁 Capping / Winsorizing
  • 🧮 Imputation
  • 📊 Binning

🔍 Statistical Detection Methods¶

In [1]:
import numpy as np
import pandas as pd

# Create a dummy dataset with multivariate numeric features
np.random.seed(42)

# Generate base normal data
data = {
    'feature_1': np.random.normal(50, 10, 100),
    'feature_2': np.random.normal(30, 5, 100),
    'feature_3': np.random.normal(100, 20, 100)
}

# Inject outliers
data['feature_1'][[5, 15]] = [150, -30]
data['feature_2'][[25]] = [100]
data['feature_3'][[70, 90]] = [250, -80]

df = pd.DataFrame(data)
df.head()
Out[1]:
feature_1 feature_2 feature_3
0 54.967142 22.923146 107.155747
1 48.617357 27.896773 111.215691
2 56.476885 28.286427 121.661025
3 65.230299 25.988614 121.076041
4 47.658466 29.193571 72.446613

📈 Z-Score¶

📖 Click to Expand
📈 What is Z-Score?¶

Z-Score measures how many standard deviations a data point is from the mean of the distribution.
It assumes the data follows a normal (Gaussian) distribution.

⚙️ How It Works¶
  • Calculate mean (μ) and standard deviation (σ) of the feature
  • For each value ( x ), compute ( z = \frac{x - \mu}{\sigma} )
  • Points with |z| > threshold (typically 3) are flagged as outliers
🕵️‍♂️ When to Use¶
  • When the variable is normally distributed
  • Works well for univariate outlier detection
✅ Pros¶
  • Simple and interpretable
  • Requires no parameter tuning beyond threshold
⚠️ Limitations¶
  • Sensitive to mean and standard deviation — affected by extreme values
  • Fails on skewed or non-Gaussian data
  • Can miss multivariate outliers
In [ ]:
from scipy.stats import median_abs_deviation

def detect_outliers_zscore_all(df, column, threshold=3.0):
    """
    Detect outliers in a single numeric column of a DataFrame using Z-Score.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        column (str): Name of the numeric column.
        threshold (float): Z-score threshold (default is 3.0).

    Returns:
        pd.Series: Boolean mask where True indicates an outlier in the specified column.
    """
    if column not in df.columns or not np.issubdtype(df[column].dtype, np.number):
        raise ValueError(f"Column '{column}' is either missing or not numeric.")

    z_scores = zscore(df[column])
    mask = np.abs(z_scores) > threshold

    if mask.sum() > 0:
        print(f"\033[91m⚠️ [{column}] - Detected {mask.sum()} outliers using Z-Score (threshold = {threshold})\033[0m")
    else:
        print(f"\033[92m✅ [{column}] - No outliers detected using Z-Score (threshold = {threshold})\033[0m")

    return mask

# Run Z-Score detection across all numeric columns
zscore_results = {}
for col in df.select_dtypes(include='number').columns:
    zscore_results[col] = detect_outliers_zscore_all(df, col)
⚠️ [feature_1] - Detected 2 outliers using Z-Score (threshold = 3.0)
⚠️ [feature_2] - Detected 1 outliers using Z-Score (threshold = 3.0)
⚠️ [feature_3] - Detected 2 outliers using Z-Score (threshold = 3.0)

🧮 Modified Z-Score (MAD)¶

📖 Click to Expand
🧮 What is Modified Z-Score?¶

Modified Z-Score replaces the mean and standard deviation in the Z-Score formula with the median and median absolute deviation (MAD), making it robust to extreme values.

⚙️ How It Works¶
  • Compute the median (M) of the data
  • Calculate MAD = median(|xᵢ - M|)
  • For each point ( x ), compute:
    ( \text{Modified Z} = 0.6745 \cdot \frac{x - M}{\text{MAD}} )
  • Flag points where |Modified Z| > threshold (typically 3.5)
🕵️‍♂️ When to Use¶
  • When the distribution is skewed or contains outliers
  • Works well for robust univariate detection
✅ Pros¶
  • Resistant to influence from extreme values
  • More reliable on small or noisy datasets
⚠️ Limitations¶
  • Assumes unimodal structure; may struggle with multimodal data
  • Still a univariate method — doesn't capture contextual or multivariate outliers
In [4]:
def detect_outliers_modified_zscore(df, column, threshold=3.5):
    """
    Detect outliers using the Modified Z-Score method based on MAD.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        column (str): Name of the numeric column.
        threshold (float): Modified Z-Score threshold (default is 3.5).

    Returns:
        pd.Series: Boolean mask where True indicates an outlier.
    """
    if column not in df.columns or not np.issubdtype(df[column].dtype, np.number):
        raise ValueError(f"Column '{column}' is either missing or not numeric.")

    series = df[column]
    median = series.median()
    mad = median_abs_deviation(series, scale='normal')  # default scale approximates std
    if mad == 0:
        mask = pd.Series([False] * len(series), index=series.index)
        print(f"\033[93m⚠️ [{column}] - MAD is zero. Skipping detection.\033[0m")
        return mask

    modified_z = 0.6745 * (series - median) / mad
    mask = np.abs(modified_z) > threshold

    if mask.sum() > 0:
        print(f"\033[91m⚠️ [{column}] - Detected {mask.sum()} outliers using Modified Z-Score (threshold = {threshold})\033[0m")
    else:
        print(f"\033[92m✅ [{column}] - No outliers detected using Modified Z-Score (threshold = {threshold})\033[0m")

    return mask

# Run Modified Z-Score detection across all numeric columns
mad_results = {}
for col in df.select_dtypes(include='number').columns:
    mad_results[col] = detect_outliers_modified_zscore(df, col)
⚠️ [feature_1] - Detected 2 outliers using Modified Z-Score (threshold = 3.5)
⚠️ [feature_2] - Detected 1 outliers using Modified Z-Score (threshold = 3.5)
⚠️ [feature_3] - Detected 2 outliers using Modified Z-Score (threshold = 3.5)

📊 Interquartile Range (IQR)¶

📖 Click to Expand
📊 What is IQR?¶

The Interquartile Range (IQR) is a rule-based method that defines outliers as points lying far outside the central 50% of the data.
It does not assume any specific distribution.

⚙️ How It Works¶
  • Compute Q1 (25th percentile) and Q3 (75th percentile)
  • Calculate IQR = Q3 - Q1
  • Define lower bound = Q1 - 1.5 × IQR
    Define upper bound = Q3 + 1.5 × IQR
  • Flag points outside this range as outliers
🕵️‍♂️ When to Use¶
  • For non-parametric, univariate outlier detection
  • Effective when data is not normally distributed
✅ Pros¶
  • Simple and interpretable
  • Not influenced by extreme values
  • No distributional assumptions
⚠️ Limitations¶
  • Doesn't work well with multimodal distributions
  • Only considers 1 feature at a time (univariate)
  • Threshold (1.5×IQR) is arbitrary and may require tuning
In [5]:
def detect_outliers_iqr(df, column, factor=1.5):
    """
    Detect outliers using the IQR method.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        column (str): Name of the numeric column.
        factor (float): Multiplier for IQR (default is 1.5).

    Returns:
        pd.Series: Boolean mask where True indicates an outlier.
    """
    if column not in df.columns or not np.issubdtype(df[column].dtype, np.number):
        raise ValueError(f"Column '{column}' is either missing or not numeric.")

    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - factor * IQR
    upper = Q3 + factor * IQR

    mask = (df[column] < lower) | (df[column] > upper)

    if mask.sum() > 0:
        print(f"\033[91m⚠️ [{column}] - Detected {mask.sum()} outliers using IQR (factor = {factor})\033[0m")
    else:
        print(f"\033[92m✅ [{column}] - No outliers detected using IQR (factor = {factor})\033[0m")

    return mask

# Run IQR detection across all numeric columns
iqr_results = {}
for col in df.select_dtypes(include='number').columns:
    iqr_results[col] = detect_outliers_iqr(df, col)
⚠️ [feature_1] - Detected 3 outliers using IQR (factor = 1.5)
⚠️ [feature_2] - Detected 2 outliers using IQR (factor = 1.5)
⚠️ [feature_3] - Detected 4 outliers using IQR (factor = 1.5)

🔍 Grubbs' Test¶

📖 Click to Expand
🔍 What is Grubbs' Test?¶

Grubbs' Test is a statistical test used to detect a single outlier in a normally distributed univariate dataset.
It evaluates whether the extreme value is statistically different from the rest of the data.

⚙️ How It Works¶
  • Assumes data follows a normal distribution
  • Calculates a test statistic based on the most extreme value: ( G = \frac{|\text{extreme} - \bar{x}|}{s} )
  • Compares ( G ) to a critical value from the t-distribution
  • If ( G ) exceeds the threshold, the point is flagged as an outlier
🕵️‍♂️ When to Use¶
  • On small, normally distributed datasets
  • When verifying if a specific extreme point is statistically unjustified
✅ Pros¶
  • Provides a statistical test (p-value) rather than just a threshold rule
  • Good for auditing one suspect value at a time
⚠️ Limitations¶
  • Can only detect one outlier at a time
  • Assumes normality — breaks down if data is skewed or multimodal
  • Not scalable to large or multivariate datasets
In [6]:
from scipy.stats import t, norm

def detect_outliers_grubbs(df, column, alpha=0.05):
    """
    Detect a single outlier in a numeric column using Grubbs' Test.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        column (str): Name of the numeric column.
        alpha (float): Significance level (default is 0.05).

    Returns:
        pd.Series: Boolean mask where True indicates an outlier.
    """
    if column not in df.columns or not np.issubdtype(df[column].dtype, np.number):
        raise ValueError(f"Column '{column}' is either missing or not numeric.")

    x = df[column].dropna().copy()
    n = len(x)
    mean_x = x.mean()
    std_x = x.std()
    
    # Grubbs statistic
    abs_devs = abs(x - mean_x)
    G = abs_devs.max() / std_x

    # Critical value
    t_crit = t.ppf(1 - alpha / (2 * n), df=n - 2)
    critical_value = ((n - 1) / np.sqrt(n)) * np.sqrt(t_crit**2 / (n - 2 + t_crit**2))

    mask = abs_devs == abs_devs.max()
    if G > critical_value:
        print(f"\033[91m⚠️ [{column}] - Grubbs' Test: 1 outlier detected (G = {G:.2f}, critical = {critical_value:.2f})\033[0m")
    else:
        print(f"\033[92m✅ [{column}] - Grubbs' Test: No outlier detected (G = {G:.2f}, critical = {critical_value:.2f})\033[0m")
        mask[:] = False

    return mask

# Run Grubbs' Test across all numeric columns
grubbs_results = {}
for col in df.select_dtypes(include='number').columns:
    grubbs_results[col] = detect_outliers_grubbs(df, col)
⚠️ [feature_1] - Grubbs' Test: 1 outlier detected (G = 6.39, critical = 3.38)
⚠️ [feature_2] - Grubbs' Test: 1 outlier detected (G = 8.25, critical = 3.38)
⚠️ [feature_3] - Grubbs' Test: 1 outlier detected (G = 5.67, critical = 3.38)

📏 Chi-Square & Mahalanobis Distance¶

📖 Click to Expand
📏 What Are Chi-Square & Mahalanobis Distance?¶

Both are multivariate statistical techniques for detecting outliers by measuring how far a point is from the expected distribution in a multidimensional space.

  • Mahalanobis Distance accounts for correlation between features
  • Chi-Square Test uses Mahalanobis distance under the assumption of multivariate normality
⚙️ How It Works¶
  • Compute the mean vector and covariance matrix of the dataset
  • For each point ( x ), compute Mahalanobis distance: ( D^2 = (x - \mu)^T \Sigma^{-1} (x - \mu) )
  • Compare ( D^2 ) to a Chi-Square threshold with df = number of features
🕵️‍♂️ When to Use¶
  • Detecting outliers in multivariate numeric data
  • When feature correlation is important (e.g., financial indicators)
✅ Pros¶
  • Captures multivariate outliers missed by univariate methods
  • Incorporates feature relationships via covariance
⚠️ Limitations¶
  • Assumes multivariate normal distribution
  • Sensitive to outliers in the covariance matrix
  • Requires numerical, continuous variables
In [7]:
from scipy.stats import chi2

def detect_outliers_mahalanobis(df, columns, alpha=0.01):
    """
    Detect multivariate outliers using Mahalanobis distance and Chi-Square threshold.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        columns (list): List of numeric columns to include.
        alpha (float): Significance level for Chi-Square test (default 0.01).

    Returns:
        pd.Series: Boolean mask where True indicates an outlier.
    """
    data = df[columns].dropna()
    x = data.values
    mean_vec = np.mean(x, axis=0)
    cov_matrix = np.cov(x, rowvar=False)
    inv_covmat = np.linalg.inv(cov_matrix)

    diff = x - mean_vec
    left = np.dot(diff, inv_covmat)
    mahal_sq = np.sum(left * diff, axis=1)

    threshold = chi2.ppf(1 - alpha, df=len(columns))
    mask = pd.Series(mahal_sq > threshold, index=data.index)

    if mask.sum() > 0:
        print(f"\033[91m⚠️ [Mahalanobis] - Detected {mask.sum()} multivariate outliers (alpha = {alpha})\033[0m")
    else:
        print(f"\033[92m✅ [Mahalanobis] - No multivariate outliers detected (alpha = {alpha})\033[0m")

    return mask

# Apply on all numeric columns
mahalanobis_mask = detect_outliers_mahalanobis(df, df.select_dtypes(include='number').columns.tolist())
⚠️ [Mahalanobis] - Detected 5 multivariate outliers (alpha = 0.01)

Back to the top


🤖 ML-Based Detection Methods¶

🌲 Isolation Forest¶

📖 Click to Expand
🌲 What is Isolation Forest?¶

Isolation Forest is an ensemble-based anomaly detection method that works by isolating observations using random splits.
Outliers are easier to isolate and require fewer splits.

⚙️ How It Works¶
  • Builds multiple trees by randomly selecting features and split values
  • Each sample's average path length across trees is computed
  • Outliers have shorter path lengths (isolated faster)
  • Scores are assigned to rank outlier likelihood
🕵️‍♂️ When to Use¶
  • On high-dimensional or large datasets
  • When a model-based, non-parametric detector is preferred
  • Suitable for both univariate and multivariate outlier detection
✅ Pros¶
  • Scales well to large datasets
  • No distributional assumptions
  • Handles multivariate relationships
⚠️ Limitations¶
  • Output scores are relative, not probabilistic
  • Performance can vary with small datasets
  • May struggle with highly imbalanced feature importance
In [8]:
from sklearn.ensemble import IsolationForest

def detect_outliers_isolation_forest(df, columns, contamination=0.05, random_state=42):
    """
    Detect outliers using Isolation Forest.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        columns (list): List of columns to apply the method on.
        contamination (float): Expected proportion of outliers.
        random_state (int): Seed for reproducibility.

    Returns:
        pd.Series: Boolean mask where True indicates an outlier.
    """
    data = df[columns].dropna()
    iso = IsolationForest(contamination=contamination, random_state=random_state)
    preds = iso.fit_predict(data)

    mask = pd.Series(preds == -1, index=data.index)

    if mask.sum() > 0:
        print(f"\033[91m⚠️ [Isolation Forest] - Detected {mask.sum()} outliers (contamination = {contamination})\033[0m")
    else:
        print(f"\033[92m✅ [Isolation Forest] - No outliers detected (contamination = {contamination})\033[0m")

    return mask

# Apply Isolation Forest on all numeric columns
isoforest_mask = detect_outliers_isolation_forest(df, df.select_dtypes(include='number').columns.tolist())
⚠️ [Isolation Forest] - Detected 5 outliers (contamination = 0.05)
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names
  warnings.warn(

🔎 Local Outlier Factor (LOF)¶

📖 Click to Expand
🔎 What is LOF?¶

Local Outlier Factor (LOF) identifies outliers by comparing the local density of each point to that of its neighbors.
Outliers have significantly lower density than surrounding points.

⚙️ How It Works¶
  • For each point, compute its k-nearest neighbors
  • Estimate local density based on average distance to those neighbors
  • Compute the LOF score by comparing a point’s density to that of its neighbors
  • Higher LOF score = more likely to be an outlier
🕵️‍♂️ When to Use¶
  • Detecting local anomalies that deviate within a neighborhood
  • Datasets with clusters of varying density
✅ Pros¶
  • Captures local context, unlike global methods
  • Good for complex, non-linear distributions
⚠️ Limitations¶
  • Sensitive to the choice of k (number of neighbors)
  • Struggles with high-dimensional data
  • Scores are relative — not probability-based
In [9]:
from sklearn.neighbors import LocalOutlierFactor

def detect_outliers_lof(df, columns, n_neighbors=20, contamination=0.05):
    """
    Detect outliers using Local Outlier Factor (LOF).

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        columns (list): List of columns to apply the method on.
        n_neighbors (int): Number of neighbors to use for LOF.
        contamination (float): Proportion of expected outliers.

    Returns:
        pd.Series: Boolean mask where True indicates an outlier.
    """
    data = df[columns].dropna()
    lof = LocalOutlierFactor(n_neighbors=n_neighbors, contamination=contamination)
    preds = lof.fit_predict(data)

    mask = pd.Series(preds == -1, index=data.index)

    if mask.sum() > 0:
        print(f"\033[91m⚠️ [LOF] - Detected {mask.sum()} outliers (n_neighbors = {n_neighbors}, contamination = {contamination})\033[0m")
    else:
        print(f"\033[92m✅ [LOF] - No outliers detected (n_neighbors = {n_neighbors})\033[0m")

    return mask

# Apply LOF on all numeric columns
lof_mask = detect_outliers_lof(df, df.select_dtypes(include='number').columns.tolist())
⚠️ [LOF] - Detected 5 outliers (n_neighbors = 20, contamination = 0.05)

Back to the top


📍 Proximity & Clustering-Based Detection¶

🌀 DBSCAN¶

📖 Click to Expand
🌀 What is DBSCAN?¶

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that identifies outliers as points that do not belong to any dense region.

⚙️ How It Works¶
  • Groups points into clusters based on density (minPts within a radius ε)
  • Points not reachable from any cluster are labeled as noise
  • These noise points are treated as outliers
🕵️‍♂️ When to Use¶
  • When the data has irregular shapes or density-based clusters
  • Works well for unsupervised anomaly detection
✅ Pros¶
  • No need to specify number of clusters
  • Can detect outliers as a byproduct of clustering
  • Handles arbitrarily shaped clusters
⚠️ Limitations¶
  • Requires tuning of ε and minPts
  • Sensitive to scale of features — needs normalization
  • Can fail with varying density clusters
In [10]:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

def detect_outliers_dbscan(df, columns, eps=0.5, min_samples=5):
    """
    Detect outliers using DBSCAN clustering (label -1 as noise).

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        columns (list): List of columns to apply DBSCAN on.
        eps (float): Maximum distance between samples to be considered neighbors.
        min_samples (int): Minimum samples required to form a dense region.

    Returns:
        pd.Series: Boolean mask where True indicates an outlier.
    """
    data = df[columns].dropna()
    scaled = StandardScaler().fit_transform(data)

    db = DBSCAN(eps=eps, min_samples=min_samples)
    labels = db.fit_predict(scaled)

    mask = pd.Series(labels == -1, index=data.index)

    if mask.sum() > 0:
        print(f"\033[91m⚠️ [DBSCAN] - Detected {mask.sum()} outliers (eps = {eps}, min_samples = {min_samples})\033[0m")
    else:
        print(f"\033[92m✅ [DBSCAN] - No outliers detected (eps = {eps}, min_samples = {min_samples})\033[0m")

    return mask

# Apply DBSCAN on all numeric columns
dbscan_mask = detect_outliers_dbscan(df, df.select_dtypes(include='number').columns.tolist())
⚠️ [DBSCAN] - Detected 32 outliers (eps = 0.5, min_samples = 5)

📍 K-Means Distance¶

📖 Click to Expand
📍 What is K-Means Distance for Outlier Detection?¶

In K-Means, outliers are often identified as points that are far from their assigned cluster centroids — i.e., they have high intra-cluster distances.

⚙️ How It Works¶
  • Run K-Means clustering on the dataset
  • For each point, compute the distance to its assigned cluster centroid
  • Rank points by distance; outliers lie in the tail of the distance distribution
🕵️‍♂️ When to Use¶
  • As a quick unsupervised heuristic
  • When K-Means is already used for segmentation and you want to flag edge cases
✅ Pros¶
  • Simple to implement using existing clustering output
  • Scales well to large datasets
⚠️ Limitations¶
  • Assumes spherical clusters — poor performance on non-globular data
  • Requires setting k (number of clusters)
  • Sensitive to initial centroid selection and feature scaling
In [11]:
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min

def detect_outliers_kmeans(df, columns, n_clusters=3, threshold_quantile=0.99, random_state=42):
    """
    Detect outliers based on distance to KMeans cluster centroids.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        columns (list): List of columns to apply KMeans on.
        n_clusters (int): Number of clusters.
        threshold_quantile (float): Quantile of distance above which to flag as outlier.
        random_state (int): Seed for reproducibility.

    Returns:
        pd.Series: Boolean mask where True indicates an outlier.
    """
    data = df[columns].dropna()
    scaled = StandardScaler().fit_transform(data)

    kmeans = KMeans(n_clusters=n_clusters, random_state=random_state)
    labels = kmeans.fit_predict(scaled)

    _, distances = pairwise_distances_argmin_min(scaled, kmeans.cluster_centers_)
    threshold = np.quantile(distances, threshold_quantile)
    mask = pd.Series(distances > threshold, index=data.index)

    if mask.sum() > 0:
        print(f"\033[91m⚠️ [KMeans Distance] - Detected {mask.sum()} outliers (above {int(threshold_quantile * 100)}th percentile distance)\033[0m")
    else:
        print(f"\033[92m✅ [KMeans Distance] - No outliers detected (above {int(threshold_quantile * 100)}th percentile distance)\033[0m")

    return mask

# Apply KMeans distance-based detection
kmeans_mask = detect_outliers_kmeans(df, df.select_dtypes(include='number').columns.tolist())
⚠️ [KMeans Distance] - Detected 1 outliers (above 99th percentile distance)
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(

Back to the top


📐 Probabilistic Detection Methods¶

📊 Gaussian Mixture Models (GMM)¶

📖 Click to Expand
📊 What is GMM for Outlier Detection?¶

GMM assumes the data is generated from a mixture of several Gaussian distributions.
Points with low likelihood under the fitted model are flagged as outliers.

⚙️ How It Works¶
  • Fit a Gaussian mixture model using Expectation-Maximization (EM)
  • Compute the log-likelihood of each point under the model
  • Flag points with likelihood below a threshold as outliers
🕵️‍♂️ When to Use¶
  • For continuous multivariate data
  • When data clusters resemble Gaussian blobs
✅ Pros¶
  • Soft clustering: gives probabilistic interpretation
  • Handles complex, multimodal distributions
⚠️ Limitations¶
  • Assumes Gaussian components — may fail on heavy-tailed or skewed data
  • Requires selecting the number of components (can be tuned via BIC/AIC)
  • Sensitive to outliers during model fitting
In [12]:
from sklearn.mixture import GaussianMixture

def detect_outliers_gmm(df, columns, n_components=3, threshold_quantile=0.01, random_state=42):
    """
    Detect outliers using Gaussian Mixture Model log-likelihood scores.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        columns (list): List of columns to apply GMM on.
        n_components (int): Number of Gaussian components.
        threshold_quantile (float): Lower quantile of log-likelihood to flag as outlier.
        random_state (int): Seed for reproducibility.

    Returns:
        pd.Series: Boolean mask where True indicates an outlier.
    """
    data = df[columns].dropna()
    scaled = StandardScaler().fit_transform(data)

    gmm = GaussianMixture(n_components=n_components, random_state=random_state)
    gmm.fit(scaled)
    log_probs = gmm.score_samples(scaled)

    threshold = np.quantile(log_probs, threshold_quantile)
    mask = pd.Series(log_probs < threshold, index=data.index)

    if mask.sum() > 0:
        print(f"\033[91m⚠️ [GMM] - Detected {mask.sum()} outliers (below {int(threshold_quantile * 100)}th percentile log-likelihood)\033[0m")
    else:
        print(f"\033[92m✅ [GMM] - No outliers detected (below {int(threshold_quantile * 100)}th percentile log-likelihood)\033[0m")

    return mask

# Apply GMM outlier detection
gmm_mask = detect_outliers_gmm(df, df.select_dtypes(include='number').columns.tolist())
⚠️ [GMM] - Detected 1 outliers (below 1th percentile log-likelihood)

📉 Extreme Value Theory (EVT)¶

📖 Click to Expand
📉 What is Extreme Value Theory (EVT)?¶

Extreme Value Theory models the tail behavior of a distribution to identify rare, extreme events.
It’s often used to detect outliers that lie far in the distribution's tail.

⚙️ How It Works¶
  • Focuses on modeling the maximum (or minimum) values in data
  • Fits a distribution (e.g., Generalized Pareto) to the tail
  • Outliers are points with very low probability under this tail distribution
🕵️‍♂️ When to Use¶
  • In fields where extreme risks or rare events are important (e.g., finance, insurance, climate)
  • When the tail of the distribution carries meaningful signal
✅ Pros¶
  • Tail-focused — captures true "extremes"
  • Statistically principled for outlier modeling
⚠️ Limitations¶
  • Requires sufficient tail data to model
  • Only effective for univariate distributions
  • Requires careful threshold selection for tail modeling
In [14]:
def detect_outliers_evt(df, column, threshold_quantile=0.95):
    """
    Detect outliers using Extreme Value Theory (GPD tail modeling).

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        column (str): Column to apply EVT on.
        threshold_quantile (float): Quantile to define tail threshold.

    Returns:
        pd.Series: Boolean mask where True indicates an outlier.
    """
    if column not in df.columns or not np.issubdtype(df[column].dtype, np.number):
        raise ValueError(f"Column '{column}' is either missing or not numeric.")
    
    x = df[column].dropna()
    threshold = x.quantile(threshold_quantile)
    tail_excess = x[x > threshold] - threshold

    if len(tail_excess) < 5:
        print(f"\033[93m⚠️ [{column}] - Not enough data in tail for EVT modeling.\033[0m")
        return pd.Series([False] * len(x), index=x.index)

    # Fit GPD to the excess over threshold
    c, loc, scale = genpareto.fit(tail_excess)
    probs = genpareto.sf(tail_excess, c, loc=loc, scale=scale)

    # Build full-length boolean mask
    mask = pd.Series(False, index=x.index)
    mask.loc[tail_excess.index] = probs < 0.01

    if mask.sum() > 0:
        print(f"\033[91m⚠️ [{column}] - Detected {mask.sum()} outliers using EVT (quantile = {threshold_quantile})\033[0m")
    else:
        print(f"\033[92m✅ [{column}] - No outliers detected using EVT (quantile = {threshold_quantile})\033[0m")

    return mask

# Retry EVT detection
evt_results = {}
for col in df.select_dtypes(include='number').columns:
    evt_results[col] = detect_outliers_evt(df, col)
✅ [feature_1] - No outliers detected using EVT (quantile = 0.95)
✅ [feature_2] - No outliers detected using EVT (quantile = 0.95)
✅ [feature_3] - No outliers detected using EVT (quantile = 0.95)

🧠 Bayesian Methods¶

📖 Click to Expand
🧠 What Are Bayesian Methods for Outlier Detection?¶

Bayesian methods model uncertainty in the data and incorporate prior beliefs to estimate the posterior probability of a point being an outlier.

⚙️ How It Works¶
  • Define a generative model with priors over parameters
  • Use observed data to compute posterior distributions
  • Outliers are identified as points with low posterior probability under the model
🕵️‍♂️ When to Use¶
  • When uncertainty modeling is critical
  • For probabilistic anomaly detection in structured or time-series data
  • When expert knowledge can inform priors
✅ Pros¶
  • Explicitly models uncertainty
  • Flexible and extensible to many domains
  • Can incorporate prior knowledge
⚠️ Limitations¶
  • Computationally intensive (requires sampling or variational inference)
  • Requires strong modeling assumptions
  • Interpretation may be non-trivial for high-dimensional data
In [15]:
from sklearn.linear_model import BayesianRidge

def detect_outliers_bayesian_residuals(df, column, features=None, threshold=3.0):
    """
    Detect outliers based on residuals from a Bayesian Ridge regression model.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        column (str): Target column to model and test for outliers.
        features (list): Feature columns to use for modeling (default: all others).
        threshold (float): Z-score threshold on residuals.

    Returns:
        pd.Series: Boolean mask where True indicates an outlier.
    """
    if column not in df.columns or not np.issubdtype(df[column].dtype, np.number):
        raise ValueError(f"Target column '{column}' is either missing or not numeric.")
    
    features = features or [col for col in df.select_dtypes(include='number').columns if col != column]
    data = df[[column] + features].dropna()

    X = data[features]
    y = data[column]

    model = BayesianRidge()
    model.fit(X, y)
    preds = model.predict(X)
    residuals = y - preds

    z_resid = zscore(residuals)
    mask = pd.Series(np.abs(z_resid) > threshold, index=data.index)

    if mask.sum() > 0:
        print(f"\033[91m⚠️ [{column}] - Detected {mask.sum()} residual outliers using Bayesian Regression (threshold = {threshold})\033[0m")
    else:
        print(f"\033[92m✅ [{column}] - No residual outliers detected using Bayesian Regression (threshold = {threshold})\033[0m")

    return mask

# Run Bayesian residual outlier detection per column using others as predictors
bayes_results = {}
numeric_cols = df.select_dtypes(include='number').columns.tolist()
for target_col in numeric_cols:
    bayes_results[target_col] = detect_outliers_bayesian_residuals(df, target_col)
⚠️ [feature_1] - Detected 2 residual outliers using Bayesian Regression (threshold = 3.0)
⚠️ [feature_2] - Detected 1 residual outliers using Bayesian Regression (threshold = 3.0)
⚠️ [feature_3] - Detected 2 residual outliers using Bayesian Regression (threshold = 3.0)

Back to the top


🛠️ Outlier Treatment Strategies¶

❌ Deletion¶

📖 Click to Expand
❌ What is Deletion?¶

Deletion involves simply removing identified outliers from the dataset before training or analysis.

⚙️ How It Works¶
  • Detect outliers using a chosen method (e.g., IQR, Z-Score)
  • Drop those rows from the dataset using filtering or indexing
🕵️‍♂️ When to Use¶
  • When dataset is large and removing a few rows won’t affect learning
  • If outliers are clearly data entry errors or irrelevant edge cases
✅ Pros¶
  • Simple and fast to apply
  • Removes potential noise from modeling
⚠️ Limitations¶
  • Risks losing valuable signal, especially in small datasets
  • Can bias model if outliers carry meaningful variation
  • Not reversible — original data is discarded
In [16]:
def treat_outliers_deletion(df, mask_dict):
    """
    Remove rows from the DataFrame where any of the provided masks are True.

    Parameters:
        df (pd.DataFrame): Original DataFrame.
        mask_dict (dict): Dictionary of {column: boolean mask Series}.

    Returns:
        pd.DataFrame: Cleaned DataFrame with outliers removed.
    """
    combined_mask = pd.Series(False, index=df.index)
    for col, mask in mask_dict.items():
        combined_mask = combined_mask | mask

    treated_count = combined_mask.sum()
    cleaned_df = df[~combined_mask]

    print(f"\033[94m🔧 Deleted {treated_count} rows containing outliers across any specified column\033[0m")
    return cleaned_df

# Example: remove all rows detected as outliers by IQR
df_deleted = treat_outliers_deletion(df, iqr_results)
🔧 Deleted 9 rows containing outliers across any specified column

🔁 Capping / Winsorizing¶

📖 Click to Expand
🔁 What is Capping / Winsorizing?¶

Capping (also known as Winsorizing) involves limiting extreme values by replacing them with specified percentile thresholds, rather than removing them.

⚙️ How It Works¶
  • Identify upper and lower percentile cutoffs (e.g., 1st and 99th percentiles)
  • Replace values above the upper bound with the upper threshold
  • Replace values below the lower bound with the lower threshold
🕵️‍♂️ When to Use¶
  • When outliers are legitimate but too influential
  • In models sensitive to scale (e.g., linear regression)
✅ Pros¶
  • Preserves dataset size and structure
  • Reduces the influence of extreme values without deletion
⚠️ Limitations¶
  • Thresholds are arbitrary — may require tuning
  • Can distort distribution if used aggressively
  • Doesn’t address multivariate outliers
In [17]:
def treat_outliers_capping(df, columns, lower_quantile=0.01, upper_quantile=0.99):
    """
    Cap outliers in specified columns based on percentile thresholds.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        columns (list): Columns to apply capping.
        lower_quantile (float): Lower bound percentile.
        upper_quantile (float): Upper bound percentile.

    Returns:
        pd.DataFrame: DataFrame with capped values.
    """
    df_capped = df.copy()
    total_treated = 0

    for col in columns:
        if not np.issubdtype(df[col].dtype, np.number):
            continue

        lower = df[col].quantile(lower_quantile)
        upper = df[col].quantile(upper_quantile)

        before = df[col]
        capped = before.clip(lower, upper)
        treated = (before != capped).sum()
        total_treated += treated

        df_capped[col] = capped

    print(f"\033[94m🔧 Capped {total_treated} outlier values using {int(lower_quantile*100)}–{int(upper_quantile*100)} percentile thresholds\033[0m")
    return df_capped

# Example: apply capping to all numeric columns
df_capped = treat_outliers_capping(df, df.select_dtypes(include='number').columns.tolist())
🔧 Capped 6 outlier values using 1–99 percentile thresholds

🧮 Imputation¶

📖 Click to Expand
🧮 What is Imputation for Outliers?¶

Imputation replaces outlier values with a more reasonable estimate (e.g., mean, median, or model prediction), treating them similarly to missing data.

⚙️ How It Works¶
  • Identify outliers using a chosen method
  • Replace them with:
    • Central tendency (mean, median)
    • Value from a predictive model (regression, KNN, etc.)
    • Group-specific statistics (e.g., median by segment)
🕵️‍♂️ When to Use¶
  • When outliers are suspected to be corrupted or extreme noise
  • When data integrity is important and deletion isn't an option
  • Especially useful for time series, healthcare, or small datasets
✅ Pros¶
  • Retains dataset size and row context
  • Can preserve statistical properties when done carefully
⚠️ Limitations¶
  • Imputed values may hide uncertainty
  • Risk of biasing the dataset if imputation method is naive
  • Not suitable when outliers are meaningful or intentional signals
In [18]:
def treat_outliers_imputation(df, mask_dict, strategy="median"):
    """
    Impute outliers in specified columns using central tendency.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        mask_dict (dict): Dictionary of {column: boolean mask Series}.
        strategy (str): Imputation method: 'mean' or 'median'.

    Returns:
        pd.DataFrame: DataFrame with outliers imputed.
    """
    df_imputed = df.copy()
    total_imputed = 0

    for col, mask in mask_dict.items():
        if not np.issubdtype(df[col].dtype, np.number):
            continue

        if strategy == "mean":
            value = df[col].mean()
        elif strategy == "median":
            value = df[col].median()
        else:
            raise ValueError("Strategy must be 'mean' or 'median'.")

        df_imputed.loc[mask, col] = value
        total_imputed += mask.sum()

    print(f"\033[94m🔧 Imputed {total_imputed} outliers using {strategy} strategy\033[0m")
    return df_imputed

# Example: Impute IQR-detected outliers using median
df_imputed = treat_outliers_imputation(df, iqr_results, strategy="median")
🔧 Imputed 9 outliers using median strategy

📊 Binning¶

📖 Click to Expand
📊 What is Binning?¶

Binning transforms continuous variables into discrete categories (bins), which can help smooth out the influence of outliers.

⚙️ How It Works¶
  • Define bin edges manually or using quantiles (equal-width or equal-frequency)
  • Replace raw values with corresponding bin labels or codes
  • Outliers naturally fall into edge bins, limiting their impact
🕵️‍♂️ When to Use¶
  • When interpretability is more important than precision
  • As a preprocessing step for tree-based models or rule-based systems
  • To mitigate extreme values in skewed distributions
✅ Pros¶
  • Reduces influence of outliers
  • Can simplify feature relationships
  • Useful for feature engineering
⚠️ Limitations¶
  • Can lead to information loss
  • Bin choice is arbitrary — poor binning can hurt performance
  • Not suitable for models that rely on continuous features (e.g., linear regression)
In [19]:
def treat_outliers_binning(df, columns, bins=5, strategy="quantile"):
    """
    Discretize numeric columns into bins to reduce outlier impact.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        columns (list): List of numeric columns to bin.
        bins (int): Number of bins.
        strategy (str): 'quantile' or 'uniform'.

    Returns:
        pd.DataFrame: DataFrame with binned columns.
    """
    df_binned = df.copy()
    total_binned = 0

    for col in columns:
        if not np.issubdtype(df[col].dtype, np.number):
            continue

        if strategy == "quantile":
            df_binned[col], bin_edges = pd.qcut(df[col], q=bins, labels=False, retbins=True, duplicates='drop')
        elif strategy == "uniform":
            df_binned[col], bin_edges = pd.cut(df[col], bins=bins, labels=False, retbins=True)
        else:
            raise ValueError("Strategy must be 'quantile' or 'uniform'.")

        total_binned += df[col].notna().sum()

    print(f"\033[94m🔧 Binned {total_binned} values across {len(columns)} column(s) using '{strategy}' strategy with {bins} bins\033[0m")
    return df_binned

# Example: Apply quantile-based binning to all numeric columns
df_binned = treat_outliers_binning(df, df.select_dtypes(include='number').columns.tolist(), bins=4, strategy="quantile")
🔧 Binned 300 values across 3 column(s) using 'quantile' strategy with 4 bins

Back to the top