🔍 Statistical Detection Methods¶
import numpy as np
import pandas as pd
# Create a dummy dataset with multivariate numeric features
np.random.seed(42)
# Generate base normal data
data = {
'feature_1': np.random.normal(50, 10, 100),
'feature_2': np.random.normal(30, 5, 100),
'feature_3': np.random.normal(100, 20, 100)
}
# Inject outliers
data['feature_1'][[5, 15]] = [150, -30]
data['feature_2'][[25]] = [100]
data['feature_3'][[70, 90]] = [250, -80]
df = pd.DataFrame(data)
df.head()
feature_1 | feature_2 | feature_3 | |
---|---|---|---|
0 | 54.967142 | 22.923146 | 107.155747 |
1 | 48.617357 | 27.896773 | 111.215691 |
2 | 56.476885 | 28.286427 | 121.661025 |
3 | 65.230299 | 25.988614 | 121.076041 |
4 | 47.658466 | 29.193571 | 72.446613 |
📈 Z-Score¶
📖 Click to Expand
📈 What is Z-Score?¶
Z-Score measures how many standard deviations a data point is from the mean of the distribution.
It assumes the data follows a normal (Gaussian) distribution.
⚙️ How It Works¶
- Calculate mean (μ) and standard deviation (σ) of the feature
- For each value ( x ), compute ( z = \frac{x - \mu}{\sigma} )
- Points with |z| > threshold (typically 3) are flagged as outliers
🕵️♂️ When to Use¶
- When the variable is normally distributed
- Works well for univariate outlier detection
✅ Pros¶
- Simple and interpretable
- Requires no parameter tuning beyond threshold
⚠️ Limitations¶
- Sensitive to mean and standard deviation — affected by extreme values
- Fails on skewed or non-Gaussian data
- Can miss multivariate outliers
from scipy.stats import median_abs_deviation
def detect_outliers_zscore_all(df, column, threshold=3.0):
"""
Detect outliers in a single numeric column of a DataFrame using Z-Score.
Parameters:
df (pd.DataFrame): Input DataFrame.
column (str): Name of the numeric column.
threshold (float): Z-score threshold (default is 3.0).
Returns:
pd.Series: Boolean mask where True indicates an outlier in the specified column.
"""
if column not in df.columns or not np.issubdtype(df[column].dtype, np.number):
raise ValueError(f"Column '{column}' is either missing or not numeric.")
z_scores = zscore(df[column])
mask = np.abs(z_scores) > threshold
if mask.sum() > 0:
print(f"\033[91m⚠️ [{column}] - Detected {mask.sum()} outliers using Z-Score (threshold = {threshold})\033[0m")
else:
print(f"\033[92m✅ [{column}] - No outliers detected using Z-Score (threshold = {threshold})\033[0m")
return mask
# Run Z-Score detection across all numeric columns
zscore_results = {}
for col in df.select_dtypes(include='number').columns:
zscore_results[col] = detect_outliers_zscore_all(df, col)
⚠️ [feature_1] - Detected 2 outliers using Z-Score (threshold = 3.0) ⚠️ [feature_2] - Detected 1 outliers using Z-Score (threshold = 3.0) ⚠️ [feature_3] - Detected 2 outliers using Z-Score (threshold = 3.0)
🧮 Modified Z-Score (MAD)¶
📖 Click to Expand
🧮 What is Modified Z-Score?¶
Modified Z-Score replaces the mean and standard deviation in the Z-Score formula with the median and median absolute deviation (MAD), making it robust to extreme values.
⚙️ How It Works¶
- Compute the median (M) of the data
- Calculate MAD = median(|xᵢ - M|)
- For each point ( x ), compute:
( \text{Modified Z} = 0.6745 \cdot \frac{x - M}{\text{MAD}} ) - Flag points where |Modified Z| > threshold (typically 3.5)
🕵️♂️ When to Use¶
- When the distribution is skewed or contains outliers
- Works well for robust univariate detection
✅ Pros¶
- Resistant to influence from extreme values
- More reliable on small or noisy datasets
⚠️ Limitations¶
- Assumes unimodal structure; may struggle with multimodal data
- Still a univariate method — doesn't capture contextual or multivariate outliers
def detect_outliers_modified_zscore(df, column, threshold=3.5):
"""
Detect outliers using the Modified Z-Score method based on MAD.
Parameters:
df (pd.DataFrame): Input DataFrame.
column (str): Name of the numeric column.
threshold (float): Modified Z-Score threshold (default is 3.5).
Returns:
pd.Series: Boolean mask where True indicates an outlier.
"""
if column not in df.columns or not np.issubdtype(df[column].dtype, np.number):
raise ValueError(f"Column '{column}' is either missing or not numeric.")
series = df[column]
median = series.median()
mad = median_abs_deviation(series, scale='normal') # default scale approximates std
if mad == 0:
mask = pd.Series([False] * len(series), index=series.index)
print(f"\033[93m⚠️ [{column}] - MAD is zero. Skipping detection.\033[0m")
return mask
modified_z = 0.6745 * (series - median) / mad
mask = np.abs(modified_z) > threshold
if mask.sum() > 0:
print(f"\033[91m⚠️ [{column}] - Detected {mask.sum()} outliers using Modified Z-Score (threshold = {threshold})\033[0m")
else:
print(f"\033[92m✅ [{column}] - No outliers detected using Modified Z-Score (threshold = {threshold})\033[0m")
return mask
# Run Modified Z-Score detection across all numeric columns
mad_results = {}
for col in df.select_dtypes(include='number').columns:
mad_results[col] = detect_outliers_modified_zscore(df, col)
⚠️ [feature_1] - Detected 2 outliers using Modified Z-Score (threshold = 3.5) ⚠️ [feature_2] - Detected 1 outliers using Modified Z-Score (threshold = 3.5) ⚠️ [feature_3] - Detected 2 outliers using Modified Z-Score (threshold = 3.5)
📊 Interquartile Range (IQR)¶
📖 Click to Expand
📊 What is IQR?¶
The Interquartile Range (IQR) is a rule-based method that defines outliers as points lying far outside the central 50% of the data.
It does not assume any specific distribution.
⚙️ How It Works¶
- Compute Q1 (25th percentile) and Q3 (75th percentile)
- Calculate IQR = Q3 - Q1
- Define lower bound = Q1 - 1.5 × IQR
Define upper bound = Q3 + 1.5 × IQR - Flag points outside this range as outliers
🕵️♂️ When to Use¶
- For non-parametric, univariate outlier detection
- Effective when data is not normally distributed
✅ Pros¶
- Simple and interpretable
- Not influenced by extreme values
- No distributional assumptions
⚠️ Limitations¶
- Doesn't work well with multimodal distributions
- Only considers 1 feature at a time (univariate)
- Threshold (1.5×IQR) is arbitrary and may require tuning
def detect_outliers_iqr(df, column, factor=1.5):
"""
Detect outliers using the IQR method.
Parameters:
df (pd.DataFrame): Input DataFrame.
column (str): Name of the numeric column.
factor (float): Multiplier for IQR (default is 1.5).
Returns:
pd.Series: Boolean mask where True indicates an outlier.
"""
if column not in df.columns or not np.issubdtype(df[column].dtype, np.number):
raise ValueError(f"Column '{column}' is either missing or not numeric.")
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - factor * IQR
upper = Q3 + factor * IQR
mask = (df[column] < lower) | (df[column] > upper)
if mask.sum() > 0:
print(f"\033[91m⚠️ [{column}] - Detected {mask.sum()} outliers using IQR (factor = {factor})\033[0m")
else:
print(f"\033[92m✅ [{column}] - No outliers detected using IQR (factor = {factor})\033[0m")
return mask
# Run IQR detection across all numeric columns
iqr_results = {}
for col in df.select_dtypes(include='number').columns:
iqr_results[col] = detect_outliers_iqr(df, col)
⚠️ [feature_1] - Detected 3 outliers using IQR (factor = 1.5) ⚠️ [feature_2] - Detected 2 outliers using IQR (factor = 1.5) ⚠️ [feature_3] - Detected 4 outliers using IQR (factor = 1.5)
🔍 Grubbs' Test¶
📖 Click to Expand
🔍 What is Grubbs' Test?¶
Grubbs' Test is a statistical test used to detect a single outlier in a normally distributed univariate dataset.
It evaluates whether the extreme value is statistically different from the rest of the data.
⚙️ How It Works¶
- Assumes data follows a normal distribution
- Calculates a test statistic based on the most extreme value: ( G = \frac{|\text{extreme} - \bar{x}|}{s} )
- Compares ( G ) to a critical value from the t-distribution
- If ( G ) exceeds the threshold, the point is flagged as an outlier
🕵️♂️ When to Use¶
- On small, normally distributed datasets
- When verifying if a specific extreme point is statistically unjustified
✅ Pros¶
- Provides a statistical test (p-value) rather than just a threshold rule
- Good for auditing one suspect value at a time
⚠️ Limitations¶
- Can only detect one outlier at a time
- Assumes normality — breaks down if data is skewed or multimodal
- Not scalable to large or multivariate datasets
from scipy.stats import t, norm
def detect_outliers_grubbs(df, column, alpha=0.05):
"""
Detect a single outlier in a numeric column using Grubbs' Test.
Parameters:
df (pd.DataFrame): Input DataFrame.
column (str): Name of the numeric column.
alpha (float): Significance level (default is 0.05).
Returns:
pd.Series: Boolean mask where True indicates an outlier.
"""
if column not in df.columns or not np.issubdtype(df[column].dtype, np.number):
raise ValueError(f"Column '{column}' is either missing or not numeric.")
x = df[column].dropna().copy()
n = len(x)
mean_x = x.mean()
std_x = x.std()
# Grubbs statistic
abs_devs = abs(x - mean_x)
G = abs_devs.max() / std_x
# Critical value
t_crit = t.ppf(1 - alpha / (2 * n), df=n - 2)
critical_value = ((n - 1) / np.sqrt(n)) * np.sqrt(t_crit**2 / (n - 2 + t_crit**2))
mask = abs_devs == abs_devs.max()
if G > critical_value:
print(f"\033[91m⚠️ [{column}] - Grubbs' Test: 1 outlier detected (G = {G:.2f}, critical = {critical_value:.2f})\033[0m")
else:
print(f"\033[92m✅ [{column}] - Grubbs' Test: No outlier detected (G = {G:.2f}, critical = {critical_value:.2f})\033[0m")
mask[:] = False
return mask
# Run Grubbs' Test across all numeric columns
grubbs_results = {}
for col in df.select_dtypes(include='number').columns:
grubbs_results[col] = detect_outliers_grubbs(df, col)
⚠️ [feature_1] - Grubbs' Test: 1 outlier detected (G = 6.39, critical = 3.38) ⚠️ [feature_2] - Grubbs' Test: 1 outlier detected (G = 8.25, critical = 3.38) ⚠️ [feature_3] - Grubbs' Test: 1 outlier detected (G = 5.67, critical = 3.38)
📏 Chi-Square & Mahalanobis Distance¶
📖 Click to Expand
📏 What Are Chi-Square & Mahalanobis Distance?¶
Both are multivariate statistical techniques for detecting outliers by measuring how far a point is from the expected distribution in a multidimensional space.
- Mahalanobis Distance accounts for correlation between features
- Chi-Square Test uses Mahalanobis distance under the assumption of multivariate normality
⚙️ How It Works¶
- Compute the mean vector and covariance matrix of the dataset
- For each point ( x ), compute Mahalanobis distance: ( D^2 = (x - \mu)^T \Sigma^{-1} (x - \mu) )
- Compare ( D^2 ) to a Chi-Square threshold with
df = number of features
🕵️♂️ When to Use¶
- Detecting outliers in multivariate numeric data
- When feature correlation is important (e.g., financial indicators)
✅ Pros¶
- Captures multivariate outliers missed by univariate methods
- Incorporates feature relationships via covariance
⚠️ Limitations¶
- Assumes multivariate normal distribution
- Sensitive to outliers in the covariance matrix
- Requires numerical, continuous variables
from scipy.stats import chi2
def detect_outliers_mahalanobis(df, columns, alpha=0.01):
"""
Detect multivariate outliers using Mahalanobis distance and Chi-Square threshold.
Parameters:
df (pd.DataFrame): Input DataFrame.
columns (list): List of numeric columns to include.
alpha (float): Significance level for Chi-Square test (default 0.01).
Returns:
pd.Series: Boolean mask where True indicates an outlier.
"""
data = df[columns].dropna()
x = data.values
mean_vec = np.mean(x, axis=0)
cov_matrix = np.cov(x, rowvar=False)
inv_covmat = np.linalg.inv(cov_matrix)
diff = x - mean_vec
left = np.dot(diff, inv_covmat)
mahal_sq = np.sum(left * diff, axis=1)
threshold = chi2.ppf(1 - alpha, df=len(columns))
mask = pd.Series(mahal_sq > threshold, index=data.index)
if mask.sum() > 0:
print(f"\033[91m⚠️ [Mahalanobis] - Detected {mask.sum()} multivariate outliers (alpha = {alpha})\033[0m")
else:
print(f"\033[92m✅ [Mahalanobis] - No multivariate outliers detected (alpha = {alpha})\033[0m")
return mask
# Apply on all numeric columns
mahalanobis_mask = detect_outliers_mahalanobis(df, df.select_dtypes(include='number').columns.tolist())
⚠️ [Mahalanobis] - Detected 5 multivariate outliers (alpha = 0.01)
🤖 ML-Based Detection Methods¶
🌲 Isolation Forest¶
📖 Click to Expand
🌲 What is Isolation Forest?¶
Isolation Forest is an ensemble-based anomaly detection method that works by isolating observations using random splits.
Outliers are easier to isolate and require fewer splits.
⚙️ How It Works¶
- Builds multiple trees by randomly selecting features and split values
- Each sample's average path length across trees is computed
- Outliers have shorter path lengths (isolated faster)
- Scores are assigned to rank outlier likelihood
🕵️♂️ When to Use¶
- On high-dimensional or large datasets
- When a model-based, non-parametric detector is preferred
- Suitable for both univariate and multivariate outlier detection
✅ Pros¶
- Scales well to large datasets
- No distributional assumptions
- Handles multivariate relationships
⚠️ Limitations¶
- Output scores are relative, not probabilistic
- Performance can vary with small datasets
- May struggle with highly imbalanced feature importance
from sklearn.ensemble import IsolationForest
def detect_outliers_isolation_forest(df, columns, contamination=0.05, random_state=42):
"""
Detect outliers using Isolation Forest.
Parameters:
df (pd.DataFrame): Input DataFrame.
columns (list): List of columns to apply the method on.
contamination (float): Expected proportion of outliers.
random_state (int): Seed for reproducibility.
Returns:
pd.Series: Boolean mask where True indicates an outlier.
"""
data = df[columns].dropna()
iso = IsolationForest(contamination=contamination, random_state=random_state)
preds = iso.fit_predict(data)
mask = pd.Series(preds == -1, index=data.index)
if mask.sum() > 0:
print(f"\033[91m⚠️ [Isolation Forest] - Detected {mask.sum()} outliers (contamination = {contamination})\033[0m")
else:
print(f"\033[92m✅ [Isolation Forest] - No outliers detected (contamination = {contamination})\033[0m")
return mask
# Apply Isolation Forest on all numeric columns
isoforest_mask = detect_outliers_isolation_forest(df, df.select_dtypes(include='number').columns.tolist())
⚠️ [Isolation Forest] - Detected 5 outliers (contamination = 0.05)
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names warnings.warn(
🔎 Local Outlier Factor (LOF)¶
📖 Click to Expand
🔎 What is LOF?¶
Local Outlier Factor (LOF) identifies outliers by comparing the local density of each point to that of its neighbors.
Outliers have significantly lower density than surrounding points.
⚙️ How It Works¶
- For each point, compute its k-nearest neighbors
- Estimate local density based on average distance to those neighbors
- Compute the LOF score by comparing a point’s density to that of its neighbors
- Higher LOF score = more likely to be an outlier
🕵️♂️ When to Use¶
- Detecting local anomalies that deviate within a neighborhood
- Datasets with clusters of varying density
✅ Pros¶
- Captures local context, unlike global methods
- Good for complex, non-linear distributions
⚠️ Limitations¶
- Sensitive to the choice of
k
(number of neighbors) - Struggles with high-dimensional data
- Scores are relative — not probability-based
from sklearn.neighbors import LocalOutlierFactor
def detect_outliers_lof(df, columns, n_neighbors=20, contamination=0.05):
"""
Detect outliers using Local Outlier Factor (LOF).
Parameters:
df (pd.DataFrame): Input DataFrame.
columns (list): List of columns to apply the method on.
n_neighbors (int): Number of neighbors to use for LOF.
contamination (float): Proportion of expected outliers.
Returns:
pd.Series: Boolean mask where True indicates an outlier.
"""
data = df[columns].dropna()
lof = LocalOutlierFactor(n_neighbors=n_neighbors, contamination=contamination)
preds = lof.fit_predict(data)
mask = pd.Series(preds == -1, index=data.index)
if mask.sum() > 0:
print(f"\033[91m⚠️ [LOF] - Detected {mask.sum()} outliers (n_neighbors = {n_neighbors}, contamination = {contamination})\033[0m")
else:
print(f"\033[92m✅ [LOF] - No outliers detected (n_neighbors = {n_neighbors})\033[0m")
return mask
# Apply LOF on all numeric columns
lof_mask = detect_outliers_lof(df, df.select_dtypes(include='number').columns.tolist())
⚠️ [LOF] - Detected 5 outliers (n_neighbors = 20, contamination = 0.05)
📍 Proximity & Clustering-Based Detection¶
🌀 DBSCAN¶
📖 Click to Expand
🌀 What is DBSCAN?¶
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that identifies outliers as points that do not belong to any dense region.
⚙️ How It Works¶
- Groups points into clusters based on density (minPts within a radius ε)
- Points not reachable from any cluster are labeled as noise
- These noise points are treated as outliers
🕵️♂️ When to Use¶
- When the data has irregular shapes or density-based clusters
- Works well for unsupervised anomaly detection
✅ Pros¶
- No need to specify number of clusters
- Can detect outliers as a byproduct of clustering
- Handles arbitrarily shaped clusters
⚠️ Limitations¶
- Requires tuning of ε and minPts
- Sensitive to scale of features — needs normalization
- Can fail with varying density clusters
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
def detect_outliers_dbscan(df, columns, eps=0.5, min_samples=5):
"""
Detect outliers using DBSCAN clustering (label -1 as noise).
Parameters:
df (pd.DataFrame): Input DataFrame.
columns (list): List of columns to apply DBSCAN on.
eps (float): Maximum distance between samples to be considered neighbors.
min_samples (int): Minimum samples required to form a dense region.
Returns:
pd.Series: Boolean mask where True indicates an outlier.
"""
data = df[columns].dropna()
scaled = StandardScaler().fit_transform(data)
db = DBSCAN(eps=eps, min_samples=min_samples)
labels = db.fit_predict(scaled)
mask = pd.Series(labels == -1, index=data.index)
if mask.sum() > 0:
print(f"\033[91m⚠️ [DBSCAN] - Detected {mask.sum()} outliers (eps = {eps}, min_samples = {min_samples})\033[0m")
else:
print(f"\033[92m✅ [DBSCAN] - No outliers detected (eps = {eps}, min_samples = {min_samples})\033[0m")
return mask
# Apply DBSCAN on all numeric columns
dbscan_mask = detect_outliers_dbscan(df, df.select_dtypes(include='number').columns.tolist())
⚠️ [DBSCAN] - Detected 32 outliers (eps = 0.5, min_samples = 5)
📍 K-Means Distance¶
📖 Click to Expand
📍 What is K-Means Distance for Outlier Detection?¶
In K-Means, outliers are often identified as points that are far from their assigned cluster centroids — i.e., they have high intra-cluster distances.
⚙️ How It Works¶
- Run K-Means clustering on the dataset
- For each point, compute the distance to its assigned cluster centroid
- Rank points by distance; outliers lie in the tail of the distance distribution
🕵️♂️ When to Use¶
- As a quick unsupervised heuristic
- When K-Means is already used for segmentation and you want to flag edge cases
✅ Pros¶
- Simple to implement using existing clustering output
- Scales well to large datasets
⚠️ Limitations¶
- Assumes spherical clusters — poor performance on non-globular data
- Requires setting
k
(number of clusters) - Sensitive to initial centroid selection and feature scaling
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min
def detect_outliers_kmeans(df, columns, n_clusters=3, threshold_quantile=0.99, random_state=42):
"""
Detect outliers based on distance to KMeans cluster centroids.
Parameters:
df (pd.DataFrame): Input DataFrame.
columns (list): List of columns to apply KMeans on.
n_clusters (int): Number of clusters.
threshold_quantile (float): Quantile of distance above which to flag as outlier.
random_state (int): Seed for reproducibility.
Returns:
pd.Series: Boolean mask where True indicates an outlier.
"""
data = df[columns].dropna()
scaled = StandardScaler().fit_transform(data)
kmeans = KMeans(n_clusters=n_clusters, random_state=random_state)
labels = kmeans.fit_predict(scaled)
_, distances = pairwise_distances_argmin_min(scaled, kmeans.cluster_centers_)
threshold = np.quantile(distances, threshold_quantile)
mask = pd.Series(distances > threshold, index=data.index)
if mask.sum() > 0:
print(f"\033[91m⚠️ [KMeans Distance] - Detected {mask.sum()} outliers (above {int(threshold_quantile * 100)}th percentile distance)\033[0m")
else:
print(f"\033[92m✅ [KMeans Distance] - No outliers detected (above {int(threshold_quantile * 100)}th percentile distance)\033[0m")
return mask
# Apply KMeans distance-based detection
kmeans_mask = detect_outliers_kmeans(df, df.select_dtypes(include='number').columns.tolist())
⚠️ [KMeans Distance] - Detected 1 outliers (above 99th percentile distance)
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
📐 Probabilistic Detection Methods¶
📊 Gaussian Mixture Models (GMM)¶
📖 Click to Expand
📊 What is GMM for Outlier Detection?¶
GMM assumes the data is generated from a mixture of several Gaussian distributions.
Points with low likelihood under the fitted model are flagged as outliers.
⚙️ How It Works¶
- Fit a Gaussian mixture model using Expectation-Maximization (EM)
- Compute the log-likelihood of each point under the model
- Flag points with likelihood below a threshold as outliers
🕵️♂️ When to Use¶
- For continuous multivariate data
- When data clusters resemble Gaussian blobs
✅ Pros¶
- Soft clustering: gives probabilistic interpretation
- Handles complex, multimodal distributions
⚠️ Limitations¶
- Assumes Gaussian components — may fail on heavy-tailed or skewed data
- Requires selecting the number of components (can be tuned via BIC/AIC)
- Sensitive to outliers during model fitting
from sklearn.mixture import GaussianMixture
def detect_outliers_gmm(df, columns, n_components=3, threshold_quantile=0.01, random_state=42):
"""
Detect outliers using Gaussian Mixture Model log-likelihood scores.
Parameters:
df (pd.DataFrame): Input DataFrame.
columns (list): List of columns to apply GMM on.
n_components (int): Number of Gaussian components.
threshold_quantile (float): Lower quantile of log-likelihood to flag as outlier.
random_state (int): Seed for reproducibility.
Returns:
pd.Series: Boolean mask where True indicates an outlier.
"""
data = df[columns].dropna()
scaled = StandardScaler().fit_transform(data)
gmm = GaussianMixture(n_components=n_components, random_state=random_state)
gmm.fit(scaled)
log_probs = gmm.score_samples(scaled)
threshold = np.quantile(log_probs, threshold_quantile)
mask = pd.Series(log_probs < threshold, index=data.index)
if mask.sum() > 0:
print(f"\033[91m⚠️ [GMM] - Detected {mask.sum()} outliers (below {int(threshold_quantile * 100)}th percentile log-likelihood)\033[0m")
else:
print(f"\033[92m✅ [GMM] - No outliers detected (below {int(threshold_quantile * 100)}th percentile log-likelihood)\033[0m")
return mask
# Apply GMM outlier detection
gmm_mask = detect_outliers_gmm(df, df.select_dtypes(include='number').columns.tolist())
⚠️ [GMM] - Detected 1 outliers (below 1th percentile log-likelihood)
📉 Extreme Value Theory (EVT)¶
📖 Click to Expand
📉 What is Extreme Value Theory (EVT)?¶
Extreme Value Theory models the tail behavior of a distribution to identify rare, extreme events.
It’s often used to detect outliers that lie far in the distribution's tail.
⚙️ How It Works¶
- Focuses on modeling the maximum (or minimum) values in data
- Fits a distribution (e.g., Generalized Pareto) to the tail
- Outliers are points with very low probability under this tail distribution
🕵️♂️ When to Use¶
- In fields where extreme risks or rare events are important (e.g., finance, insurance, climate)
- When the tail of the distribution carries meaningful signal
✅ Pros¶
- Tail-focused — captures true "extremes"
- Statistically principled for outlier modeling
⚠️ Limitations¶
- Requires sufficient tail data to model
- Only effective for univariate distributions
- Requires careful threshold selection for tail modeling
def detect_outliers_evt(df, column, threshold_quantile=0.95):
"""
Detect outliers using Extreme Value Theory (GPD tail modeling).
Parameters:
df (pd.DataFrame): Input DataFrame.
column (str): Column to apply EVT on.
threshold_quantile (float): Quantile to define tail threshold.
Returns:
pd.Series: Boolean mask where True indicates an outlier.
"""
if column not in df.columns or not np.issubdtype(df[column].dtype, np.number):
raise ValueError(f"Column '{column}' is either missing or not numeric.")
x = df[column].dropna()
threshold = x.quantile(threshold_quantile)
tail_excess = x[x > threshold] - threshold
if len(tail_excess) < 5:
print(f"\033[93m⚠️ [{column}] - Not enough data in tail for EVT modeling.\033[0m")
return pd.Series([False] * len(x), index=x.index)
# Fit GPD to the excess over threshold
c, loc, scale = genpareto.fit(tail_excess)
probs = genpareto.sf(tail_excess, c, loc=loc, scale=scale)
# Build full-length boolean mask
mask = pd.Series(False, index=x.index)
mask.loc[tail_excess.index] = probs < 0.01
if mask.sum() > 0:
print(f"\033[91m⚠️ [{column}] - Detected {mask.sum()} outliers using EVT (quantile = {threshold_quantile})\033[0m")
else:
print(f"\033[92m✅ [{column}] - No outliers detected using EVT (quantile = {threshold_quantile})\033[0m")
return mask
# Retry EVT detection
evt_results = {}
for col in df.select_dtypes(include='number').columns:
evt_results[col] = detect_outliers_evt(df, col)
✅ [feature_1] - No outliers detected using EVT (quantile = 0.95) ✅ [feature_2] - No outliers detected using EVT (quantile = 0.95) ✅ [feature_3] - No outliers detected using EVT (quantile = 0.95)
🧠 Bayesian Methods¶
📖 Click to Expand
🧠 What Are Bayesian Methods for Outlier Detection?¶
Bayesian methods model uncertainty in the data and incorporate prior beliefs to estimate the posterior probability of a point being an outlier.
⚙️ How It Works¶
- Define a generative model with priors over parameters
- Use observed data to compute posterior distributions
- Outliers are identified as points with low posterior probability under the model
🕵️♂️ When to Use¶
- When uncertainty modeling is critical
- For probabilistic anomaly detection in structured or time-series data
- When expert knowledge can inform priors
✅ Pros¶
- Explicitly models uncertainty
- Flexible and extensible to many domains
- Can incorporate prior knowledge
⚠️ Limitations¶
- Computationally intensive (requires sampling or variational inference)
- Requires strong modeling assumptions
- Interpretation may be non-trivial for high-dimensional data
from sklearn.linear_model import BayesianRidge
def detect_outliers_bayesian_residuals(df, column, features=None, threshold=3.0):
"""
Detect outliers based on residuals from a Bayesian Ridge regression model.
Parameters:
df (pd.DataFrame): Input DataFrame.
column (str): Target column to model and test for outliers.
features (list): Feature columns to use for modeling (default: all others).
threshold (float): Z-score threshold on residuals.
Returns:
pd.Series: Boolean mask where True indicates an outlier.
"""
if column not in df.columns or not np.issubdtype(df[column].dtype, np.number):
raise ValueError(f"Target column '{column}' is either missing or not numeric.")
features = features or [col for col in df.select_dtypes(include='number').columns if col != column]
data = df[[column] + features].dropna()
X = data[features]
y = data[column]
model = BayesianRidge()
model.fit(X, y)
preds = model.predict(X)
residuals = y - preds
z_resid = zscore(residuals)
mask = pd.Series(np.abs(z_resid) > threshold, index=data.index)
if mask.sum() > 0:
print(f"\033[91m⚠️ [{column}] - Detected {mask.sum()} residual outliers using Bayesian Regression (threshold = {threshold})\033[0m")
else:
print(f"\033[92m✅ [{column}] - No residual outliers detected using Bayesian Regression (threshold = {threshold})\033[0m")
return mask
# Run Bayesian residual outlier detection per column using others as predictors
bayes_results = {}
numeric_cols = df.select_dtypes(include='number').columns.tolist()
for target_col in numeric_cols:
bayes_results[target_col] = detect_outliers_bayesian_residuals(df, target_col)
⚠️ [feature_1] - Detected 2 residual outliers using Bayesian Regression (threshold = 3.0) ⚠️ [feature_2] - Detected 1 residual outliers using Bayesian Regression (threshold = 3.0) ⚠️ [feature_3] - Detected 2 residual outliers using Bayesian Regression (threshold = 3.0)
🛠️ Outlier Treatment Strategies¶
❌ Deletion¶
📖 Click to Expand
❌ What is Deletion?¶
Deletion involves simply removing identified outliers from the dataset before training or analysis.
⚙️ How It Works¶
- Detect outliers using a chosen method (e.g., IQR, Z-Score)
- Drop those rows from the dataset using filtering or indexing
🕵️♂️ When to Use¶
- When dataset is large and removing a few rows won’t affect learning
- If outliers are clearly data entry errors or irrelevant edge cases
✅ Pros¶
- Simple and fast to apply
- Removes potential noise from modeling
⚠️ Limitations¶
- Risks losing valuable signal, especially in small datasets
- Can bias model if outliers carry meaningful variation
- Not reversible — original data is discarded
def treat_outliers_deletion(df, mask_dict):
"""
Remove rows from the DataFrame where any of the provided masks are True.
Parameters:
df (pd.DataFrame): Original DataFrame.
mask_dict (dict): Dictionary of {column: boolean mask Series}.
Returns:
pd.DataFrame: Cleaned DataFrame with outliers removed.
"""
combined_mask = pd.Series(False, index=df.index)
for col, mask in mask_dict.items():
combined_mask = combined_mask | mask
treated_count = combined_mask.sum()
cleaned_df = df[~combined_mask]
print(f"\033[94m🔧 Deleted {treated_count} rows containing outliers across any specified column\033[0m")
return cleaned_df
# Example: remove all rows detected as outliers by IQR
df_deleted = treat_outliers_deletion(df, iqr_results)
🔧 Deleted 9 rows containing outliers across any specified column
🔁 Capping / Winsorizing¶
📖 Click to Expand
🔁 What is Capping / Winsorizing?¶
Capping (also known as Winsorizing) involves limiting extreme values by replacing them with specified percentile thresholds, rather than removing them.
⚙️ How It Works¶
- Identify upper and lower percentile cutoffs (e.g., 1st and 99th percentiles)
- Replace values above the upper bound with the upper threshold
- Replace values below the lower bound with the lower threshold
🕵️♂️ When to Use¶
- When outliers are legitimate but too influential
- In models sensitive to scale (e.g., linear regression)
✅ Pros¶
- Preserves dataset size and structure
- Reduces the influence of extreme values without deletion
⚠️ Limitations¶
- Thresholds are arbitrary — may require tuning
- Can distort distribution if used aggressively
- Doesn’t address multivariate outliers
def treat_outliers_capping(df, columns, lower_quantile=0.01, upper_quantile=0.99):
"""
Cap outliers in specified columns based on percentile thresholds.
Parameters:
df (pd.DataFrame): Input DataFrame.
columns (list): Columns to apply capping.
lower_quantile (float): Lower bound percentile.
upper_quantile (float): Upper bound percentile.
Returns:
pd.DataFrame: DataFrame with capped values.
"""
df_capped = df.copy()
total_treated = 0
for col in columns:
if not np.issubdtype(df[col].dtype, np.number):
continue
lower = df[col].quantile(lower_quantile)
upper = df[col].quantile(upper_quantile)
before = df[col]
capped = before.clip(lower, upper)
treated = (before != capped).sum()
total_treated += treated
df_capped[col] = capped
print(f"\033[94m🔧 Capped {total_treated} outlier values using {int(lower_quantile*100)}–{int(upper_quantile*100)} percentile thresholds\033[0m")
return df_capped
# Example: apply capping to all numeric columns
df_capped = treat_outliers_capping(df, df.select_dtypes(include='number').columns.tolist())
🔧 Capped 6 outlier values using 1–99 percentile thresholds
🧮 Imputation¶
📖 Click to Expand
🧮 What is Imputation for Outliers?¶
Imputation replaces outlier values with a more reasonable estimate (e.g., mean, median, or model prediction), treating them similarly to missing data.
⚙️ How It Works¶
- Identify outliers using a chosen method
- Replace them with:
- Central tendency (mean, median)
- Value from a predictive model (regression, KNN, etc.)
- Group-specific statistics (e.g., median by segment)
🕵️♂️ When to Use¶
- When outliers are suspected to be corrupted or extreme noise
- When data integrity is important and deletion isn't an option
- Especially useful for time series, healthcare, or small datasets
✅ Pros¶
- Retains dataset size and row context
- Can preserve statistical properties when done carefully
⚠️ Limitations¶
- Imputed values may hide uncertainty
- Risk of biasing the dataset if imputation method is naive
- Not suitable when outliers are meaningful or intentional signals
def treat_outliers_imputation(df, mask_dict, strategy="median"):
"""
Impute outliers in specified columns using central tendency.
Parameters:
df (pd.DataFrame): Input DataFrame.
mask_dict (dict): Dictionary of {column: boolean mask Series}.
strategy (str): Imputation method: 'mean' or 'median'.
Returns:
pd.DataFrame: DataFrame with outliers imputed.
"""
df_imputed = df.copy()
total_imputed = 0
for col, mask in mask_dict.items():
if not np.issubdtype(df[col].dtype, np.number):
continue
if strategy == "mean":
value = df[col].mean()
elif strategy == "median":
value = df[col].median()
else:
raise ValueError("Strategy must be 'mean' or 'median'.")
df_imputed.loc[mask, col] = value
total_imputed += mask.sum()
print(f"\033[94m🔧 Imputed {total_imputed} outliers using {strategy} strategy\033[0m")
return df_imputed
# Example: Impute IQR-detected outliers using median
df_imputed = treat_outliers_imputation(df, iqr_results, strategy="median")
🔧 Imputed 9 outliers using median strategy
📊 Binning¶
📖 Click to Expand
📊 What is Binning?¶
Binning transforms continuous variables into discrete categories (bins), which can help smooth out the influence of outliers.
⚙️ How It Works¶
- Define bin edges manually or using quantiles (equal-width or equal-frequency)
- Replace raw values with corresponding bin labels or codes
- Outliers naturally fall into edge bins, limiting their impact
🕵️♂️ When to Use¶
- When interpretability is more important than precision
- As a preprocessing step for tree-based models or rule-based systems
- To mitigate extreme values in skewed distributions
✅ Pros¶
- Reduces influence of outliers
- Can simplify feature relationships
- Useful for feature engineering
⚠️ Limitations¶
- Can lead to information loss
- Bin choice is arbitrary — poor binning can hurt performance
- Not suitable for models that rely on continuous features (e.g., linear regression)
def treat_outliers_binning(df, columns, bins=5, strategy="quantile"):
"""
Discretize numeric columns into bins to reduce outlier impact.
Parameters:
df (pd.DataFrame): Input DataFrame.
columns (list): List of numeric columns to bin.
bins (int): Number of bins.
strategy (str): 'quantile' or 'uniform'.
Returns:
pd.DataFrame: DataFrame with binned columns.
"""
df_binned = df.copy()
total_binned = 0
for col in columns:
if not np.issubdtype(df[col].dtype, np.number):
continue
if strategy == "quantile":
df_binned[col], bin_edges = pd.qcut(df[col], q=bins, labels=False, retbins=True, duplicates='drop')
elif strategy == "uniform":
df_binned[col], bin_edges = pd.cut(df[col], bins=bins, labels=False, retbins=True)
else:
raise ValueError("Strategy must be 'quantile' or 'uniform'.")
total_binned += df[col].notna().sum()
print(f"\033[94m🔧 Binned {total_binned} values across {len(columns)} column(s) using '{strategy}' strategy with {bins} bins\033[0m")
return df_binned
# Example: Apply quantile-based binning to all numeric columns
df_binned = treat_outliers_binning(df, df.select_dtypes(include='number').columns.tolist(), bins=4, strategy="quantile")
🔧 Binned 300 values across 3 column(s) using 'quantile' strategy with 4 bins