Status: Complete Python Coverage License

📖 Classification¶

๐Ÿงญ Problem Statement

  • ๐Ÿ“Œ What is Classification?

๐Ÿ“‚ Data Setup

  • ๐Ÿ“ฅ Load Dataset
  • ๐Ÿ“Š Data Characteristics Dictionary
  • ๐Ÿ”Ž EDA
  • ๐Ÿ› ๏ธ Feature Engineering
  • ๐Ÿงน Preprocessing

๐Ÿงช Baseline Classifier Model

  • ๐Ÿ“Š Model Evaluation
  • ๐Ÿ“‰ Confusion Matrix
  • ๐Ÿ“ˆ ROC Curve / AUC
  • ๐Ÿงฎ Update Best Model Info

๐Ÿ” Algorithms

  • ๐Ÿ“Š Logistic Regression
  • ๐Ÿงฎ Naive Bayes
  • ๐ŸŒณ Decision Tree
  • ๐ŸŒฒ Random Forest
  • ๐ŸŽฏ KNN (K-Nearest Neighbors)
  • ๐Ÿ“ˆ SVM (Support Vector Machines)
  • ๐Ÿš€ XGBoost
  • ๐Ÿง  Neural Network

๐Ÿ“Š Model Selection

  • ๐Ÿง  Recommend Models
  • ๐Ÿ“ˆ Model Comparison
  • ๐Ÿ“Š Feature Importance
  • ๐Ÿงฌ SHAP Values

๐Ÿ› ๏ธ Fine-Tune

  • ๐Ÿงช Feature Selection โ€“ RFE
  • ๐Ÿงช Feature Selection โ€“ RFE + SHAP
  • ๐Ÿ”Ž Grid Search
  • ๐ŸŽฒ Randomized Search

๐Ÿ”€ Ensemble Methods

  • ๐Ÿ—ณ๏ธ Voting Classifier
  • ๐Ÿงฌ Stacking Classifier
  • ๐Ÿชต Bagging
  • ๐Ÿš€ Boosting

๐Ÿ“ฆ Export & Deployment

  • ๐ŸงŠ Pickling
  • ๐Ÿ“Š Monitoring Hooks

🧭 Problem Statement¶

๐Ÿ“Œ What is Classification?

๐Ÿ“– Click to Expand

Classification is a type of supervised machine learning where the goal is to predict a categorical label for an observation. Given a set of features (input data), the model tries to assign the observation to one of several predefined classes.

Common examples of classification problems include:

  • Spam detection: Classifying emails as spam or not.
  • Customer churn prediction: Classifying customers as likely to leave (churn) or stay based on their activity.
  • Image recognition: Classifying images into categories, like identifying animals, vehicles, etc.

In classification, the output is discrete (e.g., 'spam' vs 'not spam', 'churn' vs 'no churn'). This contrasts with regression, where the output is continuous (e.g., predicting a house price).

Key Points
  • Supervised learning approach.
  • Used for predicting categories.
  • Output is discrete (binary or multiclass).
  • Examples: email classification, disease diagnosis, fraud detection.

Back to the top


📂 Data Setup¶

📥 Load Dataset¶

๐Ÿ“– Click to Expand

In this section, we will begin by preparing the dataset. For simplicity, we'll use a simulated classification dataset generated using the make_classification function from sklearn. This allows us to create a synthetic dataset that is suitable for practicing classification tasks.

We will simulate a dataset with the following properties:

  • 1000 samples (observations)
  • 10 features (predictors)
  • 2 informative features (ones that help in prediction)
  • 2 classes (binary classification problem)

Let's generate and take a look at the data.

In [27]:
# Data handling and manipulation
import pandas as pd
import numpy as np

# Machine Learning and Model Evaluation
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, TimeSeriesSplit, KFold
from sklearn.decomposition import PCA
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Statistical and Other Utilities
from scipy.stats import zscore
from termcolor import colored

# Visualization
import matplotlib.pyplot as plt
In [28]:
from sklearn.datasets import make_classification
import numpy as np
import pandas as pd

# Simulate base classification dataset
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=2,
    n_redundant=2,
    n_repeated=0,
    n_classes=2,
    weights=[0.7, 0.3],  # simulate class imbalance
    flip_y=0.01,         # 1% label noise
    class_sep=0.8,       # less separation = harder task
    random_state=42
)

# Create DataFrame
df = pd.DataFrame(X, columns=[f"Feature_{i}" for i in range(1, 11)])
target_col = "Target" 
df[target_col] = y

# Inject missing values randomly (e.g., 1% of cells)
# mask = np.random.rand(*df.shape) < 0.01
# df[mask] = np.nan

# Display preview
df.head()
Out[28]:
Feature_1 Feature_2 Feature_3 Feature_4 Feature_5 Feature_6 Feature_7 Feature_8 Feature_9 Feature_10 Target
0 0.959085 -0.066449 0.918572 -0.358079 0.997266 1.181890 -1.415679 -1.210161 -0.828077 1.227274 0
1 -0.910796 -0.566395 -0.940419 0.831617 -1.176962 1.820544 1.552375 -0.984534 0.563896 0.209470 1
2 -0.103769 -0.432774 -0.389454 0.793818 -0.268646 -1.836360 1.039086 -0.246383 -0.858145 -0.297376 1
3 1.580930 2.023606 1.542262 0.006800 -1.607661 0.184741 -2.419427 -0.357445 -1.273127 -0.190039 0
4 -0.006898 -0.711303 0.139918 0.117124 1.536061 0.597538 -0.437329 -0.939156 0.484698 0.236224 0

๐Ÿ“Š Data Characteristics Dictionary

๐Ÿ“– Click to Expand

This section initializes the data characteristics dictionary, which will store various metadata about the dataset, including details about the target variable, features, data size, and linear separability.

The dictionary contains the following key sections:

  1. ๐ŸŽฏ Target Variable:
    • Type: Specifies whether the target variable is binary or multiclass.
    • Imbalance: Indicates whether the target variable has class imbalance.
    • Class Imbalance Severity: Specifies the severity of the imbalance (e.g., high, low).
  2. ๐Ÿ”ง Features:
    • Type: Describes the type of features in the dataset (e.g., categorical, continuous, or mixed).
    • Correlation: Indicates the correlation between features (e.g., low, medium, high).
    • Outliers: Flag to indicate whether outliers are detected in the features.
    • Missing Data: Tracks the percentage of missing data or flags missing values.
  3. ๐Ÿ“ˆ Data Size:
    • Size: Contains the number of samples (rows) and number of features (columns).
  4. ๐Ÿ” Linear Separability:
    • Linear Separability: States whether the classes are linearly separable (True or False).

This dictionary will be updated dynamically as we analyze the dataset in subsequent steps. It serves as a summary of key dataset properties to help guide further analysis and modeling decisions.

In [29]:
# Initialize the data characteristics dictionary
data_characteristics = {
    "target_variable": {
        "type": None,  # "binary", "multiclass"
        "imbalance": None,  # True if imbalanced, False otherwise
        "class_imbalance_severity": None  # e.g., "high", "low"
    },
    "features": {
        "type": None,  # "categorical", "continuous", "mixed"
        "correlation": None,  # "low", "medium", "high"
        "outliers": None,  # True if outliers detected, False otherwise
        "missing_data": None  # Percentage of missing data or boolean
    },
    "data_size": None,  # Size of dataset (samples, features)
    "linear_separability": None  # True if classes are linearly separable
}

🔎 EDA¶

In [30]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr

# If needed, convert X and y to DataFrame and Series
if isinstance(X, np.ndarray):
    X_df = pd.DataFrame(X, columns=[f"Feature_{i}" for i in range(X.shape[1])])
else:
    X_df = X

if isinstance(y, np.ndarray):
    y_series = pd.Series(y, name="Target")
else:
    y_series = y

# Target-related
target_type = "binary" if y_series.nunique() == 2 else "multiclass"
imbalance_ratio = y_series.value_counts(normalize=True).min()
imbalance_flag = imbalance_ratio < 0.4
imbalance_severity = "high" if imbalance_ratio < 0.2 else "low" if imbalance_ratio < 0.4 else "balanced"

# Feature-related
num_cols = X_df.select_dtypes(include=["number"]).shape[1]
cat_cols = X_df.select_dtypes(exclude=["number"]).shape[1]
feature_type = "continuous" if cat_cols == 0 else "categorical" if num_cols == 0 else "mixed"

missing_pct = X_df.isna().mean().mean()
outlier_flag = any(X_df.apply(lambda col: (col > col.mean() + 3 * col.std()) | (col < col.mean() - 3 * col.std())).sum() > 0)

# Correlation level โ€” only if continuous
if feature_type == "continuous":
    corr_matrix = X_df.corr().abs()
    upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    avg_corr = upper_tri.stack().mean()
    corr_level = "high" if avg_corr > 0.7 else "medium" if avg_corr > 0.3 else "low"
else:
    corr_level = "N/A"

# Final update
data_characteristics.update({
    "target_variable": {
        "type": target_type,
        "imbalance": imbalance_flag,
        "class_imbalance_severity": imbalance_severity
    },
    "features": {
        "type": feature_type,
        "correlation": corr_level,
        "outliers": outlier_flag,
        "missing_data": f"{missing_pct:.2%}"
    },
    "data_size": X_df.shape,
    "linear_separability": None
})
In [31]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Infer positive class
positive_class = y_series.unique()[1] if len(y_series.unique()) == 2 else 1

# PCA to 2D
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_df)

# Train/test split
X_pca_train, X_pca_test, y_pca_train, y_pca_test = train_test_split(
    X_pca, y_series, test_size=0.2, random_state=42, stratify=y_series
)

# Fit linear model
clf = LogisticRegression()
clf.fit(X_pca_train, y_pca_train)
y_pred_pca = clf.predict(X_pca_test)

# F1-based separability score
f1_pca = f1_score(y_pca_test, y_pred_pca, pos_label=positive_class, zero_division=0)
data_characteristics["linear_separability"] = f1_pca > 0.75  # Adjustable threshold

# Optional print
print(f"โœ… Linear separability (2D PCA, Logistic F1): {f1_pca:.2f}")
print(f"โ†ช Updated: linear_separability = {data_characteristics['linear_separability']}")
if f1_pca > 0.85:
    interpretation = "Strong linear separability in 2D โ€” linear models likely to perform well."
elif f1_pca > 0.7:
    interpretation = "Moderate linear separability โ€” linear models may work with tuning."
else:
    interpretation = "Poor linear separability โ€” expect better results with non-linear models."

print(f"๐Ÿ“Œ Interpretation: {interpretation}")
โœ… Linear separability (2D PCA, Logistic F1): 0.74
โ†ช Updated: linear_separability = False
๐Ÿ“Œ Interpretation: Moderate linear separability โ€” linear models may work with tuning.
In [32]:
from pprint import pprint
pprint(data_characteristics)
{'data_size': (1000, 10),
 'features': {'correlation': 'low',
              'missing_data': '0.00%',
              'outliers': True,
              'type': 'continuous'},
 'linear_separability': False,
 'target_variable': {'class_imbalance_severity': 'low',
                     'imbalance': True,
                     'type': 'binary'}}

🛠️ Feature Engineering¶

  • Ommitted Here

🧹 Preprocessing¶

In [33]:
from sklearn.model_selection import train_test_split

# Define features and target
X = df.drop(columns=target_col)
y = df[target_col]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("โœ… Data split complete:")
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")
โœ… Data split complete:
Train size: 800, Test size: 200

Back to the top


🧪 Baseline Classifier Model¶

๐Ÿ“– Click to Expand

In this section, we define the baseline model for the classification task. The baseline model is typically a dummy model that can be used to compare against more sophisticated models. Here, we use the DummyClassifier, which predicts the majority class, to set a baseline performance.

The baseline model will help us assess if more advanced models (e.g., Random Forest, SVM) are making meaningful improvements over a simple strategy.

๐Ÿง  Why Track best_model_info?

In real-world pipelines, it's critical to:

  • Compare models not just by accuracy, but a full suite of metrics.
  • Store the actual model object, hyperparameters, and diagnostics in one place.
  • Ensure only the best-performing model (based on a chosen metric like F1 or AUC) is promoted forward.
In [34]:
# Initialize Central tracker dictionary to track best model details upon iterations
best_model_info = {
    "name": None,
    "model": None,
    "metrics": {
        "train": {
            "accuracy": -np.inf,
            "precision": -np.inf,
            "recall": -np.inf,
            "f1": -np.inf,
            "roc_auc": -np.inf
            # Note: confusion_matrix and classification_report omitted for train
            # because they're redundant and cluttered for internal training fit
        },
        "test": {
            "accuracy": -np.inf,
            "precision": -np.inf,
            "recall": -np.inf,
            "f1": -np.inf,
            "roc_auc": -np.inf,
            "confusion_matrix": None,
            "classification_report": None
        }
    },
    "hyperparameters": None
}

# Dictionary to store all model performance results for comparison
model_results = {}
In [35]:
# Metric to decide which model is "best"
# Common choices (ranked by practical usage):
# 1. "f1"        โ†’ balanced precision/recall (default choice, esp. with class imbalance)
# 2. "roc_auc"   โ†’ good for imbalanced classes, uses probability scores
# 3. "accuracy"  โ†’ only when classes are balanced and all errors are equal
# 4. "precision" โ†’ when false positives are costly (e.g., spam detection)
# 5. "recall"    โ†’ when false negatives are costly (e.g., fraud, cancer)

# Success metric used to select the best model
success_metric = "f1"  # or "roc_auc", depending on use case
# success_split = "test"  # "train" or "test"
In [36]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Fit a dummy classifier as a baseline
dummy_clf = DummyClassifier(strategy="most_frequent")  # or try "stratified", "uniform"
dummy_clf.fit(X_train, y_train)
Out[36]:
DummyClassifier(strategy='most_frequent')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DummyClassifier(strategy='most_frequent')
In [37]:
# Predict on both train and test
y_train_pred = dummy_clf.predict(X_train)
y_test_pred = dummy_clf.predict(X_test)

📊 Model Evaluation¶

๐Ÿ“– Click to Expand
  • Accuracy: Overall correctness. Misleading when classes are imbalanced.
  • Precision: Of predicted positives, how many are truly positive? Important when false positives are costly.
  • Recall: Of actual positives, how many did we catch? Crucial when missing positives is expensive.
  • F1 Score: Harmonic mean of precision and recall. Useful when you care about balance.
  • ROC AUC: Probability a random positive ranks above a random negative. Good for probability-based classifiers.
๐Ÿ“– Click to Expand

Precision, Recall, and F1 Score are classification metrics that help us understand model performance beyond just accuracy:

  • Precision: Of all predicted positives, how many were actually correct? (Low precision = many false alarms)
  • Recall: Of all actual positives, how many did we catch? (Low recall = missed positives)
  • F1 Score: Harmonic mean of precision and recall โ€” useful when classes are imbalanced.

Business Perspective:

  • If false positives are costly (e.g., spam filters, fraud flags), precision matters more.
  • If missing positives is risky (e.g., cancer detection), recall is critical.
  • F1 balances both and gives a single, interpretable metric.

These metrics are vital when accuracy is misleading โ€” especially in skewed datasets.

In [38]:
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score

# Technical output
print("๐Ÿ“‰ Classification Report\n")
print(classification_report(y_test, y_test_pred))
๐Ÿ“‰ Classification Report

              precision    recall  f1-score   support

           0       0.70      1.00      0.82       140
           1       0.00      0.00      0.00        60

    accuracy                           0.70       200
   macro avg       0.35      0.50      0.41       200
weighted avg       0.49      0.70      0.58       200

/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
In [39]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Determine positive class once
positive_class = y_train.unique()[1] if len(y_train.unique()) == 2 else 1

def evaluate_model(y_true, y_pred, label="Model"):
    acc  = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, pos_label=positive_class, average='binary', zero_division=0)
    rec  = recall_score(y_true, y_pred, pos_label=positive_class, average='binary', zero_division=0)
    f1   = f1_score(y_true, y_pred, pos_label=positive_class, average='binary', zero_division=0)

    # Aligned core metrics
    print(f"\n๐Ÿ“Š {label} โ€” Performance Summary:")
    print(f"- Accuracy  : {acc :>7.2%} โ†’ Overall correctness.")
    print(f"- Precision : {prec:>7.2%} โ†’ Of predicted '{positive_class}', how many were right.")
    print(f"- Recall    : {rec :>7.2%} โ†’ Of actual '{positive_class}', how many we caught.")
    print(f"- F1 Score  : {f1  :>7.2%} โ†’ Balance of precision & recall.")

    # Business interpretation
    print("\n๐Ÿ“Œ Interpretation:")
    if prec < 0.6:
        print("- High false positives โ†’ risky if false alarms are costly.")
    else:
        print("- Precision looks acceptable; false positives under control.")

    if rec < 0.6:
        print("- High false negatives โ†’ risky if missing positives is costly.")
    else:
        print("- Recall is strong; model is catching true cases well.")

    print(f"- F1 Score shows overall tradeoff quality: {f1:.2f}")

# Example usage
evaluate_model(y_test, y_test_pred, label="Baseline Classifier")
๐Ÿ“Š Baseline Classifier โ€” Performance Summary:
- Accuracy  :  70.00% โ†’ Overall correctness.
- Precision :   0.00% โ†’ Of predicted '1', how many were right.
- Recall    :   0.00% โ†’ Of actual '1', how many we caught.
- F1 Score  :   0.00% โ†’ Balance of precision & recall.

๐Ÿ“Œ Interpretation:
- High false positives โ†’ risky if false alarms are costly.
- High false negatives โ†’ risky if missing positives is costly.
- F1 Score shows overall tradeoff quality: 0.00

📉 Confusion Matrix¶

๐Ÿ“– Click to Expand

The confusion matrix is a NxN table that helps us visualize the performance of a classification model.

๐Ÿ“– Confusion Matrix Terminology:

  • True Positive (TP): Prediction = Positive, Predicted = True-ly
  • False Positive (FP): Prediction = Positive, Predicted = False-ly
  • True Negative (TN): Prediction = Negative, Predicted = True-ly
  • False Negative (FN): Prediction = Negative, Predicted = False-ly
                  Predicted
                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                โ”‚     0   โ”‚   1 โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
Actual   โ”‚  0   โ”‚   TN    โ”‚  FP โ”‚  โ† Specificity = TN / (TN + FP) = True Negative Rate (TNR)
         โ”‚  1   โ”‚   FN    โ”‚  TP โ”‚  โ† Recall = TP / (TP + FN) = Sensitivity, TPR, Hit Rate
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†‘
                              โ””โ”€ Precision = TP / (TP + FP) = Positive Predictive Value 
In [40]:
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

def plot_confusion(y_true, y_pred, model_name="Model"):
    """
    Plot a confusion matrix with count and percentage annotations.
    Warns if y_pred contains unseen labels not present in y_true.
    """
    # Robust label set
    labels = np.unique(np.concatenate([y_true, y_pred]))
    
    # Check for potential leakage or mismatch
    unseen_preds = set(y_pred) - set(y_true)
    if unseen_preds:
        print(f"\033[91mโš ๏ธ Warning: y_pred contains unseen class labels: {unseen_preds} โ€” "
              f"this may indicate leakage or label mismatch.\033[0m")

    # Compute confusion matrix and percentages
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    cm_sum = np.sum(cm)
    cm_perc = cm / cm_sum * 100

    # Annotate with count and %
    annot = np.empty_like(cm).astype(str)
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            c = cm[i, j]
            p = cm_perc[i, j]
            annot[i, j] = f"{c}\n({p:.1f}%)"

    # Plot
    plt.figure(figsize=(3, 2))
    sns.heatmap(cm, annot=annot, fmt="", cmap="Blues", cbar=True,
                xticklabels=labels, yticklabels=labels)
    plt.xlabel("Predicted label")
    plt.ylabel("True label")
    plt.title(f"Confusion Matrix ({model_name})")
    plt.tight_layout()
    plt.show()

plot_confusion(y_test, y_test_pred, model_name="Baseline Classifier")

📈 ROC Curve / AUC¶

๐Ÿ“– Click to Expand

ROC Curve (Receiver Operating Characteristic) plots the True Positive Rate (TPR) vs False Positive Rate (FPR) across different threshold values.

  • A model that randomly guesses would fall along the diagonal (AUC = 0.5)
  • A perfect model hugs the top-left corner (AUC = 1.0)

AUC (Area Under the Curve) quantifies overall separability between the two classes:

  • Technical Insight: Higher AUC means better discrimination between positive and negative cases.
  • Business Relevance: Especially useful when false positives and false negatives have different costs โ€” like fraud detection, churn prediction, etc.

This plot lets stakeholders quickly gauge how good the model is โ€” regardless of classification threshold.

In [41]:
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

def plot_roc_auc(model, X_test, y_test, model_name="Model"):
    """
    Plot ROC curve, print AUC score, and give business-facing interpretation.
    """
    if hasattr(model, "predict_proba"):
        y_scores = model.predict_proba(X_test)[:, 1]
    elif hasattr(model, "decision_function"):
        y_scores = model.decision_function(X_test)
    else:
        raise ValueError("Model does not support probability estimates or decision function.")
    
    fpr, tpr, _ = roc_curve(y_test, y_scores)
    auc_score = roc_auc_score(y_test, y_scores)
    
    # Plot
    plt.figure(figsize=(6, 4))
    plt.plot(fpr, tpr, label=f"AUC = {auc_score:.2f}")
    plt.plot([0, 1], [0, 1], "k--", label="Random Guess")
    plt.xlabel("False Positive Rate (1 - Specificity)")
    plt.ylabel("True Positive Rate (Recall / Sensitivity)")
    plt.title(f"ROC Curve ({model_name})")
    plt.legend()
    plt.tight_layout()
    plt.show()
    
    # Output
    print(f"๐Ÿ”น ROC AUC Score for {model_name}: {auc_score:.4f}")
    if auc_score <= 0.55:
        print("๐Ÿ“Œ Interpretation: Model performs at or near random. It cannot meaningfully separate classes.")
    elif auc_score < 0.7:
        print("๐Ÿ“Œ Interpretation: Some separability, but not reliable yet. Needs improvement.")
    else:
        print("๐Ÿ“Œ Interpretation: Model is doing a good job distinguishing between classes.")

plot_roc_auc(dummy_clf, X_test, y_test, model_name="Baseline Classifier")
๐Ÿ”น ROC AUC Score for Baseline Classifier: 0.5000
๐Ÿ“Œ Interpretation: Model performs at or near random. It cannot meaningfully separate classes.

🧮 Update Best Model Info¶

In [42]:
from termcolor import colored
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report
)

def update_best_model(model_name, model_obj, y_train, y_test, y_train_pred, y_test_pred, hyperparameters=None):
    """
    Computes metrics internally, updates best_model_info if model outperforms current best.
    Also logs all model results.
    """
    # Evaluate performance
    metrics = {
        "train": {
            "accuracy": accuracy_score(y_train, y_train_pred),
            "precision": precision_score(y_train, y_train_pred, pos_label=positive_class, zero_division=0),
            "recall": recall_score(y_train, y_train_pred, pos_label=positive_class, zero_division=0),
            "f1": f1_score(y_train, y_train_pred, pos_label=positive_class, zero_division=0),
            "roc_auc": roc_auc_score(y_train, model_obj.predict_proba(X_train)[:, 1])
        },
        "test": {
            "accuracy": accuracy_score(y_test, y_test_pred),
            "precision": precision_score(y_test, y_test_pred, pos_label=positive_class, zero_division=0),
            "recall": recall_score(y_test, y_test_pred, pos_label=positive_class, zero_division=0),
            "f1": f1_score(y_test, y_test_pred, pos_label=positive_class, zero_division=0),
            "roc_auc": roc_auc_score(y_test, model_obj.predict_proba(X_test)[:, 1]),
            "confusion_matrix": confusion_matrix(y_test, y_test_pred),
            "classification_report": classification_report(y_test, y_test_pred, output_dict=True)
        }
    }

    # Compare with current best
    current_score = metrics["test"][success_metric]
    best_score = best_model_info["metrics"]["test"].get(success_metric, -1)
    previous_best = best_model_info["name"] or "None"

    if current_score > best_score:
        best_model_info.update({
            "name": model_name,
            "model": model_obj,
            "metrics": metrics,
            "hyperparameters": hyperparameters or {}
        })
        print(colored(
            f"โœ… {model_name} just beat previous best ({previous_best}) โ†’ "
            f"{success_metric}: {best_score:.4f} โ†’ {current_score:.4f}", "green"))
        # print(f"๐Ÿ“Š Current Test Performance:")
        # for metric in ["accuracy", "precision", "recall", "f1", "roc_auc"]:
        #     val = metrics["test"][metric]
        #     print(f"- {metric.capitalize():<9}: {val:.4f}")

    # Log all model results
    model_results[model_name] = {
        "model": model_obj,
        "metrics": metrics,
        "hyperparameters": hyperparameters or {}
    }
In [43]:
update_best_model(
    model_name="DummyClassifier",
    model_obj=dummy_clf,
    y_train=y_train,
    y_test=y_test,
    y_train_pred=y_train_pred,
    y_test_pred=y_test_pred,
    hyperparameters={"strategy": "most_frequent"}
)
โœ… DummyClassifier just beat previous best (None) โ†’ f1: -inf โ†’ 0.0000
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
In [44]:
# from pprint import pprint
# pprint(best_model_info)
# pprint(model_results)

# import json
# print(json.dumps(best_model_info, indent=2, default=str))
# print(json.dumps(model_results, indent=2, default=str))

Back to the top


🔍 Algorithms¶

๐Ÿ“– Click to Expand - Suitability Checklist (All Models)
Criterion ๐Ÿ“Š Logistic Regression ๐Ÿงฎ Naive Bayes ๐ŸŒณ Decision Tree ๐ŸŒฒ Random Forest ๐ŸŽฏ KNN (K-Nearest Neighbors) ๐Ÿ“ˆ SVM (Support Vector Machines) ๐Ÿš€ XGBoost ๐Ÿง  Neural Network
Interpretability โœ… Excellent โ€“ coefficients are directly interpretable โœ… Good โ€“ conditional probabilities are intuitive โœ… Very good โ€“ rules and splits are easily visualized and explained โŒ Low โ€“ individual trees interpretable, but ensemble is a black box โš ๏ธ Moderate โ€“ intuitive idea, but no explicit model or coefficients โš ๏ธ Moderate โ€“ support vectors and margins can be visualized in 2D, but overall less intuitive โŒ Low โ€“ complex ensemble; partial plots or SHAP values needed for insight โŒ Very low โ€“ acts as a black box unless aided by techniques like SHAP, LIME
Linearity Expectation โš ๏ธ Yes โ€“ assumes linear relationship between features and log-odds โš ๏ธ Assumes feature independence โ†’ not truly linear or interactive โœ… No โ€“ naturally captures non-linear relationships โœ… No โ€“ captures complex non-linear relationships โœ… No โ€“ captures non-linear patterns based on local neighborhoods โš ๏ธ Depends โ€“ linear SVM assumes linear separability; kernel SVM handles non-linear โœ… No โ€“ naturally models complex non-linear relationships โœ… No โ€“ inherently models non-linear and complex interactions
High dimensionality โœ… Good โ€“ performs well with many features (with regularization) โœ… Very good โ€“ handles high-dimensional sparse data well โš ๏ธ Moderate โ€“ can overfit with too many features unless pruned โœ… Good โ€“ handles many features well via feature bagging โŒ Poor โ€“ suffers from the curse of dimensionality; distances become less meaningful โœ… Very good โ€“ especially effective in high-dimensional spaces (e.g., text data) โœ… Excellent โ€“ handles many features via regularization and feature importance โœ… Excellent โ€“ scales well with many features and large data
Handling of multicollinearity โŒ Needs treatment โ€“ regularization helps, but still sensitive โŒ Poor โ€“ assumes feature independence; correlated features hurt performance โœ… Handles it โ€“ but may create instability in splits โœ… Handles it โ€“ less sensitive due to random feature selection per tree โŒ Problematic โ€“ redundant features distort distance metrics โš ๏ธ Can be sensitive โ€“ especially in linear SVM; use regularization โœ… Handles well โ€“ trees split on the most useful among correlated features โœ… Handles โ€“ internal weights adjust during training, but correlated inputs may slow convergence
Handling of categorical features โŒ Needs preprocessing โ€“ requires one-hot or ordinal encoding โš ๏ธ Needs preprocessing โ€“ requires encoding, but categorical NB variants exist โš ๏ธ Partial โ€“ numerical encoding needed; some implementations support direct handling โš ๏ธ Requires encoding โ€“ label or one-hot encoding typically needed โŒ Not natively supported โ€“ requires careful encoding and distance handling โŒ Not supported โ€“ requires one-hot or other encoding โš ๏ธ Needs encoding โ€“ label encoding typically used; native support improving โŒ Not native โ€“ requires one-hot encoding or embeddings
Handling of outliers โŒ Sensitive โ€“ can distort coefficients significantly โŒ Very sensitive โ€“ assumes Gaussian or other strict distributional forms โœ… Robust โ€“ splits are based on thresholds, not sensitive to extreme values โœ… Robust โ€“ not sensitive due to median-based splits and ensembling โŒ Sensitive โ€“ local distance-based voting easily skewed by outliers โŒ Sensitive โ€“ margin-based optimization gets distorted by outliers โœ… Robust โ€“ trees are insensitive to extreme values โŒ Sensitive โ€“ can destabilize training; often mitigated with preprocessing
Handling of missing values โŒ Not supported โ€“ requires imputation โŒ Not supported โ€“ requires imputation โš ๏ธ Limited โ€“ some implementations handle missing splits, others need imputation โš ๏ธ Some support โ€“ not native in all implementations; imputation often needed โŒ Not supported โ€“ requires complete-case or imputation preprocessing โŒ Not supported โ€“ must impute before training โœ… Built-in โ€“ learns optimal path for missing values during tree construction โŒ Not supported โ€“ must be imputed before training
Scaling of features needed โš ๏ธ Yes โ€“ especially important when using regularization โš ๏ธ Sometimes โ€“ required if Gaussian NB is used (assumes normal distribution) โœ… Not needed โ€“ uses raw feature values for splitting โœ… Not needed โ€“ tree splits are scale-invariant โœ… Yes โ€“ essential, as distance calculations are affected by feature magnitudes โœ… Yes โ€“ essential due to reliance on distance and dot products โœ… Not needed โ€“ tree-based, scale-invariant โœ… Yes โ€“ critical for stable and fast convergence (e.g., standardization or normalization)
Class Imbalance problem โš ๏ธ Needs adjustment โ€“ use class_weight='balanced' or resampling โš ๏ธ Needs adjustment โ€“ priors can be tuned or class weights applied manually โŒ Poor โ€“ biased toward majority class unless adjusted with class_weight or sampling โš ๏ธ Needs adjustment โ€“ use class_weight='balanced' or stratified sampling โŒ Poor โ€“ biased toward majority class due to majority voting โš ๏ธ Needs adjustment โ€“ use class_weight='balanced' or tune C and margins โœ… Handled โ€“ use scale_pos_weight, custom loss, or sampling โš ๏ธ Needs care โ€“ custom loss functions, class weights, or resampling required
Handling of sparseness in data โœ… Works fine โ€“ especially with L1 regularization for feature selection โœ… Excellent โ€“ especially performant in text classification or bag-of-words models โš ๏ธ Depends โ€“ not ideal for extremely sparse datasets (e.g., text data) โš ๏ธ Moderate โ€“ not ideal for extreme sparsity (e.g., NLP bag-of-words) โŒ Weak โ€“ sparse vectors make distance metrics ineffective โœ… Good โ€“ works well in high-dimensional sparse spaces (esp. linear SVM) โœ… Excellent โ€“ designed to handle sparse matrices natively โš ๏ธ Depends โ€“ not ideal unless using sparse-aware architectures or embedding layers
Accuracy Moderate โ€“ often outperformed by tree-based models for complex, non-linear patterns. Surprisingly strong baseline for some problems (e.g., NLP); weak when feature independence assumption breaks. Prone to overfitting if unpruned; weak alone but powerful as base learners in ensembles. โœ… Strong โ€“ robust out-of-the-box performance with low overfitting risk. โš ๏ธ Highly data-dependent โ€“ can perform well with clean, balanced, low-dimensional data. โœ… High โ€“ strong performance on well-separated data, especially with good kernel choice. โœ… Top-tier โ€“ one of the most accurate out-of-the-box models for tabular data. โœ… High โ€“ can outperform other models with enough data and tuning, especially on non-tabular data.
Training speed โœ… Fast โ€“ very efficient even on large datasets. โœ… Extremely fast โ€“ almost instantaneous to train. โœ… Fast โ€“ quick to train on moderate-sized datasets. โš ๏ธ Slower than single models โ€“ parallelizable but can be compute-heavy. โœ… Fast training, โŒ Slow inference โ€“ lazy learner, evaluates at prediction time. โŒ Slow โ€“ especially on large datasets or with complex kernels. โš ๏ธ Slower โ€“ faster than many ensembles, but heavier than single models; GPU support helps. โŒ Slow โ€“ resource-intensive; requires tuning and hardware for best performance.

📊 Logistic Regression¶

๐Ÿ“– Click to Expand
๐Ÿ” What is Logistic Regression?

Despite the name, Logistic Regression is used for classification โ€” not regression.
It predicts the probability that an observation belongs to a certain class (e.g., 0 or 1).
Under the hood, it fits a weighted formula to the input features, applies a sigmoid function, and outputs a value between 0 and 1.

Example:
A model might say there's a 78% chance this customer will churn.
If that crosses a certain threshold (say, 50%), we classify it as โ€œYes.โ€

โœ… Pros vs โŒ Cons
Pros Cons
Fast and efficient Assumes linear relationship (log-odds)
Easy to interpret (feature weights) Doesnโ€™t handle complex patterns well
Works well with small datasets Sensitive to multicollinearity
Outputs probabilities May underperform on nonlinear data
๐Ÿง  When to Use
  • You want a quick baseline with interpretable output
  • You care about probabilities, not just labels
  • Your data is fairly linearly separable
  • The number of features is small to medium
โš ๏ธ Pitfalls & Hacks
  • Pitfall: If features are highly correlated (multicollinearity), the model may become unstable. Use regularization (e.g., L2 penalty).
  • Hack: For imbalanced datasets, adjust the threshold or use class_weight='balanced' to avoid bias toward the majority class.
  • Tip: Standardize features before training, especially if using regularization.
๐Ÿ“– Click to Expand - Inner Workings
๐Ÿงฎ Logistic Regression โ€“ Internal Workflow

Logistic Regression builds a model to predict probabilities using a sigmoid transformation over a linear combination of features.
Even though the dataset doesn't contain any column called z (the logit), the model constructs it using weights it learns through training.

๐Ÿ“Š Toy Dataset
Row Hours Studied (x) Pass? (y)
110
220
330
441
551
661
๐Ÿ” Internal Steps (Training Loop)
  1. Initialize weights: Start with arbitrary w and b (e.g., 0)
  2. Compute logit: z = w * x + b
  3. Apply sigmoid: ลท = 1 / (1 + exp(-z)) โ†’ gives predicted probability
  4. Compute log loss: L = - [ y log(ลท) + (1 - y) log(1 - ลท) ]
  5. Compute gradients: Derivatives of loss w.r.t. w and b
  6. Update weights: Adjust w and b using gradient descent
  7. Repeat: Loop steps 2โ€“6 until convergence (loss stops improving)
โœ… Convergence Goal

The model searches for the final coefficients (w*, b*) that minimize total log loss across all rows. This is the training objective.

๐Ÿ“– Click to Expand - Suitability Checklist
Criterion Comment
Interpretabilityโœ… Excellent โ€“ coefficients are directly interpretable
Linearity Expectationโš ๏ธ Yes โ€“ assumes linear relationship between features and log-odds
High dimensionalityโœ… Good โ€“ performs well with many features (with regularization)
Handling of multicollinearityโŒ Needs treatment โ€“ regularization helps, but still sensitive
Handling of categorical featuresโŒ Needs preprocessing โ€“ requires one-hot or ordinal encoding
Handling of outliersโŒ Sensitive โ€“ can distort coefficients significantly
Handling of missing valuesโŒ Not supported โ€“ requires imputation
Scaling of features neededโš ๏ธ Yes โ€“ especially important when using regularization
Class Imbalance problemโš ๏ธ Needs adjustment โ€“ use class_weight='balanced' or resampling
Handling of sparseness in dataโœ… Works fine โ€“ especially with L1 regularization for feature selection

General comment on accuracy: Moderate โ€“ often outperformed by tree-based models for complex, non-linear patterns.

General comment on training speed: โœ… Fast โ€“ very efficient even on large datasets.

In [45]:
# 1. Train model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
Out[45]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [46]:
# 2. Show the learned equation
print("๐Ÿง  Learned Logistic Equation:\n")

terms = [f"({w[i]:+.4f})ยท{col}" for i, col in enumerate(X_train.columns)]
equation = " +\n    ".join(terms)
print("z =\n    " + equation)
print(f"  + ({b:+.4f})  โ† bias\n")
๐Ÿง  Learned Logistic Equation:

z =
    (-0.3652)ยทFeature_1 +
    (+0.2297)ยทFeature_2 +
    (-0.5858)ยทFeature_3 +
    (+0.0253)ยทFeature_4 +
    (+0.0175)ยทFeature_5 +
    (+0.0402)ยทFeature_6 +
    (+1.2599)ยทFeature_7 +
    (-0.1435)ยทFeature_8 +
    (-0.4557)ยทFeature_9 +
    (-0.0027)ยทFeature_10
  + (-0.7569)  โ† bias

In [47]:
# 3. Manually compute z, sigmoid, and prediction
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

X_sample = X_test.iloc[:10].copy()
z_vals = np.dot(X_sample, w) + b
probs = sigmoid(z_vals)
preds = (probs >= 0.5).astype(int)
In [48]:
# 4. Print table showing internals
diagnostics = X_sample.copy()
diagnostics["z = wยทx + b"] = np.round(z_vals, 4)
diagnostics["sigmoid(z) = prob"] = np.round(probs, 4)
diagnostics["prediction"] = preds
diagnostics["true_label"] = y_test.iloc[:10].values
diagnostics["log_loss_row"] = -(
    diagnostics["true_label"] * np.log(diagnostics["sigmoid(z) = prob"]) +
    (1 - diagnostics["true_label"]) * np.log(1 - diagnostics["sigmoid(z) = prob"])
).round(4)

print("\n๐Ÿ“Š Internal Breakdown (First 10 Test Rows):")
display(diagnostics)
๐Ÿ“Š Internal Breakdown (First 10 Test Rows):
Feature_1 Feature_2 Feature_3 Feature_4 Feature_5 Feature_6 Feature_7 Feature_8 Feature_9 Feature_10 z = wยทx + b sigmoid(z) = prob prediction true_label log_loss_row
595 0.178986 1.033881 0.488455 -0.407460 -0.574101 0.536414 -1.232457 0.123480 0.881295 0.907962 -2.8442 0.0550 0 0 0.0566
868 0.994921 0.202329 0.936990 1.631857 -1.269330 1.702515 -1.419999 1.818062 -0.910983 -0.733033 -3.1678 0.0404 0 0 0.0412
406 0.030370 0.967794 0.407555 0.398992 -1.271549 -1.153084 -1.200730 0.324658 1.210347 1.050049 -2.9566 0.0494 0 0 0.0507
815 1.248285 -1.087246 1.198098 -1.085825 -0.675708 0.034152 -1.850321 -1.148794 -1.069471 0.679373 -3.8832 0.0202 0 0 0.0204
762 0.070351 -0.115110 0.300255 -0.345919 -1.391958 1.704102 -0.815085 -1.121751 0.700136 0.889154 -2.1370 0.1056 0 0 0.1116
229 0.049136 0.862543 0.306325 -1.694973 -1.426827 0.069337 -0.864365 0.817306 0.804673 0.820232 -2.3964 0.0835 0 0 0.0872
445 -0.300191 0.372648 -0.264597 0.166493 -1.285599 -1.615846 0.373120 0.243475 0.334053 1.542736 -0.2111 0.4474 0 1 0.8043
691 0.370534 0.184971 0.508500 0.452756 0.576451 -1.508556 -1.016108 -0.770819 0.181994 1.707330 -2.4438 0.0799 0 0 0.0833
625 0.263054 -0.335138 0.132288 0.342338 1.987061 -0.530971 -0.022842 0.853976 -0.618068 1.554160 -0.8592 0.2975 0 0 0.3531
697 0.843541 0.978422 0.785313 0.522143 0.975312 0.515628 -1.176116 -0.330789 -0.802144 -1.103670 -2.3149 0.0899 0 0 0.0942
In [49]:
# 5. Calculate and print total log loss
from sklearn.metrics import log_loss
total_loss = log_loss(y_test, model.predict_proba(X_test)[:, 1])
print(f"\n๐Ÿ“‰ Total Log Loss: {total_loss:.4f}")
๐Ÿ“‰ Total Log Loss: 0.3407

🧮 Naive Bayes¶

๐Ÿ“– Click to Expand
๐Ÿ” What is Naive Bayes?

Naive Bayes is a family of probabilistic classifiers based on Bayesโ€™ Theorem.
It assumes that all features are independent of each other โ€” which is rarely true in practice, but the model still performs surprisingly well.

It calculates the probability of each class given the input features and picks the class with the highest likelihood.

Example:
โ€œGiven these symptoms, whatโ€™s the most probable disease?โ€ โ€” Naive Bayes is widely used in text classification, spam detection, and medical diagnosis.

โœ… Pros vs โŒ Cons
Pros Cons
Very fast and scalable Assumes feature independence (naive)
Handles high-dimensional data well May underperform with correlated inputs
Simple and interpretable Struggles with numeric feature scaling
Works well with text data Outputs are often overconfident
๐Ÿง  When to Use
  • Youโ€™re working with text (e.g., spam filters, sentiment)
  • You want a fast baseline
  • Youโ€™re dealing with high-dimensional, sparse features (like TF-IDF)
  • You have clean categorical or binary features
โš ๏ธ Pitfalls & Hacks
  • Pitfall: Doesnโ€™t handle continuous features naturally โ€” convert them to bins or use GaussianNB.
  • Hack: Apply Laplace smoothing to handle zero probabilities in unseen combinations.
  • Tip: Donโ€™t expect high accuracy on raw numeric data โ€” it shines in text-like scenarios.
๐Ÿ“– Click to Expand - Suitability Checklist
Criterion Comment
Interpretabilityโœ… Good โ€“ conditional probabilities are intuitive
Linearity Expectationโš ๏ธ Assumes feature independence โ†’ not truly linear or interactive
High dimensionalityโœ… Very good โ€“ handles high-dimensional sparse data well
Handling of multicollinearityโŒ Poor โ€“ assumes feature independence; correlated features hurt performance
Handling of categorical featuresโš ๏ธ Needs preprocessing โ€“ requires encoding, but categorical NB variants exist
Handling of outliersโŒ Very sensitive โ€“ assumes Gaussian or other strict distributional forms
Handling of missing valuesโŒ Not supported โ€“ requires imputation
Scaling of features neededโš ๏ธ Sometimes โ€“ required if Gaussian NB is used (assumes normal distribution)
Class Imbalance problemโš ๏ธ Needs adjustment โ€“ priors can be tuned or class weights applied manually
Handling of sparseness in dataโœ… Excellent โ€“ especially performant in text classification or bag-of-words models

General comment on accuracy: Surprisingly strong baseline for some problems (e.g., NLP); weak when feature independence assumption breaks.

General comment on training speed: โœ… Extremely fast โ€“ almost instantaneous to train.

🌳 Decision Tree¶

๐Ÿ“– Click to Expand
๐Ÿ” What is a Decision Tree?

A Decision Tree splits data into branches based on feature values, creating a flowchart-like structure.
Each split is chosen to maximize class separation (typically using Gini impurity or entropy).
The result is a set of human-readable rules โ€” like:
โ€œIf age < 30 and income > 50K โ†’ likely to churn.โ€

Itโ€™s intuitive and easy to explain, even to non-technical stakeholders.

โœ… Pros vs โŒ Cons
Pros Cons
Easy to visualize and interpret Prone to overfitting on noisy data
No need for feature scaling Can create unstable splits
Captures non-linear relationships Doesnโ€™t generalize well on small data
Works for both numeric and categorical Can be biased toward dominant features
๐Ÿง  When to Use
  • You need a model thatโ€™s explainable (e.g., in regulated domains)
  • Your data has mixed types (numeric + categorical)
  • You want to prototype quickly and understand feature importance
  • Youโ€™re okay with less predictive power in favor of interpretability
โš ๏ธ Pitfalls & Hacks
  • Pitfall: Deep trees can memorize the training data โ€” always prune or set max_depth.
  • Hack: Use as a weak learner inside ensembles (like Random Forest or XGBoost) to improve performance.
  • Tip: Use feature importance from trees to guide feature selection for other models.
๐Ÿ“– Click to Expand - Suitability Checklist
Criterion Comment
Interpretabilityโœ… Very good โ€“ rules and splits are easily visualized and explained
Linearity Expectationโœ… No โ€“ naturally captures non-linear relationships
High dimensionalityโš ๏ธ Moderate โ€“ can overfit with too many features unless pruned
Handling of multicollinearityโœ… Handles it โ€“ but may create instability in splits
Handling of categorical featuresโš ๏ธ Partial โ€“ numerical encoding needed; some implementations support direct handling
Handling of outliersโœ… Robust โ€“ splits are based on thresholds, not sensitive to extreme values
Handling of missing valuesโš ๏ธ Limited โ€“ some implementations handle missing splits, others need imputation
Scaling of features neededโœ… Not needed โ€“ uses raw feature values for splitting
Class Imbalance problemโŒ Poor โ€“ biased toward majority class unless adjusted with class_weight or sampling
Handling of sparseness in dataโš ๏ธ Depends โ€“ not ideal for extremely sparse datasets (e.g., text data)

General comment on accuracy: Prone to overfitting if unpruned; weak alone but powerful as base learners in ensembles.

General comment on training speed: โœ… Fast โ€“ quick to train on moderate-sized datasets.

🌲 Random Forest¶

๐Ÿ“– Click to Expand
๐Ÿ” What is a Random Forest?

Random Forest is an ensemble method that builds many decision trees and combines their outputs.
Each tree sees a random subset of the data and features, making the forest diverse and robust.

It works by aggregating the predictions of multiple trees (majority vote for classification), reducing the overfitting risk of a single decision tree.

Think of it as a crowd of weak models working together to make better predictions.

โœ… Pros vs โŒ Cons
Pros Cons
Strong performance out of the box Less interpretable than a single tree
Handles non-linearities and interactions Slower for real-time predictions
Resistant to overfitting May require tuning to perform well
Works well with large feature spaces Not ideal when interpretability is key
๐Ÿง  When to Use
  • You need a reliable general-purpose model with minimal tuning
  • You want to improve stability over a single decision tree
  • Your data is tabular and structured
  • You care more about performance than full interpretability
โš ๏ธ Pitfalls & Hacks
  • Pitfall: May become large and slow โ€” tune n_estimators and max_depth if needed
  • Hack: Use feature_importances_ to find influential variables
  • Tip: Avoid one-hot encoding with high-cardinality features โ€” use label encoding instead
๐Ÿ“– Click to Expand - Suitability Checklist
Criterion Comment
InterpretabilityโŒ Low โ€“ individual trees interpretable, but ensemble is a black box
Linearity Expectationโœ… No โ€“ captures complex non-linear relationships
High dimensionalityโœ… Good โ€“ handles many features well via feature bagging
Handling of multicollinearityโœ… Handles it โ€“ less sensitive due to random feature selection per tree
Handling of categorical featuresโš ๏ธ Requires encoding โ€“ label or one-hot encoding typically needed
Handling of outliersโœ… Robust โ€“ not sensitive due to median-based splits and ensembling
Handling of missing valuesโš ๏ธ Some support โ€“ not native in all implementations; imputation often needed
Scaling of features neededโœ… Not needed โ€“ tree splits are scale-invariant
Class Imbalance problemโš ๏ธ Needs adjustment โ€“ use class_weight='balanced' or stratified sampling
Handling of sparseness in dataโš ๏ธ Moderate โ€“ not ideal for extreme sparsity (e.g., NLP bag-of-words)

General comment on accuracy: โœ… Strong โ€“ robust out-of-the-box performance with low overfitting risk.

General comment on training speed: โš ๏ธ Slower than single models โ€“ parallelizable but can be compute-heavy.

🎯 KNN (K-Nearest Neighbors)¶

๐Ÿ“– Click to Expand
๐Ÿ” What is K-Nearest Neighbors?

KNN is a non-parametric, instance-based learning method.
It doesnโ€™t learn a model during training โ€” instead, it stores the data.
At prediction time, it looks at the K most similar observations (neighbors) and assigns the class based on majority vote.

Similarity is usually measured using Euclidean distance (or other distance metrics for different data types).

Example:
โ€œTo predict a label for this point, look at its 5 closest data points and choose the most common class.โ€

โœ… Pros vs โŒ Cons
Pros Cons
Simple and intuitive Slow at prediction time (no training step)
No training required Struggles with high-dimensional data
Captures local patterns Requires feature scaling
Flexible distance metrics Memory-intensive with large datasets
๐Ÿง  When to Use
  • You have low-dimensional, clean data
  • You want to prototype quickly with minimal assumptions
  • You care about local behavior rather than global rules
  • Interpretability is less important than flexibility
โš ๏ธ Pitfalls & Hacks
  • Pitfall: Distance metrics break down in high-dimensional space (curse of dimensionality)
  • Hack: Use StandardScaler or MinMaxScaler to normalize features before fitting
  • Tip: Tune k using cross-validation; odd numbers help avoid ties in binary classification
๐Ÿ“– Click to Expand - Suitability Checklist
Criterion Comment
Interpretabilityโš ๏ธ Moderate โ€“ intuitive idea, but no explicit model or coefficients
Linearity Expectationโœ… No โ€“ captures non-linear patterns based on local neighborhoods
High dimensionalityโŒ Poor โ€“ suffers from the curse of dimensionality; distances become less meaningful
Handling of multicollinearityโŒ Problematic โ€“ redundant features distort distance metrics
Handling of categorical featuresโŒ Not natively supported โ€“ requires careful encoding and distance handling
Handling of outliersโŒ Sensitive โ€“ local distance-based voting easily skewed by outliers
Handling of missing valuesโŒ Not supported โ€“ requires complete-case or imputation preprocessing
Scaling of features neededโœ… Yes โ€“ essential, as distance calculations are affected by feature magnitudes
Class Imbalance problemโŒ Poor โ€“ biased toward majority class due to majority voting
Handling of sparseness in dataโŒ Weak โ€“ sparse vectors make distance metrics ineffective

General comment on accuracy: โš ๏ธ Highly data-dependent โ€“ can perform well with clean, balanced, low-dimensional data.

General comment on training speed: โœ… Fast training, โŒ Slow inference โ€“ lazy learner, evaluates at prediction time.

📈 SVM (Support Vector Machines)¶

๐Ÿ“– Click to Expand
๐Ÿ” What is SVM?

Support Vector Machines (SVM) are margin-based classifiers that try to find the best boundary (hyperplane) that separates classes.
SVM focuses on support vectors โ€” the critical data points closest to the boundary โ€” to maximize the margin between classes.

It can handle non-linear patterns using kernel tricks (e.g., RBF kernel), making it flexible for complex data.

Think of it as drawing the widest possible gap between two classes while avoiding overlap.

โœ… Pros vs โŒ Cons
Pros Cons
Works well in high-dimensional spaces Slow on large datasets
Effective for non-linear boundaries Requires careful parameter tuning
Robust to overfitting (with regularization) Not intuitive to interpret
Supports different kernels Doesnโ€™t scale well with noisy data
๐Ÿง  When to Use
  • Your data is high-dimensional, but you want a non-linear model
  • You need a strong classifier and have time to tune hyperparameters
  • Dataset is moderate in size and reasonably clean
  • You care about maximizing margin of separation
โš ๏ธ Pitfalls & Hacks
  • Pitfall: Doesn't output probabilities by default โ€” use probability=True in SVC if needed
  • Hack: Use RBF kernel as a good starting point for non-linear problems
  • Tip: Always standardize features โ€” SVM is sensitive to feature scale
๐Ÿ“– Click to Expand - Suitability Checklist
Criterion Comment
Interpretabilityโš ๏ธ Moderate โ€“ support vectors and margins can be visualized in 2D, but overall less intuitive
Linearity Expectationโš ๏ธ Depends โ€“ linear SVM assumes linear separability; kernel SVM handles non-linear
High dimensionalityโœ… Very good โ€“ especially effective in high-dimensional spaces (e.g., text data)
Handling of multicollinearityโš ๏ธ Can be sensitive โ€“ especially in linear SVM; use regularization
Handling of categorical featuresโŒ Not supported โ€“ requires one-hot or other encoding
Handling of outliersโŒ Sensitive โ€“ margin-based optimization gets distorted by outliers
Handling of missing valuesโŒ Not supported โ€“ must impute before training
Scaling of features neededโœ… Yes โ€“ essential due to reliance on distance and dot products
Class Imbalance problemโš ๏ธ Needs adjustment โ€“ use class_weight='balanced' or tune C and margins
Handling of sparseness in dataโœ… Good โ€“ works well in high-dimensional sparse spaces (esp. linear SVM)

General comment on accuracy: โœ… High โ€“ strong performance on well-separated data, especially with good kernel choice.

General comment on training speed: โŒ Slow โ€“ especially on large datasets or with complex kernels.

🚀 XGBoost¶

๐Ÿ“– Click to Expand
๐Ÿ” What is XGBoost?

XGBoost (Extreme Gradient Boosting) is a powerful boosted tree ensemble method.
Unlike Random Forest (which builds trees in parallel), XGBoost builds trees sequentially โ€” each new tree tries to fix the errors of the previous one.

It uses gradient descent to minimize loss, with regularization to prevent overfitting.
XGBoost is known for its speed, accuracy, and efficiency, making it a go-to model in many Kaggle competitions and production systems.

โœ… Pros vs โŒ Cons
Pros Cons
High predictive accuracy Harder to interpret
Built-in regularization (less overfitting) More complex than basic tree models
Fast and scalable Requires tuning for best performance
Handles missing data automatically May overfit small/noisy datasets
๐Ÿง  When to Use
  • You need top-tier performance on structured/tabular data
  • Youโ€™re working with noisy or complex relationships
  • Youโ€™re okay with a black-box model in exchange for results
  • You want built-in tools for feature importance, early stopping, etc.
โš ๏ธ Pitfalls & Hacks
  • Pitfall: Easy to overfit if n_estimators is too high โ€” always monitor with validation
  • Hack: Use early_stopping_rounds during training to auto-pick optimal iteration
  • Tip: Start with basic settings and use GridSearchCV or Optuna for tuning
๐Ÿ“– Click to Expand - Suitability Checklist
Criterion Comment
InterpretabilityโŒ Low โ€“ complex ensemble; partial plots or SHAP values needed for insight
Linearity Expectationโœ… No โ€“ naturally models complex non-linear relationships
High dimensionalityโœ… Excellent โ€“ handles many features via regularization and feature importance
Handling of multicollinearityโœ… Handles well โ€“ trees split on the most useful among correlated features
Handling of categorical featuresโš ๏ธ Needs encoding โ€“ label encoding typically used; native support improving
Handling of outliersโœ… Robust โ€“ trees are insensitive to extreme values
Handling of missing valuesโœ… Built-in โ€“ learns optimal path for missing values during tree construction
Scaling of features neededโœ… Not needed โ€“ tree-based, scale-invariant
Class Imbalance problemโœ… Handled โ€“ use scale_pos_weight, custom loss, or sampling
Handling of sparseness in dataโœ… Excellent โ€“ designed to handle sparse matrices natively

General comment on accuracy: โœ… Top-tier โ€“ one of the most accurate out-of-the-box models for tabular data.

General comment on training speed: โš ๏ธ Slower โ€“ faster than many ensembles, but heavier than single models; GPU support helps.

🧠 Neural Network¶

๐Ÿ“– Click to Expand
๐Ÿ” What is a Neural Network?

A Neural Network is a layered structure of interconnected "neurons" inspired by the human brain.
Each neuron applies a weighted transformation followed by a non-linear activation, allowing the model to learn complex, non-linear patterns in the data.

Even a basic feedforward neural network (also called Multi-Layer Perceptron or MLP) can approximate intricate decision boundaries โ€” making it powerful but harder to interpret.

Think of it as a flexible function builder that learns patterns layer by layer.

โœ… Pros vs โŒ Cons
Pros Cons
Can model complex, non-linear relationships Requires lots of data and tuning
Works well on both tabular and image/text data Not interpretable out of the box
Scales with data and compute Can overfit if not regularized
Highly customizable architectures Slower to train, harder to debug
๐Ÿง  When to Use
  • You have enough data and want to model complex interactions
  • You're comfortable with longer training and tuning
  • You care more about predictive power than explainability
  • You're building pipelines that could benefit from deep learning extensions later
โš ๏ธ Pitfalls & Hacks
  • Pitfall: Prone to overfitting โ€” always use dropout, regularization, or early stopping
  • Hack: Use a simple architecture (1โ€“2 hidden layers) for structured/tabular data
  • Tip: Standardize inputs and tune learning rate; training can otherwise stall or explode
๐Ÿ“– Click to Expand - Suitability Checklist
Criterion Comment
InterpretabilityโŒ Very low โ€“ acts as a black box unless aided by techniques like SHAP, LIME
Linearity Expectationโœ… No โ€“ inherently models non-linear and complex interactions
High dimensionalityโœ… Excellent โ€“ scales well with many features and large data
Handling of multicollinearityโœ… Handles โ€“ internal weights adjust during training, but correlated inputs may slow convergence
Handling of categorical featuresโŒ Not native โ€“ requires one-hot encoding or embeddings
Handling of outliersโŒ Sensitive โ€“ can destabilize training; often mitigated with preprocessing
Handling of missing valuesโŒ Not supported โ€“ must be imputed before training
Scaling of features neededโœ… Yes โ€“ critical for stable and fast convergence (e.g., standardization or normalization)
Class Imbalance problemโš ๏ธ Needs care โ€“ custom loss functions, class weights, or resampling required
Handling of sparseness in dataโš ๏ธ Depends โ€“ not ideal unless using sparse-aware architectures or embedding layers

General comment on accuracy: โœ… High โ€“ can outperform other models with enough data and tuning, especially on non-tabular data.

General comment on training speed: โŒ Slow โ€“ resource-intensive; requires tuning and hardware for best performance.

Back to the top


📊 Model Selection¶

๐Ÿ“– Click to Expand
๐Ÿง  Model Selection Table (Based on Data Characteristics)
Target Type Linearly Separable Correlation Imbalance Recommended Models Notes
Binaryโœ… TrueLowโœ… TrueXGBoost > Random ForestUse tree-based models with class weights or resampling.
Binaryโœ… TrueLowโŒ FalseLogistic Regression > SVMStart with simple linear models. Use as benchmark.
Binaryโœ… TrueHighโœ… TrueXGBoost > Random ForestUse tree-based models with class weights or resampling.
Binaryโœ… TrueHighโŒ FalseLogistic Regression > SVMStart with simple linear models. Use as benchmark.
BinaryโŒ FalseLowโœ… TrueXGBoost > Random ForestBoosting or RF with class weights to handle imbalance + complexity.
BinaryโŒ FalseLowโŒ FalseRandom Forest > Decision TreeSimple non-linear trees likely sufficient. Avoid tuning-heavy models.
BinaryโŒ FalseHighโœ… TrueXGBoost > Random ForestBoosting or RF with class weights to handle imbalance + complexity.
BinaryโŒ FalseHighโŒ FalseRandom Forest > Decision TreeSimple non-linear trees likely sufficient. Avoid tuning-heavy models.
Multiclassโœ… TrueLowโœ… TrueXGBoost > Logistic RegressionUse OvR strategy with LR/XGB. Watch for class separation.
Multiclassโœ… TrueLowโŒ FalseXGBoost > Logistic RegressionUse OvR strategy with LR/XGB. Watch for class separation.
Multiclassโœ… TrueHighโœ… TrueXGBoost > Logistic RegressionUse OvR strategy with LR/XGB. Watch for class separation.
Multiclassโœ… TrueHighโŒ FalseXGBoost > Logistic RegressionUse OvR strategy with LR/XGB. Watch for class separation.
MulticlassโŒ FalseLowโœ… TrueNeural Network > KNNUse Neural Net or KNN. Prioritize decision boundary complexity.
MulticlassโŒ FalseLowโŒ FalseNeural Network > KNNUse Neural Net or KNN. Prioritize decision boundary complexity.
MulticlassโŒ FalseHighโœ… TrueXGBoost > Random ForestTree-based models preferred. Skip preprocessing of collinear features.
MulticlassโŒ FalseHighโŒ FalseXGBoost > Random ForestTree-based models preferred. Skip preprocessing of collinear features.
๐Ÿ“– Click to Expand ๐Ÿง  **Model Selection Flowchart (Based on Data Characteristics)**
๐ŸŽฏ Target Type = Binary?
  โ”œโ”€โ”€ โœ… Yes
  โ”‚   โ”œโ”€โ”€ ๐Ÿ“ˆ Linearly Separable?
  โ”‚   โ”‚   โ”œโ”€โ”€ โœ… Yes
  โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿงฌ Feature Type = Categorical?
  โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ โœ… ------------------------------------------> ๐ŸŒฒ Random Forest > ๐Ÿš€ XGBoost
  โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ โŒ โ†’ ๐Ÿ“‰ Correlation = High?
  โ”‚   โ”‚   โ”‚   โ”‚         โ”œโ”€โ”€ โœ… ------------------------------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚   โ”‚   โ”‚   โ”‚         โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Missing Data?
  โ”‚   โ”‚   โ”‚   โ”‚               โ”œโ”€โ”€ โœ… ------------------------------> ๐Ÿš€ XGBoost > ๐Ÿง  Neural Network
  โ”‚   โ”‚   โ”‚   โ”‚               โ””โ”€โ”€ โŒ โš ๏ธ Outliers Present?
  โ”‚   โ”‚   โ”‚   โ”‚                     โ”œโ”€โ”€ โœ… โ†’ โš–๏ธ Imbalanced?
  โ”‚   โ”‚   โ”‚   โ”‚                     โ”‚   โ”œโ”€โ”€ โœ… --------------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚   โ”‚   โ”‚   โ”‚                     โ”‚   โ””โ”€โ”€ โŒ --------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚   โ”‚   โ”‚   โ”‚                     โ””โ”€โ”€ โŒ ------------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ โŒ โ†’ ๐Ÿ“‰ Correlation = High?
  โ”‚   โ”‚   โ”‚         โ”œโ”€โ”€ โœ… ----------------------------------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚   โ”‚   โ”‚         โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Missing Data?
  โ”‚   โ”‚   โ”‚               โ”œโ”€โ”€ โœ… ----------------------------------> ๐Ÿš€ XGBoost > ๐Ÿง  Neural Network
  โ”‚   โ”‚   โ”‚               โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Outliers Present?
  โ”‚   โ”‚   โ”‚                     โ”œโ”€โ”€ โœ… โ†’ โš–๏ธ Imbalanced?
  โ”‚   โ”‚   โ”‚                     โ”‚   โ”œโ”€โ”€ โœ… ------------------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚   โ”‚   โ”‚                     โ”‚   โ””โ”€โ”€ โŒ ------------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚   โ”‚   โ”‚                     โ””โ”€โ”€ โŒ ----------------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚   โ”‚   โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Missing Data?
  โ”‚   โ”‚         โ”œโ”€โ”€ โœ… --------------------------------------------> ๐Ÿš€ XGBoost > ๐Ÿง  Neural Network
  โ”‚   โ”‚         โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Outliers Present?
  โ”‚   โ”‚               โ”œโ”€โ”€ โœ… โ†’ โš–๏ธ Imbalanced?
  โ”‚   โ”‚               โ”‚   โ”œโ”€โ”€ โœ… ----------------------------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚   โ”‚               โ”‚   โ””โ”€โ”€ โŒ ----------------------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚   โ”‚               โ””โ”€โ”€ โŒ --------------------------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚   โ””โ”€โ”€ โŒ No (Multiclass)
  โ”‚         โ”œโ”€โ”€ ๐Ÿ“ˆ Linearly Separable?
  โ”‚         โ”‚   โ”œโ”€โ”€ โœ… Yes
  โ”‚         โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿงฌ Feature Type = Categorical?
  โ”‚         โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ โœ… ------------------------------------> ๐ŸŒฒ Random Forest > ๐Ÿš€ XGBoost
  โ”‚         โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ โŒ โ†’ ๐Ÿ“‰ Correlation = High?
  โ”‚         โ”‚   โ”‚   โ”‚         โ”œโ”€โ”€ โœ… ------------------------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚         โ”‚   โ”‚   โ”‚         โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Missing Data?
  โ”‚         โ”‚   โ”‚   โ”‚               โ”œโ”€โ”€ โœ… ------------------------> ๐Ÿš€ XGBoost > ๐Ÿง  Neural Network
  โ”‚         โ”‚   โ”‚   โ”‚               โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Outliers Present?
  โ”‚         โ”‚   โ”‚   โ”‚                     โ”œโ”€โ”€ โœ… โ†’ โš–๏ธ Imbalanced?
  โ”‚         โ”‚   โ”‚   โ”‚                     โ”‚   โ”œโ”€โ”€ โœ… --------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚         โ”‚   โ”‚   โ”‚                     โ”‚   โ””โ”€โ”€ โŒ --------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚         โ”‚   โ”‚   โ”‚                     โ””โ”€โ”€ โŒ ------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚         โ”‚   โ”‚   โ””โ”€โ”€ โŒ โ†’ ๐Ÿ“‰ Correlation = High?
  โ”‚         โ”‚   โ”‚         โ”œโ”€โ”€ โœ… ----------------------------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚         โ”‚   โ”‚         โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Missing Data?
  โ”‚         โ”‚   โ”‚               โ”œโ”€โ”€ โœ… ----------------------------> ๐Ÿš€ XGBoost > ๐Ÿง  Neural Network
  โ”‚         โ”‚   โ”‚               โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Outliers Present?
  โ”‚         โ”‚   โ”‚                     โ”œโ”€โ”€ โœ… โ†’ โš–๏ธ Imbalanced?
  โ”‚         โ”‚   โ”‚                     โ”‚   โ”œโ”€โ”€ โœ… ------------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚         โ”‚   โ”‚                     โ”‚   โ””โ”€โ”€ โŒ ------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚         โ”‚   โ”‚                     โ””โ”€โ”€ โŒ ----------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚         โ”‚   โ””โ”€โ”€ โŒ โ†’ ๐Ÿ“‰ Correlation = High?
  โ”‚         โ”‚         โ”œโ”€โ”€ โœ… โ†’ ------------------------------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚         โ”‚         โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Missing Data?
  โ”‚         โ”‚               โ”œโ”€โ”€ โœ… --------------------------------> ๐Ÿš€ XGBoost > ๐Ÿง  Neural Network
  โ”‚         โ”‚               โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Outliers Present?
  โ”‚         โ”‚                     โ”œโ”€โ”€ โœ… โ†’ โš–๏ธ Imbalanced?
  โ”‚         โ”‚                     โ”‚   โ”œโ”€โ”€ โœ… ----------------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚         โ”‚                     โ”‚   โ””โ”€โ”€ โŒ ----------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚         โ”‚                     โ””โ”€โ”€ โŒ --------------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚         โ””โ”€โ”€ โŒ ๐Ÿงฌ Feature Type = Categorical?
  โ”‚               โ”œโ”€โ”€ โœ… ------------------------------------------> ๐ŸŒฒ Random Forest > ๐Ÿš€ XGBoost
  โ”‚               โ””โ”€โ”€ โŒ โ†’ ๐Ÿ“‰ Correlation = High?
  โ”‚                     โ”œโ”€โ”€ โœ… ------------------------------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚                     โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Missing Data?
  โ”‚                           โ”œโ”€โ”€ โœ… ------------------------------> ๐Ÿš€ XGBoost > ๐Ÿง  Neural Network
  โ”‚                           โ””โ”€โ”€ โŒ โ†’ โš ๏ธ Outliers Present?
  โ”‚                                 โ”œโ”€โ”€ โœ… โ†’ โš–๏ธ Imbalanced?
  โ”‚                                 โ”‚   โ”œโ”€โ”€ โœ… --------------------> ๐Ÿš€ XGBoost > ๐ŸŒฒ Random Forest
  โ”‚                                 โ”‚   โ””โ”€โ”€ โŒ --------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ”‚                                 โ””โ”€โ”€ โŒ ------------------------> ๐ŸŒฒ Random Forest > ๐ŸŒณ Decision Tree
  โ””โ”€โ”€ End
๐Ÿ“– Click to Expand - Model Choice
๐Ÿง  Model Selection Table (Based on Data Characteristics)
Target Type Linearly Separable Correlation Imbalance Recommended Models Notes
Binaryโœ… TrueLowโœ… TrueXGBoost > Random ForestUse tree-based models with class weights or resampling.
Binaryโœ… TrueLowโŒ FalseLogistic Regression > SVMStart with simple linear models. Use as benchmark.
Binaryโœ… TrueHighโœ… TrueXGBoost > Random ForestUse tree-based models with class weights or resampling.
Binaryโœ… TrueHighโŒ FalseLogistic Regression > SVMStart with simple linear models. Use as benchmark.
BinaryโŒ FalseLowโœ… TrueXGBoost > Random ForestBoosting or RF with class weights to handle imbalance + complexity.
BinaryโŒ FalseLowโŒ FalseRandom Forest > Decision TreeSimple non-linear trees likely sufficient. Avoid tuning-heavy models.
BinaryโŒ FalseHighโœ… TrueXGBoost > Random ForestBoosting or RF with class weights to handle imbalance + complexity.
BinaryโŒ FalseHighโŒ FalseRandom Forest > Decision TreeSimple non-linear trees likely sufficient. Avoid tuning-heavy models.
Multiclassโœ… TrueLowโœ… TrueXGBoost > Logistic RegressionUse OvR strategy with LR/XGB. Watch for class separation.
Multiclassโœ… TrueLowโŒ FalseXGBoost > Logistic RegressionUse OvR strategy with LR/XGB. Watch for class separation.
Multiclassโœ… TrueHighโœ… TrueXGBoost > Logistic RegressionUse OvR strategy with LR/XGB. Watch for class separation.
Multiclassโœ… TrueHighโŒ FalseXGBoost > Logistic RegressionUse OvR strategy with LR/XGB. Watch for class separation.
MulticlassโŒ FalseLowโœ… TrueNeural Network > KNNUse Neural Net or KNN. Prioritize decision boundary complexity.
MulticlassโŒ FalseLowโŒ FalseNeural Network > KNNUse Neural Net or KNN. Prioritize decision boundary complexity.
MulticlassโŒ FalseHighโœ… TrueXGBoost > Random ForestTree-based models preferred. Skip preprocessing of collinear features.
MulticlassโŒ FalseHighโŒ FalseXGBoost > Random ForestTree-based models preferred. Skip preprocessing of collinear features.
๐Ÿ“– Click to Expand
Start
  โ”‚
  โ”œโ”€โ”€ ๐ŸŽฏ Target Type = Binary?
  โ”‚     โ”‚
  โ”‚     โ”œโ”€โ”€ โœ… Yes
  โ”‚     โ”‚     โ”œโ”€โ”€ ๐Ÿ“ˆ Linearly Separable?
  โ”‚     โ”‚     โ”‚     โ”œโ”€โ”€ โœ… Yes
  โ”‚     โ”‚     โ”‚     โ”‚     โ”œโ”€โ”€ ๐Ÿงฌ Feature Type = Categorical?
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐ŸŒฒ Random Forest / ๐Ÿš€ XGBoost
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚     โ””โ”€โ”€ โŒ
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚           โ”œโ”€โ”€ ๐Ÿ“‰ Correlation = High?
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ โŒ Avoid NB / LR
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚           โ”‚     โ””โ”€โ”€ โŒ
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚           โ”‚           โ”œโ”€โ”€ โš ๏ธ Missing Data?
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚           โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐Ÿš€ XGBoost / CatBoost
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚           โ”‚           โ”‚     โ””โ”€โ”€ โŒ
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”œโ”€โ”€ โš ๏ธ Outliers?
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐Ÿš€ XGBoost / ๐ŸŒฒ Random Forest
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”‚     โ””โ”€โ”€ โŒ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐Ÿค– Logistic Regression / ๐Ÿงญ SVM
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ””โ”€โ”€ End
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚           โ”‚           โ””โ”€โ”€ End
  โ”‚     โ”‚     โ”‚     โ”‚     โ”‚           โ””โ”€โ”€ End
  โ”‚     โ”‚     โ”‚     โ”‚     โ””โ”€โ”€ End
  โ”‚     โ”‚     โ”‚     โ””โ”€โ”€ โŒ No
  โ”‚     โ”‚     โ”‚           โ”œโ”€โ”€ ๐Ÿงฌ Feature Type = Categorical?
  โ”‚     โ”‚     โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐ŸŒฒ Random Forest / ๐Ÿš€ XGBoost
  โ”‚     โ”‚     โ”‚           โ”‚     โ””โ”€โ”€ โŒ
  โ”‚     โ”‚     โ”‚           โ”‚           โ”œโ”€โ”€ ๐Ÿ“‰ Correlation = High?
  โ”‚     โ”‚     โ”‚           โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ โŒ Avoid NB / LR
  โ”‚     โ”‚     โ”‚           โ”‚           โ”‚     โ””โ”€โ”€ โŒ
  โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”œโ”€โ”€ โš ๏ธ Missing Data?
  โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐Ÿš€ XGBoost / CatBoost
  โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”‚     โ””โ”€โ”€ โŒ
  โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”‚           โ”œโ”€โ”€ โš ๏ธ Outliers?
  โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐Ÿš€ XGBoost / ๐ŸŒฒ Random Forest
  โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”‚           โ”‚     โ””โ”€โ”€ โŒ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐Ÿง  Neural Network / ๐ŸŽฏ KNN
  โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”‚           โ””โ”€โ”€ End
  โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ””โ”€โ”€ End
  โ”‚     โ”‚     โ”‚           โ”‚           โ””โ”€โ”€ End
  โ”‚     โ”‚     โ”‚           โ””โ”€โ”€ End
  โ”‚     โ”‚     โ””โ”€โ”€ End
  โ”‚     โ””โ”€โ”€ โŒ No (Multiclass)
  โ”‚           โ”œโ”€โ”€ ๐Ÿ“ˆ Linearly Separable?
  โ”‚           โ”‚     โ”œโ”€โ”€ โœ… Yes
  โ”‚           โ”‚     โ”‚     โ”œโ”€โ”€ ๐Ÿงฌ Feature Type = Categorical?
  โ”‚           โ”‚     โ”‚     โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐ŸŒฒ Random Forest / ๐Ÿš€ XGBoost
  โ”‚           โ”‚     โ”‚     โ”‚     โ””โ”€โ”€ โŒ
  โ”‚           โ”‚     โ”‚     โ”‚           โ”œโ”€โ”€ ๐Ÿ“‰ Correlation = High?
  โ”‚           โ”‚     โ”‚     โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ โŒ Avoid NB / LR
  โ”‚           โ”‚     โ”‚     โ”‚           โ”‚     โ””โ”€โ”€ โŒ
  โ”‚           โ”‚     โ”‚     โ”‚           โ”‚           โ”œโ”€โ”€ โš ๏ธ Missing Data?
  โ”‚           โ”‚     โ”‚     โ”‚           โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐Ÿš€ XGBoost / CatBoost
  โ”‚           โ”‚     โ”‚     โ”‚           โ”‚           โ”‚     โ””โ”€โ”€ โŒ
  โ”‚           โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”œโ”€โ”€ โš ๏ธ Outliers?
  โ”‚           โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐Ÿš€ XGBoost / ๐ŸŒฒ Random Forest
  โ”‚           โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ”‚     โ””โ”€โ”€ โŒ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐Ÿค– Logistic Regression (OvR) / ๐Ÿงญ SVM
  โ”‚           โ”‚     โ”‚     โ”‚           โ”‚           โ”‚           โ””โ”€โ”€ End
  โ”‚           โ”‚     โ”‚     โ”‚           โ”‚           โ””โ”€โ”€ End
  โ”‚           โ”‚     โ”‚     โ”‚           โ””โ”€โ”€ End
  โ”‚           โ”‚     โ”‚     โ””โ”€โ”€ End
  โ”‚           โ”‚     โ””โ”€โ”€ โŒ No
  โ”‚           โ”‚           โ”œโ”€โ”€ ๐Ÿงฌ Feature Type = Categorical?
  โ”‚           โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐ŸŒฒ Random Forest / ๐Ÿš€ XGBoost
  โ”‚           โ”‚           โ”‚     โ””โ”€โ”€ โŒ
  โ”‚           โ”‚           โ”‚           โ”œโ”€โ”€ ๐Ÿ“‰ Correlation = High?
  โ”‚           โ”‚           โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ โŒ Avoid NB / LR
  โ”‚           โ”‚           โ”‚           โ”‚     โ””โ”€โ”€ โŒ
  โ”‚           โ”‚           โ”‚           โ”‚           โ”œโ”€โ”€ โš ๏ธ Missing Data?
  โ”‚           โ”‚           โ”‚           โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐Ÿš€ XGBoost / CatBoost
  โ”‚           โ”‚           โ”‚           โ”‚           โ”‚     โ””โ”€โ”€ โŒ
  โ”‚           โ”‚           โ”‚           โ”‚           โ”‚           โ”œโ”€โ”€ โš ๏ธ Outliers?
  โ”‚           โ”‚           โ”‚           โ”‚           โ”‚           โ”‚     โ”œโ”€โ”€ โœ… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐Ÿš€ XGBoost / ๐ŸŒฒ Random Forest
  โ”‚           โ”‚           โ”‚           โ”‚           โ”‚           โ”‚     โ””โ”€โ”€ โŒ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ๐Ÿง  Neural Network / ๐ŸŽฏ KNN
  โ”‚           โ”‚           โ”‚           โ”‚           โ”‚           โ””โ”€โ”€ End
  โ”‚           โ”‚           โ”‚           โ”‚           โ””โ”€โ”€ End
  โ”‚           โ”‚           โ”‚           โ””โ”€โ”€ End
  โ”‚           โ”‚           โ””โ”€โ”€ End
  โ”‚           โ””โ”€โ”€ End
  โ””โ”€โ”€ ๐Ÿงช Evaluate Top 3 Recommended Models

🧠 Recommend Models¶

In [50]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
import xgboost as xgb

model_registry = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Naive Bayes": GaussianNB(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "KNN": KNeighborsClassifier(),
    "SVM": SVC(probability=True),  # needed for ROC AUC
    "XGBoost": xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    "Neural Network": MLPClassifier(max_iter=1000)
}
In [51]:
def recommend_models(data_characteristics, verbose=True):
    """
    Scores and ranks models based on data characteristics.
    Prints the recommended order and rationales.
    """
    from termcolor import colored

    score = {}
    rationale = {}
    
    # Extract characteristics
    target_info = data_characteristics.get("target_variable", {})
    feature_info = data_characteristics.get("features", {})
    is_linear = data_characteristics.get("linear_separability", False)

    imbalance = target_info.get("imbalance", False)
    imbalance_severity = target_info.get("class_imbalance_severity", "")
    feature_type = feature_info.get("type", "")
    corr = feature_info.get("correlation", "")
    outliers = feature_info.get("outliers", False)

    # --- Logistic Regression ---
    score["Logistic Regression"] = 2
    rationale["Logistic Regression"] = ["Good for linearly separable, numeric data"]
    if is_linear:
        score["Logistic Regression"] += 3
        rationale["Logistic Regression"].append("Linear separability is True")
    if feature_type == "continuous":
        score["Logistic Regression"] += 1
        rationale["Logistic Regression"].append("Features are continuous")
    if outliers:
        score["Logistic Regression"] -= 1
        rationale["Logistic Regression"].append("Sensitive to outliers")

    # --- Naive Bayes ---
    score["Naive Bayes"] = 1
    rationale["Naive Bayes"] = ["Good for categorical, independent features"]
    if feature_type == "categorical":
        score["Naive Bayes"] += 2
        rationale["Naive Bayes"].append("Features are categorical")
    if corr == "low":
        score["Naive Bayes"] += 1
        rationale["Naive Bayes"].append("Feature correlation is low")
    if corr == "high":
        score["Naive Bayes"] -= 2
        rationale["Naive Bayes"].append("Correlation is high โ†’ violates independence")

    # --- Decision Tree ---
    score["Decision Tree"] = 2
    rationale["Decision Tree"] = ["Fast, flexible, handles most data types"]
    if corr == "high":
        score["Decision Tree"] += 1
        rationale["Decision Tree"].append("Can exploit feature redundancy")
    if outliers:
        score["Decision Tree"] += 1
        rationale["Decision Tree"].append("Robust to outliers")

    # --- Random Forest ---
    score["Random Forest"] = 3
    rationale["Random Forest"] = ["Strong baseline, handles imbalance + outliers"]
    if outliers:
        score["Random Forest"] += 1
        rationale["Random Forest"].append("Handles outliers well")
    if imbalance:
        score["Random Forest"] += 1
        rationale["Random Forest"].append("Bootstrap helps with imbalance")

    # --- KNN ---
    score["KNN"] = 1
    rationale["KNN"] = ["Simple, distance-based"]
    if feature_type == "continuous":
        score["KNN"] += 1
        rationale["KNN"].append("Distance-based โ†’ works better on continuous features")
    if outliers:
        score["KNN"] -= 2
        rationale["KNN"].append("Very sensitive to outliers")
    if imbalance:
        score["KNN"] -= 1
        rationale["KNN"].append("Imbalance skews neighbors")

    # --- SVM ---
    score["SVM"] = 2
    rationale["SVM"] = ["Margin-based model"]
    if is_linear:
        score["SVM"] += 2
        rationale["SVM"].append("Linear separability is True")
    if imbalance:
        score["SVM"] -= 1
        rationale["SVM"].append("Needs tuning to handle imbalance")
    if feature_type == "continuous":
        score["SVM"] += 1
        rationale["SVM"].append("Requires numeric features")

    # --- Neural Network ---
    score["Neural Network"] = 2
    rationale["Neural Network"] = ["Flexible but sensitive"]
    if imbalance_severity == "high":
        score["Neural Network"] += 1
        rationale["Neural Network"].append("Can learn from imbalance if tuned")
    if outliers:
        score["Neural Network"] -= 1
        rationale["Neural Network"].append("Can be unstable with outliers")

    # --- XGBoost ---
    score["XGBoost"] = 4
    rationale["XGBoost"] = ["Strong general-purpose model"]
    if outliers:
        score["XGBoost"] += 1
        rationale["XGBoost"].append("Robust to outliers")
    if imbalance:
        score["XGBoost"] += 1
        rationale["XGBoost"].append("scale_pos_weight helps with imbalance")
    if corr == "high":
        score["XGBoost"] += 1
        rationale["XGBoost"].append("Handles redundant features well")

    # Sort by descending score
    ranked_models = sorted(score.items(), key=lambda x: x[1], reverse=True)
    ranked_model_names = [model for model, _ in ranked_models]

    # Filter and reorder model_registry
    ranked_registry = {name: model_registry[name] for name in ranked_model_names if name in model_registry}

    if verbose:
        print("๐Ÿง  Recommended Model Evaluation Order:\n")
        for i, name in enumerate(ranked_model_names, 1):
            if name in model_registry:
                prefix = colored(f"{i}. {name} (Score: {score[name]})", "green") if i <= 3 else f"{i}. {name} (Score: {score[name]})"
                print(prefix)
                for reason in rationale[name]:
                    print(f"   โ†ช {reason}")
        print()

    return ranked_model_names, ranked_registry
In [52]:
_, model_registry = recommend_models(data_characteristics, model_registry)

# model_registry
๐Ÿง  Recommended Model Evaluation Order:

1. XGBoost (Score: 6)
   โ†ช Strong general-purpose model
   โ†ช Robust to outliers
   โ†ช scale_pos_weight helps with imbalance
2. Random Forest (Score: 5)
   โ†ช Strong baseline, handles imbalance + outliers
   โ†ช Handles outliers well
   โ†ช Bootstrap helps with imbalance
3. Decision Tree (Score: 3)
   โ†ช Fast, flexible, handles most data types
   โ†ช Robust to outliers
4. Logistic Regression (Score: 2)
   โ†ช Good for linearly separable, numeric data
   โ†ช Features are continuous
   โ†ช Sensitive to outliers
5. Naive Bayes (Score: 2)
   โ†ช Good for categorical, independent features
   โ†ช Feature correlation is low
6. SVM (Score: 2)
   โ†ช Margin-based model
   โ†ช Needs tuning to handle imbalance
   โ†ช Requires numeric features
7. Neural Network (Score: 1)
   โ†ช Flexible but sensitive
   โ†ช Can be unstable with outliers
8. KNN (Score: -1)
   โ†ช Simple, distance-based
   โ†ช Distance-based โ†’ works better on continuous features
   โ†ช Very sensitive to outliers
   โ†ช Imbalance skews neighbors

📈 Model Comparison¶

In [53]:
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    accuracy_score, roc_auc_score, confusion_matrix, log_loss
)

top_k = 3
for name in list(model_registry.keys())[:top_k]:
    # We evaluate only the top 3 recommended models (ranked earlier) for focused comparison.
    print(f"\n๐Ÿ”ง Training: {name}")

    # Fit and predict
    model = model_registry[name]
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluation summary
    evaluate_model(y_test, y_test_pred, label=name)
    plot_confusion(y_test, y_test_pred, model_name=name)
    plot_roc_auc(model, X_test, y_test, model_name=name)

    # Track and log best model
    update_best_model(
        model_name=name,
        model_obj=model,
        y_train=y_train,
        y_test=y_test,
        y_train_pred=y_train_pred,
        y_test_pred=y_test_pred
    )

    print("โ€”" * 80)  # horizontal line
๐Ÿ”ง Training: XGBoost

๐Ÿ“Š XGBoost โ€” Performance Summary:
- Accuracy  :  93.00% โ†’ Overall correctness.
- Precision :  89.66% โ†’ Of predicted '1', how many were right.
- Recall    :  86.67% โ†’ Of actual '1', how many we caught.
- F1 Score  :  88.14% โ†’ Balance of precision & recall.

๐Ÿ“Œ Interpretation:
- Precision looks acceptable; false positives under control.
- Recall is strong; model is catching true cases well.
- F1 Score shows overall tradeoff quality: 0.88
๐Ÿ”น ROC AUC Score for XGBoost: 0.9554
๐Ÿ“Œ Interpretation: Model is doing a good job distinguishing between classes.
โœ… XGBoost just beat previous best (DummyClassifier) โ†’ f1: 0.0000 โ†’ 0.8814
โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”

๐Ÿ”ง Training: Random Forest

๐Ÿ“Š Random Forest โ€” Performance Summary:
- Accuracy  :  93.00% โ†’ Overall correctness.
- Precision :  92.59% โ†’ Of predicted '1', how many were right.
- Recall    :  83.33% โ†’ Of actual '1', how many we caught.
- F1 Score  :  87.72% โ†’ Balance of precision & recall.

๐Ÿ“Œ Interpretation:
- Precision looks acceptable; false positives under control.
- Recall is strong; model is catching true cases well.
- F1 Score shows overall tradeoff quality: 0.88
๐Ÿ”น ROC AUC Score for Random Forest: 0.9605
๐Ÿ“Œ Interpretation: Model is doing a good job distinguishing between classes.
โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”

๐Ÿ”ง Training: Decision Tree

๐Ÿ“Š Decision Tree โ€” Performance Summary:
- Accuracy  :  86.00% โ†’ Overall correctness.
- Precision :  70.51% โ†’ Of predicted '1', how many were right.
- Recall    :  91.67% โ†’ Of actual '1', how many we caught.
- F1 Score  :  79.71% โ†’ Balance of precision & recall.

๐Ÿ“Œ Interpretation:
- Precision looks acceptable; false positives under control.
- Recall is strong; model is catching true cases well.
- F1 Score shows overall tradeoff quality: 0.80
๐Ÿ”น ROC AUC Score for Decision Tree: 0.8762
๐Ÿ“Œ Interpretation: Model is doing a good job distinguishing between classes.
โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”
In [54]:
# from pprint import pprint
# pprint(best_model_info)
# pprint(model_results)
In [55]:
# Print current best model based on success_metric
print(f"\n๐Ÿ† Best model so far: {best_model_info['name']} "
      f"({success_metric.upper()} = {best_model_info['metrics']['test'][success_metric]:.4f})")

print(f"\n๐Ÿ“Š Model Ranking by {success_metric.upper()}:\n")
ranked = sorted(
    model_results.items(),
    key=lambda x: x[1]["metrics"]["test"][success_metric],
    reverse=True
)

for i, (name, result) in enumerate(ranked, 1):
    score = result["metrics"]["test"][success_metric]
    print(f"{i}. {name:<20} {success_metric}: {score:.4f}")
๐Ÿ† Best model so far: XGBoost (F1 = 0.8814)

๐Ÿ“Š Model Ranking by F1:

1. XGBoost              f1: 0.8814
2. Random Forest        f1: 0.8772
3. Decision Tree        f1: 0.7971
4. DummyClassifier      f1: 0.0000
In [56]:
import plotly.graph_objects as go
import plotly.subplots as sp
import pandas as pd

# Extract test metrics
df_results = pd.DataFrame({
    model_name: data["metrics"]["test"]
    for model_name, data in model_results.items()
}).T

# Original metrics you'd like to plot
desired_metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'specificity']

# Filter only those that exist in df_results
metrics = [m for m in desired_metrics if m in df_results.columns]

# Create subplot layout
rows = (len(metrics) + 1) // 2
fig = sp.make_subplots(rows=rows, cols=2, subplot_titles=[m.upper() for m in metrics])

# Plot each available metric
for i, metric in enumerate(metrics):
    row, col = divmod(i, 2)
    fig.add_trace(
        go.Bar(
            x=df_results.index,
            y=df_results[metric],
            name=metric,
            text=pd.to_numeric(df_results[metric], errors="coerce").round(3),
            textposition="auto"
        ),
        row=row+1, col=col+1
    )

fig.update_layout(
    height=300 * rows,
    width=1000,
    title_text="Model Comparison by Metric",
    showlegend=False
)

fig.show()

๐Ÿ“Š Feature Importance

๐Ÿ“– Click to Expand

Feature importance tells us which variables the model relied on most to make predictions.
Itโ€™s like asking, โ€œWhat factors influenced the decision the most?โ€

In tree-based models like Random Forest or XGBoost, itโ€™s calculated based on how often and how effectively a feature was used to split the data.

This is useful for:

  • Understanding the modelโ€™s decision logic
  • Identifying key business drivers
  • Eliminating irrelevant features
In [57]:
# best_model_info
In [58]:
import pandas as pd
import matplotlib.pyplot as plt

def plot_feature_importance(model=None, feature_names=None, top_n=10, model_name=None):
    """
    Plots top N feature importances.
    Defaults to best_model_info['model'] unless overridden.
    Optionally takes a model_name for the plot title.
    """
    if model is None:
        model = best_model_info["model"]
        model_name = best_model_info.get("name", "Best Model") if model_name is None else model_name
    else:
        model_name = model_name or "Selected Model"

    if feature_names is None:
        feature_names = X_train.columns

    if not hasattr(model, "feature_importances_"):
        raise ValueError("Model does not support feature_importances_")

    importance_df = pd.DataFrame({
        "Feature": feature_names,
        "Importance": model.feature_importances_
    }).sort_values(by="Importance", ascending=False).head(top_n)

    plt.figure(figsize=(8, 5))
    plt.barh(importance_df["Feature"][::-1], importance_df["Importance"][::-1])
    for i, (feature, importance) in enumerate(zip(importance_df["Feature"][::-1], importance_df["Importance"][::-1])):
        plt.text(importance + 0.005, i, f"{importance:.3f}", va='center')
    plt.title(f"Top Feature Importances ({model_name})")
    plt.xlabel("Importance Score")
    plt.tight_layout()
    plt.show()

    return list(importance_df["Feature"])

# โœ… Default: plot for best model
imp_ranked = plot_feature_importance()

# ๐Ÿ› ๏ธ Optional: override model + title
# alt_model = model_results["Random Forest"]["model"]
# imp_ranked = plot_feature_importance(model=alt_model, model_name="Random Forest")

๐Ÿงฌ SHAP Values

๐Ÿ“– Click to Expand

SHAP (SHapley Additive exPlanations) values explain how much each feature contributed to a specific prediction โ€” positively or negatively.

Itโ€™s like breaking down a credit score:
โ€œAge added +12 points, income removed -5 pointsโ€ฆโ€

SHAP is model-agnostic and gives local explanations (for individual predictions) and global insights (feature impact across all predictions).

Useful for:

  • Auditing high-stakes predictions
  • Building trust with stakeholders
  • Diagnosing model behavior case-by-case
In [59]:
import shap

def plot_shap_summary_tree(model=None, X=None, model_name=None):
    """
    Plot SHAP summary for tree-based models (RandomForest, XGBoost).
    Defaults to best_model_info['model'] and X_test.
    """
    if model is None:
        model = best_model_info["model"]
        model_name = model_name or best_model_info.get("name", "Best Model")
    else:
        model_name = model_name or "Selected Model"

    if X is None:
        X = X_test

    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X)

    # For binary classification, use shap_values[1]
    if isinstance(shap_values, list) and len(shap_values) == 2:
        shap_values = shap_values[1]

    shap.summary_plot(shap_values, X)

    print(f"\n๐Ÿ“Œ SHAP Summary for {model_name}:")
    print("- Each bar shows how much that feature influences the modelโ€™s decision.")
    print("- Features at the top are the most impactful across all predictions.")
    print("- Blue/red indicate direction: does the feature push prediction up or down?")
    print("- Helps us understand *why* the model is confident โ€” not just *what* it predicts.")

    shap_df = pd.DataFrame(np.abs(shap_values), columns=X.columns).mean().sort_values(ascending=False)
    return list(shap_df.index)

# โœ… Default: SHAP for best model
shap_ranked = plot_shap_summary_tree()

# ๐Ÿ› ๏ธ Optional: SHAP for any other model
# alt_model = model_results["Random Forest"]["model"]
# shap_ranked = plot_shap_summary_tree(model=alt_model, model_name="Random Forest")
[23:15:09] WARNING: /Users/runner/work/xgboost/xgboost/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
๐Ÿ“Œ SHAP Summary for XGBoost:
- Each bar shows how much that feature influences the modelโ€™s decision.
- Features at the top are the most impactful across all predictions.
- Blue/red indicate direction: does the feature push prediction up or down?
- Helps us understand *why* the model is confident โ€” not just *what* it predicts.

Back to the top


🛠️ Fine-Tune¶

๐Ÿ“– Click to Expand
  • Fine-tuning helps unlock the modelโ€™s full potential by finding better hyperparameter values.
  • It improves accuracy, recall, and other metrics without changing the model type.
  • We typically tune the best-performing model from the baseline round (Random Forest in our case).
  • Two common methods: Grid Search (exhaustive) and Randomized Search (faster, approximate).

🧪 Feature Selection – RFE¶

In [60]:
if False:
    from sklearn.feature_selection import RFE

    # Use full X_train
    X_full = X_train.copy()
    model = best_model_info["model"]

    # Choose how many features to keep (optional: all, top 50%, or fixed)
    n_to_select = max(1, X_full.shape[1] // 2)  # or change to any value

    # Run RFE
    selector = RFE(estimator=model, n_features_to_select=n_to_select, step=1)
    selector.fit(X_full, y_train)

    # Final selected features
    selected_features = list(X_full.columns[selector.support_])
    print(f"โœ… RFE selected features (no filtering): {selected_features}")

🧪 Feature Selection – RFE + SHAP¶

In [61]:
from sklearn.base import clone
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, roc_auc_score
import shap
import numpy as np

def shap_guided_backward_elimination(
    model, X, y,
    shap_ranked=None,
    metric_name=None,
    drop_threshold=0.005,
    min_features=1,
    verbose=True
):
    """
    SHAP-guided backward elimination with early stopping.
    Drops features by SHAP rank until performance drops significantly or hits min_features.
    """
    model_base = clone(model)
    X_curr = X.copy()

    # Determine metric
    metric_name = metric_name or success_metric
    metric_func = {
        "f1": f1_score,
        "accuracy": accuracy_score,
        "precision": precision_score,
        "recall": recall_score,
        "roc_auc": roc_auc_score
    }.get(metric_name)

    if metric_func is None:
        raise ValueError(f"Unsupported metric: {metric_name}")

    # Get SHAP-ranked features if not provided
    if shap_ranked is None:
        explainer = shap.TreeExplainer(model_base.fit(X_curr, y))
        shap_values = explainer.shap_values(X_curr)
        if isinstance(shap_values, list) and len(shap_values) == 2:
            shap_values = shap_values[1]
        shap_importance = np.abs(shap_values).mean(axis=0)
        shap_ranked = X_curr.columns[np.argsort(shap_importance)[::-1]].tolist()
    else:
        shap_ranked = shap_ranked.copy()

    # Initialize tracking
    score_history = []
    previous_score = None

    while len(shap_ranked) >= min_features:
        model = clone(model_base)
        model.fit(X_curr[shap_ranked], y)
        y_pred = model.predict(X_curr[shap_ranked])
        score = metric_func(y, y_pred, zero_division=0)
        score_history.append((len(shap_ranked), score, shap_ranked.copy()))

        if verbose:
            feat_list = ", ".join(shap_ranked)
            print(f"โœ… {len(shap_ranked)} features โ†’ {metric_name}: {score:.4f} โ†’ [{feat_list}]")

        # Early stop if score drops significantly
        if previous_score is not None and (previous_score - score) > drop_threshold:
            if verbose:
                print(f"๐Ÿ›‘ Stopping early: {metric_name} dropped from {previous_score:.4f} to {score:.4f}")
            break

        previous_score = score
        shap_ranked.pop()  # Drop lowest-ranked SHAP feature

    if not score_history:
        raise ValueError("No elimination steps executed โ€” shap_ranked too short or invalid inputs.")

    # Best configuration
    tolerance = 0.01  # Accept within 1% drop of best score
    best_score = max(score_history, key=lambda x: x[1])[1]
    # Keep all configs that are within tolerance
    candidates = [cfg for cfg in score_history if (best_score - cfg[1]) <= tolerance]
    # Pick one with the fewest features
    best_config = min(candidates, key=lambda x: x[0])
    print(f"\n๐ŸŽฏ Best config: {len(best_config[2])} features โ†’ {metric_name}: {best_config[1]:.4f}")
    return best_config[2], score_history
In [62]:
final_features, history = shap_guided_backward_elimination(
    model=best_model_info["model"],
    X=X_train,
    y=y_train,
    shap_ranked=shap_ranked
)
final_features

X_train_full = X_train.copy() # retaining copies for future reference
X_test_full = X_test.copy() # retaining copies for future reference
X_train = X_train[final_features]
X_test  = X_test[final_features]
โœ… 10 features โ†’ f1: 1.0000 โ†’ [Feature_7, Feature_9, Feature_6, Feature_2, Feature_5, Feature_8, Feature_10, Feature_1, Feature_3, Feature_4]
โœ… 9 features โ†’ f1: 1.0000 โ†’ [Feature_7, Feature_9, Feature_6, Feature_2, Feature_5, Feature_8, Feature_10, Feature_1, Feature_3]
โœ… 8 features โ†’ f1: 1.0000 โ†’ [Feature_7, Feature_9, Feature_6, Feature_2, Feature_5, Feature_8, Feature_10, Feature_1]
โœ… 7 features โ†’ f1: 1.0000 โ†’ [Feature_7, Feature_9, Feature_6, Feature_2, Feature_5, Feature_8, Feature_10]
โœ… 6 features โ†’ f1: 1.0000 โ†’ [Feature_7, Feature_9, Feature_6, Feature_2, Feature_5, Feature_8]
โœ… 5 features โ†’ f1: 1.0000 โ†’ [Feature_7, Feature_9, Feature_6, Feature_2, Feature_5]
โœ… 4 features โ†’ f1: 1.0000 โ†’ [Feature_7, Feature_9, Feature_6, Feature_2]
โœ… 3 features โ†’ f1: 1.0000 โ†’ [Feature_7, Feature_9, Feature_6]
โœ… 2 features โ†’ f1: 0.9896 โ†’ [Feature_7, Feature_9]
๐Ÿ›‘ Stopping early: f1 dropped from 1.0000 to 0.9896

๐ŸŽฏ Best config: 3 features โ†’ f1: 1.0000

🔎 Grid Search¶

๐Ÿ“– Click to Expand
๐Ÿ” What is Grid Search?

Grid Search tests all possible combinations of hyperparameters across a fixed grid.
Itโ€™s exhaustive, simple, and works best when the number of hyperparameters is small.

  • Pros: Comprehensive, easy to understand
  • Cons: Very slow when search space is large
In [63]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

# ๐Ÿ”ง Complete and default-aware param grid
param_grids = {
    "RandomForestClassifier": {
        "n_estimators": [100, 200],              # default: 100
        "max_depth": [None, 5, 10],              # default: None
        "min_samples_split": [2, 5],             # default: 2
        "min_samples_leaf": [1, 2],              # default: 1
        "max_features": ["sqrt", "log2"]         # default: "sqrt"
    },
    "DecisionTreeClassifier": {
        "max_depth": [None, 5, 10],              # default: None
        "min_samples_split": [2, 5],             # default: 2
        "min_samples_leaf": [1, 2],              # default: 1
        "criterion": ["gini", "entropy"]         # default: "gini"
    },
    "GaussianNB": {
        # Note: Naive Bayes (GaussianNB) has limited tunable parameters โ€” only var_smoothing is exposed
        "var_smoothing": [1e-9, 1e-8, 1e-7]      # default: 1e-9
    },
    "LogisticRegression": {
        "C": [0.01, 0.1, 1, 10],                 # default: 1
        "penalty": ["l2"],                       # default: "l2"
        "solver": ["lbfgs"],                     # default: "lbfgs"
        "max_iter": [100, 500]                   # default: 100
    },
    "SVC": {
        "C": [0.1, 1, 10],                       # default: 1
        "kernel": ["linear", "rbf"],             # default: "rbf"
        "gamma": ["scale", "auto"],              # default: "scale"
        "probability": [True]                    # default: False (forced True for AUC)
    },
    "KNeighborsClassifier": {
        "n_neighbors": [3, 5, 7],                # default: 5
        "weights": ["uniform", "distance"],      # default: "uniform"
        "metric": ["euclidean", "manhattan", "minkowski"]  # default: "minkowski"
    },
    "MLPClassifier": {
        "hidden_layer_sizes": [(50,), (100,)],  # default: (100,)
        "activation": ["relu", "tanh"],          # default: "relu"
        "alpha": [0.0001, 0.001],                # default: 0.0001
        "learning_rate": ["constant", "adaptive"],  # default: "constant"
        "max_iter": [200, 500]                   # default: 200
    },
    "XGBClassifier": {
        "n_estimators": [100, 200],
        "max_depth": [3, 5, 7],
        "learning_rate": [0.01, 0.1],
        "subsample": [0.8, 1.0],
        "colsample_bytree": [0.8, 1.0],
        "scale_pos_weight": [1, 2]  # useful for class imbalance
    }
}
In [64]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import (
    precision_score, recall_score, f1_score, accuracy_score,
    roc_auc_score, confusion_matrix, log_loss
)

# โš™๏ธ Resolve model name and corresponding grid
model_name = best_model_info["model"].__class__.__name__  # โœ… fixed here
param_grid = param_grids.get(model_name)

if param_grid is None:
    raise ValueError(f"No param grid defined for model: {model_name}")

print(f"\n๐Ÿ”ง Running Grid Search for: {model_name}")

# ๐Ÿงช Run Grid Search
model_instance = best_model_info["model"].__class__()

grid_search = GridSearchCV(
    estimator=model_instance,
    param_grid=param_grid,
    scoring="f1",
    cv=5,
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)
best_tuned_model = grid_search.best_estimator_

print("โœ… Best Parameters Found:")
print(grid_search.best_params_)

# ๐Ÿ“ˆ Evaluate tuned model
y_test_pred = best_tuned_model.predict(X_test)

if hasattr(best_tuned_model, "predict_proba"):
    y_scores = best_tuned_model.predict_proba(X_test)[:, 1]
elif hasattr(best_tuned_model, "decision_function"):
    y_scores = best_tuned_model.decision_function(X_test)
else:
    y_scores = y_test_pred

cm = confusion_matrix(y_test, y_test_pred)
tn, fp, fn, tp = cm.ravel()

# Metrics
precision = precision_score(y_test, y_test_pred, zero_division=0)
recall = recall_score(y_test, y_test_pred, zero_division=0)
f1 = f1_score(y_test, y_test_pred, zero_division=0)
accuracy = accuracy_score(y_test, y_test_pred)
auc = roc_auc_score(y_test, y_scores)
specificity = tn / (tn + fp)
logloss = log_loss(y_test, y_scores)

# Add to model_results with a new key
model_results[f"{model_name} (Tuned)"] = {
    "model": best_tuned_model,
    "accuracy": accuracy,
    "precision": precision,
    "recall": recall,
    "f1": f1,
    "auc": auc,
    "specificity": specificity,
    "log_loss": logloss
}

# Evaluation summary
evaluate_model(y_test, y_test_pred, label=name)
plot_confusion(y_test, y_test_pred, model_name=f"{model_name} (Tuned)")
plot_roc_auc(best_tuned_model, X_test, y_test, model_name=f"{model_name} (Tuned)")
๐Ÿ”ง Running Grid Search for: XGBClassifier
Fitting 5 folds for each of 96 candidates, totalling 480 fits
โœ… Best Parameters Found:
{'colsample_bytree': 1.0, 'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 200, 'scale_pos_weight': 1, 'subsample': 0.8}

๐Ÿ“Š DummyClassifier โ€” Performance Summary:
- Accuracy  :  93.00% โ†’ Overall correctness.
- Precision :  94.23% โ†’ Of predicted '1', how many were right.
- Recall    :  81.67% โ†’ Of actual '1', how many we caught.
- F1 Score  :  87.50% โ†’ Balance of precision & recall.

๐Ÿ“Œ Interpretation:
- Precision looks acceptable; false positives under control.
- Recall is strong; model is catching true cases well.
- F1 Score shows overall tradeoff quality: 0.87
๐Ÿ”น ROC AUC Score for XGBClassifier (Tuned): 0.9700
๐Ÿ“Œ Interpretation: Model is doing a good job distinguishing between classes.
In [65]:
# best_model_info

🎲 Randomized Search¶

๐Ÿ“– Click to Expand
๐Ÿ” What is Randomized Search?

Randomized Search selects a random subset of combinations to test, rather than all of them.
Itโ€™s faster and often just as effective โ€” especially when only a few hyperparameters really matter.

  • Pros: Much faster than grid search, good for large spaces
  • Cons: May miss optimal combo if unlucky
In [66]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
from sklearn.metrics import (
    precision_score, recall_score, f1_score, accuracy_score,
    roc_auc_score, confusion_matrix, log_loss
)

# ๐Ÿ” Use same param grid as defined earlier
model_name = best_model_info["model"].__class__.__name__
param_dist = param_grids.get(model_name)

if param_dist is None:
    raise ValueError(f"No param distribution defined for model: {model_name}")

print(f"\n๐ŸŽฒ Running Randomized Search for: {model_name}")

# Create a new instance of the model
model_instance = best_model_info["model"].__class__()

# ๐Ÿ” Run randomized search
random_search = RandomizedSearchCV(
    estimator=model_instance,
    param_distributions=param_dist,
    n_iter=15,
    scoring="f1",
    cv=5,
    n_jobs=-1,
    verbose=1,
    random_state=42
)

random_search.fit(X_train, y_train)
best_random_model = random_search.best_estimator_

print("โœ… Best Parameters Found:")
print(random_search.best_params_)

# ๐Ÿ”Ž Evaluate tuned model
y_test_pred = best_random_model.predict(X_test)

if hasattr(best_random_model, "predict_proba"):
    y_scores = best_random_model.predict_proba(X_test)[:, 1]
elif hasattr(best_random_model, "decision_function"):
    y_scores = best_random_model.decision_function(X_test)
else:
    y_scores = y_test_pred

cm = confusion_matrix(y_test, y_test_pred)
tn, fp, fn, tp = cm.ravel()

# Metrics
precision = precision_score(y_test, y_test_pred, zero_division=0)
recall = recall_score(y_test, y_test_pred, zero_division=0)
f1 = f1_score(y_test, y_test_pred, zero_division=0)
accuracy = accuracy_score(y_test, y_test_pred)
auc = roc_auc_score(y_test, y_scores)
specificity = tn / (tn + fp)
logloss = log_loss(y_test, y_scores)

# Store results
model_results[f"{model_name} (RandomSearch)"] = {
    "model": best_random_model,
    "accuracy": accuracy,
    "precision": precision,
    "recall": recall,
    "f1": f1,
    "auc": auc,
    "specificity": specificity,
    "log_loss": logloss
}

# Visual eval
evaluate_model(y_test, y_test_pred, label=f"{model_name} (RandomSearch)")
plot_confusion(y_test, y_test_pred, model_name=f"{model_name} (RandomSearch)")
plot_roc_auc(best_random_model, X_test, y_test, model_name=f"{model_name} (RandomSearch)")
๐ŸŽฒ Running Randomized Search for: XGBClassifier
Fitting 5 folds for each of 15 candidates, totalling 75 fits
โœ… Best Parameters Found:
{'subsample': 0.8, 'scale_pos_weight': 1, 'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.1, 'colsample_bytree': 1.0}

๐Ÿ“Š XGBClassifier (RandomSearch) โ€” Performance Summary:
- Accuracy  :  92.00% โ†’ Overall correctness.
- Precision :  89.29% โ†’ Of predicted '1', how many were right.
- Recall    :  83.33% โ†’ Of actual '1', how many we caught.
- F1 Score  :  86.21% โ†’ Balance of precision & recall.

๐Ÿ“Œ Interpretation:
- Precision looks acceptable; false positives under control.
- Recall is strong; model is catching true cases well.
- F1 Score shows overall tradeoff quality: 0.86
๐Ÿ”น ROC AUC Score for XGBClassifier (RandomSearch): 0.9594
๐Ÿ“Œ Interpretation: Model is doing a good job distinguishing between classes.

Back to the top


🔀 Ensemble Methods (Templates)¶

๐Ÿ“– Click to Expand
๐Ÿ”€ When Should You Use Ensemble Methods?

Ensembles are useful when:

  • Single models plateau and canโ€™t capture all patterns
  • You want to boost performance by combining strengths of multiple models
  • You observe inconsistent results across base models (e.g., one is good at recall, another at precision)
  • You need more robust and stable predictions across different datasets

Use ensembles after benchmarking individual models โ€” they add complexity but often yield better generalization.

🗳️ Voting Classifier¶

๐Ÿ“– Click to Expand
๐Ÿ—ณ๏ธ What is a Voting Classifier?

A Voting Classifier combines predictions from multiple different models and makes a final decision based on majority vote (for classification) or average prediction (for regression).

There are two main types:

  • Hard Voting: Chooses the class predicted by the most models.
  • Soft Voting: Averages predicted probabilities and chooses the most likely class.

Itโ€™s like consulting multiple doctors and going with the consensus.

In [67]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

# Define voting type: 'hard' or 'soft'
voting_type = 'hard'  # change to 'hard' if you want majority voting

# Define the ensemble
voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(max_iter=1000)),
        ('dt', DecisionTreeClassifier()),
        ('nb', GaussianNB())
    ],
    voting=voting_type
)

# Train the ensemble
print(f"๐Ÿ”ง Training: Voting Classifier ({voting_type})")
voting_clf.fit(X_train, y_train)

# Predict labels
y_pred_voting = voting_clf.predict(X_test)

# Evaluate
plot_confusion(y_test, y_pred_voting, model_name=f"Voting Classifier ({voting_type})")

# Only plot ROC if model supports probability estimates
if voting_type == 'soft':
    plot_roc_auc(voting_clf, X_test, y_test, model_name=f"Voting Classifier ({voting_type})")

evaluate_model(y_test, y_pred_voting, label=f"Voting Classifier ({voting_type})")
๐Ÿ”ง Training: Voting Classifier (hard)
๐Ÿ“Š Voting Classifier (hard) โ€” Performance Summary:
- Accuracy  :  88.00% โ†’ Overall correctness.
- Precision :  90.91% โ†’ Of predicted '1', how many were right.
- Recall    :  66.67% โ†’ Of actual '1', how many we caught.
- F1 Score  :  76.92% โ†’ Balance of precision & recall.

๐Ÿ“Œ Interpretation:
- Precision looks acceptable; false positives under control.
- Recall is strong; model is catching true cases well.
- F1 Score shows overall tradeoff quality: 0.77

🧬 Stacking Classifier¶

๐Ÿ“– Click to Expand
๐Ÿงฌ What is Stacking?

Stacking involves training multiple models (called base models), and then using a meta-model to learn how to best combine their outputs.

Example:

  • Base models: logistic regression, decision tree, SVM
  • Meta-model: another model that learns which base model to trust more for each kind of input

Itโ€™s like having specialists give their opinions, and then a generalist makes the final call based on their inputs.

In [68]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

# Define base models
base_estimators = [
    ('lr', LogisticRegression(max_iter=1000)),
    ('dt', DecisionTreeClassifier()),
    ('nb', GaussianNB())
]

# Define meta-model (final estimator)
meta_model = LogisticRegression()

# Build stacking classifier
stacking_clf = StackingClassifier(
    estimators=base_estimators,
    final_estimator=meta_model,
    passthrough=False,  # set to True if you want raw features included in meta-model input
    cv=5                # internal cross-validation
)

# Train the ensemble
print("๐Ÿ”ง Training: Stacking Classifier")
stacking_clf.fit(X_train, y_train)

# Predict labels
y_pred_stack = stacking_clf.predict(X_test)

# Evaluate
plot_confusion(y_test, y_pred_stack, model_name="Stacking Classifier")
plot_roc_auc(stacking_clf, X_test, y_test, model_name="Stacking Classifier")
evaluate_model(y_test, y_pred_stack, label="Stacking Classifier")
๐Ÿ”ง Training: Stacking Classifier
๐Ÿ”น ROC AUC Score for Stacking Classifier: 0.9429
๐Ÿ“Œ Interpretation: Model is doing a good job distinguishing between classes.

๐Ÿ“Š Stacking Classifier โ€” Performance Summary:
- Accuracy  :  93.50% โ†’ Overall correctness.
- Precision :  88.52% โ†’ Of predicted '1', how many were right.
- Recall    :  90.00% โ†’ Of actual '1', how many we caught.
- F1 Score  :  89.26% โ†’ Balance of precision & recall.

๐Ÿ“Œ Interpretation:
- Precision looks acceptable; false positives under control.
- Recall is strong; model is catching true cases well.
- F1 Score shows overall tradeoff quality: 0.89

🪵 Bagging¶

๐Ÿ“– Click to Expand
๐Ÿชต What is Bagging?

Bagging (Bootstrap Aggregating) builds multiple versions of the same model (e.g., decision trees), each trained on a different random sample of the data.

Then it combines their outputs (usually by voting or averaging) to reduce overfitting and variance.

Random Forest is a popular example of bagging.

Itโ€™s like asking the same expert multiple times under different conditions and averaging their answers.

In [69]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier            # fast, default
from sklearn.linear_model import LogisticRegression        # works well with linear patterns
from sklearn.neighbors import KNeighborsClassifier         # unstable, benefits a lot from bagging
from sklearn.svm import SVC                                # slow with bagging, use carefully
from sklearn.naive_bayes import GaussianNB                 # rare with bagging (already stable)
from sklearn.ensemble import RandomForestClassifier        # not recommended โ€” it's already bagged

# Example usage:
# base_estimator = LogisticRegression(max_iter=1000)
# base_estimator = KNeighborsClassifier()
# base_estimator = SVC(probability=True)
# base_estimator = GaussianNB()

# Define bagging classifier with decision trees
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,              # number of trees
    max_samples=0.8,              # bootstrap sample size
    max_features=1.0,             # use all features
    random_state=42,
    n_jobs=-1                     # parallel processing
)

# Train the ensemble
print("๐Ÿ”ง Training: Bagging Classifier")
bagging_clf.fit(X_train, y_train)

# Predict
y_pred_bag = bagging_clf.predict(X_test)

# Evaluate
plot_confusion(y_test, y_pred_bag, model_name="Bagging Classifier")
plot_roc_auc(bagging_clf, X_test, y_test, model_name="Bagging Classifier")
evaluate_model(y_test, y_pred_bag, label="Bagging Classifier")
๐Ÿ”ง Training: Bagging Classifier
`base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
๐Ÿ”น ROC AUC Score for Bagging Classifier: 0.9685
๐Ÿ“Œ Interpretation: Model is doing a good job distinguishing between classes.

๐Ÿ“Š Bagging Classifier โ€” Performance Summary:
- Accuracy  :  92.00% โ†’ Overall correctness.
- Precision :  87.93% โ†’ Of predicted '1', how many were right.
- Recall    :  85.00% โ†’ Of actual '1', how many we caught.
- F1 Score  :  86.44% โ†’ Balance of precision & recall.

๐Ÿ“Œ Interpretation:
- Precision looks acceptable; false positives under control.
- Recall is strong; model is catching true cases well.
- F1 Score shows overall tradeoff quality: 0.86

🚀 Boosting¶

๐Ÿ“– Click to Expand
๐Ÿš€ What is Boosting?

Boosting trains models sequentially โ€” each new model focuses on correcting the mistakes of the previous one.

It gives more weight to errors and slowly builds a strong overall model by combining many weak ones.

Popular examples: XGBoost, AdaBoost, Gradient Boosting

Think of it as building knowledge step by step, learning from past failures to get better over time.

In [70]:
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier 

# Define the boosting classifier
boosting_clf = GradientBoostingClassifier(
    n_estimators=100,        # number of boosting rounds
    learning_rate=0.1,       # step size shrinkage
    max_depth=3,             # depth of each weak learner
    subsample=1.0,           # can be <1.0 for stochastic gradient boosting
    random_state=42
)

# Train the ensemble
print("๐Ÿ”ง Training: Boosting Classifier")
boosting_clf.fit(X_train, y_train)

# Predict
y_pred_boost = boosting_clf.predict(X_test)

# Evaluate
plot_confusion(y_test, y_pred_boost, model_name="Boosting Classifier")
plot_roc_auc(boosting_clf, X_test, y_test, model_name="Boosting Classifier")
evaluate_model(y_test, y_pred_boost, label="Boosting Classifier")
๐Ÿ”ง Training: Boosting Classifier
๐Ÿ”น ROC AUC Score for Boosting Classifier: 0.9738
๐Ÿ“Œ Interpretation: Model is doing a good job distinguishing between classes.

๐Ÿ“Š Boosting Classifier โ€” Performance Summary:
- Accuracy  :  95.00% โ†’ Overall correctness.
- Precision :  93.10% โ†’ Of predicted '1', how many were right.
- Recall    :  90.00% โ†’ Of actual '1', how many we caught.
- F1 Score  :  91.53% โ†’ Balance of precision & recall.

๐Ÿ“Œ Interpretation:
- Precision looks acceptable; false positives under control.
- Recall is strong; model is catching true cases well.
- F1 Score shows overall tradeoff quality: 0.92

Back to the top


📦 Export & Deployment (Optional)¶

๐Ÿ“– Click to Expand
  • Save the final trained model to disk (e.g., .pkl, .joblib)
  • Export final evaluation metrics (e.g., to .json or .csv)
  • Package preprocessing steps if applicable (e.g., scalers, encoders)
  • Useful for handing off, sharing, or production integration

🧊 Pickling (Model Export)¶

In [71]:
import joblib
import json
import os

export=False
if export:
    # ๐Ÿ“ฆ Create export folder if it doesn't exist
    os.makedirs("export", exist_ok=True)

    # ๐Ÿ’พ Save the best model
    joblib.dump(best_model, "export/best_model.joblib")

    # ๐Ÿงฎ Prepare and save the evaluation metrics (exclude the model object)
    metrics_copy = {k: v for k, v in model_results[best_model_name].items() if k != "model"}
    with open("export/metrics.json", "w") as f:
        json.dump(metrics_copy, f, indent=2)

    print("โœ… Model and metrics exported to /export/")

📊 Monitoring Hooks (Production Logging)¶

๐Ÿ“– Click to Expand

In real-world deployments, itโ€™s crucial to track how your model behaves once itโ€™s live.

What to log in production:

  • โœ… Number of predictions served
  • โœ… Confidence scores / prediction probabilities
  • โœ… Class distribution over time
  • โœ… Drift in input features
  • โœ… Model response latency
  • โŒ Ground truth (usually delayed or unavailable)

You can integrate this with tools like:

  • Prometheus + Grafana
  • AWS CloudWatch
  • Datadog
  • MLflow, EvidentlyAI, or WhyLabs

Back to the top