๐งช Baseline Classifier Model
Classification is a type of supervised machine learning where the goal is to predict a categorical label for an observation. Given a set of features (input data), the model tries to assign the observation to one of several predefined classes.
Common examples of classification problems include:
In classification, the output is discrete (e.g., 'spam' vs 'not spam', 'churn' vs 'no churn'). This contrasts with regression, where the output is continuous (e.g., predicting a house price).
In this section, we will begin by preparing the dataset. For simplicity, we'll use a simulated classification dataset generated using the make_classification
function from sklearn
. This allows us to create a synthetic dataset that is suitable for practicing classification tasks.
We will simulate a dataset with the following properties:
Let's generate and take a look at the data.
# Data handling and manipulation
import pandas as pd
import numpy as np
# Machine Learning and Model Evaluation
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, TimeSeriesSplit, KFold
from sklearn.decomposition import PCA
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Statistical and Other Utilities
from scipy.stats import zscore
from termcolor import colored
# Visualization
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
import numpy as np
import pandas as pd
# Simulate base classification dataset
X, y = make_classification(
n_samples=1000,
n_features=10,
n_informative=2,
n_redundant=2,
n_repeated=0,
n_classes=2,
weights=[0.7, 0.3], # simulate class imbalance
flip_y=0.01, # 1% label noise
class_sep=0.8, # less separation = harder task
random_state=42
)
# Create DataFrame
df = pd.DataFrame(X, columns=[f"Feature_{i}" for i in range(1, 11)])
target_col = "Target"
df[target_col] = y
# Inject missing values randomly (e.g., 1% of cells)
# mask = np.random.rand(*df.shape) < 0.01
# df[mask] = np.nan
# Display preview
df.head()
Feature_1 | Feature_2 | Feature_3 | Feature_4 | Feature_5 | Feature_6 | Feature_7 | Feature_8 | Feature_9 | Feature_10 | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.959085 | -0.066449 | 0.918572 | -0.358079 | 0.997266 | 1.181890 | -1.415679 | -1.210161 | -0.828077 | 1.227274 | 0 |
1 | -0.910796 | -0.566395 | -0.940419 | 0.831617 | -1.176962 | 1.820544 | 1.552375 | -0.984534 | 0.563896 | 0.209470 | 1 |
2 | -0.103769 | -0.432774 | -0.389454 | 0.793818 | -0.268646 | -1.836360 | 1.039086 | -0.246383 | -0.858145 | -0.297376 | 1 |
3 | 1.580930 | 2.023606 | 1.542262 | 0.006800 | -1.607661 | 0.184741 | -2.419427 | -0.357445 | -1.273127 | -0.190039 | 0 |
4 | -0.006898 | -0.711303 | 0.139918 | 0.117124 | 1.536061 | 0.597538 | -0.437329 | -0.939156 | 0.484698 | 0.236224 | 0 |
This section initializes the data characteristics dictionary, which will store various metadata about the dataset, including details about the target variable, features, data size, and linear separability.
The dictionary contains the following key sections:
This dictionary will be updated dynamically as we analyze the dataset in subsequent steps. It serves as a summary of key dataset properties to help guide further analysis and modeling decisions.
# Initialize the data characteristics dictionary
data_characteristics = {
"target_variable": {
"type": None, # "binary", "multiclass"
"imbalance": None, # True if imbalanced, False otherwise
"class_imbalance_severity": None # e.g., "high", "low"
},
"features": {
"type": None, # "categorical", "continuous", "mixed"
"correlation": None, # "low", "medium", "high"
"outliers": None, # True if outliers detected, False otherwise
"missing_data": None # Percentage of missing data or boolean
},
"data_size": None, # Size of dataset (samples, features)
"linear_separability": None # True if classes are linearly separable
}
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
# If needed, convert X and y to DataFrame and Series
if isinstance(X, np.ndarray):
X_df = pd.DataFrame(X, columns=[f"Feature_{i}" for i in range(X.shape[1])])
else:
X_df = X
if isinstance(y, np.ndarray):
y_series = pd.Series(y, name="Target")
else:
y_series = y
# Target-related
target_type = "binary" if y_series.nunique() == 2 else "multiclass"
imbalance_ratio = y_series.value_counts(normalize=True).min()
imbalance_flag = imbalance_ratio < 0.4
imbalance_severity = "high" if imbalance_ratio < 0.2 else "low" if imbalance_ratio < 0.4 else "balanced"
# Feature-related
num_cols = X_df.select_dtypes(include=["number"]).shape[1]
cat_cols = X_df.select_dtypes(exclude=["number"]).shape[1]
feature_type = "continuous" if cat_cols == 0 else "categorical" if num_cols == 0 else "mixed"
missing_pct = X_df.isna().mean().mean()
outlier_flag = any(X_df.apply(lambda col: (col > col.mean() + 3 * col.std()) | (col < col.mean() - 3 * col.std())).sum() > 0)
# Correlation level โ only if continuous
if feature_type == "continuous":
corr_matrix = X_df.corr().abs()
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
avg_corr = upper_tri.stack().mean()
corr_level = "high" if avg_corr > 0.7 else "medium" if avg_corr > 0.3 else "low"
else:
corr_level = "N/A"
# Final update
data_characteristics.update({
"target_variable": {
"type": target_type,
"imbalance": imbalance_flag,
"class_imbalance_severity": imbalance_severity
},
"features": {
"type": feature_type,
"correlation": corr_level,
"outliers": outlier_flag,
"missing_data": f"{missing_pct:.2%}"
},
"data_size": X_df.shape,
"linear_separability": None
})
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
# Infer positive class
positive_class = y_series.unique()[1] if len(y_series.unique()) == 2 else 1
# PCA to 2D
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_df)
# Train/test split
X_pca_train, X_pca_test, y_pca_train, y_pca_test = train_test_split(
X_pca, y_series, test_size=0.2, random_state=42, stratify=y_series
)
# Fit linear model
clf = LogisticRegression()
clf.fit(X_pca_train, y_pca_train)
y_pred_pca = clf.predict(X_pca_test)
# F1-based separability score
f1_pca = f1_score(y_pca_test, y_pred_pca, pos_label=positive_class, zero_division=0)
data_characteristics["linear_separability"] = f1_pca > 0.75 # Adjustable threshold
# Optional print
print(f"โ
Linear separability (2D PCA, Logistic F1): {f1_pca:.2f}")
print(f"โช Updated: linear_separability = {data_characteristics['linear_separability']}")
if f1_pca > 0.85:
interpretation = "Strong linear separability in 2D โ linear models likely to perform well."
elif f1_pca > 0.7:
interpretation = "Moderate linear separability โ linear models may work with tuning."
else:
interpretation = "Poor linear separability โ expect better results with non-linear models."
print(f"๐ Interpretation: {interpretation}")
โ Linear separability (2D PCA, Logistic F1): 0.74 โช Updated: linear_separability = False ๐ Interpretation: Moderate linear separability โ linear models may work with tuning.
from pprint import pprint
pprint(data_characteristics)
{'data_size': (1000, 10), 'features': {'correlation': 'low', 'missing_data': '0.00%', 'outliers': True, 'type': 'continuous'}, 'linear_separability': False, 'target_variable': {'class_imbalance_severity': 'low', 'imbalance': True, 'type': 'binary'}}
from sklearn.model_selection import train_test_split
# Define features and target
X = df.drop(columns=target_col)
y = df[target_col]
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print("โ
Data split complete:")
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")
โ Data split complete: Train size: 800, Test size: 200
In this section, we define the baseline model for the classification task. The baseline model is typically a dummy model that can be used to compare against more sophisticated models. Here, we use the DummyClassifier, which predicts the majority class, to set a baseline performance.
The baseline model will help us assess if more advanced models (e.g., Random Forest, SVM) are making meaningful improvements over a simple strategy.
๐ง Why Track best_model_info
?
In real-world pipelines, it's critical to:
# Initialize Central tracker dictionary to track best model details upon iterations
best_model_info = {
"name": None,
"model": None,
"metrics": {
"train": {
"accuracy": -np.inf,
"precision": -np.inf,
"recall": -np.inf,
"f1": -np.inf,
"roc_auc": -np.inf
# Note: confusion_matrix and classification_report omitted for train
# because they're redundant and cluttered for internal training fit
},
"test": {
"accuracy": -np.inf,
"precision": -np.inf,
"recall": -np.inf,
"f1": -np.inf,
"roc_auc": -np.inf,
"confusion_matrix": None,
"classification_report": None
}
},
"hyperparameters": None
}
# Dictionary to store all model performance results for comparison
model_results = {}
# Metric to decide which model is "best"
# Common choices (ranked by practical usage):
# 1. "f1" โ balanced precision/recall (default choice, esp. with class imbalance)
# 2. "roc_auc" โ good for imbalanced classes, uses probability scores
# 3. "accuracy" โ only when classes are balanced and all errors are equal
# 4. "precision" โ when false positives are costly (e.g., spam detection)
# 5. "recall" โ when false negatives are costly (e.g., fraud, cancer)
# Success metric used to select the best model
success_metric = "f1" # or "roc_auc", depending on use case
# success_split = "test" # "train" or "test"
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
# Fit a dummy classifier as a baseline
dummy_clf = DummyClassifier(strategy="most_frequent") # or try "stratified", "uniform"
dummy_clf.fit(X_train, y_train)
DummyClassifier(strategy='most_frequent')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DummyClassifier(strategy='most_frequent')
# Predict on both train and test
y_train_pred = dummy_clf.predict(X_train)
y_test_pred = dummy_clf.predict(X_test)
Precision, Recall, and F1 Score are classification metrics that help us understand model performance beyond just accuracy:
Business Perspective:
These metrics are vital when accuracy is misleading โ especially in skewed datasets.
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
# Technical output
print("๐ Classification Report\n")
print(classification_report(y_test, y_test_pred))
๐ Classification Report precision recall f1-score support 0 0.70 1.00 0.82 140 1 0.00 0.00 0.00 60 accuracy 0.70 200 macro avg 0.35 0.50 0.41 200 weighted avg 0.49 0.70 0.58 200
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) /Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) /Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Determine positive class once
positive_class = y_train.unique()[1] if len(y_train.unique()) == 2 else 1
def evaluate_model(y_true, y_pred, label="Model"):
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred, pos_label=positive_class, average='binary', zero_division=0)
rec = recall_score(y_true, y_pred, pos_label=positive_class, average='binary', zero_division=0)
f1 = f1_score(y_true, y_pred, pos_label=positive_class, average='binary', zero_division=0)
# Aligned core metrics
print(f"\n๐ {label} โ Performance Summary:")
print(f"- Accuracy : {acc :>7.2%} โ Overall correctness.")
print(f"- Precision : {prec:>7.2%} โ Of predicted '{positive_class}', how many were right.")
print(f"- Recall : {rec :>7.2%} โ Of actual '{positive_class}', how many we caught.")
print(f"- F1 Score : {f1 :>7.2%} โ Balance of precision & recall.")
# Business interpretation
print("\n๐ Interpretation:")
if prec < 0.6:
print("- High false positives โ risky if false alarms are costly.")
else:
print("- Precision looks acceptable; false positives under control.")
if rec < 0.6:
print("- High false negatives โ risky if missing positives is costly.")
else:
print("- Recall is strong; model is catching true cases well.")
print(f"- F1 Score shows overall tradeoff quality: {f1:.2f}")
# Example usage
evaluate_model(y_test, y_test_pred, label="Baseline Classifier")
๐ Baseline Classifier โ Performance Summary: - Accuracy : 70.00% โ Overall correctness. - Precision : 0.00% โ Of predicted '1', how many were right. - Recall : 0.00% โ Of actual '1', how many we caught. - F1 Score : 0.00% โ Balance of precision & recall. ๐ Interpretation: - High false positives โ risky if false alarms are costly. - High false negatives โ risky if missing positives is costly. - F1 Score shows overall tradeoff quality: 0.00
The confusion matrix is a NxN table that helps us visualize the performance of a classification model.
๐ Confusion Matrix Terminology:
Predicted โโโโโโโโโโโโโโโโโ โ 0 โ 1 โ โโโโโโโโผโโโโโโโโโโผโโโโโโค Actual โ 0 โ TN โ FP โ โ Specificity = TN / (TN + FP) = True Negative Rate (TNR) โ 1 โ FN โ TP โ โ Recall = TP / (TP + FN) = Sensitivity, TPR, Hit Rate โโโโโโโโดโโโโโโโโโโดโโโโโโ โ โโ Precision = TP / (TP + FP) = Positive Predictive Value
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
def plot_confusion(y_true, y_pred, model_name="Model"):
"""
Plot a confusion matrix with count and percentage annotations.
Warns if y_pred contains unseen labels not present in y_true.
"""
# Robust label set
labels = np.unique(np.concatenate([y_true, y_pred]))
# Check for potential leakage or mismatch
unseen_preds = set(y_pred) - set(y_true)
if unseen_preds:
print(f"\033[91mโ ๏ธ Warning: y_pred contains unseen class labels: {unseen_preds} โ "
f"this may indicate leakage or label mismatch.\033[0m")
# Compute confusion matrix and percentages
cm = confusion_matrix(y_true, y_pred, labels=labels)
cm_sum = np.sum(cm)
cm_perc = cm / cm_sum * 100
# Annotate with count and %
annot = np.empty_like(cm).astype(str)
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
c = cm[i, j]
p = cm_perc[i, j]
annot[i, j] = f"{c}\n({p:.1f}%)"
# Plot
plt.figure(figsize=(3, 2))
sns.heatmap(cm, annot=annot, fmt="", cmap="Blues", cbar=True,
xticklabels=labels, yticklabels=labels)
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.title(f"Confusion Matrix ({model_name})")
plt.tight_layout()
plt.show()
plot_confusion(y_test, y_test_pred, model_name="Baseline Classifier")
ROC Curve (Receiver Operating Characteristic) plots the True Positive Rate (TPR) vs False Positive Rate (FPR) across different threshold values.
AUC (Area Under the Curve) quantifies overall separability between the two classes:
This plot lets stakeholders quickly gauge how good the model is โ regardless of classification threshold.
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
def plot_roc_auc(model, X_test, y_test, model_name="Model"):
"""
Plot ROC curve, print AUC score, and give business-facing interpretation.
"""
if hasattr(model, "predict_proba"):
y_scores = model.predict_proba(X_test)[:, 1]
elif hasattr(model, "decision_function"):
y_scores = model.decision_function(X_test)
else:
raise ValueError("Model does not support probability estimates or decision function.")
fpr, tpr, _ = roc_curve(y_test, y_scores)
auc_score = roc_auc_score(y_test, y_scores)
# Plot
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, label=f"AUC = {auc_score:.2f}")
plt.plot([0, 1], [0, 1], "k--", label="Random Guess")
plt.xlabel("False Positive Rate (1 - Specificity)")
plt.ylabel("True Positive Rate (Recall / Sensitivity)")
plt.title(f"ROC Curve ({model_name})")
plt.legend()
plt.tight_layout()
plt.show()
# Output
print(f"๐น ROC AUC Score for {model_name}: {auc_score:.4f}")
if auc_score <= 0.55:
print("๐ Interpretation: Model performs at or near random. It cannot meaningfully separate classes.")
elif auc_score < 0.7:
print("๐ Interpretation: Some separability, but not reliable yet. Needs improvement.")
else:
print("๐ Interpretation: Model is doing a good job distinguishing between classes.")
plot_roc_auc(dummy_clf, X_test, y_test, model_name="Baseline Classifier")
๐น ROC AUC Score for Baseline Classifier: 0.5000 ๐ Interpretation: Model performs at or near random. It cannot meaningfully separate classes.
from termcolor import colored
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score, confusion_matrix, classification_report
)
def update_best_model(model_name, model_obj, y_train, y_test, y_train_pred, y_test_pred, hyperparameters=None):
"""
Computes metrics internally, updates best_model_info if model outperforms current best.
Also logs all model results.
"""
# Evaluate performance
metrics = {
"train": {
"accuracy": accuracy_score(y_train, y_train_pred),
"precision": precision_score(y_train, y_train_pred, pos_label=positive_class, zero_division=0),
"recall": recall_score(y_train, y_train_pred, pos_label=positive_class, zero_division=0),
"f1": f1_score(y_train, y_train_pred, pos_label=positive_class, zero_division=0),
"roc_auc": roc_auc_score(y_train, model_obj.predict_proba(X_train)[:, 1])
},
"test": {
"accuracy": accuracy_score(y_test, y_test_pred),
"precision": precision_score(y_test, y_test_pred, pos_label=positive_class, zero_division=0),
"recall": recall_score(y_test, y_test_pred, pos_label=positive_class, zero_division=0),
"f1": f1_score(y_test, y_test_pred, pos_label=positive_class, zero_division=0),
"roc_auc": roc_auc_score(y_test, model_obj.predict_proba(X_test)[:, 1]),
"confusion_matrix": confusion_matrix(y_test, y_test_pred),
"classification_report": classification_report(y_test, y_test_pred, output_dict=True)
}
}
# Compare with current best
current_score = metrics["test"][success_metric]
best_score = best_model_info["metrics"]["test"].get(success_metric, -1)
previous_best = best_model_info["name"] or "None"
if current_score > best_score:
best_model_info.update({
"name": model_name,
"model": model_obj,
"metrics": metrics,
"hyperparameters": hyperparameters or {}
})
print(colored(
f"โ
{model_name} just beat previous best ({previous_best}) โ "
f"{success_metric}: {best_score:.4f} โ {current_score:.4f}", "green"))
# print(f"๐ Current Test Performance:")
# for metric in ["accuracy", "precision", "recall", "f1", "roc_auc"]:
# val = metrics["test"][metric]
# print(f"- {metric.capitalize():<9}: {val:.4f}")
# Log all model results
model_results[model_name] = {
"model": model_obj,
"metrics": metrics,
"hyperparameters": hyperparameters or {}
}
update_best_model(
model_name="DummyClassifier",
model_obj=dummy_clf,
y_train=y_train,
y_test=y_test,
y_train_pred=y_train_pred,
y_test_pred=y_test_pred,
hyperparameters={"strategy": "most_frequent"}
)
โ
DummyClassifier just beat previous best (None) โ f1: -inf โ 0.0000
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) /Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) /Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
# from pprint import pprint
# pprint(best_model_info)
# pprint(model_results)
# import json
# print(json.dumps(best_model_info, indent=2, default=str))
# print(json.dumps(model_results, indent=2, default=str))
Criterion | ๐ Logistic Regression | ๐งฎ Naive Bayes | ๐ณ Decision Tree | ๐ฒ Random Forest | ๐ฏ KNN (K-Nearest Neighbors) | ๐ SVM (Support Vector Machines) | ๐ XGBoost | ๐ง Neural Network |
---|---|---|---|---|---|---|---|---|
Interpretability | โ Excellent โ coefficients are directly interpretable | โ Good โ conditional probabilities are intuitive | โ Very good โ rules and splits are easily visualized and explained | โ Low โ individual trees interpretable, but ensemble is a black box | โ ๏ธ Moderate โ intuitive idea, but no explicit model or coefficients | โ ๏ธ Moderate โ support vectors and margins can be visualized in 2D, but overall less intuitive | โ Low โ complex ensemble; partial plots or SHAP values needed for insight | โ Very low โ acts as a black box unless aided by techniques like SHAP, LIME |
Linearity Expectation | โ ๏ธ Yes โ assumes linear relationship between features and log-odds | โ ๏ธ Assumes feature independence โ not truly linear or interactive | โ No โ naturally captures non-linear relationships | โ No โ captures complex non-linear relationships | โ No โ captures non-linear patterns based on local neighborhoods | โ ๏ธ Depends โ linear SVM assumes linear separability; kernel SVM handles non-linear | โ No โ naturally models complex non-linear relationships | โ No โ inherently models non-linear and complex interactions |
High dimensionality | โ Good โ performs well with many features (with regularization) | โ Very good โ handles high-dimensional sparse data well | โ ๏ธ Moderate โ can overfit with too many features unless pruned | โ Good โ handles many features well via feature bagging | โ Poor โ suffers from the curse of dimensionality; distances become less meaningful | โ Very good โ especially effective in high-dimensional spaces (e.g., text data) | โ Excellent โ handles many features via regularization and feature importance | โ Excellent โ scales well with many features and large data |
Handling of multicollinearity | โ Needs treatment โ regularization helps, but still sensitive | โ Poor โ assumes feature independence; correlated features hurt performance | โ Handles it โ but may create instability in splits | โ Handles it โ less sensitive due to random feature selection per tree | โ Problematic โ redundant features distort distance metrics | โ ๏ธ Can be sensitive โ especially in linear SVM; use regularization | โ Handles well โ trees split on the most useful among correlated features | โ Handles โ internal weights adjust during training, but correlated inputs may slow convergence |
Handling of categorical features | โ Needs preprocessing โ requires one-hot or ordinal encoding | โ ๏ธ Needs preprocessing โ requires encoding, but categorical NB variants exist | โ ๏ธ Partial โ numerical encoding needed; some implementations support direct handling | โ ๏ธ Requires encoding โ label or one-hot encoding typically needed | โ Not natively supported โ requires careful encoding and distance handling | โ Not supported โ requires one-hot or other encoding | โ ๏ธ Needs encoding โ label encoding typically used; native support improving | โ Not native โ requires one-hot encoding or embeddings |
Handling of outliers | โ Sensitive โ can distort coefficients significantly | โ Very sensitive โ assumes Gaussian or other strict distributional forms | โ Robust โ splits are based on thresholds, not sensitive to extreme values | โ Robust โ not sensitive due to median-based splits and ensembling | โ Sensitive โ local distance-based voting easily skewed by outliers | โ Sensitive โ margin-based optimization gets distorted by outliers | โ Robust โ trees are insensitive to extreme values | โ Sensitive โ can destabilize training; often mitigated with preprocessing |
Handling of missing values | โ Not supported โ requires imputation | โ Not supported โ requires imputation | โ ๏ธ Limited โ some implementations handle missing splits, others need imputation | โ ๏ธ Some support โ not native in all implementations; imputation often needed | โ Not supported โ requires complete-case or imputation preprocessing | โ Not supported โ must impute before training | โ Built-in โ learns optimal path for missing values during tree construction | โ Not supported โ must be imputed before training |
Scaling of features needed | โ ๏ธ Yes โ especially important when using regularization | โ ๏ธ Sometimes โ required if Gaussian NB is used (assumes normal distribution) | โ Not needed โ uses raw feature values for splitting | โ Not needed โ tree splits are scale-invariant | โ Yes โ essential, as distance calculations are affected by feature magnitudes | โ Yes โ essential due to reliance on distance and dot products | โ Not needed โ tree-based, scale-invariant | โ Yes โ critical for stable and fast convergence (e.g., standardization or normalization) |
Class Imbalance problem | โ ๏ธ Needs adjustment โ use class_weight='balanced' or resampling | โ ๏ธ Needs adjustment โ priors can be tuned or class weights applied manually | โ Poor โ biased toward majority class unless adjusted with class_weight or sampling | โ ๏ธ Needs adjustment โ use class_weight='balanced' or stratified sampling | โ Poor โ biased toward majority class due to majority voting | โ ๏ธ Needs adjustment โ use class_weight='balanced' or tune C and margins | โ Handled โ use scale_pos_weight, custom loss, or sampling | โ ๏ธ Needs care โ custom loss functions, class weights, or resampling required |
Handling of sparseness in data | โ Works fine โ especially with L1 regularization for feature selection | โ Excellent โ especially performant in text classification or bag-of-words models | โ ๏ธ Depends โ not ideal for extremely sparse datasets (e.g., text data) | โ ๏ธ Moderate โ not ideal for extreme sparsity (e.g., NLP bag-of-words) | โ Weak โ sparse vectors make distance metrics ineffective | โ Good โ works well in high-dimensional sparse spaces (esp. linear SVM) | โ Excellent โ designed to handle sparse matrices natively | โ ๏ธ Depends โ not ideal unless using sparse-aware architectures or embedding layers |
Accuracy | Moderate โ often outperformed by tree-based models for complex, non-linear patterns. | Surprisingly strong baseline for some problems (e.g., NLP); weak when feature independence assumption breaks. | Prone to overfitting if unpruned; weak alone but powerful as base learners in ensembles. | โ Strong โ robust out-of-the-box performance with low overfitting risk. | โ ๏ธ Highly data-dependent โ can perform well with clean, balanced, low-dimensional data. | โ High โ strong performance on well-separated data, especially with good kernel choice. | โ Top-tier โ one of the most accurate out-of-the-box models for tabular data. | โ High โ can outperform other models with enough data and tuning, especially on non-tabular data. |
Training speed | โ Fast โ very efficient even on large datasets. | โ Extremely fast โ almost instantaneous to train. | โ Fast โ quick to train on moderate-sized datasets. | โ ๏ธ Slower than single models โ parallelizable but can be compute-heavy. | โ Fast training, โ Slow inference โ lazy learner, evaluates at prediction time. | โ Slow โ especially on large datasets or with complex kernels. | โ ๏ธ Slower โ faster than many ensembles, but heavier than single models; GPU support helps. | โ Slow โ resource-intensive; requires tuning and hardware for best performance. |
Despite the name, Logistic Regression is used for classification โ not regression.
It predicts the probability that an observation belongs to a certain class (e.g., 0 or 1).
Under the hood, it fits a weighted formula to the input features, applies a sigmoid function, and outputs a value between 0 and 1.
Example:
A model might say there's a 78% chance this customer will churn.
If that crosses a certain threshold (say, 50%), we classify it as โYes.โ
Pros | Cons |
---|---|
Fast and efficient | Assumes linear relationship (log-odds) |
Easy to interpret (feature weights) | Doesnโt handle complex patterns well |
Works well with small datasets | Sensitive to multicollinearity |
Outputs probabilities | May underperform on nonlinear data |
class_weight='balanced'
to avoid bias toward the majority class.Logistic Regression builds a model to predict probabilities using a sigmoid transformation over a linear combination of features.
Even though the dataset doesn't contain any column called z
(the logit), the model constructs it using weights it learns through training.
Row | Hours Studied (x) | Pass? (y) |
---|---|---|
1 | 1 | 0 |
2 | 2 | 0 |
3 | 3 | 0 |
4 | 4 | 1 |
5 | 5 | 1 |
6 | 6 | 1 |
w
and b
(e.g., 0)z = w * x + b
ลท = 1 / (1 + exp(-z))
โ gives predicted probability
L = - [ y log(ลท) + (1 - y) log(1 - ลท) ]
w
and b
w
and b
using gradient descentThe model searches for the final coefficients (w*, b*
) that minimize total log loss across all rows. This is the training objective.
Criterion | Comment |
---|---|
Interpretability | โ Excellent โ coefficients are directly interpretable |
Linearity Expectation | โ ๏ธ Yes โ assumes linear relationship between features and log-odds |
High dimensionality | โ Good โ performs well with many features (with regularization) |
Handling of multicollinearity | โ Needs treatment โ regularization helps, but still sensitive |
Handling of categorical features | โ Needs preprocessing โ requires one-hot or ordinal encoding |
Handling of outliers | โ Sensitive โ can distort coefficients significantly |
Handling of missing values | โ Not supported โ requires imputation |
Scaling of features needed | โ ๏ธ Yes โ especially important when using regularization |
Class Imbalance problem | โ ๏ธ Needs adjustment โ use class_weight='balanced' or resampling |
Handling of sparseness in data | โ Works fine โ especially with L1 regularization for feature selection |
General comment on accuracy: Moderate โ often outperformed by tree-based models for complex, non-linear patterns.
General comment on training speed: โ Fast โ very efficient even on large datasets.
# 1. Train model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
# 2. Show the learned equation
print("๐ง Learned Logistic Equation:\n")
terms = [f"({w[i]:+.4f})ยท{col}" for i, col in enumerate(X_train.columns)]
equation = " +\n ".join(terms)
print("z =\n " + equation)
print(f" + ({b:+.4f}) โ bias\n")
๐ง Learned Logistic Equation: z = (-0.3652)ยทFeature_1 + (+0.2297)ยทFeature_2 + (-0.5858)ยทFeature_3 + (+0.0253)ยทFeature_4 + (+0.0175)ยทFeature_5 + (+0.0402)ยทFeature_6 + (+1.2599)ยทFeature_7 + (-0.1435)ยทFeature_8 + (-0.4557)ยทFeature_9 + (-0.0027)ยทFeature_10 + (-0.7569) โ bias
# 3. Manually compute z, sigmoid, and prediction
def sigmoid(z):
return 1 / (1 + np.exp(-z))
X_sample = X_test.iloc[:10].copy()
z_vals = np.dot(X_sample, w) + b
probs = sigmoid(z_vals)
preds = (probs >= 0.5).astype(int)
# 4. Print table showing internals
diagnostics = X_sample.copy()
diagnostics["z = wยทx + b"] = np.round(z_vals, 4)
diagnostics["sigmoid(z) = prob"] = np.round(probs, 4)
diagnostics["prediction"] = preds
diagnostics["true_label"] = y_test.iloc[:10].values
diagnostics["log_loss_row"] = -(
diagnostics["true_label"] * np.log(diagnostics["sigmoid(z) = prob"]) +
(1 - diagnostics["true_label"]) * np.log(1 - diagnostics["sigmoid(z) = prob"])
).round(4)
print("\n๐ Internal Breakdown (First 10 Test Rows):")
display(diagnostics)
๐ Internal Breakdown (First 10 Test Rows):
Feature_1 | Feature_2 | Feature_3 | Feature_4 | Feature_5 | Feature_6 | Feature_7 | Feature_8 | Feature_9 | Feature_10 | z = wยทx + b | sigmoid(z) = prob | prediction | true_label | log_loss_row | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
595 | 0.178986 | 1.033881 | 0.488455 | -0.407460 | -0.574101 | 0.536414 | -1.232457 | 0.123480 | 0.881295 | 0.907962 | -2.8442 | 0.0550 | 0 | 0 | 0.0566 |
868 | 0.994921 | 0.202329 | 0.936990 | 1.631857 | -1.269330 | 1.702515 | -1.419999 | 1.818062 | -0.910983 | -0.733033 | -3.1678 | 0.0404 | 0 | 0 | 0.0412 |
406 | 0.030370 | 0.967794 | 0.407555 | 0.398992 | -1.271549 | -1.153084 | -1.200730 | 0.324658 | 1.210347 | 1.050049 | -2.9566 | 0.0494 | 0 | 0 | 0.0507 |
815 | 1.248285 | -1.087246 | 1.198098 | -1.085825 | -0.675708 | 0.034152 | -1.850321 | -1.148794 | -1.069471 | 0.679373 | -3.8832 | 0.0202 | 0 | 0 | 0.0204 |
762 | 0.070351 | -0.115110 | 0.300255 | -0.345919 | -1.391958 | 1.704102 | -0.815085 | -1.121751 | 0.700136 | 0.889154 | -2.1370 | 0.1056 | 0 | 0 | 0.1116 |
229 | 0.049136 | 0.862543 | 0.306325 | -1.694973 | -1.426827 | 0.069337 | -0.864365 | 0.817306 | 0.804673 | 0.820232 | -2.3964 | 0.0835 | 0 | 0 | 0.0872 |
445 | -0.300191 | 0.372648 | -0.264597 | 0.166493 | -1.285599 | -1.615846 | 0.373120 | 0.243475 | 0.334053 | 1.542736 | -0.2111 | 0.4474 | 0 | 1 | 0.8043 |
691 | 0.370534 | 0.184971 | 0.508500 | 0.452756 | 0.576451 | -1.508556 | -1.016108 | -0.770819 | 0.181994 | 1.707330 | -2.4438 | 0.0799 | 0 | 0 | 0.0833 |
625 | 0.263054 | -0.335138 | 0.132288 | 0.342338 | 1.987061 | -0.530971 | -0.022842 | 0.853976 | -0.618068 | 1.554160 | -0.8592 | 0.2975 | 0 | 0 | 0.3531 |
697 | 0.843541 | 0.978422 | 0.785313 | 0.522143 | 0.975312 | 0.515628 | -1.176116 | -0.330789 | -0.802144 | -1.103670 | -2.3149 | 0.0899 | 0 | 0 | 0.0942 |
# 5. Calculate and print total log loss
from sklearn.metrics import log_loss
total_loss = log_loss(y_test, model.predict_proba(X_test)[:, 1])
print(f"\n๐ Total Log Loss: {total_loss:.4f}")
๐ Total Log Loss: 0.3407
Naive Bayes is a family of probabilistic classifiers based on Bayesโ Theorem.
It assumes that all features are independent of each other โ which is rarely true in practice, but the model still performs surprisingly well.
It calculates the probability of each class given the input features and picks the class with the highest likelihood.
Example:
โGiven these symptoms, whatโs the most probable disease?โ โ Naive Bayes is widely used in text classification, spam detection, and medical diagnosis.
Pros | Cons |
---|---|
Very fast and scalable | Assumes feature independence (naive) |
Handles high-dimensional data well | May underperform with correlated inputs |
Simple and interpretable | Struggles with numeric feature scaling |
Works well with text data | Outputs are often overconfident |
Criterion | Comment |
---|---|
Interpretability | โ Good โ conditional probabilities are intuitive |
Linearity Expectation | โ ๏ธ Assumes feature independence โ not truly linear or interactive |
High dimensionality | โ Very good โ handles high-dimensional sparse data well |
Handling of multicollinearity | โ Poor โ assumes feature independence; correlated features hurt performance |
Handling of categorical features | โ ๏ธ Needs preprocessing โ requires encoding, but categorical NB variants exist |
Handling of outliers | โ Very sensitive โ assumes Gaussian or other strict distributional forms |
Handling of missing values | โ Not supported โ requires imputation |
Scaling of features needed | โ ๏ธ Sometimes โ required if Gaussian NB is used (assumes normal distribution) |
Class Imbalance problem | โ ๏ธ Needs adjustment โ priors can be tuned or class weights applied manually |
Handling of sparseness in data | โ Excellent โ especially performant in text classification or bag-of-words models |
General comment on accuracy: Surprisingly strong baseline for some problems (e.g., NLP); weak when feature independence assumption breaks.
General comment on training speed: โ Extremely fast โ almost instantaneous to train.
A Decision Tree splits data into branches based on feature values, creating a flowchart-like structure.
Each split is chosen to maximize class separation (typically using Gini impurity or entropy).
The result is a set of human-readable rules โ like:
โIf age < 30 and income > 50K โ likely to churn.โ
Itโs intuitive and easy to explain, even to non-technical stakeholders.
Pros | Cons |
---|---|
Easy to visualize and interpret | Prone to overfitting on noisy data |
No need for feature scaling | Can create unstable splits |
Captures non-linear relationships | Doesnโt generalize well on small data |
Works for both numeric and categorical | Can be biased toward dominant features |
max_depth
.Criterion | Comment |
---|---|
Interpretability | โ Very good โ rules and splits are easily visualized and explained |
Linearity Expectation | โ No โ naturally captures non-linear relationships |
High dimensionality | โ ๏ธ Moderate โ can overfit with too many features unless pruned |
Handling of multicollinearity | โ Handles it โ but may create instability in splits |
Handling of categorical features | โ ๏ธ Partial โ numerical encoding needed; some implementations support direct handling |
Handling of outliers | โ Robust โ splits are based on thresholds, not sensitive to extreme values |
Handling of missing values | โ ๏ธ Limited โ some implementations handle missing splits, others need imputation |
Scaling of features needed | โ Not needed โ uses raw feature values for splitting |
Class Imbalance problem | โ Poor โ biased toward majority class unless adjusted with class_weight or sampling |
Handling of sparseness in data | โ ๏ธ Depends โ not ideal for extremely sparse datasets (e.g., text data) |
General comment on accuracy: Prone to overfitting if unpruned; weak alone but powerful as base learners in ensembles.
General comment on training speed: โ Fast โ quick to train on moderate-sized datasets.
Random Forest is an ensemble method that builds many decision trees and combines their outputs.
Each tree sees a random subset of the data and features, making the forest diverse and robust.
It works by aggregating the predictions of multiple trees (majority vote for classification), reducing the overfitting risk of a single decision tree.
Think of it as a crowd of weak models working together to make better predictions.
Pros | Cons |
---|---|
Strong performance out of the box | Less interpretable than a single tree |
Handles non-linearities and interactions | Slower for real-time predictions |
Resistant to overfitting | May require tuning to perform well |
Works well with large feature spaces | Not ideal when interpretability is key |
n_estimators
and max_depth
if neededfeature_importances_
to find influential variablesCriterion | Comment |
---|---|
Interpretability | โ Low โ individual trees interpretable, but ensemble is a black box |
Linearity Expectation | โ No โ captures complex non-linear relationships |
High dimensionality | โ Good โ handles many features well via feature bagging |
Handling of multicollinearity | โ Handles it โ less sensitive due to random feature selection per tree |
Handling of categorical features | โ ๏ธ Requires encoding โ label or one-hot encoding typically needed |
Handling of outliers | โ Robust โ not sensitive due to median-based splits and ensembling |
Handling of missing values | โ ๏ธ Some support โ not native in all implementations; imputation often needed |
Scaling of features needed | โ Not needed โ tree splits are scale-invariant |
Class Imbalance problem | โ ๏ธ Needs adjustment โ use class_weight='balanced' or stratified sampling |
Handling of sparseness in data | โ ๏ธ Moderate โ not ideal for extreme sparsity (e.g., NLP bag-of-words) |
General comment on accuracy: โ Strong โ robust out-of-the-box performance with low overfitting risk.
General comment on training speed: โ ๏ธ Slower than single models โ parallelizable but can be compute-heavy.
KNN is a non-parametric, instance-based learning method.
It doesnโt learn a model during training โ instead, it stores the data.
At prediction time, it looks at the K most similar observations (neighbors) and assigns the class based on majority vote.
Similarity is usually measured using Euclidean distance (or other distance metrics for different data types).
Example:
โTo predict a label for this point, look at its 5 closest data points and choose the most common class.โ
Pros | Cons |
---|---|
Simple and intuitive | Slow at prediction time (no training step) |
No training required | Struggles with high-dimensional data |
Captures local patterns | Requires feature scaling |
Flexible distance metrics | Memory-intensive with large datasets |
StandardScaler
or MinMaxScaler
to normalize features before fittingk
using cross-validation; odd numbers help avoid ties in binary classificationCriterion | Comment |
---|---|
Interpretability | โ ๏ธ Moderate โ intuitive idea, but no explicit model or coefficients |
Linearity Expectation | โ No โ captures non-linear patterns based on local neighborhoods |
High dimensionality | โ Poor โ suffers from the curse of dimensionality; distances become less meaningful |
Handling of multicollinearity | โ Problematic โ redundant features distort distance metrics |
Handling of categorical features | โ Not natively supported โ requires careful encoding and distance handling |
Handling of outliers | โ Sensitive โ local distance-based voting easily skewed by outliers |
Handling of missing values | โ Not supported โ requires complete-case or imputation preprocessing |
Scaling of features needed | โ Yes โ essential, as distance calculations are affected by feature magnitudes |
Class Imbalance problem | โ Poor โ biased toward majority class due to majority voting |
Handling of sparseness in data | โ Weak โ sparse vectors make distance metrics ineffective |
General comment on accuracy: โ ๏ธ Highly data-dependent โ can perform well with clean, balanced, low-dimensional data.
General comment on training speed: โ Fast training, โ Slow inference โ lazy learner, evaluates at prediction time.
Support Vector Machines (SVM) are margin-based classifiers that try to find the best boundary (hyperplane) that separates classes.
SVM focuses on support vectors โ the critical data points closest to the boundary โ to maximize the margin between classes.
It can handle non-linear patterns using kernel tricks (e.g., RBF kernel), making it flexible for complex data.
Think of it as drawing the widest possible gap between two classes while avoiding overlap.
Pros | Cons |
---|---|
Works well in high-dimensional spaces | Slow on large datasets |
Effective for non-linear boundaries | Requires careful parameter tuning |
Robust to overfitting (with regularization) | Not intuitive to interpret |
Supports different kernels | Doesnโt scale well with noisy data |
probability=True
in SVC
if neededCriterion | Comment |
---|---|
Interpretability | โ ๏ธ Moderate โ support vectors and margins can be visualized in 2D, but overall less intuitive |
Linearity Expectation | โ ๏ธ Depends โ linear SVM assumes linear separability; kernel SVM handles non-linear |
High dimensionality | โ Very good โ especially effective in high-dimensional spaces (e.g., text data) |
Handling of multicollinearity | โ ๏ธ Can be sensitive โ especially in linear SVM; use regularization |
Handling of categorical features | โ Not supported โ requires one-hot or other encoding |
Handling of outliers | โ Sensitive โ margin-based optimization gets distorted by outliers |
Handling of missing values | โ Not supported โ must impute before training |
Scaling of features needed | โ Yes โ essential due to reliance on distance and dot products |
Class Imbalance problem | โ ๏ธ Needs adjustment โ use class_weight='balanced' or tune C and margins |
Handling of sparseness in data | โ Good โ works well in high-dimensional sparse spaces (esp. linear SVM) |
General comment on accuracy: โ High โ strong performance on well-separated data, especially with good kernel choice.
General comment on training speed: โ Slow โ especially on large datasets or with complex kernels.
XGBoost (Extreme Gradient Boosting) is a powerful boosted tree ensemble method.
Unlike Random Forest (which builds trees in parallel), XGBoost builds trees sequentially โ each new tree tries to fix the errors of the previous one.
It uses gradient descent to minimize loss, with regularization to prevent overfitting.
XGBoost is known for its speed, accuracy, and efficiency, making it a go-to model in many Kaggle competitions and production systems.
Pros | Cons |
---|---|
High predictive accuracy | Harder to interpret |
Built-in regularization (less overfitting) | More complex than basic tree models |
Fast and scalable | Requires tuning for best performance |
Handles missing data automatically | May overfit small/noisy datasets |
n_estimators
is too high โ always monitor with validationearly_stopping_rounds
during training to auto-pick optimal iterationGridSearchCV
or Optuna
for tuningCriterion | Comment |
---|---|
Interpretability | โ Low โ complex ensemble; partial plots or SHAP values needed for insight |
Linearity Expectation | โ No โ naturally models complex non-linear relationships |
High dimensionality | โ Excellent โ handles many features via regularization and feature importance |
Handling of multicollinearity | โ Handles well โ trees split on the most useful among correlated features |
Handling of categorical features | โ ๏ธ Needs encoding โ label encoding typically used; native support improving |
Handling of outliers | โ Robust โ trees are insensitive to extreme values |
Handling of missing values | โ Built-in โ learns optimal path for missing values during tree construction |
Scaling of features needed | โ Not needed โ tree-based, scale-invariant |
Class Imbalance problem | โ
Handled โ use scale_pos_weight , custom loss, or sampling |
Handling of sparseness in data | โ Excellent โ designed to handle sparse matrices natively |
General comment on accuracy: โ Top-tier โ one of the most accurate out-of-the-box models for tabular data.
General comment on training speed: โ ๏ธ Slower โ faster than many ensembles, but heavier than single models; GPU support helps.
A Neural Network is a layered structure of interconnected "neurons" inspired by the human brain.
Each neuron applies a weighted transformation followed by a non-linear activation, allowing the model to learn complex, non-linear patterns in the data.
Even a basic feedforward neural network (also called Multi-Layer Perceptron or MLP) can approximate intricate decision boundaries โ making it powerful but harder to interpret.
Think of it as a flexible function builder that learns patterns layer by layer.
Pros | Cons |
---|---|
Can model complex, non-linear relationships | Requires lots of data and tuning |
Works well on both tabular and image/text data | Not interpretable out of the box |
Scales with data and compute | Can overfit if not regularized |
Highly customizable architectures | Slower to train, harder to debug |
Criterion | Comment |
---|---|
Interpretability | โ Very low โ acts as a black box unless aided by techniques like SHAP, LIME |
Linearity Expectation | โ No โ inherently models non-linear and complex interactions |
High dimensionality | โ Excellent โ scales well with many features and large data |
Handling of multicollinearity | โ Handles โ internal weights adjust during training, but correlated inputs may slow convergence |
Handling of categorical features | โ Not native โ requires one-hot encoding or embeddings |
Handling of outliers | โ Sensitive โ can destabilize training; often mitigated with preprocessing |
Handling of missing values | โ Not supported โ must be imputed before training |
Scaling of features needed | โ Yes โ critical for stable and fast convergence (e.g., standardization or normalization) |
Class Imbalance problem | โ ๏ธ Needs care โ custom loss functions, class weights, or resampling required |
Handling of sparseness in data | โ ๏ธ Depends โ not ideal unless using sparse-aware architectures or embedding layers |
General comment on accuracy: โ High โ can outperform other models with enough data and tuning, especially on non-tabular data.
General comment on training speed: โ Slow โ resource-intensive; requires tuning and hardware for best performance.
Target Type | Linearly Separable | Correlation | Imbalance | Recommended Models | Notes |
---|---|---|---|---|---|
Binary | โ True | Low | โ True | XGBoost > Random Forest | Use tree-based models with class weights or resampling. |
Binary | โ True | Low | โ False | Logistic Regression > SVM | Start with simple linear models. Use as benchmark. |
Binary | โ True | High | โ True | XGBoost > Random Forest | Use tree-based models with class weights or resampling. |
Binary | โ True | High | โ False | Logistic Regression > SVM | Start with simple linear models. Use as benchmark. |
Binary | โ False | Low | โ True | XGBoost > Random Forest | Boosting or RF with class weights to handle imbalance + complexity. |
Binary | โ False | Low | โ False | Random Forest > Decision Tree | Simple non-linear trees likely sufficient. Avoid tuning-heavy models. |
Binary | โ False | High | โ True | XGBoost > Random Forest | Boosting or RF with class weights to handle imbalance + complexity. |
Binary | โ False | High | โ False | Random Forest > Decision Tree | Simple non-linear trees likely sufficient. Avoid tuning-heavy models. |
Multiclass | โ True | Low | โ True | XGBoost > Logistic Regression | Use OvR strategy with LR/XGB. Watch for class separation. |
Multiclass | โ True | Low | โ False | XGBoost > Logistic Regression | Use OvR strategy with LR/XGB. Watch for class separation. |
Multiclass | โ True | High | โ True | XGBoost > Logistic Regression | Use OvR strategy with LR/XGB. Watch for class separation. |
Multiclass | โ True | High | โ False | XGBoost > Logistic Regression | Use OvR strategy with LR/XGB. Watch for class separation. |
Multiclass | โ False | Low | โ True | Neural Network > KNN | Use Neural Net or KNN. Prioritize decision boundary complexity. |
Multiclass | โ False | Low | โ False | Neural Network > KNN | Use Neural Net or KNN. Prioritize decision boundary complexity. |
Multiclass | โ False | High | โ True | XGBoost > Random Forest | Tree-based models preferred. Skip preprocessing of collinear features. |
Multiclass | โ False | High | โ False | XGBoost > Random Forest | Tree-based models preferred. Skip preprocessing of collinear features. |
๐ฏ Target Type = Binary? โโโ โ Yes โ โโโ ๐ Linearly Separable? โ โ โโโ โ Yes โ โ โ โโโ ๐งฌ Feature Type = Categorical? โ โ โ โ โโโ โ ------------------------------------------> ๐ฒ Random Forest > ๐ XGBoost โ โ โ โ โโโ โ โ ๐ Correlation = High? โ โ โ โ โโโ โ ------------------------------------> ๐ XGBoost > ๐ฒ Random Forest โ โ โ โ โโโ โ โ โ ๏ธ Missing Data? โ โ โ โ โโโ โ ------------------------------> ๐ XGBoost > ๐ง Neural Network โ โ โ โ โโโ โ โ ๏ธ Outliers Present? โ โ โ โ โโโ โ โ โ๏ธ Imbalanced? โ โ โ โ โ โโโ โ --------------------> ๐ XGBoost > ๐ฒ Random Forest โ โ โ โ โ โโโ โ --------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โ โ โ โโโ โ ------------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โ โ โโโ โ โ ๐ Correlation = High? โ โ โ โโโ โ ----------------------------------------> ๐ XGBoost > ๐ฒ Random Forest โ โ โ โโโ โ โ โ ๏ธ Missing Data? โ โ โ โโโ โ ----------------------------------> ๐ XGBoost > ๐ง Neural Network โ โ โ โโโ โ โ โ ๏ธ Outliers Present? โ โ โ โโโ โ โ โ๏ธ Imbalanced? โ โ โ โ โโโ โ ------------------------> ๐ XGBoost > ๐ฒ Random Forest โ โ โ โ โโโ โ ------------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โ โ โโโ โ ----------------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โ โโโ โ โ โ ๏ธ Missing Data? โ โ โโโ โ --------------------------------------------> ๐ XGBoost > ๐ง Neural Network โ โ โโโ โ โ โ ๏ธ Outliers Present? โ โ โโโ โ โ โ๏ธ Imbalanced? โ โ โ โโโ โ ----------------------------------> ๐ XGBoost > ๐ฒ Random Forest โ โ โ โโโ โ ----------------------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โ โโโ โ --------------------------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โโโ โ No (Multiclass) โ โโโ ๐ Linearly Separable? โ โ โโโ โ Yes โ โ โ โโโ ๐งฌ Feature Type = Categorical? โ โ โ โ โโโ โ ------------------------------------> ๐ฒ Random Forest > ๐ XGBoost โ โ โ โ โโโ โ โ ๐ Correlation = High? โ โ โ โ โโโ โ ------------------------------> ๐ XGBoost > ๐ฒ Random Forest โ โ โ โ โโโ โ โ โ ๏ธ Missing Data? โ โ โ โ โโโ โ ------------------------> ๐ XGBoost > ๐ง Neural Network โ โ โ โ โโโ โ โ โ ๏ธ Outliers Present? โ โ โ โ โโโ โ โ โ๏ธ Imbalanced? โ โ โ โ โ โโโ โ --------------> ๐ XGBoost > ๐ฒ Random Forest โ โ โ โ โ โโโ โ --------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โ โ โ โโโ โ ------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โ โ โโโ โ โ ๐ Correlation = High? โ โ โ โโโ โ ----------------------------------> ๐ XGBoost > ๐ฒ Random Forest โ โ โ โโโ โ โ โ ๏ธ Missing Data? โ โ โ โโโ โ ----------------------------> ๐ XGBoost > ๐ง Neural Network โ โ โ โโโ โ โ โ ๏ธ Outliers Present? โ โ โ โโโ โ โ โ๏ธ Imbalanced? โ โ โ โ โโโ โ ------------------> ๐ XGBoost > ๐ฒ Random Forest โ โ โ โ โโโ โ ------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โ โ โโโ โ ----------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โ โโโ โ โ ๐ Correlation = High? โ โ โโโ โ โ ------------------------------------> ๐ XGBoost > ๐ฒ Random Forest โ โ โโโ โ โ โ ๏ธ Missing Data? โ โ โโโ โ --------------------------------> ๐ XGBoost > ๐ง Neural Network โ โ โโโ โ โ โ ๏ธ Outliers Present? โ โ โโโ โ โ โ๏ธ Imbalanced? โ โ โ โโโ โ ----------------------> ๐ XGBoost > ๐ฒ Random Forest โ โ โ โโโ โ ----------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โ โโโ โ --------------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โโโ โ ๐งฌ Feature Type = Categorical? โ โโโ โ ------------------------------------------> ๐ฒ Random Forest > ๐ XGBoost โ โโโ โ โ ๐ Correlation = High? โ โโโ โ ------------------------------------> ๐ XGBoost > ๐ฒ Random Forest โ โโโ โ โ โ ๏ธ Missing Data? โ โโโ โ ------------------------------> ๐ XGBoost > ๐ง Neural Network โ โโโ โ โ โ ๏ธ Outliers Present? โ โโโ โ โ โ๏ธ Imbalanced? โ โ โโโ โ --------------------> ๐ XGBoost > ๐ฒ Random Forest โ โ โโโ โ --------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โ โโโ โ ------------------------> ๐ฒ Random Forest > ๐ณ Decision Tree โโโ End
Target Type | Linearly Separable | Correlation | Imbalance | Recommended Models | Notes |
---|---|---|---|---|---|
Binary | โ True | Low | โ True | XGBoost > Random Forest | Use tree-based models with class weights or resampling. |
Binary | โ True | Low | โ False | Logistic Regression > SVM | Start with simple linear models. Use as benchmark. |
Binary | โ True | High | โ True | XGBoost > Random Forest | Use tree-based models with class weights or resampling. |
Binary | โ True | High | โ False | Logistic Regression > SVM | Start with simple linear models. Use as benchmark. |
Binary | โ False | Low | โ True | XGBoost > Random Forest | Boosting or RF with class weights to handle imbalance + complexity. |
Binary | โ False | Low | โ False | Random Forest > Decision Tree | Simple non-linear trees likely sufficient. Avoid tuning-heavy models. |
Binary | โ False | High | โ True | XGBoost > Random Forest | Boosting or RF with class weights to handle imbalance + complexity. |
Binary | โ False | High | โ False | Random Forest > Decision Tree | Simple non-linear trees likely sufficient. Avoid tuning-heavy models. |
Multiclass | โ True | Low | โ True | XGBoost > Logistic Regression | Use OvR strategy with LR/XGB. Watch for class separation. |
Multiclass | โ True | Low | โ False | XGBoost > Logistic Regression | Use OvR strategy with LR/XGB. Watch for class separation. |
Multiclass | โ True | High | โ True | XGBoost > Logistic Regression | Use OvR strategy with LR/XGB. Watch for class separation. |
Multiclass | โ True | High | โ False | XGBoost > Logistic Regression | Use OvR strategy with LR/XGB. Watch for class separation. |
Multiclass | โ False | Low | โ True | Neural Network > KNN | Use Neural Net or KNN. Prioritize decision boundary complexity. |
Multiclass | โ False | Low | โ False | Neural Network > KNN | Use Neural Net or KNN. Prioritize decision boundary complexity. |
Multiclass | โ False | High | โ True | XGBoost > Random Forest | Tree-based models preferred. Skip preprocessing of collinear features. |
Multiclass | โ False | High | โ False | XGBoost > Random Forest | Tree-based models preferred. Skip preprocessing of collinear features. |
Start โ โโโ ๐ฏ Target Type = Binary? โ โ โ โโโ โ Yes โ โ โโโ ๐ Linearly Separable? โ โ โ โโโ โ Yes โ โ โ โ โโโ ๐งฌ Feature Type = Categorical? โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ฒ Random Forest / ๐ XGBoost โ โ โ โ โ โโโ โ โ โ โ โ โ โโโ ๐ Correlation = High? โ โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ โ Avoid NB / LR โ โ โ โ โ โ โโโ โ โ โ โ โ โ โ โโโ โ ๏ธ Missing Data? โ โ โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ XGBoost / CatBoost โ โ โ โ โ โ โ โโโ โ โ โ โ โ โ โ โ โโโ โ ๏ธ Outliers? โ โ โ โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ XGBoost / ๐ฒ Random Forest โ โ โ โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ค Logistic Regression / ๐งญ SVM โ โ โ โ โ โ โ โโโ End โ โ โ โ โ โ โโโ End โ โ โ โ โ โโโ End โ โ โ โ โโโ End โ โ โ โโโ โ No โ โ โ โโโ ๐งฌ Feature Type = Categorical? โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ฒ Random Forest / ๐ XGBoost โ โ โ โ โโโ โ โ โ โ โ โโโ ๐ Correlation = High? โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ โ Avoid NB / LR โ โ โ โ โ โโโ โ โ โ โ โ โ โโโ โ ๏ธ Missing Data? โ โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ XGBoost / CatBoost โ โ โ โ โ โ โโโ โ โ โ โ โ โ โ โโโ โ ๏ธ Outliers? โ โ โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ XGBoost / ๐ฒ Random Forest โ โ โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ง Neural Network / ๐ฏ KNN โ โ โ โ โ โ โโโ End โ โ โ โ โ โโโ End โ โ โ โ โโโ End โ โ โ โโโ End โ โ โโโ End โ โโโ โ No (Multiclass) โ โโโ ๐ Linearly Separable? โ โ โโโ โ Yes โ โ โ โโโ ๐งฌ Feature Type = Categorical? โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ฒ Random Forest / ๐ XGBoost โ โ โ โ โโโ โ โ โ โ โ โโโ ๐ Correlation = High? โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ โ Avoid NB / LR โ โ โ โ โ โโโ โ โ โ โ โ โ โโโ โ ๏ธ Missing Data? โ โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ XGBoost / CatBoost โ โ โ โ โ โ โโโ โ โ โ โ โ โ โ โโโ โ ๏ธ Outliers? โ โ โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ XGBoost / ๐ฒ Random Forest โ โ โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ค Logistic Regression (OvR) / ๐งญ SVM โ โ โ โ โ โ โโโ End โ โ โ โ โ โโโ End โ โ โ โ โโโ End โ โ โ โโโ End โ โ โโโ โ No โ โ โโโ ๐งฌ Feature Type = Categorical? โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ฒ Random Forest / ๐ XGBoost โ โ โ โโโ โ โ โ โ โโโ ๐ Correlation = High? โ โ โ โ โโโ โ โโโโโโโโโโโโโโ โ Avoid NB / LR โ โ โ โ โโโ โ โ โ โ โ โโโ โ ๏ธ Missing Data? โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ XGBoost / CatBoost โ โ โ โ โ โโโ โ โ โ โ โ โ โโโ โ ๏ธ Outliers? โ โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ XGBoost / ๐ฒ Random Forest โ โ โ โ โ โ โโโ โ โโโโโโโโโโโโโโ ๐ง Neural Network / ๐ฏ KNN โ โ โ โ โ โโโ End โ โ โ โ โโโ End โ โ โ โโโ End โ โ โโโ End โ โโโ End โโโ ๐งช Evaluate Top 3 Recommended Models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
model_registry = {
"Logistic Regression": LogisticRegression(max_iter=1000),
"Naive Bayes": GaussianNB(),
"Decision Tree": DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(),
"KNN": KNeighborsClassifier(),
"SVM": SVC(probability=True), # needed for ROC AUC
"XGBoost": xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
"Neural Network": MLPClassifier(max_iter=1000)
}
def recommend_models(data_characteristics, verbose=True):
"""
Scores and ranks models based on data characteristics.
Prints the recommended order and rationales.
"""
from termcolor import colored
score = {}
rationale = {}
# Extract characteristics
target_info = data_characteristics.get("target_variable", {})
feature_info = data_characteristics.get("features", {})
is_linear = data_characteristics.get("linear_separability", False)
imbalance = target_info.get("imbalance", False)
imbalance_severity = target_info.get("class_imbalance_severity", "")
feature_type = feature_info.get("type", "")
corr = feature_info.get("correlation", "")
outliers = feature_info.get("outliers", False)
# --- Logistic Regression ---
score["Logistic Regression"] = 2
rationale["Logistic Regression"] = ["Good for linearly separable, numeric data"]
if is_linear:
score["Logistic Regression"] += 3
rationale["Logistic Regression"].append("Linear separability is True")
if feature_type == "continuous":
score["Logistic Regression"] += 1
rationale["Logistic Regression"].append("Features are continuous")
if outliers:
score["Logistic Regression"] -= 1
rationale["Logistic Regression"].append("Sensitive to outliers")
# --- Naive Bayes ---
score["Naive Bayes"] = 1
rationale["Naive Bayes"] = ["Good for categorical, independent features"]
if feature_type == "categorical":
score["Naive Bayes"] += 2
rationale["Naive Bayes"].append("Features are categorical")
if corr == "low":
score["Naive Bayes"] += 1
rationale["Naive Bayes"].append("Feature correlation is low")
if corr == "high":
score["Naive Bayes"] -= 2
rationale["Naive Bayes"].append("Correlation is high โ violates independence")
# --- Decision Tree ---
score["Decision Tree"] = 2
rationale["Decision Tree"] = ["Fast, flexible, handles most data types"]
if corr == "high":
score["Decision Tree"] += 1
rationale["Decision Tree"].append("Can exploit feature redundancy")
if outliers:
score["Decision Tree"] += 1
rationale["Decision Tree"].append("Robust to outliers")
# --- Random Forest ---
score["Random Forest"] = 3
rationale["Random Forest"] = ["Strong baseline, handles imbalance + outliers"]
if outliers:
score["Random Forest"] += 1
rationale["Random Forest"].append("Handles outliers well")
if imbalance:
score["Random Forest"] += 1
rationale["Random Forest"].append("Bootstrap helps with imbalance")
# --- KNN ---
score["KNN"] = 1
rationale["KNN"] = ["Simple, distance-based"]
if feature_type == "continuous":
score["KNN"] += 1
rationale["KNN"].append("Distance-based โ works better on continuous features")
if outliers:
score["KNN"] -= 2
rationale["KNN"].append("Very sensitive to outliers")
if imbalance:
score["KNN"] -= 1
rationale["KNN"].append("Imbalance skews neighbors")
# --- SVM ---
score["SVM"] = 2
rationale["SVM"] = ["Margin-based model"]
if is_linear:
score["SVM"] += 2
rationale["SVM"].append("Linear separability is True")
if imbalance:
score["SVM"] -= 1
rationale["SVM"].append("Needs tuning to handle imbalance")
if feature_type == "continuous":
score["SVM"] += 1
rationale["SVM"].append("Requires numeric features")
# --- Neural Network ---
score["Neural Network"] = 2
rationale["Neural Network"] = ["Flexible but sensitive"]
if imbalance_severity == "high":
score["Neural Network"] += 1
rationale["Neural Network"].append("Can learn from imbalance if tuned")
if outliers:
score["Neural Network"] -= 1
rationale["Neural Network"].append("Can be unstable with outliers")
# --- XGBoost ---
score["XGBoost"] = 4
rationale["XGBoost"] = ["Strong general-purpose model"]
if outliers:
score["XGBoost"] += 1
rationale["XGBoost"].append("Robust to outliers")
if imbalance:
score["XGBoost"] += 1
rationale["XGBoost"].append("scale_pos_weight helps with imbalance")
if corr == "high":
score["XGBoost"] += 1
rationale["XGBoost"].append("Handles redundant features well")
# Sort by descending score
ranked_models = sorted(score.items(), key=lambda x: x[1], reverse=True)
ranked_model_names = [model for model, _ in ranked_models]
# Filter and reorder model_registry
ranked_registry = {name: model_registry[name] for name in ranked_model_names if name in model_registry}
if verbose:
print("๐ง Recommended Model Evaluation Order:\n")
for i, name in enumerate(ranked_model_names, 1):
if name in model_registry:
prefix = colored(f"{i}. {name} (Score: {score[name]})", "green") if i <= 3 else f"{i}. {name} (Score: {score[name]})"
print(prefix)
for reason in rationale[name]:
print(f" โช {reason}")
print()
return ranked_model_names, ranked_registry
_, model_registry = recommend_models(data_characteristics, model_registry)
# model_registry
๐ง Recommended Model Evaluation Order: 1. XGBoost (Score: 6) โช Strong general-purpose model โช Robust to outliers โช scale_pos_weight helps with imbalance 2. Random Forest (Score: 5) โช Strong baseline, handles imbalance + outliers โช Handles outliers well โช Bootstrap helps with imbalance 3. Decision Tree (Score: 3) โช Fast, flexible, handles most data types โช Robust to outliers 4. Logistic Regression (Score: 2) โช Good for linearly separable, numeric data โช Features are continuous โช Sensitive to outliers 5. Naive Bayes (Score: 2) โช Good for categorical, independent features โช Feature correlation is low 6. SVM (Score: 2) โช Margin-based model โช Needs tuning to handle imbalance โช Requires numeric features 7. Neural Network (Score: 1) โช Flexible but sensitive โช Can be unstable with outliers 8. KNN (Score: -1) โช Simple, distance-based โช Distance-based โ works better on continuous features โช Very sensitive to outliers โช Imbalance skews neighbors
from sklearn.metrics import (
precision_score, recall_score, f1_score,
accuracy_score, roc_auc_score, confusion_matrix, log_loss
)
top_k = 3
for name in list(model_registry.keys())[:top_k]:
# We evaluate only the top 3 recommended models (ranked earlier) for focused comparison.
print(f"\n๐ง Training: {name}")
# Fit and predict
model = model_registry[name]
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Evaluation summary
evaluate_model(y_test, y_test_pred, label=name)
plot_confusion(y_test, y_test_pred, model_name=name)
plot_roc_auc(model, X_test, y_test, model_name=name)
# Track and log best model
update_best_model(
model_name=name,
model_obj=model,
y_train=y_train,
y_test=y_test,
y_train_pred=y_train_pred,
y_test_pred=y_test_pred
)
print("โ" * 80) # horizontal line
๐ง Training: XGBoost ๐ XGBoost โ Performance Summary: - Accuracy : 93.00% โ Overall correctness. - Precision : 89.66% โ Of predicted '1', how many were right. - Recall : 86.67% โ Of actual '1', how many we caught. - F1 Score : 88.14% โ Balance of precision & recall. ๐ Interpretation: - Precision looks acceptable; false positives under control. - Recall is strong; model is catching true cases well. - F1 Score shows overall tradeoff quality: 0.88
๐น ROC AUC Score for XGBoost: 0.9554
๐ Interpretation: Model is doing a good job distinguishing between classes.
โ
XGBoost just beat previous best (DummyClassifier) โ f1: 0.0000 โ 0.8814
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ง Training: Random Forest
๐ Random Forest โ Performance Summary:
- Accuracy : 93.00% โ Overall correctness.
- Precision : 92.59% โ Of predicted '1', how many were right.
- Recall : 83.33% โ Of actual '1', how many we caught.
- F1 Score : 87.72% โ Balance of precision & recall.
๐ Interpretation:
- Precision looks acceptable; false positives under control.
- Recall is strong; model is catching true cases well.
- F1 Score shows overall tradeoff quality: 0.88
๐น ROC AUC Score for Random Forest: 0.9605 ๐ Interpretation: Model is doing a good job distinguishing between classes. โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ๐ง Training: Decision Tree ๐ Decision Tree โ Performance Summary: - Accuracy : 86.00% โ Overall correctness. - Precision : 70.51% โ Of predicted '1', how many were right. - Recall : 91.67% โ Of actual '1', how many we caught. - F1 Score : 79.71% โ Balance of precision & recall. ๐ Interpretation: - Precision looks acceptable; false positives under control. - Recall is strong; model is catching true cases well. - F1 Score shows overall tradeoff quality: 0.80
๐น ROC AUC Score for Decision Tree: 0.8762 ๐ Interpretation: Model is doing a good job distinguishing between classes. โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# from pprint import pprint
# pprint(best_model_info)
# pprint(model_results)
# Print current best model based on success_metric
print(f"\n๐ Best model so far: {best_model_info['name']} "
f"({success_metric.upper()} = {best_model_info['metrics']['test'][success_metric]:.4f})")
print(f"\n๐ Model Ranking by {success_metric.upper()}:\n")
ranked = sorted(
model_results.items(),
key=lambda x: x[1]["metrics"]["test"][success_metric],
reverse=True
)
for i, (name, result) in enumerate(ranked, 1):
score = result["metrics"]["test"][success_metric]
print(f"{i}. {name:<20} {success_metric}: {score:.4f}")
๐ Best model so far: XGBoost (F1 = 0.8814) ๐ Model Ranking by F1: 1. XGBoost f1: 0.8814 2. Random Forest f1: 0.8772 3. Decision Tree f1: 0.7971 4. DummyClassifier f1: 0.0000
import plotly.graph_objects as go
import plotly.subplots as sp
import pandas as pd
# Extract test metrics
df_results = pd.DataFrame({
model_name: data["metrics"]["test"]
for model_name, data in model_results.items()
}).T
# Original metrics you'd like to plot
desired_metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'specificity']
# Filter only those that exist in df_results
metrics = [m for m in desired_metrics if m in df_results.columns]
# Create subplot layout
rows = (len(metrics) + 1) // 2
fig = sp.make_subplots(rows=rows, cols=2, subplot_titles=[m.upper() for m in metrics])
# Plot each available metric
for i, metric in enumerate(metrics):
row, col = divmod(i, 2)
fig.add_trace(
go.Bar(
x=df_results.index,
y=df_results[metric],
name=metric,
text=pd.to_numeric(df_results[metric], errors="coerce").round(3),
textposition="auto"
),
row=row+1, col=col+1
)
fig.update_layout(
height=300 * rows,
width=1000,
title_text="Model Comparison by Metric",
showlegend=False
)
fig.show()
Feature importance tells us which variables the model relied on most to make predictions.
Itโs like asking, โWhat factors influenced the decision the most?โ
In tree-based models like Random Forest or XGBoost, itโs calculated based on how often and how effectively a feature was used to split the data.
This is useful for:
# best_model_info
import pandas as pd
import matplotlib.pyplot as plt
def plot_feature_importance(model=None, feature_names=None, top_n=10, model_name=None):
"""
Plots top N feature importances.
Defaults to best_model_info['model'] unless overridden.
Optionally takes a model_name for the plot title.
"""
if model is None:
model = best_model_info["model"]
model_name = best_model_info.get("name", "Best Model") if model_name is None else model_name
else:
model_name = model_name or "Selected Model"
if feature_names is None:
feature_names = X_train.columns
if not hasattr(model, "feature_importances_"):
raise ValueError("Model does not support feature_importances_")
importance_df = pd.DataFrame({
"Feature": feature_names,
"Importance": model.feature_importances_
}).sort_values(by="Importance", ascending=False).head(top_n)
plt.figure(figsize=(8, 5))
plt.barh(importance_df["Feature"][::-1], importance_df["Importance"][::-1])
for i, (feature, importance) in enumerate(zip(importance_df["Feature"][::-1], importance_df["Importance"][::-1])):
plt.text(importance + 0.005, i, f"{importance:.3f}", va='center')
plt.title(f"Top Feature Importances ({model_name})")
plt.xlabel("Importance Score")
plt.tight_layout()
plt.show()
return list(importance_df["Feature"])
# โ
Default: plot for best model
imp_ranked = plot_feature_importance()
# ๐ ๏ธ Optional: override model + title
# alt_model = model_results["Random Forest"]["model"]
# imp_ranked = plot_feature_importance(model=alt_model, model_name="Random Forest")
SHAP (SHapley Additive exPlanations) values explain how much each feature contributed to a specific prediction โ positively or negatively.
Itโs like breaking down a credit score:
โAge added +12 points, income removed -5 pointsโฆโ
SHAP is model-agnostic and gives local explanations (for individual predictions) and global insights (feature impact across all predictions).
Useful for:
import shap
def plot_shap_summary_tree(model=None, X=None, model_name=None):
"""
Plot SHAP summary for tree-based models (RandomForest, XGBoost).
Defaults to best_model_info['model'] and X_test.
"""
if model is None:
model = best_model_info["model"]
model_name = model_name or best_model_info.get("name", "Best Model")
else:
model_name = model_name or "Selected Model"
if X is None:
X = X_test
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
# For binary classification, use shap_values[1]
if isinstance(shap_values, list) and len(shap_values) == 2:
shap_values = shap_values[1]
shap.summary_plot(shap_values, X)
print(f"\n๐ SHAP Summary for {model_name}:")
print("- Each bar shows how much that feature influences the modelโs decision.")
print("- Features at the top are the most impactful across all predictions.")
print("- Blue/red indicate direction: does the feature push prediction up or down?")
print("- Helps us understand *why* the model is confident โ not just *what* it predicts.")
shap_df = pd.DataFrame(np.abs(shap_values), columns=X.columns).mean().sort_values(ascending=False)
return list(shap_df.index)
# โ
Default: SHAP for best model
shap_ranked = plot_shap_summary_tree()
# ๐ ๏ธ Optional: SHAP for any other model
# alt_model = model_results["Random Forest"]["model"]
# shap_ranked = plot_shap_summary_tree(model=alt_model, model_name="Random Forest")
[23:15:09] WARNING: /Users/runner/work/xgboost/xgboost/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
๐ SHAP Summary for XGBoost: - Each bar shows how much that feature influences the modelโs decision. - Features at the top are the most impactful across all predictions. - Blue/red indicate direction: does the feature push prediction up or down? - Helps us understand *why* the model is confident โ not just *what* it predicts.
if False:
from sklearn.feature_selection import RFE
# Use full X_train
X_full = X_train.copy()
model = best_model_info["model"]
# Choose how many features to keep (optional: all, top 50%, or fixed)
n_to_select = max(1, X_full.shape[1] // 2) # or change to any value
# Run RFE
selector = RFE(estimator=model, n_features_to_select=n_to_select, step=1)
selector.fit(X_full, y_train)
# Final selected features
selected_features = list(X_full.columns[selector.support_])
print(f"โ
RFE selected features (no filtering): {selected_features}")
from sklearn.base import clone
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, roc_auc_score
import shap
import numpy as np
def shap_guided_backward_elimination(
model, X, y,
shap_ranked=None,
metric_name=None,
drop_threshold=0.005,
min_features=1,
verbose=True
):
"""
SHAP-guided backward elimination with early stopping.
Drops features by SHAP rank until performance drops significantly or hits min_features.
"""
model_base = clone(model)
X_curr = X.copy()
# Determine metric
metric_name = metric_name or success_metric
metric_func = {
"f1": f1_score,
"accuracy": accuracy_score,
"precision": precision_score,
"recall": recall_score,
"roc_auc": roc_auc_score
}.get(metric_name)
if metric_func is None:
raise ValueError(f"Unsupported metric: {metric_name}")
# Get SHAP-ranked features if not provided
if shap_ranked is None:
explainer = shap.TreeExplainer(model_base.fit(X_curr, y))
shap_values = explainer.shap_values(X_curr)
if isinstance(shap_values, list) and len(shap_values) == 2:
shap_values = shap_values[1]
shap_importance = np.abs(shap_values).mean(axis=0)
shap_ranked = X_curr.columns[np.argsort(shap_importance)[::-1]].tolist()
else:
shap_ranked = shap_ranked.copy()
# Initialize tracking
score_history = []
previous_score = None
while len(shap_ranked) >= min_features:
model = clone(model_base)
model.fit(X_curr[shap_ranked], y)
y_pred = model.predict(X_curr[shap_ranked])
score = metric_func(y, y_pred, zero_division=0)
score_history.append((len(shap_ranked), score, shap_ranked.copy()))
if verbose:
feat_list = ", ".join(shap_ranked)
print(f"โ
{len(shap_ranked)} features โ {metric_name}: {score:.4f} โ [{feat_list}]")
# Early stop if score drops significantly
if previous_score is not None and (previous_score - score) > drop_threshold:
if verbose:
print(f"๐ Stopping early: {metric_name} dropped from {previous_score:.4f} to {score:.4f}")
break
previous_score = score
shap_ranked.pop() # Drop lowest-ranked SHAP feature
if not score_history:
raise ValueError("No elimination steps executed โ shap_ranked too short or invalid inputs.")
# Best configuration
tolerance = 0.01 # Accept within 1% drop of best score
best_score = max(score_history, key=lambda x: x[1])[1]
# Keep all configs that are within tolerance
candidates = [cfg for cfg in score_history if (best_score - cfg[1]) <= tolerance]
# Pick one with the fewest features
best_config = min(candidates, key=lambda x: x[0])
print(f"\n๐ฏ Best config: {len(best_config[2])} features โ {metric_name}: {best_config[1]:.4f}")
return best_config[2], score_history
final_features, history = shap_guided_backward_elimination(
model=best_model_info["model"],
X=X_train,
y=y_train,
shap_ranked=shap_ranked
)
final_features
X_train_full = X_train.copy() # retaining copies for future reference
X_test_full = X_test.copy() # retaining copies for future reference
X_train = X_train[final_features]
X_test = X_test[final_features]
โ 10 features โ f1: 1.0000 โ [Feature_7, Feature_9, Feature_6, Feature_2, Feature_5, Feature_8, Feature_10, Feature_1, Feature_3, Feature_4] โ 9 features โ f1: 1.0000 โ [Feature_7, Feature_9, Feature_6, Feature_2, Feature_5, Feature_8, Feature_10, Feature_1, Feature_3] โ 8 features โ f1: 1.0000 โ [Feature_7, Feature_9, Feature_6, Feature_2, Feature_5, Feature_8, Feature_10, Feature_1] โ 7 features โ f1: 1.0000 โ [Feature_7, Feature_9, Feature_6, Feature_2, Feature_5, Feature_8, Feature_10] โ 6 features โ f1: 1.0000 โ [Feature_7, Feature_9, Feature_6, Feature_2, Feature_5, Feature_8] โ 5 features โ f1: 1.0000 โ [Feature_7, Feature_9, Feature_6, Feature_2, Feature_5] โ 4 features โ f1: 1.0000 โ [Feature_7, Feature_9, Feature_6, Feature_2] โ 3 features โ f1: 1.0000 โ [Feature_7, Feature_9, Feature_6] โ 2 features โ f1: 0.9896 โ [Feature_7, Feature_9] ๐ Stopping early: f1 dropped from 1.0000 to 0.9896 ๐ฏ Best config: 3 features โ f1: 1.0000
Grid Search tests all possible combinations of hyperparameters across a fixed grid.
Itโs exhaustive, simple, and works best when the number of hyperparameters is small.
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
# ๐ง Complete and default-aware param grid
param_grids = {
"RandomForestClassifier": {
"n_estimators": [100, 200], # default: 100
"max_depth": [None, 5, 10], # default: None
"min_samples_split": [2, 5], # default: 2
"min_samples_leaf": [1, 2], # default: 1
"max_features": ["sqrt", "log2"] # default: "sqrt"
},
"DecisionTreeClassifier": {
"max_depth": [None, 5, 10], # default: None
"min_samples_split": [2, 5], # default: 2
"min_samples_leaf": [1, 2], # default: 1
"criterion": ["gini", "entropy"] # default: "gini"
},
"GaussianNB": {
# Note: Naive Bayes (GaussianNB) has limited tunable parameters โ only var_smoothing is exposed
"var_smoothing": [1e-9, 1e-8, 1e-7] # default: 1e-9
},
"LogisticRegression": {
"C": [0.01, 0.1, 1, 10], # default: 1
"penalty": ["l2"], # default: "l2"
"solver": ["lbfgs"], # default: "lbfgs"
"max_iter": [100, 500] # default: 100
},
"SVC": {
"C": [0.1, 1, 10], # default: 1
"kernel": ["linear", "rbf"], # default: "rbf"
"gamma": ["scale", "auto"], # default: "scale"
"probability": [True] # default: False (forced True for AUC)
},
"KNeighborsClassifier": {
"n_neighbors": [3, 5, 7], # default: 5
"weights": ["uniform", "distance"], # default: "uniform"
"metric": ["euclidean", "manhattan", "minkowski"] # default: "minkowski"
},
"MLPClassifier": {
"hidden_layer_sizes": [(50,), (100,)], # default: (100,)
"activation": ["relu", "tanh"], # default: "relu"
"alpha": [0.0001, 0.001], # default: 0.0001
"learning_rate": ["constant", "adaptive"], # default: "constant"
"max_iter": [200, 500] # default: 200
},
"XGBClassifier": {
"n_estimators": [100, 200],
"max_depth": [3, 5, 7],
"learning_rate": [0.01, 0.1],
"subsample": [0.8, 1.0],
"colsample_bytree": [0.8, 1.0],
"scale_pos_weight": [1, 2] # useful for class imbalance
}
}
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import (
precision_score, recall_score, f1_score, accuracy_score,
roc_auc_score, confusion_matrix, log_loss
)
# โ๏ธ Resolve model name and corresponding grid
model_name = best_model_info["model"].__class__.__name__ # โ
fixed here
param_grid = param_grids.get(model_name)
if param_grid is None:
raise ValueError(f"No param grid defined for model: {model_name}")
print(f"\n๐ง Running Grid Search for: {model_name}")
# ๐งช Run Grid Search
model_instance = best_model_info["model"].__class__()
grid_search = GridSearchCV(
estimator=model_instance,
param_grid=param_grid,
scoring="f1",
cv=5,
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
best_tuned_model = grid_search.best_estimator_
print("โ
Best Parameters Found:")
print(grid_search.best_params_)
# ๐ Evaluate tuned model
y_test_pred = best_tuned_model.predict(X_test)
if hasattr(best_tuned_model, "predict_proba"):
y_scores = best_tuned_model.predict_proba(X_test)[:, 1]
elif hasattr(best_tuned_model, "decision_function"):
y_scores = best_tuned_model.decision_function(X_test)
else:
y_scores = y_test_pred
cm = confusion_matrix(y_test, y_test_pred)
tn, fp, fn, tp = cm.ravel()
# Metrics
precision = precision_score(y_test, y_test_pred, zero_division=0)
recall = recall_score(y_test, y_test_pred, zero_division=0)
f1 = f1_score(y_test, y_test_pred, zero_division=0)
accuracy = accuracy_score(y_test, y_test_pred)
auc = roc_auc_score(y_test, y_scores)
specificity = tn / (tn + fp)
logloss = log_loss(y_test, y_scores)
# Add to model_results with a new key
model_results[f"{model_name} (Tuned)"] = {
"model": best_tuned_model,
"accuracy": accuracy,
"precision": precision,
"recall": recall,
"f1": f1,
"auc": auc,
"specificity": specificity,
"log_loss": logloss
}
# Evaluation summary
evaluate_model(y_test, y_test_pred, label=name)
plot_confusion(y_test, y_test_pred, model_name=f"{model_name} (Tuned)")
plot_roc_auc(best_tuned_model, X_test, y_test, model_name=f"{model_name} (Tuned)")
๐ง Running Grid Search for: XGBClassifier Fitting 5 folds for each of 96 candidates, totalling 480 fits โ Best Parameters Found: {'colsample_bytree': 1.0, 'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 200, 'scale_pos_weight': 1, 'subsample': 0.8} ๐ DummyClassifier โ Performance Summary: - Accuracy : 93.00% โ Overall correctness. - Precision : 94.23% โ Of predicted '1', how many were right. - Recall : 81.67% โ Of actual '1', how many we caught. - F1 Score : 87.50% โ Balance of precision & recall. ๐ Interpretation: - Precision looks acceptable; false positives under control. - Recall is strong; model is catching true cases well. - F1 Score shows overall tradeoff quality: 0.87
๐น ROC AUC Score for XGBClassifier (Tuned): 0.9700 ๐ Interpretation: Model is doing a good job distinguishing between classes.
# best_model_info
Randomized Search selects a random subset of combinations to test, rather than all of them.
Itโs faster and often just as effective โ especially when only a few hyperparameters really matter.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
from sklearn.metrics import (
precision_score, recall_score, f1_score, accuracy_score,
roc_auc_score, confusion_matrix, log_loss
)
# ๐ Use same param grid as defined earlier
model_name = best_model_info["model"].__class__.__name__
param_dist = param_grids.get(model_name)
if param_dist is None:
raise ValueError(f"No param distribution defined for model: {model_name}")
print(f"\n๐ฒ Running Randomized Search for: {model_name}")
# Create a new instance of the model
model_instance = best_model_info["model"].__class__()
# ๐ Run randomized search
random_search = RandomizedSearchCV(
estimator=model_instance,
param_distributions=param_dist,
n_iter=15,
scoring="f1",
cv=5,
n_jobs=-1,
verbose=1,
random_state=42
)
random_search.fit(X_train, y_train)
best_random_model = random_search.best_estimator_
print("โ
Best Parameters Found:")
print(random_search.best_params_)
# ๐ Evaluate tuned model
y_test_pred = best_random_model.predict(X_test)
if hasattr(best_random_model, "predict_proba"):
y_scores = best_random_model.predict_proba(X_test)[:, 1]
elif hasattr(best_random_model, "decision_function"):
y_scores = best_random_model.decision_function(X_test)
else:
y_scores = y_test_pred
cm = confusion_matrix(y_test, y_test_pred)
tn, fp, fn, tp = cm.ravel()
# Metrics
precision = precision_score(y_test, y_test_pred, zero_division=0)
recall = recall_score(y_test, y_test_pred, zero_division=0)
f1 = f1_score(y_test, y_test_pred, zero_division=0)
accuracy = accuracy_score(y_test, y_test_pred)
auc = roc_auc_score(y_test, y_scores)
specificity = tn / (tn + fp)
logloss = log_loss(y_test, y_scores)
# Store results
model_results[f"{model_name} (RandomSearch)"] = {
"model": best_random_model,
"accuracy": accuracy,
"precision": precision,
"recall": recall,
"f1": f1,
"auc": auc,
"specificity": specificity,
"log_loss": logloss
}
# Visual eval
evaluate_model(y_test, y_test_pred, label=f"{model_name} (RandomSearch)")
plot_confusion(y_test, y_test_pred, model_name=f"{model_name} (RandomSearch)")
plot_roc_auc(best_random_model, X_test, y_test, model_name=f"{model_name} (RandomSearch)")
๐ฒ Running Randomized Search for: XGBClassifier Fitting 5 folds for each of 15 candidates, totalling 75 fits โ Best Parameters Found: {'subsample': 0.8, 'scale_pos_weight': 1, 'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.1, 'colsample_bytree': 1.0} ๐ XGBClassifier (RandomSearch) โ Performance Summary: - Accuracy : 92.00% โ Overall correctness. - Precision : 89.29% โ Of predicted '1', how many were right. - Recall : 83.33% โ Of actual '1', how many we caught. - F1 Score : 86.21% โ Balance of precision & recall. ๐ Interpretation: - Precision looks acceptable; false positives under control. - Recall is strong; model is catching true cases well. - F1 Score shows overall tradeoff quality: 0.86
๐น ROC AUC Score for XGBClassifier (RandomSearch): 0.9594 ๐ Interpretation: Model is doing a good job distinguishing between classes.
Ensembles are useful when:
Use ensembles after benchmarking individual models โ they add complexity but often yield better generalization.
A Voting Classifier combines predictions from multiple different models and makes a final decision based on majority vote (for classification) or average prediction (for regression).
There are two main types:
Itโs like consulting multiple doctors and going with the consensus.
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
# Define voting type: 'hard' or 'soft'
voting_type = 'hard' # change to 'hard' if you want majority voting
# Define the ensemble
voting_clf = VotingClassifier(
estimators=[
('lr', LogisticRegression(max_iter=1000)),
('dt', DecisionTreeClassifier()),
('nb', GaussianNB())
],
voting=voting_type
)
# Train the ensemble
print(f"๐ง Training: Voting Classifier ({voting_type})")
voting_clf.fit(X_train, y_train)
# Predict labels
y_pred_voting = voting_clf.predict(X_test)
# Evaluate
plot_confusion(y_test, y_pred_voting, model_name=f"Voting Classifier ({voting_type})")
# Only plot ROC if model supports probability estimates
if voting_type == 'soft':
plot_roc_auc(voting_clf, X_test, y_test, model_name=f"Voting Classifier ({voting_type})")
evaluate_model(y_test, y_pred_voting, label=f"Voting Classifier ({voting_type})")
๐ง Training: Voting Classifier (hard)
๐ Voting Classifier (hard) โ Performance Summary: - Accuracy : 88.00% โ Overall correctness. - Precision : 90.91% โ Of predicted '1', how many were right. - Recall : 66.67% โ Of actual '1', how many we caught. - F1 Score : 76.92% โ Balance of precision & recall. ๐ Interpretation: - Precision looks acceptable; false positives under control. - Recall is strong; model is catching true cases well. - F1 Score shows overall tradeoff quality: 0.77
Stacking involves training multiple models (called base models), and then using a meta-model to learn how to best combine their outputs.
Example:
Itโs like having specialists give their opinions, and then a generalist makes the final call based on their inputs.
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
# Define base models
base_estimators = [
('lr', LogisticRegression(max_iter=1000)),
('dt', DecisionTreeClassifier()),
('nb', GaussianNB())
]
# Define meta-model (final estimator)
meta_model = LogisticRegression()
# Build stacking classifier
stacking_clf = StackingClassifier(
estimators=base_estimators,
final_estimator=meta_model,
passthrough=False, # set to True if you want raw features included in meta-model input
cv=5 # internal cross-validation
)
# Train the ensemble
print("๐ง Training: Stacking Classifier")
stacking_clf.fit(X_train, y_train)
# Predict labels
y_pred_stack = stacking_clf.predict(X_test)
# Evaluate
plot_confusion(y_test, y_pred_stack, model_name="Stacking Classifier")
plot_roc_auc(stacking_clf, X_test, y_test, model_name="Stacking Classifier")
evaluate_model(y_test, y_pred_stack, label="Stacking Classifier")
๐ง Training: Stacking Classifier
๐น ROC AUC Score for Stacking Classifier: 0.9429 ๐ Interpretation: Model is doing a good job distinguishing between classes. ๐ Stacking Classifier โ Performance Summary: - Accuracy : 93.50% โ Overall correctness. - Precision : 88.52% โ Of predicted '1', how many were right. - Recall : 90.00% โ Of actual '1', how many we caught. - F1 Score : 89.26% โ Balance of precision & recall. ๐ Interpretation: - Precision looks acceptable; false positives under control. - Recall is strong; model is catching true cases well. - F1 Score shows overall tradeoff quality: 0.89
Bagging (Bootstrap Aggregating) builds multiple versions of the same model (e.g., decision trees), each trained on a different random sample of the data.
Then it combines their outputs (usually by voting or averaging) to reduce overfitting and variance.
Random Forest is a popular example of bagging.
Itโs like asking the same expert multiple times under different conditions and averaging their answers.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier # fast, default
from sklearn.linear_model import LogisticRegression # works well with linear patterns
from sklearn.neighbors import KNeighborsClassifier # unstable, benefits a lot from bagging
from sklearn.svm import SVC # slow with bagging, use carefully
from sklearn.naive_bayes import GaussianNB # rare with bagging (already stable)
from sklearn.ensemble import RandomForestClassifier # not recommended โ it's already bagged
# Example usage:
# base_estimator = LogisticRegression(max_iter=1000)
# base_estimator = KNeighborsClassifier()
# base_estimator = SVC(probability=True)
# base_estimator = GaussianNB()
# Define bagging classifier with decision trees
bagging_clf = BaggingClassifier(
base_estimator=DecisionTreeClassifier(),
n_estimators=50, # number of trees
max_samples=0.8, # bootstrap sample size
max_features=1.0, # use all features
random_state=42,
n_jobs=-1 # parallel processing
)
# Train the ensemble
print("๐ง Training: Bagging Classifier")
bagging_clf.fit(X_train, y_train)
# Predict
y_pred_bag = bagging_clf.predict(X_test)
# Evaluate
plot_confusion(y_test, y_pred_bag, model_name="Bagging Classifier")
plot_roc_auc(bagging_clf, X_test, y_test, model_name="Bagging Classifier")
evaluate_model(y_test, y_pred_bag, label="Bagging Classifier")
๐ง Training: Bagging Classifier
`base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
๐น ROC AUC Score for Bagging Classifier: 0.9685 ๐ Interpretation: Model is doing a good job distinguishing between classes. ๐ Bagging Classifier โ Performance Summary: - Accuracy : 92.00% โ Overall correctness. - Precision : 87.93% โ Of predicted '1', how many were right. - Recall : 85.00% โ Of actual '1', how many we caught. - F1 Score : 86.44% โ Balance of precision & recall. ๐ Interpretation: - Precision looks acceptable; false positives under control. - Recall is strong; model is catching true cases well. - F1 Score shows overall tradeoff quality: 0.86
Boosting trains models sequentially โ each new model focuses on correcting the mistakes of the previous one.
It gives more weight to errors and slowly builds a strong overall model by combining many weak ones.
Popular examples: XGBoost, AdaBoost, Gradient Boosting
Think of it as building knowledge step by step, learning from past failures to get better over time.
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
# Define the boosting classifier
boosting_clf = GradientBoostingClassifier(
n_estimators=100, # number of boosting rounds
learning_rate=0.1, # step size shrinkage
max_depth=3, # depth of each weak learner
subsample=1.0, # can be <1.0 for stochastic gradient boosting
random_state=42
)
# Train the ensemble
print("๐ง Training: Boosting Classifier")
boosting_clf.fit(X_train, y_train)
# Predict
y_pred_boost = boosting_clf.predict(X_test)
# Evaluate
plot_confusion(y_test, y_pred_boost, model_name="Boosting Classifier")
plot_roc_auc(boosting_clf, X_test, y_test, model_name="Boosting Classifier")
evaluate_model(y_test, y_pred_boost, label="Boosting Classifier")
๐ง Training: Boosting Classifier
๐น ROC AUC Score for Boosting Classifier: 0.9738 ๐ Interpretation: Model is doing a good job distinguishing between classes. ๐ Boosting Classifier โ Performance Summary: - Accuracy : 95.00% โ Overall correctness. - Precision : 93.10% โ Of predicted '1', how many were right. - Recall : 90.00% โ Of actual '1', how many we caught. - F1 Score : 91.53% โ Balance of precision & recall. ๐ Interpretation: - Precision looks acceptable; false positives under control. - Recall is strong; model is catching true cases well. - F1 Score shows overall tradeoff quality: 0.92
.pkl
, .joblib
).json
or .csv
)import joblib
import json
import os
export=False
if export:
# ๐ฆ Create export folder if it doesn't exist
os.makedirs("export", exist_ok=True)
# ๐พ Save the best model
joblib.dump(best_model, "export/best_model.joblib")
# ๐งฎ Prepare and save the evaluation metrics (exclude the model object)
metrics_copy = {k: v for k, v in model_results[best_model_name].items() if k != "model"}
with open("export/metrics.json", "w") as f:
json.dump(metrics_copy, f, indent=2)
print("โ
Model and metrics exported to /export/")
In real-world deployments, itโs crucial to track how your model behaves once itโs live.
What to log in production:
You can integrate this with tools like: