# Simulated Dataset Setup
import numpy as np
import pandas as pd

np.random.seed(42)  # for reproducibility

# We'll define features, treatment assignment, and outcomes step-by-step later

# Define covariates (features)
n = 5000

age = np.random.normal(40, 12, n)       # Age
income = np.random.normal(60000, 15000, n)  # Annual income
prior_engagement = np.random.beta(2, 5, n)  # Past engagement score [0,1]

# Treatment assignment probability based on features
treatment_prob = (
    0.3 * (income > 70000).astype(float) +
    0.2 * (prior_engagement > 0.5).astype(float) +
    0.1 * (age < 30).astype(float) +
    np.random.normal(0, 0.05, n)  # small noise
)
treatment_prob = np.clip(treatment_prob, 0, 1)

# Assign treatment
T = np.random.binomial(1, treatment_prob)

# Let's define a "true" baseline outcome based on the same covariates
base_outcome = (
    50 + 
    0.02 * income +
    5 * prior_engagement -
    0.3 * age +
    np.random.normal(0, 5, n)  # random noise
)

# Define a true treatment effect (could vary by subgroup later)
true_treatment_effect = 10  # a flat +10 effect for everyone

# Simulate potential outcomes
Y_0 = base_outcome
Y_1 = base_outcome + true_treatment_effect

# Observed outcome based on treatment assignment
Y_obs = T * Y_1 + (1 - T) * Y_0

# Assemble into a dataframe
df = pd.DataFrame({
    'age': age,
    'income': income,
    'prior_engagement': prior_engagement,
    'T': T,
    'Y_obs': Y_obs,
    'Y_0': Y_0,
    'Y_1': Y_1,
})

df.head()

# Quick naive estimation
treated_mean = df.loc[df['T'] == 1, 'Y_obs'].mean()
control_mean = df.loc[df['T'] == 0, 'Y_obs'].mean()

naive_diff = treated_mean - control_mean

print(f"Naive difference in means: {naive_diff:.2f}")
print(f"True treatment effect (ground truth): {true_treatment_effect}")

Naive difference in means: 250.53
True treatment effect (ground truth): 10

# Let's quickly visualize how the treated vs control groups differ in covariates
df.groupby('T')[['age', 'income', 'prior_engagement']].mean()

# Let's calculate the *average baseline outcome* (Y_0) in treated vs untreated groups
treated_baseline = df.loc[df['T'] == 1, 'Y_0'].mean()
control_baseline = df.loc[df['T'] == 0, 'Y_0'].mean()

baseline_diff = treated_baseline - control_baseline

print(f"Baseline (Y_0) difference between treated and control: {baseline_diff:.2f}")
print("This baseline imbalance creates bias in naive estimation.")

Baseline (Y_0) difference between treated and control: 240.53
This baseline imbalance creates bias in naive estimation.

# Simple conditioning by subgroup (example: income > 70k vs <= 70k)
df['high_income'] = (df['income'] > 70000).astype(int)

grouped = df.groupby(['high_income', 'T'])['Y_obs'].mean().unstack()
grouped['diff'] = grouped[1] - grouped[0]

print("Simple Conditional Difference by Income Group:")
print(grouped[['diff']])

Simple Conditional Difference by Income Group:
T                 diff
high_income           
0            16.867043
1            13.563878

# Stratify based on prior_engagement (simple high vs low)
df['high_engagement'] = (df['prior_engagement'] > 0.5).astype(int)

strat_grouped = df.groupby(['high_engagement', 'T'])['Y_obs'].mean().unstack()
strat_grouped['diff'] = strat_grouped[1] - strat_grouped[0]

print("Conditional Difference by Engagement Group:")
print(strat_grouped[['diff']])

Conditional Difference by Engagement Group:
T                      diff
high_engagement            
0                290.043114
1                145.885304

import statsmodels.api as sm

X = df[['T', 'age', 'income', 'prior_engagement']]
X = sm.add_constant(X)
y = df['Y_obs']

reg_model = sm.OLS(y, X).fit()

print(reg_model.summary())

print(f"\nEstimated treatment effect (β₁) after adjustment: {reg_model.params['T']:.2f}")

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Y_obs   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.652e+06
Date:                Sat, 26 Apr 2025   Prob (F-statistic):               0.00
Time:                        17:50:40   Log-Likelihood:                -15122.
No. Observations:                5000   AIC:                         3.025e+04
Df Residuals:                    4995   BIC:                         3.029e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               50.3660      0.398    126.398      0.000      49.585      51.147
T                   10.0641      0.220     45.761      0.000       9.633      10.495
age                 -0.2942      0.006    -49.784      0.000      -0.306      -0.283
income               0.0200   4.83e-06   4142.593      0.000       0.020       0.020
prior_engagement     4.3802      0.452      9.681      0.000       3.493       5.267
==============================================================================
Omnibus:                        5.657   Durbin-Watson:                   1.967
Prob(Omnibus):                  0.059   Jarque-Bera (JB):                5.136
Skew:                          -0.029   Prob(JB):                       0.0767
Kurtosis:                       2.854   Cond. No.                     4.38e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.38e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Estimated treatment effect (β₁) after adjustment: 10.06

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

# Step 1: Estimate propensity scores
ps_model = LogisticRegression()
ps_model.fit(df[['age', 'income', 'prior_engagement']], df['T'])
df['propensity_score'] = ps_model.predict_proba(df[['age', 'income', 'prior_engagement']])[:,1]

# Step 2: Nearest neighbor matching
treated = df[df['T'] == 1]
control = df[df['T'] == 0]

nn = NearestNeighbors(n_neighbors=1)
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]

# Calculate matched difference
matched_diff = (treated['Y_obs'].values - matched_control['Y_obs'].values).mean()

print(f"Propensity Score Matched Estimate of Treatment Effect: {matched_diff:.2f}")

Propensity Score Matched Estimate of Treatment Effect: 197.50

from statsmodels.api import OLS, add_constant

# Simulate an instrument Z (let's assume it's random and satisfies conditions)
np.random.seed(42)
df['Z'] = np.random.binomial(1, 0.5, size=len(df))

# Make treatment depend partly on Z
df['T_iv'] = (0.5 * df['Z'] + 0.5 * df['prior_engagement'] + np.random.normal(0, 0.1, len(df))) > 0.5
df['T_iv'] = df['T_iv'].astype(int)

# Stage 1: Predict treatment from instrument
X_stage1 = add_constant(df[['Z', 'age', 'income', 'prior_engagement']])
stage1_model = OLS(df['T_iv'], X_stage1).fit()
df['T_hat'] = stage1_model.predict(X_stage1)

# Stage 2: Predict outcome from predicted treatment
X_stage2 = add_constant(df[['T_hat', 'age', 'income', 'prior_engagement']])
stage2_model = OLS(df['Y_obs'], X_stage2).fit()

print(stage2_model.summary())

print(f"\nEstimated causal effect (via 2SLS): {stage2_model.params['T_hat']:.2f}")

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Y_obs   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.278e+06
Date:                Sat, 26 Apr 2025   Prob (F-statistic):               0.00
Time:                        17:50:40   Log-Likelihood:                -15996.
No. Observations:                5000   AIC:                         3.200e+04
Df Residuals:                    4995   BIC:                         3.203e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               48.1296      0.474    101.498      0.000      47.200      49.059
T_hat                0.2548      0.197      1.296      0.195      -0.131       0.640
age                 -0.3133      0.007    -44.628      0.000      -0.327      -0.300
income               0.0201   5.54e-06   3619.407      0.000       0.020       0.020
prior_engagement     6.6954      0.540     12.391      0.000       5.636       7.755
==============================================================================
Omnibus:                       71.333   Durbin-Watson:                   1.970
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               74.381
Skew:                           0.287   Prob(JB):                     7.05e-17
Kurtosis:                       3.169   Cond. No.                     4.33e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.33e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Estimated causal effect (via 2SLS): 0.25

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Define features
features = ['age', 'income', 'prior_engagement']

# Split into train/test for honest estimation
X_train, X_test, y_train, y_test = train_test_split(df[features], df['Y_obs'], test_size=0.3, random_state=42)
T_train, T_test = train_test_split(df['T'], test_size=0.3, random_state=42)

# Outcome model: Y ~ X
y_model = RandomForestRegressor()
y_model.fit(X_train, y_train)
df['y_hat'] = y_model.predict(df[features])

# Treatment model: T ~ X
t_model = RandomForestRegressor()
t_model.fit(X_train, T_train)
df['t_hat'] = t_model.predict(df[features])

# Calculate residuals
df['residual_Y'] = df['Y_obs'] - df['y_hat']
df['residual_T'] = df['T'] - df['t_hat']

# Final stage: regress residual_Y ~ residual_T
X_resid = sm.add_constant(df['residual_T'])
y_resid = df['residual_Y']

residual_model = sm.OLS(y_resid, X_resid).fit()

print(residual_model.summary())

print(f"\nEstimated causal effect via DML: {residual_model.params['residual_T']:.2f}")

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             residual_Y   R-squared:                       0.127
Model:                            OLS   Adj. R-squared:                  0.127
Method:                 Least Squares   F-statistic:                     726.5
Date:                Sat, 26 Apr 2025   Prob (F-statistic):          1.62e-149
Time:                        17:50:41   Log-Likelihood:                -14928.
No. Observations:                5000   AIC:                         2.986e+04
Df Residuals:                    4998   BIC:                         2.987e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0842      0.068      1.243      0.214      -0.049       0.217
residual_T     8.9188      0.331     26.953      0.000       8.270       9.567
==============================================================================
Omnibus:                     7090.850   Durbin-Watson:                   1.998
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          9991352.741
Skew:                           7.694   Prob(JB):                         0.00
Kurtosis:                     221.453   Cond. No.                         4.88
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Estimated causal effect via DML: 8.92

# Calculate ATE (simple diff from simulation ground truth)
ate = (df['Y_1'] - df['Y_0']).mean()
print(f"True ATE: {ate:.2f}")

# Calculate CATE for high engagement group
cate_high_engagement = (df[df['prior_engagement'] > 0.5]['Y_1'] - df[df['prior_engagement'] > 0.5]['Y_0']).mean()
print(f"CATE for high engagement users: {cate_high_engagement:.2f}")

# Show a few ITEs
df['ITE_true'] = df['Y_1'] - df['Y_0']
print("\nSample ITEs:")
print(df[['age', 'income', 'prior_engagement', 'ITE_true']].head())

True ATE: 10.00
CATE for high engagement users: 10.00

Sample ITEs:
         age        income  prior_engagement  ITE_true
0  45.960570  53643.604770          0.188077      10.0
1  38.340828  53198.788374          0.170389      10.0
2  47.772262  33065.352411          0.511379      10.0
3  58.276358  55048.647124          0.318793      10.0
4  37.190160  70992.436227          0.384439      10.0

# !pip install econml

# Causal Forest: Full Correct Code

from econml.grf import CausalForest
from sklearn.model_selection import train_test_split

# Prepare features and outcome
X = df[['age', 'income', 'prior_engagement']].values  # Features (2D)
T = df['T'].values  # Treatment (1D)
Y = df['Y_obs'].values  # Observed outcome (1D)

# Fit causal forest
forest = CausalForest(n_estimators=100, random_state=42)
forest.fit(X, T, Y)  # Correct order: X, T, Y

# Predict treatment effects (CATEs)
cate_preds = forest.predict(X)

# Store predictions
df['CATE_predicted'] = cate_preds

# Display sample predictions
print("\nSample predicted CATEs:")
print(df[['age', 'income', 'prior_engagement', 'CATE_predicted']].head())

Sample predicted CATEs:
         age        income  prior_engagement  CATE_predicted
0  45.960570  53643.604770          0.188077       15.767683
1  38.340828  53198.788374          0.170389       18.062330
2  47.772262  33065.352411          0.511379       -4.625701
3  58.276358  55048.647124          0.318793       14.400047
4  37.190160  70992.436227          0.384439       11.047467

# Create a random placebo treatment
np.random.seed(123)
df['placebo_T'] = np.random.binomial(1, 0.5, size=len(df))

# Estimate naive difference for placebo treatment
placebo_treated_mean = df.loc[df['placebo_T'] == 1, 'Y_obs'].mean()
placebo_control_mean = df.loc[df['placebo_T'] == 0, 'Y_obs'].mean()

placebo_naive_diff = placebo_treated_mean - placebo_control_mean

print(f"Placebo test: naive difference in means = {placebo_naive_diff:.2f}")

# Ideally close to zero if model is unbiased

Placebo test: naive difference in means = 7.79

# Simulate a hidden confounder correlated with treatment and outcome
np.random.seed(42)
df['hidden_confounder'] = np.random.normal(0, 1, size=len(df))

# Make the outcome depend slightly on this hidden confounder
df['Y_obs_biased'] = df['Y_obs'] + 2 * df['hidden_confounder']

# Re-run naive difference with biased outcome
treated_mean_biased = df.loc[df['T'] == 1, 'Y_obs_biased'].mean()
control_mean_biased = df.loc[df['T'] == 0, 'Y_obs_biased'].mean()

biased_naive_diff = treated_mean_biased - control_mean_biased

print(f"Naive difference in means (with hidden confounding): {biased_naive_diff:.2f}")

# See how much bias was introduced
bias_inflation = biased_naive_diff - naive_diff
print(f"Inflation in estimate due to hidden confounder: {bias_inflation:.2f}")

Naive difference in means (with hidden confounding): 250.11
Inflation in estimate due to hidden confounder: -0.42

# Predict counterfactual outcomes using Causal Forest
# (Already trained earlier, we use forest)

# Predict Y(1) and Y(0) separately
cate_preds = df['CATE_predicted'].values
baseline_preds = df['y_hat'].values  # From earlier outcome model

# Predict counterfactual outcomes
df['Y_cf_T1'] = baseline_preds + cate_preds  # Predicted outcome if treated
df['Y_cf_T0'] = baseline_preds  # Predicted outcome if untreated (baseline)

# Now simulate what would happen if treatment status flipped
df['counterfactual_outcome'] = np.where(
    df['T'] == 1,
    df['Y_cf_T0'],  # If treated, counterfactual is untreated
    df['Y_cf_T1']   # If untreated, counterfactual is treated
)

print("\nSample Counterfactual Predictions:")
print(df[['T', 'Y_obs', 'counterfactual_outcome']].head())

Sample Counterfactual Predictions:
   T        Y_obs  counterfactual_outcome
0  0  1114.502287             1127.868133
1  0  1099.893323             1119.847270
2  0   704.133035              697.394845
3  0  1139.203509             1151.640784
4  0  1464.308234             1475.218807

# Rank users by predicted CATE
df['priority_score'] = df['CATE_predicted']

# Top 5 users we should prioritize for treatment
top_users = df.sort_values('priority_score', ascending=False).head(5)

print("\nTop users to prioritize based on CATE:")
print(top_users[['age', 'income', 'prior_engagement', 'CATE_predicted']])

Top users to prioritize based on CATE:
            age        income  prior_engagement  CATE_predicted
2841  21.800289  53281.835326          0.350447       20.543173
724   42.050385  53280.579923          0.403388       20.443580
3205  31.876058  53300.354298          0.420334       20.411161
4614  40.670367  53244.477131          0.632127       20.275141
3711  24.210210  53209.647772          0.335915       20.259483

	age	income	prior_engagement
T
0	40.389926	58304.201745	0.278891
1	37.892227	70283.224987	0.330779

Method	When Useful	Strengths	Weaknesses
Simple Diff-in-Means	Randomized experiments	Easy, unbiased	Useless with confounding
Regression Adjustment	Observational data with measured confounders	Easy to implement	Model misspecification risk
Stratification	Small number of discrete confounders	Transparent	Breaks down in high dimensions
Propensity Score Matching (PSM)	Observational data with many confounders	Balances groups	Sensitive to model of treatment
Instrumental Variables (IV)	Unobserved confounders exist	Bypasses confounding	Hard to find good instruments
Double Machine Learning (DML)	High-dimensional nonlinear confounders	ML flexibility + debiasing	Needs lots of data, honest splits
Causal Forests	Heterogeneous treatment effects	Flexible CATE estimation	Complex, less interpretable

	age	income	prior_engagement	Y_obs	Y_0	Y_1
0	45.960570	53643.604770	0.188077	1114.502287	1114.502287	1124.502287
1	38.340828	53198.788374	0.170389	1099.893323	1099.893323	1109.893323
2	47.772262	33065.352411	0.511379	704.133035	704.133035	714.133035
3	58.276358	55048.647124	0.318793	1139.203509	1139.203509	1149.203509
4	37.190160	70992.436227	0.384439	1464.308234	1464.308234	1474.308234

🧭 Causal Inference¶

🎯 Introduction to Causal Inference¶

🧠 Why this Notebook

🎓 What is Causal Inference?¶

🎓 Causal inference is the process of estimating the effect of one variable (the treatment) on another (the outcome), holding all else constant.

📌 Why go beyond Correlation?¶

📌 Correlation can be dangerous when used as a proxy for causation.

🧭 Real-world problems that need causality¶

🧭 Correlation might be fine for dashboards. But when making decisions, causality is non-negotiable.

🧠 Core Concepts & Notation¶

🧮 Treatment, Outcome, Units¶

📐 Potential Outcomes (Rubin Causal Model)¶

🧵 Fundamental Problem of Causal Inference¶

The Fundamental Problem of Causal Inference:

🧠 Assumptions (SUTVA, Ignorability, Overlap)¶

Three core ones:

🧪 Simulated Dataset Setup¶

🧬 Define treatment assignment logic¶

🔬 Inject confounding intentionally¶

🧊 Simulate potential outcomes + observed data¶

🚫 Naive Estimation¶

❌ Simple difference in means¶

⚠️ Bias due to confounding¶

🕸️ Causal Diagrams (DAGs)¶

🧿 Quick primer on DAGs¶

🕷️ Confounding vs. colliders vs. mediators¶

🔗 What can/can’t be estimated just from data¶

What can be estimated:

What cannot be estimated:

🔍 Backdoor Adjustment Methods¶

🧾 Conditioning on confounders¶

🕵️‍♂️ Stratification / Subgroup analysis¶

📊 Regression Adjustment¶

📌 Propensity Score Matching (PSM)¶

🎯 Instrumental Variables (IV)¶

🪝 When backdoor paths can’t be blocked¶

🎯 Valid instrument conditions¶

🧩 2-Stage Least Squares (2SLS)¶

🧰 Double Machine Learning (DML)¶

🪛 Use ML models for nuisance functions¶

🧱 Residualization + orthogonalization logic¶

🧲 When to prefer over traditional regression¶

🌈 Heterogeneous Treatment Effects¶

🎨 ATE vs. CATE vs. ITE¶

🌟 Uplift models and use cases¶

🧩 Tree-based methods (Causal Trees, Causal Forests)¶

🧪 Placebo Tests & Robustness Checks¶

🧻 Randomized placebo treatments¶

⚗️ Sensitivity to unobserved confounding¶

🧬 Counterfactual Thinking¶

🤖 Predicting what would’ve happened¶

🔁 Usage in recommendation & personalization¶

📌 Closing Notes¶

📝 Summary table of methods¶

📋 When to use what¶

📎 Causal vs Predictive mindset¶