Causal inference gives us the tools to answer "what if" questions — not just "what is." In product, policy, medicine, and science, we often need to act, and actions require understanding their consequences.
This field helps us:
This notebook is a build-up from first principles to practical methods — with enough grounding to reason about experiments, models, and their assumptions clearly.
The core idea is to estimate:
What would the outcome have been if the treatment had (or had not) occurred?
Unlike correlation or predictive modeling:
At its heart, causal inference is about:
Example: Ice cream sales are correlated with shark attacks. Should we ban dessert? Clearly not — they’re both caused by heatwaves (a confounder).
Correlation fails because it:
Causal inference gives tools to:
Examples:
These questions involve interventions, and only causal methods can tell us what would’ve happened under a different choice.
T
): The intervention or condition being tested (e.g., new design, drug, policy).Y
): The result or metric affected by the treatment (e.g., click, recovery, score).i
): The entities receiving treatment and producing outcomes (e.g., users, patients, schools).Each unit can receive a treatment or control, and we observe only one outcome — not both.
This framing is universal and applies whether you're testing emails, ads, or vaccines.
The Potential Outcomes framework (aka Rubin Causal Model) imagines two parallel worlds for each unit:
Y(1)
: Outcome if treatedY(0)
: Outcome if not treatedWe define Individual Treatment Effect (ITE) as:
ITE = Y(1) - Y(0)
Key idea:
Each unit has both potential outcomes — but we can only observe one. The other is counterfactual.
This framework allows us to define:
And formalizes why causal inference is hard: we never see both outcomes.
For any individual, we can observe only one potential outcome — never both.
Example:
Y(0)
Y(1)
would have been for that exact userThis creates a missing data problem: the counterfactual is unobservable.
To solve this, we rely on:
All causal methods are, in some way, trying to approximate the missing counterfactual.
Causal inference relies heavily on assumptions — even when you don’t randomize.
Without these, causal estimates can be biased or undefined. Always question whether they hold before trusting results.
Simulating data is the best way to control the truth when learning causal inference.
Here’s why we simulate:
In real-world observational data, the "truth" is hidden. Simulating lets you debug your causal intuition safely before dealing with messy production datasets.
# Simulated Dataset Setup
import numpy as np
import pandas as pd
np.random.seed(42) # for reproducibility
# We'll define features, treatment assignment, and outcomes step-by-step later
To simulate treatment realistically:
For example:
We’ll simulate a non-random treatment assignment based on a few covariates to mimic real-world biases.
# Define covariates (features)
n = 5000
age = np.random.normal(40, 12, n) # Age
income = np.random.normal(60000, 15000, n) # Annual income
prior_engagement = np.random.beta(2, 5, n) # Past engagement score [0,1]
# Treatment assignment probability based on features
treatment_prob = (
0.3 * (income > 70000).astype(float) +
0.2 * (prior_engagement > 0.5).astype(float) +
0.1 * (age < 30).astype(float) +
np.random.normal(0, 0.05, n) # small noise
)
treatment_prob = np.clip(treatment_prob, 0, 1)
# Assign treatment
T = np.random.binomial(1, treatment_prob)
In real-world datasets, treatment assignment is not random — it’s confounded by covariates.
We deliberately inject confounding so that:
Confounding creates the need for adjustment, which will be a major theme later.
# Let's define a "true" baseline outcome based on the same covariates
base_outcome = (
50 +
0.02 * income +
5 * prior_engagement -
0.3 * age +
np.random.normal(0, 5, n) # random noise
)
In the Rubin Potential Outcomes framework, each unit has two outcomes:
Y(1)
→ If treatedY(0)
→ If not treatedWe can simulate this by:
Y(1)
Y(0)
as the base outcomeImportant: We observe only one of Y(1)
or Y(0)
, depending on treatment assignment (T
).
# Define a true treatment effect (could vary by subgroup later)
true_treatment_effect = 10 # a flat +10 effect for everyone
# Simulate potential outcomes
Y_0 = base_outcome
Y_1 = base_outcome + true_treatment_effect
# Observed outcome based on treatment assignment
Y_obs = T * Y_1 + (1 - T) * Y_0
# Assemble into a dataframe
df = pd.DataFrame({
'age': age,
'income': income,
'prior_engagement': prior_engagement,
'T': T,
'Y_obs': Y_obs,
'Y_0': Y_0,
'Y_1': Y_1,
})
df.head()
age | income | prior_engagement | T | Y_obs | Y_0 | Y_1 | |
---|---|---|---|---|---|---|---|
0 | 45.960570 | 53643.604770 | 0.188077 | 0 | 1114.502287 | 1114.502287 | 1124.502287 |
1 | 38.340828 | 53198.788374 | 0.170389 | 0 | 1099.893323 | 1099.893323 | 1109.893323 |
2 | 47.772262 | 33065.352411 | 0.511379 | 0 | 704.133035 | 704.133035 | 714.133035 |
3 | 58.276358 | 55048.647124 | 0.318793 | 0 | 1139.203509 | 1139.203509 | 1149.203509 |
4 | 37.190160 | 70992.436227 | 0.384439 | 0 | 1464.308234 | 1464.308234 | 1474.308234 |
Many causal questions are first attacked by simply comparing the treated vs. untreated groups.
Naive Approach:
Average outcome of treated - Average outcome of untreated.
This looks simple, but in observational data:
Naive estimation almost always gives biased results unless you have perfect randomization.
In this section, we'll see how bad the naive approach can get even on a simple synthetic dataset.
# Quick naive estimation
treated_mean = df.loc[df['T'] == 1, 'Y_obs'].mean()
control_mean = df.loc[df['T'] == 0, 'Y_obs'].mean()
naive_diff = treated_mean - control_mean
print(f"Naive difference in means: {naive_diff:.2f}")
print(f"True treatment effect (ground truth): {true_treatment_effect}")
Naive difference in means: 250.53 True treatment effect (ground truth): 10
A simple difference in means is mathematically:
E[Y | T=1] - E[Y | T=0]
If treatment assignment were random:
But if treatment is confounded, then:
T=1
units may systematically differ from T=0
units.We’ll soon quantify how large this bias can be.
# Let's quickly visualize how the treated vs control groups differ in covariates
df.groupby('T')[['age', 'income', 'prior_engagement']].mean()
age | income | prior_engagement | |
---|---|---|---|
T | |||
0 | 40.389926 | 58304.201745 | 0.278891 |
1 | 37.892227 | 70283.224987 | 0.330779 |
Bias from confounding happens when:
Mathematically:
Observed difference = True treatment effect + Bias from baseline differences
In our simulation:
This is why adjusting for confounders is critical — naive methods can easily mislead interventions and business decisions.
# Let's calculate the *average baseline outcome* (Y_0) in treated vs untreated groups
treated_baseline = df.loc[df['T'] == 1, 'Y_0'].mean()
control_baseline = df.loc[df['T'] == 0, 'Y_0'].mean()
baseline_diff = treated_baseline - control_baseline
print(f"Baseline (Y_0) difference between treated and control: {baseline_diff:.2f}")
print("This baseline imbalance creates bias in naive estimation.")
Baseline (Y_0) difference between treated and control: 240.53 This baseline imbalance creates bias in naive estimation.
Directed Acyclic Graphs (DAGs) are a compact way to represent assumptions about the data generating process.
DAGs are not learned from data. They are drawn from domain knowledge to help reason about:
Almost every causal inference method implicitly or explicitly assumes a DAG about the world.
A DAG (Directed Acyclic Graph) encodes assumptions about how variables causally relate.
Example:
Age → Income → Health
Means:
DAGs help identify:
They act like a map — letting you plan causal estimation strategies intelligently.
Confounders:
T
) and Health (Y
).Colliders:
Mediators:
👉 Correct adjustment requires distinguishing among these roles.
Not everything is identifiable from data alone — assumptions are unavoidable.
Data + assumptions → Causal conclusions.
Data alone → Only correlational findings.
DAGs clarify where you need domain knowledge vs where data suffices.
Backdoor adjustment methods aim to block backdoor paths — non-causal paths that create bias between treatment and outcome.
Core idea:
Backdoor adjustment simulates what would happen if treatment assignment were random within levels of the confounders.
It’s the foundational idea behind:
If you can block all backdoor paths, you can estimate causal effects from observational data reliably.
Conditioning means holding confounders constant when comparing treated vs untreated units.
Examples:
By conditioning, you eliminate the variation due to confounders, isolating the causal effect.
Important:
You should only condition on true confounders — not colliders or mediators.
Conditioning can be implemented via:
# Simple conditioning by subgroup (example: income > 70k vs <= 70k)
df['high_income'] = (df['income'] > 70000).astype(int)
grouped = df.groupby(['high_income', 'T'])['Y_obs'].mean().unstack()
grouped['diff'] = grouped[1] - grouped[0]
print("Simple Conditional Difference by Income Group:")
print(grouped[['diff']])
Simple Conditional Difference by Income Group: T diff high_income 0 16.867043 1 13.563878
Stratification means breaking the dataset into buckets based on confounders and comparing treatment effects within each bucket.
Typical steps:
When useful:
Limits:
# Stratify based on prior_engagement (simple high vs low)
df['high_engagement'] = (df['prior_engagement'] > 0.5).astype(int)
strat_grouped = df.groupby(['high_engagement', 'T'])['Y_obs'].mean().unstack()
strat_grouped['diff'] = strat_grouped[1] - strat_grouped[0]
print("Conditional Difference by Engagement Group:")
print(strat_grouped[['diff']])
Conditional Difference by Engagement Group: T diff high_engagement 0 290.043114 1 145.885304
Regression adjustment estimates causal effects by controlling for confounders via regression.
Simple linear model:
Y = β₀ + β₁·T + β₂·(confounder1) + β₃·(confounder2) + ... + ε
β₁
captures the adjusted effect of treatment, controlling for confounders.Advantages:
Risks:
import statsmodels.api as sm
X = df[['T', 'age', 'income', 'prior_engagement']]
X = sm.add_constant(X)
y = df['Y_obs']
reg_model = sm.OLS(y, X).fit()
print(reg_model.summary())
print(f"\nEstimated treatment effect (β₁) after adjustment: {reg_model.params['T']:.2f}")
OLS Regression Results ============================================================================== Dep. Variable: Y_obs R-squared: 1.000 Model: OLS Adj. R-squared: 1.000 Method: Least Squares F-statistic: 4.652e+06 Date: Sat, 26 Apr 2025 Prob (F-statistic): 0.00 Time: 17:50:40 Log-Likelihood: -15122. No. Observations: 5000 AIC: 3.025e+04 Df Residuals: 4995 BIC: 3.029e+04 Df Model: 4 Covariance Type: nonrobust ==================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------ const 50.3660 0.398 126.398 0.000 49.585 51.147 T 10.0641 0.220 45.761 0.000 9.633 10.495 age -0.2942 0.006 -49.784 0.000 -0.306 -0.283 income 0.0200 4.83e-06 4142.593 0.000 0.020 0.020 prior_engagement 4.3802 0.452 9.681 0.000 3.493 5.267 ============================================================================== Omnibus: 5.657 Durbin-Watson: 1.967 Prob(Omnibus): 0.059 Jarque-Bera (JB): 5.136 Skew: -0.029 Prob(JB): 0.0767 Kurtosis: 2.854 Cond. No. 4.38e+05 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 4.38e+05. This might indicate that there are strong multicollinearity or other numerical problems. Estimated treatment effect (β₁) after adjustment: 10.06
Propensity Score Matching (PSM) is a two-step procedure:
P(T=1 | X)
) using observed covariates.Why PSM?
When useful:
Limitations:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
# Step 1: Estimate propensity scores
ps_model = LogisticRegression()
ps_model.fit(df[['age', 'income', 'prior_engagement']], df['T'])
df['propensity_score'] = ps_model.predict_proba(df[['age', 'income', 'prior_engagement']])[:,1]
# Step 2: Nearest neighbor matching
treated = df[df['T'] == 1]
control = df[df['T'] == 0]
nn = NearestNeighbors(n_neighbors=1)
nn.fit(control[['propensity_score']])
distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]
# Calculate matched difference
matched_diff = (treated['Y_obs'].values - matched_control['Y_obs'].values).mean()
print(f"Propensity Score Matched Estimate of Treatment Effect: {matched_diff:.2f}")
Propensity Score Matched Estimate of Treatment Effect: 197.50
Instrumental Variables (IV) methods are used when simple adjustment for confounders is impossible or not credible.
When treatment is endogenous (affected by unobserved factors also affecting outcome), traditional methods like regression fail.
IV solves this by:
You create quasi-randomization even in observational data.
Classic examples:
You need IV methods when:
Example:
Key realization:
If backdoor paths exist through unobserved variables, IV becomes necessary.
For an instrument (Z
) to be valid, it must satisfy:
Z
must affect treatment T
.Z
must affect the outcome Y
only through T
.Z
to Y
.)Z
must be independent of unobserved confounders affecting Y
.If any of these fail:
Choosing or arguing a valid instrument is 90% of the IV battle.
2SLS (Two-Stage Least Squares) is the classic estimation procedure for IV:
T
on instrument Z
(and any controls)T̂
)Y
on predicted treatment (T̂
)The second-stage coefficient gives the causal effect of treatment on outcome, isolating variation driven by the instrument.
Warning:
linearmodels
automate this.from statsmodels.api import OLS, add_constant
# Simulate an instrument Z (let's assume it's random and satisfies conditions)
np.random.seed(42)
df['Z'] = np.random.binomial(1, 0.5, size=len(df))
# Make treatment depend partly on Z
df['T_iv'] = (0.5 * df['Z'] + 0.5 * df['prior_engagement'] + np.random.normal(0, 0.1, len(df))) > 0.5
df['T_iv'] = df['T_iv'].astype(int)
# Stage 1: Predict treatment from instrument
X_stage1 = add_constant(df[['Z', 'age', 'income', 'prior_engagement']])
stage1_model = OLS(df['T_iv'], X_stage1).fit()
df['T_hat'] = stage1_model.predict(X_stage1)
# Stage 2: Predict outcome from predicted treatment
X_stage2 = add_constant(df[['T_hat', 'age', 'income', 'prior_engagement']])
stage2_model = OLS(df['Y_obs'], X_stage2).fit()
print(stage2_model.summary())
print(f"\nEstimated causal effect (via 2SLS): {stage2_model.params['T_hat']:.2f}")
OLS Regression Results ============================================================================== Dep. Variable: Y_obs R-squared: 1.000 Model: OLS Adj. R-squared: 1.000 Method: Least Squares F-statistic: 3.278e+06 Date: Sat, 26 Apr 2025 Prob (F-statistic): 0.00 Time: 17:50:40 Log-Likelihood: -15996. No. Observations: 5000 AIC: 3.200e+04 Df Residuals: 4995 BIC: 3.203e+04 Df Model: 4 Covariance Type: nonrobust ==================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------ const 48.1296 0.474 101.498 0.000 47.200 49.059 T_hat 0.2548 0.197 1.296 0.195 -0.131 0.640 age -0.3133 0.007 -44.628 0.000 -0.327 -0.300 income 0.0201 5.54e-06 3619.407 0.000 0.020 0.020 prior_engagement 6.6954 0.540 12.391 0.000 5.636 7.755 ============================================================================== Omnibus: 71.333 Durbin-Watson: 1.970 Prob(Omnibus): 0.000 Jarque-Bera (JB): 74.381 Skew: 0.287 Prob(JB): 7.05e-17 Kurtosis: 3.169 Cond. No. 4.33e+05 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 4.33e+05. This might indicate that there are strong multicollinearity or other numerical problems. Estimated causal effect (via 2SLS): 0.25
Double Machine Learning (DML) is a modern causal estimation technique that:
Why DML matters:
It builds robust treatment effect estimators even when you use ML methods like Random Forests, XGBoost, or Neural Nets for intermediate steps.
In DML, you model two "nuisance functions" first:
Y ~ X
T ~ X
You can use any ML model (linear regression, random forest, gradient boosting, etc.) for these.
Key point:
The goal is accurate prediction, not causal interpretation, at this stage.
Later, DML uses the residuals from these models to isolate the causal effect of T
on Y
.
This two-step process protects the final estimate from overfitting to noisy high-dimensional features.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# Define features
features = ['age', 'income', 'prior_engagement']
# Split into train/test for honest estimation
X_train, X_test, y_train, y_test = train_test_split(df[features], df['Y_obs'], test_size=0.3, random_state=42)
T_train, T_test = train_test_split(df['T'], test_size=0.3, random_state=42)
# Outcome model: Y ~ X
y_model = RandomForestRegressor()
y_model.fit(X_train, y_train)
df['y_hat'] = y_model.predict(df[features])
# Treatment model: T ~ X
t_model = RandomForestRegressor()
t_model.fit(X_train, T_train)
df['t_hat'] = t_model.predict(df[features])
After fitting nuisance models:
Residual_Y = Y - Ŷ
Residual_T = T - T̂
Why?
Y
and T
that is predictable from X
.T
on Y
, orthogonal to confounders.This two-stage process is called orthogonalization — it minimizes bias from overfitting nuisance functions.
It’s a key innovation that separates DML from naive ML-based adjustment.
# Calculate residuals
df['residual_Y'] = df['Y_obs'] - df['y_hat']
df['residual_T'] = df['T'] - df['t_hat']
# Final stage: regress residual_Y ~ residual_T
X_resid = sm.add_constant(df['residual_T'])
y_resid = df['residual_Y']
residual_model = sm.OLS(y_resid, X_resid).fit()
print(residual_model.summary())
print(f"\nEstimated causal effect via DML: {residual_model.params['residual_T']:.2f}")
OLS Regression Results ============================================================================== Dep. Variable: residual_Y R-squared: 0.127 Model: OLS Adj. R-squared: 0.127 Method: Least Squares F-statistic: 726.5 Date: Sat, 26 Apr 2025 Prob (F-statistic): 1.62e-149 Time: 17:50:41 Log-Likelihood: -14928. No. Observations: 5000 AIC: 2.986e+04 Df Residuals: 4998 BIC: 2.987e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 0.0842 0.068 1.243 0.214 -0.049 0.217 residual_T 8.9188 0.331 26.953 0.000 8.270 9.567 ============================================================================== Omnibus: 7090.850 Durbin-Watson: 1.998 Prob(Omnibus): 0.000 Jarque-Bera (JB): 9991352.741 Skew: 7.694 Prob(JB): 0.00 Kurtosis: 221.453 Cond. No. 4.88 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Estimated causal effect via DML: 8.92
You should prefer DML over traditional regression when:
Traditional regression assumes:
DML frees you from strict parametric forms, allowing modern ML models while still aiming for valid causal estimates.
✅ DML shines in modern settings: tech products, healthcare, online platforms — where datasets are messy, rich, and big.
Until now, we've talked about the average effect of treatment across the entire population (ATE).
But in reality:
Heterogeneous Treatment Effects (HTE) study how effects vary:
Estimating HTE is critical for:
Different layers of treatment effect granularity:
In practice:
Good causal inference methods can recover CATEs/ITEs if enough data and signal exist.
# Calculate ATE (simple diff from simulation ground truth)
ate = (df['Y_1'] - df['Y_0']).mean()
print(f"True ATE: {ate:.2f}")
# Calculate CATE for high engagement group
cate_high_engagement = (df[df['prior_engagement'] > 0.5]['Y_1'] - df[df['prior_engagement'] > 0.5]['Y_0']).mean()
print(f"CATE for high engagement users: {cate_high_engagement:.2f}")
# Show a few ITEs
df['ITE_true'] = df['Y_1'] - df['Y_0']
print("\nSample ITEs:")
print(df[['age', 'income', 'prior_engagement', 'ITE_true']].head())
True ATE: 10.00 CATE for high engagement users: 10.00 Sample ITEs: age income prior_engagement ITE_true 0 45.960570 53643.604770 0.188077 10.0 1 38.340828 53198.788374 0.170389 10.0 2 47.772262 33065.352411 0.511379 10.0 3 58.276358 55048.647124 0.318793 10.0 4 37.190160 70992.436227 0.384439 10.0
Uplift modeling directly models the difference in probability of a positive outcome between treated and untreated users.
Instead of modeling outcome probabilities separately, uplift models focus on:
Where uplift models shine:
Typical techniques:
Tree-based methods are powerful for discovering treatment effect heterogeneity:
They can estimate CATEs reliably across different subgroups without manually specifying interactions.
When useful:
# !pip install econml
# Causal Forest: Full Correct Code
from econml.grf import CausalForest
from sklearn.model_selection import train_test_split
# Prepare features and outcome
X = df[['age', 'income', 'prior_engagement']].values # Features (2D)
T = df['T'].values # Treatment (1D)
Y = df['Y_obs'].values # Observed outcome (1D)
# Fit causal forest
forest = CausalForest(n_estimators=100, random_state=42)
forest.fit(X, T, Y) # Correct order: X, T, Y
# Predict treatment effects (CATEs)
cate_preds = forest.predict(X)
# Store predictions
df['CATE_predicted'] = cate_preds
# Display sample predictions
print("\nSample predicted CATEs:")
print(df[['age', 'income', 'prior_engagement', 'CATE_predicted']].head())
Sample predicted CATEs: age income prior_engagement CATE_predicted 0 45.960570 53643.604770 0.188077 15.767683 1 38.340828 53198.788374 0.170389 18.062330 2 47.772262 33065.352411 0.511379 -4.625701 3 58.276358 55048.647124 0.318793 14.400047 4 37.190160 70992.436227 0.384439 11.047467
Even after careful causal estimation, you must ask:
Robustness checks build confidence that your findings are not artifacts of modeling choices, random noise, or hidden confounders.
Placebo tests simulate situations where you expect no effect — if you detect an effect there, something's wrong.
Robust causal analysis is not just about point estimates — it’s about proving to yourself that you aren't fooling yourself.
Placebo tests inject "fake" treatments to validate your method.
Idea:
If your model finds strong effects even when treatment is randomized, your pipeline is leaking bias or overfitting.
Placebo Tests Are Critical:
Placebo tests are a basic but powerful check — always worth doing.
# Create a random placebo treatment
np.random.seed(123)
df['placebo_T'] = np.random.binomial(1, 0.5, size=len(df))
# Estimate naive difference for placebo treatment
placebo_treated_mean = df.loc[df['placebo_T'] == 1, 'Y_obs'].mean()
placebo_control_mean = df.loc[df['placebo_T'] == 0, 'Y_obs'].mean()
placebo_naive_diff = placebo_treated_mean - placebo_control_mean
print(f"Placebo test: naive difference in means = {placebo_naive_diff:.2f}")
# Ideally close to zero if model is unbiased
Placebo test: naive difference in means = 7.79
Even after adjusting for observed confounders, unobserved variables can still bias causal estimates.
Sensitivity analysis asks:
Typical approaches:
In practical data science:
If your conclusions survive plausible levels of unobserved bias, they are more credible.
Important mindset:
No analysis is perfect — the goal is to understand limits, not pretend away uncertainty.
# Simulate a hidden confounder correlated with treatment and outcome
np.random.seed(42)
df['hidden_confounder'] = np.random.normal(0, 1, size=len(df))
# Make the outcome depend slightly on this hidden confounder
df['Y_obs_biased'] = df['Y_obs'] + 2 * df['hidden_confounder']
# Re-run naive difference with biased outcome
treated_mean_biased = df.loc[df['T'] == 1, 'Y_obs_biased'].mean()
control_mean_biased = df.loc[df['T'] == 0, 'Y_obs_biased'].mean()
biased_naive_diff = treated_mean_biased - control_mean_biased
print(f"Naive difference in means (with hidden confounding): {biased_naive_diff:.2f}")
# See how much bias was introduced
bias_inflation = biased_naive_diff - naive_diff
print(f"Inflation in estimate due to hidden confounder: {bias_inflation:.2f}")
Naive difference in means (with hidden confounding): 250.11 Inflation in estimate due to hidden confounder: -0.42
Counterfactual thinking is the backbone of causal inference.
Instead of asking:
"What happened?"
We ask:
"What would have happened if things were different?"
In causal inference:
Counterfactual reasoning enables:
Without counterfactuals, causal inference is blind.
Predicting counterfactuals means estimating:
Y(1)
for untreated units.Y(0)
for treated units.We use:
Goal:
This enables granular interventions — not just average effects across a population.
Important to remember:
Predicted counterfactuals are estimates, not direct observations — uncertainty always exists.
# Predict counterfactual outcomes using Causal Forest
# (Already trained earlier, we use forest)
# Predict Y(1) and Y(0) separately
cate_preds = df['CATE_predicted'].values
baseline_preds = df['y_hat'].values # From earlier outcome model
# Predict counterfactual outcomes
df['Y_cf_T1'] = baseline_preds + cate_preds # Predicted outcome if treated
df['Y_cf_T0'] = baseline_preds # Predicted outcome if untreated (baseline)
# Now simulate what would happen if treatment status flipped
df['counterfactual_outcome'] = np.where(
df['T'] == 1,
df['Y_cf_T0'], # If treated, counterfactual is untreated
df['Y_cf_T1'] # If untreated, counterfactual is treated
)
print("\nSample Counterfactual Predictions:")
print(df[['T', 'Y_obs', 'counterfactual_outcome']].head())
Sample Counterfactual Predictions: T Y_obs counterfactual_outcome 0 0 1114.502287 1127.868133 1 0 1099.893323 1119.847270 2 0 704.133035 697.394845 3 0 1139.203509 1151.640784 4 0 1464.308234 1475.218807
Counterfactual predictions unlock personalization:
Instead of treating everyone the same, you can:
Examples:
Strategic mindshift:
Focus on marginally persuadable users, not just overall averages.
Real-world use cases often combine:
# Rank users by predicted CATE
df['priority_score'] = df['CATE_predicted']
# Top 5 users we should prioritize for treatment
top_users = df.sort_values('priority_score', ascending=False).head(5)
print("\nTop users to prioritize based on CATE:")
print(top_users[['age', 'income', 'prior_engagement', 'CATE_predicted']])
Top users to prioritize based on CATE: age income prior_engagement CATE_predicted 2841 21.800289 53281.835326 0.350447 20.543173 724 42.050385 53280.579923 0.403388 20.443580 3205 31.876058 53300.354298 0.420334 20.411161 4614 40.670367 53244.477131 0.632127 20.275141 3711 24.210210 53209.647772 0.335915 20.259483
You now have a practical understanding of core causal inference techniques.
You should be able to:
Remember:
Causal thinking is not just a technique — it’s a lens to see decision-making clearly.
Method | When Useful | Strengths | Weaknesses |
---|---|---|---|
Simple Diff-in-Means | Randomized experiments | Easy, unbiased | Useless with confounding |
Regression Adjustment | Observational data with measured confounders | Easy to implement | Model misspecification risk |
Stratification | Small number of discrete confounders | Transparent | Breaks down in high dimensions |
Propensity Score Matching (PSM) | Observational data with many confounders | Balances groups | Sensitive to model of treatment |
Instrumental Variables (IV) | Unobserved confounders exist | Bypasses confounding | Hard to find good instruments |
Double Machine Learning (DML) | High-dimensional nonlinear confounders | ML flexibility + debiasing | Needs lots of data, honest splits |
Causal Forests | Heterogeneous treatment effects | Flexible CATE estimation | Complex, less interpretable |
Choosing a method depends on:
Choosing the right method = matching the method to the bias and complexity in your data.
Predictive modeling mindset:
Causal inference mindset:
Key difference:
Predictive models can be accurate yet useless for interventions.
Causal models are harder but necessary to make confident decisions.
Quote to remember:
"All models are wrong. Some models are useful. Only causal models are useful for actions."