Status: Complete Python Coverage License

๐Ÿงญ Causal Inferenceยถ

  • ๐ŸŽฏ Introduction to Causal Inference

    • ๐ŸŽ“ What is Causal Inference?
    • ๐Ÿ“Œ Why go beyond Correlation?
    • ๐Ÿงญ Real-world problems that need causality
  • ๐Ÿง  Core Concepts & Notation

    • ๐Ÿงฎ Treatment, Outcome, Units
    • ๐Ÿ“ Potential Outcomes (Rubin Causal Model)
    • ๐Ÿงต Fundamental Problem of Causal Inference
    • ๐Ÿง  Assumptions (SUTVA, Ignorability, Overlap)
  • ๐Ÿงช Simulated Dataset Setup

    • ๐Ÿงฌ Define treatment assignment logic
    • ๐Ÿ”ฌ Inject confounding intentionally
    • ๐ŸงŠ Simulate potential outcomes + observed data
  • ๐Ÿšซ Naive Estimation

    • โŒ Simple difference in means
    • โš ๏ธ Bias due to confounding
  • ๐Ÿ•ธ๏ธ Causal Diagrams (DAGs)

    • ๐Ÿงฟ Quick primer on DAGs
    • ๐Ÿ•ท๏ธ Confounding vs. colliders vs. mediators
    • ๐Ÿ”— What can/canโ€™t be estimated just from data
  • ๐Ÿ” Backdoor Adjustment Methods

    • ๐Ÿงพ Conditioning on confounders
    • ๐Ÿ•ต๏ธโ€โ™‚๏ธ Stratification / Subgroup analysis
    • ๐Ÿ“Š Regression Adjustment
    • ๐Ÿ“Œ Propensity Score Matching (PSM)
  • ๐ŸŽฏ Instrumental Variables (IV)

    • ๐Ÿช When backdoor paths canโ€™t be blocked
    • ๐ŸŽฏ Valid instrument conditions
    • ๐Ÿงฉ 2-Stage Least Squares (2SLS)
  • ๐Ÿงฐ Double Machine Learning (DML)

    • ๐Ÿช› Use ML models for nuisance functions
    • ๐Ÿงฑ Residualization + orthogonalization logic
    • ๐Ÿงฒ When to prefer over traditional regression
  • ๐ŸŒˆ Heterogeneous Treatment Effects

    • ๐ŸŽจ ATE vs. CATE vs. ITE
    • ๐ŸŒŸ Uplift models and use cases
    • ๐Ÿงฉ Tree-based methods (Causal Trees, Causal Forests)
  • ๐Ÿงช Placebo Tests & Robustness Checks

    • ๐Ÿงป Randomized placebo treatments
    • โš—๏ธ Sensitivity to unobserved confounding
  • ๐Ÿงฌ Counterfactual Thinking

    • ๐Ÿค– Predicting what wouldโ€™ve happened
    • ๐Ÿ” Usage in recommendation & personalization
  • ๐Ÿ“Œ Closing Notes

    • ๐Ÿ“ Summary table of methods
    • ๐Ÿ“‹ When to use what
    • ๐Ÿ“Ž Causal vs Predictive mindset

๐ŸŽฏ Introduction to Causal Inferenceยถ

๐Ÿ“‰ Click to Expand
๐Ÿง  Why this Notebook

Causal inference gives us the tools to answer "what if" questions โ€” not just "what is." In product, policy, medicine, and science, we often need to act, and actions require understanding their consequences.

This field helps us:

  • Understand how and why outcomes change.
  • Move from data descriptions to data-driven interventions.
  • Avoid the trap of chasing noisy correlations.

This notebook is a build-up from first principles to practical methods โ€” with enough grounding to reason about experiments, models, and their assumptions clearly.

๐ŸŽ“ What is Causal Inference?ยถ

๐Ÿ“‰ Click to Expand
๐ŸŽ“ Causal inference is the process of estimating the effect of one variable (the treatment) on another (the outcome), holding all else constant.

The core idea is to estimate:

What would the outcome have been if the treatment had (or had not) occurred?

Unlike correlation or predictive modeling:

  • It asks counterfactual questions โ€” what would have happened under different scenarios.
  • It requires assumptions, design, and often randomization or clever statistical tricks.

At its heart, causal inference is about:

  • Designing better interventions
  • Estimating treatment effects
  • Avoiding misleading associational patterns

๐Ÿ“Œ Why go beyond Correlation?ยถ

๐Ÿ“‰ Click to Expand
๐Ÿ“Œ Correlation can be dangerous when used as a proxy for causation.

Example: Ice cream sales are correlated with shark attacks. Should we ban dessert? Clearly not โ€” theyโ€™re both caused by heatwaves (a confounder).

Correlation fails because it:

  • Ignores confounders (common causes of both variables)
  • Misses directionality (what affects what)
  • Can be driven by reverse causation or coincidence

Causal inference gives tools to:

  • Identify confounding
  • Design better studies (randomized or observational)
  • Interpret results in terms of actionable causes

๐Ÿงญ Real-world problems that need causalityยถ

๐Ÿ“‰ Click to Expand
๐Ÿงญ Correlation might be fine for dashboards. But when making decisions, causality is non-negotiable.

Examples:

  • Product: Did that new button placement increase checkout, or was it a seasonal effect?
  • Marketing: Did the email nudge lead to purchases, or did loyal users open it anyway?
  • Policy: Did a tax cut help the economy, or was it already improving?
  • Health: Does a drug reduce disease, or do healthier people tend to take it?

These questions involve interventions, and only causal methods can tell us what wouldโ€™ve happened under a different choice.

Back to the top ___

๐Ÿง  Core Concepts & Notationยถ

๐Ÿงฎ Treatment, Outcome, Unitsยถ

๐Ÿ“‰ Click to Expand
  • Treatment (T): The intervention or condition being tested (e.g., new design, drug, policy).
  • Outcome (Y): The result or metric affected by the treatment (e.g., click, recovery, score).
  • Units (i): The entities receiving treatment and producing outcomes (e.g., users, patients, schools).

Each unit can receive a treatment or control, and we observe only one outcome โ€” not both.

This framing is universal and applies whether you're testing emails, ads, or vaccines.

๐Ÿ“ Potential Outcomes (Rubin Causal Model)ยถ

๐Ÿ“‰ Click to Expand

The Potential Outcomes framework (aka Rubin Causal Model) imagines two parallel worlds for each unit:

  • Y(1): Outcome if treated
  • Y(0): Outcome if not treated

We define Individual Treatment Effect (ITE) as:

ITE = Y(1) - Y(0)

Key idea:

Each unit has both potential outcomes โ€” but we can only observe one. The other is counterfactual.

This framework allows us to define:

  • ATE (Average Treatment Effect)
  • CATE (Conditional ATE, for subgroups)

And formalizes why causal inference is hard: we never see both outcomes.

๐Ÿงต Fundamental Problem of Causal Inferenceยถ

๐Ÿ“‰ Click to Expand
The Fundamental Problem of Causal Inference:
For any individual, we can observe only one potential outcome โ€” never both.

Example:

  • A user sees version A โ†’ you observe Y(0)
  • Youโ€™ll never know what Y(1) would have been for that exact user

This creates a missing data problem: the counterfactual is unobservable.

To solve this, we rely on:

  • Randomization
  • Modeling + assumptions
  • Matching or weighting approaches

All causal methods are, in some way, trying to approximate the missing counterfactual.

๐Ÿง  Assumptions (SUTVA, Ignorability, Overlap)ยถ

๐Ÿ“‰ Click to Expand

Causal inference relies heavily on assumptions โ€” even when you donโ€™t randomize.

Three core ones:
  • SUTVA (Stable Unit Treatment Value Assumption)
    โ†’ Your treatment doesnโ€™t affect someone elseโ€™s outcome.
    โ†’ No interference across units.
  • Ignorability (a.k.a. Unconfoundedness)
    โ†’ Given the observed covariates, treatment assignment is as good as random.
    โ†’ This lets you use observed data for estimation.
  • Overlap (a.k.a. Positivity)
    โ†’ Every unit has a non-zero probability of receiving either treatment.
    โ†’ You canโ€™t learn effects where thereโ€™s no variation.

Without these, causal estimates can be biased or undefined. Always question whether they hold before trusting results.

Back to the top ___

๐Ÿงช Simulated Dataset Setupยถ

๐Ÿ“‰ Click to Expand

Simulating data is the best way to control the truth when learning causal inference.

Hereโ€™s why we simulate:

  • You get full knowledge of ground-truth treatment effects.
  • You can deliberately create confounding, bias, non-randomness.
  • You can practice recovering the true causal effect using different methods.

In real-world observational data, the "truth" is hidden. Simulating lets you debug your causal intuition safely before dealing with messy production datasets.

Inย [30]:
# Simulated Dataset Setup
import numpy as np
import pandas as pd

np.random.seed(42)  # for reproducibility

# We'll define features, treatment assignment, and outcomes step-by-step later

๐Ÿงฌ Define treatment assignment logicยถ

๐Ÿ“‰ Click to Expand

To simulate treatment realistically:

  • Treatment should depend on observed features.
  • Treatment should not be random โ€” otherwise, no confounding to deal with.

For example:

  • Wealthier users might be more likely to receive a premium offer.
  • Healthier patients might be less likely to receive intensive care.

Weโ€™ll simulate a non-random treatment assignment based on a few covariates to mimic real-world biases.

Inย [31]:
# Define covariates (features)
n = 5000

age = np.random.normal(40, 12, n)       # Age
income = np.random.normal(60000, 15000, n)  # Annual income
prior_engagement = np.random.beta(2, 5, n)  # Past engagement score [0,1]

# Treatment assignment probability based on features
treatment_prob = (
    0.3 * (income > 70000).astype(float) +
    0.2 * (prior_engagement > 0.5).astype(float) +
    0.1 * (age < 30).astype(float) +
    np.random.normal(0, 0.05, n)  # small noise
)
treatment_prob = np.clip(treatment_prob, 0, 1)

# Assign treatment
T = np.random.binomial(1, treatment_prob)

๐Ÿ”ฌ Inject confounding intentionallyยถ

๐Ÿ“‰ Click to Expand

In real-world datasets, treatment assignment is not random โ€” itโ€™s confounded by covariates.

We deliberately inject confounding so that:

  • Covariates (age, income, engagement) affect both treatment and outcome.
  • If we naively compare treated vs untreated, we'll get biased results.

Confounding creates the need for adjustment, which will be a major theme later.

Inย [32]:
# Let's define a "true" baseline outcome based on the same covariates
base_outcome = (
    50 + 
    0.02 * income +
    5 * prior_engagement -
    0.3 * age +
    np.random.normal(0, 5, n)  # random noise
)

๐ŸงŠ Simulate potential outcomes + observed dataยถ

๐Ÿ“‰ Click to Expand

In the Rubin Potential Outcomes framework, each unit has two outcomes:

  • Y(1) โ†’ If treated
  • Y(0) โ†’ If not treated

We can simulate this by:

  • Applying a true treatment effect to Y(1)
  • Leaving Y(0) as the base outcome

Important: We observe only one of Y(1) or Y(0), depending on treatment assignment (T).

Inย [33]:
# Define a true treatment effect (could vary by subgroup later)
true_treatment_effect = 10  # a flat +10 effect for everyone

# Simulate potential outcomes
Y_0 = base_outcome
Y_1 = base_outcome + true_treatment_effect

# Observed outcome based on treatment assignment
Y_obs = T * Y_1 + (1 - T) * Y_0

# Assemble into a dataframe
df = pd.DataFrame({
    'age': age,
    'income': income,
    'prior_engagement': prior_engagement,
    'T': T,
    'Y_obs': Y_obs,
    'Y_0': Y_0,
    'Y_1': Y_1,
})

df.head()
Out[33]:
age income prior_engagement T Y_obs Y_0 Y_1
0 45.960570 53643.604770 0.188077 0 1114.502287 1114.502287 1124.502287
1 38.340828 53198.788374 0.170389 0 1099.893323 1099.893323 1109.893323
2 47.772262 33065.352411 0.511379 0 704.133035 704.133035 714.133035
3 58.276358 55048.647124 0.318793 0 1139.203509 1139.203509 1149.203509
4 37.190160 70992.436227 0.384439 0 1464.308234 1464.308234 1474.308234

Back to the top ___

๐Ÿšซ Naive Estimationยถ

๐Ÿ“‰ Click to Expand

Many causal questions are first attacked by simply comparing the treated vs. untreated groups.

Naive Approach:

Average outcome of treated - Average outcome of untreated.

This looks simple, but in observational data:

  • Treated and untreated units are not comparable.
  • Treatment assignment was not randomized.
  • Differences in baseline characteristics confound the simple difference.

Naive estimation almost always gives biased results unless you have perfect randomization.

In this section, we'll see how bad the naive approach can get even on a simple synthetic dataset.

Inย [34]:
# Quick naive estimation
treated_mean = df.loc[df['T'] == 1, 'Y_obs'].mean()
control_mean = df.loc[df['T'] == 0, 'Y_obs'].mean()

naive_diff = treated_mean - control_mean

print(f"Naive difference in means: {naive_diff:.2f}")
print(f"True treatment effect (ground truth): {true_treatment_effect}")
Naive difference in means: 250.53
True treatment effect (ground truth): 10

โŒ Simple difference in meansยถ

๐Ÿ“‰ Click to Expand

A simple difference in means is mathematically:

E[Y | T=1] - E[Y | T=0]

If treatment assignment were random:

  • The two groups would be exchangeable.
  • Baseline covariates would balance on average.
  • The simple difference would be an unbiased estimator of ATE.

But if treatment is confounded, then:

  • T=1 units may systematically differ from T=0 units.
  • The naive estimator picks up both causal effect and selection bias.

Weโ€™ll soon quantify how large this bias can be.

Inย [35]:
# Let's quickly visualize how the treated vs control groups differ in covariates
df.groupby('T')[['age', 'income', 'prior_engagement']].mean()
Out[35]:
age income prior_engagement
T
0 40.389926 58304.201745 0.278891
1 37.892227 70283.224987 0.330779

โš ๏ธ Bias due to confoundingยถ

๐Ÿ“‰ Click to Expand

Bias from confounding happens when:

  • The treated group has systematically different baseline outcomes than the control group.

Mathematically:

Observed difference = True treatment effect + Bias from baseline differences

In our simulation:

  • Higher income users are more likely to be treated.
  • Income also directly influences outcome.
  • Therefore, the observed difference overstates the real effect.

This is why adjusting for confounders is critical โ€” naive methods can easily mislead interventions and business decisions.

Inย [36]:
# Let's calculate the *average baseline outcome* (Y_0) in treated vs untreated groups
treated_baseline = df.loc[df['T'] == 1, 'Y_0'].mean()
control_baseline = df.loc[df['T'] == 0, 'Y_0'].mean()

baseline_diff = treated_baseline - control_baseline

print(f"Baseline (Y_0) difference between treated and control: {baseline_diff:.2f}")
print("This baseline imbalance creates bias in naive estimation.")
Baseline (Y_0) difference between treated and control: 240.53
This baseline imbalance creates bias in naive estimation.

Back to the top ___

๐Ÿ•ธ๏ธ Causal Diagrams (DAGs)ยถ

๐Ÿ“‰ Click to Expand

Directed Acyclic Graphs (DAGs) are a compact way to represent assumptions about the data generating process.

  • Nodes = variables
  • Edges (arrows) = direct causal influence

DAGs are not learned from data. They are drawn from domain knowledge to help reason about:

  • Confounders
  • Biases
  • Valid adjustment strategies

Almost every causal inference method implicitly or explicitly assumes a DAG about the world.

๐Ÿงฟ Quick primer on DAGsยถ

๐Ÿ“‰ Click to Expand

A DAG (Directed Acyclic Graph) encodes assumptions about how variables causally relate.

  • Directed: Arrows have direction (cause โ†’ effect).
  • Acyclic: No feedback loops allowed (you canโ€™t return to a node).

Example:

Age โ†’ Income โ†’ Health

Means:

  • Age affects income.
  • Income affects health.
  • No reverse paths.

DAGs help identify:

  • Which paths are confounded
  • Which variables to control for
  • Whether effects are identifiable

They act like a map โ€” letting you plan causal estimation strategies intelligently.

๐Ÿ•ท๏ธ Confounding vs. colliders vs. mediatorsยถ

๐Ÿ“‰ Click to Expand

Confounders:

  • Variables that influence both treatment and outcome.
  • Must be adjusted for to block bias.
  • Example: Age confounds the relationship between Exercise (T) and Health (Y).

Colliders:

  • Variables caused by two other variables.
  • Must NOT adjust for colliders โ€” doing so opens spurious associations.
  • Example: Adjusting for "hospitalization" might introduce bias when studying Smoking โ†’ Lung Disease.

Mediators:

  • Variables on the causal pathway between treatment and outcome.
  • Adjusting for them blocks part of the causal effect you want to measure.
  • Example: Exercise โ†’ Fitness โ†’ Health (fitness is a mediator).

๐Ÿ‘‰ Correct adjustment requires distinguishing among these roles.

๐Ÿ”— What can/canโ€™t be estimated just from dataยถ

๐Ÿ“‰ Click to Expand

Not everything is identifiable from data alone โ€” assumptions are unavoidable.

What can be estimated:
  • Associations (correlations, patterns)
  • Conditional independence structures
  • Causal effects if the right covariates are controlled (based on DAG structure)
What cannot be estimated:
  • Whether a relationship is causal (without assumptions)
  • The full structure of a DAG (unless randomized experiments are used)

Data + assumptions โ†’ Causal conclusions.
Data alone โ†’ Only correlational findings.

DAGs clarify where you need domain knowledge vs where data suffices.

Back to the top ___

๐Ÿ” Backdoor Adjustment Methodsยถ

๐Ÿ“‰ Click to Expand

Backdoor adjustment methods aim to block backdoor paths โ€” non-causal paths that create bias between treatment and outcome.

Core idea:

  • Identify variables (confounders) that open backdoor paths.
  • Condition on them โ€” either by stratifying, modeling, or matching.

Backdoor adjustment simulates what would happen if treatment assignment were random within levels of the confounders.

Itโ€™s the foundational idea behind:

  • Regression
  • Matching
  • Stratification
  • Propensity scores

If you can block all backdoor paths, you can estimate causal effects from observational data reliably.

๐Ÿงพ Conditioning on confoundersยถ

๐Ÿ“‰ Click to Expand

Conditioning means holding confounders constant when comparing treated vs untreated units.

Examples:

  • Comparing treated vs untreated users within each income band.
  • Comparing recovery rates within each age group.

By conditioning, you eliminate the variation due to confounders, isolating the causal effect.

Important:
You should only condition on true confounders โ€” not colliders or mediators.

Conditioning can be implemented via:

  • Subgrouping
  • Regression
  • Matching
  • Weighting
Inย [37]:
# Simple conditioning by subgroup (example: income > 70k vs <= 70k)
df['high_income'] = (df['income'] > 70000).astype(int)

grouped = df.groupby(['high_income', 'T'])['Y_obs'].mean().unstack()
grouped['diff'] = grouped[1] - grouped[0]

print("Simple Conditional Difference by Income Group:")
print(grouped[['diff']])
Simple Conditional Difference by Income Group:
T                 diff
high_income           
0            16.867043
1            13.563878

๐Ÿ•ต๏ธโ€โ™‚๏ธ Stratification / Subgroup analysisยถ

๐Ÿ“‰ Click to Expand

Stratification means breaking the dataset into buckets based on confounders and comparing treatment effects within each bucket.

Typical steps:

  1. Divide data based on a confounder (e.g., low vs high engagement).
  2. Within each stratum, compute treated vs control differences.
  3. Aggregate across strata (weighted average).

When useful:

  • When confounders are categorical or easily discretized.
  • When interpretability is important.

Limits:

  • Doesnโ€™t scale well with many confounders (curse of dimensionality).
Inย [38]:
# Stratify based on prior_engagement (simple high vs low)
df['high_engagement'] = (df['prior_engagement'] > 0.5).astype(int)

strat_grouped = df.groupby(['high_engagement', 'T'])['Y_obs'].mean().unstack()
strat_grouped['diff'] = strat_grouped[1] - strat_grouped[0]

print("Conditional Difference by Engagement Group:")
print(strat_grouped[['diff']])
Conditional Difference by Engagement Group:
T                      diff
high_engagement            
0                290.043114
1                145.885304

๐Ÿ“Š Regression Adjustmentยถ

๐Ÿ“‰ Click to Expand

Regression adjustment estimates causal effects by controlling for confounders via regression.

Simple linear model:

Y = ฮฒโ‚€ + ฮฒโ‚ยทT + ฮฒโ‚‚ยท(confounder1) + ฮฒโ‚ƒยท(confounder2) + ... + ฮต
  • ฮฒโ‚ captures the adjusted effect of treatment, controlling for confounders.
  • It removes bias from observable confounders (under correct model specification).

Advantages:

  • Easy to use.
  • Scales to many covariates.

Risks:

  • Sensitive to model misspecification.
  • Wrong functional forms (nonlinearities, interactions) can bias estimates.
Inย [39]:
import statsmodels.api as sm

X = df[['T', 'age', 'income', 'prior_engagement']]
X = sm.add_constant(X)
y = df['Y_obs']

reg_model = sm.OLS(y, X).fit()

print(reg_model.summary())

print(f"\nEstimated treatment effect (ฮฒโ‚) after adjustment: {reg_model.params['T']:.2f}")
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Y_obs   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.652e+06
Date:                Sat, 26 Apr 2025   Prob (F-statistic):               0.00
Time:                        17:50:40   Log-Likelihood:                -15122.
No. Observations:                5000   AIC:                         3.025e+04
Df Residuals:                    4995   BIC:                         3.029e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               50.3660      0.398    126.398      0.000      49.585      51.147
T                   10.0641      0.220     45.761      0.000       9.633      10.495
age                 -0.2942      0.006    -49.784      0.000      -0.306      -0.283
income               0.0200   4.83e-06   4142.593      0.000       0.020       0.020
prior_engagement     4.3802      0.452      9.681      0.000       3.493       5.267
==============================================================================
Omnibus:                        5.657   Durbin-Watson:                   1.967
Prob(Omnibus):                  0.059   Jarque-Bera (JB):                5.136
Skew:                          -0.029   Prob(JB):                       0.0767
Kurtosis:                       2.854   Cond. No.                     4.38e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.38e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Estimated treatment effect (ฮฒโ‚) after adjustment: 10.06

๐Ÿ“Œ Propensity Score Matching (PSM)ยถ

๐Ÿ“‰ Click to Expand

Propensity Score Matching (PSM) is a two-step procedure:

  1. Model the probability of receiving treatment (P(T=1 | X)) using observed covariates.
  2. Match treated and control units with similar propensity scores.

Why PSM?

  • Instead of adjusting for many covariates separately, you balance treated and control groups on a single dimension (the propensity score).

When useful:

  • When covariate space is high-dimensional.
  • When you want a matched sample that resembles randomized data.

Limitations:

  • Requires good overlap (common support).
  • Still relies on unconfoundedness assumption.
Inย [40]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

# Step 1: Estimate propensity scores
ps_model = LogisticRegression()
ps_model.fit(df[['age', 'income', 'prior_engagement']], df['T'])
df['propensity_score'] = ps_model.predict_proba(df[['age', 'income', 'prior_engagement']])[:,1]

# Step 2: Nearest neighbor matching
treated = df[df['T'] == 1]
control = df[df['T'] == 0]

nn = NearestNeighbors(n_neighbors=1)
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]

# Calculate matched difference
matched_diff = (treated['Y_obs'].values - matched_control['Y_obs'].values).mean()

print(f"Propensity Score Matched Estimate of Treatment Effect: {matched_diff:.2f}")
Propensity Score Matched Estimate of Treatment Effect: 197.50

Back to the top ___

๐ŸŽฏ Instrumental Variables (IV)ยถ

๐Ÿ“‰ Click to Expand

Instrumental Variables (IV) methods are used when simple adjustment for confounders is impossible or not credible.

When treatment is endogenous (affected by unobserved factors also affecting outcome), traditional methods like regression fail.

IV solves this by:

  • Using a "proxy" (instrument) that affects treatment but is otherwise unrelated to the outcome except through treatment.
  • "Re-randomizing" variation in treatment based on the instrument.

You create quasi-randomization even in observational data.

Classic examples:

  • Distance to hospital โ†’ instrument for getting surgery.
  • Random assignment of judges โ†’ instrument for harsher sentencing.

๐Ÿช When backdoor paths canโ€™t be blockedยถ

๐Ÿ“‰ Click to Expand

You need IV methods when:

  • There are unobserved confounders you canโ€™t measure.
  • No set of observed covariates satisfies ignorability.
  • Standard backdoor adjustment will be biased.

Example:

  • Studying the effect of education on income: natural intelligence is a hidden confounder (affects both education and income).
  • You can't just regress income ~ education โ€” bias remains.

Key realization:
If backdoor paths exist through unobserved variables, IV becomes necessary.

๐ŸŽฏ Valid instrument conditionsยถ

๐Ÿ“‰ Click to Expand

For an instrument (Z) to be valid, it must satisfy:

  1. Relevance:
    Z must affect treatment T.
    (There must be a first-stage effect.)
  2. Exclusion Restriction:
    Z must affect the outcome Y only through T.
    (No direct path from Z to Y.)
  3. Independence (As-if Randomness):
    Z must be independent of unobserved confounders affecting Y.

If any of these fail:

  • IV estimates are biased or meaningless.
  • You canโ€™t fix bad instruments with bigger sample sizes.

Choosing or arguing a valid instrument is 90% of the IV battle.

๐Ÿงฉ 2-Stage Least Squares (2SLS)ยถ

๐Ÿ“‰ Click to Expand

2SLS (Two-Stage Least Squares) is the classic estimation procedure for IV:

  • Stage 1:
    Regress treatment T on instrument Z (and any controls)
    โ†’ get predicted treatment (Tฬ‚)
  • Stage 2:
    Regress outcome Y on predicted treatment (Tฬ‚)

The second-stage coefficient gives the causal effect of treatment on outcome, isolating variation driven by the instrument.

Warning:

  • Standard regression software doesn't correct standard errors properly when doing 2SLS manually.
  • Later packages like linearmodels automate this.
Inย [41]:
from statsmodels.api import OLS, add_constant

# Simulate an instrument Z (let's assume it's random and satisfies conditions)
np.random.seed(42)
df['Z'] = np.random.binomial(1, 0.5, size=len(df))

# Make treatment depend partly on Z
df['T_iv'] = (0.5 * df['Z'] + 0.5 * df['prior_engagement'] + np.random.normal(0, 0.1, len(df))) > 0.5
df['T_iv'] = df['T_iv'].astype(int)

# Stage 1: Predict treatment from instrument
X_stage1 = add_constant(df[['Z', 'age', 'income', 'prior_engagement']])
stage1_model = OLS(df['T_iv'], X_stage1).fit()
df['T_hat'] = stage1_model.predict(X_stage1)

# Stage 2: Predict outcome from predicted treatment
X_stage2 = add_constant(df[['T_hat', 'age', 'income', 'prior_engagement']])
stage2_model = OLS(df['Y_obs'], X_stage2).fit()

print(stage2_model.summary())

print(f"\nEstimated causal effect (via 2SLS): {stage2_model.params['T_hat']:.2f}")
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Y_obs   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.278e+06
Date:                Sat, 26 Apr 2025   Prob (F-statistic):               0.00
Time:                        17:50:40   Log-Likelihood:                -15996.
No. Observations:                5000   AIC:                         3.200e+04
Df Residuals:                    4995   BIC:                         3.203e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               48.1296      0.474    101.498      0.000      47.200      49.059
T_hat                0.2548      0.197      1.296      0.195      -0.131       0.640
age                 -0.3133      0.007    -44.628      0.000      -0.327      -0.300
income               0.0201   5.54e-06   3619.407      0.000       0.020       0.020
prior_engagement     6.6954      0.540     12.391      0.000       5.636       7.755
==============================================================================
Omnibus:                       71.333   Durbin-Watson:                   1.970
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               74.381
Skew:                           0.287   Prob(JB):                     7.05e-17
Kurtosis:                       3.169   Cond. No.                     4.33e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.33e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Estimated causal effect (via 2SLS): 0.25

Back to the top ___

๐Ÿงฐ Double Machine Learning (DML)ยถ

๐Ÿ“‰ Click to Expand

Double Machine Learning (DML) is a modern causal estimation technique that:

  • Separates the modeling of treatment and outcome.
  • Uses flexible machine learning models to control for complex confounders.
  • Debiases the final treatment effect estimation by orthogonalization.

Why DML matters:

  • Traditional linear regression forces linearity.
  • DML allows for nonlinear, high-dimensional adjustment without overfitting causal estimates.

It builds robust treatment effect estimators even when you use ML methods like Random Forests, XGBoost, or Neural Nets for intermediate steps.

๐Ÿช› Use ML models for nuisance functionsยถ

๐Ÿ“‰ Click to Expand

In DML, you model two "nuisance functions" first:

  1. Outcome model: Y ~ X
  2. Treatment model: T ~ X

You can use any ML model (linear regression, random forest, gradient boosting, etc.) for these.

Key point:
The goal is accurate prediction, not causal interpretation, at this stage.

Later, DML uses the residuals from these models to isolate the causal effect of T on Y.

This two-step process protects the final estimate from overfitting to noisy high-dimensional features.

Inย [42]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Define features
features = ['age', 'income', 'prior_engagement']

# Split into train/test for honest estimation
X_train, X_test, y_train, y_test = train_test_split(df[features], df['Y_obs'], test_size=0.3, random_state=42)
T_train, T_test = train_test_split(df['T'], test_size=0.3, random_state=42)

# Outcome model: Y ~ X
y_model = RandomForestRegressor()
y_model.fit(X_train, y_train)
df['y_hat'] = y_model.predict(df[features])

# Treatment model: T ~ X
t_model = RandomForestRegressor()
t_model.fit(X_train, T_train)
df['t_hat'] = t_model.predict(df[features])

๐Ÿงฑ Residualization + orthogonalization logicยถ

๐Ÿ“‰ Click to Expand

After fitting nuisance models:

  • Calculate residuals:
    • Residual_Y = Y - Yฬ‚
    • Residual_T = T - Tฬ‚
  • Then regress Residual_Y ~ Residual_T.

Why?

  • This removes the part of Y and T that is predictable from X.
  • What remains captures the "clean" causal variation of T on Y, orthogonal to confounders.

This two-stage process is called orthogonalization โ€” it minimizes bias from overfitting nuisance functions.

Itโ€™s a key innovation that separates DML from naive ML-based adjustment.

Inย [43]:
# Calculate residuals
df['residual_Y'] = df['Y_obs'] - df['y_hat']
df['residual_T'] = df['T'] - df['t_hat']

# Final stage: regress residual_Y ~ residual_T
X_resid = sm.add_constant(df['residual_T'])
y_resid = df['residual_Y']

residual_model = sm.OLS(y_resid, X_resid).fit()

print(residual_model.summary())

print(f"\nEstimated causal effect via DML: {residual_model.params['residual_T']:.2f}")
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             residual_Y   R-squared:                       0.127
Model:                            OLS   Adj. R-squared:                  0.127
Method:                 Least Squares   F-statistic:                     726.5
Date:                Sat, 26 Apr 2025   Prob (F-statistic):          1.62e-149
Time:                        17:50:41   Log-Likelihood:                -14928.
No. Observations:                5000   AIC:                         2.986e+04
Df Residuals:                    4998   BIC:                         2.987e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0842      0.068      1.243      0.214      -0.049       0.217
residual_T     8.9188      0.331     26.953      0.000       8.270       9.567
==============================================================================
Omnibus:                     7090.850   Durbin-Watson:                   1.998
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          9991352.741
Skew:                           7.694   Prob(JB):                         0.00
Kurtosis:                     221.453   Cond. No.                         4.88
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Estimated causal effect via DML: 8.92

๐Ÿงฒ When to prefer over traditional regressionยถ

๐Ÿ“‰ Click to Expand

You should prefer DML over traditional regression when:

  • High-dimensional confounders (lots of features) exist.
  • Nonlinear relationships are likely between covariates and treatment/outcome.
  • Flexible modeling is important (tree-based, neural nets, etc.)
  • Concern about model misspecification in simple linear regression.

Traditional regression assumes:

  • Linear relationships
  • No complex interactions unless explicitly modeled

DML frees you from strict parametric forms, allowing modern ML models while still aiming for valid causal estimates.

โœ… DML shines in modern settings: tech products, healthcare, online platforms โ€” where datasets are messy, rich, and big.

Back to the top ___

๐ŸŒˆ Heterogeneous Treatment Effectsยถ

๐Ÿ“‰ Click to Expand

Until now, we've talked about the average effect of treatment across the entire population (ATE).

But in reality:

  • Different users respond differently.
  • Treatment effects vary by user characteristics.

Heterogeneous Treatment Effects (HTE) study how effects vary:

  • Across groups (e.g., high engagement vs low engagement)
  • Across individuals (personalized effects)

Estimating HTE is critical for:

  • Personalized recommendations
  • Smart targeting (marketing, healthcare, product launches)

๐ŸŽจ ATE vs. CATE vs. ITEยถ

๐Ÿ“‰ Click to Expand

Different layers of treatment effect granularity:

  • ATE (Average Treatment Effect):
    Average effect across everyone.
  • CATE (Conditional Average Treatment Effect):
    Average effect given some subgroup (e.g., CATE for users <30 years old).
  • ITE (Individual Treatment Effect):
    Effect for a specific user.

In practice:

  • ATE is easiest to estimate.
  • CATEs are often actionable (targeted marketing).
  • ITEs are the hardest โ€” noisy and high-variance.

Good causal inference methods can recover CATEs/ITEs if enough data and signal exist.

Inย [44]:
# Calculate ATE (simple diff from simulation ground truth)
ate = (df['Y_1'] - df['Y_0']).mean()
print(f"True ATE: {ate:.2f}")

# Calculate CATE for high engagement group
cate_high_engagement = (df[df['prior_engagement'] > 0.5]['Y_1'] - df[df['prior_engagement'] > 0.5]['Y_0']).mean()
print(f"CATE for high engagement users: {cate_high_engagement:.2f}")

# Show a few ITEs
df['ITE_true'] = df['Y_1'] - df['Y_0']
print("\nSample ITEs:")
print(df[['age', 'income', 'prior_engagement', 'ITE_true']].head())
True ATE: 10.00
CATE for high engagement users: 10.00

Sample ITEs:
         age        income  prior_engagement  ITE_true
0  45.960570  53643.604770          0.188077      10.0
1  38.340828  53198.788374          0.170389      10.0
2  47.772262  33065.352411          0.511379      10.0
3  58.276358  55048.647124          0.318793      10.0
4  37.190160  70992.436227          0.384439      10.0

๐ŸŒŸ Uplift models and use casesยถ

๐Ÿ“‰ Click to Expand

Uplift modeling directly models the difference in probability of a positive outcome between treated and untreated users.

Instead of modeling outcome probabilities separately, uplift models focus on:

  • Who is most persuadable?
  • Who would change behavior because of treatment?

Where uplift models shine:

  • Marketing campaigns (maximize conversions per dollar)
  • Customer retention (target save offers only to those who would churn)
  • Medical interventions (target high-risk patients)

Typical techniques:

  • Uplift Decision Trees
  • Two-model approach (predict Y|T=1 and Y|T=0 separately, then subtract)
  • Causal Forests

๐Ÿงฉ Tree-based methods (Causal Trees, Causal Forests)ยถ

๐Ÿ“‰ Click to Expand

Tree-based methods are powerful for discovering treatment effect heterogeneity:

  • Causal Trees:
    • Split data to maximize treatment effect differences between branches.
    • One tree trained specifically for causal splits.
  • Causal Forests:
    • Ensemble of causal trees.
    • Averages treatment effect estimates across trees.
    • Reduces variance compared to a single tree.

They can estimate CATEs reliably across different subgroups without manually specifying interactions.


When useful:

  • You expect heterogeneity but don't know in advance how to segment.
  • You want flexible, interpretable treatment effect estimation.
Inย [45]:
# !pip install econml
Inย [46]:
# Causal Forest: Full Correct Code

from econml.grf import CausalForest
from sklearn.model_selection import train_test_split

# Prepare features and outcome
X = df[['age', 'income', 'prior_engagement']].values  # Features (2D)
T = df['T'].values  # Treatment (1D)
Y = df['Y_obs'].values  # Observed outcome (1D)

# Fit causal forest
forest = CausalForest(n_estimators=100, random_state=42)
forest.fit(X, T, Y)  # Correct order: X, T, Y

# Predict treatment effects (CATEs)
cate_preds = forest.predict(X)

# Store predictions
df['CATE_predicted'] = cate_preds

# Display sample predictions
print("\nSample predicted CATEs:")
print(df[['age', 'income', 'prior_engagement', 'CATE_predicted']].head())
Sample predicted CATEs:
         age        income  prior_engagement  CATE_predicted
0  45.960570  53643.604770          0.188077       15.767683
1  38.340828  53198.788374          0.170389       18.062330
2  47.772262  33065.352411          0.511379       -4.625701
3  58.276358  55048.647124          0.318793       14.400047
4  37.190160  70992.436227          0.384439       11.047467

Back to the top ___

๐Ÿงช Placebo Tests & Robustness Checksยถ

๐Ÿ“‰ Click to Expand

Even after careful causal estimation, you must ask:

  • Was it a real effect?
  • Could hidden bias still exist?

Robustness checks build confidence that your findings are not artifacts of modeling choices, random noise, or hidden confounders.

Placebo tests simulate situations where you expect no effect โ€” if you detect an effect there, something's wrong.

Robust causal analysis is not just about point estimates โ€” itโ€™s about proving to yourself that you aren't fooling yourself.

๐Ÿงป Randomized placebo treatmentsยถ

๐Ÿ“‰ Click to Expand

Placebo tests inject "fake" treatments to validate your method.

Idea:

  • Randomly assign a placebo treatment.
  • Re-estimate the treatment effect.
  • Expect no significant effect if your method is honest.

If your model finds strong effects even when treatment is randomized, your pipeline is leaking bias or overfitting.


Placebo Tests Are Critical:

  • They detect specification errors.
  • They detect uncontrolled confounding.
  • They expose overfitting to noise.

Placebo tests are a basic but powerful check โ€” always worth doing.

Inย [47]:
# Create a random placebo treatment
np.random.seed(123)
df['placebo_T'] = np.random.binomial(1, 0.5, size=len(df))

# Estimate naive difference for placebo treatment
placebo_treated_mean = df.loc[df['placebo_T'] == 1, 'Y_obs'].mean()
placebo_control_mean = df.loc[df['placebo_T'] == 0, 'Y_obs'].mean()

placebo_naive_diff = placebo_treated_mean - placebo_control_mean

print(f"Placebo test: naive difference in means = {placebo_naive_diff:.2f}")

# Ideally close to zero if model is unbiased
Placebo test: naive difference in means = 7.79

โš—๏ธ Sensitivity to unobserved confoundingยถ

๐Ÿ“‰ Click to Expand

Even after adjusting for observed confounders, unobserved variables can still bias causal estimates.

Sensitivity analysis asks:

  • How strong would hidden confounding have to be to overturn my results?

Typical approaches:

  • Simulate hidden confounders and see effect size shifts.
  • Use formulas (like Rosenbaum bounds) to quantify robustness.

In practical data science:

  • Simulate scenarios with added fake bias.
  • Stress test conclusions under "worst plausible" hidden biases.

If your conclusions survive plausible levels of unobserved bias, they are more credible.

Important mindset:
No analysis is perfect โ€” the goal is to understand limits, not pretend away uncertainty.

Inย [48]:
# Simulate a hidden confounder correlated with treatment and outcome
np.random.seed(42)
df['hidden_confounder'] = np.random.normal(0, 1, size=len(df))

# Make the outcome depend slightly on this hidden confounder
df['Y_obs_biased'] = df['Y_obs'] + 2 * df['hidden_confounder']

# Re-run naive difference with biased outcome
treated_mean_biased = df.loc[df['T'] == 1, 'Y_obs_biased'].mean()
control_mean_biased = df.loc[df['T'] == 0, 'Y_obs_biased'].mean()

biased_naive_diff = treated_mean_biased - control_mean_biased

print(f"Naive difference in means (with hidden confounding): {biased_naive_diff:.2f}")

# See how much bias was introduced
bias_inflation = biased_naive_diff - naive_diff
print(f"Inflation in estimate due to hidden confounder: {bias_inflation:.2f}")
Naive difference in means (with hidden confounding): 250.11
Inflation in estimate due to hidden confounder: -0.42

Back to the top ___

๐Ÿงฌ Counterfactual Thinkingยถ

๐Ÿ“‰ Click to Expand

Counterfactual thinking is the backbone of causal inference.

Instead of asking:

"What happened?"

We ask:

"What would have happened if things were different?"

In causal inference:

  • Each unit (user, patient, item) has two potential outcomes.
  • Only one is observed.
  • The other โ€” the counterfactual โ€” must be predicted or estimated.

Counterfactual reasoning enables:

  • Simulating user behavior under alternate scenarios.
  • Personalizing interventions based on predicted outcomes.

Without counterfactuals, causal inference is blind.

๐Ÿค– Predicting what wouldโ€™ve happenedยถ

๐Ÿ“‰ Click to Expand

Predicting counterfactuals means estimating:

  • Y(1) for untreated units.
  • Y(0) for treated units.

We use:

  • Machine learning models trained on observed data.
  • Causal forests, meta-learners, and other counterfactual predictors.

Goal:

  • Recover the missing potential outcome.
  • Estimate individual treatment effects (ITEs).

This enables granular interventions โ€” not just average effects across a population.


Important to remember:
Predicted counterfactuals are estimates, not direct observations โ€” uncertainty always exists.

Inย [49]:
# Predict counterfactual outcomes using Causal Forest
# (Already trained earlier, we use forest)

# Predict Y(1) and Y(0) separately
cate_preds = df['CATE_predicted'].values
baseline_preds = df['y_hat'].values  # From earlier outcome model

# Predict counterfactual outcomes
df['Y_cf_T1'] = baseline_preds + cate_preds  # Predicted outcome if treated
df['Y_cf_T0'] = baseline_preds  # Predicted outcome if untreated (baseline)

# Now simulate what would happen if treatment status flipped
df['counterfactual_outcome'] = np.where(
    df['T'] == 1,
    df['Y_cf_T0'],  # If treated, counterfactual is untreated
    df['Y_cf_T1']   # If untreated, counterfactual is treated
)

print("\nSample Counterfactual Predictions:")
print(df[['T', 'Y_obs', 'counterfactual_outcome']].head())
Sample Counterfactual Predictions:
   T        Y_obs  counterfactual_outcome
0  0  1114.502287             1127.868133
1  0  1099.893323             1119.847270
2  0   704.133035              697.394845
3  0  1139.203509             1151.640784
4  0  1464.308234             1475.218807

๐Ÿ” Usage in recommendation & personalizationยถ

๐Ÿ“‰ Click to Expand

Counterfactual predictions unlock personalization:

Instead of treating everyone the same, you can:

  • Target users where treatment has highest predicted uplift.
  • De-prioritize users who won't respond.

Examples:

  • Marketing: Show ads only to users likely to convert if nudged.
  • Healthcare: Prioritize interventions for patients who benefit most.
  • Products: Recommend features or promotions to maximize lift per user.

Strategic mindshift:
Focus on marginally persuadable users, not just overall averages.

Real-world use cases often combine:

  • Causal effect estimation (CATE/ITE)
  • Ranking users by expected benefit
  • Action prioritization based on counterfactuals
Inย [50]:
# Rank users by predicted CATE
df['priority_score'] = df['CATE_predicted']

# Top 5 users we should prioritize for treatment
top_users = df.sort_values('priority_score', ascending=False).head(5)

print("\nTop users to prioritize based on CATE:")
print(top_users[['age', 'income', 'prior_engagement', 'CATE_predicted']])
Top users to prioritize based on CATE:
            age        income  prior_engagement  CATE_predicted
2841  21.800289  53281.835326          0.350447       20.543173
724   42.050385  53280.579923          0.403388       20.443580
3205  31.876058  53300.354298          0.420334       20.411161
4614  40.670367  53244.477131          0.632127       20.275141
3711  24.210210  53209.647772          0.335915       20.259483

Back to the top ___

๐Ÿ“Œ Closing Notesยถ

๐Ÿ“‰ Click to Expand

You now have a practical understanding of core causal inference techniques.

You should be able to:

  • Simulate data with confounding
  • Estimate naive effects and detect bias
  • Adjust using regression, matching, stratification
  • Apply modern tools like DML and Causal Forests
  • Think in terms of counterfactuals, not just correlations

Remember:
Causal thinking is not just a technique โ€” itโ€™s a lens to see decision-making clearly.

๐Ÿ“ Summary table of methodsยถ

๐Ÿ“‰ Click to Expand
MethodWhen UsefulStrengthsWeaknesses
Simple Diff-in-MeansRandomized experimentsEasy, unbiasedUseless with confounding
Regression AdjustmentObservational data with measured confoundersEasy to implementModel misspecification risk
StratificationSmall number of discrete confoundersTransparentBreaks down in high dimensions
Propensity Score Matching (PSM)Observational data with many confoundersBalances groupsSensitive to model of treatment
Instrumental Variables (IV)Unobserved confounders existBypasses confoundingHard to find good instruments
Double Machine Learning (DML)High-dimensional nonlinear confoundersML flexibility + debiasingNeeds lots of data, honest splits
Causal ForestsHeterogeneous treatment effectsFlexible CATE estimationComplex, less interpretable

๐Ÿ“‹ When to use whatยถ

๐Ÿ“‰ Click to Expand

Choosing a method depends on:

  • Randomization present?
    • Yes โ†’ Simple difference in means is fine.
    • No โ†’ Need adjustment.
  • Are confounders observed?
    • Yes โ†’ Regression, PSM, DML are options.
    • No โ†’ Need IV or natural experiments.
  • Do you expect heterogeneous effects?
    • Yes โ†’ Causal Trees, Causal Forests, Meta-learners.
  • Is high-dimensional data involved?
    • Yes โ†’ Prefer DML over simple regression.

Choosing the right method = matching the method to the bias and complexity in your data.

๐Ÿ“Ž Causal vs Predictive mindsetยถ

๐Ÿ“‰ Click to Expand

Predictive modeling mindset:

  • Focuses on fitting observed outcomes.
  • Good for forecasts, risk scores, recommendation engines.
  • Does not care about interventions.

Causal inference mindset:

  • Focuses on what would happen if we intervened.
  • Good for making decisions (policies, treatments, products).
  • Requires stronger assumptions, careful design.

Key difference:
Predictive models can be accurate yet useless for interventions.
Causal models are harder but necessary to make confident decisions.


Quote to remember:

"All models are wrong. Some models are useful. Only causal models are useful for actions."

Back to the top ___