21 Potential Outcomes, Identification, and Estimation Under Unconfoundedness

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from scipy import stats

plt.style.use('assets/book.mplstyle')
rng = np.random.default_rng(42)

21.0.1 Notation for this chapter

Symbol	Meaning
$Y_i(1), Y_i(0)$	Potential outcomes (treated, control)
$D_i \in \{0, 1\}$	Treatment indicator
$Y_i = D_i Y_i(1) + (1-D_i) Y_i(0)$	Observed outcome
$\tau = \mathbb{E}[Y(1) - Y(0)]$	Average treatment effect (ATE)
$e(\mathbf{x}) = \Pr(D=1 \mid \mathbf{x})$	Propensity score

21.1 Motivation

Regression (Chapters 10-14) tells you about associations: “people who take the drug have better outcomes.” Causal inference asks the harder question: “does the drug cause better outcomes, or do healthier people choose to take it?”

The difference matters. If healthier people self-select into treatment, the naive comparison of treated vs untreated outcomes is biased — it conflates the treatment effect with the selection effect. This chapter develops the tools to separate the two.

The practical setting: You have observational data (not a randomized experiment). Some units received treatment ($D = 1$), others didn’t ($D = 0$). You observe covariates $\mathbf{X}$ (age, income, health status) and an outcome $Y$ (recovery time, earnings, test score). You want to estimate the causal effect of $D$ on $Y$.

Four estimators, all built from scipy/statsmodels/sklearn components:

Regression adjustment — fit separate outcome models for treated and control, predict counterfactuals
Matching — pair treated and control units with similar covariates, compare outcomes within pairs
IPW (Inverse Probability Weighting) — reweight observations to create a pseudo-population where treatment is random
AIPW (Augmented IPW) — combines regression and weighting for “double robustness”

21.2 Mathematical Foundation

21.2.1 The fundamental problem of causal inference

For each unit $i$, there are two potential outcomes:

$Y_i(1)$: what would happen if unit $i$ receives treatment
$Y_i(0)$: what would happen if unit $i$ does not receive treatment

The causal effect for unit $i$ is $\tau_i = Y_i(1) - Y_i(0)$.

The problem: We observe only one of the two potential outcomes. If $D_i = 1$ (treated), we see $Y_i(1)$ but not $Y_i(0)$. If $D_i = 0$ (control), we see $Y_i(0)$ but not $Y_i(1)$. The other potential outcome is counterfactual — it is what would have happened under the alternative treatment.

The Average Treatment Effect (ATE) averages over all units: \[ \tau = \mathbb{E}[Y(1) - Y(0)]. \]

We cannot compute this directly because we never observe both potential outcomes for the same unit. This is the fundamental problem.

21.2.2 Why naive comparison fails

The naive estimator $\hat{\tau}_{\text{naive}} = \bar{Y}_{\text{treated}} - \bar{Y}_{\text{control}}$ is biased whenever treatment assignment depends on characteristics that also affect the outcome. For example:

People with worse health are more likely to take the drug → naive estimate is biased downward (treated group looks sicker)
Richer students are more likely to attend tutoring → naive estimate is biased upward (tutored students were already advantaged)

This is confounding: a variable that affects both treatment assignment and the outcome.

21.2.3 Identification under unconfoundedness

To estimate causal effects from observational data, we need three assumptions:

1. SUTVA (Stable Unit Treatment Value Assumption): One unit’s treatment doesn’t affect another unit’s outcome. (No “spillover” effects.)

2. Unconfoundedness (ignorability): Treatment assignment is independent of potential outcomes, conditional on covariates: $(Y(1), Y(0)) \perp D \mid \mathbf{X}$. This means: once we condition on $\mathbf{X}$, the remaining variation in $D$ is essentially random.

3. Positivity (overlap): Every unit has a positive probability of receiving either treatment status: $0 < e(\mathbf{x}) < 1$ for all $\mathbf{x}$ in the support. If some units always or never receive treatment, we cannot learn about the treatment effect for them.

Under these three assumptions, the ATE is identified: \[ \tau = \mathbb{E}\!\left[\frac{D_i Y_i}{e(\mathbf{X}_i)} - \frac{(1-D_i)Y_i}{1-e(\mathbf{X}_i)}\right] \tag{21.1}\]

21.2.4 The propensity score

The propensity score $e(\mathbf{x}) = \Pr(D=1 \mid \mathbf{X}=\mathbf{x})$ is the probability of receiving treatment given covariates. It is a scalar summary of all confounders.

The Rosenbaum-Rubin theorem (1983): If unconfoundedness holds given $\mathbf{X}$, it also holds given $e(\mathbf{X})$ alone. This is powerful: instead of adjusting for many covariates simultaneously (which is hard in high dimensions), you can adjust for a single number — the propensity score.

In practice, we estimate $e(\mathbf{x})$ using logistic regression or other classifiers (sklearn’s LogisticRegression).

21.2.4.1 Proof of the Rosenbaum-Rubin theorem

The theorem is worth proving because the proof is the design principle: it shows exactly why propensity-score methods work and what they require.

Claim: If $(Y(1), Y(0)) \perp D \mid \mathbf{X}$, then $(Y(1), Y(0)) \perp D \mid e(\mathbf{X})$.

Proof. We need to show that $\Pr(D=1 \mid Y(1), Y(0), e(\mathbf{X})) = \Pr(D=1 \mid e(\mathbf{X}))$. Start from the right-hand side. By the law of iterated expectations, conditioning on $e(\mathbf{X})$:

\[ \Pr(D=1 \mid e(\mathbf{X})) = \mathbb{E}[\Pr(D=1 \mid \mathbf{X}) \mid e(\mathbf{X})] = \mathbb{E}[e(\mathbf{X}) \mid e(\mathbf{X})] = e(\mathbf{X}). \]

The first equality uses the tower property. The second uses the definition of the propensity score. The third is trivial: a function of $e(\mathbf{X})$ is measurable with respect to $e(\mathbf{X})$.

Now for the left-hand side. Since unconfoundedness holds given $\mathbf{X}$, and $e(\mathbf{X})$ is a function of $\mathbf{X}$:

\[ \Pr(D=1 \mid Y(1), Y(0), e(\mathbf{X})) = \mathbb{E}[\Pr(D=1 \mid Y(1), Y(0), \mathbf{X}) \mid Y(1), Y(0), e(\mathbf{X})] \]

By unconfoundedness, $\Pr(D=1 \mid Y(1), Y(0), \mathbf{X}) = \Pr(D=1 \mid \mathbf{X}) = e(\mathbf{X})$. Substituting:

\[ = \mathbb{E}[e(\mathbf{X}) \mid Y(1), Y(0), e(\mathbf{X})] = e(\mathbf{X}). \]

Both sides equal $e(\mathbf{X})$, so $D$ is conditionally independent of the potential outcomes given the propensity score. $\square$

Why this matters for code: The proof tells you that after you estimate $\hat{e}(\mathbf{x})$, you can stratify, match, or reweight on this single number instead of the full covariate vector. This is why IPW works — it reweights by $1/\hat{e}$ rather than requiring nonparametric adjustment in the full $\mathbf{X}$ space.

21.2.4.2 Positivity violations

Positivity requires $0 < e(\mathbf{x}) < 1$ for all $\mathbf{x}$. When this fails — some covariate strata have $e(\mathbf{x}) \approx 0$ or $\approx 1$ — the causal effect is not identified for those strata, and estimation becomes unstable.

Practical consequences:

IPW weights explode. If $e(\mathbf{x}_i) = 0.01$, the weight $1/e(\mathbf{x}_i) = 100$. A single observation with near-zero propensity can dominate the entire estimate.
Effective sample size drops. The ESS formula $(\sum w_i)^2 / \sum w_i^2$ captures how many independent observations the weighted sample is worth. Extreme weights mean few observations drive the estimate.
Extrapolation. If no treated units exist in some covariate region, the outcome model must extrapolate the counterfactual $\mu_1(\mathbf{x})$ from distant data — pure model dependence.

Diagnostic: The propensity score histogram (shown in Section 21.6) reveals positivity violations as regions where one group’s distribution has mass but the other’s does not.

Practical responses:

Trimming: Discard units with $\hat{e}(\mathbf{x}) < c$ or $> 1 - c$ (common choices: $c = 0.01$ or $c = 0.05$). This changes the estimand from the population ATE to the ATE among the trimmed population.
Stabilized weights: Replace $1/e$ with $e_{\text{marg}} / e$ where $e_{\text{marg}} = \Pr(D=1)$ is the marginal treatment probability. This preserves consistency while reducing variance.
AIPW: Less sensitive to extreme weights because the outcome model handles the regions where $e(\mathbf{x})$ is extreme.

21.2.5 The four estimators

21.2.5.1 Matching

The oldest approach: for each treated unit, find one or more control units with similar covariates and compare their outcomes. The ATE estimate averages the within-match differences.

Distance metrics:

Exact matching: Require $\mathbf{X}_i = \mathbf{X}_j$. Only feasible with few discrete covariates.
Mahalanobis distance: $d(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^\top \mathbf{S}^{-1} (\mathbf{x}_i - \mathbf{x}_j)}$ where $\mathbf{S}$ is the pooled sample covariance. Scale-invariant and accounts for correlations between covariates.
Propensity-score distance: $d(\mathbf{x}_i, \mathbf{x}_j) = |e(\mathbf{x}_i) - e(\mathbf{x}_j)|$. Reduces matching to a one-dimensional problem by the Rosenbaum-Rubin theorem.

Limitation: Matching discards unmatched units, which changes the estimand (the population you’re averaging over). With replacement, the same control unit may be matched multiple times, inflating its influence. Matching quality degrades in high dimensions (the “curse of dimensionality”).

21.2.5.2 Regression adjustment

Fit separate outcome models for treated and control units: $\hat{\mu}_1(\mathbf{x}) = \hat{\mathbb{E}}[Y \mid \mathbf{X}=\mathbf{x}, D=1]$ and $\hat{\mu}_0(\mathbf{x}) = \hat{\mathbb{E}}[Y \mid \mathbf{X}=\mathbf{x}, D=0]$.

The ATE estimate is the average predicted difference: $\hat{\tau}_{\text{reg}} = \frac{1}{n}\sum_i [\hat{\mu}_1(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i)]$.

Limitation: If the outcome model is wrong, the estimate is biased.

21.2.5.3 Inverse Probability Weighting (IPW)

Reweight each observation by the inverse of its propensity score. Treated units with low $e(\mathbf{x})$ (unlikely to be treated) get high weight; control units with high $e(\mathbf{x})$ (likely to be treated) get high weight. This creates a pseudo-population where treatment is independent of covariates.

$\hat{\tau}_{\text{IPW}} = \frac{1}{n}\sum_i \left[\frac{D_i Y_i}{e(\mathbf{X}_i)} - \frac{(1-D_i)Y_i}{1-e(\mathbf{X}_i)}\right]$

Limitation: If the propensity model is wrong, or if some $e(\mathbf{x})$ are near 0 or 1 (poor overlap), the weights become extreme and the estimate is unstable.

IPW asymptotic variance. The IPW estimator has influence function $\psi_i^{\text{IPW}} = D_i Y_i / e(\mathbf{X}_i) - (1-D_i) Y_i / (1-e(\mathbf{X}_i)) - \tau$, so its asymptotic variance is:

\[ V_{\text{IPW}} = \mathbb{E}\!\left[ \frac{D_i Y_i^2}{e(\mathbf{X}_i)^2} + \frac{(1-D_i) Y_i^2}{(1-e(\mathbf{X}_i))^2} \right] - \tau^2 \]

This grows as $e(\mathbf{x}) \to 0$ or $e(\mathbf{x}) \to 1$, confirming the weight-instability intuition. In code, we estimate this via the sample variance of the individual IPW contributions, connecting to the M-estimation framework from Chapter 12.

Stabilized (Hajek) IPW. The normalized variant divides treated and control weighted means by their respective weight sums:

$\hat{\tau}_{\text{Hajek}} = \frac{\sum_i D_i Y_i / e_i}{\sum_i D_i / e_i} - \frac{\sum_i (1-D_i) Y_i / (1-e_i)}{\sum_i (1-D_i) / (1-e_i)}$

This is invariant to rescaling of the weights and has lower variance than the unnormalized Horvitz-Thompson estimator when the sample size varies across strata. It is slightly biased in finite samples but the bias is $O(1/n)$.

21.2.5.4 AIPW (Augmented IPW): double robustness

AIPW combines both approaches:

\[ \hat{\tau}_{\text{AIPW}} = \frac{1}{n}\sum_i \left[\hat{\mu}_1(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i) + \frac{D_i(Y_i - \hat{\mu}_1(\mathbf{X}_i))}{\hat{e}(\mathbf{X}_i)} - \frac{(1-D_i)(Y_i - \hat{\mu}_0(\mathbf{X}_i))}{1-\hat{e}(\mathbf{X}_i)}\right] \tag{21.2}\]

Double robustness: AIPW is consistent if either the outcome model or the propensity model is correct (but not necessarily both). This is the key advantage: you get two chances to get it right.

Intuition: The first line ($\hat{\mu}_1 - \hat{\mu}_0$) is the regression adjustment. The second and third lines are corrections that use the IPW weights to fix any bias in the regression. If the regression is perfect ($\hat{\mu}_d = \mu_d$), the corrections have mean zero and don’t hurt. If the regression is wrong but the propensity is right, the corrections fix the bias.

21.2.5.5 Proof of double robustness

This proof is worth giving in full because it reveals the mechanism behind the estimator’s resilience — and shows exactly what “double robustness” means mathematically.

Claim. Let $\hat{\mu}_d(\mathbf{x})$ and $\hat{e}(\mathbf{x})$ be working models for the outcome and propensity, respectively. The AIPW estimator is consistent for $\tau$ if either (a) $\hat{\mu}_d(\mathbf{x}) \xrightarrow{p} \mu_d(\mathbf{x})$ for $d \in \{0, 1\}$, or (b) $\hat{e}(\mathbf{x}) \xrightarrow{p} e(\mathbf{x})$.

Proof. Write the AIPW score for a single observation:

\[ \phi_i = \hat{\mu}_1(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i) + \frac{D_i(Y_i - \hat{\mu}_1(\mathbf{X}_i))}{\hat{e}(\mathbf{X}_i)} - \frac{(1-D_i)(Y_i - \hat{\mu}_0(\mathbf{X}_i))}{1-\hat{e}(\mathbf{X}_i)} \]

Take expectations conditional on $\mathbf{X}_i$, treating the working models as fixed (they converge to their probability limits):

\[ \mathbb{E}[\phi_i \mid \mathbf{X}_i] = \hat{\mu}_1(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i) + \frac{e(\mathbf{X}_i)(\mu_1(\mathbf{X}_i) - \hat{\mu}_1(\mathbf{X}_i))}{\hat{e}(\mathbf{X}_i)} - \frac{(1-e(\mathbf{X}_i))(\mu_0(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i))}{1-\hat{e}(\mathbf{X}_i)} \]

where we used $\mathbb{E}[D_i \mid \mathbf{X}_i] = e(\mathbf{X}_i)$ and $\mathbb{E}[Y_i \mid D_i=d, \mathbf{X}_i] = \mu_d(\mathbf{X}_i)$.

Case (a): outcome model correct ($\hat{\mu}_d = \mu_d$). The residuals $\mu_d - \hat{\mu}_d = 0$, so the correction terms vanish regardless of $\hat{e}$:

\[ \mathbb{E}[\phi_i \mid \mathbf{X}_i] = \mu_1(\mathbf{X}_i) - \mu_0(\mathbf{X}_i) = \tau(\mathbf{X}_i). \]

Taking the outer expectation gives $\mathbb{E}[\phi_i] = \tau$.

Case (b): propensity model correct ($\hat{e} = e$). The correction terms become:

\[ (\mu_1(\mathbf{X}_i) - \hat{\mu}_1(\mathbf{X}_i)) - (\mu_0(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i)) \]

which, combined with the regression terms, yields:

\[ \mathbb{E}[\phi_i \mid \mathbf{X}_i] = \mu_1(\mathbf{X}_i) - \mu_0(\mathbf{X}_i) = \tau(\mathbf{X}_i). \]

Again, $\mathbb{E}[\phi_i] = \tau$. $\square$

The key insight is that the AIPW score decomposes into a regression component plus an IPW-based correction. The correction removes any first-order bias from the regression, but only needs one of the two models to be correct to do so. This is the same Neyman-orthogonality property exploited by DML in Chapter 23.

21.3 The Algorithm

Algorithm 21.1: AIPW Estimator

Input: Data $(\mathbf{X}, D, Y)$

Output: $\hat{\tau}$, SE, 95% CI

Estimate propensity score: Fit logistic regression of $D$ on $\mathbf{X}$. Get $\hat{e}(\mathbf{X}_i)$ for each observation. Clip to $[0.01, 0.99]$ to avoid division by zero.
Estimate outcome models: Fit OLS of $Y$ on $\mathbf{X}$ separately for treated ($D=1$) and control ($D=0$) subsamples. Predict $\hat{\mu}_1(\mathbf{X}_i)$ and $\hat{\mu}_0(\mathbf{X}_i)$ for all observations (including counterfactual predictions).
Compute AIPW scores: For each observation, compute the individual AIPW contribution using Equation 21.2.
Estimate ATE: $\hat{\tau} = \bar{\text{AIPW}}$ (sample mean of AIPW scores).
Standard error: $\hat{\text{se}} = \text{sd}(\text{AIPW}_i) / \sqrt{n}$. This uses the influence function representation of the estimator (connecting to M-estimation theory from Chapter 12).

Algorithm 21.2: Nearest-Neighbor Matching

Input: Data $(\mathbf{X}, D, Y)$, distance metric $d$, number of matches $M$

Output: $\hat{\tau}_{\text{match}}$

Compute pooled covariance $\mathbf{S}$ from $\mathbf{X}$.
For each treated unit $i$: Find the $M$ nearest control units by $d(\mathbf{X}_i, \mathbf{X}_j)$ where $D_j = 0$. Compute $\hat{Y}_i(0) = \frac{1}{M} \sum_{j \in \mathcal{M}(i)} Y_j$.
For each control unit $j$: Find the $M$ nearest treated units by $d(\mathbf{X}_j, \mathbf{X}_k)$ where $D_k = 1$. Compute $\hat{Y}_j(1) = \frac{1}{M} \sum_{k \in \mathcal{M}(j)} Y_k$.
Estimate ATE: $\hat{\tau} = \frac{1}{n} \sum_{i=1}^n [D_i(Y_i - \hat{Y}_i(0)) + (1-D_i)(\hat{Y}_i(1) - Y_i)]$

21.4 Statistical Properties

Regression adjustment is consistent when the outcome model is correctly specified. It is the most efficient estimator when the model is right, but can be badly biased when the model is wrong.

IPW is consistent when the propensity model is correctly specified. Its variance depends on the variability of the weights: extreme weights (near-violations of positivity) inflate the variance.

AIPW is consistent when either model is correct. When both are correct, it achieves the semiparametric efficiency bound (Hahn, 1998) — it has the smallest possible variance among all regular estimators of the ATE under unconfoundedness.

The efficiency bound: $\text{Var}(\hat{\tau}_{\text{eff}}) = \mathbb{E}\!\left[\frac{\text{Var}(Y(1) \mid \mathbf{X})}{e(\mathbf{X})} + \frac{\text{Var}(Y(0) \mid \mathbf{X})}{1 - e(\mathbf{X})} + (\tau(\mathbf{X}) - \tau)^2\right]$

The first two terms reflect the cost of not observing both potential outcomes; the third reflects treatment effect heterogeneity.

21.5 Implementation

21.5.1 Simulating observational data

To understand these estimators, we simulate data where we control everything and know the true ATE. This lets us see exactly how and why each estimator succeeds or fails.

n = 1000

# Covariates: x1 is a CONFOUNDER (affects both treatment and outcome)
# x2 is a prognostic variable (affects outcome only)
x1 = rng.standard_normal(n)
x2 = rng.standard_normal(n)
X = np.column_stack([x1, x2])

# Treatment assignment depends on x1 (confounding!)
# Higher x1 → higher probability of treatment
e_true = 1 / (1 + np.exp(-(0.5*x1 - 0.3*x2)))
D = rng.binomial(1, e_true)

# Potential outcomes: both depend on x1 and x2
# Y(0) = 1 + x1 + 0.5*x2 + noise
# Y(1) = Y(0) + 2  (constant treatment effect of 2)
tau_true = 2.0
Y0 = 1 + x1 + 0.5*x2 + rng.standard_normal(n)
Y1 = Y0 + tau_true

# Observed outcome: Y = D*Y(1) + (1-D)*Y(0)
Y = D*Y1 + (1-D)*Y0

print(f"True ATE: {tau_true}")
print(f"Treatment rate: {D.mean():.0%}")
print(f"Naive diff in means: {Y[D==1].mean() - Y[D==0].mean():.3f}")
print(f"(Biased because x1 confounds: treated units have higher x1)")

True ATE: 2.0
Treatment rate: 51%
Naive diff in means: 2.233
(Biased because x1 confounds: treated units have higher x1)

The naive estimate is biased because $x_1$ affects both who gets treated and the outcome. People with higher $x_1$ are both more likely to be treated and have higher $Y(0)$. The naive comparison picks up both the treatment effect and the selection effect.

21.5.2 Step-by-step: regression adjustment

# Step 1: Fit separate outcome models for treated and control
X_sm = sm.add_constant(X)

# Model for treated units: E[Y | X, D=1]
ols_treated = sm.OLS(Y[D==1], X_sm[D==1]).fit()
# Model for control units: E[Y | X, D=0]
ols_control = sm.OLS(Y[D==0], X_sm[D==0]).fit()

# Step 2: Predict BOTH potential outcomes for ALL observations
# mu1_hat[i] = predicted Y(1) for unit i (even if they were control)
# mu0_hat[i] = predicted Y(0) for unit i (even if they were treated)
mu1_hat = ols_treated.predict(X_sm)  # counterfactual for controls
mu0_hat = ols_control.predict(X_sm)  # counterfactual for treated

# Step 3: Average the predicted treatment effects
ate_reg = (mu1_hat - mu0_hat).mean()
print(f"Regression adjustment ATE: {ate_reg:.3f} (true: {tau_true})")

Regression adjustment ATE: 1.969 (true: 2.0)

What happened: We used the outcome model to impute the missing potential outcomes. For a treated unit, we observe $Y(1)$ and predict $Y(0)$. For a control unit, we observe $Y(0)$ and predict $Y(1)$. The estimated ATE is the average of these predicted differences.

21.5.3 Step-by-step: IPW

# Step 1: Estimate the propensity score
# Use logistic regression: P(D=1 | X) = logistic(X @ beta)
lr = LogisticRegression(C=1e10, max_iter=1000)
lr.fit(X, D)
e_hat = lr.predict_proba(X)[:, 1]  # predicted P(D=1 | X)

# Step 2: Clip extreme probabilities to avoid huge weights
# Without clipping, 1/e can be enormous for small e
e_hat = np.clip(e_hat, 0.01, 0.99)

# Step 3: Compute IPW estimate
# Treated: weight by 1/e (upweight treated units unlikely to be treated)
# Control: weight by 1/(1-e) (upweight controls likely to be treated)
ate_ipw = np.mean(D*Y/e_hat - (1-D)*Y/(1-e_hat))
print(f"IPW ATE: {ate_ipw:.3f} (true: {tau_true})")

# Check the weights
weights = np.where(D, 1/e_hat, 1/(1-e_hat))
print(f"Weight range: [{weights.min():.1f}, {weights.max():.1f}]")
print(f"Effective sample size: {(weights.sum())**2 / (weights**2).sum():.0f} / {n}")

IPW ATE: 1.988 (true: 2.0)
Weight range: [1.1, 6.0]
Effective sample size: 902 / 1000

What happened: We reweighted the sample so that treatment assignment is “as if random.” The weights balance the covariates between treated and control groups.

21.5.4 Step-by-step: AIPW

# Combine regression adjustment and IPW
# Each observation gets an AIPW "score" — the individual-level
# contribution to the ATE estimate

aipw_scores = (
    # Regression adjustment part:
    (mu1_hat - mu0_hat)
    # + IPW correction for treated:
    + D * (Y - mu1_hat) / e_hat
    # - IPW correction for control:
    - (1-D) * (Y - mu0_hat) / (1-e_hat)
)

ate_aipw = aipw_scores.mean()
se_aipw = aipw_scores.std() / np.sqrt(n)
ci_lo = ate_aipw - 1.96 * se_aipw
ci_hi = ate_aipw + 1.96 * se_aipw

print(f"AIPW ATE: {ate_aipw:.3f} (SE: {se_aipw:.3f})")
print(f"95% CI: ({ci_lo:.3f}, {ci_hi:.3f})")
print(f"True ATE: {tau_true} — {'in CI ✓' if ci_lo < tau_true < ci_hi else 'NOT in CI ✗'}")

AIPW ATE: 1.967 (SE: 0.066)
95% CI: (1.838, 2.096)
True ATE: 2.0 — in CI ✓

What happened: AIPW starts with the regression adjustment estimate, then adds IPW-based corrections. The corrections account for any residual bias from the outcome model. If the outcome model is perfect, the corrections are noise around zero. If the outcome model is wrong but the propensity is right, the corrections fix the bias.

21.5.5 Step-by-step: stabilized (Hajek) IPW

# Stabilized weights: divide by the sum within each group
# This is the Hajek estimator — lower variance than HT
w_treated = 1 / e_hat[D == 1]
w_control = 1 / (1 - e_hat[D == 0])

# Weighted means with normalized weights
mu1_hajek = (
    np.sum(Y[D == 1] * w_treated) / np.sum(w_treated)
)
mu0_hajek = (
    np.sum(Y[D == 0] * w_control) / np.sum(w_control)
)
ate_hajek = mu1_hajek - mu0_hajek

# Compare with unnormalized IPW
print(f"Horvitz-Thompson IPW: {ate_ipw:.3f}")
print(f"Hajek (stabilized):   {ate_hajek:.3f}")
print(f"True ATE:             {tau_true}")
print(f"\nHajek is often closer — normalization reduces "
      f"sensitivity to extreme weights.")

Horvitz-Thompson IPW: 1.988
Hajek (stabilized):   1.964
True ATE:             2.0

Hajek is often closer — normalization reduces sensitivity to extreme weights.

What happened: The Hajek estimator normalizes by the sum of weights within each group. This is equivalent to using stabilized weights $w_i^{\text{stab}} = D_i \cdot \bar{D} / e_i + (1-D_i) \cdot (1 - \bar{D}) / (1-e_i)$ where $\bar{D}$ is the marginal treatment rate.

21.5.6 Step-by-step: matching with Mahalanobis distance

def match_ate(X, D, Y, n_matches=1):
    """Nearest-neighbor matching using Mahalanobis distance.

    Implements Algorithm 21.2. For each unit, finds
    the closest unit(s) in the opposite treatment group
    and imputes the missing potential outcome.

    Parameters
    ----------
    X : ndarray of shape (n, p)
        Covariates.
    D : ndarray of shape (n,)
        Treatment indicator (0 or 1).
    Y : ndarray of shape (n,)
        Observed outcomes.
    n_matches : int
        Number of nearest neighbors to match.

    Returns
    -------
    ate : float
        Estimated ATE.
    matched_diffs : ndarray
        Per-unit treatment effect estimates.
    """
    n = len(Y)
    # Pooled covariance for Mahalanobis distance
    S = np.cov(X, rowvar=False)
    S_inv = np.linalg.inv(S)

    treated_idx = np.where(D == 1)[0]
    control_idx = np.where(D == 0)[0]

    matched_diffs = np.zeros(n)

    for i in range(n):
        if D[i] == 1:
            # Find nearest control(s)
            pool = control_idx
        else:
            # Find nearest treated unit(s)
            pool = treated_idx

        # Mahalanobis distance to all units in pool
        diff = X[pool] - X[i]
        dists = np.sqrt(
            np.sum(diff @ S_inv * diff, axis=1)
        )

        # Indices of n_matches nearest neighbors
        nn = pool[np.argsort(dists)[:n_matches]]
        y_imputed = Y[nn].mean()

        if D[i] == 1:
            # Observed Y(1), imputed Y(0)
            matched_diffs[i] = Y[i] - y_imputed
        else:
            # Imputed Y(1), observed Y(0)
            matched_diffs[i] = y_imputed - Y[i]

    return matched_diffs.mean(), matched_diffs


ate_match, match_diffs = match_ate(X, D, Y, n_matches=1)
print(f"Matching ATE (M=1): {ate_match:.3f} "
      f"(true: {tau_true})")

# Try M=3 for smoother estimate
ate_match3, _ = match_ate(X, D, Y, n_matches=3)
print(f"Matching ATE (M=3): {ate_match3:.3f}")

Matching ATE (M=1): 2.049 (true: 2.0)
Matching ATE (M=3): 2.001

What happened: For each unit, we found the closest unit in the other treatment group (by Mahalanobis distance) and used its outcome as the counterfactual. The $O(n^2 p)$ loop is slow but transparent — production implementations use $k$-d trees. The single-match estimator is noisier than $M=3$ because each counterfactual depends on one neighbor.

21.5.7 Comparison of all estimators

naive_est = Y[D == 1].mean() - Y[D == 0].mean()
se_match = match_diffs.std() / np.sqrt(n)

print(f"{'Method':<20} {'ATE':>8} {'SE':>8} {'Bias':>8}")
print("-" * 48)
print(f"{'True':<20} {tau_true:>8.3f}")
print(f"{'Naive':<20} {naive_est:>8.3f} "
      f"{'':>8} {naive_est - tau_true:>8.3f}")
print(f"{'Regression':<20} {ate_reg:>8.3f} "
      f"{'':>8} {ate_reg - tau_true:>8.3f}")
print(f"{'Matching (M=1)':<20} {ate_match:>8.3f} "
      f"{se_match:>8.3f} {ate_match - tau_true:>8.3f}")
print(f"{'IPW (HT)':<20} {ate_ipw:>8.3f} "
      f"{'':>8} {ate_ipw - tau_true:>8.3f}")
print(f"{'IPW (Hajek)':<20} {ate_hajek:>8.3f} "
      f"{'':>8} {ate_hajek - tau_true:>8.3f}")
print(f"{'AIPW':<20} {ate_aipw:>8.3f} "
      f"{se_aipw:>8.3f} {ate_aipw - tau_true:>8.3f}")

Method                    ATE       SE     Bias
------------------------------------------------
True                    2.000
Naive                   2.233             0.233
Regression              1.969            -0.031
Matching (M=1)          2.049    0.043    0.049
IPW (HT)                1.988            -0.012
IPW (Hajek)             1.964            -0.036
AIPW                    1.967    0.066   -0.033

21.6 Diagnostics

21.6.1 Balance diagnostics

The most important diagnostic: after weighting or matching, are the treated and control groups balanced on covariates? The standardized mean difference (SMD) measures this.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

# Propensity score overlap
ax1.hist(e_hat[D==1], bins=30, alpha=0.5, density=True,
         label="Treated", color="C0")
ax1.hist(e_hat[D==0], bins=30, alpha=0.5, density=True,
         label="Control", color="C1")
ax1.set_xlabel("Propensity score")
ax1.set_ylabel("Density")
ax1.legend()
ax1.set_title("Overlap check")

# SMD before and after weighting
smd_raw = []
smd_weighted = []
for j in range(2):
    # Raw SMD
    smd = (X[D==1, j].mean() - X[D==0, j].mean()) / X[:, j].std()
    smd_raw.append(abs(smd))
    # Weighted SMD
    w1 = 1/e_hat[D==1]; w0 = 1/(1-e_hat[D==0])
    wm1 = (X[D==1, j] * w1).sum() / w1.sum()
    wm0 = (X[D==0, j] * w0).sum() / w0.sum()
    smd_weighted.append(abs((wm1 - wm0) / X[:, j].std()))

x_pos = np.arange(2)
ax2.barh(x_pos - 0.15, smd_raw, 0.3, label="Before weighting", color="C0")
ax2.barh(x_pos + 0.15, smd_weighted, 0.3, label="After IPW", color="C1")
ax2.set_yticks(x_pos)
ax2.set_yticklabels(['x1 (confounder)', 'x2'])
ax2.set_xlabel("Absolute SMD")
ax2.axvline(0.1, color="gray", linestyle="--", alpha=0.5, label="SMD < 0.1 threshold")
ax2.legend(fontsize=8)
ax2.set_title("Covariate balance")

plt.tight_layout()
plt.show()

Figure 21.1: Left: Propensity score distributions by treatment group. Good overlap (both distributions cover the same range) is necessary. Right: Standardized mean differences before and after IPW weighting — weighting reduces the imbalance.

Rule of thumb: After weighting, all SMDs should be < 0.1. If any covariate remains imbalanced, the treatment effect estimate may still be confounded.

21.6.2 Sensitivity analysis: Rosenbaum bounds

Unconfoundedness is untestable — it assumes no unobserved confounders. Sensitivity analysis asks: “How strong would an unobserved confounder need to be to explain away the estimated effect?” If only an implausibly strong confounder could do it, the result is robust.

Rosenbaum bounds formalize this. Under unconfoundedness, any two units with the same covariates have the same odds of treatment. Rosenbaum’s framework allows an unobserved confounder $U$ that shifts the treatment odds by a factor of at most $\Gamma$:

\[ \frac{1}{\Gamma} \leq \frac{\Pr(D_i=1 \mid \mathbf{X}, U) / \Pr(D_i=0 \mid \mathbf{X}, U)} {\Pr(D_j=1 \mid \mathbf{X}, U) / \Pr(D_j=0 \mid \mathbf{X}, U)} \leq \Gamma \]

for matched pairs $(i, j)$ with the same $\mathbf{X}$. When $\Gamma = 1$, there is no unobserved confounding. As $\Gamma$ increases, we allow stronger confounders. For each $\Gamma$, we compute bounds on the $p$-value: if the treatment effect remains significant even at large $\Gamma$, the result is robust.

# Simplified Rosenbaum bounds for matched-pair data
# For each matched pair, compute the signed rank statistic
# under varying degrees of hidden bias (Gamma)

def rosenbaum_bounds(diffs, gammas):
    """Compute Rosenbaum sensitivity bounds.

    For a set of matched-pair differences, computes
    the worst-case p-value under hidden bias of
    strength Gamma.

    Parameters
    ----------
    diffs : ndarray
        Matched-pair outcome differences (treated -
        control).
    gammas : array-like
        Values of Gamma to evaluate.

    Returns
    -------
    results : list of (gamma, p_upper)
        Worst-case p-values for each Gamma.
    """
    abs_diffs = np.abs(diffs)
    # Rank the absolute differences
    ranks = stats.rankdata(abs_diffs)
    signs = np.sign(diffs)

    results = []
    for gamma in gammas:
        # Under hidden bias Gamma, the probability that
        # the treated unit in pair i has the larger
        # outcome ranges from 1/(1+Gamma) to Gamma/(1+Gamma)
        p_plus = gamma / (1 + gamma)

        # The signed-rank statistic T+ = sum of ranks
        # where diff > 0
        t_plus = np.sum(ranks[signs > 0])

        # Under the worst case, each positive pair has
        # probability p_plus of being positive
        n_pairs = len(diffs)
        # Expected value and variance of T+ under
        # worst-case assignment
        e_t = p_plus * n_pairs * (n_pairs + 1) / 2
        var_t = (
            p_plus * (1 - p_plus)
            * n_pairs * (n_pairs + 1)
            * (2 * n_pairs + 1) / 6
        )

        # Normal approximation for p-value
        if var_t > 0:
            z = (t_plus - e_t) / np.sqrt(var_t)
            p_val = 1 - stats.norm.cdf(z)
        else:
            p_val = 0.5

        results.append((gamma, p_val))

    return results


# Use matched-pair differences from our matching estimator
# Get matched pairs for treated units
S_cov = np.cov(X, rowvar=False)
S_inv = np.linalg.inv(S_cov)
treated_idx = np.where(D == 1)[0]
control_idx = np.where(D == 0)[0]

pair_diffs = []
for i in treated_idx:
    diff = X[control_idx] - X[i]
    dists = np.sqrt(
        np.sum(diff @ S_inv * diff, axis=1)
    )
    nn = control_idx[np.argmin(dists)]
    pair_diffs.append(Y[i] - Y[nn])

pair_diffs = np.array(pair_diffs)

gammas = [1.0, 1.2, 1.5, 2.0, 2.5, 3.0]
bounds = rosenbaum_bounds(pair_diffs, gammas)

print("Rosenbaum sensitivity analysis")
print(f"{'Gamma':>6} {'p-value (upper)':>16}")
print("-" * 24)
for gamma, p in bounds:
    sig = "***" if p < 0.001 else (
        "**" if p < 0.01 else (
        "*" if p < 0.05 else ""))
    print(f"{gamma:>6.1f} {p:>16.4f} {sig}")
print(f"\nResult robust up to Gamma where p > 0.05.")
print(f"Higher Gamma = more robust to hidden bias.")

Rosenbaum sensitivity analysis
 Gamma  p-value (upper)
------------------------
   1.0           0.0000 ***
   1.2           0.0000 ***
   1.5           0.0000 ***
   2.0           0.0000 ***
   2.5           0.0000 ***
   3.0           0.0000 ***

Result robust up to Gamma where p > 0.05.
Higher Gamma = more robust to hidden bias.

The table tells you: even if an unobserved confounder doubled the odds of treatment ($\Gamma = 2$), the effect would still be statistically significant. This is a strong result — it means you’d need an implausibly powerful hidden confounder to explain away the effect.

A complementary check: how much does the estimate shift when we omit a known confounder?

# Omission-based sensitivity: drop x1 and see the bias
ate_no_x1 = Y[D == 1].mean() - Y[D == 0].mean()
print(f"ATE with x1 adjustment:    {ate_aipw:.3f}")
print(f"ATE without x1 (biased):   {ate_no_x1:.3f}")
print(f"Bias from omitting x1:     "
      f"{ate_no_x1 - ate_aipw:.3f}")
print(f"\nAn unobserved confounder as strong as x1")
print(f"would shift the estimate by "
      f"{abs(ate_no_x1 - ate_aipw):.3f}.")

ATE with x1 adjustment:    1.967
ATE without x1 (biased):   2.233
Bias from omitting x1:     0.267

An unobserved confounder as strong as x1
would shift the estimate by 0.267.

21.7 Computational Considerations

Method	Cost	Memory	Notes
Regression	$O(np^2)$ × 2 (two OLS)	$O(np)$	Fast, efficient if model right
Matching	$O(n^2 p)$ brute-force	$O(n^2)$ dist. matrix	Slow; $k$-d tree gives $O(n \log n)$
IPW	$O(np)$ (one logistic)	$O(n)$ weights	Simple, unstable with extreme weights
AIPW	$O(np^2)$ × 2 + $O(np)$	$O(n)$ scores	Best overall: doubly robust + SE from scores

All four estimators are fast for typical observational studies ($n < 10^5$, $p < 100$). For larger problems, consider doubly robust estimators with machine learning nuisance models (Chapter 23, DML).

21.8 Worked Example

Estimating the effect of job training on earnings.

# Simulate a job training program
n_w = 2000

# Covariates
education = rng.normal(12, 3, n_w).clip(6, 20)  # years of education
experience = rng.exponential(10, n_w).clip(0, 40)
prior_earnings = rng.lognormal(10, 0.5, n_w)

X_w = np.column_stack([
    (education - 12) / 3,
    (experience - 10) / 10,
    np.log(prior_earnings / 50000),
])

# Selection into training: higher education → more likely to enroll
e_w = 1 / (1 + np.exp(-(0.5*X_w[:, 0] - 0.3*X_w[:, 2])))
D_w = rng.binomial(1, e_w)

# Outcome: earnings after training
# True effect: training increases earnings by $5,000
tau_w = 5000
Y_w = (40000 + 3000*X_w[:, 0] + 2000*X_w[:, 1]
       + 5000*X_w[:, 2] + tau_w * D_w + 8000*rng.standard_normal(n_w))

# Naive estimate (biased by selection)
naive_w = Y_w[D_w==1].mean() - Y_w[D_w==0].mean()

# AIPW estimate
X_w_sm = sm.add_constant(X_w)
m1 = sm.OLS(Y_w[D_w==1], X_w_sm[D_w==1]).fit()
m0 = sm.OLS(Y_w[D_w==0], X_w_sm[D_w==0]).fit()
lr_w = LogisticRegression(C=1e10, max_iter=1000).fit(X_w, D_w)
e_w_hat = np.clip(lr_w.predict_proba(X_w)[:, 1], 0.01, 0.99)

mu1_w = m1.predict(X_w_sm)
mu0_w = m0.predict(X_w_sm)
aipw_w = ((mu1_w - mu0_w) + D_w*(Y_w - mu1_w)/e_w_hat
           - (1-D_w)*(Y_w - mu0_w)/(1-e_w_hat))
ate_w = aipw_w.mean()
se_w = aipw_w.std() / np.sqrt(n_w)

print(f"True training effect:  ${tau_w:,.0f}")
print(f"Naive estimate:        ${naive_w:,.0f}")
print(f"AIPW estimate:         ${ate_w:,.0f} (SE: ${se_w:,.0f})")
print(f"95% CI: (${ate_w - 1.96*se_w:,.0f}, "
      f"${ate_w + 1.96*se_w:,.0f})")
print(f"\nNaive is biased upward because educated workers")
print(f"self-select into training AND earn more.")

True training effect:  $5,000
Naive estimate:        $5,945
AIPW estimate:         $4,876 (SE: $376)
95% CI: ($4,140, $5,613)

Naive is biased upward because educated workers
self-select into training AND earn more.

Now run all four estimators on the same data and compare.

# --- Regression adjustment ---
mu1_w_hat = m1.predict(X_w_sm)
mu0_w_hat = m0.predict(X_w_sm)
ate_reg_w = (mu1_w_hat - mu0_w_hat).mean()

# --- IPW (Horvitz-Thompson) ---
ate_ipw_w = np.mean(
    D_w * Y_w / e_w_hat
    - (1 - D_w) * Y_w / (1 - e_w_hat)
)

# --- Hajek (stabilized) IPW ---
w1_w = 1 / e_w_hat[D_w == 1]
w0_w = 1 / (1 - e_w_hat[D_w == 0])
mu1_hajek_w = (
    np.sum(Y_w[D_w == 1] * w1_w) / np.sum(w1_w)
)
mu0_hajek_w = (
    np.sum(Y_w[D_w == 0] * w0_w) / np.sum(w0_w)
)
ate_hajek_w = mu1_hajek_w - mu0_hajek_w

# --- Matching ---
ate_match_w, _ = match_ate(X_w, D_w, Y_w, n_matches=3)

# --- Comparison ---
naive_w = Y_w[D_w == 1].mean() - Y_w[D_w == 0].mean()
print(f"{'Method':<20} {'ATE':>10} {'Bias':>10}")
print("-" * 42)
print(f"{'True':<20} {tau_w:>10,.0f}")
print(f"{'Naive':<20} {naive_w:>10,.0f} "
      f"{naive_w - tau_w:>10,.0f}")
print(f"{'Regression':<20} {ate_reg_w:>10,.0f} "
      f"{ate_reg_w - tau_w:>10,.0f}")
print(f"{'Matching (M=3)':<20} {ate_match_w:>10,.0f} "
      f"{ate_match_w - tau_w:>10,.0f}")
print(f"{'IPW (HT)':<20} {ate_ipw_w:>10,.0f} "
      f"{ate_ipw_w - tau_w:>10,.0f}")
print(f"{'IPW (Hajek)':<20} {ate_hajek_w:>10,.0f} "
      f"{ate_hajek_w - tau_w:>10,.0f}")
print(f"{'AIPW':<20} {ate_w:>10,.0f} "
      f"{ate_w - tau_w:>10,.0f}")

Method                      ATE       Bias
------------------------------------------
True                      5,000
Naive                     5,945        945
Regression                4,896       -104
Matching (M=3)            5,111        111
IPW (HT)                  5,039         39
IPW (Hajek)               4,934        -66
AIPW                      4,876       -124

21.8.1 Diagnostic suite for the worked example

The diagnostic suite answers three questions: (1) is the propensity model adequate? (2) do the treated and control groups overlap? (3) do the weights or matching produce covariate balance?

fig, axes = plt.subplots(2, 2, figsize=(10, 8))

# --- Panel 1: Propensity score overlap ---
ax = axes[0, 0]
ax.hist(
    e_w_hat[D_w == 1], bins=30, alpha=0.5,
    density=True, label="Treated", color="C0"
)
ax.hist(
    e_w_hat[D_w == 0], bins=30, alpha=0.5,
    density=True, label="Control", color="C1"
)
ax.set_xlabel("Propensity score")
ax.set_ylabel("Density")
ax.legend(fontsize=8)
ax.set_title("Overlap check")

# --- Panel 2: Covariate balance (love plot) ---
ax = axes[0, 1]
cov_names = ["Education", "Experience", "Log earnings"]
smd_raw_w = []
smd_ipw_w = []
for j in range(3):
    # Raw SMD
    pool_sd = X_w[:, j].std()
    raw = (
        X_w[D_w == 1, j].mean()
        - X_w[D_w == 0, j].mean()
    ) / pool_sd
    smd_raw_w.append(abs(raw))

    # Weighted SMD
    w1 = 1 / e_w_hat[D_w == 1]
    w0 = 1 / (1 - e_w_hat[D_w == 0])
    wm1 = (
        (X_w[D_w == 1, j] * w1).sum() / w1.sum()
    )
    wm0 = (
        (X_w[D_w == 0, j] * w0).sum() / w0.sum()
    )
    smd_ipw_w.append(abs((wm1 - wm0) / pool_sd))

y_pos = np.arange(3)
ax.barh(
    y_pos - 0.15, smd_raw_w, 0.3,
    label="Before IPW", color="C0"
)
ax.barh(
    y_pos + 0.15, smd_ipw_w, 0.3,
    label="After IPW", color="C1"
)
ax.set_yticks(y_pos)
ax.set_yticklabels(cov_names)
ax.set_xlabel("Absolute SMD")
ax.axvline(
    0.1, color="gray", linestyle="--", alpha=0.5,
    label="< 0.1 threshold"
)
ax.legend(fontsize=7)
ax.set_title("Covariate balance")

# --- Panel 3: Weight distribution ---
ax = axes[1, 0]
ipw_weights = np.where(
    D_w, 1 / e_w_hat, 1 / (1 - e_w_hat)
)
ax.hist(
    ipw_weights, bins=50, color="C2",
    edgecolor="white", linewidth=0.5
)
ax.set_xlabel("IPW weight")
ax.set_ylabel("Count")
ax.set_title("Weight distribution")
ess = (
    ipw_weights.sum() ** 2
    / (ipw_weights ** 2).sum()
)
ax.text(
    0.95, 0.95,
    f"ESS: {ess:.0f} / {n_w}",
    transform=ax.transAxes,
    ha="right", va="top", fontsize=9
)

# --- Panel 4: Propensity calibration ---
ax = axes[1, 1]
# Bin propensity scores, compare predicted vs actual
n_bins = 10
bins = np.linspace(0, 1, n_bins + 1)
bin_centers = []
bin_means = []
for b in range(n_bins):
    mask = (
        (e_w_hat >= bins[b])
        & (e_w_hat < bins[b + 1])
    )
    if mask.sum() > 0:
        bin_centers.append(
            (bins[b] + bins[b + 1]) / 2
        )
        bin_means.append(D_w[mask].mean())

ax.plot(
    bin_centers, bin_means, "o-",
    color="C0", label="Observed"
)
ax.plot([0, 1], [0, 1], "--", color="gray",
        label="Perfect calibration")
ax.set_xlabel("Predicted P(D=1)")
ax.set_ylabel("Observed P(D=1)")
ax.legend(fontsize=8)
ax.set_title("Propensity calibration")

plt.tight_layout()
plt.show()

Figure 21.2: Full diagnostic suite for the job training example. Top-left: propensity score distributions — good overlap is essential. Top-right: covariate balance before and after IPW. Bottom-left: IPW weight distribution — long tails signal positivity concerns. Bottom-right: propensity model calibration.

Reading the diagnostics:

Overlap (top-left): If the two histograms don’t overlap, some units have no comparable counterparts in the other group — positivity is violated and no estimator can fix this.
Balance (top-right): After IPW, all SMDs should be below 0.1. Remaining imbalance means confounding may persist.
Weights (bottom-left): A long right tail signals near-violations of positivity. The ESS tells you the effective information content — if ESS is much smaller than $n$, the estimate is driven by a few heavily weighted units.
Calibration (bottom-right): If the propensity model is well calibrated, points follow the diagonal. Systematic deviation means the propensity model is misspecified — IPW will be biased.

21.9 Exercises

Exercise 21.1 ($\star\star$, diagnostic failure). Simulate data with a positivity violation: $e(\mathbf{x}) \approx 0$ for some covariate values. Show that IPW becomes unstable (extreme weights) while AIPW remains more stable. What does the propensity score overlap plot look like?

# Strong confounding → near-violation of positivity
x_strong = rng.standard_normal(500)
e_strong = 1/(1+np.exp(-(3*x_strong)))  # steep: e ∈ (0.05, 0.95)
D_s = rng.binomial(1, e_strong)
Y_s = x_strong + 2*D_s + rng.standard_normal(500)

lr_s = LogisticRegression(C=1e10).fit(x_strong.reshape(-1,1), D_s)
e_s = np.clip(lr_s.predict_proba(x_strong.reshape(-1,1))[:,1], 0.01, 0.99)

# IPW
ipw_s = np.mean(D_s*Y_s/e_s - (1-D_s)*Y_s/(1-e_s))
weights_s = np.where(D_s, 1/e_s, 1/(1-e_s))

# AIPW
X_s = sm.add_constant(x_strong)
m1_s = sm.OLS(Y_s[D_s==1], X_s[D_s==1]).fit()
m0_s = sm.OLS(Y_s[D_s==0], X_s[D_s==0]).fit()
mu1_s, mu0_s = m1_s.predict(X_s), m0_s.predict(X_s)
aipw_s = ((mu1_s - mu0_s) + D_s*(Y_s-mu1_s)/e_s
           - (1-D_s)*(Y_s-mu0_s)/(1-e_s))

print(f"True ATE: 2.0")
print(f"IPW:  {ipw_s:.3f}  (max weight: {weights_s.max():.1f})")
print(f"AIPW: {aipw_s.mean():.3f}")
print(f"\nIPW is pulled by extreme weights.")
print(f"AIPW uses the outcome model to stabilize.")

True ATE: 2.0
IPW:  2.765  (max weight: 100.0)
AIPW: 2.437

IPW is pulled by extreme weights.
AIPW uses the outcome model to stabilize.

Exercise 21.2 ($\star\star$, implementation). Verify AIPW’s double robustness: (a) fit AIPW with a wrong outcome model (constant prediction) but correct propensity → should be consistent; (b) fit with correct outcome but wrong propensity (constant 0.5) → should also be consistent.

# (a) Wrong outcome, correct propensity
mu1_bad = np.full(n, Y[D==1].mean())
mu0_bad = np.full(n, Y[D==0].mean())
aipw_a = ((mu1_bad - mu0_bad) + D*(Y-mu1_bad)/e_hat
           - (1-D)*(Y-mu0_bad)/(1-e_hat))
print(f"(a) Wrong outcome, right propensity: {aipw_a.mean():.3f}")

# (b) Correct outcome, wrong propensity (constant 0.5)
e_wrong = np.full(n, 0.5)
aipw_b = ((mu1_hat - mu0_hat) + D*(Y-mu1_hat)/e_wrong
           - (1-D)*(Y-mu0_hat)/(1-e_wrong))
print(f"(b) Right outcome, wrong propensity: {aipw_b.mean():.3f}")
print(f"True ATE: {tau_true}")
print(f"Both are close — that's double robustness.")

(a) Wrong outcome, right propensity: 1.964
(b) Right outcome, wrong propensity: 1.969
True ATE: 2.0
Both are close — that's double robustness.

Exercise 21.3 ($\star\star$, comparison). Run all four estimators (naive, regression, IPW, AIPW) across 200 simulations. Report the mean bias and RMSE. Which has the best bias-variance tradeoff?

results = {m: [] for m in ['Naive', 'Reg', 'IPW', 'AIPW']}
for rep in range(200):
    r = np.random.default_rng(rep)
    x = r.standard_normal((300,2))
    e = 1/(1+np.exp(-(0.5*x[:,0]-0.3*x[:,1])))
    d = r.binomial(1,e)
    y0 = 1+x[:,0]+0.5*x[:,1]+r.standard_normal(300)
    y = y0 + 2*d
    xs = sm.add_constant(x)

    results['Naive'].append(y[d==1].mean()-y[d==0].mean())

    m1=sm.OLS(y[d==1],xs[d==1]).fit()
    m0=sm.OLS(y[d==0],xs[d==0]).fit()
    results['Reg'].append((m1.predict(xs)-m0.predict(xs)).mean())

    lr2=LogisticRegression(C=1e10).fit(x,d)
    eh=np.clip(lr2.predict_proba(x)[:,1],0.01,0.99)
    results['IPW'].append(np.mean(d*y/eh-(1-d)*y/(1-eh)))

    aipw_rep=((m1.predict(xs)-m0.predict(xs))
              +d*(y-m1.predict(xs))/eh-(1-d)*(y-m0.predict(xs))/(1-eh))
    results['AIPW'].append(aipw_rep.mean())

print(f"{'Method':<8} {'Mean':>8} {'Bias':>8} {'RMSE':>8}")
for m in results:
    vals = np.array(results[m])
    print(f"{m:<8} {vals.mean():>8.3f} {vals.mean()-2:>8.3f} "
          f"{np.sqrt(np.mean((vals-2)**2)):>8.3f}")

Method       Mean     Bias     RMSE
Naive       2.312    0.312    0.355
Reg         1.994   -0.006    0.120
IPW         1.992   -0.008    0.131
AIPW        1.993   -0.007    0.121

AIPW typically has the smallest RMSE: slightly more variable than regression adjustment (because of the IPW correction terms) but less biased overall.

Exercise 21.4 ($\star\star\star$, conceptual). Explain why AIPW is “doubly robust.” Under what conditions does it achieve the semiparametric efficiency bound?

AIPW adds augmentation terms to the regression adjustment: $D(Y-\hat{\mu}_1)/\hat{e} - (1-D)(Y-\hat{\mu}_0)/(1-\hat{e})$.

If propensity is correct ($\hat{e} = e$): The augmentation terms have conditional mean zero by the IPW identification formula, regardless of $\hat{\mu}$. So $\hat{\tau}$ is consistent.

If outcome model is correct ($\hat{\mu}_d = \mu_d$): The residuals $Y - \hat{\mu}_d$ have conditional mean zero given $X$. The augmentation terms then correct any propensity misspecification through the $\hat{\mu}_1 - \hat{\mu}_0$ term. So $\hat{\tau}$ is consistent.

AIPW achieves the semiparametric efficiency bound (Hahn, 1998) when both models are correct. The bound is the smallest possible asymptotic variance for any regular estimator of the ATE under unconfoundedness:

$V_{\text{eff}} = \mathbb{E}\!\left[\frac{\sigma_1^2(\mathbf{X})}{e(\mathbf{X})} + \frac{\sigma_0^2(\mathbf{X})}{1-e(\mathbf{X})} + (\tau(\mathbf{X}) - \tau)^2\right]$

where $\sigma_d^2(\mathbf{x}) = \text{Var}(Y(d) \mid \mathbf{X}=\mathbf{x})$.

21.10 Bibliographic Notes

Potential outcomes: Neyman (1923) for randomized experiments, Rubin (1974) for the extension to observational studies. The framework is sometimes called the “Rubin Causal Model”; Holland (1986) provides a clear exposition.

Propensity score: Rosenbaum and Rubin (1983). The balancing property (propensity score suffices for unconfoundedness) is Theorem 3 of that paper.

IPW: Horvitz and Thompson (1952) for the survey sampling origin. The application to causal inference: Robins (1986). The stabilized (Hajek) estimator: Hajek (1971).

Matching: Rubin (1973) for bias reduction via matching; Abadie and Imbens (2006) for large-sample properties of matching estimators, including the result that matching with replacement is $\sqrt{n}$-consistent but has a bias term that requires correction when the number of matches is fixed. Mahalanobis distance matching: Rubin (1980).

AIPW and double robustness: Robins, Rotnitzky, and Zhao (1994). The semiparametric efficiency bound for the ATE: Hahn (1998). The double-robustness proof in Section 21.2.5.5 follows the argument in Lunceford and Davidian (2004), which gives a clear finite-sample decomposition.

Sensitivity analysis: Rosenbaum (2002, Ch. 4) for the theory of sensitivity bounds. Cinelli and Hazlett (2020) for the omitted variable bias framework as a modern alternative.

For textbooks: Imbens and Rubin (2015) for the potential-outcomes approach; Hernán and Robins (2020) for a modern epidemiological perspective (freely available online). Pearl (2009) for the structural causal model (SCM) approach, which this chapter does not cover but which is complementary.

--- title: "Potential Outcomes, Identification, and Estimation Under Unconfoundedness" --- ```{python} #| label: imports import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm from sklearn.linear_model import LogisticRegression from scipy import stats plt.style.use('assets/book.mplstyle') rng = np.random.default_rng(42) ``` ### Notation for this chapter | Symbol | Meaning | |--------|---------| | $Y_i(1), Y_i(0)$ | Potential outcomes (treated, control) | | $D_i \in \{0, 1\}$ | Treatment indicator | | $Y_i = D_i Y_i(1) + (1-D_i) Y_i(0)$ | Observed outcome | | $\tau = \mathbb{E}[Y(1) - Y(0)]$ | Average treatment effect (ATE) | | $e(\mathbf{x}) = \Pr(D=1 \mid \mathbf{x})$ | Propensity score | ## Motivation {#sec-ch21-motivation} Regression (Chapters 10-14) tells you about associations: "people who take the drug have better outcomes." Causal inference asks the harder question: "does the drug *cause* better outcomes, or do healthier people choose to take it?" The difference matters. If healthier people self-select into treatment, the naive comparison of treated vs untreated outcomes is biased --- it conflates the treatment effect with the selection effect. This chapter develops the tools to separate the two. **The practical setting:** You have observational data (not a randomized experiment). Some units received treatment ($D = 1$), others didn't ($D = 0$). You observe covariates $\mathbf{X}$ (age, income, health status) and an outcome $Y$ (recovery time, earnings, test score). You want to estimate the causal effect of $D$ on $Y$. Four estimators, all built from scipy/statsmodels/sklearn components: 1. **Regression adjustment** — fit separate outcome models for treated and control, predict counterfactuals 2. **Matching** — pair treated and control units with similar covariates, compare outcomes within pairs 3. **IPW (Inverse Probability Weighting)** — reweight observations to create a pseudo-population where treatment is random 4. **AIPW (Augmented IPW)** — combines regression and weighting for "double robustness" ## Mathematical Foundation {#sec-ch21-math} ### The fundamental problem of causal inference For each unit $i$, there are two *potential outcomes*: - $Y_i(1)$: what would happen if unit $i$ receives treatment - $Y_i(0)$: what would happen if unit $i$ does not receive treatment The causal effect for unit $i$ is $\tau_i = Y_i(1) - Y_i(0)$. **The problem:** We observe only one of the two potential outcomes. If $D_i = 1$ (treated), we see $Y_i(1)$ but not $Y_i(0)$. If $D_i = 0$ (control), we see $Y_i(0)$ but not $Y_i(1)$. The other potential outcome is *counterfactual* --- it is what would have happened under the alternative treatment. The *Average Treatment Effect* (ATE) averages over all units: $$ \tau = \mathbb{E}[Y(1) - Y(0)]. $$ We cannot compute this directly because we never observe both potential outcomes for the same unit. This is the fundamental problem. ### Why naive comparison fails The naive estimator $\hat{\tau}_{\text{naive}} = \bar{Y}_{\text{treated}} - \bar{Y}_{\text{control}}$ is biased whenever treatment assignment depends on characteristics that also affect the outcome. For example: - People with worse health are more likely to take the drug → naive estimate is biased downward (treated group looks sicker) - Richer students are more likely to attend tutoring → naive estimate is biased upward (tutored students were already advantaged) This is *confounding*: a variable that affects both treatment assignment and the outcome. ### Identification under unconfoundedness To estimate causal effects from observational data, we need three assumptions: **1. SUTVA** (Stable Unit Treatment Value Assumption): One unit's treatment doesn't affect another unit's outcome. (No "spillover" effects.) **2. Unconfoundedness** (ignorability): Treatment assignment is independent of potential outcomes, *conditional on covariates*: $(Y(1), Y(0)) \perp D \mid \mathbf{X}$. This means: once we condition on $\mathbf{X}$, the remaining variation in $D$ is essentially random. **3. Positivity** (overlap): Every unit has a positive probability of receiving either treatment status: $0 < e(\mathbf{x}) < 1$ for all $\mathbf{x}$ in the support. If some units *always* or *never* receive treatment, we cannot learn about the treatment effect for them. Under these three assumptions, the ATE is identified: $$ \tau = \mathbb{E}\!\left[\frac{D_i Y_i}{e(\mathbf{X}_i)} - \frac{(1-D_i)Y_i}{1-e(\mathbf{X}_i)}\right] $$ {#eq-ipw} ### The propensity score The propensity score $e(\mathbf{x}) = \Pr(D=1 \mid \mathbf{X}=\mathbf{x})$ is the probability of receiving treatment given covariates. It is a *scalar summary* of all confounders. **The Rosenbaum-Rubin theorem (1983):** If unconfoundedness holds given $\mathbf{X}$, it also holds given $e(\mathbf{X})$ alone. This is powerful: instead of adjusting for many covariates simultaneously (which is hard in high dimensions), you can adjust for a single number --- the propensity score. In practice, we estimate $e(\mathbf{x})$ using logistic regression or other classifiers (sklearn's `LogisticRegression`). #### Proof of the Rosenbaum-Rubin theorem The theorem is worth proving because the proof *is* the design principle: it shows exactly why propensity-score methods work and what they require. **Claim:** If $(Y(1), Y(0)) \perp D \mid \mathbf{X}$, then $(Y(1), Y(0)) \perp D \mid e(\mathbf{X})$. **Proof.** We need to show that $\Pr(D=1 \mid Y(1), Y(0), e(\mathbf{X})) = \Pr(D=1 \mid e(\mathbf{X}))$. Start from the right-hand side. By the law of iterated expectations, conditioning on $e(\mathbf{X})$: $$ \Pr(D=1 \mid e(\mathbf{X})) = \mathbb{E}[\Pr(D=1 \mid \mathbf{X}) \mid e(\mathbf{X})] = \mathbb{E}[e(\mathbf{X}) \mid e(\mathbf{X})] = e(\mathbf{X}). $$ The first equality uses the tower property. The second uses the definition of the propensity score. The third is trivial: a function of $e(\mathbf{X})$ is measurable with respect to $e(\mathbf{X})$. Now for the left-hand side. Since unconfoundedness holds given $\mathbf{X}$, and $e(\mathbf{X})$ is a function of $\mathbf{X}$: $$ \Pr(D=1 \mid Y(1), Y(0), e(\mathbf{X})) = \mathbb{E}[\Pr(D=1 \mid Y(1), Y(0), \mathbf{X}) \mid Y(1), Y(0), e(\mathbf{X})] $$ By unconfoundedness, $\Pr(D=1 \mid Y(1), Y(0), \mathbf{X}) = \Pr(D=1 \mid \mathbf{X}) = e(\mathbf{X})$. Substituting: $$ = \mathbb{E}[e(\mathbf{X}) \mid Y(1), Y(0), e(\mathbf{X})] = e(\mathbf{X}). $$ Both sides equal $e(\mathbf{X})$, so $D$ is conditionally independent of the potential outcomes given the propensity score. $\square$ **Why this matters for code:** The proof tells you that after you estimate $\hat{e}(\mathbf{x})$, you can stratify, match, or reweight on this single number instead of the full covariate vector. This is why IPW works --- it reweights by $1/\hat{e}$ rather than requiring nonparametric adjustment in the full $\mathbf{X}$ space. #### Positivity violations Positivity requires $0 < e(\mathbf{x}) < 1$ for all $\mathbf{x}$. When this fails --- some covariate strata have $e(\mathbf{x}) \approx 0$ or $\approx 1$ --- the causal effect is not identified for those strata, and estimation becomes unstable. **Practical consequences:** - **IPW weights explode.** If $e(\mathbf{x}_i) = 0.01$, the weight $1/e(\mathbf{x}_i) = 100$. A single observation with near-zero propensity can dominate the entire estimate. - **Effective sample size drops.** The ESS formula $(\sum w_i)^2 / \sum w_i^2$ captures how many independent observations the weighted sample is worth. Extreme weights mean few observations drive the estimate. - **Extrapolation.** If no treated units exist in some covariate region, the outcome model must extrapolate the counterfactual $\mu_1(\mathbf{x})$ from distant data --- pure model dependence. **Diagnostic:** The propensity score histogram (shown in @sec-ch21-diagnostics) reveals positivity violations as regions where one group's distribution has mass but the other's does not. **Practical responses:** 1. **Trimming:** Discard units with $\hat{e}(\mathbf{x}) < c$ or $> 1 - c$ (common choices: $c = 0.01$ or $c = 0.05$). This changes the estimand from the population ATE to the ATE among the trimmed population. 2. **Stabilized weights:** Replace $1/e$ with $e_{\text{marg}} / e$ where $e_{\text{marg}} = \Pr(D=1)$ is the marginal treatment probability. This preserves consistency while reducing variance. 3. **AIPW:** Less sensitive to extreme weights because the outcome model handles the regions where $e(\mathbf{x})$ is extreme. ### The four estimators #### Matching The oldest approach: for each treated unit, find one or more control units with similar covariates and compare their outcomes. The ATE estimate averages the within-match differences. **Distance metrics:** - *Exact matching:* Require $\mathbf{X}_i = \mathbf{X}_j$. Only feasible with few discrete covariates. - *Mahalanobis distance:* $d(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^\top \mathbf{S}^{-1} (\mathbf{x}_i - \mathbf{x}_j)}$ where $\mathbf{S}$ is the pooled sample covariance. Scale-invariant and accounts for correlations between covariates. - *Propensity-score distance:* $d(\mathbf{x}_i, \mathbf{x}_j) = |e(\mathbf{x}_i) - e(\mathbf{x}_j)|$. Reduces matching to a one-dimensional problem by the Rosenbaum-Rubin theorem. **Limitation:** Matching discards unmatched units, which changes the estimand (the population you're averaging over). With replacement, the same control unit may be matched multiple times, inflating its influence. Matching quality degrades in high dimensions (the "curse of dimensionality"). #### Regression adjustment Fit separate outcome models for treated and control units: $\hat{\mu}_1(\mathbf{x}) = \hat{\mathbb{E}}[Y \mid \mathbf{X}=\mathbf{x}, D=1]$ and $\hat{\mu}_0(\mathbf{x}) = \hat{\mathbb{E}}[Y \mid \mathbf{X}=\mathbf{x}, D=0]$. The ATE estimate is the average predicted difference: $\hat{\tau}_{\text{reg}} = \frac{1}{n}\sum_i [\hat{\mu}_1(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i)]$. **Limitation:** If the outcome model is wrong, the estimate is biased. #### Inverse Probability Weighting (IPW) Reweight each observation by the inverse of its propensity score. Treated units with low $e(\mathbf{x})$ (unlikely to be treated) get high weight; control units with high $e(\mathbf{x})$ (likely to be treated) get high weight. This creates a pseudo-population where treatment is independent of covariates. $\hat{\tau}_{\text{IPW}} = \frac{1}{n}\sum_i \left[\frac{D_i Y_i}{e(\mathbf{X}_i)} - \frac{(1-D_i)Y_i}{1-e(\mathbf{X}_i)}\right]$ **Limitation:** If the propensity model is wrong, or if some $e(\mathbf{x})$ are near 0 or 1 (poor overlap), the weights become extreme and the estimate is unstable. **IPW asymptotic variance.** The IPW estimator has influence function $\psi_i^{\text{IPW}} = D_i Y_i / e(\mathbf{X}_i) - (1-D_i) Y_i / (1-e(\mathbf{X}_i)) - \tau$, so its asymptotic variance is: $$ V_{\text{IPW}} = \mathbb{E}\!\left[ \frac{D_i Y_i^2}{e(\mathbf{X}_i)^2} + \frac{(1-D_i) Y_i^2}{(1-e(\mathbf{X}_i))^2} \right] - \tau^2 $$ This grows as $e(\mathbf{x}) \to 0$ or $e(\mathbf{x}) \to 1$, confirming the weight-instability intuition. In code, we estimate this via the sample variance of the individual IPW contributions, connecting to the M-estimation framework from Chapter 12. **Stabilized (Hajek) IPW.** The *normalized* variant divides treated and control weighted means by their respective weight sums: $\hat{\tau}_{\text{Hajek}} = \frac{\sum_i D_i Y_i / e_i}{\sum_i D_i / e_i} - \frac{\sum_i (1-D_i) Y_i / (1-e_i)}{\sum_i (1-D_i) / (1-e_i)}$ This is invariant to rescaling of the weights and has lower variance than the unnormalized Horvitz-Thompson estimator when the sample size varies across strata. It is slightly biased in finite samples but the bias is $O(1/n)$. #### AIPW (Augmented IPW): double robustness AIPW combines both approaches: $$ \hat{\tau}_{\text{AIPW}} = \frac{1}{n}\sum_i \left[\hat{\mu}_1(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i) + \frac{D_i(Y_i - \hat{\mu}_1(\mathbf{X}_i))}{\hat{e}(\mathbf{X}_i)} - \frac{(1-D_i)(Y_i - \hat{\mu}_0(\mathbf{X}_i))}{1-\hat{e}(\mathbf{X}_i)}\right] $$ {#eq-aipw} **Double robustness:** AIPW is consistent if *either* the outcome model or the propensity model is correct (but not necessarily both). This is the key advantage: you get two chances to get it right. **Intuition:** The first line ($\hat{\mu}_1 - \hat{\mu}_0$) is the regression adjustment. The second and third lines are *corrections* that use the IPW weights to fix any bias in the regression. If the regression is perfect ($\hat{\mu}_d = \mu_d$), the corrections have mean zero and don't hurt. If the regression is wrong but the propensity is right, the corrections fix the bias. #### Proof of double robustness {#sec-ch21-dr-proof} This proof is worth giving in full because it reveals the mechanism behind the estimator's resilience --- and shows exactly what "double robustness" means mathematically. **Claim.** Let $\hat{\mu}_d(\mathbf{x})$ and $\hat{e}(\mathbf{x})$ be working models for the outcome and propensity, respectively. The AIPW estimator is consistent for $\tau$ if *either* (a) $\hat{\mu}_d(\mathbf{x}) \xrightarrow{p} \mu_d(\mathbf{x})$ for $d \in \{0, 1\}$, or (b) $\hat{e}(\mathbf{x}) \xrightarrow{p} e(\mathbf{x})$. **Proof.** Write the AIPW score for a single observation: $$ \phi_i = \hat{\mu}_1(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i) + \frac{D_i(Y_i - \hat{\mu}_1(\mathbf{X}_i))}{\hat{e}(\mathbf{X}_i)} - \frac{(1-D_i)(Y_i - \hat{\mu}_0(\mathbf{X}_i))}{1-\hat{e}(\mathbf{X}_i)} $$ Take expectations conditional on $\mathbf{X}_i$, treating the working models as fixed (they converge to their probability limits): $$ \mathbb{E}[\phi_i \mid \mathbf{X}_i] = \hat{\mu}_1(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i) + \frac{e(\mathbf{X}_i)(\mu_1(\mathbf{X}_i) - \hat{\mu}_1(\mathbf{X}_i))}{\hat{e}(\mathbf{X}_i)} - \frac{(1-e(\mathbf{X}_i))(\mu_0(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i))}{1-\hat{e}(\mathbf{X}_i)} $$ where we used $\mathbb{E}[D_i \mid \mathbf{X}_i] = e(\mathbf{X}_i)$ and $\mathbb{E}[Y_i \mid D_i=d, \mathbf{X}_i] = \mu_d(\mathbf{X}_i)$. **Case (a): outcome model correct** ($\hat{\mu}_d = \mu_d$). The residuals $\mu_d - \hat{\mu}_d = 0$, so the correction terms vanish regardless of $\hat{e}$: $$ \mathbb{E}[\phi_i \mid \mathbf{X}_i] = \mu_1(\mathbf{X}_i) - \mu_0(\mathbf{X}_i) = \tau(\mathbf{X}_i). $$ Taking the outer expectation gives $\mathbb{E}[\phi_i] = \tau$. **Case (b): propensity model correct** ($\hat{e} = e$). The correction terms become: $$ (\mu_1(\mathbf{X}_i) - \hat{\mu}_1(\mathbf{X}_i)) - (\mu_0(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i)) $$ which, combined with the regression terms, yields: $$ \mathbb{E}[\phi_i \mid \mathbf{X}_i] = \mu_1(\mathbf{X}_i) - \mu_0(\mathbf{X}_i) = \tau(\mathbf{X}_i). $$ Again, $\mathbb{E}[\phi_i] = \tau$. $\square$ **The key insight** is that the AIPW score decomposes into a regression component plus an IPW-based correction. The correction removes any first-order bias from the regression, but only needs *one* of the two models to be correct to do so. This is the same Neyman-orthogonality property exploited by DML in Chapter 23. ## The Algorithm {#sec-ch21-algorithm} ::: {.callout-note appearance="minimal"} ### Algorithm 21.1: AIPW Estimator **Input:** Data $(\mathbf{X}, D, Y)$ **Output:** $\hat{\tau}$, SE, 95% CI 1. **Estimate propensity score:** Fit logistic regression of $D$ on $\mathbf{X}$. Get $\hat{e}(\mathbf{X}_i)$ for each observation. Clip to $[0.01, 0.99]$ to avoid division by zero. 2. **Estimate outcome models:** Fit OLS of $Y$ on $\mathbf{X}$ separately for treated ($D=1$) and control ($D=0$) subsamples. Predict $\hat{\mu}_1(\mathbf{X}_i)$ and $\hat{\mu}_0(\mathbf{X}_i)$ for *all* observations (including counterfactual predictions). 3. **Compute AIPW scores:** For each observation, compute the individual AIPW contribution using @eq-aipw. 4. **Estimate ATE:** $\hat{\tau} = \bar{\text{AIPW}}$ (sample mean of AIPW scores). 5. **Standard error:** $\hat{\text{se}} = \text{sd}(\text{AIPW}_i) / \sqrt{n}$. This uses the *influence function* representation of the estimator (connecting to M-estimation theory from Chapter 12). ::: ::: {.callout-note appearance="minimal"} ### Algorithm 21.2: Nearest-Neighbor Matching **Input:** Data $(\mathbf{X}, D, Y)$, distance metric $d$, number of matches $M$ **Output:** $\hat{\tau}_{\text{match}}$ 1. **Compute pooled covariance** $\mathbf{S}$ from $\mathbf{X}$. 2. **For each treated unit $i$:** Find the $M$ nearest control units by $d(\mathbf{X}_i, \mathbf{X}_j)$ where $D_j = 0$. Compute $\hat{Y}_i(0) = \frac{1}{M} \sum_{j \in \mathcal{M}(i)} Y_j$. 3. **For each control unit $j$:** Find the $M$ nearest treated units by $d(\mathbf{X}_j, \mathbf{X}_k)$ where $D_k = 1$. Compute $\hat{Y}_j(1) = \frac{1}{M} \sum_{k \in \mathcal{M}(j)} Y_k$. 4. **Estimate ATE:** $\hat{\tau} = \frac{1}{n} \sum_{i=1}^n [D_i(Y_i - \hat{Y}_i(0)) + (1-D_i)(\hat{Y}_i(1) - Y_i)]$ ::: ## Statistical Properties {#sec-ch21-stats} **Regression adjustment** is consistent when the outcome model is correctly specified. It is the most efficient estimator when the model is right, but can be badly biased when the model is wrong. **IPW** is consistent when the propensity model is correctly specified. Its variance depends on the variability of the weights: extreme weights (near-violations of positivity) inflate the variance. **AIPW** is consistent when *either* model is correct. When *both* are correct, it achieves the semiparametric efficiency bound (Hahn, 1998) --- it has the smallest possible variance among all regular estimators of the ATE under unconfoundedness. **The efficiency bound:** $\text{Var}(\hat{\tau}_{\text{eff}}) = \mathbb{E}\!\left[\frac{\text{Var}(Y(1) \mid \mathbf{X})}{e(\mathbf{X})} + \frac{\text{Var}(Y(0) \mid \mathbf{X})}{1 - e(\mathbf{X})} + (\tau(\mathbf{X}) - \tau)^2\right]$ The first two terms reflect the cost of not observing both potential outcomes; the third reflects treatment effect heterogeneity. ## Implementation {#sec-ch21-impl} ### Simulating observational data To understand these estimators, we simulate data where we *control* everything and know the true ATE. This lets us see exactly how and why each estimator succeeds or fails. ```{python} #| label: sim-data n = 1000 # Covariates: x1 is a CONFOUNDER (affects both treatment and outcome) # x2 is a prognostic variable (affects outcome only) x1 = rng.standard_normal(n) x2 = rng.standard_normal(n) X = np.column_stack([x1, x2]) # Treatment assignment depends on x1 (confounding!) # Higher x1 → higher probability of treatment e_true = 1 / (1 + np.exp(-(0.5*x1 - 0.3*x2))) D = rng.binomial(1, e_true) # Potential outcomes: both depend on x1 and x2 # Y(0) = 1 + x1 + 0.5*x2 + noise # Y(1) = Y(0) + 2 (constant treatment effect of 2) tau_true = 2.0 Y0 = 1 + x1 + 0.5*x2 + rng.standard_normal(n) Y1 = Y0 + tau_true # Observed outcome: Y = D*Y(1) + (1-D)*Y(0) Y = D*Y1 + (1-D)*Y0 print(f"True ATE: {tau_true}") print(f"Treatment rate: {D.mean():.0%}") print(f"Naive diff in means: {Y[D==1].mean() - Y[D==0].mean():.3f}") print(f"(Biased because x1 confounds: treated units have higher x1)") ``` The naive estimate is biased because $x_1$ affects both who gets treated and the outcome. People with higher $x_1$ are both more likely to be treated *and* have higher $Y(0)$. The naive comparison picks up both the treatment effect and the selection effect. ### Step-by-step: regression adjustment ```{python} #| label: reg-adjustment # Step 1: Fit separate outcome models for treated and control X_sm = sm.add_constant(X) # Model for treated units: E[Y | X, D=1] ols_treated = sm.OLS(Y[D==1], X_sm[D==1]).fit() # Model for control units: E[Y | X, D=0] ols_control = sm.OLS(Y[D==0], X_sm[D==0]).fit() # Step 2: Predict BOTH potential outcomes for ALL observations # mu1_hat[i] = predicted Y(1) for unit i (even if they were control) # mu0_hat[i] = predicted Y(0) for unit i (even if they were treated) mu1_hat = ols_treated.predict(X_sm) # counterfactual for controls mu0_hat = ols_control.predict(X_sm) # counterfactual for treated # Step 3: Average the predicted treatment effects ate_reg = (mu1_hat - mu0_hat).mean() print(f"Regression adjustment ATE: {ate_reg:.3f} (true: {tau_true})") ``` **What happened:** We used the outcome model to *impute* the missing potential outcomes. For a treated unit, we observe $Y(1)$ and predict $Y(0)$. For a control unit, we observe $Y(0)$ and predict $Y(1)$. The estimated ATE is the average of these predicted differences. ### Step-by-step: IPW ```{python} #| label: ipw # Step 1: Estimate the propensity score # Use logistic regression: P(D=1 | X) = logistic(X @ beta) lr = LogisticRegression(C=1e10, max_iter=1000) lr.fit(X, D) e_hat = lr.predict_proba(X)[:, 1] # predicted P(D=1 | X) # Step 2: Clip extreme probabilities to avoid huge weights # Without clipping, 1/e can be enormous for small e e_hat = np.clip(e_hat, 0.01, 0.99) # Step 3: Compute IPW estimate # Treated: weight by 1/e (upweight treated units unlikely to be treated) # Control: weight by 1/(1-e) (upweight controls likely to be treated) ate_ipw = np.mean(D*Y/e_hat - (1-D)*Y/(1-e_hat)) print(f"IPW ATE: {ate_ipw:.3f} (true: {tau_true})") # Check the weights weights = np.where(D, 1/e_hat, 1/(1-e_hat)) print(f"Weight range: [{weights.min():.1f}, {weights.max():.1f}]") print(f"Effective sample size: {(weights.sum())**2 / (weights**2).sum():.0f} / {n}") ``` **What happened:** We reweighted the sample so that treatment assignment is "as if random." The weights balance the covariates between treated and control groups. ### Step-by-step: AIPW ```{python} #| label: aipw # Combine regression adjustment and IPW # Each observation gets an AIPW "score" — the individual-level # contribution to the ATE estimate aipw_scores = ( # Regression adjustment part: (mu1_hat - mu0_hat) # + IPW correction for treated: + D * (Y - mu1_hat) / e_hat # - IPW correction for control: - (1-D) * (Y - mu0_hat) / (1-e_hat) ) ate_aipw = aipw_scores.mean() se_aipw = aipw_scores.std() / np.sqrt(n) ci_lo = ate_aipw - 1.96 * se_aipw ci_hi = ate_aipw + 1.96 * se_aipw print(f"AIPW ATE: {ate_aipw:.3f} (SE: {se_aipw:.3f})") print(f"95% CI: ({ci_lo:.3f}, {ci_hi:.3f})") print(f"True ATE: {tau_true} — {'in CI ✓' if ci_lo < tau_true < ci_hi else 'NOT in CI ✗'}") ``` **What happened:** AIPW starts with the regression adjustment estimate, then adds IPW-based corrections. The corrections account for any residual bias from the outcome model. If the outcome model is perfect, the corrections are noise around zero. If the outcome model is wrong but the propensity is right, the corrections fix the bias. ### Step-by-step: stabilized (Hajek) IPW ```{python} #| label: hajek-ipw # Stabilized weights: divide by the sum within each group # This is the Hajek estimator — lower variance than HT w_treated = 1 / e_hat[D == 1] w_control = 1 / (1 - e_hat[D == 0]) # Weighted means with normalized weights mu1_hajek = ( np.sum(Y[D == 1] * w_treated) / np.sum(w_treated) ) mu0_hajek = ( np.sum(Y[D == 0] * w_control) / np.sum(w_control) ) ate_hajek = mu1_hajek - mu0_hajek # Compare with unnormalized IPW print(f"Horvitz-Thompson IPW: {ate_ipw:.3f}") print(f"Hajek (stabilized): {ate_hajek:.3f}") print(f"True ATE: {tau_true}") print(f"\nHajek is often closer — normalization reduces " f"sensitivity to extreme weights.") ``` **What happened:** The Hajek estimator normalizes by the sum of weights within each group. This is equivalent to using stabilized weights $w_i^{\text{stab}} = D_i \cdot \bar{D} / e_i + (1-D_i) \cdot (1 - \bar{D}) / (1-e_i)$ where $\bar{D}$ is the marginal treatment rate. ### Step-by-step: matching with Mahalanobis distance ```{python} #| label: matching def match_ate(X, D, Y, n_matches=1): """Nearest-neighbor matching using Mahalanobis distance. Implements Algorithm 21.2. For each unit, finds the closest unit(s) in the opposite treatment group and imputes the missing potential outcome. Parameters ---------- X : ndarray of shape (n, p) Covariates. D : ndarray of shape (n,) Treatment indicator (0 or 1). Y : ndarray of shape (n,) Observed outcomes. n_matches : int Number of nearest neighbors to match. Returns ------- ate : float Estimated ATE. matched_diffs : ndarray Per-unit treatment effect estimates. """ n = len(Y) # Pooled covariance for Mahalanobis distance S = np.cov(X, rowvar=False) S_inv = np.linalg.inv(S) treated_idx = np.where(D == 1)[0] control_idx = np.where(D == 0)[0] matched_diffs = np.zeros(n) for i in range(n): if D[i] == 1: # Find nearest control(s) pool = control_idx else: # Find nearest treated unit(s) pool = treated_idx # Mahalanobis distance to all units in pool diff = X[pool] - X[i] dists = np.sqrt( np.sum(diff @ S_inv * diff, axis=1) ) # Indices of n_matches nearest neighbors nn = pool[np.argsort(dists)[:n_matches]] y_imputed = Y[nn].mean() if D[i] == 1: # Observed Y(1), imputed Y(0) matched_diffs[i] = Y[i] - y_imputed else: # Imputed Y(1), observed Y(0) matched_diffs[i] = y_imputed - Y[i] return matched_diffs.mean(), matched_diffs ate_match, match_diffs = match_ate(X, D, Y, n_matches=1) print(f"Matching ATE (M=1): {ate_match:.3f} " f"(true: {tau_true})") # Try M=3 for smoother estimate ate_match3, _ = match_ate(X, D, Y, n_matches=3) print(f"Matching ATE (M=3): {ate_match3:.3f}") ``` **What happened:** For each unit, we found the closest unit in the other treatment group (by Mahalanobis distance) and used its outcome as the counterfactual. The $O(n^2 p)$ loop is slow but transparent --- production implementations use $k$-d trees. The single-match estimator is noisier than $M=3$ because each counterfactual depends on one neighbor. ### Comparison of all estimators ```{python} #| label: comparison naive_est = Y[D == 1].mean() - Y[D == 0].mean() se_match = match_diffs.std() / np.sqrt(n) print(f"{'Method':<20} {'ATE':>8} {'SE':>8} {'Bias':>8}") print("-" * 48) print(f"{'True':<20} {tau_true:>8.3f}") print(f"{'Naive':<20} {naive_est:>8.3f} " f"{'':>8} {naive_est - tau_true:>8.3f}") print(f"{'Regression':<20} {ate_reg:>8.3f} " f"{'':>8} {ate_reg - tau_true:>8.3f}") print(f"{'Matching (M=1)':<20} {ate_match:>8.3f} " f"{se_match:>8.3f} {ate_match - tau_true:>8.3f}") print(f"{'IPW (HT)':<20} {ate_ipw:>8.3f} " f"{'':>8} {ate_ipw - tau_true:>8.3f}") print(f"{'IPW (Hajek)':<20} {ate_hajek:>8.3f} " f"{'':>8} {ate_hajek - tau_true:>8.3f}") print(f"{'AIPW':<20} {ate_aipw:>8.3f} " f"{se_aipw:>8.3f} {ate_aipw - tau_true:>8.3f}") ``` ## Diagnostics {#sec-ch21-diagnostics} ### Balance diagnostics The most important diagnostic: after weighting or matching, are the treated and control groups balanced on covariates? The *standardized mean difference* (SMD) measures this. ```{python} #| label: fig-balance #| fig-cap: "Left: Propensity score distributions by treatment group. Good overlap (both distributions cover the same range) is necessary. Right: Standardized mean differences before and after IPW weighting — weighting reduces the imbalance." fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4)) # Propensity score overlap ax1.hist(e_hat[D==1], bins=30, alpha=0.5, density=True, label="Treated", color="C0") ax1.hist(e_hat[D==0], bins=30, alpha=0.5, density=True, label="Control", color="C1") ax1.set_xlabel("Propensity score") ax1.set_ylabel("Density") ax1.legend() ax1.set_title("Overlap check") # SMD before and after weighting smd_raw = [] smd_weighted = [] for j in range(2): # Raw SMD smd = (X[D==1, j].mean() - X[D==0, j].mean()) / X[:, j].std() smd_raw.append(abs(smd)) # Weighted SMD w1 = 1/e_hat[D==1]; w0 = 1/(1-e_hat[D==0]) wm1 = (X[D==1, j] * w1).sum() / w1.sum() wm0 = (X[D==0, j] * w0).sum() / w0.sum() smd_weighted.append(abs((wm1 - wm0) / X[:, j].std())) x_pos = np.arange(2) ax2.barh(x_pos - 0.15, smd_raw, 0.3, label="Before weighting", color="C0") ax2.barh(x_pos + 0.15, smd_weighted, 0.3, label="After IPW", color="C1") ax2.set_yticks(x_pos) ax2.set_yticklabels(['x1 (confounder)', 'x2']) ax2.set_xlabel("Absolute SMD") ax2.axvline(0.1, color="gray", linestyle="--", alpha=0.5, label="SMD < 0.1 threshold") ax2.legend(fontsize=8) ax2.set_title("Covariate balance") plt.tight_layout() plt.show() ``` **Rule of thumb:** After weighting, all SMDs should be < 0.1. If any covariate remains imbalanced, the treatment effect estimate may still be confounded. ### Sensitivity analysis: Rosenbaum bounds Unconfoundedness is *untestable* --- it assumes no unobserved confounders. Sensitivity analysis asks: "How strong would an unobserved confounder need to be to explain away the estimated effect?" If only an implausibly strong confounder could do it, the result is robust. **Rosenbaum bounds** formalize this. Under unconfoundedness, any two units with the same covariates have the same odds of treatment. Rosenbaum's framework allows an unobserved confounder $U$ that shifts the treatment odds by a factor of at most $\Gamma$: $$ \frac{1}{\Gamma} \leq \frac{\Pr(D_i=1 \mid \mathbf{X}, U) / \Pr(D_i=0 \mid \mathbf{X}, U)} {\Pr(D_j=1 \mid \mathbf{X}, U) / \Pr(D_j=0 \mid \mathbf{X}, U)} \leq \Gamma $$ for matched pairs $(i, j)$ with the same $\mathbf{X}$. When $\Gamma = 1$, there is no unobserved confounding. As $\Gamma$ increases, we allow stronger confounders. For each $\Gamma$, we compute bounds on the $p$-value: if the treatment effect remains significant even at large $\Gamma$, the result is robust. ```{python} #| label: sensitivity # Simplified Rosenbaum bounds for matched-pair data # For each matched pair, compute the signed rank statistic # under varying degrees of hidden bias (Gamma) def rosenbaum_bounds(diffs, gammas): """Compute Rosenbaum sensitivity bounds. For a set of matched-pair differences, computes the worst-case p-value under hidden bias of strength Gamma. Parameters ---------- diffs : ndarray Matched-pair outcome differences (treated - control). gammas : array-like Values of Gamma to evaluate. Returns ------- results : list of (gamma, p_upper) Worst-case p-values for each Gamma. """ abs_diffs = np.abs(diffs) # Rank the absolute differences ranks = stats.rankdata(abs_diffs) signs = np.sign(diffs) results = [] for gamma in gammas: # Under hidden bias Gamma, the probability that # the treated unit in pair i has the larger # outcome ranges from 1/(1+Gamma) to Gamma/(1+Gamma) p_plus = gamma / (1 + gamma) # The signed-rank statistic T+ = sum of ranks # where diff > 0 t_plus = np.sum(ranks[signs > 0]) # Under the worst case, each positive pair has # probability p_plus of being positive n_pairs = len(diffs) # Expected value and variance of T+ under # worst-case assignment e_t = p_plus * n_pairs * (n_pairs + 1) / 2 var_t = ( p_plus * (1 - p_plus) * n_pairs * (n_pairs + 1) * (2 * n_pairs + 1) / 6 ) # Normal approximation for p-value if var_t > 0: z = (t_plus - e_t) / np.sqrt(var_t) p_val = 1 - stats.norm.cdf(z) else: p_val = 0.5 results.append((gamma, p_val)) return results # Use matched-pair differences from our matching estimator # Get matched pairs for treated units S_cov = np.cov(X, rowvar=False) S_inv = np.linalg.inv(S_cov) treated_idx = np.where(D == 1)[0] control_idx = np.where(D == 0)[0] pair_diffs = [] for i in treated_idx: diff = X[control_idx] - X[i] dists = np.sqrt( np.sum(diff @ S_inv * diff, axis=1) ) nn = control_idx[np.argmin(dists)] pair_diffs.append(Y[i] - Y[nn]) pair_diffs = np.array(pair_diffs) gammas = [1.0, 1.2, 1.5, 2.0, 2.5, 3.0] bounds = rosenbaum_bounds(pair_diffs, gammas) print("Rosenbaum sensitivity analysis") print(f"{'Gamma':>6} {'p-value (upper)':>16}") print("-" * 24) for gamma, p in bounds: sig = "***" if p < 0.001 else ( "**" if p < 0.01 else ( "*" if p < 0.05 else "")) print(f"{gamma:>6.1f} {p:>16.4f} {sig}") print(f"\nResult robust up to Gamma where p > 0.05.") print(f"Higher Gamma = more robust to hidden bias.") ``` The table tells you: even if an unobserved confounder doubled the odds of treatment ($\Gamma = 2$), the effect would still be statistically significant. This is a strong result --- it means you'd need an implausibly powerful hidden confounder to explain away the effect. A complementary check: how much does the estimate shift when we omit a known confounder? ```{python} #| label: sensitivity-omit # Omission-based sensitivity: drop x1 and see the bias ate_no_x1 = Y[D == 1].mean() - Y[D == 0].mean() print(f"ATE with x1 adjustment: {ate_aipw:.3f}") print(f"ATE without x1 (biased): {ate_no_x1:.3f}") print(f"Bias from omitting x1: " f"{ate_no_x1 - ate_aipw:.3f}") print(f"\nAn unobserved confounder as strong as x1") print(f"would shift the estimate by " f"{abs(ate_no_x1 - ate_aipw):.3f}.") ``` ## Computational Considerations {#sec-ch21-computation} | Method | Cost | Memory | Notes | |---|---|---|---| | Regression | $O(np^2)$ × 2 (two OLS) | $O(np)$ | Fast, efficient if model right | | Matching | $O(n^2 p)$ brute-force | $O(n^2)$ dist. matrix | Slow; $k$-d tree gives $O(n \log n)$ | | IPW | $O(np)$ (one logistic) | $O(n)$ weights | Simple, unstable with extreme weights | | AIPW | $O(np^2)$ × 2 + $O(np)$ | $O(n)$ scores | Best overall: doubly robust + SE from scores | All four estimators are fast for typical observational studies ($n < 10^5$, $p < 100$). For larger problems, consider doubly robust estimators with machine learning nuisance models (Chapter 23, DML). ## Worked Example {#sec-ch21-example} Estimating the effect of job training on earnings. ```{python} #| label: worked # Simulate a job training program n_w = 2000 # Covariates education = rng.normal(12, 3, n_w).clip(6, 20) # years of education experience = rng.exponential(10, n_w).clip(0, 40) prior_earnings = rng.lognormal(10, 0.5, n_w) X_w = np.column_stack([ (education - 12) / 3, (experience - 10) / 10, np.log(prior_earnings / 50000), ]) # Selection into training: higher education → more likely to enroll e_w = 1 / (1 + np.exp(-(0.5*X_w[:, 0] - 0.3*X_w[:, 2]))) D_w = rng.binomial(1, e_w) # Outcome: earnings after training # True effect: training increases earnings by $5,000 tau_w = 5000 Y_w = (40000 + 3000*X_w[:, 0] + 2000*X_w[:, 1] + 5000*X_w[:, 2] + tau_w * D_w + 8000*rng.standard_normal(n_w)) # Naive estimate (biased by selection) naive_w = Y_w[D_w==1].mean() - Y_w[D_w==0].mean() # AIPW estimate X_w_sm = sm.add_constant(X_w) m1 = sm.OLS(Y_w[D_w==1], X_w_sm[D_w==1]).fit() m0 = sm.OLS(Y_w[D_w==0], X_w_sm[D_w==0]).fit() lr_w = LogisticRegression(C=1e10, max_iter=1000).fit(X_w, D_w) e_w_hat = np.clip(lr_w.predict_proba(X_w)[:, 1], 0.01, 0.99) mu1_w = m1.predict(X_w_sm) mu0_w = m0.predict(X_w_sm) aipw_w = ((mu1_w - mu0_w) + D_w*(Y_w - mu1_w)/e_w_hat - (1-D_w)*(Y_w - mu0_w)/(1-e_w_hat)) ate_w = aipw_w.mean() se_w = aipw_w.std() / np.sqrt(n_w) print(f"True training effect: ${tau_w:,.0f}") print(f"Naive estimate: ${naive_w:,.0f}") print(f"AIPW estimate: ${ate_w:,.0f} (SE: ${se_w:,.0f})") print(f"95% CI: (${ate_w - 1.96*se_w:,.0f}, " f"${ate_w + 1.96*se_w:,.0f})") print(f"\nNaive is biased upward because educated workers") print(f"self-select into training AND earn more.") ``` Now run all four estimators on the same data and compare. ```{python} #| label: worked-all-estimators # --- Regression adjustment --- mu1_w_hat = m1.predict(X_w_sm) mu0_w_hat = m0.predict(X_w_sm) ate_reg_w = (mu1_w_hat - mu0_w_hat).mean() # --- IPW (Horvitz-Thompson) --- ate_ipw_w = np.mean( D_w * Y_w / e_w_hat - (1 - D_w) * Y_w / (1 - e_w_hat) ) # --- Hajek (stabilized) IPW --- w1_w = 1 / e_w_hat[D_w == 1] w0_w = 1 / (1 - e_w_hat[D_w == 0]) mu1_hajek_w = ( np.sum(Y_w[D_w == 1] * w1_w) / np.sum(w1_w) ) mu0_hajek_w = ( np.sum(Y_w[D_w == 0] * w0_w) / np.sum(w0_w) ) ate_hajek_w = mu1_hajek_w - mu0_hajek_w # --- Matching --- ate_match_w, _ = match_ate(X_w, D_w, Y_w, n_matches=3) # --- Comparison --- naive_w = Y_w[D_w == 1].mean() - Y_w[D_w == 0].mean() print(f"{'Method':<20} {'ATE':>10} {'Bias':>10}") print("-" * 42) print(f"{'True':<20} {tau_w:>10,.0f}") print(f"{'Naive':<20} {naive_w:>10,.0f} " f"{naive_w - tau_w:>10,.0f}") print(f"{'Regression':<20} {ate_reg_w:>10,.0f} " f"{ate_reg_w - tau_w:>10,.0f}") print(f"{'Matching (M=3)':<20} {ate_match_w:>10,.0f} " f"{ate_match_w - tau_w:>10,.0f}") print(f"{'IPW (HT)':<20} {ate_ipw_w:>10,.0f} " f"{ate_ipw_w - tau_w:>10,.0f}") print(f"{'IPW (Hajek)':<20} {ate_hajek_w:>10,.0f} " f"{ate_hajek_w - tau_w:>10,.0f}") print(f"{'AIPW':<20} {ate_w:>10,.0f} " f"{ate_w - tau_w:>10,.0f}") ``` ### Diagnostic suite for the worked example The diagnostic suite answers three questions: (1) is the propensity model adequate? (2) do the treated and control groups overlap? (3) do the weights or matching produce covariate balance? ```{python} #| label: fig-worked-diagnostics #| fig-cap: "Full diagnostic suite for the job training example. Top-left: propensity score distributions — good overlap is essential. Top-right: covariate balance before and after IPW. Bottom-left: IPW weight distribution — long tails signal positivity concerns. Bottom-right: propensity model calibration." fig, axes = plt.subplots(2, 2, figsize=(10, 8)) # --- Panel 1: Propensity score overlap --- ax = axes[0, 0] ax.hist( e_w_hat[D_w == 1], bins=30, alpha=0.5, density=True, label="Treated", color="C0" ) ax.hist( e_w_hat[D_w == 0], bins=30, alpha=0.5, density=True, label="Control", color="C1" ) ax.set_xlabel("Propensity score") ax.set_ylabel("Density") ax.legend(fontsize=8) ax.set_title("Overlap check") # --- Panel 2: Covariate balance (love plot) --- ax = axes[0, 1] cov_names = ["Education", "Experience", "Log earnings"] smd_raw_w = [] smd_ipw_w = [] for j in range(3): # Raw SMD pool_sd = X_w[:, j].std() raw = ( X_w[D_w == 1, j].mean() - X_w[D_w == 0, j].mean() ) / pool_sd smd_raw_w.append(abs(raw)) # Weighted SMD w1 = 1 / e_w_hat[D_w == 1] w0 = 1 / (1 - e_w_hat[D_w == 0]) wm1 = ( (X_w[D_w == 1, j] * w1).sum() / w1.sum() ) wm0 = ( (X_w[D_w == 0, j] * w0).sum() / w0.sum() ) smd_ipw_w.append(abs((wm1 - wm0) / pool_sd)) y_pos = np.arange(3) ax.barh( y_pos - 0.15, smd_raw_w, 0.3, label="Before IPW", color="C0" ) ax.barh( y_pos + 0.15, smd_ipw_w, 0.3, label="After IPW", color="C1" ) ax.set_yticks(y_pos) ax.set_yticklabels(cov_names) ax.set_xlabel("Absolute SMD") ax.axvline( 0.1, color="gray", linestyle="--", alpha=0.5, label="< 0.1 threshold" ) ax.legend(fontsize=7) ax.set_title("Covariate balance") # --- Panel 3: Weight distribution --- ax = axes[1, 0] ipw_weights = np.where( D_w, 1 / e_w_hat, 1 / (1 - e_w_hat) ) ax.hist( ipw_weights, bins=50, color="C2", edgecolor="white", linewidth=0.5 ) ax.set_xlabel("IPW weight") ax.set_ylabel("Count") ax.set_title("Weight distribution") ess = ( ipw_weights.sum() ** 2 / (ipw_weights ** 2).sum() ) ax.text( 0.95, 0.95, f"ESS: {ess:.0f} / {n_w}", transform=ax.transAxes, ha="right", va="top", fontsize=9 ) # --- Panel 4: Propensity calibration --- ax = axes[1, 1] # Bin propensity scores, compare predicted vs actual n_bins = 10 bins = np.linspace(0, 1, n_bins + 1) bin_centers = [] bin_means = [] for b in range(n_bins): mask = ( (e_w_hat >= bins[b]) & (e_w_hat < bins[b + 1]) ) if mask.sum() > 0: bin_centers.append( (bins[b] + bins[b + 1]) / 2 ) bin_means.append(D_w[mask].mean()) ax.plot( bin_centers, bin_means, "o-", color="C0", label="Observed" ) ax.plot([0, 1], [0, 1], "--", color="gray", label="Perfect calibration") ax.set_xlabel("Predicted P(D=1)") ax.set_ylabel("Observed P(D=1)") ax.legend(fontsize=8) ax.set_title("Propensity calibration") plt.tight_layout() plt.show() ``` **Reading the diagnostics:** - **Overlap (top-left):** If the two histograms don't overlap, some units have no comparable counterparts in the other group --- positivity is violated and no estimator can fix this. - **Balance (top-right):** After IPW, all SMDs should be below 0.1. Remaining imbalance means confounding may persist. - **Weights (bottom-left):** A long right tail signals near-violations of positivity. The ESS tells you the effective information content --- if ESS is much smaller than $n$, the estimate is driven by a few heavily weighted units. - **Calibration (bottom-right):** If the propensity model is well calibrated, points follow the diagonal. Systematic deviation means the propensity model is misspecified --- IPW will be biased. ## Exercises {#sec-ch21-exercises} **Exercise 21.1** ($\star\star$, diagnostic failure). Simulate data with a positivity violation: $e(\mathbf{x}) \approx 0$ for some covariate values. Show that IPW becomes unstable (extreme weights) while AIPW remains more stable. What does the propensity score overlap plot look like? %% ```{python} #| label: ex-21-1 # Strong confounding → near-violation of positivity x_strong = rng.standard_normal(500) e_strong = 1/(1+np.exp(-(3*x_strong))) # steep: e ∈ (0.05, 0.95) D_s = rng.binomial(1, e_strong) Y_s = x_strong + 2*D_s + rng.standard_normal(500) lr_s = LogisticRegression(C=1e10).fit(x_strong.reshape(-1,1), D_s) e_s = np.clip(lr_s.predict_proba(x_strong.reshape(-1,1))[:,1], 0.01, 0.99) # IPW ipw_s = np.mean(D_s*Y_s/e_s - (1-D_s)*Y_s/(1-e_s)) weights_s = np.where(D_s, 1/e_s, 1/(1-e_s)) # AIPW X_s = sm.add_constant(x_strong) m1_s = sm.OLS(Y_s[D_s==1], X_s[D_s==1]).fit() m0_s = sm.OLS(Y_s[D_s==0], X_s[D_s==0]).fit() mu1_s, mu0_s = m1_s.predict(X_s), m0_s.predict(X_s) aipw_s = ((mu1_s - mu0_s) + D_s*(Y_s-mu1_s)/e_s - (1-D_s)*(Y_s-mu0_s)/(1-e_s)) print(f"True ATE: 2.0") print(f"IPW: {ipw_s:.3f} (max weight: {weights_s.max():.1f})") print(f"AIPW: {aipw_s.mean():.3f}") print(f"\nIPW is pulled by extreme weights.") print(f"AIPW uses the outcome model to stabilize.") ``` %% **Exercise 21.2** ($\star\star$, implementation). Verify AIPW's double robustness: (a) fit AIPW with a *wrong* outcome model (constant prediction) but *correct* propensity → should be consistent; (b) fit with *correct* outcome but *wrong* propensity (constant 0.5) → should also be consistent. %% ```{python} #| label: ex-21-2 # (a) Wrong outcome, correct propensity mu1_bad = np.full(n, Y[D==1].mean()) mu0_bad = np.full(n, Y[D==0].mean()) aipw_a = ((mu1_bad - mu0_bad) + D*(Y-mu1_bad)/e_hat - (1-D)*(Y-mu0_bad)/(1-e_hat)) print(f"(a) Wrong outcome, right propensity: {aipw_a.mean():.3f}") # (b) Correct outcome, wrong propensity (constant 0.5) e_wrong = np.full(n, 0.5) aipw_b = ((mu1_hat - mu0_hat) + D*(Y-mu1_hat)/e_wrong - (1-D)*(Y-mu0_hat)/(1-e_wrong)) print(f"(b) Right outcome, wrong propensity: {aipw_b.mean():.3f}") print(f"True ATE: {tau_true}") print(f"Both are close — that's double robustness.") ``` %% **Exercise 21.3** ($\star\star$, comparison). Run all four estimators (naive, regression, IPW, AIPW) across 200 simulations. Report the mean bias and RMSE. Which has the best bias-variance tradeoff? %% ```{python} #| label: ex-21-3 results = {m: [] for m in ['Naive', 'Reg', 'IPW', 'AIPW']} for rep in range(200): r = np.random.default_rng(rep) x = r.standard_normal((300,2)) e = 1/(1+np.exp(-(0.5*x[:,0]-0.3*x[:,1]))) d = r.binomial(1,e) y0 = 1+x[:,0]+0.5*x[:,1]+r.standard_normal(300) y = y0 + 2*d xs = sm.add_constant(x) results['Naive'].append(y[d==1].mean()-y[d==0].mean()) m1=sm.OLS(y[d==1],xs[d==1]).fit() m0=sm.OLS(y[d==0],xs[d==0]).fit() results['Reg'].append((m1.predict(xs)-m0.predict(xs)).mean()) lr2=LogisticRegression(C=1e10).fit(x,d) eh=np.clip(lr2.predict_proba(x)[:,1],0.01,0.99) results['IPW'].append(np.mean(d*y/eh-(1-d)*y/(1-eh))) aipw_rep=((m1.predict(xs)-m0.predict(xs)) +d*(y-m1.predict(xs))/eh-(1-d)*(y-m0.predict(xs))/(1-eh)) results['AIPW'].append(aipw_rep.mean()) print(f"{'Method':<8} {'Mean':>8} {'Bias':>8} {'RMSE':>8}") for m in results: vals = np.array(results[m]) print(f"{m:<8} {vals.mean():>8.3f} {vals.mean()-2:>8.3f} " f"{np.sqrt(np.mean((vals-2)**2)):>8.3f}") ``` AIPW typically has the smallest RMSE: slightly more variable than regression adjustment (because of the IPW correction terms) but less biased overall. %% **Exercise 21.4** ($\star\star\star$, conceptual). Explain why AIPW is "doubly robust." Under what conditions does it achieve the semiparametric efficiency bound? %% AIPW adds augmentation terms to the regression adjustment: $D(Y-\hat{\mu}_1)/\hat{e} - (1-D)(Y-\hat{\mu}_0)/(1-\hat{e})$. **If propensity is correct** ($\hat{e} = e$): The augmentation terms have conditional mean zero by the IPW identification formula, regardless of $\hat{\mu}$. So $\hat{\tau}$ is consistent. **If outcome model is correct** ($\hat{\mu}_d = \mu_d$): The residuals $Y - \hat{\mu}_d$ have conditional mean zero given $X$. The augmentation terms then correct any propensity misspecification through the $\hat{\mu}_1 - \hat{\mu}_0$ term. So $\hat{\tau}$ is consistent. AIPW achieves the **semiparametric efficiency bound** (Hahn, 1998) when *both* models are correct. The bound is the smallest possible asymptotic variance for any regular estimator of the ATE under unconfoundedness: $V_{\text{eff}} = \mathbb{E}\!\left[\frac{\sigma_1^2(\mathbf{X})}{e(\mathbf{X})} + \frac{\sigma_0^2(\mathbf{X})}{1-e(\mathbf{X})} + (\tau(\mathbf{X}) - \tau)^2\right]$ where $\sigma_d^2(\mathbf{x}) = \text{Var}(Y(d) \mid \mathbf{X}=\mathbf{x})$. %% ## Bibliographic Notes {#sec-ch21-biblio} Potential outcomes: Neyman (1923) for randomized experiments, Rubin (1974) for the extension to observational studies. The framework is sometimes called the "Rubin Causal Model"; Holland (1986) provides a clear exposition. Propensity score: Rosenbaum and Rubin (1983). The balancing property (propensity score suffices for unconfoundedness) is Theorem 3 of that paper. IPW: Horvitz and Thompson (1952) for the survey sampling origin. The application to causal inference: Robins (1986). The stabilized (Hajek) estimator: Hajek (1971). Matching: Rubin (1973) for bias reduction via matching; Abadie and Imbens (2006) for large-sample properties of matching estimators, including the result that matching with replacement is $\sqrt{n}$-consistent but has a bias term that requires correction when the number of matches is fixed. Mahalanobis distance matching: Rubin (1980). AIPW and double robustness: Robins, Rotnitzky, and Zhao (1994). The semiparametric efficiency bound for the ATE: Hahn (1998). The double-robustness proof in @sec-ch21-dr-proof follows the argument in Lunceford and Davidian (2004), which gives a clear finite-sample decomposition. Sensitivity analysis: Rosenbaum (2002, Ch. 4) for the theory of sensitivity bounds. Cinelli and Hazlett (2020) for the omitted variable bias framework as a modern alternative. For textbooks: Imbens and Rubin (2015) for the potential-outcomes approach; Hernán and Robins (2020) for a modern epidemiological perspective (freely available online). Pearl (2009) for the structural causal model (SCM) approach, which this chapter does not cover but which is complementary.