22 Quasi-Experimental Designs: RDD, DiD, and Synthetic Control

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from scipy import stats

plt.style.use('assets/book.mplstyle')
rng = np.random.default_rng(42)

22.1 Motivation

Chapter 21 required unconfoundedness: treatment assignment is random conditional on observables. When unconfoundedness fails (unobserved confounders exist), we need designs that create quasi-randomization:

RDD: Treatment assigned by whether a running variable crosses a cutoff (test score → scholarship, age → eligibility)
DiD: Compare treated and control groups before and after an intervention, assuming parallel trends
Synthetic control: Construct a weighted combination of control units that matches the treated unit’s pre-treatment trajectory

All three exploit specific design features rather than assuming away confounders.

22.2 Mathematical Foundation

22.2.1 RDD: why the cutoff creates quasi-randomization

In a sharp RDD, treatment is assigned by whether a running variable $X_i$ crosses a cutoff $c$: $D_i = \mathbb{1}(X_i \geq c)$. For example, students with test scores above 80 get a scholarship; those below don’t.

The key insight: Units just above and just below the cutoff are nearly identical — they differ by a tiny amount of the running variable. A student with a score of 80.1 is essentially the same as a student with 79.9, but one gets the scholarship and the other doesn’t. This creates a quasi-random experiment at the cutoff.

The causal effect at the cutoff is: \[ \tau_{\text{RDD}} = \lim_{x \downarrow c} \mathbb{E}[Y \mid X=x] - \lim_{x \uparrow c} \mathbb{E}[Y \mid X=x]. \tag{22.1}\]

What RDD does and doesn’t identify: RDD identifies the local treatment effect at the cutoff — for people whose running variable is right at $c$. It does NOT identify the effect for people far from the cutoff. A scholarship might help marginal students (near the cutoff) differently from top students (far above it).

How to estimate it: Fit separate regressions on each side of the cutoff (local linear regression) and measure the jump. The bandwidth determines how far from the cutoff you include data: wider bandwidth = more data but more bias (the relationship may be nonlinear); narrower bandwidth = less bias but more variance.

22.2.1.1 Identification proof sketch

The RDD estimand relies on a continuity assumption. Suppose the potential outcomes $\mathbb{E}[Y(0) \mid X=x]$ and $\mathbb{E}[Y(1) \mid X=x]$ are both continuous functions of $x$ at the cutoff $c$. Then:

\[ \lim_{x \downarrow c} \mathbb{E}[Y \mid X=x] = \lim_{x \downarrow c} \mathbb{E}[Y(1) \mid X=x] = \mathbb{E}[Y(1) \mid X=c], \]

because everyone just above $c$ is treated, and similarly $\lim_{x \uparrow c} \mathbb{E}[Y \mid X=x] = \mathbb{E}[Y(0) \mid X=c]$. The jump at the cutoff is therefore:

\[ \tau_{\text{RDD}} = \mathbb{E}[Y(1) - Y(0) \mid X = c]. \]

This is a local estimand: it tells you the treatment effect only for units at the cutoff. Extrapolating to units far from the cutoff requires additional assumptions (constant effects, parametric models) that the design alone does not justify. The from-scratch implementation below makes this concrete: the local linear regression literally restricts to a narrow window around $c$.

22.2.1.2 Fuzzy RDD as instrumental variables

In a sharp RDD, crossing the cutoff perfectly determines treatment: $\Pr(D=1 \mid X=c+\epsilon) = 1$ and $\Pr(D=1 \mid X=c-\epsilon) = 0$. But in many applications, the cutoff creates a jump in the probability of treatment without making it deterministic. Students above a scholarship cutoff are eligible but may not all enroll; patients above a risk threshold are recommended surgery but some decline. This is a fuzzy RDD.

In a fuzzy RDD, the cutoff indicator $Z_i = \mathbb{1}(X_i \geq c)$ serves as an instrument for the endogenous treatment $D_i$. The fuzzy RDD estimand is:

\[ \tau_{\text{fuzzy}} = \frac{\lim_{x \downarrow c} \mathbb{E}[Y \mid X=x] - \lim_{x \uparrow c} \mathbb{E}[Y \mid X=x]} {\lim_{x \downarrow c} \mathbb{E}[D \mid X=x] - \lim_{x \uparrow c} \mathbb{E}[D \mid X=x]}. \tag{22.2}\]

This is a Wald estimator — the ratio of the jump in the outcome to the jump in the treatment probability — which is exactly the IV estimand from Chapter 20. The implementation uses 2SLS restricted to observations near the cutoff.

22.2.1.3 Optimal bandwidth selection

The bandwidth $h$ controls the bias-variance tradeoff. Too narrow and variance explodes (few observations); too wide and bias dominates (nonlinear relationships get flattened). Imbens and Kalyanaraman (2012) derived the MSE-optimal bandwidth for a local linear estimator:

\[ h_{\text{IK}}^* = C_{\text{IK}} \cdot n^{-1/5}, \]

where $C_{\text{IK}}$ depends on the curvature of the conditional expectation function on each side of the cutoff and the density of $X$ at $c$. The $n^{-1/5}$ rate reflects the fact that local linear regression is a nonparametric estimator — the bandwidth shrinks with sample size, but slowly enough to keep variance in check.

In practice, the IK bandwidth is computed as:

\[ h_{\text{IK}}^* = C_K \left(\frac{\hat{\sigma}^2(c)}{n \cdot \hat{f}(c) \cdot (\hat{m}''_+(c)^2 + \hat{m}''_-(c)^2)}\right)^{1/5}, \]

where $\hat{f}(c)$ is the estimated density at the cutoff, $\hat{m}''_{\pm}(c)$ are estimated second derivatives on each side, $\hat{\sigma}^2(c)$ is the conditional variance at $c$, and $C_K$ is a kernel-specific constant. We implement a simplified version from scratch below.

22.2.2 DiD: why parallel trends removes confounding

Difference-in-differences compares changes rather than levels:

\[ \tau_{\text{DiD}} = \underbrace{(\bar{Y}_{\text{treated,post}} - \bar{Y}_{\text{treated,pre}})}_{\text{change for treated}} - \underbrace{(\bar{Y}_{\text{control,post}} - \bar{Y}_{\text{control,pre}})}_{\text{change for control}}. \tag{22.3}\]

Why it works: Suppose treated and control groups differ in levels (treated cities are richer) but would have followed the same trend absent treatment (both would have grown at the same rate). The first difference (post - pre) removes the level difference. The second difference (treated change - control change) removes the common trend. What’s left is the treatment effect.

The parallel trends assumption: The control group’s change is what would have happened to the treated group without treatment. This is untestable in the post-period, but you can check for pre-trends: if the two groups had parallel trends before treatment, it is more plausible they would have continued to.

The TWFE problem (Goodman-Bacon, 2021): The standard two-way fixed effects regression $Y_{it} = \alpha_i + \gamma_t + \tau D_{it} + \varepsilon_{it}$ works perfectly for the simple 2×2 case (two groups, two periods). But when treatment is staggered (different units treated at different times) and effects are heterogeneous (early vs late adopters have different effects), TWFE computes a weighted average of all possible 2×2 comparisons. Some of these comparisons use already-treated units as controls, and the resulting weights can be negative. This can produce an estimate with the wrong sign even when every individual treatment effect is positive.

The fix: Use estimators designed for staggered adoption: Callaway and Sant’Anna (2021) or Sun and Abraham (2021), which only use not-yet-treated units as controls.

22.2.2.1 Event study specification

An event study estimates treatment effects at each time period relative to treatment onset. The regression model is:

\[ Y_{it} = \alpha_i + \gamma_t + \sum_{k \neq -1} \delta_k \cdot \mathbb{1}(t - G_i = k) + \varepsilon_{it}, \tag{22.4}\]

where $G_i$ is the period when unit $i$ is first treated, $k$ indexes periods relative to treatment ($k < 0$ is pre-treatment, $k \geq 0$ is post-treatment), and $k = -1$ is the omitted reference period. The coefficients $\delta_k$ have two roles:

Pre-treatment ($k < 0$): These should be approximately zero. If they trend away from zero, parallel trends is violated.
Post-treatment ($k \geq 0$): These estimate the dynamic treatment effect at each post-treatment horizon.

22.2.2.2 Pre-trends testing pitfalls (Roth 2022)

Testing for pre-trends sounds straightforward: if $\delta_{-2}, \delta_{-3}, \ldots$ are insignificant, parallel trends holds. But Roth (2022) showed this creates pre-test bias: conditioning on passing the pre-trends test biases the post-treatment estimates. The intuition is selection — among the studies where the pre-test passes, the ones with real parallel-trends violations are overrepresented among those that happened to get small pre-treatment coefficients by chance. The practical takeaway: a non-significant pre-trends test does not validate the design. Report the pre-treatment coefficients honestly; do not condition on them.

22.2.2.3 Callaway-Sant’Anna group-time ATTs

The Callaway-Sant’Anna estimator defines group-time average treatment effects $ATT(g, t)$: the effect at time $t$ for the group first treated at time $g$. Each $ATT(g, t)$ is a simple DiD using only not-yet-treated (or never-treated) units as controls:

\[ ATT(g, t) = \mathbb{E}[Y_t - Y_{g-1} \mid G=g] - \mathbb{E}[Y_t - Y_{g-1} \mid C=1], \]

where $C=1$ indicates the control group. These group-time effects are then aggregated into summary measures (overall ATT, dynamic effects by event time, group-specific effects) using explicit weighting. The key insight is that no already-treated unit ever serves as a control, avoiding the negative-weight problem entirely.

22.2.3 Synthetic control: the counterfactual as an optimization problem

When you have a single treated unit (one state that passed a law, one company that merged), you cannot compute a simple average of control outcomes. Synthetic control constructs the best possible counterfactual by weighting control units to match the treated unit’s pre-treatment trajectory:

\[ \hat{\mathbf{w}} = \arg\min_\mathbf{w} \|\mathbf{Y}_1^{\text{pre}} - \mathbf{Y}_0^{\text{pre}}\mathbf{w}\|^2 \quad \text{s.t.} \quad \mathbf{w} \geq 0, \quad \sum w_j = 1. \]

The weights $\hat{w}_j$ tell you: “the counterfactual is 30% of California + 20% of Texas + …” The treatment effect is the gap between the actual outcome and the synthetic control in the post-period.

Why the constraints: Non-negative weights ($w_j \geq 0$) and summing to 1 ($\sum w_j = 1$) prevent extrapolation. The synthetic control is a convex combination of real units, so it stays within the range of observed data. This is more defensible than regression, which can extrapolate far outside the data.

Connection to optimization (Ch. 3): This is a constrained least-squares problem, solved by scipy.optimize.minimize with LinearConstraint. The optimization theory from Chapter 3 applies directly.

22.2.3.1 Permutation inference for synthetic control

Standard inference (confidence intervals, $p$-values) does not apply to synthetic control — there is only one treated unit, so there is no sampling distribution. Instead, we use permutation inference (Abadie, Diamond, Hainmueller 2010):

Apply the synthetic control method to each control unit $j$, pretending it was the treated unit (a “placebo” test).
For each placebo unit $j$, compute the ratio of post-treatment RMSPE to pre-treatment RMSPE:

\[ r_j = \frac{\text{RMSPE}_j^{\text{post}}}{\text{RMSPE}_j^{\text{pre}}}. \]

The $p$-value is the fraction of placebo units with a ratio at least as large as the treated unit’s ratio:

\[ p = \frac{1}{J+1} \sum_{j=0}^{J} \mathbb{1}(r_j \geq r_0), \]

where $r_0$ is the treated unit’s ratio and $J$ is the number of control units.

Why the ratio: Using the ratio rather than the raw post-treatment RMSPE accounts for pre-treatment fit quality. A control unit with poor pre-fit will have large post-treatment gaps even without any treatment — dividing by pre-RMSPE penalizes these spurious “effects.”

22.3 Statsmodels / From-Scratch Implementation

22.3.1 RDD from scratch

In a sharp RDD, treatment is deterministic: everyone above the cutoff gets treated, everyone below does not. We estimate the treatment effect by comparing outcomes just above and just below the cutoff using local linear regression (OLS restricted to observations near the cutoff).

# Simulate sharp RDD: treatment assigned by whether x >= 0
n = 500
x_run = rng.uniform(-3, 3, n)  # running variable (e.g., test score)
cutoff = 0
D_rdd = (x_run >= cutoff).astype(int)

# Outcome: continuous at cutoff, treatment adds 2
y_rdd = 1 + 0.5*x_run + 2*D_rdd + rng.standard_normal(n)

# Local linear regression near cutoff
bw = 1.0  # bandwidth
mask = np.abs(x_run - cutoff) <= bw
X_rdd = sm.add_constant(np.column_stack([x_run[mask], D_rdd[mask],
                                          x_run[mask]*D_rdd[mask]]))
rdd_fit = sm.OLS(y_rdd[mask], X_rdd).fit(cov_type='HC1')

print(f"RDD estimate: {rdd_fit.params[2]:.3f} "
      f"(SE={rdd_fit.bse[2]:.3f})")
print(f"True effect: 2.0")

RDD estimate: 2.016 (SE=0.291)
True effect: 2.0

fig, ax = plt.subplots()
ax.scatter(x_run, y_rdd, s=5, alpha=0.3, color="C0")

# Fit separate lines on each side
for side, color in [(x_run < cutoff, "C1"), (x_run >= cutoff, "C2")]:
    mask_s = side & (np.abs(x_run) < 2)
    fit = np.polyfit(x_run[mask_s], y_rdd[mask_s], 1)
    x_grid = np.linspace(x_run[mask_s].min(), x_run[mask_s].max(), 50)
    ax.plot(x_grid, np.polyval(fit, x_grid), color=color, linewidth=2)

ax.axvline(cutoff, color="gray", linestyle="--", alpha=0.5)
ax.set_xlabel("Running variable")
ax.set_ylabel("Outcome")
plt.show()

Figure 22.1: Regression discontinuity: the jump at the cutoff identifies the treatment effect. Local linear fits on each side of the cutoff estimate the discontinuity.

22.3.2 Local polynomial RDD from scratch

The RDD estimator above used OLS on observations near the cutoff. A proper local polynomial estimator uses kernel weighting — giving more weight to observations closer to the cutoff. This is weighted least squares with a triangular or Epanechnikov kernel.

def local_linear_rdd(y, x, cutoff, bandwidth, kernel='triangular'):
    """Local linear RDD estimator with kernel weighting.

    Returns the estimated discontinuity and its HC1 standard error.
    """
    mask = np.abs(x - cutoff) <= bandwidth
    x_loc = x[mask] - cutoff  # center at cutoff
    y_loc = y[mask]
    d_loc = (x_loc >= 0).astype(float)

    # Kernel weights
    u = np.abs(x_loc) / bandwidth
    if kernel == 'triangular':
        w = np.maximum(1 - u, 0)
    elif kernel == 'epanechnikov':
        w = np.maximum(0.75 * (1 - u**2), 0)
    else:  # uniform
        w = np.ones_like(u)

    # WLS: y ~ 1 + x + D + D*x, weighted by kernel
    X_mat = np.column_stack([
        np.ones(len(x_loc)), x_loc,
        d_loc, d_loc * x_loc,
    ])
    W = np.diag(w)
    # Weighted OLS: (X'WX)^{-1} X'Wy
    XtW = X_mat.T @ W
    beta = np.linalg.solve(XtW @ X_mat, XtW @ y_loc)

    # HC1 standard errors
    resid = y_loc - X_mat @ beta
    n_eff = mask.sum()
    meat = X_mat.T @ np.diag(w**2 * resid**2) @ X_mat
    bread = np.linalg.inv(XtW @ X_mat)
    V = bread @ meat @ bread * n_eff / (n_eff - 4)
    se = np.sqrt(np.diag(V))

    return beta[2], se[2]  # treatment coef and SE

tau_hat, se_hat = local_linear_rdd(
    y_rdd, x_run, cutoff=0, bandwidth=1.0,
)
print(f"Local linear RDD: {tau_hat:.3f} (SE={se_hat:.3f})")
print(f"True effect: 2.0")

Local linear RDD: 2.001 (SE=0.313)
True effect: 2.0

22.3.3 IK bandwidth selector from scratch

The Imbens-Kalyanaraman bandwidth requires estimates of the density at the cutoff, the conditional variance, and the second derivatives of the regression function on each side. We implement a simplified version that captures the key ideas.

def ik_bandwidth(y, x, cutoff, kernel='triangular'):
    """Simplified Imbens-Kalyanaraman optimal bandwidth.

    Uses a pilot bandwidth to estimate curvature, then
    computes the MSE-optimal bandwidth for local linear
    regression.
    """
    n = len(x)
    # Step 1: pilot bandwidth (Silverman rule-of-thumb)
    h_pilot = 1.06 * np.std(x) * n**(-0.2)

    # Step 2: estimate density at cutoff
    f_hat = np.mean(np.abs(x - cutoff) <= h_pilot) / (2 * h_pilot)

    # Step 3: estimate conditional variance at cutoff
    near = np.abs(x - cutoff) <= h_pilot
    if near.sum() < 10:
        return h_pilot  # fallback
    x_near = x[near] - cutoff
    y_near = y[near]
    X_pilot = sm.add_constant(x_near)
    resid_pilot = sm.OLS(y_near, X_pilot).fit().resid
    sigma2_hat = np.mean(resid_pilot**2)

    # Step 4: estimate curvature (second derivative) on
    # each side using a quadratic fit with wider bandwidth
    h_curv = 2 * h_pilot
    m2_plus, m2_minus = 0.0, 0.0
    for sign, store in [(1, 'plus'), (-1, 'minus')]:
        if sign == 1:
            side = (x >= cutoff) & (x <= cutoff + h_curv)
        else:
            side = (x < cutoff) & (x >= cutoff - h_curv)
        if side.sum() < 5:
            continue
        xs = x[side] - cutoff
        ys = y[side]
        Xq = np.column_stack([
            np.ones(side.sum()), xs, xs**2,
        ])
        beta_q = np.linalg.lstsq(Xq, ys, rcond=None)[0]
        if store == 'plus':
            m2_plus = 2 * beta_q[2]  # 2nd derivative
        else:
            m2_minus = 2 * beta_q[2]

    # Step 5: IK formula (triangular kernel constant)
    C_K = 3.4375  # for triangular kernel
    curv = m2_plus**2 + m2_minus**2
    if curv < 1e-12:
        return h_pilot  # curvature too small
    h_opt = C_K * (sigma2_hat / (n * f_hat * curv))**0.2
    return h_opt

h_ik = ik_bandwidth(y_rdd, x_run, cutoff=0)
print(f"IK optimal bandwidth: {h_ik:.3f}")

# Compare estimates across bandwidths
for h in [0.5, h_ik, 1.0, 1.5, 2.0]:
    tau_h, se_h = local_linear_rdd(
        y_rdd, x_run, cutoff=0, bandwidth=h,
    )
    label = " (IK)" if np.isclose(h, h_ik, atol=0.01) else ""
    print(f"  h={h:.2f}{label}: tau={tau_h:.3f} (SE={se_h:.3f})")

IK optimal bandwidth: 1.131
  h=0.50: tau=2.112 (SE=0.453)
  h=1.13 (IK): tau=2.028 (SE=0.293)
  h=1.00: tau=2.001 (SE=0.313)
  h=1.50: tau=1.980 (SE=0.259)
  h=2.00: tau=1.958 (SE=0.228)

22.3.4 Fuzzy RDD from scratch

When the cutoff creates a jump in the probability of treatment rather than a sharp switch, we have a fuzzy RDD. The estimator is 2SLS restricted to observations near the cutoff — the same IV machinery from Chapter 20, applied locally.

# Simulate fuzzy RDD: cutoff increases treatment probability
# but doesn't determine it perfectly
n_fuzz = 1000
x_fuzz = rng.uniform(-3, 3, n_fuzz)
# Treatment probability jumps from 0.2 to 0.8 at cutoff
p_treat = np.where(x_fuzz >= 0, 0.8, 0.2)
D_fuzz = rng.binomial(1, p_treat)

# Outcome: treatment effect of 3
y_fuzz = (1 + 0.5 * x_fuzz + 3 * D_fuzz
          + rng.standard_normal(n_fuzz))

# Fuzzy RDD via local 2SLS
bw_fuzz = 1.0
mask_f = np.abs(x_fuzz) <= bw_fuzz
x_loc = x_fuzz[mask_f]
y_loc = y_fuzz[mask_f]
D_loc = D_fuzz[mask_f]
Z_loc = (x_loc >= 0).astype(float)

# Stage 1: D ~ Z + x (instrument is cutoff indicator)
X_s1 = sm.add_constant(np.column_stack([Z_loc, x_loc]))
stage1 = sm.OLS(D_loc, X_s1).fit()
D_hat = stage1.fittedvalues

# Stage 2: y ~ D_hat + x
X_s2 = sm.add_constant(np.column_stack([D_hat, x_loc]))
stage2 = sm.OLS(y_loc, X_s2).fit()

# Wald estimator (ratio of jumps)
jump_y = (y_loc[Z_loc == 1].mean()
          - y_loc[Z_loc == 0].mean())
jump_D = (D_loc[Z_loc == 1].mean()
          - D_loc[Z_loc == 0].mean())
wald = jump_y / jump_D

print(f"Fuzzy RDD (2SLS): {stage2.params[1]:.3f}")
print(f"Fuzzy RDD (Wald): {wald:.3f}")
print(f"First stage jump: {jump_D:.3f}")
print(f"True effect: 3.0")

Fuzzy RDD (2SLS): 2.760
Fuzzy RDD (Wald): 3.783
First stage jump: 0.629
True effect: 3.0

22.3.5 DiD from scratch

Difference-in-differences compares the change in outcomes for the treated group to the change for the control group. The identifying assumption is that both groups would have followed the same trend absent treatment (parallel trends). The treatment effect is the difference between the actual and counterfactual changes.

# Simulate a simple 2x2 DiD: 2 groups × 2 time periods
n_did = 200
treated = np.repeat([0, 1], n_did // 2)
post = np.tile([0, 1], n_did // 2)

# Parallel trends + treatment effect
y_did = (5 + 2*treated + 3*post + 1.5*treated*post
         + rng.standard_normal(n_did))

# DiD estimate
y00 = y_did[(treated==0) & (post==0)].mean()
y01 = y_did[(treated==0) & (post==1)].mean()
y10 = y_did[(treated==1) & (post==0)].mean()
y11 = y_did[(treated==1) & (post==1)].mean()
did_est = (y11 - y10) - (y01 - y00)

print(f"DiD estimate: {did_est:.3f} (true=1.5)")
print(f"Means: Y00={y00:.2f}, Y01={y01:.2f}, Y10={y10:.2f}, Y11={y11:.2f}")

# Equivalent regression
X_did = sm.add_constant(np.column_stack([treated, post, treated*post]))
did_reg = sm.OLS(y_did, X_did).fit(cov_type='HC1')
print(f"Regression DiD: {did_reg.params[3]:.3f} (SE={did_reg.bse[3]:.3f})")

DiD estimate: 1.538 (true=1.5)
Means: Y00=4.93, Y01=7.93, Y10=7.00, Y11=11.53
Regression DiD: 1.538 (SE=0.265)

fig, ax = plt.subplots()
ax.plot([0, 1], [y00, y01], 'o-', color="C0", markersize=8, label="Control")
ax.plot([0, 1], [y10, y11], 's-', color="C1", markersize=8, label="Treated (actual)")
ax.plot([0, 1], [y10, y10 + (y01-y00)], 's--', color="C1", alpha=0.5,
        markersize=8, label="Treated (counterfactual)")
ax.set_xticks([0, 1]); ax.set_xticklabels(["Pre", "Post"])
ax.set_ylabel("Outcome"); ax.legend()
plt.show()

Figure 22.2: Difference-in-differences: the parallel trends assumption (dashed line) shows where the treated group would have been without treatment. The DiD estimate is the gap between actual and counterfactual.

22.3.6 Event study DiD from scratch

An event study specification estimates treatment effects at each period relative to the treatment date. This is the workhorse diagnostic for DiD: pre-treatment coefficients should be zero (validating parallel trends), and post-treatment coefficients trace out the dynamic effect.

# Simulate panel: 50 units, 10 periods, treatment at t=5
n_units, n_periods = 50, 10
treat_period = 5
n_treated = 25

# Unit and time fixed effects
alpha_i = rng.standard_normal(n_units) * 2
gamma_t = np.linspace(0, 3, n_periods)

# Build panel
unit_id = np.repeat(np.arange(n_units), n_periods)
time_id = np.tile(np.arange(n_periods), n_units)
treated_unit = (unit_id < n_treated).astype(float)
post = (time_id >= treat_period).astype(float)

# Dynamic treatment effects: grow over post-treatment periods
true_effects = np.zeros(n_periods)
for t in range(treat_period, n_periods):
    true_effects[t] = 1.0 * (t - treat_period + 1)

# Outcome
y_panel = (alpha_i[unit_id] + gamma_t[time_id]
           + true_effects[time_id] * treated_unit
           + rng.standard_normal(n_units * n_periods) * 0.5)

# Event study regression: create dummies for each relative
# period k, omitting k=-1 as reference
event_time = time_id - treat_period  # relative to treatment
ref_period = -1

# Build design matrix with unit FE, time FE, event dummies
# Only event dummies for treated units
event_periods = sorted(set(event_time))
event_periods = [k for k in event_periods if k != ref_period]

X_cols = {}
# Unit dummies (omit first)
for i in range(1, n_units):
    X_cols[f'unit_{i}'] = (unit_id == i).astype(float)
# Time dummies (omit first)
for t in range(1, n_periods):
    X_cols[f'time_{t}'] = (time_id == t).astype(float)
# Event-time dummies for treated units
for k in event_periods:
    X_cols[f'event_{k}'] = (
        (event_time == k) & (treated_unit == 1)
    ).astype(float)

X_es = sm.add_constant(
    pd.DataFrame(X_cols)
)
es_fit = sm.OLS(y_panel, X_es).fit(cov_type='HC1')

# Extract event-study coefficients
es_coefs = []
es_ses = []
for k in event_periods:
    idx = list(X_es.columns).index(f'event_{k}')
    es_coefs.append(es_fit.params.iloc[idx])
    es_ses.append(es_fit.bse.iloc[idx])
es_coefs = np.array(es_coefs)
es_ses = np.array(es_ses)

fig, ax = plt.subplots()
# Insert the reference period (coef=0, se=0) for plotting
plot_periods = sorted(event_periods + [ref_period])
plot_coefs = []
plot_ses = []
for k in plot_periods:
    if k == ref_period:
        plot_coefs.append(0.0)
        plot_ses.append(0.0)
    else:
        i = event_periods.index(k)
        plot_coefs.append(es_coefs[i])
        plot_ses.append(es_ses[i])
plot_coefs = np.array(plot_coefs)
plot_ses = np.array(plot_ses)

ax.errorbar(
    plot_periods, plot_coefs,
    yerr=1.96 * plot_ses,
    fmt='o-', color='C0', capsize=3, markersize=5,
    label='Estimated',
)
ax.plot(
    plot_periods,
    [true_effects[treat_period + k] if k >= 0 else 0
     for k in plot_periods],
    's--', color='C1', markersize=4, alpha=0.7,
    label='True',
)
ax.axvline(-0.5, color='gray', linestyle='--', alpha=0.5)
ax.axhline(0, color='gray', linestyle='-', alpha=0.3)
ax.set_xlabel("Periods relative to treatment")
ax.set_ylabel("Estimated effect")
ax.legend()
plt.show()

Figure 22.3: Event study plot: pre-treatment coefficients near zero validate parallel trends; post-treatment coefficients trace the growing treatment effect.

22.3.7 Synthetic control from scratch

from scipy.optimize import minimize

# 1 treated unit, 5 control units, 20 pre + 10 post periods
n_pre, n_post = 20, 10
n_control = 5

# Pre-treatment: treated and controls have similar trends
Y_control_pre = np.cumsum(rng.standard_normal((n_pre, n_control)), axis=0) + 50
Y_treated_pre = (Y_control_pre @ np.array([0.3, 0.25, 0.2, 0.15, 0.1])
                 + rng.standard_normal(n_pre))

# Post-treatment: treated diverges
Y_control_post = (Y_control_pre[-1:] +
                  np.cumsum(rng.standard_normal((n_post, n_control)), axis=0))
Y_treated_post = (Y_control_post @ np.array([0.3, 0.25, 0.2, 0.15, 0.1])
                  + 5  # treatment effect
                  + rng.standard_normal(n_post))

# Find synthetic control weights
def synth_objective(w, Y_treated, Y_control):
    return np.sum((Y_treated - Y_control @ w)**2)

from scipy.optimize import LinearConstraint
constraints = [
    LinearConstraint(np.ones((1, n_control)), 1, 1),  # sum = 1
]
bounds = [(0, 1)] * n_control
w0 = np.ones(n_control) / n_control

res = minimize(synth_objective, w0, args=(Y_treated_pre, Y_control_pre),
               bounds=bounds, constraints=constraints, method='SLSQP')
w_synth = res.x

print(f"Synthetic weights: {w_synth.round(3)}")
print(f"Pre-treatment RMSE: {np.sqrt(res.fun/n_pre):.3f}")

# Treatment effect
Y_synth_post = Y_control_post @ w_synth
tau_synth = Y_treated_post - Y_synth_post
print(f"Avg treatment effect (post): {tau_synth.mean():.3f} (true≈5)")

Synthetic weights: [0.205 0.214 0.331 0.044 0.207]
Pre-treatment RMSE: 0.936
Avg treatment effect (post): 5.316 (true≈5)

Y_all_treated = np.concatenate([Y_treated_pre, Y_treated_post])
Y_all_synth = np.concatenate([Y_control_pre @ w_synth, Y_synth_post])

fig, ax = plt.subplots()
t = range(n_pre + n_post)
ax.plot(t, Y_all_treated, color="C0", linewidth=1.5, label="Treated")
ax.plot(t, Y_all_synth, '--', color="C1", linewidth=1.5, label="Synthetic control")
ax.axvline(n_pre, color="gray", linestyle="--", alpha=0.5)
ax.fill_between(range(n_pre, n_pre+n_post),
                Y_treated_post, Y_synth_post, alpha=0.2, color="C2",
                label="Treatment effect")
ax.set_xlabel("Time period"); ax.legend()
plt.show()

Figure 22.4: Synthetic control: the synthetic (dashed) closely tracks the treated unit pre-treatment. Post-treatment, the gap between actual and synthetic estimates the treatment effect.

22.3.8 Synthetic control placebo tests

To assess whether the treated unit’s effect is “significant,” we apply the synthetic control method to each control unit (pretending it is treated) and compare the resulting gaps.

def fit_synth(Y_treated_pre, Y_control_pre, Y_control_post):
    """Fit synthetic control and return post-treatment gaps."""
    n_c = Y_control_pre.shape[1]
    w0 = np.ones(n_c) / n_c
    constraints = [
        LinearConstraint(np.ones((1, n_c)), 1, 1),
    ]
    bounds = [(0, 1)] * n_c
    res = minimize(
        synth_objective, w0,
        args=(Y_treated_pre, Y_control_pre),
        bounds=bounds, constraints=constraints,
        method='SLSQP',
    )
    w = res.x
    pre_rmspe = np.sqrt(res.fun / len(Y_treated_pre))
    post_gaps = (Y_control_post @ np.zeros(n_c))  # placeholder
    return w, pre_rmspe

# Compute RMSPE ratio for the treated unit
pre_rmspe_treated = np.sqrt(
    np.mean((Y_treated_pre - Y_control_pre @ w_synth)**2)
)
post_rmspe_treated = np.sqrt(
    np.mean((Y_treated_post - Y_control_post @ w_synth)**2)
)
ratio_treated = post_rmspe_treated / pre_rmspe_treated

# Placebo: apply synth control to each control unit
ratios = [ratio_treated]
placebo_gaps = {}
for j in range(n_control):
    # Treat control unit j as "treated"
    Y_j_pre = Y_control_pre[:, j]
    # Remaining controls (exclude unit j)
    cols = [c for c in range(n_control) if c != j]
    Y_donors_pre = Y_control_pre[:, cols]
    Y_donors_post = Y_control_post[:, cols]
    Y_j_post = Y_control_post[:, j]

    n_d = len(cols)
    w0_p = np.ones(n_d) / n_d
    constraints_p = [
        LinearConstraint(np.ones((1, n_d)), 1, 1),
    ]
    bounds_p = [(0, 1)] * n_d
    res_p = minimize(
        synth_objective, w0_p,
        args=(Y_j_pre, Y_donors_pre),
        bounds=bounds_p, constraints=constraints_p,
        method='SLSQP',
    )
    w_p = res_p.x
    pre_rmspe_j = np.sqrt(
        np.mean((Y_j_pre - Y_donors_pre @ w_p)**2)
    )
    post_rmspe_j = np.sqrt(
        np.mean((Y_j_post - Y_donors_post @ w_p)**2)
    )
    if pre_rmspe_j > 1e-6:
        ratios.append(post_rmspe_j / pre_rmspe_j)
    # Store gaps for plotting
    placebo_gaps[j] = np.concatenate([
        Y_j_pre - Y_donors_pre @ w_p,
        Y_j_post - Y_donors_post @ w_p,
    ])

# p-value: fraction of ratios >= treated ratio
p_val = np.mean(np.array(ratios) >= ratio_treated)
print(f"Treated RMSPE ratio: {ratio_treated:.2f}")
print(f"Placebo ratios: "
      f"{np.array(ratios[1:]).round(2)}")
print(f"Permutation p-value: {p_val:.3f}")

Treated RMSPE ratio: 5.78
Placebo ratios: [3.13 2.47 1.74 1.04 4.62]
Permutation p-value: 0.167

fig, ax = plt.subplots()
t = range(n_pre + n_post)

# Plot placebo gaps in gray
for j, gaps in placebo_gaps.items():
    ax.plot(t, gaps, color='gray', alpha=0.3, linewidth=0.8)

# Plot treated unit gap
treated_gaps = np.concatenate([
    Y_treated_pre - Y_control_pre @ w_synth,
    Y_treated_post - Y_control_post @ w_synth,
])
ax.plot(t, treated_gaps, color='C0', linewidth=2,
        label='Treated unit')
ax.axvline(n_pre, color='gray', linestyle='--', alpha=0.5)
ax.axhline(0, color='gray', linestyle='-', alpha=0.3)
ax.set_xlabel("Time period")
ax.set_ylabel("Gap (actual − synthetic)")
ax.legend()
plt.show()

Figure 22.5: Placebo test: the treated unit’s gap (dark) should be larger than the placebo gaps (gray). If the treated line is not extreme, the effect is not significant by permutation inference.

22.4 Variations, Diagnostics, and Pitfalls

22.4.1 RDD diagnostics

Placebo tests: Run the RDD analysis at fake cutoffs where there is no treatment. The estimated “effect” should be zero. If you find significant effects at fake cutoffs, your specification is wrong.

Density test (McCrary test): Check whether observations bunch at the cutoff. If people can manipulate their running variable to be just above (or below) the cutoff, the quasi-random assignment breaks down. A discontinuity in the density of the running variable at the cutoff is evidence of manipulation.

def mccrary_density_test(x, cutoff, n_bins=20,
                         bandwidth=None):
    """Simplified McCrary (2008) density test.

    Estimates density on each side of the cutoff using
    local linear regression on binned counts. Returns the
    log-difference in density and a Wald test statistic.
    """
    if bandwidth is None:
        bandwidth = 1.5 * np.std(x) * len(x)**(-0.2)

    # Bin the data
    near = np.abs(x - cutoff) <= bandwidth
    x_near = x[near]
    bin_edges = np.linspace(
        cutoff - bandwidth, cutoff + bandwidth,
        n_bins + 1,
    )
    counts, _ = np.histogram(x_near, bins=bin_edges)
    midpoints = 0.5 * (bin_edges[:-1] + bin_edges[1:])

    # Separate left and right of cutoff
    left = midpoints < cutoff
    right = midpoints >= cutoff

    # Local linear on each side to estimate density at cutoff
    def fit_side(mask):
        xm = midpoints[mask] - cutoff
        ym = counts[mask].astype(float)
        X_m = sm.add_constant(xm)
        return sm.OLS(ym, X_m).fit()

    fit_l = fit_side(left)
    fit_r = fit_side(right)

    # Density estimates at cutoff (intercepts)
    f_left = fit_l.params[0]
    f_right = fit_r.params[0]
    se_diff = np.sqrt(fit_l.bse[0]**2 + fit_r.bse[0]**2)
    t_stat = (f_right - f_left) / se_diff

    return {
        'f_left': f_left, 'f_right': f_right,
        't_stat': t_stat,
        'p_value': 2 * (1 - stats.norm.cdf(np.abs(t_stat))),
    }

mc = mccrary_density_test(x_run, cutoff=0)
print(f"Density left of cutoff:  {mc['f_left']:.1f}")
print(f"Density right of cutoff: {mc['f_right']:.1f}")
print(f"McCrary t-stat: {mc['t_stat']:.2f} "
      f"(p={mc['p_value']:.3f})")
print("No manipulation" if mc['p_value'] > 0.05
      else "Evidence of manipulation")

Density left of cutoff:  6.9
Density right of cutoff: 5.6
McCrary t-stat: -0.59 (p=0.552)
No manipulation

Bandwidth sensitivity: The RDD estimate depends on the bandwidth. Report results for multiple bandwidths; if the estimate changes substantially, the result is fragile.

bandwidths = np.linspace(0.3, 2.5, 20)
taus = []
ses = []
for h in bandwidths:
    tau_h, se_h = local_linear_rdd(
        y_rdd, x_run, cutoff=0, bandwidth=h,
    )
    taus.append(tau_h)
    ses.append(se_h)
taus = np.array(taus)
ses = np.array(ses)

fig, ax = plt.subplots()
ax.plot(bandwidths, taus, color='C0', linewidth=1.5)
ax.fill_between(
    bandwidths, taus - 1.96*ses, taus + 1.96*ses,
    alpha=0.2, color='C0',
)
ax.axhline(2.0, color='C1', linestyle='--',
           alpha=0.7, label='True effect')
ax.axvline(h_ik, color='gray', linestyle=':',
           alpha=0.7, label=f'IK bandwidth ({h_ik:.2f})')
ax.set_xlabel("Bandwidth")
ax.set_ylabel("RDD estimate")
ax.legend()
plt.show()

Figure 22.6: Bandwidth sensitivity: the RDD estimate should be stable across a range of bandwidths. Large variation indicates sensitivity to specification.

Covariate balance near the cutoff: If the RDD design is valid, covariates should be balanced just above and just below the cutoff. A significant jump in any pre-treatment covariate is a red flag.

# Simulate a pre-treatment covariate (should be smooth at cutoff)
covariate = 0.3 * x_run + rng.standard_normal(n) * 0.5

# Test for discontinuity in the covariate
tau_cov, se_cov = local_linear_rdd(
    covariate, x_run, cutoff=0, bandwidth=1.0,
)
print(f"Covariate jump at cutoff: {tau_cov:.3f} "
      f"(SE={se_cov:.3f})")
print(f"p-value: "
      f"{2*(1 - stats.norm.cdf(abs(tau_cov/se_cov))):.3f}")
print("Balanced" if abs(tau_cov/se_cov) < 1.96
      else "Imbalanced — RDD validity concern")

Covariate jump at cutoff: -0.139 (SE=0.125)
p-value: 0.265
Balanced

22.4.2 DiD diagnostics

Pre-trends test: Plot the outcome for treated and control groups in the pre-treatment period. Parallel pre-trends support (but do not prove) the parallel trends assumption. Diverging pre-trends are a red flag.

Event study plot: The event study from the implementation section already serves as the primary DiD diagnostic — pre-treatment coefficients near zero validate the design. Below, we examine the pre-treatment coefficients more formally.

# Test: are pre-treatment event-study coefficients
# jointly zero?
pre_coefs = []
pre_ses = []
for k in event_periods:
    if k < 0:  # pre-treatment only
        idx = list(X_es.columns).index(f'event_{k}')
        pre_coefs.append(es_fit.params.iloc[idx])
        pre_ses.append(es_fit.bse.iloc[idx])

pre_coefs = np.array(pre_coefs)
pre_ses = np.array(pre_ses)

print("Pre-treatment event study coefficients:")
for i, k in enumerate([k for k in event_periods if k < 0]):
    sig = "*" if abs(pre_coefs[i]/pre_ses[i]) > 1.96 else ""
    print(f"  k={k:+d}: {pre_coefs[i]:+.3f} "
          f"(SE={pre_ses[i]:.3f}) {sig}")
print("\nJoint F-test on pre-treatment coefficients:")
f_stat = np.mean((pre_coefs / pre_ses)**2)
print(f"  Mean squared t-stat: {f_stat:.2f} "
      f"(should be ≈1 under H0)")

Pre-treatment event study coefficients:
  k=-5: +0.323 (SE=0.200) 
  k=-4: +0.133 (SE=0.199) 
  k=-3: +0.297 (SE=0.216) 
  k=-2: +0.299 (SE=0.211) 

Joint F-test on pre-treatment coefficients:
  Mean squared t-stat: 1.74 (should be ≈1 under H0)

22.4.3 Synthetic control diagnostics

Pre-treatment fit: The synthetic control should closely match the treated unit in the pre-treatment period. A poor pre-treatment fit (large gap) means the synthetic control is not a credible counterfactual.

Placebo test (permutation inference): The placebo tests above give us the distribution of RMSPE ratios. We can formalize this.

fig, ax = plt.subplots()
ax.hist(
    ratios[1:], bins=8, color='C0', alpha=0.5,
    edgecolor='black', label='Placebo units',
)
ax.axvline(
    ratio_treated, color='C1', linewidth=2,
    linestyle='--', label=f'Treated (ratio={ratio_treated:.1f})',
)
ax.set_xlabel("Post/Pre RMSPE ratio")
ax.set_ylabel("Count")
ax.legend()
plt.show()

Figure 22.7: Distribution of RMSPE ratios from placebo tests. The treated unit (vertical line) should be in the right tail; otherwise the effect is not distinguishable from noise.

Predictor balance table: Report how well the synthetic control matches the treated unit on pre-treatment outcomes and covariates.

# Pre-treatment fit quality
pre_fit_gaps = Y_treated_pre - Y_control_pre @ w_synth
print("Synthetic control pre-treatment fit:")
print(f"  Mean absolute gap: "
      f"{np.mean(np.abs(pre_fit_gaps)):.3f}")
print(f"  Max absolute gap:  "
      f"{np.max(np.abs(pre_fit_gaps)):.3f}")
print(f"  Pre-RMSPE:         {pre_rmspe_treated:.3f}")
print(f"\nWeights on donor units:")
for j in range(n_control):
    if w_synth[j] > 0.01:
        print(f"  Unit {j}: {w_synth[j]:.3f}")

Synthetic control pre-treatment fit:
  Mean absolute gap: 0.633
  Max absolute gap:  2.272
  Pre-RMSPE:         0.936

Weights on donor units:
  Unit 0: 0.205
  Unit 1: 0.214
  Unit 2: 0.331
  Unit 3: 0.044
  Unit 4: 0.207

22.4.4 Choosing between methods

Method	Requires	Identifies	Limitation
RDD	Cutoff in running variable	Local effect at cutoff	Not for people far from cutoff
DiD	Pre/post comparison group	Average effect of policy	Parallel trends assumption
Synth control	Many control units, 1 treated	Effect for treated unit	Can’t do standard inference

Rule of thumb: Use RDD when there is a clear eligibility cutoff; DiD when there is a clear before/after treatment date with a comparison group; synthetic control when there is a single treated unit and many potential controls.

22.5 Computational Considerations

Method	Cost	Notes
RDD (local linear)	$O(n p^2)$ in the bandwidth window	Standard OLS
DiD (TWFE)	$O(n p^2)$	Standard panel regression
Synthetic control	$O(T_{\text{pre}} \cdot J)$ optimization	$J$ = number of control units

All methods are fast for typical quasi-experimental data. The challenge is not computation but design: finding the right comparison group or cutoff.

22.6 Exercises

Exercise 22.1 ($\star\star$, diagnostic failure). Run a “placebo” RDD at a fake cutoff (where there is no treatment). Does the test reject? Repeat at 10 different fake cutoffs and report the rejection rate. It should be ≈5%.

rejections = 0
for fake_c in np.linspace(-2, -0.5, 10):
    mask = np.abs(x_run - fake_c) <= 1.0
    D_fake = (x_run[mask] >= fake_c).astype(int)
    X_fake = sm.add_constant(np.column_stack([x_run[mask], D_fake]))
    res = sm.OLS(y_rdd[mask], X_fake).fit(cov_type='HC1')
    if res.pvalues[2] < 0.05:
        rejections += 1
print(f"Placebo rejections: {rejections}/10 (should be ≈0-1)")

Placebo rejections: 1/10 (should be ≈0-1)

Exercise 22.2 ($\star\star$, comparison). Compare the DiD estimate using the 2×2 formula vs. TWFE regression. Show they give the same answer in the 2×2 case.

print(f"2x2 formula:   {did_est:.4f}")
print(f"TWFE regression: {did_reg.params[3]:.4f}")
print(f"Match: {np.isclose(did_est, did_reg.params[3], atol=0.01)}")

2x2 formula:   1.5378
TWFE regression: 1.5378
Match: True

Exercise 22.3 ($\star\star$, implementation). Implement synthetic control from scratch using scipy.optimize.minimize with constraints $w_j \geq 0$, $\sum w_j = 1$. Verify on the simulated data.

# Already implemented above
print(f"Weights sum: {w_synth.sum():.4f}")
print(f"All non-negative: {(w_synth >= -1e-10).all()}")
print(f"Treatment effect: {tau_synth.mean():.3f}")

Weights sum: 1.0000
All non-negative: True
Treatment effect: 5.316

Exercise 22.4 ($\star\star\star$, conceptual). Explain the Goodman-Bacon (2021) decomposition: why does TWFE with staggered adoption produce negative weights, and why can this give a wrong-sign estimate even when all group-specific effects are positive?

TWFE with staggered adoption computes $\hat{\tau}$ as a weighted average of all $2 \times 2$ DiD comparisons between groups treated at different times. Some comparisons use already-treated units as controls for newly-treated units. When treatment effects are heterogeneous (growing over time), the already-treated group’s outcome is elevated by their own treatment effect, making them a poor control. The resulting weights on these comparisons can be negative.

Example: if early adopters have a small effect that grows over time, and late adopters have a large initial effect, the TWFE estimate can be negative even though every individual effect is positive. The negative weight on the early-vs-late comparison (where the “control” is already treated) pulls the estimate below zero.

Fix: Callaway–Sant’Anna (2021) or Sun–Abraham (2021) estimators that only use not-yet-treated units as controls.

22.7 Bibliographic Notes

RDD: Thistlethwaite and Campbell (1960) introduced the idea; Hahn, Todd, and van der Klaauw (2001) formalized the identification argument. Imbens and Kalyanaraman (2012) derived the MSE-optimal bandwidth. Cattaneo, Idrobo, and Titiunik (2020) provide a modern practical guide. McCrary (2008) introduced the density test for manipulation. Lee and Lemieux (2010) is the standard reference for RDD practice. For fuzzy RDD, see Imbens and Angrist (1994) on the LATE interpretation and the connection to IV.

DiD: Ashenfelter and Card (1985) for the canonical application. Goodman-Bacon (2021) demonstrated the decomposition problem with staggered adoption. Callaway and Sant’Anna (2021) and Sun and Abraham (2021) provide heterogeneity-robust estimators. Roth (2022) exposed the pre-test bias problem: conditioning on passing a pre-trends test biases post-treatment estimates. De Chaisemartin and d’Haultfoeuille (2020) provide an alternative robust estimator. Roth, Sant’Anna, Bilinski, and Poe (2023) survey the recent methodological advances.

Synthetic control: Abadie and Gardeazabal (2003) for the original Basque Country application; Abadie, Diamond, and Hainmueller (2010) for the formal framework; Abadie, Diamond, and Hainmueller (2015) for permutation inference. Doudchenko and Imbens (2016) relax the non-negativity and sum-to-one constraints. Arkhangelsky et al. (2021) introduce the synthetic difference-in-differences estimator that combines synthetic control with DiD.

--- title: "Quasi-Experimental Designs: RDD, DiD, and Synthetic Control" --- ```{python} #| label: imports import numpy as np import matplotlib.pyplot as plt import pandas as pd import statsmodels.api as sm from scipy import stats plt.style.use('assets/book.mplstyle') rng = np.random.default_rng(42) ``` ## Motivation {#sec-ch22-motivation} Chapter 21 required unconfoundedness: treatment assignment is random conditional on observables. When unconfoundedness fails (unobserved confounders exist), we need *designs* that create quasi-randomization: - **RDD:** Treatment assigned by whether a running variable crosses a cutoff (test score → scholarship, age → eligibility) - **DiD:** Compare treated and control groups before and after an intervention, assuming parallel trends - **Synthetic control:** Construct a weighted combination of control units that matches the treated unit's pre-treatment trajectory All three exploit specific design features rather than assuming away confounders. ## Mathematical Foundation {#sec-ch22-math} ### RDD: why the cutoff creates quasi-randomization In a sharp RDD, treatment is assigned by whether a *running variable* $X_i$ crosses a cutoff $c$: $D_i = \mathbb{1}(X_i \geq c)$. For example, students with test scores above 80 get a scholarship; those below don't. **The key insight:** Units just above and just below the cutoff are nearly identical --- they differ by a tiny amount of the running variable. A student with a score of 80.1 is essentially the same as a student with 79.9, but one gets the scholarship and the other doesn't. This creates a quasi-random experiment at the cutoff. The causal effect at the cutoff is: $$ \tau_{\text{RDD}} = \lim_{x \downarrow c} \mathbb{E}[Y \mid X=x] - \lim_{x \uparrow c} \mathbb{E}[Y \mid X=x]. $$ {#eq-rdd} **What RDD does and doesn't identify:** RDD identifies the *local* treatment effect at the cutoff --- for people whose running variable is right at $c$. It does NOT identify the effect for people far from the cutoff. A scholarship might help marginal students (near the cutoff) differently from top students (far above it). **How to estimate it:** Fit separate regressions on each side of the cutoff (local linear regression) and measure the jump. The *bandwidth* determines how far from the cutoff you include data: wider bandwidth = more data but more bias (the relationship may be nonlinear); narrower bandwidth = less bias but more variance. #### Identification proof sketch The RDD estimand relies on a *continuity* assumption. Suppose the potential outcomes $\mathbb{E}[Y(0) \mid X=x]$ and $\mathbb{E}[Y(1) \mid X=x]$ are both continuous functions of $x$ at the cutoff $c$. Then: $$ \lim_{x \downarrow c} \mathbb{E}[Y \mid X=x] = \lim_{x \downarrow c} \mathbb{E}[Y(1) \mid X=x] = \mathbb{E}[Y(1) \mid X=c], $$ because everyone just above $c$ is treated, and similarly $\lim_{x \uparrow c} \mathbb{E}[Y \mid X=x] = \mathbb{E}[Y(0) \mid X=c]$. The jump at the cutoff is therefore: $$ \tau_{\text{RDD}} = \mathbb{E}[Y(1) - Y(0) \mid X = c]. $$ This is a *local* estimand: it tells you the treatment effect *only for units at the cutoff*. Extrapolating to units far from the cutoff requires additional assumptions (constant effects, parametric models) that the design alone does not justify. The from-scratch implementation below makes this concrete: the local linear regression literally restricts to a narrow window around $c$. #### Fuzzy RDD as instrumental variables In a sharp RDD, crossing the cutoff *perfectly* determines treatment: $\Pr(D=1 \mid X=c+\epsilon) = 1$ and $\Pr(D=1 \mid X=c-\epsilon) = 0$. But in many applications, the cutoff creates a *jump* in the probability of treatment without making it deterministic. Students above a scholarship cutoff are *eligible* but may not all enroll; patients above a risk threshold are *recommended* surgery but some decline. This is a fuzzy RDD. In a fuzzy RDD, the cutoff indicator $Z_i = \mathbb{1}(X_i \geq c)$ serves as an *instrument* for the endogenous treatment $D_i$. The fuzzy RDD estimand is: $$ \tau_{\text{fuzzy}} = \frac{\lim_{x \downarrow c} \mathbb{E}[Y \mid X=x] - \lim_{x \uparrow c} \mathbb{E}[Y \mid X=x]} {\lim_{x \downarrow c} \mathbb{E}[D \mid X=x] - \lim_{x \uparrow c} \mathbb{E}[D \mid X=x]}. $$ {#eq-fuzzy-rdd} This is a Wald estimator — the ratio of the jump in the outcome to the jump in the treatment probability — which is exactly the IV estimand from Chapter 20. The implementation uses 2SLS restricted to observations near the cutoff. #### Optimal bandwidth selection The bandwidth $h$ controls the bias-variance tradeoff. Too narrow and variance explodes (few observations); too wide and bias dominates (nonlinear relationships get flattened). Imbens and Kalyanaraman (2012) derived the MSE-optimal bandwidth for a local linear estimator: $$ h_{\text{IK}}^* = C_{\text{IK}} \cdot n^{-1/5}, $$ where $C_{\text{IK}}$ depends on the curvature of the conditional expectation function on each side of the cutoff and the density of $X$ at $c$. The $n^{-1/5}$ rate reflects the fact that local linear regression is a nonparametric estimator — the bandwidth shrinks with sample size, but slowly enough to keep variance in check. In practice, the IK bandwidth is computed as: $$ h_{\text{IK}}^* = C_K \left(\frac{\hat{\sigma}^2(c)}{n \cdot \hat{f}(c) \cdot (\hat{m}''_+(c)^2 + \hat{m}''_-(c)^2)}\right)^{1/5}, $$ where $\hat{f}(c)$ is the estimated density at the cutoff, $\hat{m}''_{\pm}(c)$ are estimated second derivatives on each side, $\hat{\sigma}^2(c)$ is the conditional variance at $c$, and $C_K$ is a kernel-specific constant. We implement a simplified version from scratch below. ### DiD: why parallel trends removes confounding Difference-in-differences compares *changes* rather than levels: $$ \tau_{\text{DiD}} = \underbrace{(\bar{Y}_{\text{treated,post}} - \bar{Y}_{\text{treated,pre}})}_{\text{change for treated}} - \underbrace{(\bar{Y}_{\text{control,post}} - \bar{Y}_{\text{control,pre}})}_{\text{change for control}}. $$ {#eq-did} **Why it works:** Suppose treated and control groups differ in *levels* (treated cities are richer) but would have followed the same *trend* absent treatment (both would have grown at the same rate). The first difference (post - pre) removes the level difference. The second difference (treated change - control change) removes the common trend. What's left is the treatment effect. **The parallel trends assumption:** The control group's change is what would have happened to the treated group without treatment. This is untestable in the post-period, but you can check for *pre-trends*: if the two groups had parallel trends before treatment, it is more plausible they would have continued to. **The TWFE problem (Goodman-Bacon, 2021):** The standard two-way fixed effects regression $Y_{it} = \alpha_i + \gamma_t + \tau D_{it} + \varepsilon_{it}$ works perfectly for the simple 2×2 case (two groups, two periods). But when treatment is *staggered* (different units treated at different times) and effects are *heterogeneous* (early vs late adopters have different effects), TWFE computes a weighted average of all possible 2×2 comparisons. Some of these comparisons use *already-treated* units as controls, and the resulting weights can be *negative*. This can produce an estimate with the wrong sign even when every individual treatment effect is positive. **The fix:** Use estimators designed for staggered adoption: Callaway and Sant'Anna (2021) or Sun and Abraham (2021), which only use not-yet-treated units as controls. #### Event study specification An event study estimates treatment effects at each time period relative to treatment onset. The regression model is: $$ Y_{it} = \alpha_i + \gamma_t + \sum_{k \neq -1} \delta_k \cdot \mathbb{1}(t - G_i = k) + \varepsilon_{it}, $$ {#eq-event-study} where $G_i$ is the period when unit $i$ is first treated, $k$ indexes periods relative to treatment ($k < 0$ is pre-treatment, $k \geq 0$ is post-treatment), and $k = -1$ is the omitted reference period. The coefficients $\delta_k$ have two roles: - **Pre-treatment ($k < 0$):** These should be approximately zero. If they trend away from zero, parallel trends is violated. - **Post-treatment ($k \geq 0$):** These estimate the dynamic treatment effect at each post-treatment horizon. #### Pre-trends testing pitfalls (Roth 2022) Testing for pre-trends sounds straightforward: if $\delta_{-2}, \delta_{-3}, \ldots$ are insignificant, parallel trends holds. But Roth (2022) showed this creates *pre-test bias*: conditioning on passing the pre-trends test biases the post-treatment estimates. The intuition is selection — among the studies where the pre-test passes, the ones with real parallel-trends violations are overrepresented among those that happened to get small pre-treatment coefficients by chance. The practical takeaway: a non-significant pre-trends test does *not* validate the design. Report the pre-treatment coefficients honestly; do not condition on them. #### Callaway-Sant'Anna group-time ATTs The Callaway-Sant'Anna estimator defines group-time average treatment effects $ATT(g, t)$: the effect at time $t$ for the group first treated at time $g$. Each $ATT(g, t)$ is a simple DiD using only not-yet-treated (or never-treated) units as controls: $$ ATT(g, t) = \mathbb{E}[Y_t - Y_{g-1} \mid G=g] - \mathbb{E}[Y_t - Y_{g-1} \mid C=1], $$ where $C=1$ indicates the control group. These group-time effects are then aggregated into summary measures (overall ATT, dynamic effects by event time, group-specific effects) using explicit weighting. The key insight is that no already-treated unit ever serves as a control, avoiding the negative-weight problem entirely. ### Synthetic control: the counterfactual as an optimization problem When you have a *single* treated unit (one state that passed a law, one company that merged), you cannot compute a simple average of control outcomes. Synthetic control constructs the best possible counterfactual by weighting control units to match the treated unit's pre-treatment trajectory: $$ \hat{\mathbf{w}} = \arg\min_\mathbf{w} \|\mathbf{Y}_1^{\text{pre}} - \mathbf{Y}_0^{\text{pre}}\mathbf{w}\|^2 \quad \text{s.t.} \quad \mathbf{w} \geq 0, \quad \sum w_j = 1. $$ The weights $\hat{w}_j$ tell you: "the counterfactual is 30% of California + 20% of Texas + ..." The treatment effect is the gap between the actual outcome and the synthetic control in the post-period. **Why the constraints:** Non-negative weights ($w_j \geq 0$) and summing to 1 ($\sum w_j = 1$) prevent extrapolation. The synthetic control is a *convex combination* of real units, so it stays within the range of observed data. This is more defensible than regression, which can extrapolate far outside the data. **Connection to optimization (Ch. 3):** This is a constrained least-squares problem, solved by `scipy.optimize.minimize` with `LinearConstraint`. The optimization theory from Chapter 3 applies directly. #### Permutation inference for synthetic control Standard inference (confidence intervals, $p$-values) does not apply to synthetic control — there is only one treated unit, so there is no sampling distribution. Instead, we use *permutation inference* (Abadie, Diamond, Hainmueller 2010): 1. Apply the synthetic control method to each control unit $j$, pretending it was the treated unit (a "placebo" test). 2. For each placebo unit $j$, compute the ratio of post-treatment RMSPE to pre-treatment RMSPE: $$ r_j = \frac{\text{RMSPE}_j^{\text{post}}}{\text{RMSPE}_j^{\text{pre}}}. $$ 3. The $p$-value is the fraction of placebo units with a ratio at least as large as the treated unit's ratio: $$ p = \frac{1}{J+1} \sum_{j=0}^{J} \mathbb{1}(r_j \geq r_0), $$ where $r_0$ is the treated unit's ratio and $J$ is the number of control units. **Why the ratio:** Using the ratio rather than the raw post-treatment RMSPE accounts for pre-treatment fit quality. A control unit with poor pre-fit will have large post-treatment gaps even without any treatment — dividing by pre-RMSPE penalizes these spurious "effects." ## Statsmodels / From-Scratch Implementation {#sec-ch22-impl} ### RDD from scratch In a sharp RDD, treatment is deterministic: everyone above the cutoff gets treated, everyone below does not. We estimate the treatment effect by comparing outcomes just above and just below the cutoff using local linear regression (OLS restricted to observations near the cutoff). ```{python} #| label: rdd-demo # Simulate sharp RDD: treatment assigned by whether x >= 0 n = 500 x_run = rng.uniform(-3, 3, n) # running variable (e.g., test score) cutoff = 0 D_rdd = (x_run >= cutoff).astype(int) # Outcome: continuous at cutoff, treatment adds 2 y_rdd = 1 + 0.5*x_run + 2*D_rdd + rng.standard_normal(n) # Local linear regression near cutoff bw = 1.0 # bandwidth mask = np.abs(x_run - cutoff) <= bw X_rdd = sm.add_constant(np.column_stack([x_run[mask], D_rdd[mask], x_run[mask]*D_rdd[mask]])) rdd_fit = sm.OLS(y_rdd[mask], X_rdd).fit(cov_type='HC1') print(f"RDD estimate: {rdd_fit.params[2]:.3f} " f"(SE={rdd_fit.bse[2]:.3f})") print(f"True effect: 2.0") ``` ```{python} #| label: fig-rdd #| fig-cap: "Regression discontinuity: the jump at the cutoff identifies the treatment effect. Local linear fits on each side of the cutoff estimate the discontinuity." fig, ax = plt.subplots() ax.scatter(x_run, y_rdd, s=5, alpha=0.3, color="C0") # Fit separate lines on each side for side, color in [(x_run < cutoff, "C1"), (x_run >= cutoff, "C2")]: mask_s = side & (np.abs(x_run) < 2) fit = np.polyfit(x_run[mask_s], y_rdd[mask_s], 1) x_grid = np.linspace(x_run[mask_s].min(), x_run[mask_s].max(), 50) ax.plot(x_grid, np.polyval(fit, x_grid), color=color, linewidth=2) ax.axvline(cutoff, color="gray", linestyle="--", alpha=0.5) ax.set_xlabel("Running variable") ax.set_ylabel("Outcome") plt.show() ``` ### Local polynomial RDD from scratch The RDD estimator above used OLS on observations near the cutoff. A proper local polynomial estimator uses *kernel weighting* — giving more weight to observations closer to the cutoff. This is weighted least squares with a triangular or Epanechnikov kernel. ```{python} #| label: local-poly-rdd def local_linear_rdd(y, x, cutoff, bandwidth, kernel='triangular'): """Local linear RDD estimator with kernel weighting. Returns the estimated discontinuity and its HC1 standard error. """ mask = np.abs(x - cutoff) <= bandwidth x_loc = x[mask] - cutoff # center at cutoff y_loc = y[mask] d_loc = (x_loc >= 0).astype(float) # Kernel weights u = np.abs(x_loc) / bandwidth if kernel == 'triangular': w = np.maximum(1 - u, 0) elif kernel == 'epanechnikov': w = np.maximum(0.75 * (1 - u**2), 0) else: # uniform w = np.ones_like(u) # WLS: y ~ 1 + x + D + D*x, weighted by kernel X_mat = np.column_stack([ np.ones(len(x_loc)), x_loc, d_loc, d_loc * x_loc, ]) W = np.diag(w) # Weighted OLS: (X'WX)^{-1} X'Wy XtW = X_mat.T @ W beta = np.linalg.solve(XtW @ X_mat, XtW @ y_loc) # HC1 standard errors resid = y_loc - X_mat @ beta n_eff = mask.sum() meat = X_mat.T @ np.diag(w**2 * resid**2) @ X_mat bread = np.linalg.inv(XtW @ X_mat) V = bread @ meat @ bread * n_eff / (n_eff - 4) se = np.sqrt(np.diag(V)) return beta[2], se[2] # treatment coef and SE tau_hat, se_hat = local_linear_rdd( y_rdd, x_run, cutoff=0, bandwidth=1.0, ) print(f"Local linear RDD: {tau_hat:.3f} (SE={se_hat:.3f})") print(f"True effect: 2.0") ``` ### IK bandwidth selector from scratch The Imbens-Kalyanaraman bandwidth requires estimates of the density at the cutoff, the conditional variance, and the second derivatives of the regression function on each side. We implement a simplified version that captures the key ideas. ```{python} #| label: ik-bandwidth def ik_bandwidth(y, x, cutoff, kernel='triangular'): """Simplified Imbens-Kalyanaraman optimal bandwidth. Uses a pilot bandwidth to estimate curvature, then computes the MSE-optimal bandwidth for local linear regression. """ n = len(x) # Step 1: pilot bandwidth (Silverman rule-of-thumb) h_pilot = 1.06 * np.std(x) * n**(-0.2) # Step 2: estimate density at cutoff f_hat = np.mean(np.abs(x - cutoff) <= h_pilot) / (2 * h_pilot) # Step 3: estimate conditional variance at cutoff near = np.abs(x - cutoff) <= h_pilot if near.sum() < 10: return h_pilot # fallback x_near = x[near] - cutoff y_near = y[near] X_pilot = sm.add_constant(x_near) resid_pilot = sm.OLS(y_near, X_pilot).fit().resid sigma2_hat = np.mean(resid_pilot**2) # Step 4: estimate curvature (second derivative) on # each side using a quadratic fit with wider bandwidth h_curv = 2 * h_pilot m2_plus, m2_minus = 0.0, 0.0 for sign, store in [(1, 'plus'), (-1, 'minus')]: if sign == 1: side = (x >= cutoff) & (x <= cutoff + h_curv) else: side = (x < cutoff) & (x >= cutoff - h_curv) if side.sum() < 5: continue xs = x[side] - cutoff ys = y[side] Xq = np.column_stack([ np.ones(side.sum()), xs, xs**2, ]) beta_q = np.linalg.lstsq(Xq, ys, rcond=None)[0] if store == 'plus': m2_plus = 2 * beta_q[2] # 2nd derivative else: m2_minus = 2 * beta_q[2] # Step 5: IK formula (triangular kernel constant) C_K = 3.4375 # for triangular kernel curv = m2_plus**2 + m2_minus**2 if curv < 1e-12: return h_pilot # curvature too small h_opt = C_K * (sigma2_hat / (n * f_hat * curv))**0.2 return h_opt h_ik = ik_bandwidth(y_rdd, x_run, cutoff=0) print(f"IK optimal bandwidth: {h_ik:.3f}") # Compare estimates across bandwidths for h in [0.5, h_ik, 1.0, 1.5, 2.0]: tau_h, se_h = local_linear_rdd( y_rdd, x_run, cutoff=0, bandwidth=h, ) label = " (IK)" if np.isclose(h, h_ik, atol=0.01) else "" print(f" h={h:.2f}{label}: tau={tau_h:.3f} (SE={se_h:.3f})") ``` ### Fuzzy RDD from scratch When the cutoff creates a jump in the *probability* of treatment rather than a sharp switch, we have a fuzzy RDD. The estimator is 2SLS restricted to observations near the cutoff — the same IV machinery from Chapter 20, applied locally. ```{python} #| label: fuzzy-rdd # Simulate fuzzy RDD: cutoff increases treatment probability # but doesn't determine it perfectly n_fuzz = 1000 x_fuzz = rng.uniform(-3, 3, n_fuzz) # Treatment probability jumps from 0.2 to 0.8 at cutoff p_treat = np.where(x_fuzz >= 0, 0.8, 0.2) D_fuzz = rng.binomial(1, p_treat) # Outcome: treatment effect of 3 y_fuzz = (1 + 0.5 * x_fuzz + 3 * D_fuzz + rng.standard_normal(n_fuzz)) # Fuzzy RDD via local 2SLS bw_fuzz = 1.0 mask_f = np.abs(x_fuzz) <= bw_fuzz x_loc = x_fuzz[mask_f] y_loc = y_fuzz[mask_f] D_loc = D_fuzz[mask_f] Z_loc = (x_loc >= 0).astype(float) # Stage 1: D ~ Z + x (instrument is cutoff indicator) X_s1 = sm.add_constant(np.column_stack([Z_loc, x_loc])) stage1 = sm.OLS(D_loc, X_s1).fit() D_hat = stage1.fittedvalues # Stage 2: y ~ D_hat + x X_s2 = sm.add_constant(np.column_stack([D_hat, x_loc])) stage2 = sm.OLS(y_loc, X_s2).fit() # Wald estimator (ratio of jumps) jump_y = (y_loc[Z_loc == 1].mean() - y_loc[Z_loc == 0].mean()) jump_D = (D_loc[Z_loc == 1].mean() - D_loc[Z_loc == 0].mean()) wald = jump_y / jump_D print(f"Fuzzy RDD (2SLS): {stage2.params[1]:.3f}") print(f"Fuzzy RDD (Wald): {wald:.3f}") print(f"First stage jump: {jump_D:.3f}") print(f"True effect: 3.0") ``` ### DiD from scratch Difference-in-differences compares the *change* in outcomes for the treated group to the *change* for the control group. The identifying assumption is that both groups would have followed the same trend absent treatment (parallel trends). The treatment effect is the difference between the actual and counterfactual changes. ```{python} #| label: did-demo # Simulate a simple 2x2 DiD: 2 groups × 2 time periods n_did = 200 treated = np.repeat([0, 1], n_did // 2) post = np.tile([0, 1], n_did // 2) # Parallel trends + treatment effect y_did = (5 + 2*treated + 3*post + 1.5*treated*post + rng.standard_normal(n_did)) # DiD estimate y00 = y_did[(treated==0) & (post==0)].mean() y01 = y_did[(treated==0) & (post==1)].mean() y10 = y_did[(treated==1) & (post==0)].mean() y11 = y_did[(treated==1) & (post==1)].mean() did_est = (y11 - y10) - (y01 - y00) print(f"DiD estimate: {did_est:.3f} (true=1.5)") print(f"Means: Y00={y00:.2f}, Y01={y01:.2f}, Y10={y10:.2f}, Y11={y11:.2f}") # Equivalent regression X_did = sm.add_constant(np.column_stack([treated, post, treated*post])) did_reg = sm.OLS(y_did, X_did).fit(cov_type='HC1') print(f"Regression DiD: {did_reg.params[3]:.3f} (SE={did_reg.bse[3]:.3f})") ``` ```{python} #| label: fig-did #| fig-cap: "Difference-in-differences: the parallel trends assumption (dashed line) shows where the treated group would have been without treatment. The DiD estimate is the gap between actual and counterfactual." fig, ax = plt.subplots() ax.plot([0, 1], [y00, y01], 'o-', color="C0", markersize=8, label="Control") ax.plot([0, 1], [y10, y11], 's-', color="C1", markersize=8, label="Treated (actual)") ax.plot([0, 1], [y10, y10 + (y01-y00)], 's--', color="C1", alpha=0.5, markersize=8, label="Treated (counterfactual)") ax.set_xticks([0, 1]); ax.set_xticklabels(["Pre", "Post"]) ax.set_ylabel("Outcome"); ax.legend() plt.show() ``` ### Event study DiD from scratch An event study specification estimates treatment effects at each period relative to the treatment date. This is the workhorse diagnostic for DiD: pre-treatment coefficients should be zero (validating parallel trends), and post-treatment coefficients trace out the dynamic effect. ```{python} #| label: event-study-did # Simulate panel: 50 units, 10 periods, treatment at t=5 n_units, n_periods = 50, 10 treat_period = 5 n_treated = 25 # Unit and time fixed effects alpha_i = rng.standard_normal(n_units) * 2 gamma_t = np.linspace(0, 3, n_periods) # Build panel unit_id = np.repeat(np.arange(n_units), n_periods) time_id = np.tile(np.arange(n_periods), n_units) treated_unit = (unit_id < n_treated).astype(float) post = (time_id >= treat_period).astype(float) # Dynamic treatment effects: grow over post-treatment periods true_effects = np.zeros(n_periods) for t in range(treat_period, n_periods): true_effects[t] = 1.0 * (t - treat_period + 1) # Outcome y_panel = (alpha_i[unit_id] + gamma_t[time_id] + true_effects[time_id] * treated_unit + rng.standard_normal(n_units * n_periods) * 0.5) # Event study regression: create dummies for each relative # period k, omitting k=-1 as reference event_time = time_id - treat_period # relative to treatment ref_period = -1 # Build design matrix with unit FE, time FE, event dummies # Only event dummies for treated units event_periods = sorted(set(event_time)) event_periods = [k for k in event_periods if k != ref_period] X_cols = {} # Unit dummies (omit first) for i in range(1, n_units): X_cols[f'unit_{i}'] = (unit_id == i).astype(float) # Time dummies (omit first) for t in range(1, n_periods): X_cols[f'time_{t}'] = (time_id == t).astype(float) # Event-time dummies for treated units for k in event_periods: X_cols[f'event_{k}'] = ( (event_time == k) & (treated_unit == 1) ).astype(float) X_es = sm.add_constant( pd.DataFrame(X_cols) ) es_fit = sm.OLS(y_panel, X_es).fit(cov_type='HC1') # Extract event-study coefficients es_coefs = [] es_ses = [] for k in event_periods: idx = list(X_es.columns).index(f'event_{k}') es_coefs.append(es_fit.params.iloc[idx]) es_ses.append(es_fit.bse.iloc[idx]) es_coefs = np.array(es_coefs) es_ses = np.array(es_ses) ``` ```{python} #| label: fig-event-study #| fig-cap: "Event study plot: pre-treatment coefficients near zero validate parallel trends; post-treatment coefficients trace the growing treatment effect." fig, ax = plt.subplots() # Insert the reference period (coef=0, se=0) for plotting plot_periods = sorted(event_periods + [ref_period]) plot_coefs = [] plot_ses = [] for k in plot_periods: if k == ref_period: plot_coefs.append(0.0) plot_ses.append(0.0) else: i = event_periods.index(k) plot_coefs.append(es_coefs[i]) plot_ses.append(es_ses[i]) plot_coefs = np.array(plot_coefs) plot_ses = np.array(plot_ses) ax.errorbar( plot_periods, plot_coefs, yerr=1.96 * plot_ses, fmt='o-', color='C0', capsize=3, markersize=5, label='Estimated', ) ax.plot( plot_periods, [true_effects[treat_period + k] if k >= 0 else 0 for k in plot_periods], 's--', color='C1', markersize=4, alpha=0.7, label='True', ) ax.axvline(-0.5, color='gray', linestyle='--', alpha=0.5) ax.axhline(0, color='gray', linestyle='-', alpha=0.3) ax.set_xlabel("Periods relative to treatment") ax.set_ylabel("Estimated effect") ax.legend() plt.show() ``` ### Synthetic control from scratch ```{python} #| label: synth-demo from scipy.optimize import minimize # 1 treated unit, 5 control units, 20 pre + 10 post periods n_pre, n_post = 20, 10 n_control = 5 # Pre-treatment: treated and controls have similar trends Y_control_pre = np.cumsum(rng.standard_normal((n_pre, n_control)), axis=0) + 50 Y_treated_pre = (Y_control_pre @ np.array([0.3, 0.25, 0.2, 0.15, 0.1]) + rng.standard_normal(n_pre)) # Post-treatment: treated diverges Y_control_post = (Y_control_pre[-1:] + np.cumsum(rng.standard_normal((n_post, n_control)), axis=0)) Y_treated_post = (Y_control_post @ np.array([0.3, 0.25, 0.2, 0.15, 0.1]) + 5 # treatment effect + rng.standard_normal(n_post)) # Find synthetic control weights def synth_objective(w, Y_treated, Y_control): return np.sum((Y_treated - Y_control @ w)**2) from scipy.optimize import LinearConstraint constraints = [ LinearConstraint(np.ones((1, n_control)), 1, 1), # sum = 1 ] bounds = [(0, 1)] * n_control w0 = np.ones(n_control) / n_control res = minimize(synth_objective, w0, args=(Y_treated_pre, Y_control_pre), bounds=bounds, constraints=constraints, method='SLSQP') w_synth = res.x print(f"Synthetic weights: {w_synth.round(3)}") print(f"Pre-treatment RMSE: {np.sqrt(res.fun/n_pre):.3f}") # Treatment effect Y_synth_post = Y_control_post @ w_synth tau_synth = Y_treated_post - Y_synth_post print(f"Avg treatment effect (post): {tau_synth.mean():.3f} (true≈5)") ``` ```{python} #| label: fig-synth #| fig-cap: "Synthetic control: the synthetic (dashed) closely tracks the treated unit pre-treatment. Post-treatment, the gap between actual and synthetic estimates the treatment effect." Y_all_treated = np.concatenate([Y_treated_pre, Y_treated_post]) Y_all_synth = np.concatenate([Y_control_pre @ w_synth, Y_synth_post]) fig, ax = plt.subplots() t = range(n_pre + n_post) ax.plot(t, Y_all_treated, color="C0", linewidth=1.5, label="Treated") ax.plot(t, Y_all_synth, '--', color="C1", linewidth=1.5, label="Synthetic control") ax.axvline(n_pre, color="gray", linestyle="--", alpha=0.5) ax.fill_between(range(n_pre, n_pre+n_post), Y_treated_post, Y_synth_post, alpha=0.2, color="C2", label="Treatment effect") ax.set_xlabel("Time period"); ax.legend() plt.show() ``` ### Synthetic control placebo tests To assess whether the treated unit's effect is "significant," we apply the synthetic control method to each control unit (pretending it is treated) and compare the resulting gaps. ```{python} #| label: synth-placebo def fit_synth(Y_treated_pre, Y_control_pre, Y_control_post): """Fit synthetic control and return post-treatment gaps.""" n_c = Y_control_pre.shape[1] w0 = np.ones(n_c) / n_c constraints = [ LinearConstraint(np.ones((1, n_c)), 1, 1), ] bounds = [(0, 1)] * n_c res = minimize( synth_objective, w0, args=(Y_treated_pre, Y_control_pre), bounds=bounds, constraints=constraints, method='SLSQP', ) w = res.x pre_rmspe = np.sqrt(res.fun / len(Y_treated_pre)) post_gaps = (Y_control_post @ np.zeros(n_c)) # placeholder return w, pre_rmspe # Compute RMSPE ratio for the treated unit pre_rmspe_treated = np.sqrt( np.mean((Y_treated_pre - Y_control_pre @ w_synth)**2) ) post_rmspe_treated = np.sqrt( np.mean((Y_treated_post - Y_control_post @ w_synth)**2) ) ratio_treated = post_rmspe_treated / pre_rmspe_treated # Placebo: apply synth control to each control unit ratios = [ratio_treated] placebo_gaps = {} for j in range(n_control): # Treat control unit j as "treated" Y_j_pre = Y_control_pre[:, j] # Remaining controls (exclude unit j) cols = [c for c in range(n_control) if c != j] Y_donors_pre = Y_control_pre[:, cols] Y_donors_post = Y_control_post[:, cols] Y_j_post = Y_control_post[:, j] n_d = len(cols) w0_p = np.ones(n_d) / n_d constraints_p = [ LinearConstraint(np.ones((1, n_d)), 1, 1), ] bounds_p = [(0, 1)] * n_d res_p = minimize( synth_objective, w0_p, args=(Y_j_pre, Y_donors_pre), bounds=bounds_p, constraints=constraints_p, method='SLSQP', ) w_p = res_p.x pre_rmspe_j = np.sqrt( np.mean((Y_j_pre - Y_donors_pre @ w_p)**2) ) post_rmspe_j = np.sqrt( np.mean((Y_j_post - Y_donors_post @ w_p)**2) ) if pre_rmspe_j > 1e-6: ratios.append(post_rmspe_j / pre_rmspe_j) # Store gaps for plotting placebo_gaps[j] = np.concatenate([ Y_j_pre - Y_donors_pre @ w_p, Y_j_post - Y_donors_post @ w_p, ]) # p-value: fraction of ratios >= treated ratio p_val = np.mean(np.array(ratios) >= ratio_treated) print(f"Treated RMSPE ratio: {ratio_treated:.2f}") print(f"Placebo ratios: " f"{np.array(ratios[1:]).round(2)}") print(f"Permutation p-value: {p_val:.3f}") ``` ```{python} #| label: fig-synth-placebo #| fig-cap: "Placebo test: the treated unit's gap (dark) should be larger than the placebo gaps (gray). If the treated line is not extreme, the effect is not significant by permutation inference." fig, ax = plt.subplots() t = range(n_pre + n_post) # Plot placebo gaps in gray for j, gaps in placebo_gaps.items(): ax.plot(t, gaps, color='gray', alpha=0.3, linewidth=0.8) # Plot treated unit gap treated_gaps = np.concatenate([ Y_treated_pre - Y_control_pre @ w_synth, Y_treated_post - Y_control_post @ w_synth, ]) ax.plot(t, treated_gaps, color='C0', linewidth=2, label='Treated unit') ax.axvline(n_pre, color='gray', linestyle='--', alpha=0.5) ax.axhline(0, color='gray', linestyle='-', alpha=0.3) ax.set_xlabel("Time period") ax.set_ylabel("Gap (actual − synthetic)") ax.legend() plt.show() ``` ## Variations, Diagnostics, and Pitfalls {#sec-ch22-diagnostics} ### RDD diagnostics **Placebo tests:** Run the RDD analysis at fake cutoffs where there is no treatment. The estimated "effect" should be zero. If you find significant effects at fake cutoffs, your specification is wrong. **Density test (McCrary test):** Check whether observations bunch at the cutoff. If people can manipulate their running variable to be just above (or below) the cutoff, the quasi-random assignment breaks down. A discontinuity in the density of the running variable at the cutoff is evidence of manipulation. ```{python} #| label: mccrary-test def mccrary_density_test(x, cutoff, n_bins=20, bandwidth=None): """Simplified McCrary (2008) density test. Estimates density on each side of the cutoff using local linear regression on binned counts. Returns the log-difference in density and a Wald test statistic. """ if bandwidth is None: bandwidth = 1.5 * np.std(x) * len(x)**(-0.2) # Bin the data near = np.abs(x - cutoff) <= bandwidth x_near = x[near] bin_edges = np.linspace( cutoff - bandwidth, cutoff + bandwidth, n_bins + 1, ) counts, _ = np.histogram(x_near, bins=bin_edges) midpoints = 0.5 * (bin_edges[:-1] + bin_edges[1:]) # Separate left and right of cutoff left = midpoints < cutoff right = midpoints >= cutoff # Local linear on each side to estimate density at cutoff def fit_side(mask): xm = midpoints[mask] - cutoff ym = counts[mask].astype(float) X_m = sm.add_constant(xm) return sm.OLS(ym, X_m).fit() fit_l = fit_side(left) fit_r = fit_side(right) # Density estimates at cutoff (intercepts) f_left = fit_l.params[0] f_right = fit_r.params[0] se_diff = np.sqrt(fit_l.bse[0]**2 + fit_r.bse[0]**2) t_stat = (f_right - f_left) / se_diff return { 'f_left': f_left, 'f_right': f_right, 't_stat': t_stat, 'p_value': 2 * (1 - stats.norm.cdf(np.abs(t_stat))), } mc = mccrary_density_test(x_run, cutoff=0) print(f"Density left of cutoff: {mc['f_left']:.1f}") print(f"Density right of cutoff: {mc['f_right']:.1f}") print(f"McCrary t-stat: {mc['t_stat']:.2f} " f"(p={mc['p_value']:.3f})") print("No manipulation" if mc['p_value'] > 0.05 else "Evidence of manipulation") ``` **Bandwidth sensitivity:** The RDD estimate depends on the bandwidth. Report results for multiple bandwidths; if the estimate changes substantially, the result is fragile. ```{python} #| label: fig-bw-sensitivity #| fig-cap: "Bandwidth sensitivity: the RDD estimate should be stable across a range of bandwidths. Large variation indicates sensitivity to specification." bandwidths = np.linspace(0.3, 2.5, 20) taus = [] ses = [] for h in bandwidths: tau_h, se_h = local_linear_rdd( y_rdd, x_run, cutoff=0, bandwidth=h, ) taus.append(tau_h) ses.append(se_h) taus = np.array(taus) ses = np.array(ses) fig, ax = plt.subplots() ax.plot(bandwidths, taus, color='C0', linewidth=1.5) ax.fill_between( bandwidths, taus - 1.96*ses, taus + 1.96*ses, alpha=0.2, color='C0', ) ax.axhline(2.0, color='C1', linestyle='--', alpha=0.7, label='True effect') ax.axvline(h_ik, color='gray', linestyle=':', alpha=0.7, label=f'IK bandwidth ({h_ik:.2f})') ax.set_xlabel("Bandwidth") ax.set_ylabel("RDD estimate") ax.legend() plt.show() ``` **Covariate balance near the cutoff:** If the RDD design is valid, covariates should be balanced just above and just below the cutoff. A significant jump in any pre-treatment covariate is a red flag. ```{python} #| label: covariate-balance # Simulate a pre-treatment covariate (should be smooth at cutoff) covariate = 0.3 * x_run + rng.standard_normal(n) * 0.5 # Test for discontinuity in the covariate tau_cov, se_cov = local_linear_rdd( covariate, x_run, cutoff=0, bandwidth=1.0, ) print(f"Covariate jump at cutoff: {tau_cov:.3f} " f"(SE={se_cov:.3f})") print(f"p-value: " f"{2*(1 - stats.norm.cdf(abs(tau_cov/se_cov))):.3f}") print("Balanced" if abs(tau_cov/se_cov) < 1.96 else "Imbalanced — RDD validity concern") ``` ### DiD diagnostics **Pre-trends test:** Plot the outcome for treated and control groups in the pre-treatment period. Parallel pre-trends support (but do not prove) the parallel trends assumption. Diverging pre-trends are a red flag. **Event study plot:** The event study from the implementation section already serves as the primary DiD diagnostic — pre-treatment coefficients near zero validate the design. Below, we examine the pre-treatment coefficients more formally. ```{python} #| label: pretrends-test # Test: are pre-treatment event-study coefficients # jointly zero? pre_coefs = [] pre_ses = [] for k in event_periods: if k < 0: # pre-treatment only idx = list(X_es.columns).index(f'event_{k}') pre_coefs.append(es_fit.params.iloc[idx]) pre_ses.append(es_fit.bse.iloc[idx]) pre_coefs = np.array(pre_coefs) pre_ses = np.array(pre_ses) print("Pre-treatment event study coefficients:") for i, k in enumerate([k for k in event_periods if k < 0]): sig = "*" if abs(pre_coefs[i]/pre_ses[i]) > 1.96 else "" print(f" k={k:+d}: {pre_coefs[i]:+.3f} " f"(SE={pre_ses[i]:.3f}) {sig}") print("\nJoint F-test on pre-treatment coefficients:") f_stat = np.mean((pre_coefs / pre_ses)**2) print(f" Mean squared t-stat: {f_stat:.2f} " f"(should be ≈1 under H0)") ``` ### Synthetic control diagnostics **Pre-treatment fit:** The synthetic control should closely match the treated unit in the pre-treatment period. A poor pre-treatment fit (large gap) means the synthetic control is not a credible counterfactual. **Placebo test (permutation inference):** The placebo tests above give us the distribution of RMSPE ratios. We can formalize this. ```{python} #| label: fig-rmspe-ratios #| fig-cap: "Distribution of RMSPE ratios from placebo tests. The treated unit (vertical line) should be in the right tail; otherwise the effect is not distinguishable from noise." fig, ax = plt.subplots() ax.hist( ratios[1:], bins=8, color='C0', alpha=0.5, edgecolor='black', label='Placebo units', ) ax.axvline( ratio_treated, color='C1', linewidth=2, linestyle='--', label=f'Treated (ratio={ratio_treated:.1f})', ) ax.set_xlabel("Post/Pre RMSPE ratio") ax.set_ylabel("Count") ax.legend() plt.show() ``` **Predictor balance table:** Report how well the synthetic control matches the treated unit on pre-treatment outcomes and covariates. ```{python} #| label: synth-balance # Pre-treatment fit quality pre_fit_gaps = Y_treated_pre - Y_control_pre @ w_synth print("Synthetic control pre-treatment fit:") print(f" Mean absolute gap: " f"{np.mean(np.abs(pre_fit_gaps)):.3f}") print(f" Max absolute gap: " f"{np.max(np.abs(pre_fit_gaps)):.3f}") print(f" Pre-RMSPE: {pre_rmspe_treated:.3f}") print(f"\nWeights on donor units:") for j in range(n_control): if w_synth[j] > 0.01: print(f" Unit {j}: {w_synth[j]:.3f}") ``` ### Choosing between methods | Method | Requires | Identifies | Limitation | |---|---|---|---| | RDD | Cutoff in running variable | Local effect at cutoff | Not for people far from cutoff | | DiD | Pre/post comparison group | Average effect of policy | Parallel trends assumption | | Synth control | Many control units, 1 treated | Effect for treated unit | Can't do standard inference | **Rule of thumb:** Use RDD when there is a clear eligibility cutoff; DiD when there is a clear before/after treatment date with a comparison group; synthetic control when there is a single treated unit and many potential controls. ## Computational Considerations {#sec-ch22-computation} | Method | Cost | Notes | |---|---|---| | RDD (local linear) | $O(n p^2)$ in the bandwidth window | Standard OLS | | DiD (TWFE) | $O(n p^2)$ | Standard panel regression | | Synthetic control | $O(T_{\text{pre}} \cdot J)$ optimization | $J$ = number of control units | All methods are fast for typical quasi-experimental data. The challenge is not computation but *design*: finding the right comparison group or cutoff. ## Exercises {#sec-ch22-exercises} **Exercise 22.1** ($\star\star$, diagnostic failure). Run a "placebo" RDD at a fake cutoff (where there is no treatment). Does the test reject? Repeat at 10 different fake cutoffs and report the rejection rate. It should be ≈5%. %% ```{python} #| label: ex-22-1 rejections = 0 for fake_c in np.linspace(-2, -0.5, 10): mask = np.abs(x_run - fake_c) <= 1.0 D_fake = (x_run[mask] >= fake_c).astype(int) X_fake = sm.add_constant(np.column_stack([x_run[mask], D_fake])) res = sm.OLS(y_rdd[mask], X_fake).fit(cov_type='HC1') if res.pvalues[2] < 0.05: rejections += 1 print(f"Placebo rejections: {rejections}/10 (should be ≈0-1)") ``` %% **Exercise 22.2** ($\star\star$, comparison). Compare the DiD estimate using the 2×2 formula vs. TWFE regression. Show they give the same answer in the 2×2 case. %% ```{python} #| label: ex-22-2 print(f"2x2 formula: {did_est:.4f}") print(f"TWFE regression: {did_reg.params[3]:.4f}") print(f"Match: {np.isclose(did_est, did_reg.params[3], atol=0.01)}") ``` %% **Exercise 22.3** ($\star\star$, implementation). Implement synthetic control from scratch using `scipy.optimize.minimize` with constraints $w_j \geq 0$, $\sum w_j = 1$. Verify on the simulated data. %% ```{python} #| label: ex-22-3 # Already implemented above print(f"Weights sum: {w_synth.sum():.4f}") print(f"All non-negative: {(w_synth >= -1e-10).all()}") print(f"Treatment effect: {tau_synth.mean():.3f}") ``` %% **Exercise 22.4** ($\star\star\star$, conceptual). Explain the Goodman-Bacon (2021) decomposition: why does TWFE with staggered adoption produce negative weights, and why can this give a wrong-sign estimate even when all group-specific effects are positive? %% TWFE with staggered adoption computes $\hat{\tau}$ as a weighted average of all $2 \times 2$ DiD comparisons between groups treated at different times. Some comparisons use *already-treated* units as controls for *newly-treated* units. When treatment effects are heterogeneous (growing over time), the already-treated group's outcome is elevated by their own treatment effect, making them a poor control. The resulting weights on these comparisons can be negative. Example: if early adopters have a small effect that grows over time, and late adopters have a large initial effect, the TWFE estimate can be negative even though every individual effect is positive. The negative weight on the early-vs-late comparison (where the "control" is already treated) pulls the estimate below zero. Fix: Callaway--Sant'Anna (2021) or Sun--Abraham (2021) estimators that only use not-yet-treated units as controls. %% ## Bibliographic Notes {#sec-ch22-biblio} RDD: Thistlethwaite and Campbell (1960) introduced the idea; Hahn, Todd, and van der Klaauw (2001) formalized the identification argument. Imbens and Kalyanaraman (2012) derived the MSE-optimal bandwidth. Cattaneo, Idrobo, and Titiunik (2020) provide a modern practical guide. McCrary (2008) introduced the density test for manipulation. Lee and Lemieux (2010) is the standard reference for RDD practice. For fuzzy RDD, see Imbens and Angrist (1994) on the LATE interpretation and the connection to IV. DiD: Ashenfelter and Card (1985) for the canonical application. Goodman-Bacon (2021) demonstrated the decomposition problem with staggered adoption. Callaway and Sant'Anna (2021) and Sun and Abraham (2021) provide heterogeneity-robust estimators. Roth (2022) exposed the pre-test bias problem: conditioning on passing a pre-trends test biases post-treatment estimates. De Chaisemartin and d'Haultfoeuille (2020) provide an alternative robust estimator. Roth, Sant'Anna, Bilinski, and Poe (2023) survey the recent methodological advances. Synthetic control: Abadie and Gardeazabal (2003) for the original Basque Country application; Abadie, Diamond, and Hainmueller (2010) for the formal framework; Abadie, Diamond, and Hainmueller (2015) for permutation inference. Doudchenko and Imbens (2016) relax the non-negativity and sum-to-one constraints. Arkhangelsky et al. (2021) introduce the synthetic difference-in-differences estimator that combines synthetic control with DiD.