Difference-in-Differences
using Mixed-Integer Programming Matching Approach

Magdalena Bennett
McCombs School of Business, The University of Texas at Austin

AEFP 50th Conference, Washington DC
March 13th, 2025

Diff-in-Diff as an identification strategy

Parallel trend assumption (PTA)

Estimate Average Treatment Effect on the Treated (ATT)

But what if the PTA doesn't hold?

We can potentially remove [part of] the bias by matching on X^s_it=X_i

This paper

Identify contexts when matching can recover causal estimates under certain violations of the parallel trend assumption.
- Overall bias reduction and increase in robustness for sensitivity analysis.

Use mixed-integer programming matching (MIP) to balance covariates directly.

This paper

Identify contexts when matching can recover causal estimates under certain violations of the parallel trend assumption.
- Overall bias reduction and increase in robustness for sensitivity analysis.

Use mixed-integer programming matching (MIP) to balance covariates directly.

Simulations:
Different DGP scenarios

Application:
School segregation & vouchers

Let's set up the problem

DD Setup

Let $Y_{i t} (z)$ be the potential outcome for unit $i$ in period $t$ under treatment $z$ .
Intervention implemented in $T_{0}$ $\to$ No units are treated in $t \leq T_{0}$

DD Setup

Let $Y_{i t} (z)$ be the potential outcome for unit $i$ in period $t$ under treatment $z$ .
Intervention implemented in $T_{0}$ $\to$ No units are treated in $t \leq T_{0}$
Difference-in-Differences (DD) focuses on ATT for $t > T_{0}$ :

$A T T (t) = E [Y_{i t} (1) - Y_{i t} (0) | Z = 1]$

DD Setup

Let $Y_{i t} (z)$ be the potential outcome for unit $i$ in period $t$ under treatment $z$ .
Intervention implemented in $T_{0}$ $\to$ No units are treated in $t \leq T_{0}$
Difference-in-Differences (DD) focuses on ATT for $t > T_{0}$ :

$A T T (t) = E [Y_{i t} (1) - Y_{i t} (0) | Z = 1]$

Under the PTA:

$\begin{aligned} {\hat{τ}}^{D D} = & \overset{Δ_{p o s t}}{\overset{⏞}{E [Y_{i 1} | Z = 1] - E [Y_{i 1} | Z = 0]}} - \\ \underset{Δ_{p r e}}{\underset{⏟}{(E [Y_{i 0} | Z = 1] - E [Y_{i 0} | Z = 0])}} \end{aligned}$

Bias in a DD setting

Bias can be introduced to DD in different ways:

Bias in a DD setting

Bias can be introduced to DD in different ways:

1) Time-invariant covariates with time-varying effects: Obs. Bias

e.g. Effect of gender on salaries.

Bias in a DD setting

Bias can be introduced to DD in different ways:

1) Time-invariant covariates with time-varying effects: Obs. Bias

e.g. Effect of gender on salaries.

2) Differential time-varying effects: Obs. Diff. Bias

e.g. Effect of race on salaries evolve differently over time by group.

Bias in a DD setting

Bias can be introduced to DD in different ways:

1) Time-invariant covariates with time-varying effects: Obs. Bias

e.g. Effect of gender on salaries.

2) Differential time-varying effects: Obs. Diff. Bias

e.g. Effect of race on salaries evolve differently over time by group.

3) Observed or unobserved time-varying covariates: Unobs. Bias

e.g. Test scores

If the PTA holds...

$\begin{array}{rcc} \overset{O b s . B i a s}{\overset{⏞}{({\bar{γ}}_{1} (X^{1}, t^{'}) - {\bar{γ}}_{1} (X^{0}, t^{'})) - ({\bar{γ}}_{1} (X^{1}, t) - {\bar{γ}}_{1} (X^{0}, t))}} + \\ \underset{O b s . D i f f . B i a s}{\underset{⏟}{({\bar{γ}}_{2} (X^{1}, t^{'}) - {\bar{γ}}_{2} (X^{1}, t))}} + \underset{U n o b s . B i a s}{\underset{⏟}{(λ_{t^{'} 1} - λ_{t^{'} 0}) - (λ_{t 1} - λ_{t 0})}} & = 0 \end{array}$

If the PTA holds...

One of the two conditions need to hold:

1) No effect or constant effect of $X$ on $Y$ over time: $E [γ_{1} (X, t)] = E [γ_{1} (X)]$

2) Equal distribution of observed covariates between groups: $X_{i} | Z = 1 \overset{d}{=} X_{i} | Z = 0$

If the PTA holds...

One of the two conditions need to hold:

1) No effect or constant effect of $X$ on $Y$ over time: $E [γ_{1} (X, t)] = E [γ_{1} (X)]$

2) Equal distribution of observed covariates between groups: $X_{i} | Z = 1 \overset{d}{=} X_{i} | Z = 0$

in addition to:

3) No differential time effect of $X$ on $Y$ by treatment group: $E [γ_{2} (X, t)] = 0$

4) No unobserved time-varying effects: $λ_{t 1} = λ_{t 0}$

If the PTA holds...

One of the two conditions need to hold:

1) No effect or constant effect of $X$ on $Y$ over time: $E [γ_{1} (X, t)] = E [γ_{1} (X)]$

2) Equal distribution of observed covariates between groups: $X_{i} | Z = 1 \overset{d}{=} X_{i} | Z = 0$

in addition to:

3) No differential time effect of $X$ on $Y$ by treatment group: $E [γ_{2} (X, t)] = 0$

4) No unobserved time-varying effects: $λ_{t 1} = λ_{t 0}$

Cond. 2 can hold through matching

If the PTA holds...

One of the two conditions need to hold:

1) No effect or constant effect of $X$ on $Y$ over time: $E [γ_{1} (X, t)] = E [γ_{1} (X)]$

2) Equal distribution of observed covariates between groups: $X_{i} | Z = 1 \overset{d}{=} X_{i} | Z = 0$

in addition to:

3) No differential time effect of $X$ on $Y$ by treatment group: $E [γ_{2} (X, t)] = 0$

4) No unobserved time-varying effects: $λ_{t 1} = λ_{t 0}$

Cond. 2 can hold through matching

Cond. 3 and 4 can be tested with sensitivity analysis

Sensitivity analysis for Diff-in-Diff

In an event study $\to$ null effects prior to the intervention:

Honest approach to test pretrendsOne main issue with the previous test →→ Underpowered

  

Honest approach to test pretrends

One main issue with the previous test $\to$ Underpowered
Rambachan & Roth (2023) propose sensitivity bounds to allow pre-trends violations:
- E.g. Violations in the post-intervention period can be at most $M$ times the max violation in the pre-intervention period.

Honest approach to test pretrends

One main issue with the previous test $\to$ Underpowered
Rambachan & Roth (2023) propose sensitivity bounds to allow pre-trends violations:
- E.g. Violations in the post-intervention period can be at most $M$ times the max violation in the pre-intervention period.

Simulations

Different scenarios

For linear and quadratic functions:

S1: No interaction between X and t

S2: Equal interaction between X and t

S3: Differential interaction between X and t

S4: S3 + Bias cancellation

Different scenarios

For linear and quadratic functions:

S1: No interaction between X and t

S2: Equal interaction between X and t

S3: Differential interaction between X and t

S4: S3 + Bias cancellation

For all scenarios, differential distribution of covariates $X$ between groups

Parameters:

Parameter
Value

Number of obs (N)
1,000 

Pr(Z=1)
0.5 

Time periods (T)
8 

Last pre-intervention period (T_0)
4 

Matching PS
Nearest neighbor (using calipers)

MIP Matching tolerance
.01 SD

Number of simulations
1,000

Estimate compared to sample ATT (can be different for matching)

Parameter	Value
Number of obs (N)	1,000
`Pr(Z=1)`	0.5
Time periods (T)	8
Last pre-intervention period (T_0)	4
Matching PS	Nearest neighbor (using calipers)
MIP Matching tolerance	.01 SD
Number of simulations	1,000

S1 - No interaction between X and t

Event study estimates by time period (wrt T=4) for no interaction between X and t

S2 - Equal interaction between X and t by treatment

Event study estimates by time period (wrt T=4) for equal interaction between X and t

S3 - Differential interaction between X and t by treatment

Event study estimates by time period (wrt T=4) for differential interaction between X and t

Why is this bias reduction important?

Example of S2 (Quadratic) with no true effect:

Relative Magnitude Sensitivity Bounds on relative magnitudes for Scenario 2 (quadratic) - No effect

Why is this bias reduction important?

Even under modest bias, we would incorrectly reject the null 20% of the time.

$Rejection rate of null hypothesis for different values of `$\beta_x_t$`$

Why is this bias reduction important?

Sensitivity analysis results are skewed by the magnitude of the bias.

S4: Bias cancellation

Application

Preferential Voucher Scheme in Chile

Universal flat voucher scheme $\overset{2008}{⟶}$ Universal + preferential voucher scheme
Preferential voucher scheme:
- Targeted to bottom 40% of vulnerable students
- Additional 50% of voucher per student
- Additional money for concentration of SEP students.

Preferential Voucher Scheme in Chile

Universal flat voucher scheme $\overset{2008}{⟶}$ Universal + preferential voucher scheme
Preferential voucher scheme:
- Targeted to bottom 40% of vulnerable students
- Additional 50% of voucher per student
- Additional money for concentration of SEP students.

Students:
- Verify SEP status
- Attend a SEP school

Schools:
- Opt-into the policy
- No selection, no fees
- Resources ~ performance

Before matching: Household income

Before matching: Average SIMCE

Matching + DD

Prior to matching: No parallel pre-trend
Different types of schools:
- Schools that charge high co-payment fees.
- Schools with low number of SEP student enrolled.
MIP Matching using constant or "sticky" covariates:
- Mean balance (0.025 SD): Enrollment, average yearly subsidy, number of voucher schools in county, charges add-on fees
- Exact balance: Geographic province

Groups are balanced in specific characteristics

Matching in 16 out of 53 provinces

After matching: Household income

After matching: Average SIMCE

Results

Matched schools:
- More vulnerable and lower test scores than the population mean.

Results

Matched schools:
- More vulnerable and lower test scores than the population mean.
9pp increase in the income gap between SEP and non-SEP schools in matched DD:
- SEP schools attracted even more vulnerable students.
- Non-SEP schools increased their average family income.

Results

Matched schools:
- More vulnerable and lower test scores than the population mean.
9pp increase in the income gap between SEP and non-SEP schools in matched DD:
- SEP schools attracted even more vulnerable students.
- Non-SEP schools increased their average family income.
No evidence of increase in SIMCE score:
- Could be a longer-term outcome.

Results

Matched schools:
- More vulnerable and lower test scores than the population mean.
9pp increase in the income gap between SEP and non-SEP schools in matched DD:
- SEP schools attracted even more vulnerable students.
- Non-SEP schools increased their average family income.
No evidence of increase in SIMCE score:
- Could be a longer-term outcome.
Findings in segregation are moderately robust to hidden bias (Keele et al., 2019):
- $Γ_{c} = 1.76$ $\to$ Unobserved confounder would have to change the probability of assignment from 50% vs 50% to 32.7% vs 67.3%.
- Allows up to 70% of the maximum deviation in the pre-intervention period (M = 0.7) vs 50% without matching (Rambachan & Roth, 2023)

Potential reasons?

Increase in probability of becoming SEP in 2009 jumps discontinuously at 60% of SEP student concentration in 2008 (4.7 pp; SE = 0.024)

Let's wrap it up

Conclusions and Next Steps

Matching can be an important tool to address violations in PTA.

Bias reduction is very important for sensitivity analysis.

Serial correlation also plays an important role: Don't match on random noise.

Next steps: Partial identification using time-varying covariates

Difference-in-Differences
using Mixed-Integer Programming Matching Approach

Magdalena Bennett
McCombs School of Business, The University of Texas at Austin

AEFP 50th Conference, Washington DC
March 13th, 2025

Honest approach to test pretrends

One drawback of the previous method is that it can overstate (or understate) the robustness of findings if the point estimate is biased.
- Honest CIs depend on the magnitude of the point estimate as well as the pre-trend violations.

Honest approach to test pretrends

One drawback of the previous method is that it can overstate (or understate) the robustness of findings if the point estimate is biased.
- Honest CIs depend on the magnitude of the point estimate as well as the pre-trend violations.
Matching can reduce the overall bias of the point estimate

Honest approach to test pretrends

One drawback of the previous method is that it can overstate (or understate) the robustness of findings if the point estimate is biased.
- Honest CIs depend on the magnitude of the point estimate as well as the pre-trend violations.
Matching can reduce the overall bias of the point estimate

How do we match?

Match on covariates or outcomes? Levels or trends?
Propensity score matching? Optimal matching? etc.

How do we match?

Match on covariates or outcomes? Levels or trends?
Propensity score matching? Optimal matching? etc.

This paper:

Match on time-invariant covariates that could make groups behave differently.
- Use distribution of covariates to match on a template.
Use of Mixed-Integer Programming (MIP) Matching (Zubizarreta, 2015; Bennett, Zubizarreta, & Vielma, 2020):
- Balance covariates directly
- Yield largest matched sample under balancing constraints (cardinality matching)
- Works fast with large samples

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Difference-in-Differencesusing Mixed-Integer Programming Matching Approach

Diff-in-Diff as an identification strategy

Parallel trend assumption (PTA)

Estimate Average Treatment Effect on the Treated (ATT)

But what if the PTA doesn't hold?

But what if the PTA doesn't hold?

This paper

This paper

DD Setup

DD Setup

DD Setup

Bias in a DD setting

Bias in a DD setting

Bias in a DD setting

Bias in a DD setting

If the PTA holds...

If the PTA holds...

If the PTA holds...

If the PTA holds...

If the PTA holds...

Sensitivity analysis for Diff-in-Diff

Honest approach to test pretrends

Honest approach to test pretrends

Honest approach to test pretrends

Different scenarios

Different scenarios

Parameters:

S1 - No interaction between X and t

S2 - Equal interaction between X and t by treatment

S3 - Differential interaction between X and t by treatment

Why is this bias reduction important?

Why is this bias reduction important?

Why is this bias reduction important?

S4: Bias cancellation

Preferential Voucher Scheme in Chile

Preferential Voucher Scheme in Chile

Before matching: Household income

Before matching: Average SIMCE

Matching + DD

Groups are balanced in specific characteristics

Matching in 16 out of 53 provinces

After matching: Household income

After matching: Average SIMCE

Results

Results

Results

Results

Potential reasons?

Conclusions and Next Steps

Difference-in-Differencesusing Mixed-Integer Programming Matching Approach

Honest approach to test pretrends

Honest approach to test pretrends

Honest approach to test pretrends

How do we match?

How do we match?

Data Generating Processes

SEP adoption over time

Diff-in-Diff as an identification strategy

Help

Difference-in-Differences
using Mixed-Integer Programming Matching Approach

Difference-in-Differences
using Mixed-Integer Programming Matching Approach