Note: all replication materials are posted in this GitHub repository.
The Synthetic Control Method (SCM for short) was introduced by Abadie, Diamond and Hainmueller, in a 2010 paper that was published in JASA. I’ll first provide some basic intuition for the method, before I get into the application.
As a researcher (or more commonly, as a policy analyst), we sometimes encounter situations in which we have 1 (or very few) treated panels, and the available control panels are not immediately useful. The treated panel is just ‘too different’. This might be apparent, for example, from an inspection of pre-trends in the outcome variable.
If all of the available control panels are too different, matching can’t resolve this. A related, alternative approach would be re-weighting (e.g., IPW). This, essentially, is what SCM implements. It identifies a linear, weighted combination that maps control unit outcomes to outcomes in the treated unit, when the treatment is not active (i.e., in the pre-treatment period).
What is this exercise really doing then? We can think about this “mapping” exercise as a prediction problem (note: viewed this way, we might think of difference-in-differences as an inflexible approach to capturing / representing the mapping between treatment and control panels, i.e., it’s a shitty prediction algorithm, and a parallel trend violation amounts to prediction error). What SCM does is allow for greater flexibility in the mapping between treatment and control. It accomplishes this by allowing for a fractional representation of individual panels when it learns the mapping, via re-weighting.
Assuming we can learn how to use the control panels’ outcomes to predict the treatment panel outcome before treatment turns on, we can apply that learned prediction into the post period, to speculate about what would have happened to the treated panel had the treatment never turned on. Of course, there are some nuances here that I am glossing over. For example, the weights we learn under SCM are constrained to be non-negative, selected controls need to lie in the convex hull of the treated panel, i.e., control panels cannot be ‘too different’ from the treated panel. As I’ve stated previously, I encourage you to go read the original paper, as well as subsequent innovations, e.g., synthetic difference in differences, generalized synthetic control and synthetic control using LASSO, or SCUL.
I am going to provide a quick demonstration of the SCM, via a replication of some recently published work in PNAS. At the same time, I’ll demonstrate SCM using LASSO as well, given the number of available controls is quite large in this case.
Mitze et al. (2020, PNAS) examine a case study in Germany around the mandatory imposition of face masks in Jena, Germany, on April 6th of 2020. They employed this policy event as a natural experiment, to estimate the effect of mandatory face mask policies on COVID-19 case rates. Interestingly, this setup is also something of an ideal use case for SCM; we have a single treated location (the unit of analysis here is a district), and many controls (hundreds of other, untreated districts, in fact). Below is the main result of the study, a figure, which depicts the actual cumulative covid case series in Jena (solid line), and the estimated synthetic counterfactual (dashed line). This is the result we are going to try to reproduce.
Now, I originally sought to replicate this study because I needed a teaching example for my graduate-level causal inference course. The study seemed like a wonderful candidate for replication, because PNAS requires authors to disclose data and analysis scripts as supplementary appendices. Unfortunately, in this case, I was unable to locate anything approximating useful replication materials. The supplementary appendix did not include any analysis scripts (despite the authors’ claim to the contrary in their paper), and the data that were provided amounted to a single spreadsheet, with a handful of time series (certainly not the entirety of raw data that were used to construct a synthetic control). Rather than give up, I decided to locate the datasets the authors had claimed to have used in their study. After some digging, I was able to locate most of the relevant data.
Data on daily COVID-19 case reports for each German district and day have been published by the Robert Koch Institute since January of 2020. They have been made available publicly, via API. Fortunately, someone has been collecting that data and sharing it on a GitHub repository! See here.
I was then able to locate data for most of the control covariates the authors mentioned in their paper, by scouring different German Census Data portals (not a fun exercise when you cannot speak German!). I located data on the number of physicians per capita in each district, number of pharmacies, number of hospital beds, as well as data on resident volumes by age category, gender and so on.
Many, if not most of the covariates are available from DEStatis, including population counts by age range, as well as by gender. Further, data on a region’s local healthcare infrastructure, including physicians per capita, hospital beds, and number of pharmacies are available on an annual basis in the INKAR database, maintained by the German Federal Institute for Research on Building, Urban Affairs and Spatial Development (be prepared to use Google Translate if you don’t speak German!). I obtained the most recently available published values for each district, but I want to note that, using SCUL, which is more flexible than standard SCM, we will not need these additional covariates to do a good job.)
I am going to import and integrate the COVID-19 daily cumulative case volumes, district names (for labels / interpretation), and a few of the district characteristics: physicians per capita, pharmacy volumes, and count of individuals by age range. I will ultimately use the latter in my SCM implementation to arrive at a result that looks rather similar to that published in the paper. However, using SCUL, as you will see, we can do a good job without those features.
Let’s import and integrate the data…
library(tidyr)
library(broom)
library(dplyr)
library(ggplot2)
library(glmnet) # I'm going to use this for LASSO to select control panels.
library(janitor) # I will use this to clear some missing values.
library(Synth) # This is the original SCM package implemented by Diamond.
library(ggthemes)
library(patchwork)
library(tmap)
### Let's import Covid-19 case volume data from german districts, by day..
setwd("/Users/gburtch/Google Drive/Teaching/Courses/MSBA 6440/2021/(9) Synthetic Control /Covid-19 Facemasks/")
covid <- read.csv("Covid_DE_cases_by_district.csv")
dist_names <- read.csv("Covid_DE_district_IDs.csv")
covid <- covid %>% merge(dist_names, by.x = "dist_id", by.y = "id")
# Let's convert date column into date format.
covid$date <- as.Date(covid$date)
covid <- covid[order(covid$dist_id,covid$date),]
# We can pull in other features, though these are not actually required.
# E.g., we can pull in physicians per capita in each district.
setwd("/Users/gburtch/Google Drive/Teaching/Courses/MSBA 6440/2021/(9) Synthetic Control /Covid-19 Facemasks/district_demographics/")
dist_physicians <- read.csv("dist_phys_pcap.csv",sep=";") %>% rename("dist_id" = "Kennziffer", "physicians" = "Ärzte.je..Einwohner") %>% filter(!is.na(dist_id)) %>% select(c(dist_id,physicians))
dist_physicians$physicians <- as.numeric(gsub(",",".",dist_physicians$physicians))
covid <- covid %>% merge(dist_physicians, by="dist_id", all.x=TRUE)
# We now pull in pharmacy information.
dist_pharma <- read.csv("dist_pharmas_2017.csv",sep=";") %>% filter(!is.na(Kennziffer)) %>% rename("dist_id" = "Kennziffer","pharmacies" = "Apotheken")
dist_pharma$pharmacies <- as.numeric(gsub(",",".",dist_pharma$pharmacies))
dist_pharma <- dist_pharma %>% select(c(dist_id,pharmacies))
covid <- covid %>% merge(dist_pharma, by="dist_id",all.x=TRUE)
# We can pull in population information by age, and so on now as well.
dist_pop <- read.csv("dist_pop_age.csv", sep=";") %>% select(-c(dist_name))
dist_pop$year <- substr(dist_pop$year,1,4)
# Some missing values here.
dist_pop <- dist_pop %>% lapply(as.integer) %>% as.data.frame()
# We have yearly values for 3 years - let's just use the most recent set of values in 2019.
dist_pop <- dist_pop %>% filter(year==2019) %>% select(-c(year))
covid <- covid %>% merge(dist_pop, by="dist_id", all.x=TRUE)
# Jena is district id 16053.
covid$treat <- covid$dist_id==16053
# Let's trim to a reasonable window around the event date.
covid <- subset(covid, date > as.numeric(as.Date("2020-03-01")) & date <= as.numeric(as.Date("2020-05-21")))
Now that we’ve imported and integrated the data, we can visualize the cumulative COVID case series for all the districts, to see what was going on in early 2020. I’m going to plot the log of each series. Note that the case volumes are very right-skewed, so if we plot the levels over time we have a very hard time seeing what was going on in most of the (smaller) districts, as the big districts expand the range of the y-axis and compress the small locations together at the bottom of the plot.
# Hard to see much here in the descriptive plot of the time series.
# Jena is there at the bottom, though hard to see.
ggplot(data=covid,aes(x=date,y=log(cum_cases+1),color=factor(treat),group=dist_id,alpha=treat)) +
geom_line() +
geom_vline(xintercept=as.numeric(as.Date("2020-04-06")),color="red") +
xlab(expression(bold(paste("Date (2020)")))) +
ylab(expression(bold(paste("Logarithm of Cases")))) +
scale_alpha_manual(guide=FALSE,values=c(0.25,1))+
scale_color_manual(name="District",labels=c("Others", "Jena"),values=c("gray","blue"))+
ggtitle("Cumulative COVID-19 Cases Over Time") +
theme_economist() +
#Comment the below line out if you don't have Economica fonts installed.
theme(text = element_text(family = "Economica", size = 10), axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)),axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0)))+
NULL
We can see the treated district, Jena, in blue, right in the middle.
We can also visualize these locations, so we have a sense of where Jena is, and its spatial neighbors, which we may ultimately be worried about excluding, to avoid SUTVA violations.
# We can quickly plot where Jena is located within its state in Germany.
DE_Districts <- readRDS(url("https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_DEU_3_sf.rds"))
DE_Districts$isJena <- ifelse(DE_Districts$NAME_2=="Jena", "blue", "lightblue")
ggplot(data=DE_Districts %>% filter(NAME_1=="Thüringen")) + geom_sf(aes(fill = isJena),alpha=0.5) + scale_fill_manual(guide=FALSE,values=c("blue","lightblue"))
I’m going to start off implementing a synthetic control estimate employing a cross-validated LASSO to identify the best set of control districts and their associated weights that minimize prediction error for the treated panel’s covid numbers in the pre-treatment period. After I do this, I’ll pivot to the traditional SCM using the Synth package (hold tight).
The first thing we need to do is to pivot the panel from long format to wide format. This is because we need the control panels’ covid case time series to appear as columns (features) that can be used to predict the time series in Jena.
# I'm demonstrating syntax here for the pivot_wider function. After I transform the data like this, observations are uniquely identified by the date variable (rather than district-date).
covid.wide <- covid %>% pivot_wider(id_cols=c("date"),names_from=c("district"),values_from=c("cum_cases","under.3.years","yr.3.to.under.6.years"))
# I'm also going to then pull out pre-treatment observations, which we will use to 'train' the model that will yield the synthetic control for Jena.
covid.wide.train <- subset(covid.wide,date<as.numeric(as.Date("2020-04-06")))
Now, I’m going to use a 5-fold cross validated LASSO to identify the control district time series that are ‘useful’ for predicting the treatment outcome. LASSO is useful here because we have something of a curse of dimensionality problem. This is because we have ~400 predictors, but just 35 observations!
# We have many more predictors than time periods now, so we let's do some feature selection.
# I'm going ot use LASSO to pick the controls I'll work with.
covid.wide.train.lasso <- remove_empty(covid.wide.train, which=c("rows","cols"))
covid.wide.train_mm <- model.matrix(`cum_cases_ SK Jena`~.,covid.wide.train.lasso)
lasso <- cv.glmnet(covid.wide.train_mm, covid.wide.train$`cum_cases_ SK Jena`, standardize=TRUE,alpha=1,nfolds=5)
ests <- as.matrix(coef(lasso,lasso$lambda.1se))
# Here are the non-zero control panels that lasso selected.
names(ests[ests!=0,])
## [1] "(Intercept)" "`cum_cases_ LK Lüchow-Dannenberg`"
## [3] "`cum_cases_ LK Warendorf`" "`cum_cases_ LK Alzey-Worms`"
## [5] "`cum_cases_ LK Oder-Spree`"
Of course, we could run predictions right off the resulting LASSO model, but I’m going to just quickly use the selected controls in a simple linear model to generate these predictions (it’s faster / easier syntax wise).
# Okay, let's build our 'synthetic control'.
fml.rhs <- paste(c(names(ests[ests!=0,]))[2:length(names(ests[ests!=0,]))],collapse="+")
fml <- as.formula(paste("`cum_cases_ SK Jena`~",fml.rhs))
synth <- lm(data=covid.wide.train,formula=fml)
# Now we can synthesize the control series in the original data, into the post period.
covid.wide$synth <- predict(synth,newdata = covid.wide)
# Now we can visualize the result.
OLS_plot <- ggplot(data=covid.wide,aes(y=synth,x=date,linetype="dashed")) + geom_line() +
geom_line(aes(y=`cum_cases_ SK Jena`,x=date,linetype="solid")) +
geom_vline(xintercept=as.numeric(as.Date("2020-04-06")),color="red") +
xlab(expression(bold(paste("Date (2020)")))) +
ylab(expression(bold(paste("Cumulative COVID-19 Cases")))) +
scale_linetype_manual(name="Series",values=c("dashed","solid"),labels=c("Synth","Jena, DE"))+
ggtitle("Effect of Masks on COVID-19 (LASSO -> OLS Synth)") +
theme_economist() +
#Comment the below line out if you don't have Economica fonts installed.
theme(text = element_text(family = "Economica", size = 10), axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)),axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0)))+
xlim(as.Date("2020-03-27"),as.Date("2020-04-26")) +
ylim(0,225)+
NULL
# Let's see how we did.
OLS_plot
If we compare the resulting plot against the original from the manuscript, we see that the two are virtually identical! Okay, let’s do this again, using the Synth package.
With the Synth package, we don’t need to manually pivot the data and implement the prediction step. Instead, we employ the provided data preparation function, Synth::dataprep(). There are a few things we need to do here for the Synth package to run; it seems the outcome variable name creates problems, so we rename it to `Y’. We also need to provide a numeric date variable.
# That was quick and dirty, so let's do this properly now.
# We can implement SCM properly using the Synth package.
# This package will implement appropriate non-negativity constraints on the control weights.
# This package will also minimize MSPE (whereas a simple OLS minimizes SSE)
# The outcome variable renamed to Y as Synth isn't liking the original variable name.
covid <- covid %>% rename(Y = cum_cases)
# Let's also eliminate missing values.
covid <- subset(covid,complete.cases(covid))
# Synth requires numeric date and unit ID variables.
covid$date2 <- as.numeric(covid$date)
# Synth requires a list of control ID's - let's pull them out here.
# Note: 16053 is Jena.
# The authors also omitted a few neighboring districts, and districts that instituted their own mask policies soon after Jena.
# We thus also exclude 16071; Weimarer Land, 16062; Nordhausen, and 8325; Rottweil.
# The authors also omitted Saale-Holzland, but we lose it due to missing data on district features anyway.
dist_ids <- unique(covid$dist_id)
control_ids <- dist_ids[dist_ids != 16053 & dist_ids != 16071 & dist_ids != 16062 & dist_ids != 8325]
# Now we use Synth's data preparation package.
dataprep.out=
dataprep(foo = covid,
dependent = "Y",
unit.variable = "dist_id",
time.variable = "date2",
# The authors used a lot of seemingly irrelevant predictors
# For example, average female age? Average male age? Why is gender important?
# I am going to keep things simple here: pharmacies, physicians and elderly.
predictors = c("pharmacies","physicians","yr.75.years.and.over"),
predictors.op = "mean",
# We can also predict using case volumes day before treatment and week before treatment.
special.predictors = list(list("Y", 18356, "mean"),list("Y", 18350, "mean")),
#which panel is treated?
treatment.identifier = 16053,
#which panels are we using to construct the synthetic control?
# Controls here will be every other district.
controls.identifier = control_ids,
#what is the pre-treatment time period?
#these numeric values correspond to 34 days before treatment.
#the paper only uses the 14 days before treatment for some reason?
time.predictors.prior = c(18323:18357),
time.optimize.ssr = c(18323:18357),
#name of panel units
unit.names.variable = "district",
#time period to generate the plot for.
#paper only goes 20 days post treatment because other treatments started.
#We will just see what this looks like, however.
time.plot = 18343:18403)
# Note that this will take a few minutes to run.
synth.out = synth(dataprep.out)
##
## X1, X0, Z1, Z0 all come directly from dataprep object.
##
##
## ****************
## searching for synthetic control unit
##
##
## ****************
## ****************
## ****************
##
## MSPE (LOSS V): 472.0547
##
## solution.v:
## 0.0529831 0.001509989 0.005108959 0.415707 0.5246909
##
## solution.w:
## 0.006564324 0.0004055885 0.002139966 0.0005085972 0.0004717398 0.0001726535 0.001062503 0.0003717301 0.0001553077 0.0006183037 0.0003629104 0.0004292388 0.000214448 0.0005860168 0.0002270336 2.79219e-05 0.0002271632 0.000628013 0.0004829527 0.000390823 0.0002124507 0.0004434707 0.0005673657 0.0002652501 0.0003682191 0.0001827385 0.0001242331 0.0002416202 0.0002879616 0.0005218277 0.0003477228 0.0004412265 0.0003997039 0.0004270034 0.0007675313 0.000144265 0.001345292 0.0003719313 0.001306577 0.0007560607 0.0004403459 0.000663049 0.0003067929 0.001126626 0.05237426 0.0003507623 0.0005205719 0.0008072831 0.0005942938 0.0004740093 0.000620547 0.0003305482 0.0005693599 0.0003246393 0.0004914106 0.0003196319 0.0001490839 0.0002105214 0.00180959 0.00115103 0.0002978122 0.002039214 0.0001240423 0.0002045301 0.001574565 0.0002616505 0.0002325824 0.0003221188 0.0002445388 0.0004001476 0.0004201843 0.0001471107 0.0004923532 0.000188074 0.0001523593 0.0004416121 0.0002053172 0.0001631632 0.0003219535 9.58203e-05 0.000234878 0.0002050765 0.06544548 0.0004305961 0.0001964088 0.001757817 0.0001405515 0.0003435191 0.0001139985 0.0001188078 0.002006019 0.0003078525 0.001077941 0.0002023739 0.0001605391 0.0002187687 0.0002118577 0.0001120851 0.0005017758 0.0001682759 0.0004192505 0.0002007967 0.0009861359 0.000207707 0.0002424023 0.0005260073 0.0001512545 0.0005050657 0.0001253503 0.001209401 0.000357899 0.0002991036 0.0002980169 0.0004344054 0.0002979694 0.0004105386 0.0005220736 0.0005506335 0.0009082208 0.0002931952 0.0001741712 0.0004410594 0.0003583363 0.001469201 0.0006399747 0.0009927966 0.000267859 0.0005731917 0.001610907 0.0007549825 0.0007665145 0.0005780913 0.0005685147 0.000346942 0.0004736354 0.0007802962 0.000480617 0.0002575291 0.0002197481 0.0004033966 0.0005396109 0.01702708 0.0006910843 0.0002728648 0.0004891146 0.0009247413 0.0004451911 0.0003947985 0.0012924 0.0006579095 0.0003279909 0.0002251532 0.0003035838 0.0006816194 0.0001100815 0.000110077 0.0001957902 8.92393e-05 0.000161345 0.0003177445 0.0001173775 0.0001525583 0.0001868206 0.0003078109 9.68291e-05 0.0004031858 0.0004374914 0.0003373717 0.0002773775 0.0002698433 0.0002364188 0.000133005 9.85326e-05 0.0001511048 0.0001293801 0.0001198902 0.0001279613 0.0001923618 7.40931e-05 0.0002496367 0.0003414894 0.0001834988 0.0009165521 0.0001394759 0.0001281896 4.3254e-06 0.0002836983 0.0001253415 0.000200505 0.0001666862 0.0001124552 0.0001868573 0.0002221662 0.0002759306 0.0001302242 0.0001187129 0.0002096688 0.0002030344 0.0002388523 0.0001376266 0.0001015472 0.0003670412 0.0001559687 6.56068e-05 0.0002574874 9.62221e-05 0.0002260809 0.0002909219 0.0001239602 0.0001139023 0.0008568023 0.0008437727 0.0001246845 0.000159693 0.0001179923 0.000247116 0.0001071825 0.0001161343 0.0004820284 4.244e-07 0.0001889062 1.127e-07 0.0001614196 0.000234749 0.0002434387 9.18673e-05 0.0001718004 0.0001230177 7.74223e-05 0.0001725594 0.0001950936 1.2617e-06 0.0001457934 0.000180458 0.0002065278 0.0004320233 0.0002392633 0.0001771677 0.000711023 0.0003375406 0.0002776858 9.96867e-05 7.72069e-05 0.0002154276 0.0001955858 9.43256e-05 0.0008514579 0.0001518622 0.0002032977 0.0001075794 0.000104161 0.0003480851 0.000156387 0.0002204248 0.0001045169 3.20147e-05 0.000328717 0.0001293496 0.0002312601 0.000312998 0.0003495092 0.000322791 0.0002070512 0.0003287026 0.0001711766 0.0002519045 0.0003092016 0.0001423332 0.0003898202 6.67261e-05 0.0002612041 0.0002425694 0.0002583305 0.0003137275 0.000161908 0.0002678846 0.0001149333 0.0002722593 0.0001508385 0.0005065323 7.58301e-05 0.000201842 0.0002509582 0.0001001168 0.0003467085 4.703e-07 0.219204 0.05040197 0.0001100537 0.0003030347 0.0002735372 0.0004007101 0.0002149661 0.000498744 0.0003815238 0.0006121857 0.1031712 0.0004561474 0.0001271576 0.0009067293 2.2528e-06 0.0003776801 0.0004829234 0.008468471 0.002414929 0.000341158 0.0004588724 0.0003994225 0.0003864519 0.0004557052 0.0005491755 0.0003457466 0.0002201051 0.000166056 0.0002779092 0.0006580341 0.0001340395 0.0002304415 0.0001902925 0.000209378 0.0003712971 0.0002737388 0.0006609773 0.0002704255 0.001077984 0.0006974887 0.0005978265 0.0003584995 0.0006985959 0.0004581608 0.001427668 0.0004159547 0.00164145 0.0004776158 0.3425815 0.0006896883 0.000220863 0.0007110676 0.001627077 0.0005692535 0.001382997 0.001282586 0.0003697402 0.0005982611 0.0003684731 0.0008136127
We can then use the native plotting functions from Synth to visualize the result, as follows. These plots rely on the base plotting tools in R. The result is essentially the same, except that now we get a weaker treatment effect. I should note here that when I omitted the additional predictor covariates in my first pass at this, and the resulting synthetic control looked quite poor in the pre period. I then explored the inclusion of additional covariate predictors in something of a trial-and-error manner to arrive at this result.
# Path.plot() plots the synthetic against the actual treated unit data.
path.plot(dataprep.res = dataprep.out, synth.res = synth.out,Xlab="Date",Ylab="Cumulative COVID-19 Cases",Main="Comparison of Synth vs. Actual Cum. COVID-19 Cases in Jena, Germany")
# And we can add a vertical line where the treatment occurred.
abline(v=18358,lty=2,col="red")
Alternatively, we can pull out the results and generate our own, prettier plots, as follows.
# Let's pull out the data from the result, to make our own nicer plots in ggplot of course
synth_data_out = data.frame(dataprep.out$Y0plot%*%synth.out$solution.w)
date = as.numeric(row.names(synth_data_out))
plot.df = data.frame(y=covid$Y[covid$dist_id==16053 & covid$date2 %in% date])
plot.df$synth = synth_data_out$w.weight
plot.df$date <- covid$date[covid$dist_id==16053 & covid$date2 %in% date]
SCM_plot <- ggplot(plot.df,aes(y=y,x=date,linetype="solid")) + geom_line() +
geom_line(aes(y=synth,x=date,linetype="dashed")) +
geom_vline(xintercept=18358,color="red") +
xlab(expression(bold(paste("Date (2020)")))) +
ylab(expression(bold(paste("Cumulative COVID-19 Cases")))) +
scale_linetype_manual(name="Series",values=c("dashed","solid"),labels=c("Synth","Jena, DE"))+
ggtitle("Effect of Masks on COVID-19 (Proper Synth)") +
theme_economist() +
#Comment the below line out if you don't have Economica fonts installed.
theme(text = element_text(family = "Economica", size = 10), axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)),axis.title.x = element_text(margin = margin(t = 10, r = 0, b = 0, l = 0)))+
ylim(0,225)+
NULL
# Let's put the two plots side-by-side.
SCM_plot
… et voila!
]]>The incidental parameters problem (henceforth IPP) is a well-known problem in econometrics, dating back more than 80 years. The seminal paper introducing the problem was published by Jerzy Neyman and Betty Scott in Econometrica in 1948. I won’t rehash the details of the issue here, but I’ll give you the punch line.
With linear regression, we generally address time invariant confounds in panel data via fixed effects (either panel dummies, or the mathematically equivalent within-transformation).
Unfortunately, those methods do not ‘work’ in non-linear regression, because the fixed effects enter the specification multiplicatively (the only exception here is Poisson regression, where the math still works out as a matter of convenience).
Fortunately, Gary Chamberlain derived conditional logit, wherein the panel fixed effects are instead addressed via a transformation to the likelihood function.
While I encourage anyone to go and read these papers, my goal here is to demonstrate concretely, via simulation, just how poorly a dummy fixed effect estimator performs in the case of panel logistic regression. I will first simulate a simple scenario of static confounding, wherein a standard logistic regression model fails to recover the parameter of interest. I will then compare its performance against alternative correction procedures, namely a dummy fixed effect estimator and the Chamberlain conditional logit estimator. Notably a more recent literature pursues bias correction methods for general non-linear panel data models, e.g., Fernandez-Val (2009) in the Journal of Econometrics. I will thus also explore the efficacy of Fernandez-Val’s bias correction following dummy fixed effect logit.
The simulation setup is as follows. I begin with 100 panel units and simulate a time-invariant characteristic that serves as the confound. I then combine the units into a data frame.
# I begin by setting some simulation parameters.
set.seed(2040)
n <- 100
sims <- 100
panel_max <- 10
# Lastly, for each panel unit
confound <- rnorm(n, mean = 0, sd=1)
id <- seq(n)
treatment_grp <- rbinom(n=n, size=1, prob=confound > median(confound))
df <- data.frame(confound=confound,id = id, treatment_grp = treatment_grp)
In each run of the simulation, I create a new panel data set, extending off the basic setup. I expand the original data-frame into repeated observations of each panel, to varying lengths, T = 2..10. With each new panel dataset, I turn the treatment ‘on’ for the treatment group at the panel midpoint (rounded down).
After I create a dataset, I estimate four alternative logistic regression models:
# I create an empty data frame to store all the simulation results, based on the simulation paramters
sim_result <- expand.grid(periods = seq(from=2,to=panel_max,by=1),run = seq(sims))
# Simulate each panel length 100 times and re-run all models.
for (r in 1:sims){
for (p in seq(from=2,to=panel_max,by=1)){
# Simulate a panel of p periods.
# Make the data with whole panel length
df_tmp <- expandRows(df, count=p, count.is.col=FALSE)
# Treatment begins midway through the panel for treated units.
df_tmp <- df_tmp %>% group_by(id) %>% mutate(period = seq(from=1,to=p,by=1)) %>% mutate(Treat = (period > floor((p+1)/2))*treatment_grp) %>% arrange(treatment_grp,id,period) %>% select(-c(treatment_grp))
df_tmp$Treat_center <- df_tmp$Treat - mean(df_tmp$Treat)
df_tmp$xb <- 2*df_tmp$Treat_center + 4*df_tmp$confound
df_tmp$p <- inv.logit(df_tmp$xb)
df_tmp$Y <- rbinom(nrow(df_tmp),size=1, prob=df_tmp$p)
# Estimate our four logit models.
logit_mod_control <- glm(data=df_tmp,Y ~ Treat_center + confound, family = "binomial")
logit_mod_omit <- glm(data=df_tmp,Y ~ Treat_center, family = "binomial")
logit_mod_fe <- clogit(data=df_tmp,Y ~ Treat_center + strata(id))
logit_mod_ipp <- glm(data=df_tmp,Y ~ Treat_center + factor(id), family = "binomial")
# Sometimes the analytical correction fails, and it doesn't handle it well.
tryCatch(
{
# Let's set a placeholder result to NA in case the estimation fails.
logit_mod_bife <- data.frame(coefficients=NA)
suppressWarnings(logit_mod_bife <- bias_corr(bife(Y~Treat_center | id,data=df_tmp)))
},error=function(cond){
# BIFE crapped out.
return(logit_mod_bife)
},warning=function(cond){
# BIFE generated a warning.
return(logit_mod_bife)
}
)
sim_result_row <- data.frame(cbind(periods=p,run=r,b_control=logit_mod_control$coefficients[2],b_omit=logit_mod_omit$coefficients[2],b_fe=logit_mod_fe$coefficients[1],b_ipp=logit_mod_ipp$coefficients[2],b_bife=logit_mod_bife$coefficients[1]))
sim_result <- sim_result %>% left_join(sim_result_row,by=c("periods","run"))
sim_result$b_control.y[is.na(sim_result$b_control.y) ] <- sim_result$b_control.x[ is.na(sim_result$b_control.y)]
sim_result$b_omit.y[is.na(sim_result$b_omit.y) ] <- sim_result$b_omit.x[ is.na(sim_result$b_omit.y)]
sim_result$b_fe.y[is.na(sim_result$b_fe.y) ] <- sim_result$b_fe.x[ is.na(sim_result$b_fe.y)]
sim_result$b_ipp.y[is.na(sim_result$b_ipp.y) ] <- sim_result$b_ipp.x[ is.na(sim_result$b_ipp.y)]
sim_result$b_bife.y[is.na(sim_result$b_bife.y) ] <- sim_result$b_bife.x[ is.na(sim_result$b_bife.y)]
if("b_control.y" %in% colnames(sim_result)){
sim_result <- sim_result %>% select(-c(b_control.x, b_omit.x, b_fe.x, b_ipp.x,b_bife.x)) %>% rename(b_control=b_control.y, b_omit = b_omit.y, b_fe = b_fe.y, b_ipp = b_ipp.y, b_bife = b_bife.y)
}
}
}
Let’s take a quick look at the output from the final set of regressions, in the last iteration. Note that I am omitting the intercept and dummy coefficients in the output.
screenreg(list(logit_mod_omit,logit_mod_ipp,logit_mod_bife,logit_mod_fe,logit_mod_control),
digits=3,omit.coef=c("factor|Intercept|confound"),
custom.model.names=c("logit","logit + dummies","bife","clogit","logit + control"))
##
## ==========================================================================================
## logit logit + dummies bife clogit logit + control
## ------------------------------------------------------------------------------------------
## Treat_center 3.481 *** 1.551 *** 1.391 *** 1.394 *** 1.618 ***
## (0.277) (0.363) (0.354) (0.343) (0.323)
## ------------------------------------------------------------------------------------------
## AIC 1063.720 594.164 296.280 471.803
## BIC 1073.536 1089.847 486.526
## Log Likelihood -529.860 -196.082 -196.181 -232.901
## Deviance 1059.720 392.164 392.362 465.803
## Num. obs. 1000 1000 440 1000 1000
## R^2 0.019
## Max. R^2 0.269
## Num. events 479
## Missings 0
## ==========================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
Okay, but how consistently did each alternative estimator perform, across the 100 runs? Let’s visualize this. One thing to note here is that sometimes the estimators fail to converge, and thus spit out woefully inaccurate estimates (off by orders of magnitude). So, we will first look at the general plots, so you can see how often these happen across the different estimators. Then, we will zoom in on the “valid” estimates!
Now we will zoom in around 2, to the estimates that were generated by models that converged. What is pretty apparent here is that the Naive estimator is inconsistent; it is converging to an incorrect value, regardless of panel length. All of the other estimators converge to the correct value, given enough data and sufficiently long panels. However, they clearly do so at different rates.
Most notably, the dummy fixed effect estimator only manages to converge to the correct value when panels get sufficiently long. In short panels, this estimator performs quite poorly. The analytically corrected dummy fixed effect estimator (estimated using the ‘bife’ package) also does poorly in short panels, but it appears to recover more quickly as panel lengths increase. The conditional logit estimator, on the other hand, does well right off the bat, even in short panels!
A couple of final thoughts.
A natural question you might have is “what constitutes a sufficiently long panel?” Well, that is never truly knowable, because it will depend on your sample and context. Accordingly, as a general rule, you should not use dummy fixed effects.
Be careful about convergence; you will notice that all of these estimators frequently fail to converge. When that happens, in many cases, you still get an estimate, but it is meaningless. Be careful to ensure the model converged before you take an estimate and run with it!