Title: | Synthetic Difference-in-Difference Estimation |
---|---|
Description: | Estimate average treatment effects in panel data. Currently provides methods only for the case that all treated units adopt treatment at the same time. |
Authors: | Dmitry Arkhangelsky [aut] Susan Athey [aut] David A. Hirshberg [aut, cre] Guido W. Imbens [aut] Stefan Wager [aut] |
Maintainer: | David A. Hirshberg <[email protected]> |
License: | GPL (>= 2) | BSD_3_clause + file LICENSE |
Version: | 0.0.9 |
Built: | 2025-01-20 04:41:06 UTC |
Source: | https://github.com/skranz/synthdid |
compute the correlation matrix of a time series generated by an ar2 model
ar2_correlation_matrix(ar_coef, T)
ar2_correlation_matrix(ar_coef, T)
ar_coef |
the coefficients of the ar2 model: c(lag-1-coefficient, lag-2-coefficient) |
T |
the length of the time series |
the correlation matrix
A dataset containing per-capita cigarette consumption (in packs).
In year 1989 California imposed a Tobacco tax. The column treated
is 1 from then on for California.
data(california_prop99)
data(california_prop99)
A data frame with 1209 rows and 4 variables:
US state name, character string
Year, integer
per-capita cigarette consumption, numeric
the treatmed indicator 0: control, 1: treated, numeric
Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. "Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program." Journal of the American statistical Association 105, no. 490 (2010): 493-505.
# Load tobacco sales in long panel format. data("california_prop99") # Transform to N*T matrix format required for synthdid, # where N is the number of units and T the time periods. setup <- panel.matrices(california_prop99)
# Load tobacco sales in long panel format. data("california_prop99") # Transform to N*T matrix format required for synthdid, # where N is the number of units and T the time periods. setup <- panel.matrices(california_prop99)
CPS
data(CPS)
data(CPS)
A data frame with 2000 rows and 8 variables.
state
year
log_wage
hours
urate
min_wage
open_carry
abort_ban
Decompose Y into components F, M, and E as described in Section 3.1.1 also computes a set of 'unit factors': the first rank left singular vectors from SVD(Y)
decompose_Y(Y, rank)
decompose_Y(Y, rank)
Y |
the outcomes |
rank |
the assumed rank of the signal component L=F+M |
a list with elements F, M, E, and unit_factors
synthdid_estimate for diff-in-diff estimates. Takes all the same parameters, but by default, passes options to use the diff-in-diff estimator
did_estimate(Y, N0, T0, ...)
did_estimate(Y, N0, T0, ...)
Y |
the observation matrix. |
N0 |
the number of control units. Rows 1-N0 of Y correspond to the control units. |
T0 |
the number of pre-treatment time steps. Columns 1-T0 of Y correspond to pre-treatment time steps. |
... |
additional options for synthdid_estimate |
an object like that returned by synthdid_estimate
Estimates the DGP parameters used in the placebo studies in Sections 3 and 5 of the synthetic difference in differences paper. Described there in Section 3.1.1.
estimate_dgp(Y, assignment_vector, rank)
estimate_dgp(Y, assignment_vector, rank)
Y |
an NxT matrix of outcomes |
assignment_vector |
an Nx1 vector of treatment assignments |
rank |
the rank of the estimated signal component L |
a list with elements F, M, Sigma, pi as described in Section 3.1.1 and an element ar_coef with the AR(2) model coefficients underlying the covariance Sigma
Estimate ar2 coefficients from iid time series
fit_ar2(E)
fit_ar2(E)
E |
a matrix with those time series as rows |
a vector of ar2 coefficients: c(lag-1-coefficient, lag-2-coefficient)
Format a synthdid object
## S3 method for class 'synthdid_estimate' format(x, ...)
## S3 method for class 'synthdid_estimate' format(x, ...)
x |
The object to format |
... |
Additional arguments (currently ignored). |
Computes a density estimator by smoothing a histogram using Poisson regression. Implementation of "Lindsey's method", as descrbied in Chapter 10 of "Computer age statistical inference: algorithms, evidence, and data science' by Bradley Efron and Trevor Hastie (2016).
lindsey_density_estimate(x, K, deg)
lindsey_density_estimate(x, K, deg)
x |
|
K |
|
deg |
|
a list with 2 fields, centers and density, which are K-dimensional vectors containing the bin centers and estimated density within each bin respectively.
Converts a data set in panel form to matrix format required by synthdid estimators. A typical long panel date set looks like [unit, time, outcome, treatment]. Synthdid requires a balanced panel with simultaneous adoption of treatment: each unit must be observed at all times, and all treated units must begin treatment simultaneosly. This function creates num.units x num.time.periods matrices Y and W of outcomes and treatment indicators. In these matrices, columns are sorted by time, and by default (when treated.last=TRUE), rows for control units appear before those of treated units.
panel.matrices( panel, unit = 1, time = 2, outcome = 3, treatment = 4, treated.last = TRUE )
panel.matrices( panel, unit = 1, time = 2, outcome = 3, treatment = 4, treated.last = TRUE )
panel |
A data.frame with columns consisting of units, time, outcome, and treatment indicator. |
unit |
The column number/name corresponding to the unit identifier. Default is 1. |
time |
The column number/name corresponding to the time identifier. Default is 2. |
outcome |
The column number/name corresponding to the outcome identifier. Default is 3. |
treatment |
The column number/name corresponding to the treatment status. Default is 4. |
treated.last |
Should we sort the rows of Y and W so treated units are last. If FALSE, sort by unit number/name. Default is TRUE. |
A list with entries Y
: the data matrix, N0
: the number of control units, T0
:
the number of time periods before treatment, W
: the matrix of treatment indicators.
# Load tobacco sales in long panel format. data("california_prop99") # Transform to N*T matrix format required for synthdid, # where N is the number of units and T the time periods. setup <- panel.matrices(california_prop99, unit = 1, time = 2, outcome = 3, treatment = 4) # Compute synthdid estimate synthdid_estimate(setup$Y, setup$N0, setup$T0)
# Load tobacco sales in long panel format. data("california_prop99") # Transform to N*T matrix format required for synthdid, # where N is the number of units and T the time periods. setup <- panel.matrices(california_prop99, unit = 1, time = 2, outcome = 3, treatment = 4) # Compute synthdid estimate synthdid_estimate(setup$Y, setup$N0, setup$T0)
PENN
data(PENN)
data(PENN)
A data frame with 3219 rows and 5 variables.
country
year
log_gdp
dem
educ
Plot a synthdid object
## S3 method for class 'synthdid_estimate' plot(x, ...)
## S3 method for class 'synthdid_estimate' plot(x, ...)
x |
The object to plot |
... |
Additional arguments (currently ignored). |
Print a synthdid object
## S3 method for class 'synthdid_estimate' print(x, ...)
## S3 method for class 'synthdid_estimate' print(x, ...)
x |
The object to print |
... |
Additional arguments (currently ignored). |
randomize treatment to n units with probability pi then if the number of treated units is zero, assign treatment to one unit uniformly at random and if the number of treated units exceeds a cap, remove treatment uniformly at random so it is exactly that cap
randomize_treatment(pi, N, N1)
randomize_treatment(pi, N, N1)
pi |
the randomization probabilities |
N |
the number of units |
N1 |
the cap on the number of treated units |
a binary vector of length N, with ones indicating assignment to treatment
synthdid_estimate for synthetic control estimates. Takes all the same parameters, but by default, passes options to use the synthetic control estimator By default, this uses only 'infinitesimal' ridge regularization when estimating the weights.
sc_estimate(Y, N0, T0, eta.omega = 1e-06, ...)
sc_estimate(Y, N0, T0, eta.omega = 1e-06, ...)
Y |
the observation matrix. |
N0 |
the number of control units. Rows 1-N0 of Y correspond to the control units. |
T0 |
the number of pre-treatment time steps. Columns 1-T0 of Y correspond to pre-treatment time steps. |
eta.omega |
determines the level of ridge regularization, zeta.omega = eta.omega * noise.level, as in synthdid_estimate. |
... |
additional options for synthdid_estimate |
an object like that returned by synthdid_estimate
Simulates data from DGPs used in the placebo studies in Sections 3 and 5 of the synthetic difference in differences paper. Described there in Section 3.1.1.
simulate_dgp(parameters, N1, T1)
simulate_dgp(parameters, N1, T1)
parameters |
a list of dgp parameters (F,M,Sigma,pi) as output by estimate.dgp |
N1 |
a cap on the number of treated units, |
T1 |
the number of treated periods. |
a list with 3 elements: the outcome matrix Y, the number of control units N0, and the number of control periods T0. The first N0 rows of Y are for units assigned to control, the remaining rows are for units assigned to treatment.
A function mapping a numeric vector to a (presumably sparser) numeric vector of the same shape to be passed onto synthdid_estimate.
sparsify_function(v)
sparsify_function(v)
v |
a vector |
Summarize a synthdid object
## S3 method for class 'synthdid_estimate' summary(object, weight.digits = 3, fast = FALSE, ...)
## S3 method for class 'synthdid_estimate' summary(object, weight.digits = 3, fast = FALSE, ...)
object |
The object to summarize |
weight.digits |
The number of digits to use when displaying weights (omega, lambda) |
fast |
Be fast but less accurate, e.g. jackknife instead of bootstrap se estimate |
... |
Additional arguments (currently ignored). |
Outputs a table of important synthetic controls and their corresponding weights, sorted by weight. The table is truncated to exclude synthetic controls that do not matter for any estimate — for each estimate, the truncated controls may have total weight no larger that 1-mass.
synthdid_controls(estimates, sort.by = 1, mass = 0.9, weight.type = "omega")
synthdid_controls(estimates, sort.by = 1, mass = 0.9, weight.type = "omega")
estimates |
a list of estimates output by synthdid_estimate. Or a single estimate. |
sort.by |
the index of the estimate to sort by. Defaults to 1. |
mass |
which controls the length of the table. Defaults to 0.9. |
weight.type |
'omega' for units, 'lambda' for time periods |
Outputs the effect curve that was averaged to produce our estimate
synthdid_effect_curve(estimate)
synthdid_effect_curve(estimate)
estimate |
as output by synthdid_estimate |
See 'Synthetic Difference in Differences' by Arkhangelsky et al. This implements Algorithm 1.
synthdid_estimate( Y, N0, T0, X = array(dim = c(dim(Y), 0)), noise.level = sd(apply(Y[1:N0, 1:T0], 1, diff)), eta.omega = ((nrow(Y) - N0) * (ncol(Y) - T0))^(1/4), eta.lambda = 1e-06, zeta.omega = eta.omega * noise.level, zeta.lambda = eta.lambda * noise.level, omega.intercept = TRUE, lambda.intercept = TRUE, weights = list(omega = NULL, lambda = NULL), update.omega = is.null(weights$omega), update.lambda = is.null(weights$lambda), min.decrease = 1e-05 * noise.level, max.iter = 10000, sparsify = sparsify_function, max.iter.pre.sparsify = 100 )
synthdid_estimate( Y, N0, T0, X = array(dim = c(dim(Y), 0)), noise.level = sd(apply(Y[1:N0, 1:T0], 1, diff)), eta.omega = ((nrow(Y) - N0) * (ncol(Y) - T0))^(1/4), eta.lambda = 1e-06, zeta.omega = eta.omega * noise.level, zeta.lambda = eta.lambda * noise.level, omega.intercept = TRUE, lambda.intercept = TRUE, weights = list(omega = NULL, lambda = NULL), update.omega = is.null(weights$omega), update.lambda = is.null(weights$lambda), min.decrease = 1e-05 * noise.level, max.iter = 10000, sparsify = sparsify_function, max.iter.pre.sparsify = 100 )
Y |
the observation matrix. |
N0 |
the number of control units (N_co in the paper). Rows 1-N0 of Y correspond to the control units. |
T0 |
the number of pre-treatment time steps (T_pre in the paper). Columns 1-T0 of Y correspond to pre-treatment time steps. |
X |
an optional 3-D array of time-varying covariates. Shape should be N X T X C for C covariates. |
noise.level |
an estimate of the noise standard deviation sigma. Defaults to the standard deviation of first differences of Y. |
eta.omega |
determines the tuning parameter zeta.omega = eta.omega * noise.level. Defaults to the value (N_tr T_post)^(1/4). |
eta.lambda |
analogous for lambda. Defaults to an 'infinitesimal' value 1e-6. |
zeta.omega |
if passed, overrides the default zeta.omega = eta.omega * noise.level. Deprecated. |
zeta.lambda |
analogous for lambda. |
omega.intercept |
Binary. Use an intercept when estimating omega. |
lambda.intercept |
Binary. Use an intercept when estimating lambda. |
weights |
a list with fields lambda and omega. If non-null weights$lambda is passed, we use them instead of estimating lambda weights. Same for weights$omega. |
update.omega |
If true, solve for omega using the passed value of weights$omega only as an initialization. If false, use it exactly as passed. Defaults to false if a non-null value of weights$omega is passed. |
update.lambda |
Analogous. |
min.decrease |
Tunes a stopping criterion for our weight estimator. Stop after an iteration results in a decrease in penalized MSE smaller than min.decrease^2. |
max.iter |
A fallback stopping criterion for our weight estimator. Stop after this number of iterations. |
sparsify |
A function mapping a numeric vector to a (presumably sparser) numeric vector of the same shape, which must sum to one. If not null, we try to estimate sparse weights via a second round of Frank-Wolfe optimization initialized at sparsify( the solution to the first round ). |
max.iter.pre.sparsify |
Analogous to max.iter, but for the pre-sparsification first-round of optimization. Not used if sparsify=NULL. |
An average treatment effect estimate with 'weights' and 'setup' attached as attributes. 'weights' contains the estimated weights lambda and omega and corresponding intercepts, as well as regression coefficients beta if X is passed. 'setup' is a list describing the problem passed in: Y, N0, T0, X.
Computes a placebo variant of our estimator using pre-treatment data only
synthdid_placebo(estimate, treated.fraction = NULL)
synthdid_placebo(estimate, treated.fraction = NULL)
estimate |
as output by synthdid_estimate |
treated.fraction |
the fraction of pre-treatment data to use as a placebo treatment period Defaults to NULL, which indicates that it should be the fraction of post-treatment to pre-treatment data |
For our estimator and a placebo, plots treated and synthetic control trajectories and overlays a 2x2 diff-in-diff diagram. Requires ggplot2
synthdid_placebo_plot(estimate, overlay = FALSE, treated.fraction = NULL)
synthdid_placebo_plot(estimate, overlay = FALSE, treated.fraction = NULL)
estimate |
as output by synthdid_estimate. |
overlay |
binary, indicates whether plots should be overlaid or shown in different facets. Defaults to FALSE. |
treated.fraction |
as in synthdid_placebo |
For SC estimates, i.e., if lambda is a vector of zeros, plots the trajectories and SC estimate of the effect, but no diagram.
synthdid_plot( estimates, treated.name = "treated", control.name = "synthetic control", spaghetti.units = c(), spaghetti.matrices = NULL, facet = NULL, facet.vertical = TRUE, lambda.comparable = !is.null(facet), overlay = 0, lambda.plot.scale = 3, trajectory.linetype = 1, effect.curvature = 0.3, line.width = 0.5, guide.linetype = 2, point.size = 1, trajectory.alpha = 0.5, diagram.alpha = 0.95, effect.alpha = 0.95, onset.alpha = 0.3, ci.alpha = 0.3, spaghetti.line.width = 0.2, spaghetti.label.size = 2, spaghetti.line.alpha = 0.3, spaghetti.label.alpha = 0.5, se.method = "jackknife", alpha.multiplier = NULL )
synthdid_plot( estimates, treated.name = "treated", control.name = "synthetic control", spaghetti.units = c(), spaghetti.matrices = NULL, facet = NULL, facet.vertical = TRUE, lambda.comparable = !is.null(facet), overlay = 0, lambda.plot.scale = 3, trajectory.linetype = 1, effect.curvature = 0.3, line.width = 0.5, guide.linetype = 2, point.size = 1, trajectory.alpha = 0.5, diagram.alpha = 0.95, effect.alpha = 0.95, onset.alpha = 0.3, ci.alpha = 0.3, spaghetti.line.width = 0.2, spaghetti.label.size = 2, spaghetti.line.alpha = 0.3, spaghetti.label.alpha = 0.5, se.method = "jackknife", alpha.multiplier = NULL )
estimates |
a list of estimates output by synthdid_estimate. Or a single estimate. |
treated.name |
the name of the treated curve that appears in the legend. Defaults to 'treated' |
control.name |
the name of the control curve that appears in the legend. Defaults to 'synthetic control' |
spaghetti.units |
a list of units to plot individually. spaghetti.unit %in% rownames(Y) must work. Defaults to the empty list. |
spaghetti.matrices |
a list of matrices — one for each element of estimates — of trajectories to plot individually. The rows of these matrices should be the same length as the trajectories in Y and they must be named — set rownames(spaghetti.trajectories[[i]]) — so trajectories can be labeled in the plot. |
facet |
a list of the same length as estimates indicating the facet in which to plot each estimate. The values of the elements of the list are used to label the facets. If NULL, plot each estimate in a different facet. Defaults to NULL. |
facet.vertical |
TRUE if facets should be stacked vertically. Defaults to FALSE (horizonal). |
lambda.comparable |
TRUE if the weights lambda should be plotted in such a way that the ribbons have the same mass from plot to plot, assuming the treated curve is the same. Useful for side-by-side or overlaid plots. Defaults to FALSE if facet is not passed, TRUE if passed. |
overlay |
a number in [0,1] defaulting to 0, can be used to shift the control trajectory toward the treated trajectory. When a nonzero value is passed, we plot the control after subtracting that fraction of the diff-in-diff style adjustment for the difference between lambda-weighted pre-treatment averages of the treated and control. With intercept of almost one, this makes it easier to assess parallel-ness by making trajectories closer With intercept of one, this essentially overlays the curves, and plotting a diagram is suppressed as in the case of a SC estimate. To use different values for different plots, pass these values as an attribute 'overlay' of each estimate. If a vector is passed, plots at different intercept levels indicated by the 'frame' aesthetic. ggplotly will interpret this as an animation. |
lambda.plot.scale |
determines the scale of the plot of the weights lambda. |
trajectory.linetype |
the linetype of the treated and synthetic control trajectories |
effect.curvature |
the curvature of the arrows indicating the treatment effect. Defaults to zero. Nonzero values help avoid overplotting when plotting multiple estimates in one facet. |
line.width |
the line width. |
guide.linetype |
determines the (ggplot) linetype of the vertical segments of the parallelogram |
point.size |
determines the size of the points of the parallelogram |
trajectory.alpha |
determines transparency of trajectories |
diagram.alpha |
determines transparency of diff-in-diff diagram |
effect.alpha |
determines transparency of effect arrows |
onset.alpha |
determines transparency of vertical lines indicating onset of treatment |
ci.alpha |
determines transparency of the arrows illustrating upper and lower bounds of a 95% confidence interval for the effect |
spaghetti.line.width |
determines the width of spaghetti trajectories |
spaghetti.label.size |
determines the size of spaghetti labels |
spaghetti.line.alpha |
determines transparency of spaghetti trajectories |
spaghetti.label.alpha |
determines transparency of spaghetti trajectory labels |
se.method |
determines the method used to calculate the standard error used for this confidence interval. if 'none', don't show the interval |
alpha.multiplier |
a vector of the same length as estimates, is useful for comparing multiple estimates in one facet but highlighting one or several. All plot elements associated with the estimate are displayed with alpha multiplied by the corresponding element of alpha.multiplier. Defaults to a vector of ones. |
Requires ggplot2 Due to differences between ggplot and ggplotly, this will warn about an unknown aesthetic frame.
A diagnostic plot for sc.weight.fw.covariates. Plots the objective function, regularized RMSE, as a function of the number of Frank-Wolfe / Gradient steps taken. Requires ggplot2
synthdid_rmse_plot(estimates)
synthdid_rmse_plot(estimates)
estimates |
a list of estimates output by synthdid_estimate. Or a single estimate. |
Calculate the standard error of a synthetic diff in diff estimate. Deprecated. Use vcov.synthdid_estimate.
synthdid_se(...)
synthdid_se(...)
... |
Any valid arguments for vcov.synthdid_estimate |
Plots unit by unit difference-in-differences. Dot size indicates the weights omega_i used in the average that yields our treatment effect estimate. This estimate and endpoints of a 95% CI are plotted as horizontal lines. Requires ggplot2
synthdid_units_plot( estimates, negligible.threshold = 0.001, negligible.alpha = 0.3, se.method = "jackknife", units = NULL )
synthdid_units_plot( estimates, negligible.threshold = 0.001, negligible.alpha = 0.3, se.method = "jackknife", units = NULL )
estimates |
as output by synthdid_estimate. Can be a single one or a list of them. |
negligible.threshold |
Unit weight threshold below which units are plotted as small, transparent xs instead of circles. Defaults to .001. |
negligible.alpha |
Determines transparency of those xs. |
se.method |
the method used to calculate standard errors for the CI. See vcov.synthdid_estimate. Defaults to 'jackknife' for speed. If 'none', don't plot a CI. |
units |
a list of control units — elements of rownames(Y) — to plot differences for. Defaults to NULL, meaning all of them. |
timesteps are stored as colnames(Y), but column names cannot be Date objects. Instead, we use strings. If they are strings convertible to dates, return that
timesteps(Y)
timesteps(Y)
Y |
a matrix |
its column names interpreted as Dates if possible
Provides variance estimates based on the following three options
The bootstrap, Algorithm 2 in Arkhangelsky et al.
The jackknife, Algorithm 3 in Arkhangelsky et al.
Placebo, Algorithm 4 in Arkhangelsky et al.
## S3 method for class 'synthdid_estimate' vcov( object, method = c("bootstrap", "jackknife", "placebo"), replications = 200, ... )
## S3 method for class 'synthdid_estimate' vcov( object, method = c("bootstrap", "jackknife", "placebo"), replications = 200, ... )
object |
A synthdid model |
method |
the CI method. The default is bootstrap (warning: this may be slow on large data sets, the jackknife option is the fastest, with the caveat that it is not recommended for SC). |
replications |
the number of bootstrap replications |
... |
Additional arguments (currently ignored). |
The jackknife is not recommended for SC, see section 5 in Arkhangelsky et al. "placebo" is the only option that works for only one treated unit.
Dmitry Arkhangelsky, Susan Athey, David A. Hirshberg, Guido W. Imbens, and Stefan Wager. "Synthetic Difference in Differences". arXiv preprint arXiv:1812.09970, 2019.