0.5 Necessary libraries

library(tidyverse)
library(haven)
library(Synth)
library(devtools)
library(SCtools)
library(sjlabelled)
library(rlist)

1. Importing and cleaning data

Importing the data

smoking <- read_dta("smoking.dta") %>% 
  as.data.frame(.)

Just to get an idea of the data, we can run a summary of smoking.dta

smokingsum <- summary(smoking)
knitr::kable(smokingsum)

state	year	cigsale	lnincome	beer	age15to24	retprice
Min. : 1	Min. :1970	Min. : 40.7	Min. : 9.397	Min. : 2.50	Min. :0.1294	Min. : 27.3
1st Qu.:10	1st Qu.:1977	1st Qu.:100.9	1st Qu.: 9.739	1st Qu.:20.90	1st Qu.:0.1658	1st Qu.: 50.0
Median :20	Median :1985	Median :116.3	Median : 9.861	Median :23.30	Median :0.1781	Median : 95.5
Mean :20	Mean :1985	Mean :118.9	Mean : 9.862	Mean :23.43	Mean :0.1755	Mean :108.3
3rd Qu.:30	3rd Qu.:1993	3rd Qu.:130.5	3rd Qu.: 9.973	3rd Qu.:25.10	3rd Qu.:0.1867	3rd Qu.:158.4
Max. :39	Max. :2000	Max. :296.2	Max. :10.487	Max. :40.40	Max. :0.2037	Max. :351.2
NA	NA	NA	NA’s :195	NA’s :663	NA’s :390	NA

To prepare our data, we need to 1) create a new column, statelabels, where the state names are listed corresponding to their state number; 2) remove the labels from columns statelabels and state; 3) lastly, to ensure compatibility (with the function dataprep() from Synth package), set state class as numeric and statelabel class as character.

#1
smoking$statelabels <- as_label(smoking$state)

#2
remove_label(smoking, statelabels)
remove_label(smoking, state)

#3
smoking$statelabels <- as.character(smoking$statelabels)
smoking$state <- as.numeric(smoking$state)

The data frame now is prepped and looks like the following:

smoking

2. Creating synthetic California

Here we set our indicator variables with: lnincome, beer, age15to24, and retprice. Additionally, we create lagged indicator variables from cigarette pack prices for the following years: 1988, 1980, and 1975. The dependent variable is cigsale and the treatment year is 1988.

dataprep_out <- dataprep(
  foo = smoking,
  predictors = c("lnincome", "beer", "age15to24", "retprice"),
  predictors.op = "mean",
  time.predictors.prior = 1970:1988,
    special.predictors = list(
      list("cigsale", 1988, "mean"),
      list("cigsale", 1980, "mean"),
      list("cigsale", 1975, "mean")),
  dependent = "cigsale",
  unit.variable = "state",
  unit.names.variable = "statelabels",
  time.variable = "year",
  treatment.identifier = 3,
  controls.identifier = c(1,2,4:39),
  time.optimize.ssr = 1970:1988,
  time.plot = 1970:2000
)
synth_out <- synth(data.prep.obj = dataprep_out)

Running the following two lines of code gives us the following two graphs:

path.plot(synth_out, dataprep_out, Ylab = c("per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("California v Synthetic California"), tr.intake = 1988, Legend = c("Calfornia", "Synthetic California"))

gaps.plot(synth_out, dataprep_out, Ylab = c("gap in per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("Gap in Per-capital Cigarette Sales Between California and Synthetic"), tr.intake = 1988)

The fit and the gap look identical to the ones generated by Adbodie et al.

Indicator means

To generate the table to look our what donors contribute to our synthetic California, we can run the following code:

synth_table <- synth.tab(synth_out, dataprep_out, round.digit = 3)
knitr::kable(
  list(synth_table$tab.pred, synth_table$tab.v),
  booktabs = TRUE, valign = 't'
)

	Treated	Synthetic	Sample Mean
lnincome	10.032	9.840	9.792
beer	24.280	24.133	23.655
age15to24	0.179	0.180	0.178
retprice	66.637	66.109	64.505
special.cigsale.1988	90.100	92.449	113.824
special.cigsale.1980	120.200	120.154	138.089
special.cigsale.1975	127.100	126.822	136.932

	v.weights
lnincome	0.001
beer	0.011
age15to24	0.001
retprice	0.01
special.cigsale.1988	0.072
special.cigsale.1980	0.324
special.cigsale.1975	0.581

California and synthetic California means match up very well together. The same could be said about the sample mean, except for the lagged cigarette sales. We see that the synthetic control method put heavy weights on the lagged cigarette sales for years 1975 and 1980.

knitr::kable(synth_table$tab.w)

	w.weights	unit.names	unit.numbers
1	0.000	Alabama	1
2	0.000	Arkansas	2
4	0.094	Colorado	4
5	0.111	Connecticut	5
6	0.000	Delaware	6
7	0.000	Georgia	7
8	0.000	Idaho	8
9	0.000	Illinois	9
10	0.000	Indiana	10
11	0.000	Iowa	11
12	0.000	Kansas	12
13	0.000	Kentucky	13
14	0.000	Louisiana	14
15	0.000	Maine	15
16	0.000	Minnesota	16
17	0.000	Mississippi	17
18	0.000	Missouri	18
19	0.205	Montana	19
20	0.000	Nebraska	20
21	0.249	Nevada	21
22	0.000	New Hampshire	22
23	0.001	New Mexico	23
24	0.000	North Carolina	24
25	0.000	North Dakota	25
26	0.000	Ohio	26
27	0.000	Oklahoma	27
28	0.000	Pennsylvania	28
29	0.000	Rhode Island	29
30	0.000	South Carolina	30
31	0.000	South Dakota	31
32	0.000	Tennessee	32
33	0.000	Texas	33
34	0.341	Utah	34
35	0.000	Vermont	35
36	0.000	Virginia	36
37	0.000	West Virginia	37
38	0.000	Wisconsin	38
39	0.000	Wyoming	39

We see that our synthetic California is made up of 0.341 Utah, 0.249 Nevada, 0.205 Montana, 0.111 Connecticut, 0.094 Colorado, and 0.001 New Mexico. This composition and the predictor means are extremely close to the ones obtained by Abodie et al.

Placebos, pseudo p-values, and Proposition 99

Now we can generate our placebos and the corresponding p-values:

placebos <- generate.placebos(dataprep_out, synth_out, Sigf.ipop = 4, strategy = "multiprocess")

plot_placebos(placebos)

mspe.plot(placebos, discard.extreme = FALSE, mspe.limit = 1, plot.hist = TRUE)

# calculate and print p-value
ratio <- mspe.test(placebos, discard.extreme = F)
ratio$p.val

## [1] 0.02564103

California has the highest post/pre MSPE ratio and the The probability of obtaining a post/pre MSPE ratio as high as California is 1/39, or 0.02564, which is the same result as obtained by Abodie et al. The average treatment effect from Proposition 99 is estimated to be around a decrease of 24 packs per year from year 1988 to 2000.

3. Dropping Utah

To drop Utah from the set of controls, we can just rerun the code from part 2, dropping 34 (Utah’s numeric label) from our donor pool (controls.identifier), and appending NU (no Utah) to our renamed dataframes (dataprep_out_NU and synth_out_NU).

dataprep_out_NU <- dataprep(
  foo = smoking,
  predictors = c("lnincome", "beer", "age15to24", "retprice"),
  predictors.op = "mean",
  time.predictors.prior = 1970:1988,
    special.predictors = list(
      list("cigsale", 1988, "mean"),
      list("cigsale", 1980, "mean"),
      list("cigsale", 1975, "mean")),
    # list("retprice", 1970:1988, "mean")),
  dependent = "cigsale",
  unit.variable = "state",
  unit.names.variable = "statelabels",
  time.variable = "year",
  treatment.identifier = 3,
  controls.identifier = c(1,2,4:33,35:39),
  time.optimize.ssr = 1970:1988,
  time.plot = 1970:2000
)
synth_out_NU <- synth(data.prep.obj = dataprep_out_NU)

path.plot(synth_out_NU, dataprep_out_NU, Ylab = c("per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("California v Synthetic California"), tr.intake = 1988, Legend = c("Calfornia", "Synthetic California"))

gaps.plot(synth_out_NU, dataprep_out_NU, Ylab = c("gap in per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("Gap in Per-capital Cigarette Sales Between California and Synthetic"), tr.intake = 1988 )

Our fit and gap look pretty good here. The average treatment effect looks similar to the result above at around a decrease of 20 packs per year.

synth_table_NU <- synth.tab(synth_out_NU, dataprep_out_NU, round.digit = 3)
knitr::kable(
  list(synth_table_NU$tab.pred, synth_table_NU$tab.v),
  booktabs = TRUE, valign = 't'
)

	Treated	Synthetic	Sample Mean
lnincome	10.032	9.800	9.796
beer	24.280	28.401	23.934
age15to24	0.179	0.178	0.178
retprice	66.637	66.293	64.513
special.cigsale.1988	90.100	93.783	115.414
special.cigsale.1980	120.200	121.753	139.800
special.cigsale.1975	127.100	126.735	138.584

	v.weights
lnincome	0.001
beer	0
age15to24	0.003
retprice	0.021
special.cigsale.1988	0.043
special.cigsale.1980	0.065
special.cigsale.1975	0.865

Indicator means

Our indicator means, like in the first part, matches quite well, except for beer. Dropping Utah results in a much heavier weighting towards lagged cigarette sale in 1975.

knitr::kable(synth_table_NU$tab.w)

	w.weights	unit.names	unit.numbers
1	0.000	Alabama	1
2	0.000	Arkansas	2
4	0.139	Colorado	4
5	0.069	Connecticut	5
6	0.000	Delaware	6
7	0.000	Georgia	7
8	0.048	Idaho	8
9	0.000	Illinois	9
10	0.000	Indiana	10
11	0.000	Iowa	11
12	0.000	Kansas	12
13	0.000	Kentucky	13
14	0.000	Louisiana	14
15	0.000	Maine	15
16	0.000	Minnesota	16
17	0.000	Mississippi	17
18	0.000	Missouri	18
19	0.000	Montana	19
20	0.000	Nebraska	20
21	0.179	Nevada	21
22	0.000	New Hampshire	22
23	0.563	New Mexico	23
24	0.000	North Carolina	24
25	0.000	North Dakota	25
26	0.000	Ohio	26
27	0.000	Oklahoma	27
28	0.000	Pennsylvania	28
29	0.000	Rhode Island	29
30	0.000	South Carolina	30
31	0.000	South Dakota	31
32	0.000	Tennessee	32
33	0.000	Texas	33
35	0.000	Vermont	35
36	0.000	Virginia	36
37	0.000	West Virginia	37
38	0.000	Wisconsin	38
39	0.000	Wyoming	39

Placebos and p-value

In this instance, after dropping Utah, synthetic California becomes 0.564 New Mexico, 0.179 Nevada, 0.130 Colorado, 0.069 Connecticut, and 0.048 Idaho.

placebos_NU <- generate.placebos(dataprep_out_NU, synth_out_NU, Sigf.ipop = 4, strategy = "multiprocess")

plot_placebos(placebos_NU)

mspe.plot(placebos_NU, discard.extreme = FALSE, mspe.limit = 1, plot.hist = TRUE)

# calculate and print p-value
ratio_NU <- mspe.test(placebos_NU, discard.extreme = F)
ratio_NU$p.val

## [1] 0.1052632

After dropping Utah, California no longer has the highest post/pre MSPE ratio. The probability of obtaining a post/pre MSPE ratio as high as California is 4/39 or 0.10256. California’s MSPE ratio is also not as high as in part 2. Although the pre-treatment fit is still good, the pseudo p-value does not indicate significance.

4. With Utah, Without Lagged Cigarette Pack Sales

Synthesizing our California with Utah and without the lagged cigarette pack sales, we can take out the special indicators and put 34 back into our donor pool with the following code, appending NS (no special) to our data frames:

dataprep_out_NS <- dataprep(
  foo = smoking,
  predictors = c("lnincome", "beer", "age15to24", "retprice"),
  predictors.op = "mean",
  time.predictors.prior = 1970:1988,
  dependent = "cigsale",
  unit.variable = "state",
  unit.names.variable = "statelabels",
  time.variable = "year",
  treatment.identifier = 3,
  controls.identifier = c(1,2,4:39),
  time.optimize.ssr = 1970:1988,
  time.plot = 1970:2000
)
synth_out_NS <- synth(data.prep.obj = dataprep_out_NS)

path.plot(synth_out_NS, dataprep_out_NS, Ylab = c("per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("California v Synthetic California"), tr.intake = 1988, Legend = c("Calfornia", "Synthetic California"))

gaps.plot(synth_out_NS, dataprep_out_NS, Ylab = c("gap in per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("Gap in Per-capital Cigarette Sales Between California and Synthetic"), tr.intake = 1988 )

Without the lagged cigarette pack sales, the pre-treatment fit between California and the synthetic California is worse and consequently, the gap between the two is also worse. Our estimated treated effect looks to be a larger decrease, near 30 packs per year, however, the years leading up to the treatment year are badly fitted.

Indicator means

synth_table_NS <- synth.tab(synth_out_NS, dataprep_out_NS, round.digit = 3)
knitr::kable(
  list(synth_table_NS$tab.pred, synth_table_NS$tab.v),
  booktabs = TRUE, valign = 't'
)

	Treated	Synthetic	Sample Mean
lnincome	10.032	10.013	9.792
beer	24.280	22.982	23.655
age15to24	0.179	0.176	0.178
retprice	66.637	68.526	64.505

	v.weights
lnincome	0.799
beer	0.004
age15to24	0.192
retprice	0.006

The means look pretty much the same, except beer and retprice are a little off. However, they are still more closely matched with California compared to the sample. In this case, a large proportion of the variable weight is on lnincome.

knitr::kable(synth_table_NS$tab.w)

	w.weights	unit.names	unit.numbers
1	0.000	Alabama	1
2	0.000	Arkansas	2
4	0.514	Colorado	4
5	0.477	Connecticut	5
6	0.000	Delaware	6
7	0.000	Georgia	7
8	0.000	Idaho	8
9	0.000	Illinois	9
10	0.000	Indiana	10
11	0.000	Iowa	11
12	0.000	Kansas	12
13	0.000	Kentucky	13
14	0.000	Louisiana	14
15	0.000	Maine	15
16	0.000	Minnesota	16
17	0.000	Mississippi	17
18	0.000	Missouri	18
19	0.000	Montana	19
20	0.000	Nebraska	20
21	0.000	Nevada	21
22	0.000	New Hampshire	22
23	0.000	New Mexico	23
24	0.000	North Carolina	24
25	0.000	North Dakota	25
26	0.000	Ohio	26
27	0.000	Oklahoma	27
28	0.000	Pennsylvania	28
29	0.000	Rhode Island	29
30	0.000	South Carolina	30
31	0.000	South Dakota	31
32	0.000	Tennessee	32
33	0.000	Texas	33
34	0.000	Utah	34
35	0.000	Vermont	35
36	0.001	Virginia	36
37	0.000	West Virginia	37
38	0.000	Wisconsin	38
39	0.000	Wyoming	39

We see that in this case, our synthetic California is comprised of 0.514 Colorado, 0.477 Connecticut, and 0.001 Virginia. This is a much more different composition and weighting compared to the previous two.

Placebos and p-values

placebos_NS <- generate.placebos(dataprep_out_NS, synth_out_NS, Sigf.ipop = 4, strategy = "multiprocess")

plot_placebos(placebos_NS)

mspe.plot(placebos_NS, discard.extreme = FALSE, mspe.limit = 1, plot.hist = TRUE)

# calculate and print p-value
ratio_NS <- mspe.test(placebos_NS, discard.extreme = F)
ratio_NS$p.val

## [1] 0.1025641

After removing the lagged cigarette pack sales, California, like in part 3, no longer has the highest post/pre MSPE ratio. The probability of obtaining a post/pre MSPE ratio as high as California is the same as in part 3: 4/39 or 0.10256. The MSPE ratio is also less than compared to part 3.

5. In-time Placebo: Year 1981

dataprep_out_81 <- dataprep(
  foo = smoking,
  predictors = c("lnincome", "age15to24", "retprice"), 
  predictors.op = "mean",
  time.predictors.prior = 1970:1981,
    special.predictors = list(
      list("cigsale", 1981, "mean"),
      list("cigsale", 1980, "mean"),
      list("cigsale", 1975, "mean")),
  dependent = "cigsale",
  unit.variable = "state",
  unit.names.variable = "statelabels",
  time.variable = "year",
  treatment.identifier = 3,
  controls.identifier = c(1,2,4:39),
  time.optimize.ssr = 1970:1981,
  time.plot = 1970:1988
)
synth_out_81 <- synth(data.prep.obj = dataprep_out_81)

I received an error stating that the indicator variable “beer” is missing data for all time periods before the treatment at 1981. Therefore, in order to run the code, I had to drop beer from the predictors. I also changed the lagged cigarette pack sales from year 1988 to 1981. Everything else remains unchanged.

path.plot(synth_out_81, dataprep_out_81, Ylab = c("per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("California v Synthetic California"), tr.intake = 1981, Legend = c("Calfornia", "Synthetic California"))

gaps.plot(synth_out_81, dataprep_out_81, Ylab = c("gap in per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("Gap in Per-capital Cigarette Sales Between California and Synthetic"), tr.intake = 1981)

Indicator means

synth_table_81 <- synth.tab(synth_out_81, dataprep_out_81, round.digit = 3)
knitr::kable(
  list(synth_table_81$tab.pred, synth_table_81$tab.v),
  booktabs = TRUE, valign = 't'
)

	Treated	Synthetic	Sample Mean
lnincome	9.992	9.877	9.757
age15to24	0.184	0.183	0.184
retprice	49.150	49.928	46.718
special.cigsale.1981	118.600	119.387	137.987
special.cigsale.1980	120.200	120.669	138.089
special.cigsale.1975	127.100	126.821	136.932

	v.weights
lnincome	0.003
age15to24	0.005
retprice	0.002
special.cigsale.1981	0.15
special.cigsale.1980	0.163
special.cigsale.1975	0.677

Our means here fit extremely well and the lagged cigarette sales fit much better than the sample mean. Like in the first case, the variable weighting is on mostly lagged cigarette sales on 1975 and 1980.

knitr::kable(synth_table_81$tab.w)

	w.weights	unit.names	unit.numbers
1	0.000	Alabama	1
2	0.000	Arkansas	2
4	0.001	Colorado	4
5	0.331	Connecticut	5
6	0.000	Delaware	6
7	0.000	Georgia	7
8	0.000	Idaho	8
9	0.001	Illinois	9
10	0.000	Indiana	10
11	0.000	Iowa	11
12	0.000	Kansas	12
13	0.000	Kentucky	13
14	0.000	Louisiana	14
15	0.000	Maine	15
16	0.000	Minnesota	16
17	0.000	Mississippi	17
18	0.000	Missouri	18
19	0.000	Montana	19
20	0.000	Nebraska	20
21	0.305	Nevada	21
22	0.000	New Hampshire	22
23	0.000	New Mexico	23
24	0.000	North Carolina	24
25	0.000	North Dakota	25
26	0.000	Ohio	26
27	0.000	Oklahoma	27
28	0.000	Pennsylvania	28
29	0.000	Rhode Island	29
30	0.000	South Carolina	30
31	0.000	South Dakota	31
32	0.000	Tennessee	32
33	0.000	Texas	33
34	0.361	Utah	34
35	0.000	Vermont	35
36	0.000	Virginia	36
37	0.000	West Virginia	37
38	0.000	Wisconsin	38
39	0.000	Wyoming	39

In this case, the synthetic California is made up of 0.361 Utah, 0.305 Nevada, 0.331 Connecticut, and 0.001 Illinois. The indicator variables match up extremely well and the pre-treatment fit looks almost perfect. However, the real California and synthetic California continue to be well fitted a few years past the placebo treatment date.

Placebos and p-value

placebos_81 <- generate.placebos(dataprep_out_81, synth_out_81, Sigf.ipop = 4, strategy = "multiprocess")

plot_placebos(placebos_81)

mspe.plot(placebos_81, discard.extreme = FALSE, mspe.limit = 1, plot.hist = TRUE)

# calculate and print p-value
ratio_81 <- mspe.test(placebos_81, discard.extreme = F)
ratio_81$p.val

## [1] 0.1538462

In this case, the probability of obtaining a post/pre MSPE ratio as large as California’s is 6/39 = 0.153846. The average treatment effect in this in-time placebo case would be about a 8-10 pack decrease per year from year 1981 to 1988, although the pseudo p-value is not significant. This result is an even lower estimated effect compared to the one study by Fichtenberg and Glantz (an estimated decrease of 14 packs per year) that Abodie et al. mentions.

6. Pseudo p-values

I found and graphed the post and pre MSPE for California and the control units and the post/pre MSPE ratio at the end of each parts 2-5. Summarizing the pseudo p-values and the ATET here:

	Normal	No Utah	No Lagged	In-time Placebo
Pseudo p-values	0.02564	0.10256	0.10256	0.15385
ATET (packs per year)	~-24	~-20	~-30	~-9

We see that only the pseudo p-values from the first synthetic control is significant.

POLS 5385 - Homework 3 Synthetic Control

Don Lim

10/19/2021

0.5 Necessary libraries

1. Importing and cleaning data

2. Creating synthetic California

Indicator means

Placebos, pseudo p-values, and Proposition 99

3. Dropping Utah

Indicator means

Placebos and p-value

4. With Utah, Without Lagged Cigarette Pack Sales

Indicator means

Placebos and p-values

5. In-time Placebo: Year 1981

Indicator means

Placebos and p-value

6. Pseudo p-values