0.5 Necessary libraries

library(tidyverse)
library(haven)
library(Synth)
library(devtools)
library(SCtools)
library(sjlabelled)
library(rlist)

1. Importing and cleaning data

Importing the data

smoking <- read_dta("smoking.dta") %>% 
  as.data.frame(.)

Just to get an idea of the data, we can run a summary of smoking.dta

smokingsum <- summary(smoking)
knitr::kable(smokingsum)
state year cigsale lnincome beer age15to24 retprice
Min. : 1 Min. :1970 Min. : 40.7 Min. : 9.397 Min. : 2.50 Min. :0.1294 Min. : 27.3
1st Qu.:10 1st Qu.:1977 1st Qu.:100.9 1st Qu.: 9.739 1st Qu.:20.90 1st Qu.:0.1658 1st Qu.: 50.0
Median :20 Median :1985 Median :116.3 Median : 9.861 Median :23.30 Median :0.1781 Median : 95.5
Mean :20 Mean :1985 Mean :118.9 Mean : 9.862 Mean :23.43 Mean :0.1755 Mean :108.3
3rd Qu.:30 3rd Qu.:1993 3rd Qu.:130.5 3rd Qu.: 9.973 3rd Qu.:25.10 3rd Qu.:0.1867 3rd Qu.:158.4
Max. :39 Max. :2000 Max. :296.2 Max. :10.487 Max. :40.40 Max. :0.2037 Max. :351.2
NA NA NA NA’s :195 NA’s :663 NA’s :390 NA

To prepare our data, we need to 1) create a new column, statelabels, where the state names are listed corresponding to their state number; 2) remove the labels from columns statelabels and state; 3) lastly, to ensure compatibility (with the function dataprep() from Synth package), set state class as numeric and statelabel class as character.

#1
smoking$statelabels <- as_label(smoking$state)

#2
remove_label(smoking, statelabels)
remove_label(smoking, state)

#3
smoking$statelabels <- as.character(smoking$statelabels)
smoking$state <- as.numeric(smoking$state)

The data frame now is prepped and looks like the following:

smoking

2. Creating synthetic California

Here we set our indicator variables with: lnincome, beer, age15to24, and retprice. Additionally, we create lagged indicator variables from cigarette pack prices for the following years: 1988, 1980, and 1975. The dependent variable is cigsale and the treatment year is 1988.

dataprep_out <- dataprep(
  foo = smoking,
  predictors = c("lnincome", "beer", "age15to24", "retprice"),
  predictors.op = "mean",
  time.predictors.prior = 1970:1988,
    special.predictors = list(
      list("cigsale", 1988, "mean"),
      list("cigsale", 1980, "mean"),
      list("cigsale", 1975, "mean")),
  dependent = "cigsale",
  unit.variable = "state",
  unit.names.variable = "statelabels",
  time.variable = "year",
  treatment.identifier = 3,
  controls.identifier = c(1,2,4:39),
  time.optimize.ssr = 1970:1988,
  time.plot = 1970:2000
)
synth_out <- synth(data.prep.obj = dataprep_out)

Running the following two lines of code gives us the following two graphs:

path.plot(synth_out, dataprep_out, Ylab = c("per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("California v Synthetic California"), tr.intake = 1988, Legend = c("Calfornia", "Synthetic California"))

gaps.plot(synth_out, dataprep_out, Ylab = c("gap in per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("Gap in Per-capital Cigarette Sales Between California and Synthetic"), tr.intake = 1988)

The fit and the gap look identical to the ones generated by Adbodie et al.

Indicator means

To generate the table to look our what donors contribute to our synthetic California, we can run the following code:

synth_table <- synth.tab(synth_out, dataprep_out, round.digit = 3)
knitr::kable(
  list(synth_table$tab.pred, synth_table$tab.v),
  booktabs = TRUE, valign = 't'
)
Treated Synthetic Sample Mean
lnincome 10.032 9.840 9.792
beer 24.280 24.133 23.655
age15to24 0.179 0.180 0.178
retprice 66.637 66.109 64.505
special.cigsale.1988 90.100 92.449 113.824
special.cigsale.1980 120.200 120.154 138.089
special.cigsale.1975 127.100 126.822 136.932
v.weights
lnincome 0.001
beer 0.011
age15to24 0.001
retprice 0.01
special.cigsale.1988 0.072
special.cigsale.1980 0.324
special.cigsale.1975 0.581

California and synthetic California means match up very well together. The same could be said about the sample mean, except for the lagged cigarette sales. We see that the synthetic control method put heavy weights on the lagged cigarette sales for years 1975 and 1980.

knitr::kable(synth_table$tab.w)
w.weights unit.names unit.numbers
1 0.000 Alabama 1
2 0.000 Arkansas 2
4 0.094 Colorado 4
5 0.111 Connecticut 5
6 0.000 Delaware 6
7 0.000 Georgia 7
8 0.000 Idaho 8
9 0.000 Illinois 9
10 0.000 Indiana 10
11 0.000 Iowa 11
12 0.000 Kansas 12
13 0.000 Kentucky 13
14 0.000 Louisiana 14
15 0.000 Maine 15
16 0.000 Minnesota 16
17 0.000 Mississippi 17
18 0.000 Missouri 18
19 0.205 Montana 19
20 0.000 Nebraska 20
21 0.249 Nevada 21
22 0.000 New Hampshire 22
23 0.001 New Mexico 23
24 0.000 North Carolina 24
25 0.000 North Dakota 25
26 0.000 Ohio 26
27 0.000 Oklahoma 27
28 0.000 Pennsylvania 28
29 0.000 Rhode Island 29
30 0.000 South Carolina 30
31 0.000 South Dakota 31
32 0.000 Tennessee 32
33 0.000 Texas 33
34 0.341 Utah 34
35 0.000 Vermont 35
36 0.000 Virginia 36
37 0.000 West Virginia 37
38 0.000 Wisconsin 38
39 0.000 Wyoming 39

We see that our synthetic California is made up of 0.341 Utah, 0.249 Nevada, 0.205 Montana, 0.111 Connecticut, 0.094 Colorado, and 0.001 New Mexico. This composition and the predictor means are extremely close to the ones obtained by Abodie et al.

Placebos, pseudo p-values, and Proposition 99

Now we can generate our placebos and the corresponding p-values:

placebos <- generate.placebos(dataprep_out, synth_out, Sigf.ipop = 4, strategy = "multiprocess")
plot_placebos(placebos)

mspe.plot(placebos, discard.extreme = FALSE, mspe.limit = 1, plot.hist = TRUE)

# calculate and print p-value
ratio <- mspe.test(placebos, discard.extreme = F)
ratio$p.val
## [1] 0.02564103

California has the highest post/pre MSPE ratio and the The probability of obtaining a post/pre MSPE ratio as high as California is 1/39, or 0.02564, which is the same result as obtained by Abodie et al. The average treatment effect from Proposition 99 is estimated to be around a decrease of 24 packs per year from year 1988 to 2000.

3. Dropping Utah

To drop Utah from the set of controls, we can just rerun the code from part 2, dropping 34 (Utah’s numeric label) from our donor pool (controls.identifier), and appending NU (no Utah) to our renamed dataframes (dataprep_out_NU and synth_out_NU).

dataprep_out_NU <- dataprep(
  foo = smoking,
  predictors = c("lnincome", "beer", "age15to24", "retprice"),
  predictors.op = "mean",
  time.predictors.prior = 1970:1988,
    special.predictors = list(
      list("cigsale", 1988, "mean"),
      list("cigsale", 1980, "mean"),
      list("cigsale", 1975, "mean")),
    # list("retprice", 1970:1988, "mean")),
  dependent = "cigsale",
  unit.variable = "state",
  unit.names.variable = "statelabels",
  time.variable = "year",
  treatment.identifier = 3,
  controls.identifier = c(1,2,4:33,35:39),
  time.optimize.ssr = 1970:1988,
  time.plot = 1970:2000
)
synth_out_NU <- synth(data.prep.obj = dataprep_out_NU)
path.plot(synth_out_NU, dataprep_out_NU, Ylab = c("per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("California v Synthetic California"), tr.intake = 1988, Legend = c("Calfornia", "Synthetic California"))

gaps.plot(synth_out_NU, dataprep_out_NU, Ylab = c("gap in per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("Gap in Per-capital Cigarette Sales Between California and Synthetic"), tr.intake = 1988 )

Our fit and gap look pretty good here. The average treatment effect looks similar to the result above at around a decrease of 20 packs per year.

synth_table_NU <- synth.tab(synth_out_NU, dataprep_out_NU, round.digit = 3)
knitr::kable(
  list(synth_table_NU$tab.pred, synth_table_NU$tab.v),
  booktabs = TRUE, valign = 't'
)
Treated Synthetic Sample Mean
lnincome 10.032 9.800 9.796
beer 24.280 28.401 23.934
age15to24 0.179 0.178 0.178
retprice 66.637 66.293 64.513
special.cigsale.1988 90.100 93.783 115.414
special.cigsale.1980 120.200 121.753 139.800
special.cigsale.1975 127.100 126.735 138.584
v.weights
lnincome 0.001
beer 0
age15to24 0.003
retprice 0.021
special.cigsale.1988 0.043
special.cigsale.1980 0.065
special.cigsale.1975 0.865

Indicator means

Our indicator means, like in the first part, matches quite well, except for beer. Dropping Utah results in a much heavier weighting towards lagged cigarette sale in 1975.

knitr::kable(synth_table_NU$tab.w)
w.weights unit.names unit.numbers
1 0.000 Alabama 1
2 0.000 Arkansas 2
4 0.139 Colorado 4
5 0.069 Connecticut 5
6 0.000 Delaware 6
7 0.000 Georgia 7
8 0.048 Idaho 8
9 0.000 Illinois 9
10 0.000 Indiana 10
11 0.000 Iowa 11
12 0.000 Kansas 12
13 0.000 Kentucky 13
14 0.000 Louisiana 14
15 0.000 Maine 15
16 0.000 Minnesota 16
17 0.000 Mississippi 17
18 0.000 Missouri 18
19 0.000 Montana 19
20 0.000 Nebraska 20
21 0.179 Nevada 21
22 0.000 New Hampshire 22
23 0.563 New Mexico 23
24 0.000 North Carolina 24
25 0.000 North Dakota 25
26 0.000 Ohio 26
27 0.000 Oklahoma 27
28 0.000 Pennsylvania 28
29 0.000 Rhode Island 29
30 0.000 South Carolina 30
31 0.000 South Dakota 31
32 0.000 Tennessee 32
33 0.000 Texas 33
35 0.000 Vermont 35
36 0.000 Virginia 36
37 0.000 West Virginia 37
38 0.000 Wisconsin 38
39 0.000 Wyoming 39

Placebos and p-value

In this instance, after dropping Utah, synthetic California becomes 0.564 New Mexico, 0.179 Nevada, 0.130 Colorado, 0.069 Connecticut, and 0.048 Idaho.

placebos_NU <- generate.placebos(dataprep_out_NU, synth_out_NU, Sigf.ipop = 4, strategy = "multiprocess")
plot_placebos(placebos_NU)

mspe.plot(placebos_NU, discard.extreme = FALSE, mspe.limit = 1, plot.hist = TRUE)

# calculate and print p-value
ratio_NU <- mspe.test(placebos_NU, discard.extreme = F)
ratio_NU$p.val
## [1] 0.1052632

After dropping Utah, California no longer has the highest post/pre MSPE ratio. The probability of obtaining a post/pre MSPE ratio as high as California is 4/39 or 0.10256. California’s MSPE ratio is also not as high as in part 2. Although the pre-treatment fit is still good, the pseudo p-value does not indicate significance.

4. With Utah, Without Lagged Cigarette Pack Sales

Synthesizing our California with Utah and without the lagged cigarette pack sales, we can take out the special indicators and put 34 back into our donor pool with the following code, appending NS (no special) to our data frames:

dataprep_out_NS <- dataprep(
  foo = smoking,
  predictors = c("lnincome", "beer", "age15to24", "retprice"),
  predictors.op = "mean",
  time.predictors.prior = 1970:1988,
  dependent = "cigsale",
  unit.variable = "state",
  unit.names.variable = "statelabels",
  time.variable = "year",
  treatment.identifier = 3,
  controls.identifier = c(1,2,4:39),
  time.optimize.ssr = 1970:1988,
  time.plot = 1970:2000
)
synth_out_NS <- synth(data.prep.obj = dataprep_out_NS)
path.plot(synth_out_NS, dataprep_out_NS, Ylab = c("per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("California v Synthetic California"), tr.intake = 1988, Legend = c("Calfornia", "Synthetic California"))

gaps.plot(synth_out_NS, dataprep_out_NS, Ylab = c("gap in per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("Gap in Per-capital Cigarette Sales Between California and Synthetic"), tr.intake = 1988 )

Without the lagged cigarette pack sales, the pre-treatment fit between California and the synthetic California is worse and consequently, the gap between the two is also worse. Our estimated treated effect looks to be a larger decrease, near 30 packs per year, however, the years leading up to the treatment year are badly fitted.

Indicator means

synth_table_NS <- synth.tab(synth_out_NS, dataprep_out_NS, round.digit = 3)
knitr::kable(
  list(synth_table_NS$tab.pred, synth_table_NS$tab.v),
  booktabs = TRUE, valign = 't'
)
Treated Synthetic Sample Mean
lnincome 10.032 10.013 9.792
beer 24.280 22.982 23.655
age15to24 0.179 0.176 0.178
retprice 66.637 68.526 64.505
v.weights
lnincome 0.799
beer 0.004
age15to24 0.192
retprice 0.006

The means look pretty much the same, except beer and retprice are a little off. However, they are still more closely matched with California compared to the sample. In this case, a large proportion of the variable weight is on lnincome.

knitr::kable(synth_table_NS$tab.w)
w.weights unit.names unit.numbers
1 0.000 Alabama 1
2 0.000 Arkansas 2
4 0.514 Colorado 4
5 0.477 Connecticut 5
6 0.000 Delaware 6
7 0.000 Georgia 7
8 0.000 Idaho 8
9 0.000 Illinois 9
10 0.000 Indiana 10
11 0.000 Iowa 11
12 0.000 Kansas 12
13 0.000 Kentucky 13
14 0.000 Louisiana 14
15 0.000 Maine 15
16 0.000 Minnesota 16
17 0.000 Mississippi 17
18 0.000 Missouri 18
19 0.000 Montana 19
20 0.000 Nebraska 20
21 0.000 Nevada 21
22 0.000 New Hampshire 22
23 0.000 New Mexico 23
24 0.000 North Carolina 24
25 0.000 North Dakota 25
26 0.000 Ohio 26
27 0.000 Oklahoma 27
28 0.000 Pennsylvania 28
29 0.000 Rhode Island 29
30 0.000 South Carolina 30
31 0.000 South Dakota 31
32 0.000 Tennessee 32
33 0.000 Texas 33
34 0.000 Utah 34
35 0.000 Vermont 35
36 0.001 Virginia 36
37 0.000 West Virginia 37
38 0.000 Wisconsin 38
39 0.000 Wyoming 39

We see that in this case, our synthetic California is comprised of 0.514 Colorado, 0.477 Connecticut, and 0.001 Virginia. This is a much more different composition and weighting compared to the previous two.

Placebos and p-values

placebos_NS <- generate.placebos(dataprep_out_NS, synth_out_NS, Sigf.ipop = 4, strategy = "multiprocess")
plot_placebos(placebos_NS)

mspe.plot(placebos_NS, discard.extreme = FALSE, mspe.limit = 1, plot.hist = TRUE)

# calculate and print p-value
ratio_NS <- mspe.test(placebos_NS, discard.extreme = F)
ratio_NS$p.val
## [1] 0.1025641

After removing the lagged cigarette pack sales, California, like in part 3, no longer has the highest post/pre MSPE ratio. The probability of obtaining a post/pre MSPE ratio as high as California is the same as in part 3: 4/39 or 0.10256. The MSPE ratio is also less than compared to part 3.

5. In-time Placebo: Year 1981

dataprep_out_81 <- dataprep(
  foo = smoking,
  predictors = c("lnincome", "age15to24", "retprice"), 
  predictors.op = "mean",
  time.predictors.prior = 1970:1981,
    special.predictors = list(
      list("cigsale", 1981, "mean"),
      list("cigsale", 1980, "mean"),
      list("cigsale", 1975, "mean")),
  dependent = "cigsale",
  unit.variable = "state",
  unit.names.variable = "statelabels",
  time.variable = "year",
  treatment.identifier = 3,
  controls.identifier = c(1,2,4:39),
  time.optimize.ssr = 1970:1981,
  time.plot = 1970:1988
)
synth_out_81 <- synth(data.prep.obj = dataprep_out_81)

I received an error stating that the indicator variable “beer” is missing data for all time periods before the treatment at 1981. Therefore, in order to run the code, I had to drop beer from the predictors. I also changed the lagged cigarette pack sales from year 1988 to 1981. Everything else remains unchanged.

path.plot(synth_out_81, dataprep_out_81, Ylab = c("per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("California v Synthetic California"), tr.intake = 1981, Legend = c("Calfornia", "Synthetic California"))

gaps.plot(synth_out_81, dataprep_out_81, Ylab = c("gap in per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("Gap in Per-capital Cigarette Sales Between California and Synthetic"), tr.intake = 1981)

Indicator means

synth_table_81 <- synth.tab(synth_out_81, dataprep_out_81, round.digit = 3)
knitr::kable(
  list(synth_table_81$tab.pred, synth_table_81$tab.v),
  booktabs = TRUE, valign = 't'
)
Treated Synthetic Sample Mean
lnincome 9.992 9.877 9.757
age15to24 0.184 0.183 0.184
retprice 49.150 49.928 46.718
special.cigsale.1981 118.600 119.387 137.987
special.cigsale.1980 120.200 120.669 138.089
special.cigsale.1975 127.100 126.821 136.932
v.weights
lnincome 0.003
age15to24 0.005
retprice 0.002
special.cigsale.1981 0.15
special.cigsale.1980 0.163
special.cigsale.1975 0.677

Our means here fit extremely well and the lagged cigarette sales fit much better than the sample mean. Like in the first case, the variable weighting is on mostly lagged cigarette sales on 1975 and 1980.

knitr::kable(synth_table_81$tab.w)
w.weights unit.names unit.numbers
1 0.000 Alabama 1
2 0.000 Arkansas 2
4 0.001 Colorado 4
5 0.331 Connecticut 5
6 0.000 Delaware 6
7 0.000 Georgia 7
8 0.000 Idaho 8
9 0.001 Illinois 9
10 0.000 Indiana 10
11 0.000 Iowa 11
12 0.000 Kansas 12
13 0.000 Kentucky 13
14 0.000 Louisiana 14
15 0.000 Maine 15
16 0.000 Minnesota 16
17 0.000 Mississippi 17
18 0.000 Missouri 18
19 0.000 Montana 19
20 0.000 Nebraska 20
21 0.305 Nevada 21
22 0.000 New Hampshire 22
23 0.000 New Mexico 23
24 0.000 North Carolina 24
25 0.000 North Dakota 25
26 0.000 Ohio 26
27 0.000 Oklahoma 27
28 0.000 Pennsylvania 28
29 0.000 Rhode Island 29
30 0.000 South Carolina 30
31 0.000 South Dakota 31
32 0.000 Tennessee 32
33 0.000 Texas 33
34 0.361 Utah 34
35 0.000 Vermont 35
36 0.000 Virginia 36
37 0.000 West Virginia 37
38 0.000 Wisconsin 38
39 0.000 Wyoming 39

In this case, the synthetic California is made up of 0.361 Utah, 0.305 Nevada, 0.331 Connecticut, and 0.001 Illinois. The indicator variables match up extremely well and the pre-treatment fit looks almost perfect. However, the real California and synthetic California continue to be well fitted a few years past the placebo treatment date.

Placebos and p-value

placebos_81 <- generate.placebos(dataprep_out_81, synth_out_81, Sigf.ipop = 4, strategy = "multiprocess")
plot_placebos(placebos_81)

mspe.plot(placebos_81, discard.extreme = FALSE, mspe.limit = 1, plot.hist = TRUE)

# calculate and print p-value
ratio_81 <- mspe.test(placebos_81, discard.extreme = F)
ratio_81$p.val
## [1] 0.1538462

In this case, the probability of obtaining a post/pre MSPE ratio as large as California’s is 6/39 = 0.153846. The average treatment effect in this in-time placebo case would be about a 8-10 pack decrease per year from year 1981 to 1988, although the pseudo p-value is not significant. This result is an even lower estimated effect compared to the one study by Fichtenberg and Glantz (an estimated decrease of 14 packs per year) that Abodie et al. mentions.

6. Pseudo p-values

I found and graphed the post and pre MSPE for California and the control units and the post/pre MSPE ratio at the end of each parts 2-5. Summarizing the pseudo p-values and the ATET here:

Normal No Utah No Lagged In-time Placebo
Pseudo p-values 0.02564 0.10256 0.10256 0.15385
ATET (packs per year) ~-24 ~-20 ~-30 ~-9

We see that only the pseudo p-values from the first synthetic control is significant.