Tidysynth error -- Please specify only one treated unit - r

I am trying to calculate a Synthetic control using the tidysynth package. I'm fairly new to the package and the data but here is my code:
#Import data
synth <- read.csv("https://raw.githubusercontent.com/FDobkin/coal_paper/main/synth_data.csv")
#Convert year to date
synth$year <- strptime(synth$year, format = "%Y")
synth_out <- synth %>%
synthetic_control(
outcome=saleprice,
unit=fips,
time=year,
i_unit=47145,
i_time=2009-10-19,
generate_placebos=T
) %>%
generate_predictor(
time_window=2000-10-19:2009-10-19,
population = pop,
white = white_p,
age = age65p_p,
rucc = rucc_code,
income = median_income,
unemploy = unemprate,
laborforce = lfrate
) %>%
generate_weights(optimization_window = 2000-10-19:2009-10-19, # time to use in the optimization task
margin_ipop = .02,sigf_ipop = 7,bound_ipop = 6 # optimizer options
)
%>%
generate_control()
The error is:
Error in synth_method(treatment_unit_covariates = treatment_unit_covariates, :
Please specify only one treated unit.
The error seems to becoming from the generate_weights() statement. I am specifying the specific county that is receiving the treatment in the synthetic_control() statement. What is the error noting is wrong?

Related

Consistent error message while running grouping analysis in 'plspm' package

I am looking for some help in resolving an error using the partial least squares path modeling package ('plspm').
I can get results running a basic PLS-PM analysis but run into issues when using the grouping function, receiving the error message:
Error in if (w_dif < specs$tol || iter == specs$maxiter) break : missing value where TRUE/FALSE needed
I have no missing values and all variables have the proper classification. Elsewhere I read that there is a problem with processing observations with the exact same values across all variables, I have deleted those and still face this issue. I seem to be facing the issue only when I run the groups using the "bootstrap" method as well.
farmwood = read.csv("farmwood_groups(distance).csv", header = TRUE) %>%
slice(-c(119:123))
Control = c(0,0,0,0,0,0)
Normative = c(0,0,0,0,0,0)
B_beliefs = c(0,0,0,0,0,0)
P_control = c(1,0,0,0,0,0)
S_norm = c(0,1,0,0,0,0)
Behavior = c(0,0,1,1,1,0)
farmwood_path = rbind(Control, Normative, B_beliefs, P_control, S_norm, Behavior)
colnames(farmwood_path) = rownames(farmwood_path)
farmwood_blocks = list(14:18,20:23,8:13,24:27,19,4:7)
farmwood_modes = rep("A", 6)
farmwood_pls = plspm(farmwood, farmwood_path, farmwood_blocks, modes = farmwood_modes)
ames(farmwood)[names(farmwood) == "QB3"] <- "Distance"
farmwood$Distance <- as.factor(farmwood$Distance)
distance_boot = plspm.groups(farmwood_pls, farmwood$Distance, method = "bootstrap")
distance_perm = plspm.groups(farmwood_pls, farmwood$Distance, method = "permutation")
The data is contained here:
https://www.dropbox.com/s/8vewuupywpi1jkt/farmwood_groups%28distance%29.csv?dl=0
Any help would be appreciated. Thank you in advance

R package {gtsummary} - Can't convert .f (NULL) to function using custom stat function

I'm using gtsummary::tbl_custom_summary to present data analyzed through a test function I've written.
I'm getting this error:
Error in `mutate()`:
! Problem while computing `df_stats = pmap(...)`.
Caused by error in `as_group_map_function()`:
! Can't convert `.f`, NULL, to a function.
Backtrace:
1. gtsummary::tbl_custom_summary(...)
18. dplyr:::group_modify.grouped_df(...)
19. dplyr:::as_group_map_function(.f)
20. rlang::as_function(.f)
I haven't been able to tell if it has to do with data masking/quotations (which I tried to implement but, honestly, haven't understood much of…), or if it something related to gtsummary.
This is the code—please note the custom function's argument are modeled after tbl_custom_summary's requirements and have convenience defaults at the moment.
library(tidyverse)
library(infer)
library(gtsummary)
# Custom function supposedly uses infer package functions to determine the difference
# in medians and its bootstrapped CIs, returning a DF with values specified and
# 'glued' in the tbl_custom_summary call (statistic argument).
testdiffCI <- function(group_data,
full_data,
variable,
by,
type,
stat_display,
...,
point_stat = "median",
# This default is in order to use mtcars data
order = c("0", "1"),
ci_type = "bias-corrected"
) {
# change the function arg into one compatible with infer syntax, i.e. string
stat <- str_glue("diff in {point_stat}s") |> toString()
# These variables are passed as strings (names). Convert them to symbols.
variable <- sym(variable)
by <- sym(by)
# Calculate point estimate:
point <- full_data |>
# Trying to use shorthand defuse|>inject operator {{
# I'm really not sure if this is correct
specify(response={{variable}}, explanatory = {{by}}) |>
calculate(stat = stat, order = order) |> suppressWarnings()
# Bootstrap (population) object
boot <- full_data |>
specify(response = {{variable}}, explanatory = {{by}}) |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = stat, order = order)
ci <- get_confidence_interval(boot,
type = ci_type,
point_estimate = point
)
# Return a tibble; column names are the same as those appearing in the statistic
# argument in tbl_custom_summary call
tibble(
diff = point$stat,
ci_lo = ci$lower_ci,
ci_up = ci$upper_ci
)
}
This is the call to the summary function:
mtcars |>
mutate(vs = as.factor(vs)) |>
tbl_custom_summary(
by = vs,
stat_fns = list(all_continuous() ~ testdiffCI),
statistic = ~ "{diff} ({ci_lo} – {ci_hi})"
)
Thanks for any insights you guys may share!

R for loop going wrong when applied to function

I am trying to work on a for loop to make running a function I've developed more efficient.
However, when I put it in a for loop, it is overwriting columns that it should not be and returning incorrect results.
Edit: The error is that in the resulting dataframe MiSeq_Bord_Outliers_table0, the resulting columns containing label Outlier_type is returning incorrect outputs.
As per the Outlier_Hunter function, when Avg_Trim_Cov and S2_Total_Read_Pairs_Processed are below their
respective Q1 Thresholds their respective Outlier_type columns should read "Lower_Outlier", if between Q1 & Q3 Threshold, "Normal" and if above Q3 Threshold then "Upper_outlier". But when the for loop is executed, only "Upper_outlier" is shown in the Outlier_type columns.
Edit: The inputs have been simplified and tested on the different computer with a clean console. If there were any artifacts there before, they should have been eliminated now, and there should be no errors here now. It is important to run the outlier_results_1var part first. If you test run this code and get errors, please let me know which part failed.
Edit: MiSeq_Bord_Outliers_table0_error is the error that is being reproduced. This is the error result, not an input.
Can someone please tell me why is it returning these incorrect results and what I can do to fix it? I will upload the relevant code below. Or is there another way to do this without a for loop?
#libraries used
library(tidyverse)
library(datapasta)
library(data.table)
library(janitor)
library(ggpubr)
library(labeling)
#2.) Outlier_Hunter Function
#Function to Generate the Outlier table
#Outlier Hunter function takes 4 arguments: the dataset, column/variable of interest,
#Q1 and Q3. Q1 and Q3 are stored in the results of Quartile_Hunter.
#Input ex: MiSeq_Bord_final_report0, Avg_Trim_Cov, MiSeq_Bord_Quartiles_ATC$First_Quartile[1], MiSeq_Bord_Quartiles_ATC$Third_Quartile[1]
#Usage ex: Outlier_Hunter(MiSeq_Bord_final_report0, Avg_Trim_Cov,
#MiSeq_Bord_Quartiles_ATC$First_Quartile[1], MiSeq_Bord_Quartiles_ATC$Third_Quartile[1])
#Here is the Function to get the Outlier Table
Outlier_Hunter <- function(Platform_Genus_final_report0, my_col, Q1, Q3) {
#set up and generalize the variable name you want to work with
varname <- enquo(my_col)
#print(varname) #just to see what variable the function is working with
#get the outliers
Platform_Genus_Variable_Outliers <- Platform_Genus_final_report0 %>%
select(ReadID, Platform, Genus, !!varname) %>%
#Tell if it is an outlier, and if so, what kind of outlier
mutate(
Q1_Threshold = Q1,
Q3_Threshold = Q3,
Outlier_type =
case_when(
!!varname < Q1_Threshold ~ "Lower_Outlier",
!!varname >= Q1_Threshold & !!varname <= Q3_Threshold ~ "Normal",
!!varname > Q3_Threshold ~ "Upper_Outlier"
)
)
}
#MiSeq_Bord_Quartiles entries
MiSeq_Bord_Quartiles <- data.frame(
stringsAsFactors = FALSE,
row.names = c("Avg_Trim_Cov", "S2_Total_Read_Pairs_Processed"),
Platform = c("MiSeq", "MiSeq"),
Genus = c("Bord", "Bord"),
Min = c(0.03, 295),
First_Quartile = c(80.08, 687613.25),
Median = c(97.085, 818806.5),
Third_Quartile = c(121.5625, 988173.75),
Max = c(327.76, 2836438)
)
#Remove the hashtag below to test if what you have is correct
#datapasta::df_paste(head(MiSeq_Bord_Quartiles, 5))
#dataset entry
MiSeq_Bord_final_report0 <- data.frame(
stringsAsFactors = FALSE,
ReadID = c("A005_20160223_S11_L001","A050_20210122_S6_L001",
"A073_20210122_S7_L001",
"A076_20210426_S11_L001",
"A080_20210426_S12_L001"),
Platform = c("MiSeq","MiSeq",
"MiSeq","MiSeq","MiSeq"),
Genus = c("Bordetella",
"Bordetella","Bordetella",
"Bordetella","Bordetella"),
Avg_Raw_Read_bp = c(232.85,241.09,
248.54,246.99,248.35),
Avg_Trimmed_Read_bp = c(204.32,232.6,
238.56,242.54,244.91),
Avg_Trim_Cov = c(72.04,101.05,
92.81,41.77,54.83),
Genome_Size_Mb = c(4.1, 4.1, 4.1, 4.1, 4.1),
S1_Input_reads = c(1450010L,
1786206L,1601542L,710792L,925462L),
S1_Contaminant_reads = c(12220L,6974L,
7606L,1076L,1782L),
S1_Total_reads_removed = c(12220L,6974L,
7606L,1076L,1782L),
S1_Result_reads = c(1437790L,
1779232L,1593936L,709716L,923680L),
S2_Read_Pairs_Written = c(712776L,882301L,
790675L,352508L,459215L),
S2_Total_Read_Pairs_Processed = c(718895L,889616L,
796968L,354858L,461840L)
)
MiSeq_Bord_final_report0
#Execution for 1 variable
outlier_results_1var <- Outlier_Hunter(MiSeq_Bord_final_report0, Avg_Trim_Cov,
MiSeq_Bord_Quartiles$First_Quartile[1], MiSeq_Bord_Quartiles$Third_Quartile[1])
#Now do it with a for loop
col_var_outliers <- row.names(MiSeq_Bord_Quartiles)
#col_var_outliers <- c("Avg_Trim_Cov", "S2_Total_Read_Pairs_Processed")
#change line above to change input of variables few into Outlier Hunter Function
outlier_list_MiSeq_Bord <- list()
for (y in col_var_outliers) {
outlier_results0 <- Outlier_Hunter(MiSeq_Bord_final_report0, y, MiSeq_Bord_Quartiles[y, "First_Quartile"], MiSeq_Bord_Quartiles[y, "Third_Quartile"])
outlier_results1 <- outlier_results0
colnames(outlier_results1)[5:7] <- paste0(y, "_", colnames(outlier_results1[, c(5:7)]), sep = "")
outlier_list_MiSeq_Bord[[y]] <- outlier_results1
}
MiSeq_Bord_Outliers_table0 <- reduce(outlier_list_MiSeq_Bord, left_join, by = c("ReadID", "Platform", "Genus"))
#the columns containing label Outlier_type is where the code goes wrong.
#When Avg_Trim_Cov and S2_Total_Read_Pairs_Processed are below their
#respective Q1 Thresholds their respective Outlier_type columns should read
#"Lower_Outlier", if between Q1 & Q3 Threshold, "Normal" and if above Q3
#Threshold then "Upper_outlier". But when the for loop is executed, only
"Upper_outlier" is shown in the Outlier_type columns.
datapasta::df_paste(head(MiSeq_Bord_Outliers_table0, 5))
MiSeq_Bord_Outliers_table0_error <- data.frame(
stringsAsFactors = FALSE,
ReadID = c("A005_20160223_S11_L001",
"A050_20210122_S6_L001",
"A073_20210122_S7_L001","A076_20210426_S11_L001",
"A080_20210426_S12_L001"),
Platform = c("MiSeq",
"MiSeq","MiSeq","MiSeq",
"MiSeq"),
Genus = c("Bordetella","Bordetella","Bordetella",
"Bordetella","Bordetella"),
Avg_Trim_Cov = c(72.04,
101.05,92.81,41.77,54.83),
Avg_Trim_Cov_Q1_Threshold = c(80.08,
80.08,80.08,80.08,80.08),
Avg_Trim_Cov_Q3_Threshold = c(121.5625,
121.5625,121.5625,121.5625,
121.5625),
Avg_Trim_Cov_Outlier_type = c("Upper_Outlier","Upper_Outlier",
"Upper_Outlier","Upper_Outlier",
"Upper_Outlier"),
S2_Total_Read_Pairs_Processed = c(718895L,
889616L,796968L,354858L,
461840L),
S2_Total_Read_Pairs_Processed_Q1_Threshold = c(687613.25,
687613.25,687613.25,
687613.25,687613.25),
S2_Total_Read_Pairs_Processed_Q3_Threshold = c(988173.75,
988173.75,988173.75,
988173.75,988173.75),
S2_Total_Read_Pairs_Processed_Outlier_type = c("Upper_Outlier","Upper_Outlier",
"Upper_Outlier","Upper_Outlier",
"Upper_Outlier")
)
For use in a loop like you do, it would be more useful to write your Outlier_Hunter() function to take the target column as a character string rather than an expression.
To do that, try replacing all instances of !!varname in your function with .data[[my_col]], and remove the enquo() line altogether.
Note that with these changes, you also need to change how you call the function when you don't have the column name in a variable. For example, your single execution would become:
Outlier_Hunter(
MiSeq_Bord_final_report0,
"Avg_Trim_Cov",
MiSeq_Bord_Quartiles$First_Quartile[1],
MiSeq_Bord_Quartiles$Third_Quartile[1]
)
For more info about programming with tidy evaluation functions, you may find this rlang vignette useful.

tsibble -- how do you get around implicit gaps when there are none

I am new to the tsibble package. I have monthly data that I coerced to a tsibble to use the fable package. A few issues I am having
It appears the index variable (from my testing) is not of class date even though I applied
lubridate's ymd function to it.
has_gaps function returns FALSE but when I model on the data, I get the error that ".data contains
implicit gaps in time"
library(dplyr)
library(fable)
library(lubridate)
library(tsibble)
test <- data.frame(
YearMonth = c(20160101, 20160201, 20160301, 20160401, 20160501, 20160601,
20160701, 20160801, 20160901, 20161001, 20161101, 20161201),
Claims = c(13032647, 1668005, 24473616, 13640769, 17891432, 11596556,
23176360, 7885872, 11948461, 16194792, 4971310, 18032363),
Revenue = c(12603367, 18733242, 5862766, 3861877, 15407158, 24534258,
15633646, 13720258, 24944078, 13375742, 4537475, 22988443)
)
test_ts <- test %>%
mutate(YearMonth = ymd(YearMonth)) %>%
as_tsibble(
index = YearMonth,
regular = FALSE #because it picks up gaps when I set it to TRUE
)
# Are there any gaps?
has_gaps(test_ts, .full = T)
model_new <- test_ts %>%
model(
snaive = SNAIVE(Claims))
Warning messages:
1: 1 error encountered for snaive
[1] .data contains implicit gaps in time. You should check your data and convert implicit gaps into explicit missing values using `tsibble::fill_gaps()` if required.
Any help will appreciated.
You have a daily index, but you want a monthly index. The simplest way is to use the tsibble::yearmonth() function, but you will need to convert the date to character first.
library(dplyr)
library(tsibble)
test <- data.frame(
YearMonth = c(20160101, 20160201, 20160301, 20160401, 20160501, 20160601,
20160701, 20160801, 20160901, 20161001, 20161101, 20161201),
Claims = c(13032647, 1668005, 24473616, 13640769, 17891432, 11596556,
23176360, 7885872, 11948461, 16194792, 4971310, 18032363),
Revenue = c(12603367, 18733242, 5862766, 3861877, 15407158, 24534258,
15633646, 13720258, 24944078, 13375742, 4537475, 22988443)
)
test_ts <- test %>%
mutate(YearMonth = yearmonth(as.character(YearMonth))) %>%
as_tsibble(index = YearMonth)
Looks like as_tsibble isn't able to recognize the interval properly in the YearMonth column because it is a Date class object. It's hidden in the 'Index' section of help page that that might be problem:
For a tbl_ts of regular interval, a choice of index representation has to be made. For example, a monthly data should correspond to time index created by yearmonth or zoo::yearmon, instead of Date or POSIXct.
Like that excerpt suggests you can get around the problem with yearmonth(). But that requires a little string manipulation first to get it into a format that will parse properly.
test_ts <- test %>%
mutate(YearMonth = gsub("(.{2})01$", "-\\1", YearMonth) %>%
yearmonth()
) %>%
as_tsibble(
index = YearMonth
)
Now the model should run error free! Not sure why the has_gaps() test is saying everything is okay in your example...

R HTS package: combinef and aggts not working with gts object

I'm trying to apply the combinef and aggts functions from the R hts package to a time series matrix in order to obtain an optimized set of forecasts across a hierarchy. I've run the same code every month without issue, and am now seeing errors after upgrading to hts package v4.5.
Reproducible example (I can share data file offline if needed)
#Read in forecast data for all levels of hierarchy#
fcast<-read.csv("SampleHierarchyForecast.csv", header = TRUE, check.names = FALSE)
#Convert to time series#
fcast<-ts(fcast, start = as.numeric(2010.25) + (64)/12, end = as.numeric(2010.25) + (75)/12, f= 12)
#Create time series of only the bottom level of the hierarchy#
index<-c()
fcastBottom<-fcast
for (i in 1:length(fcastBottom [1,]))
{
if(nchar(colnames(fcastBottom)[i])!=28)
index[i]<-i
else
index[i]<-0
}
fcastBottom<-fcastBottom[,-index]
#Create grouped time series from the bottom level forecast #
GtsForecast <- gts(fcastBottom, characters = list(c(12,12), c(4)), gnames = c("Category", "Item", "Customer", "Category-Customer"))
#Use combinef function to optimally combine the full hierarchy forecast using the groups from the full hierarchy gts#
combo <- combinef(fcast, groups = GtsForecast$groups)
*Warning message:
In mapply(rep, as.list(gnames), times, SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter*
traceback()
2: stop("Argument fcasts requires all the forecasts.")
1: combinef(fcast, groups = GtsForecast$groups)
There's a little bug when comebinef() function calls gts(). Now I've fixed it on github. So you can run your own code above without any trouble after updating the development version.
Alternatively, you need to tweak your code a bit if you don't want to install the newest version.
combo <- combinef(fcast, groups = GtsForecast$groups, keep = "bottom")
combo <- ts(combo, start = as.numeric(2010.25) + (64)/12,
end = as.numeric(2010.25) + (75)/12, f = 12)
colnames(combo) <- colnames(fcastBottom)
newGtsForecast <- gts(combo, characters = list(c(12,12), c(4)),
gnames = c("Category", "Item", "Customer",
"Category-Customer"))
Aggregate <- aggts(newGtsForecast)
Hope it helps.

Resources