Background: at half-year follow up times for 4y, patients may switch to a different medication group. To account for this, I've converted survival data into counting process form. I want to compare survival curves for medication groups A, B, and C. I am using an extended Cox model but want to do pairwise comparisons of each hazard function or do stratified log-rank tests. pairwise_survdiff throws an error because of the form of my data, I think.
Example data:
x<-data.frame(tstart=rep(seq(0,18,6),3),tstop=rep(seq(6,24,6),3), rx = rep(c("A","B","C"),4), death=c(rep(0,11),1))
x
Problem:
When using survdiff in the survival package,
survdiff(Surv(tstart,tstop,death) ~ rx, data = x)
I get the error:
Error in survdiff(Surv(tstart, tstop, death) ~ rx, data = x) :
Right censored data only
I think this stems from the counting process form, since I can't find an example online that compares survival curves for time-varying covariates.
Question: is there a quick fix to this problem? Or, is there an alternative package/function with the same versatility to compare survival curves, namely using different methods? How can I implement stratified log-rank tests using survidff on counting process form data?
NOTE: this was marked as a known issue in the survminer package, see github issue here, but updating survminer did not solve my issue, and using one time interval, tstop-tstart wouldn't be correct, since that would leave, e.g., multiple entries at 6 months rather than out to the actual interval of risk.
So, here is an example of fitting the model and making the multiple comparisons using multcomp package. Note that this implicitly assumes that administration of treatments A-C is random. Depending on the assumptions about the process, it might be better to fit a multistate model with transitions between treatments and outcome.
library(purrr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(survival)
library(multcomp)
#> Loading required package: mvtnorm
#> Loading required package: TH.data
#> Loading required package: MASS
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#> select
#>
#> Attaching package: 'TH.data'
#> The following object is masked from 'package:MASS':
#>
#> geyser
# simulate survival data
set.seed(123)
n <- 200
df <- data.frame(
id = rep(1:n, each = 8),
start = rep(seq(0, 42, by = 6), times = 8),
stop = rep(seq(6, 48, by = 6), times = 8),
rx = sample(LETTERS[1:3], n * 8, replace = T))
df$hazard <- exp(-3.5 -1 * (df$rx == "A") + .5 * (df$rx == "B") +
.5 * (df$rx == "C"))
df_surv <- data.frame(id = 1:n)
df_surv$time <- split(df, f = df$id) %>%
map_dbl(~msm::rpexp(n = 1, rate = .x$hazard, t = .x$start))
df <- df %>% left_join(df_surv)
#> Joining, by = "id"
df <- df %>%
mutate(status = 1L * (time <= stop)) %>%
filter(start <= time)
df %>% head()
#> id start stop rx hazard time status
#> 1 1 0 6 A 0.01110900 13.78217 0
#> 2 1 6 12 C 0.04978707 13.78217 0
#> 3 1 12 18 B 0.04978707 13.78217 1
#> 4 2 0 6 B 0.04978707 22.37251 0
#> 5 2 6 12 B 0.04978707 22.37251 0
#> 6 2 12 18 C 0.04978707 22.37251 0
# fit the model
model <- coxph(Surv(start, stop, status)~rx, data = df)
# define pairwise comparison
glht_rx <- multcomp::glht(model, linfct=multcomp::mcp(rx="Tukey"))
glht_rx
#>
#> General Linear Hypotheses
#>
#> Multiple Comparisons of Means: Tukey Contrasts
#>
#>
#> Linear Hypotheses:
#> Estimate
#> B - A == 0 1.68722
#> C - A == 0 1.60902
#> C - B == 0 -0.07819
# perform multiple comparisons
# (adjusts for multiple comparisons + takes into account correlation of coefficients -> more power than e.g. bonferroni)
smry_rx <- summary(glht_rx)
smry_rx # -> B and C different to A, but not from each other
#>
#> Simultaneous Tests for General Linear Hypotheses
#>
#> Multiple Comparisons of Means: Tukey Contrasts
#>
#>
#> Fit: coxph(formula = Surv(start, stop, status) ~ rx, data = df)
#>
#> Linear Hypotheses:
#> Estimate Std. Error z value Pr(>|z|)
#> B - A == 0 1.68722 0.28315 5.959 <1e-05 ***
#> C - A == 0 1.60902 0.28405 5.665 <1e-05 ***
#> C - B == 0 -0.07819 0.16509 -0.474 0.88
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> (Adjusted p values reported -- single-step method)
# confidence intervals
plot(smry_rx)
Created on 2019-04-01 by the reprex package (v0.2.1)
Related
I have been trying to train a BART model using the tidymodels framework but I am running into some problems.
I can declare the model, the recipe, and the workflow alright, but once I fit the workflow, two unwanted things happen:
The original model object (bart_mod below), initially correctly stored, becomes "call: NULL", even though I don't touch the model object directly (I assign nothing to the same object name).
I am not able to retrieve any information about the fitted model. The bart_fit contains nothing and there seems to be no tidy method associated to it. All this is true even though I am able to predict values using the fitted model! (See last line of code in the reprex).
This may very well come from a misunderstanding of how all this works on my end, I am fairly new to tidymodels.
I would appreciate any help! Thank you.
library(tidyverse)
library(tidymodels)
set.seed(2022)
# Parameters --------------------------------------------------------------
n <- 5000
coef_x_var_1 <- 1
coef_x_var_2 <- 2
coef_x_var_3 <- 3
gen_y_1 <- function(data = dataset) {
return(data$y_0 +
data$x_var_1*coef_x_var_1 +
data$x_var_2*coef_x_var_2 +
data$x_var_3*coef_x_var_3 +
rnorm(n = nrow(data), mean = 0, sd = 3)
)}
# Data generation ---------------------------------------------------------
dataset <- matrix(NA, nrow = n, ncol = 3)
# Generate the unit-level moderators
dataset[,1] <- rnorm(mean = rnorm(n = 1), n = n)
dataset[,2] <- rnorm(mean = rnorm(n = 1), n = n)
dataset[,3] <- rnorm(mean = rnorm(n = 1), n = n)
# Change into dataframe
colnames(dataset) <- c("x_var_1", "x_var_2", "x_var_3")
dataset <- as_tibble(dataset)
# Make sure the variable format is numeric (except for the identifiers)
dataset$x_var_1 <- as.numeric(dataset$x_var_1)
dataset$x_var_2 <- as.numeric(dataset$x_var_2)
dataset$x_var_3 <- as.numeric(dataset$x_var_3)
# Generate the untreated potential outcomes
P0_coefs <- rdunif(n = 6, 1, 15)
dataset$y_0 <-
dataset$x_var_1*P0_coefs[4] +
dataset$x_var_2*P0_coefs[5] +
dataset$x_var_3*P0_coefs[6] +
rnorm(n = nrow(dataset), mean = 0, sd = 3)
dataset$y_1 <- gen_y_1(data = dataset)
# Create a variable to indicate treatment
treatment_group <- sample(1:nrow(dataset), size = nrow(dataset)/2)
# Indicate which potential outcome you observe
obs_dataset <- dataset |>
mutate(treated = ifelse(row_number() %in% treatment_group, 1, 0),
obs_y = ifelse(treated, y_1, y_0))
y1_obs_dataset <- obs_dataset |> filter(treated == 1)
y0_obs_dataset <- obs_dataset |> filter(treated == 0)
# Analysis ----------------------------------------------------------------
covariates <- c("x_var_1", "x_var_2", "x_var_3")
bart_formula <- as.formula(paste0("obs_y ~ ", paste(covariates, collapse = " + ")))
# Create the workflow
bart_mod <- bart() |>
set_engine("dbarts") |>
set_mode("regression")
bart_recipe <- recipe(bart_formula, data = obs_dataset) |>
step_zv(all_predictors())
bart_workflow <-
workflow() |>
add_model(bart_mod) |>
add_recipe(bart_recipe)
# The workflow first looks right
bart_workflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_zv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> BART Model Specification (regression)
#>
#> Computational engine: dbarts
# Once I fit it though, the model part becomes call: NULL
bart_fit <- bart_workflow |>
fit(y1_obs_dataset)
# Nothing is stored in the fit
bart_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_zv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call:
#> `NULL`()
# The content of this object has changed!
bart_workflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_zv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call:
#> NULL
bart_fit |>
extract_fit_parsnip(bart_fit)
#> parsnip model object
#>
#>
#> Call:
#> `NULL`()
# And yet, I am able to run a prediction using the fit!
predict(bart_fit, y0_obs_dataset)
#> # A tibble: 2,500 × 1
#> .pred
#> <dbl>
#> 1 -4.67
#> 2 -6.23
#> 3 6.35
#> 4 10.7
#> 5 4.90
#> 6 -13.8
#> 7 4.70
#> 8 19.6
#> 9 -0.907
#> 10 5.38
#> # … with 2,490 more rows
Created on 2022-12-24 with reprex v2.0.2
First stripping Martin's code down to a smaller script:
library(tidyverse)
library(tidymodels)
set.seed(2022)
obs_dataset <- structure(list(x_var_1 = c(-0.273203786163623, 0.0026566250757164,
-0.544359413888551, 0.569128408034224, -2.00048700105319, -0.159113741655834
), obs_y = c(-8.14952415680873, 1.91364235165124, -7.68391811408719,
-9.01497463720505, -18.5017189874949, -13.505685812581)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
bart_formula <- as.formula("obs_y ~ x_var_1")
# Create the workflow
bart_mod <- bart() |>
set_engine("dbarts") |>
set_mode("regression")
bart_recipe <- recipe(bart_formula, data = obs_dataset)
bart_workflow <-
workflow() |>
add_model(bart_mod) |>
add_recipe(bart_recipe)
The workflow at first looks right
bart_workflow
> ══ Workflow
> ════════════════════════════════════════════════════════════════
> Preprocessor: Recipe Model: bart()
>
> ── Preprocessor
> ────────────────────────── 0 Recipe Steps
>
> ── Model
> ─────────────────────────────────────────────────────────
> BART Model Specification (regression)
>
> Computational engine: dbarts
but this changes after fitting:
bart_fit <- bart_workflow |>
fit(obs_dataset)
bart_fit
The workflow now displays NULL for the call, as does the model object.
bart_workflow
bart_mod
══ Workflow [trained] ══════════════════════════════════════════════════════
Preprocessor: Recipe
Model: bart()
── Preprocessor ─────────────────────────────────
0 Recipe Steps
── Model ────────────────────────────────────────────────
Call:
`NULL`()
All these display values:
required_pkgs(bart_mod)
print_model_spec(bart_mod)
bart_mod[["engine"]]
bart_mod[["mode"]]
extract_recipe(bart_fit)
extract_preprocessor(bart_fit)
extract_mold(bart_fit)
bart_fit[["fit"]][["fit"]][["spec"]][["engine"]]
bart_fit[["fit"]][["fit"]][["spec"]][["mode"]]
These display NULL:
print(bart_mod)
print(bart_workflow)
print(bart_fit)
extract_fit_engine(bart_fit)
extract_fit_parsnip(bart_fit)
extract_model(bart_fit)
So, it seems that the model data is still in the objects,
and is useable,
but the print calls do not display it,
and the extract functions do not display it.
I have a data frame with response ratios for multiple locations and each location is assigned to a group (region). I want to generate a regression for each group (region) that uses Response Ratio (RR) as the response, location as the unit of replication, and each soil type as a predictor. I would like to use bootstrap resampling to generate confidence intervals around the coefficients for each soil type but I am not sure how to generate this.
#sample data
df <- data.frame(
group=rep(c('region1','region2'), 100),
subgroup=rep(c('location1','location2',
'location2', 'location1'), 25),
predictor = rep(c('soil1','soil2','soil3','soil4'), 25),
RR=rnorm(200)
)
Adding script from #Rui below. I actually have a multiple regression and so I added an additional predictor. It is still unclear to me how to extract the coefficient CIs for both soil type and temperature.
library(boot)
bootfun <- function(data, i) {
d <- data[i,]
fit <- lm(RR ~ soil_type + temperature, data = d)
coef(fit)
}
set.seed(2022)
set.seed(123)
df <- data.frame(
group=rep(c('region1','region2'), 100),
subgroup=rep(c('location1','location2',
'location2', 'location1'), 25),
soil_type = rep(c('soil1','soil2','soil3','soil4'), 25),
temperature = abs(rnorm(100, 2,1.75)),
RR=rnorm(200),
stringsAsFactors = TRUE
)
R <- 1000
b_list <- by(df, df$group, \(X) {
boot(X, bootfun, R, strata = X$subgroup)
})
b_list$region1
Function boot is base package boot has an argument strata. Split by group and apply a boot function with, for instance, by stratifying by location.
library(boot)
bootfun <- function(data, i) {
d <- data[i,]
fit <- lm(RR ~ predictor, data = d)
coef(fit)
}
set.seed(2022)
df <- data.frame(
group=rep(c('region1','region2'), 100),
subgroup=rep(c('location1','location2',
'location2', 'location1'), 25),
predictor = rep(c('soil1','soil2','soil3','soil4'), 25),
RR=rnorm(200),
stringsAsFactors = TRUE
)
R <- 1000
b_list <- by(df, df$group, \(X) {
boot(X, bootfun, R, strata = X$subgroup)
})
b_list$region1
#>
#> STRATIFIED BOOTSTRAP
#>
#>
#> Call:
#> boot(data = X, statistic = bootfun, R = R, strata = X$subgroup)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* -0.2608885 0.000469295 0.1541482
#> t2* 0.3502007 -0.004239248 0.2083503
b_list$region2
#>
#> STRATIFIED BOOTSTRAP
#>
#>
#> Call:
#> boot(data = X, statistic = bootfun, R = R, strata = X$subgroup)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* -0.03727332 -0.0001557172 0.1422502
#> t2* 0.11987005 0.0016393125 0.1952310
lapply(b_list, boot.ci)
#> Warning in sqrt(tv[, 2L]): NaNs produced
#> Warning in sqrt(tv[, 2L]): NaNs produced
#> $region1
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> FUN(boot.out = X[[i]])
#>
#> Intervals :
#> Level Normal Basic Studentized
#> 95% (-0.5635, 0.0408 ) (-0.5611, 0.0545 ) (-0.8227, -0.0225 )
#>
#> Level Percentile BCa
#> 95% (-0.5762, 0.0393 ) (-0.5733, 0.0446 )
#> Calculations and Intervals on Original Scale
#>
#> $region2
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> FUN(boot.out = X[[i]])
#>
#> Intervals :
#> Level Normal Basic Studentized
#> 95% (-0.3159, 0.2417 ) (-0.3260, 0.2460 ) (-0.3493, 0.1757 )
#>
#> Level Percentile BCa
#> 95% (-0.3206, 0.2514 ) (-0.3321, 0.2352 )
#> Calculations and Intervals on Original Scale
Created on 2022-10-25 with reprex v2.0.2
Edit
To get the bootstrapped confidence intervals of each coefficient, the code below uses two nested loops. The outer loop is by region, according to the original data partition. The inner loop is on index, meaning, on the matrix t returned by boot, see help("boot"), section Value. The index are the column numbers in any of
b_list$region1$t
b_list$region2$t
each of them with 3 columns.
library(boot)
npars <- ncol(b_list$region1$t)
ci_list <- lapply(b_list, \(region) {
ci <- lapply(seq.int(npars), \(index) {
boot.ci(region, index = index, type = c("norm","basic", "perc", "bca"))
})
setNames(ci, c("Intercept", "soil", "temperature"))
})
ci_list$region1$Intercept
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = region, type = c("norm", "basic", "perc",
#> "bca"), index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.2517, 0.6059 ) (-0.2423, 0.6043 )
#>
#> Level Percentile BCa
#> 95% (-0.2410, 0.6056 ) (-0.2414, 0.6048 )
#> Calculations and Intervals on Original Scale
ci_list$region2$temperature
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = region, type = c("norm", "basic", "perc",
#> "bca"), index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.2317, 0.0420 ) (-0.2416, 0.0404 )
#>
#> Level Percentile BCa
#> 95% (-0.2278, 0.0542 ) (-0.2265, 0.0570 )
#> Calculations and Intervals on Original Scale
Created on 2022-10-25 with reprex v2.0.2
Edit 2
Like I say in a comment below, in the new data the soil type uniquely identifies pairs of region and location, unique(df[1:3]) shows it. And it becomes useless to split by group and stratify within groups.
bootfun2 <- function(data, i) {
d <- data[i,]
fit <- lm(RR ~ temperature + soil_type, data = d)
coef(fit)
}
unique(df[1:3]) # soil type uniquely identifies region/location
#> group subgroup soil_type
#> 1 region1 location1 soil1
#> 2 region2 location2 soil2
#> 3 region1 location2 soil3
#> 4 region2 location1 soil4
fit <- lm(RR ~ temperature + soil_type, data = df)
coef(fit)
#> (Intercept) temperature soil_typesoil2 soil_typesoil3 soil_typesoil4
#> 0.25928498 -0.06352205 -0.17739104 -0.05243836 -0.20408527
set.seed(2022)
R <- 1000
b_3 <- boot(df, bootfun2, R)
b_3
#>
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#>
#>
#> Call:
#> boot(data = df, statistic = bootfun2, R = R)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* 0.25928498 0.005724634 0.18033509
#> t2* -0.06352205 -0.002910677 0.05161868
#> t3* -0.17739104 0.004932486 0.18665594
#> t4* -0.05243836 0.005796168 0.19602658
#> t5* -0.20408527 0.004914674 0.20355549
btype <- c("norm","basic", "perc", "bca")
ci_list3 <- lapply(seq_len(ncol(b_3$t)), \(index) {
boot.ci(b_3, type = btype, index = index)
})
names(ci_list3) <- names(coef(fit))
ci_list3
#> $`(Intercept)`
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = b_3, type = btype, index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.0999, 0.6070 ) (-0.0868, 0.6172 )
#>
#> Level Percentile BCa
#> 95% (-0.0986, 0.6054 ) (-0.0992, 0.6034 )
#> Calculations and Intervals on Original Scale
#>
#> $temperature
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = b_3, type = btype, index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.1618, 0.0406 ) (-0.1631, 0.0401 )
#>
#> Level Percentile BCa
#> 95% (-0.1672, 0.0360 ) (-0.1552, 0.0503 )
#> Calculations and Intervals on Original Scale
#>
#> $soil_typesoil2
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = b_3, type = btype, index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.5482, 0.1835 ) (-0.5541, 0.1955 )
#>
#> Level Percentile BCa
#> 95% (-0.5503, 0.1994 ) (-0.5542, 0.1927 )
#> Calculations and Intervals on Original Scale
#>
#> $soil_typesoil3
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = b_3, type = btype, index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.4424, 0.3260 ) (-0.4399, 0.3068 )
#>
#> Level Percentile BCa
#> 95% (-0.4117, 0.3350 ) (-0.4116, 0.3350 )
#> Calculations and Intervals on Original Scale
#>
#> $soil_typesoil4
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#>
#> CALL :
#> boot.ci(boot.out = b_3, type = btype, index = index)
#>
#> Intervals :
#> Level Normal Basic
#> 95% (-0.6080, 0.1900 ) (-0.6116, 0.2127 )
#>
#> Level Percentile BCa
#> 95% (-0.6208, 0.2035 ) (-0.6284, 0.1801 )
#> Calculations and Intervals on Original Scale
Created on 2022-10-25 with reprex v2.0.2
REWRITING ORIGINAL QUESTION
I want a function that takes a vector and pre-defined exponential smoothing model (in this example, simple exponential smoothing with alpha = 0.5), and does one-step ahead forecasting on the input vector. The below code, though clunky, does what I want. Is there a better way to do this, or a built in function / package function that does this?
The second part to this question is: If I fit an ETS or ARIMA model with, say, the {fable} package, is there a function that can take the fitted ETS/ARIMA model and do the same one-step-ahead forecasting on a vector, like in the first part?
# Ref: # https://stats.stackexchange.com/questions/44984/how-do-you-use-simple-exponential-smoothing-in-r
# NOT fully tested, just in this case. Gives one-step-ahead forecasts for an exponential smoothing model
ses <- function(x, alpha, beta = FALSE, gamma = FALSE) {
## Populate this vector with return values
result_vec <- numeric(length(x))
result_vec[1] <- NA
for (i in 2:length(x)) {
mod <- HoltWinters(x[seq(i)], alpha=alpha, beta=beta, gamma=gamma)
result_vec[i] <- predict(mod, n.ahead = 1)
}
result_vec
}
new_wt <- c(1, 0, 0, 0, 0, 1, 1, 1)
ses(new_wt, 0.5)
#> [1] NA 0.5000000 0.2500000 0.1250000 0.0625000 0.5312500 0.7656250
#> [8] 0.8828125
Created on 2020-10-22 by the reprex package (v0.3.0)
This is the question and post I was looking for that solved it for me:
https://stats.stackexchange.com/questions/265115/one-step-ahead-forecast
https://robjhyndman.com/hyndsight/out-of-sample-one-step-forecasts/
I was then able to leverage the forecast::ets() model builder to build what I want for a filter, and then use the technique there of re-calling ets() with the new data I want to do one-step-ahead forecasts on (and the original model), and then call fitted() on that new model to get the one-step-ahead forecasts. It matches up close enough with what I want a filter to do (with some off-by-one differences in the output that I can deal with easily enough).
It mostly worked with the {fable} package ETS() call. However, I don't know how to specify the starting values to make sure we match. But it is pretty close overall, and over time, should roughly match.
Below is sample code with what I had and what the ultimate solution was in the end.
# Ref: # https://stats.stackexchange.com/questions/44984/how-do-you-use-simple-exponential-smoothing-in-r
# Hand-rolled solution
# NOT fully tested, just in this case. Gives one-step-ahead forecasts for an exponential smoothing model
ses <- function(x, alpha, beta = FALSE, gamma = FALSE) {
## Populate this vector with return values
result_vec <- numeric(length(x))
result_vec[1] <- NA
for (i in 2:length(x)) {
## One method
mod <- HoltWinters(x[seq(i)], alpha=alpha, beta=beta, gamma=gamma)
result_vec[i] <- predict(mod, n.ahead = 1)
## A similar method
# result_vec[i] <- forecast::ses(x[seq(i)], alpha=alpha, beta=beta, gamma=gamma, h=1)$mean
}
result_vec
}
new_wt <- c(1, 0, 0, 0, 0, 1, 1, 1)
ses(new_wt, 0.5)
#> [1] NA 0.5000000 0.2500000 0.1250000 0.0625000 0.5312500 0.7656250
#> [8] 0.8828125
## From: https://robjhyndman.com/hyndsight/out-of-sample-one-step-forecasts/
## I can do one-step-ahead forecasts with the forecast package
library(forecast)
#> Registered S3 method overwritten by 'quantmod':
#> method from
#> as.zoo.data.frame zoo
#>
#> Attaching package: 'forecast'
#> The following object is masked _by_ '.GlobalEnv':
#>
#> ses
## Values to do one-step-ahead forecasts on
new_wt <- c(1, 0, 0, 0, 0, 1, 1, 1)
## Fit a model with hand-picked parameters, in this case, and walk the forecast
## ahead a step at a time.
initial_value <- 0
fit <- forecast::ets(initial_value, model="ANN", alpha=0.5, use.initial.values = T)
fit2 <- forecast::ets(new_wt, model=fit, use.initial.values = T)
fitted(fit2) %>% as.vector()
#> [1] 0.0000000 0.5000000 0.2500000 0.1250000 0.0625000 0.0312500 0.5156250
#> [8] 0.7578125
## With fable, I can't seem to make it work:
library(fable)
#> Loading required package: fabletools
#>
#> Attaching package: 'fabletools'
#> The following objects are masked from 'package:forecast':
#>
#> accuracy, forecast
library(tibble)
new_wt_ts <-
tibble(value = new_wt, idx = seq(length(new_wt))) %>%
as_tsibble(index = idx)
myfable <-
model(new_wt_ts, ets = ETS(value ~ error("A") + trend("N", alpha=0.5) + season("N")))
## This is close to the others, and goes in the right direction, not sure why it doesn't match?
fitted(myfable, new_wt_ts)
#> # A tsibble: 8 x 3 [1]
#> # Key: .model [1]
#> .model idx .fitted
#> <chr> <int> <dbl>
#> 1 ets 1 0.531
#> 2 ets 2 0.765
#> 3 ets 3 0.383
#> 4 ets 4 0.191
#> 5 ets 5 0.0957
#> 6 ets 6 0.0478
#> 7 ets 7 0.524
#> 8 ets 8 0.762
Created on 2020-10-22 by the reprex package (v0.3.0)
I am trying to run a post hoc analysis on an unbalanced two way anova using the anova_test funciton in the rstatix package. I need to run this post hoc test iteratively, as I have ~26 response (y) variables. My first step is to create models of all my y variables with relation to group and treatment. I have successfully managed to do this, creating a single list with 26 models:
models <- map(data[,y1:y26], ~(lm(.x ~data$group*data$treatment)))
Now comes the part I'm stuck on. Referring to these models iteratively. I would like to run the following code for every y variable I have:
group_by(group) %>%
anova_test(y ~ treatment, error = models(y), type = 3)
where my y changes every time and as it does, the "model" (referred to in the error = term) is updated accordingly. I'm struggling with this bit since first set of models I make is used to inform the second set of models.
However, if I run just one y variable through this whole bit of code at one time, I get the appropriate results.
model <- lm(y ~ group*treatment, data = data)
data %>%
group_by(group) %>%
anova_test(y ~ treatment, error = model, type = 3)
I have tried creating a for loop as well as using the map function in the purrr package but I have been unsuccessful. I am new to for loops and purrr so I am sure it's a simple fix I just can't see it.
Basically I want a way to run
data %>%
group_by(group) %>%
anova_test(y ~ treatment, error = model, type = 3)
iteratively for different y variables (y1, y2, ..., y26) while also referring to the approprite model (model$y1, model$y2, ..., model$26).
Thanks for your help!
Well you didn't give any data so let's use toothgrowth. You seem to like the model format, so let's build a list of models. You could do this in an automated fashion but to make it clear lets do it by hand. The call purrr::map with the anova_test function. You'll get a list back. Since you're in charge of naming the list elements go to town.
Updated answer May 18th. Now using map2 since you want two different models passed build a list for each...
library(rstatix)
library(purrr)
ToothGrowth$len2 <- ToothGrowth$len^2 # for variety
models <- list(model1 = lm(len ~ supp*dose, ToothGrowth),
model2 = lm(len ~ dose*supp, ToothGrowth),
model3 = lm(len2 ~ dose*supp, ToothGrowth),
model4 = lm(len2 ~ supp*dose, ToothGrowth))
models2 <- list(model1 = lm(len ~ supp, ToothGrowth),
model2 = lm(len ~ dose, ToothGrowth),
model3 = lm(len2 ~ dose, ToothGrowth),
model4 = lm(len2 ~ supp, ToothGrowth))
# one model
purrr::map(models, ~ anova_test(.x, type = 3))
# now with model for error term
purrr::map2(models, models2, ~ anova_test(.x, error = .y, type = 3))
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> $model1
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 supp 1 58 4.058 0.049000 * 0.065
#> 2 dose 1 58 12.717 0.000734 * 0.180
#> 3 supp:dose 1 58 1.588 0.213000 0.027
#>
#> $model2
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 dose 1 58 33.626 2.92e-07 * 0.367
#> 2 supp 1 58 10.729 2.00e-03 * 0.156
#> 3 dose:supp 1 58 4.200 4.50e-02 * 0.068
#>
#> $model3
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 dose 1 58 36.028 1.35e-07 * 0.383
#> 2 supp 1 58 7.128 1.00e-02 * 0.109
#> 3 dose:supp 1 58 2.709 1.05e-01 0.045
#>
#> $model4
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 supp 1 58 2.684 0.107000 0.044
#> 2 dose 1 58 13.566 0.000508 * 0.190
#> 3 supp:dose 1 58 1.020 0.317000 0.017
Thanks to Nirgrahamuk from the rstudio community forum for this answer:
map(names(models_1) ,
~ anova_test(data=group_by(df,edge),
formula = as.formula(paste0(.x,"~ trt")),
error = models_1[[.x]],
type = 3))
(see their full answer at: https://community.rstudio.com/t/trouble-using-group-by-and-map2-together/66730/8?u=mvula)
Created on 2020-05-20 by the reprex package (v0.3.0)
Disclaimer: This question is extremely related to this one I asked two days ago - but now it relates to the implementation of between and overall R2 in stargazer() output not in summary() as before.
Is there a way to get plm() to calculate between R2 and overall R2 for me and include them in the stargazer() output?
To clarify what I mean with between, overall, and within R2 see this answer on StackExchange.
My understanding is that plm only calculates within R2.
I am running a Twoways effects Within Model.
library(plm)
library(stargazer)
# Create some random data
set.seed(1)
x=rnorm(100); fe=rep(rnorm(10),each=10); id=rep(1:10,each=10); ti=rep(1:10,10); e=rnorm(100)
y=x+fe+e
data=data.frame(y,x,id,ti)
# Get plm within R2
reg=plm(y~x,model="within",index=c("id","ti"), effect = "twoways", data=data)
stargazer(reg)
I now also want to include between and overall R2 in the stargazer() output. How can I do that?
To make it explicit what I mean with between and overall R2:
# Pooled Version (overall R2)
reg1=lm(y~x)
summary(reg1)$r.squared
# Between R2
y.means=tapply(y,id,mean)[id]
x.means=tapply(x,id,mean)[id]
reg2=lm(y.means~x.means)
summary(reg2)$r.squared
To do this in stargazer, you can use the add.lines() argument. However, this adds the lines to the beginning of the summary stats section and there is no way to alter this without messing with the source code, which is beastly. I much prefer huxtable, which provides a grammar of table building and is much more extensible and customizable.
library(tidyverse)
library(plm)
library(huxtable)
# Create some random data
set.seed(1)
x=rnorm(100); fe=rep(rnorm(10),each=10); id=rep(1:10,each=10); ti=rep(1:10,10); e=rnorm(100)
y=x+fe+e
data=data.frame(y,x,id,ti)
# Get plm within R2
reg=plm(y~x,model="within",index=c("id","ti"), effect = "twoways", data=data)
stargazer(reg, type = "text",
add.lines = list(c("Overall R2", round(r.squared(reg, model = "pooled"), 3)),
c("Between R2", round(r.squared(update(reg, effect = "individual", model = "between")), 3))))
#>
#> ========================================
#> Dependent variable:
#> ---------------------------
#> y
#> ----------------------------------------
#> x 1.128***
#> (0.113)
#>
#> ----------------------------------------
#> Overall R2 0.337
#> Between R2 0.174
#> Observations 100
#> R2 0.554
#> Adjusted R2 0.448
#> F Statistic 99.483*** (df = 1; 80)
#> ========================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
# I prefer huxreg, which is much more customizable!
# Create a data frame of the R2 values
r2s <- tibble(
name = c("Overall R2", "Between R2"),
value = c(r.squared(reg, model = "pooled"),
r.squared(update(reg, effect = "individual", model = "between"))))
tab <- huxreg(reg) %>%
# Add new R2 values
add_rows(hux(r2s), after = 4)
# Rename R2
tab[7, 1] <- "Within R2"
tab %>% huxtable::print_screen()
#> ─────────────────────────────────────────────────
#> (1)
#> ─────────────────────────
#> x 1.128 ***
#> (0.113)
#> ─────────────────────────
#> N 100
#> Overall R2 0.337
#> Between R2 0.174
#> Within R2 0.554
#> ─────────────────────────────────────────────────
#> *** p < 0.001; ** p < 0.01; * p < 0.05.
#>
#> Column names: names, model1
Created on 2020-04-08 by the reprex package (v0.3.0)