Train time series models in caret by group - r

I have a data set like the following
set.seed(503)
foo <- data.table(group = rep(LETTERS[1:6], 150),
y = rnorm(n = 6 * 150, mean = 5, sd = 2),
x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
x5 = sample(c(1, 0), size = 6 * 150, replace = T))
foo[, period := 1:.N, by = group]
Problem: I want to forecast y one step ahead, for each group, using variables x1, ..., x5
I want to run a few models in caret to decide which I will use.
As of now, I am running it in a loop using timeslice
window.length <- 115
timecontrol <- trainControl(method = 'timeslice',
initialWindow = window.length,
horizon = 1,
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final')
model_list <- list()
for(g in unique(foo$group)){
for(model in c("xgbTree", "earth", "cubist")){
dat <- foo[group == g][, c('group', 'period') := NULL]
model_list[[g]][[model]] <- train(y ~ . - 1,
data = dat,
method = model,
trControl = timecontrol)
}
}
However, I would like to run all groups at the same time, using dummy variables to identify each one, like
dat <- cbind(foo, model.matrix(~ group- 1, foo))
y x1 x2 x3 x4 x5 period groupA groupB groupC groupD groupE groupF
1: 5.710250 11.9615460 22.62916 31.04790 -4.821331e-04 1 1 1 0 0 0 0 0
2: 3.442213 8.6558983 32.41881 45.70801 3.255423e-01 1 1 0 1 0 0 0 0
3: 3.485286 7.7295448 21.99022 56.42133 8.668391e+00 1 1 0 0 1 0 0 0
4: 9.659601 0.9166456 30.34609 55.72661 -7.666063e+00 1 1 0 0 0 1 0 0
5: 5.567950 3.0306864 22.07813 52.21099 5.377153e-01 1 1 0 0 0 0 1 0
But still running the time series with the correct time ordering using timeslice.
Is there a way to declare the time variable in trainControl, so my one step ahead forecast uses, in this case, six more observations for each round and droping the first 6 observations?
I can do it by ordering the data and messing with the horizon argument (given n groups, order by the time variable and put horizon = n), but this has to change if the number of groups change. And initial.window will have to be time * n_groups
timecontrol <- trainControl(method = 'timeslice',
initialWindow = window.length * length(unique(foo$group)),
horizon = length(unique(foo$group)),
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final')
Is there any ohter way?

I think the answer you are looking for is actually quite simple. You can use the skip argument to trainControl() to skip the desired number of observations after each train/test set. In this way, you only predict each group-period once, the same period is never split between the training group and testing group, and there is no information leakage.
Using the example you provided, if you set skip = 6 and horizon = 6 (the number of groups), and initialWindow = 115, then the first test set will include all groups for period 116, the next test set will include all groups for period 117, and so on.
library(caret)
library(tidyverse)
set.seed(503)
foo <- tibble(group = rep(LETTERS[1:6], 150),
y = rnorm(n = 6 * 150, mean = 5, sd = 2),
x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
x5 = sample(c(1, 0), size = 6 * 150, replace = T)) %>%
group_by(group) %>%
mutate(period = row_number()) %>%
ungroup()
dat <- cbind(foo, model.matrix(~ group- 1, foo)) %>%
select(-group)
window.length <- 115
timecontrol <- trainControl(
method = 'timeslice',
initialWindow = window.length * length(unique(foo$group)),
horizon = length(unique(foo$group)),
skip = length(unique(foo$group)),
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final'
)
model_names <- c("xgbTree", "earth", "cubist")
fits <- map(model_names,
~ train(
y ~ . - 1,
data = dat,
method = .x,
trControl = timecontrol
)) %>%
set_names(model_names)

I would use tidyr::nest() to nest groups and then iterate over the data with purrr::map(). This approach is much more flexible because it can accommodate different group sizes, different numbers of groups, and variable models or other arguments passed to caret::train(). Also, you can easily run everything in parallel using furrr.
Load packages and create data
I use tibble instead of data.table. I also reduce the size of the data.
library(caret)
library(tidyverse)
set.seed(503)
foo <- tibble(
group = rep(LETTERS[1:6], 10),
y = rnorm(n = 6 * 10, mean = 5, sd = 2),
x1 = rnorm(n = 6 * 10, mean = 5, sd = 10),
x2 = rnorm(n = 6 * 10, mean = 25, sd = 10),
x3 = rnorm(n = 6 * 10, mean = 50, sd = 10),
x4 = rnorm(n = 6 * 10, mean = 0.5, sd = 10),
x5 = sample(c(1, 0), size = 6 * 10, replace = T)
) %>%
group_by(group) %>%
mutate(period = row_number()) %>%
ungroup()
Reduce initialWindow size
window.length <- 9
timecontrol <- trainControl(
method = 'timeslice',
initialWindow = window.length,
horizon = 1,
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final'
)
Create a function that will return a list of fit model objects
# To fit each model in model_list to data and return model fits as a list.
fit_models <- function(data, model_list, timecontrol) {
map(model_list,
~ train(
y ~ . - 1,
data = data,
method = .x,
trControl = timecontrol
)) %>%
set_names(model_list)
}
Fit models
model_list <- c("xgbTree", "earth", "cubist")
mods <- foo %>%
nest(-group)
mods <- mods %>%
mutate(fits = map(
data,
~ fit_models(
data = .x,
model_list = model_list,
timecontrol = timecontrol
)
))
If you want to view the results for a particular group / model you can do:
mods[which(mods$group == "A"), ]$fits[[1]]$xgbTree
Use furrr for parallel processing
Just initialize workers with plan(multiprocess) and change map to future_map. Note you might want to change the number of workers to something less than 6 if your computer has fewer than 6 processing cores.
library(furrr)
plan(multiprocess, workers = 6)
mods <- foo %>%
nest(-group)
mods <- mods %>%
mutate(fits = future_map(
data,
~ fit_models(
data = .x,
model_list = model_list,
timecontrol = timecontrol
)
))

Related

How to provide group-wise boundaries for parameters in modelling using R nls_multstart?

I am new to using the purrr package in R and I am struggling with trying to pass a further argument to a function inside nls_multstart.
I have a nested data frame that contains data for different combinations of grouping variables.
I want to fit the same model to the data of each combinations of groups in the nested data frame.
So far, I was able to fit the model to each data.
# model
my_model <- function(ymax, k, t) {
ymax * (1 - exp(-k*t))
}
# data
t = seq(from = 1, to = 100, by = 1)
y1 = unlist(lapply(t, my_model, ymax = 500, k = 0.04))
y2 = unlist(lapply(t, my_model, ymax = 800, k = 0.06))
y = c(y1, y2)
a <- rep(x = "a", times = 100)
b <- rep(x = "b", times = 100)
groups <- c(a, b)
df <- data.frame(groups, t, y)
nested <- df %>%
group_by(groups) %>%
nest() %>%
rowwise() %>%
ungroup() %>%
mutate(maximum = map_dbl(map(data, "y"), max))
# set staring values
l <- c(ymax = 100 , k = 0.02)
u <- c(ymax = 300, k = 0.03)
# works, but without group-specific lower and upper boundaries
# fit the model
fit <- nested %>%
mutate(res = map(.x = data,
~ nls_multstart(y ~ my_model(ymax, k, t = t),
data = .x,
iter = 20,
start_lower = l,
start_upper = u,
supp_errors = 'N',
na.action = na.omit)))
However, when trying to use the value in column maximum as a group-specific boundary, R throws the following error:
# using group-specific boundary does not work
# fit the model
fit2 <- nested %>%
mutate(res = map(.x = data,
~ nls_multstart(y ~ my_model(ymax, k, t = t),
data = .x,
iter = 20,
start_lower = l,
start_upper = u,
supp_errors = 'N',
na.action = na.omit,
lower = c(maximum, 0),
upper = c(maximum*1.2, 1))))
Error in nls.lm(par = start, fn = FCT, jac = jac, control = control, lower = lower, :
length(lower) must be equal to length(par)
Can anybody give a hint how to improve on that?

Fitting data frame probability distributions with different lengths - EnvStat - looping in R

I'm trying to fit probability distributions in R using EnvStat package and looping to calculate multiple columns at once.
Columns have different lengths and some code error is happening. The data frame does not remain in numeric format.
Error message: 'x' must be a numeric vector
I couldn't identify the error. Could anyone help?
Many thanks
Follow code:
x = runif(n = 50, min = 1, max = 12)
y = runif(n = 70, min = 5, max = 15)
z = runif(n = 35, min = 1, max = 10)
m = runif(n = 80, min = 6, max = 18)
length(x) = length(m)
length(y) = length(m)
length(z) = length(m)
df = data.frame(x=x,y=y,z=z,m=m)
df
library(EnvStats)
nproc = 4
cont = 1
dfr = data.frame(variavel = character(nproc),
locationevd= (nproc), scaleevd= (nproc),
stringsAsFactors = F)
# i = 2
for (i in 1:4) {
print(i)
nome.var=colnames(df)
df = df[,c(i)]
df = na.omit(df)
variavela = nome.var[i]
dfr$variavel[cont] = variavela
evd = eevd(df);evd
locationevd = evd$parameters[[1]]
dfr$locationevd[cont] = locationevd
scaleevd = evd$parameters[[2]]
dfr$scaleevd[cont] = scaleevd
cont = cont + 1
}
writexl::write_xlsx(dfr, path = "Results.xls")
Two major changes to you code:
First, use a list instead of a dataframe (so you can accommodate unequal vector lengths):
x = runif(n = 50, min = 1, max = 12)
y = runif(n = 70, min = 5, max = 15)
z = runif(n = 35, min = 1, max = 10)
m = runif(n = 80, min = 6, max = 18)
vl = list(x=x,y=y,z=z,m=m)
vl
if (!require(EnvStats){ install.packages('EnvStats'); library(EnvStats)}
nproc = 4
# cont = 1 Not used
dfr = data.frame(variavel = character(nproc),
locationevd= (nproc), scaleevd= (nproc),
stringsAsFactors = F)
Second: Use one loop index and not use "cont" index
for ( i in 1:length(vl) ) {
# print(i) Not needed
nome.var=names(vl) # probably should have been done before loop
var = vl[[i]]
variavela = nome.var[i]
dfr$variavel[i] = variavela # all those could have been one step
evd = eevd( vl[[i]] ) # ;evd
locationevd = evd$parameters[[1]]
dfr$locationevd[i] = locationevd
scaleevd = evd$parameters[[2]]
dfr$scaleevd[i] = scaleevd
}
Which gets you the desired structure:
dfr
variavel locationevd scaleevd
1 x 5.469831 2.861025
2 y 7.931819 2.506236
3 z 3.519528 2.040744
4 m 10.591660 3.223352

Using `ordinal::clmm` model to make predictions on new data

I have some repeated measures, ordinal response data:
dat <- data.frame(
id = factor(sample(letters[1:5], 50, replace = T)),
response = factor(sample(1:7, 50, replace = T), ordered = T),
x1 = runif(n = 50, min = 1, max = 10),
x2 = runif(n = 50, min = 100, max = 1000)
)
I have built the following model:
library(ordinal)
model <- clmm(response ~ x1 + x2 + (1|id), data = dat)
I have some new data:
new_dat <- data.frame(
id = factor(sample(letters[1:5], 5, replace = T)),
x1 = runif(n = 5, min = 1, max = 10),
x2 = runif(n = 5, min = 100, max = 1000)
)
I want to be able to use the model to predict the probability of each level of dat$response occurring for new_dat, whilst still also accounting for id.
Unfortunately predict() does not work for clmm objects. predict() does work for clmm2 objects but it ignores any random effects included.
What I want to achieve is something similar to what has been done in Figure 3 of the following using this code:
library(ordinal)
fm2 <- clmm2(rating ~ temp + contact, random=judge, data=wine, Hess=TRUE, nAGQ=10)
pred <- function(eta, theta, cat = 1:(length(theta)+1), inv.link = plogis){
Theta <- c(-1e3, theta, 1e3)
sapply(cat, function(j)
inv.link(Theta[j+1] - eta) - inv.link(Theta[j] - eta))
}
mat <- expand.grid(judge = qnorm(0.95) * c(-1, 0, 1) * fm2$stDev,
contact = c(0, fm2$beta[2]),
temp = c(0, fm2$beta[1]))
pred.mat <- pred(eta=rowSums(mat), theta=fm2$Theta)
lab <- paste("contact=", rep(levels(wine$contact), 2), ", ", "temp=", rep(levels(wine$temp), each=2), sep="")
par(mfrow=c(2, 2))
for(k in c(1, 4, 7, 10)) {
plot(1:5, pred.mat[k,], lty=2, type = "l", ylim=c(0,1),
xlab="Bitterness rating scale", axes=FALSE,
ylab="Probability", main=lab[ceiling(k/3)], las=1)
axis(1); axis(2)
lines(1:5, pred.mat[k+1, ], lty=1)
lines(1:5, pred.mat[k+2, ], lty=3)
legend("topright",
c("avg. judge", "5th %-tile judge", "95th %-tile judge"),
lty=1:3, bty="n")
}
Except, my model contains multiple continuous covariates (as opposed to binary factors).
How can I use the model data to predict the probability of each level of dat$response occurring for new_dat, whilst still also accounting for id?
Many thanks.

In R & dabestr, how do I get grouped differences correctly?

Using dabestr package I'm trying to get the differences between two sets of control & test data. Moifying slightly example from help file I tried:
library(dabestr)
N <- 70
c1 <- rnorm(N, mean = 50, sd = 20)
t1 <- rnorm(N, mean = 200, sd = 20)
ID <- seq(1:N)
long.data <- tibble::tibble(ID = ID, Control1 = c1, Test1 = t1)
meandiff1 <- long.data %>%
tidyr::gather(key = Group, value = Measurement, Control1:Test1)
ID <- seq(1:N) + N
c2 <- rnorm(N, mean = 100, sd = 70)
t2 <- rnorm(N, mean = 100, sd = 70)
long.data <- tibble::tibble(ID = ID, Control2 = c2, Test2 = t2)
meandiff2 <- long.data %>%
tidyr::gather(key = Group, value = Measurement, Control2:Test2)
meandiff <- dplyr::bind_rows(meandiff1, meandiff2)
paired_mean_diff <-
dabest(meandiff, x = Group, y = Measurement,
idx = c("Control1", "Test1", "Control2", "Test2"),
paired = TRUE,
id.col = ID)
plot(paired_mean_diff)
I get these results:
So not only is everything compared to Control1 but also the paired = TRUE option seems to have no effect. I was hoping to get something similar to examples from the package page:
Any pointers on how to achieve that?
For a paired plot, you want to nest the idx keyword option as such:
paired_mean_diff <-
dabest(meandiff, x = Group, y = Measurement,
idx = list(c("Control1", "Test1"),
c("Control2", "Test2")),
paired = TRUE,
id.col = ID)

Creating imputation list for use with svyglm

Using the survey package, I am having issues creating an imputationList that svydesign will accept. Here is a reproducible example:
library(tibble)
library(survey)
library(mitools)
# Data set 1
# Note that I am excluding the "income" variable from the "df"s and creating
# it separately so that it varies between the data sets. This simulates the
# variation with multiple imputation. Since I am using the same seed
# (i.e., 123), all the other variables will be the same, the only one that
# will vary will be "income."
set.seed(123)
df1 <- tibble(id = seq(1, 100, by = 1),
gender = as.factor(rbinom(n = 100, size = 1, prob = 0.50)),
working = as.factor(rbinom(n = 100, size = 1, prob = 0.40)),
pweight = sample(50:500, 100, replace = TRUE))
# Data set 2
set.seed(123)
df2 <- tibble(id = seq(1, 100, by = 1),
gender = as.factor(rbinom(n = 100, size = 1, prob = 0.50)),
working = as.factor(rbinom(n = 100, size = 1, prob = 0.40)),
pweight = sample(50:500, 100, replace = TRUE))
# Data set 3
set.seed(123)
df3 <- tibble(id = seq(1, 100, by = 1),
gender = as.factor(rbinom(n = 100, size = 1, prob = 0.50)),
working = as.factor(rbinom(n = 100, size = 1, prob = 0.40)),
pweight = sample(50:500, 100, replace = TRUE))
# Create list of imputed data sets
impList <- imputationList(df1,
df2,
df3)
# Apply NHIS weights
weights <- svydesign(id = ~id,
weight = ~pweight,
data = impList)
I get the following error:
Error in eval(predvars, data, env) :
numeric 'envir' arg not of length one
To get it to work, I needed to directly add imputationList to svydesign as follows:
weights <- svydesign(id = ~id,
weight = ~pweight,
data = imputationList(list(df1,
df2,
df3))
the step by step instructions available at http://asdfree.com/national-health-interview-survey-nhis.html walk through exactly how to create a multiply-imputed nhis design, and the analysis examples below that include svyglm calls. avoid using library(data.table) and library(dplyr) with library(survey)

Resources