Is it possible to update a BSTS model? - r

I am working with time-series data which updates daily. Therefore, I thought that the Bayesian framework would fit perfectly because theoretically, it is possible to update the model as new data comes in. Thus, I proceed to model my data with the bsts package in R, which give me quite promising results. Now, I have two questions:
1 - How can I save my current model and then update it with the new data?
Alternatively,
2 - Is it possible to extract the coefficient distribution so that I can plug-in as a prior of a new model based on new data?
Here is a minimal example:
library(bsts)
library(magrittr)
# Use data included in the bsts package for reproducible example
data(iclaims)
plot(initial.claims)
# Let's assume I only have the data until 2011
t0 <- window(initial.claims, end = "2011-01-01")
t0
# Model data
ss <- AddLocalLinearTrend(list(), t0$iclaimsNSA)
ss <- AddSeasonal(ss, t0$iclaimsNSA, nseasons = 52)
model1 <- bsts(iclaimsNSA ~ .,
state.specification = ss,
data = t0,
niter = 1000)
predict(model1, horizon = 30) %>%
plot(.)
# What I would like to do next
saveRDS(model1, file = "model_t0")
# Next month/year
load("model_t0")
NEW_DATA <- window(initial.claims, start = "2011-01-01")
# Option 1, get the priors manually
priors <- get_priors(model_t0) # imaginary function of what I want to do
updated_model <- bsts(iclaimsNSA ~ .,
state.specification = ss,
data = NEW_DATA,
niter = 1000,
prior = priors)
# Option 2, update the model directly
updated_model <- bsts(iclaimsNSA ~ .,
state.specification = ss,
data = NEW_DATA,
niter = 1000,
old_model = model_t0)
I have seen this kind of approach with the brms package, but unfortunately, that package cannot deal with time-series related data.
Kind regards

Related

Refit an arima model with new training data in fable package

I have a function which takes a fitted model and then refits that model to new training data (this is for step-ahead cross validation). For lm models it works like this:
#create data
training_data <-
data.frame(date = seq.Date(
from = as.Date("2020-01-01"),
by = 1, length.out = 365
), x = 1:365, y = 1:365 + rnorm(n = 365))
# specify and fit model
lm_formula <- as.formula(y ~ x)
my_lm <- lm(lm_formula, data = training_data)
# refit on new training data
update(my_lm, data = new_training_data)
Is there a way to do the same thing for arima models fitted with the fable package? I'm creating the models like this
library(fable)
library(forecast)
arima_formula <- as.formula(y ~ x + PDQ(0, 0, 0))
my_arima <- as_tsibble(training_data) %>% model(ARIMA(arima_formula))
But I can't figure out a way to take the my_arima model that I've already fitted and pass it new_training_data, either using update or by extracting the formula and refitting as a new model. Note that although I've included the model formula in the reprex above, my function takes a fitted model rather than a formula. So just fitting a new model using arima_formula is not an option.
Thank you.

Cannot generate predictions in mgcv when using discretization (discrete=T)

I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.

Retrain best model on full dataset in R

I have two models to select from and using some criteria I choose one of the two. (The below is just an example, I know it doesn't make much sense)
library(forecast)
set.seed(4)
sample_dat= sample(1:nrow(cars), 5)
train = cars[-sample_dat, ]
test = cars[sample_dat, ]
models = list(lm(dist ~ speed, train), glm(dist ~ speed, train, family = "poisson"))
test_res = sapply(models, function(x) accuracy(predict(x, test, type = "response"), test$dist)[2]) #Getting the RMSE for each model
best_model = models[which.min(test_res)]
How can I retrain the best model using the full dataset (train + test)? I checked the update and update.formula functions but these don't seem to be updating the data part.
update(best_model[[1]],data = rbind(train,test))
You do not want to change the formula since that is the best model but rather update the data
Base R using your own logic, first creating a list mirroring the models list:
set.seed(4)
sample_dat= sample(1:nrow(cars), 5)
train = cars[-sample_dat, ]
test = cars[sample_dat, ]
models = list(lm(dist ~ speed, train), glm(dist ~ speed, train, family = "poisson"))
model_application = list(as.expression("lm(dist ~ speed, cars)$call"),
as.expression("glm(dist ~ speed, cars, family = 'poisson'))$call"))
test_res = sapply(models,
function(x){
# Store a function to caclulate the RMSE: rmse => function
rmse <- function(actual_vec, pred_vec){sqrt(mean((pred_vec - actual_vec)**2))}
# Getting the RMSE for each model: numeric scalar => .GlobalEnv
rmse(test$dist, predict(x, data = test, type = "response"))
}
)
best_model = models[[which.min(test_res)]]
applied_model <- eval(eval(as.expression(parse(text = model_application[[which.min(test_res)]]))))

How to stack machine learning models in R

I am new to machine learning and R.
I know that there is an R package called caretEnsemble, which could conveniently stack the models in R. However, this package looks has some problems when deals with multi-classes classification tasks.
Temporarily, I wrote some codes to try to stack the models manually and here is the example I worked on:
library(caret)
set.seed(123)
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
adData = data.frame(diagnosis, predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3 / 4)[[1]]
training = adData[inTrain,]
testing = adData[-inTrain,]
set.seed(62433)
modelFitRF <- train(diagnosis ~ ., data = training, method = "rf")
modelFitGBM <- train(diagnosis ~ ., data = training, method = "gbm",verbose=F)
modelFitLDA <- train(diagnosis ~ ., data = training, method = "lda")
predRF <- predict(modelFitRF,newdata=testing)
predGBM <- predict(modelFitGBM, newdata = testing)
prefLDA <- predict(modelFitLDA, newdata = testing)
confusionMatrix(predRF, testing$diagnosis)$overall[1]
#Accuracy
#0.7682927
confusionMatrix(predGBM, testing$diagnosis)$overall[1]
#Accuracy
#0.7926829
confusionMatrix(prefLDA, testing$diagnosis)$overall[1]
#Accuracy
#0.7682927
Now I've got three models: modelFitRF, modelFitGBM and modelFitLDA, and three predicted vectors corresponding to such three models based on the test set.
Then I will create a data frame to contain these predicted vectors and the original dependent variable in the test set:
predDF <- data.frame(predRF, predGBM, prefLDA, diagnosis = testing$diagnosis, stringsAsFactors = F)
And then, I just used such data frame as a new train set to create a stacked model:
modelStack <- train(diagnosis ~ ., data = predDF, method = "rf")
combPred <- predict(modelStack, predDF)
confusionMatrix(combPred, testing$diagnosis)$overall[1]
#Accuracy
#0.804878
Considering that stacking models usually should improve the accuracy of the predictions, I'de like to believe this might be a right to stack the models. However, I also doubt that here I used the predDF which is created by the predictions from three models with the test set.
I am not sure whether I should use the results from the test set and then apply them back to the test set to get final predictions?
(I am referring to this block below:)
predDF <- data.frame(predRF, predGBM, prefLDA, diagnosis = testing$diagnosis, stringsAsFactors = F)
modelStack <- train(diagnosis ~ ., data = predDF, method = "rf")
combPred <- predict(modelStack, predDF)
confusionMatrix(combPred, testing$diagnosis)$overall[1]

Run several GLM models using for loop in R

I'm trying to do some experiment and I want to run several GLMs model in R using the same variables but different training samples.
Here is some simulated data:
resp <- sample(0:1,100,TRUE)
x1 <- c(rep(5,20),rep(0,15), rep(2.5,40),rep(17,25))
x2 <- c(rep(23,10),rep(5,10), rep(15,40),rep(1,25), rep(2, 15))
dat <- data.frame(resp,x1, x2)
This is the loop I'm trying to use:
n <- 5
for (i in 1:n)
{
### Create training and testing data
## 80% of the sample size
# Note that I didn't use seed so that random split is performed every iteration.
smp_sizelogis <- floor(0.8 * nrow(dat))
train_indlogis <- sample(seq_len(nrow(dat)), size = smp_sizelogis)
trainlogis <- dat[train_indlogis, ]
testlogis <- dat[-train_indlogis, ]
InitLOogModel[i] <- glm(resp ~ ., data =trainlogis, family=binomial)
}
But unfortunately, I'm getting this error:
Error in InitLOogModel[i] <- glm(resp ~ ., data = trainlogis, family = binomial) :
object 'InitLOogModel' not found
Any thoughts.
I'd suggest using caret for what you're trying to do. It takes some time to learn, but incorporates many 'best practices'. Once you've learned the basics you'll be able to quickly try models other than a glm, and easily compare the models to each other. Here's modified code from your example to get you started.
## caret
library(caret)
# your data
resp <- sample(0:1,100,TRUE)
x1 <- c(rep(5,20),rep(0,15), rep(2.5,40),rep(17,25))
x2 <- c(rep(23,10),rep(5,10), rep(15,40),rep(1,25), rep(2, 15))
dat <- data.frame(resp,x1, x2)
# so caret knows you're trying to do classification, otherwise will give you an error at the train step
dat$resp <- as.factor(dat$resp)
# create a hold-out set to use after your model fitting
# not really necessary for your example, but showing for completeness
train_index <- createDataPartition(dat$resp, p = 0.8,
list = FALSE,
times = 1)
# create your train and test data
train_dat <- dat[train_index, ]
test_dat <- dat[-train_index, ]
# repeated cross validation, repeated 5 times
# this is like your 5 loops, taking 80% of the data each time
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5)
# fit the glm!
glm_fit <- train(resp ~ ., data = train_dat,
method = "glm",
family = "binomial",
trControl = fitControl)
# summary
glm_fit
# best model
glm_fit$finalModel

Resources