Predicting with bsts model and updated olddata - r

I've built a bsts model using 2 years of weekly historical data. I'm able to predict using the model with the existing training data. In order to mimic the process that would occur with the model in production, I've created an xts object that moves the 2 years of data forward by one week. When I try to predict using this dataset (populating the olddata parameter in predict.bsts), I receive the following error:
Error in terms.default(object) : no terms component nor attribute
I realize I'm probably doing something dumb here, but haven't been able to find any examples of usage of values of olddata when predicting. Appreciate any help you can provide.
Thanks
dat = xts(fcastdat$SumScan_units,order.by = fcastdat$enddate)
traindat = window(dat, start=as.Date("2015-01-03"), end=as.Date("2016-12-26"))
ss = AddLocalLevel(list(), traindat)
ss = AddSeasonal(ss, traindat, nseasons = 52)
holidays = c("EasterSunday", "USMothersDay", "IndependenceDay", "MemorialDay", "LaborDay", "Thanksgiving", "Christmas")
ss = AddNamedHolidays(ss, named.holidays = holidays, traindat)
model_loclev_seas_hol = bsts(traindat, state.specification = ss, niter = 500, ping=50, seed=1289)
burn = SuggestBurn(0.1, model_loclev_seas_hol)
pred_len = 5
pred = predict.bsts(model_ll_seas_hol, horizon = pred_len, burn = burn, quantiles = c(.025, .975))
begdt = index(traindat[1]) + 7
enddt = index(traindat[length(traindat)]) + 7
predseries = window(dat, start=as.Date(begdt), end=as.Date(enddt))
pred2 = predict.bsts(model_ll_seas_hol, horizon=pred_len, burn=burn, olddata = predseries,quantiles = c(.025, .975))

Related

Dynamic time series using TSA::arima and stats::arima

I am looking for more information to understand the difference between TSA::arimax and stats:arima when used for dynamic time series. I am interested in exploring the interplay between drinking and smoking rates in young people - treating smoking as the outcome variable.
Using the 2 commands (code below) produces the same results in my data - is this because I only have one IV and/or because I am not specifying any p or q values for the transfer variable?
I have seen online that TSA arimax is fitting a transfer function model rather than ARIMAX model but I am not sure how they differ.
alcohol.ts = ts(data=data$alcohol, frequency=4, start=c(data[1, "Year"], data[1, "Quarter"]))
iv[,1] = alcohol.ts
iv = as.data.frame(iv)
dv = ts(data=data$smoke, frequency=4, start=c(data[1, "Year"], data[1, "Quarter"]))
(model1 = stats::arima(dv, order=c(2,1,0), seasonal=list(order=c(0,0,0),
period=4), xreg=iv[,1],
transform.pars = FALSE, optim.control = list(maxit = 1000),
method='ML')
(model2 = TSA::arimax(dv, order=c(2,1,0), seasonal=list(order=c(0,0,0),
period=4),
xtransf=iv[,1], transfer=list(c(0,0)),
transform.pars = FALSE, optim.control = list(maxit = 1000),
method='ML'))

How to perform ANOVA on the result of different polynomial regression models with different levels of degree using a for loop

I'm quite new to R and I think my problem is quite simple but I cannot seem to work it out. I've looked at similar problems on here but I can't seem to get a solution to work for my specific problem.
I'm using the Wage data set that comes as part of the ISLR package to try and model wage as a function of age of differing polynomial degrees.
library(ISLR)
attach(Wage)
I'm performing regression onto wage with age up to degree 10 and then I want to apply the anova test to each model and investigate the results. The closest I have got is this;
for (i in 1:10) {
fit[[i]] <- lm(wage~poly(age, i) , data = Wage)
result[[i]] <- aov(as.formula(paste(fit[i], "~ wage")))
}
which results in this error;
Error in model.frame.default(formula = as.formula(paste(fit[i], "~ wage")), :
invalid type (list) for variable 'list(coefficients = c((Intercept) = 111.703608201744, poly(age, i) = 447.067852758315), residuals = c(231655 = -19.3925481434428, 86582 = -28.2033380861585, 161300 = 17.4500251413464, 155159 = 42.5676926169455, 11443 = -42.0253778623369, 376662 = 7.21810821763668, 450601 = 56.7036617292801, 377954 = 8.79783605559716, 228963 = 8.18131081863092, 81404 = 10.197404483506, 302778 = 3.61466467869064, 305706 = -24.4688637359983, 8690 = -16.9669134309657, 153561 = 25.4168784550536, 449654 = 14.807739524331, 447660 = -27.2938944517631, 160191 = -25.1943075097583, 230312 = 95.773820436023, 301585 = 7.84450555272621, 153682 = -9.27460094600634, 158226 = 91.9620415895517, 11141 = -59.5891117312741, 448410 = -49.3664897185768, 305116 = 50.6467028157233, 233002 = -14.5085059894098, 8684 = 161.240161560035, 229379 = -28.0716427246922, 86064 = -40.6412633049063, 378472 = -5.75413931818888, `1
Any help would be greatly appriciate and apologises for being such a R noob.
Thanks!!
To answer your question, you can do:
library(ISLR)
attach(Wage)
fit <- vector("list",10)
result <- vector("list",10)
for (i in 1:10) {
thisForm <- paste0("wage~poly(age, ",i,")")
fit[[i]] <- lm(thisForm , data = Wage)
result[[i]] <- aov(fit[[i]])
}
Above I created the formula so that you can see it in the fit / aov object, instead of having i.
Note that the lm fit is already in the aov object, meaning, you can get coefficients, predict and do things you need with lm directly on the aov object. You don't need to store the lm fit separately:
coefficients(result[[1]]) ; coefficients(fit[[1]])
(Intercept) poly(age, i)
111.7036 447.0679
(Intercept) poly(age, i)
111.7036 447.0679

How to fix "variable length differ" error in cv.zipath?

Trying to run a Cross validation of a zero-inflated poisson model using cv.zipath from the mpath package.
Fitting the LASSO
fit.lasso = zipath(estimation_sample_nomiss ~ .| .,
data = missings,
nlambda = 100,
family = "poisson",
link = "logit")
Cross validation
n <- dim(docvisits)[1]
K <- 10
set.seed(197)
foldid <- split(sample(1:n), rep(1:K, length = n))
fitcv <- cv.zipath(F_time_unemployed~ . | .,
data = estimation_sample_nomiss, family = "poisson",
nlambda = 100, lambda.count = fit.lasso$lambda.count[1:30],
lambda.zero = fit.lasso$lambda.zero[1:30], maxit.em = 300,
maxit.theta = 1, theta.fixed = FALSE, penalty = "enet",
rescale = FALSE, foldid = foldid)
I encounter the following error:
Error in model.frame.default(formula = F_time_unemployed ~ . + ., data = list(: variable lengths differ (found for '(weights)')
I have cleaned the sample of all NA's but still encounter the error message.
The solution turns out to be that the cv.zipath() command does not accept tibble data formats - at least in this instance. (No guarantee as to how this statement can be generalised). Having used dplyr commands, one needs to convert back to data frame. Thus, the solution is as simple as as.dataframe().

Format of newx in Lasso regression gives error in R

I am trying to implement lasso linear regression. I train my model but when I try to make prediction on unknown data it gives me the following error:
Error in cbind2(1, newx) %*% nbeta :
invalid class 'NA' to dup_mMatrix_as_dgeMatrix
Summary of my data is:
I want to predict the unknown percent_gc. I initially train my model using data for which percent_gc is known
set.seed(1)
###training data
data.all <- tibble(description = c('Xylanimonas cellulosilytica XIL07, DSM 15894','Teredinibacter turnerae T7901',
'Desulfotignum phosphitoxidans FiPS-3, DSM 13687','Brucella melitensis bv. 1 16M'),
phylum = c('Actinobacteria','Proteobacteria','Proteobacteria','Bacteroidetes'),
genus = c('Acaryochloris','Acetohalobium','Acidimicrobium','Acidithiobacillus'),
Latitude = c('63.93','69.372','3.493.11','44.393.704'),
Longitude = c('-22.1','88.235','134.082.527','-0.130781'),
genome_size = c(8361599,2469596,2158157,3207552),
percent_gc = c(34,24,55,44),
percent_psuedo = c(0.0032987747,0.0291222313,0.0353728489,0.0590663703),
percent_signalpeptide = c(0.02987198,0.040607055,0.048757170,0.061606859))
###data for prediction
data.prediction <- tibble(description = c('Liberibacter crescens BT-1','Saprospira grandis Lewin',
'Sinorhizobium meliloti AK83','Bifidobacterium asteroides ATCC 25910'),
phylum = c('Actinobacteria','Proteobacteria','Proteobacteria','Bacteroidetes'),
genus = c('Acaryochloris','Acetohalobium','Acidimicrobium','Acidithiobacillus'),
Latitude = c('39.53','69.372','5.493.12','44.393.704'),
Longitude = c('20.1','-88.235','134.082.527','-0.130781'),
genome_size = c(474832,2469837,2158157,3207552),
percent_gc = c(NA,NA,NA,NA),
percent_psuedo = c(0.0074639239,0.0291222313,0.0353728489,0.0590663703),
percent_signalpeptide = c(0.02987198,0.040607055,0.048757170,0.061606859))
x=model.matrix(percent_gc~.,data.all)
y=data.all$percent_gc
cv.out <- cv.glmnet (x, y, alpha = 1,family = "gaussian")
best.lambda= cv.out$lambda.min
fit <- glmnet(x,y,alpha=1)
I then want to make predictions for which percent_gc in not known.
newX = matrix(data = data.prediction %>% select(-percent_gc))
data.prediction$percent_gc <-
predict(object = fit ,type="response", s=best.lambda, newx=newX)
And this generates the error I mentioned above.
I don't understand which format newX should be in order to get rid of this help. Insights would be appreciated.
I could not really figure out how to construct a appropiate matrix, but package glmnetUtils provides functionality to directly fit a formula on a dataframe and predict. With this I got it to predict values:
library(glmnetUtils)
fit <- glmnet(percent_gc~.,data.all,alpha=1)
cv.out <- cv.glmnet (percent_gc~.,data.all, alpha = 1,family = "gaussian")
best.lambda= cv.out$lambda.min
predict(object = fit,data.prediction,s=best.lambda)

R: Holt-Winters with daily data (forecast package)

In the following example, I am trying to use Holt-Winters smoothing on daily data, but I run into a couple of issues:
# generate some dummy daily data
mData = cbind(seq.Date(from = as.Date('2011-12-01'),
to = as.Date('2013-11-30'), by = 'day'), rnorm(731))
# convert to a zoo object
zooData = as.zoo(mData[, 2, drop = FALSE],
order.by = as.Date(mData[, 1, drop = FALSE], format = '%Y-%m-%d'),
frequency = 7)
# attempt Holt-Winters smoothing
hw(x = zooData, h = 10, seasonal = 'additive', damped = FALSE,
initial = 'optimal', exponential = FALSE, fan = FALSE)
# no missing values in the data
sum(is.na(zooData))
This leads to the following error:
Error in ets(x, "AAA", alpha = alpha, beta = beta, gamma = gamma,
damped = damped, : You've got to be joking. I need more data! In
addition: Warning message: In ets(x, "AAA", alpha = alpha, beta =
beta, gamma = gamma, damped = damped, : Missing values encountered.
Using longest contiguous portion of time series
Emphasis mine.
Couple of questions:
1. Where are the missing values coming from?
2. I am assuming that the "need more data" arises from attempting to estimate 365 seasonal parameters?
Update 1:
Based on Gabor's suggestion, I have recreated a fractional index for the data where whole numbers are weeks.
I have a couple of questions.
1. Is this is an appropriate way of handling daily data when the periodicity is assumed to be weekly?
2. Is there is a more elegant way of handling the dates when working with daily data?
library(zoo)
library(forecast)
# generate some dummy daily data
mData = cbind(seq.Date(from = as.Date('2011-12-01'),
to = as.Date('2013-11-30'), by = 'day'), rnorm(731))
# conver to a zoo object with weekly frequency
zooDataWeekly = as.zoo(mData[, 2, drop = FALSE],
order.by = seq(from = 0, by = 1/7, length.out = 731))
# attempt Holt-Winters smoothing
hwData = hw(x = zooDataWeekly, h = 10, seasonal = 'additive', damped = FALSE,
initial = 'optimal', exponential = FALSE, fan = FALSE)
plot(zooDataWeekly, col = 'red')
lines(fitted(hwData))
hw requires a ts object not a zoo object. Use
zooDataWeekly <- ts(mData[,2], frequency=7)
Unless there is a good reason for specifying the model exactly, it is usually better to let R select the best model for you:
fit <- ets(zooDataWeekly)
fc <- forecast(fit)
plot(fc)

Resources