Getting an error "'xreg' and 'newxreg' have different numbers of columns" - r

I just started with R and time series forecasting.
I am doing forecasting for 1 variable (consumption) and one exogenous variable (income). This is quarterly data.
When I ran the model with R code,
#train_exp <- exp_trial[,1][1:150]
#train_inc <- exp_trial[,2][1:150]
model_train_exp <- arima(train_exp,order = c(0,2,6),seasonal = list(order=c(0,1,1),period = 4), xreg = train_inc)
this model has no errors. but, when I forecast it, i get an error xreg' and 'newxreg' have different numbers of columns
forcasted_arima <- forecast.Arima(model_train_exp, h=14)
there are so many arguments for forecast.arima. I am not familiar with those.
Can someone please tell me what should be the code for it?

The model used train_inc to make the model. It needs more train_inc values in order to finish the prediction. Think of it this way, you built the model in the form train_exp_t0 = b1 + b2*train_exp_t-1 + b3*train_inc_t0. With that model in hand, if someone provides a value for train_exp_t-1 (which is yesterday's consumption) and one for train_inc_t0 (today's income value) the model will return a train_exp_t0 (today's consumption). You need to provide it with some train_inc values to get a y out.
Example
train_exp = rnorm(20)
train_inc = 1 + rnorm(20)
fit <- arima(train_exp, xreg=train_inc)
predict(fit, h=14)
# Error in predict.Arima(fit, h = 14) :
# 'xreg' and 'newxreg' have different numbers of columns
We get the same error that you got. But when we supply new values for train_inc it works!
new_train_inc <- rnorm(14)
predict(fit, newxreg=new_train_inc)
# $pred
# Time Series:
# Start = 21
# End = 34
# Frequency = 1
# [1] -0.2444872 -0.1583624 -0.2042488 -0.2143231 -0.1992276 -0.2047153 -0.2431517 -0.1887002 -0.2480745 -0.2118920
# [11] -0.1281492 -0.2067001 -0.2202669 -0.2166019
#
# $se
# Time Series:
# Start = 21
# End = 21
# Frequency = 1
# [1] 1.153433
If it still doesn't make sense, remember that you are predicting train_exp, not train_inc.
If you would like a more formal discussion see here at Cross Validated

Related

Fit double logistic function to a time series

For the following time series data:
#1. dates of 15 day frequency:
dates = seq(as.Date("2016-09-01"), as.Date("2020-07-30"), by=15) #96 times observation
#2. water content in crops corresponding to the times given.
water <- c(0.5702722, 0.5631781, 0.5560839, 0.5555985, 0.5519783, 0.5463459,
0.5511598, 0.546652, 0.5361545, 0.530012, 0.5360571, 0.5396569,
0.5683526, 0.6031535, 0.6417821, 0.671358, 0.7015542, 0.7177007,
0.7103561, 0.7036985, 0.6958607, 0.6775161, 0.6545367, 0.6380155,
0.6113306, 0.5846186, 0.5561815, 0.5251135, 0.5085149, 0.495352,
0.485819, 0.4730029, 0.4686458, 0.4616468, 0.4613918, 0.4615532,
0.4827496, 0.5149105, 0.5447824, 0.5776764, 0.6090217, 0.6297454,
0.6399422, 0.6428941, 0.6586344, 0.6507473, 0.6290631, 0.6011123,
0.5744375, 0.5313527, 0.5008027, 0.4770338, 0.4564025, 0.4464508,
0.4309046, 0.4351668, 0.4490393, 0.4701232, 0.4911582, 0.5162941,
0.5490387, 0.5737573, 0.6031149, 0.6400073, 0.6770058, 0.7048311,
0.7255012, 0.739107, 0.7338938, 0.7265202, 0.6940718, 0.6757214,
0.6460862, 0.6163091, 0.5743775, 0.5450822, 0.5057753, 0.4715266,
0.4469859, 0.4303232, 0.4187793, 0.4119401, 0.4201316, 0.426369,
0.4419331, 0.4757525, 0.5070846, 0.5248457, 0.5607567, 0.5859825,
0.6107531, 0.6201754, 0.6356589, 0.6336177, 0.6275579, 0.6214981)
I want to fit a double-logistic function curve to the data.
I found some examples and packages that can be of help,
https://greenbrown.r-forge.r-project.org/man/FitDoubleLogElmore.html
and an example here - Indexes overlap error when using dplyr to run a function.
However, the examples given only consider annual time series.
I have tried to fit the function as:
x <- ts(water, start = c(2016,17), end = c(2020, 16), frequency = 24)
smooth.water = FitDoubleLogBeck(x, weighting = T, hessian = F, plot = T, ninit = 10)
plot(water)
plot(smooth.water$predicted)
plot(water- smooth.water$predicted)
However, this function does not seem to fit the entire time series. How can I run the function to fit the entire time series? Also, I noticed the output is different at different run, and I am not sure what makes that happen.
FitDoubleLogBeck can deal only with 1-year data, so you need analyze the data year by year. To do it just take window for 1 year then fit the data separately for each year.
As for different results at different runs the algorithm randomly chooses the initial parameters. The graph of double logistic curve is bell shaped. However you applying the algorithm to "sine"-like data but the algorithm expects to have "bell". Then it treats the water data as a cloud of points so the results are meaningless and very sensetive to intial parameter setting.
Code:
set.seed(123)
par(mfrow = c(1, 3))
# water vector taken from question above
x <- ts(water, start = c(2016,17), end = c(2020, 16), frequency = 24)
res <- sapply((2017:2019), function(year) {
x2 <- as.vector(window(x, start=c(year, 1), end=c(year, 24)))
smooth.water2 = FitDoubleLogBeck(x2, weighting = T, hessian = F, plot = T, ninit = 10)
title(main = year)
c(year = year, smooth.water2$params)
})
t(res)
Output:
year mn mx sos rsp eos rau
[1,] 2017 -0.7709318 0.17234293 16.324163 -0.6133117 6.750885 -0.7618376
[2,] 2018 -0.8900971 0.09398673 7.529345 0.6701200 17.319465 0.8277409
[3,] 2019 -4.7669470 -0.34648434 15.930455 -0.2570877 10.690043 -0.2267284

SVM with rolling window

I have a code which predict the change in the sign of future returns.
library(quantmod)
library(PerformanceAnalytics)
library(forecast)
library(e1071)
library(caret)
library(kernlab)
library(dplyr)
library(roll)
# get data yahoo finance
getSymbols("^GSPC", from = "1990-01-01", to = "2017-12-01")
# take logreturns
rnull <- CalculateReturns(prices = GSPC$GSPC.Adjusted ,method ="log")
# lags 1, 2, 3, 4, 5 as features
feat <- merge(na.trim(lag(rnull,1)),na.trim(lag(rnull,2)),na.trim(lag(rnull,3)),na.trim(lag(rnull,4)),na.trim(lag(rnull,5)),all=FALSE)
# create dataset. 6th column is actural. Previous is lagged
dataset <- merge(feat,rnull,all=FALSE)
# set columns' names
colnames(dataset) = c("lag.1", "lag.2", "lag.3","lag.4","lag.5","TARGET")
# get signs and make a data.frame
x <- sign(dataset)%>%as.data.frame
# exclude 0 sign and assume that these values are positive
x[x==0] <- 1
# for svm purposes we need to set dependent variable as factor and make levels to interpretation
x$TARGET <- as.factor(as.character(x$TARGET))
levels(x$TARGET) <- list(positive = "1", negative = "-1")
# divide sample to training and test subsamples
trainindex <- x[1:5792,]
testindex <- x[5792:7030,]
# run svm
svmFit <- ksvm(TARGET~.,data=trainindex,type="C-svc",kernel= "rbfdot")
# prediction
predsvm <- predict(svmFit, newdata=testindex)
# results
confusionMatrix(predsvm, testindex$TARGET)
The next thing I am going to do is add a rolling window (1 step forecast) to my model.
However the basic methods as rollapply does not work with dataframe. Commom methods of one step forecast for time-series are also not valid for data.frame used in e1071 package.
I wrote the following function:
svm_next_day_prediction <- function(x){
svmFit <- svm(TARGET~., data=x)
prediction <- predict(object = svmFit, newdata = tail(x,1) )
return(prediction)
}
apl = rollapplyr(data = x, width = 180, FUN = svm_next_day_prediction, by.column = TRUE)
but recieved a error because rollapply does not understand data.frames:
Error in terms.formula(formula, data = data) : '.' in formula and
no 'data' argument
Can you please explain how to apply rolling window for svm classification model with dataframe?
A few points
rollapply works with data frames that can be coerced to a matrix so be sure that your input is entirely numeric -- not a mix of numeric and factor. For example, this works using the built-in data frame BOD which has two numeric columns. Note that x passed to pred is a matrix here.
pred <- function(x) predict(svm(demand ~ Time, x))
rollapplyr(BOD, 3, FUN = pred, by.column = FALSE)
giving
## 1 2 3
## [1,] 8.868888 10.86889 17.25474
## [2,] 11.661666 17.24870 16.00000
## [3,] 18.328435 16.18583 15.78583
## [4,] 16.230474 15.83247 19.56886
I can't reproduce the error you get. I get a different error.
the code in the question has by.column = TRUE (which is the default anyways)
but that has the result of passing only a single vector to the function which
is not what you want. You want by.column = FALSE.
Try this:
x0 <- data.matrix(x)
rollapplyr(data = x0, width = 180, FUN = svm_next_day_prediction, by.column = FALSE)
you can create a list with the individual data frames and then apply your function. I rename x to df to avoid confusion:
df=x
rowwindow=179
dfList=lapply(1:(nrow(df)-rowwindow),function(x) df[x:(rowwindow+x),])
result=sapply(dfList,svm_next_day_prediction)

Representing Parametric Survival Model in 'Counting Process' form in JAGS

I'm trying to build a survival model in JAGS that allows for time-varying covariates. I'd like it to be a parametric model — for example, assuming survival follows the Weibull distribution (but I'd like to allow the hazard to vary, so exponential is too simple). So, this is essentially a Bayesian version of what can be done in the flexsurv package, which allows for time-varying covariates in parametric models.
Therefore, I want to be able to enter the data in a 'counting-process' form, where each subject has multiple rows, each corresponding to a time interval in which their covariates remained constant (as described in this pdf or here. This is the (start, stop] formulation that the survival or flexurv packages allow.
Unfortunately, every explanation of how to perform survival analysis in JAGS seems to assume one row per subject.
I attempted to take this simpler approach and extend it to the counting process format, but the model does not correctly estimate the distribution.
A Failed Attempt:
Here's an example. First we generate some data:
library('dplyr')
library('survival')
## Make the Data: -----
set.seed(3)
n_sub <- 1000
current_date <- 365*2
true_shape <- 2
true_scale <- 365
dat <- data_frame(person = 1:n_sub,
true_duration = rweibull(n = n_sub, shape = true_shape, scale = true_scale),
person_start_time = runif(n_sub, min= 0, max= true_scale*2),
person_censored = (person_start_time + true_duration) > current_date,
person_duration = ifelse(person_censored, current_date - person_start_time, true_duration)
)
person person_start_time person_censored person_duration
(int) (dbl) (lgl) (dbl)
1 1 11.81416 FALSE 487.4553
2 2 114.20900 FALSE 168.7674
3 3 75.34220 FALSE 356.6298
4 4 339.98225 FALSE 385.5119
5 5 389.23357 FALSE 259.9791
6 6 253.71067 FALSE 259.0032
7 7 419.52305 TRUE 310.4770
Then we split the data into 2 observations per subject. I'm just splitting each subject at time = 300 (unless they didn't make it to time=300, in which they get just one observation).
## Split into multiple observations per person: --------
cens_point <- 300 # <----- try changing to 0 for no split; if so, model correctly estimates
dat_split <- dat %>%
group_by(person) %>%
do(data_frame(
split = ifelse(.$person_duration > cens_point, cens_point, .$person_duration),
START = c(0, split[1]),
END = c(split[1], .$person_duration),
TINTERVAL = c(split[1], .$person_duration - split[1]),
CENS = c(ifelse(.$person_duration > cens_point, 1, .$person_censored), .$person_censored), # <— edited original post here due to bug; but problem still present when fixing bug
TINTERVAL_CENS = ifelse(CENS, NA, TINTERVAL),
END_CENS = ifelse(CENS, NA, END)
)) %>%
filter(TINTERVAL != 0)
person split START END TINTERVAL CENS TINTERVAL_CENS
(int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 1 300.0000 0 300.0000 300.00000 1 NA
2 1 300.0000 300 487.4553 187.45530 0 187.45530
3 2 168.7674 0 168.7674 168.76738 1 NA
4 3 300.0000 0 300.0000 300.00000 1 NA
5 3 300.0000 300 356.6298 56.62979 0 56.62979
6 4 300.0000 0 300.0000 300.00000 1 NA
Now we can set up the JAGS model.
## Set-Up JAGS Model -------
dat_jags <- as.list(dat_split)
dat_jags$N <- length(dat_jags$TINTERVAL)
inits <- replicate(n = 2, simplify = FALSE, expr = {
list(TINTERVAL_CENS = with(dat_jags, ifelse(CENS, TINTERVAL + 1, NA)),
END_CENS = with(dat_jags, ifelse(CENS, END + 1, NA)) )
})
model_string <-
"
model {
# set priors on reparameterized version, as suggested
# here: https://sourceforge.net/p/mcmc-jags/discussion/610036/thread/d5249e71/?limit=25#8c3b
log_a ~ dnorm(0, .001)
log(a) <- log_a
log_b ~ dnorm(0, .001)
log(b) <- log_b
nu <- a
lambda <- (1/b)^a
for (i in 1:N) {
# Estimate Subject-Durations:
CENS[i] ~ dinterval(TINTERVAL_CENS[i], TINTERVAL[i])
TINTERVAL_CENS[i] ~ dweibull( nu, lambda )
}
}
"
library('runjags')
param_monitors <- c('a', 'b', 'nu', 'lambda')
fit_jags <- run.jags(model = model_string,
burnin = 1000, sample = 1000,
monitor = param_monitors,
n.chains = 2, data = dat_jags, inits = inits)
# estimates:
fit_jags
# actual:
c(a=true_shape, b=true_scale)
Depending on where the split point is, the model estimates very different parameters for the underlying distribution. It only gets the parameters right if the data isn't split into the counting process form. It seems like this is not the way to format the data for this kind of problem.
If I am missing an assumption and my problem is less related to JAGS and more related to how I'm formulating the problem, suggestions are very welcome. I might be despairing that time-varying covariates can't be used in parametric survival models (and can only be used in models like the Cox model, which assumes constant hazards and which doesn't actually estimate the underlying distribution)— however, as I mentioned above, the flexsurvreg package in R does accommodate the (start, stop] formulation in parametric models.
If anyone knows how to build a model like this in another language (e.g. STAN instead of JAGS) that would be appreciated too.
Edit:
Chris Jackson provides some helpful advice via email:
I think the T() construct for truncation in JAGS is needed here. Essentially for each period (t[i], t[i+1]) where a person is alive but the covariate is constant, the survival time is left-truncated at the start of the period, and possibly also right-censored at the end. So you'd write something like y[i] ~ dweib(shape, scale[i])T(t[i], )
I tried implementing this suggestion as follows:
model {
# same as before
log_a ~ dnorm(0, .01)
log(a) <- log_a
log_b ~ dnorm(0, .01)
log(b) <- log_b
nu <- a
lambda <- (1/b)^a
for (i in 1:N) {
# modified to include left-truncation
CENS[i] ~ dinterval(END_CENS[i], END[i])
END_CENS[i] ~ dweibull( nu, lambda )T(START[i],)
}
}
Unfortunately this doesn't quite do the trick. With the old code, the model was mostly getting the scale parameter right, but doing a very bad job on the shape parameter. With this new code, it gets very close to the correct shape parameter, but consistently over-estimates the scale parameter. I have noticed that the degree of over-estimation is correlated with how late the split point comes. If the split-point is early (cens_point = 50), there's not really any over-estimation; if it's late (cens_point = 350), there is a lot.
I thought maybe the problem could be related to 'double-counting' the observations: if we see a censored observation at t=300, then from that same person, an uncensored observation at t=400, it seems intuitive to me that this person is contributing two data-points to our inference about the Weibull parameters when really they should just be contributing one point. I, therefore, tried incorporating a random-effect for each person; however, this completely failed, with huge estimates (in the 50-90 range) for the nu parameter. I'm not sure why that is, but perhaps that's a question for a separate post. Since I'm not whether the problems are related, you can find the code for this whole post, including the JAGS code for that model, here.
You can use rstanarm package, which is a wrapper around STAN. It allows to use standard R formula notation to describe survival models. stan_surv function accepts arguments in a "counting process" form. Different base hazard functions including Weibull can be used to fit the model.
The survival part of rstanarm - stan_surv function is still not available at CRAN so you should install the package directly from mc-stan.org.
install.packages("rstanarm", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
Please see the code below:
library(dplyr)
library(survival)
library(rstanarm)
## Make the Data: -----
set.seed(3)
n_sub <- 1000
current_date <- 365*2
true_shape <- 2
true_scale <- 365
dat <- data_frame(person = 1:n_sub,
true_duration = rweibull(n = n_sub, shape = true_shape, scale = true_scale),
person_start_time = runif(n_sub, min= 0, max= true_scale*2),
person_censored = (person_start_time + true_duration) > current_date,
person_duration = ifelse(person_censored, current_date - person_start_time, true_duration)
)
## Split into multiple observations per person: --------
cens_point <- 300 # <----- try changing to 0 for no split; if so, model correctly estimates
dat_split <- dat %>%
group_by(person) %>%
do(data_frame(
split = ifelse(.$person_duration > cens_point, cens_point, .$person_duration),
START = c(0, split[1]),
END = c(split[1], .$person_duration),
TINTERVAL = c(split[1], .$person_duration - split[1]),
CENS = c(ifelse(.$person_duration > cens_point, 1, .$person_censored), .$person_censored), # <— edited original post here due to bug; but problem still present when fixing bug
TINTERVAL_CENS = ifelse(CENS, NA, TINTERVAL),
END_CENS = ifelse(CENS, NA, END)
)) %>%
filter(TINTERVAL != 0)
dat_split$CENS <- as.integer(!(dat_split$CENS))
# Fit STAN survival model
mod_tvc <- stan_surv(
formula = Surv(START, END, CENS) ~ 1,
data = dat_split,
iter = 1000,
chains = 2,
basehaz = "weibull-aft")
# Print fit coefficients
mod_tvc$coefficients[2]
unname(exp(mod_tvc$coefficients[1]))
Output, which is consistent with true values (true_shape <- 2; true_scale <- 365):
> mod_tvc$coefficients[2]
weibull-shape
1.943157
> unname(exp(mod_tvc$coefficients[1]))
[1] 360.6058
You can also look at STAN source using rstan::get_stanmodel(mod_tvc$stanfit) to compare STAN code with the attempts you made in JAGS.

MCMCglmm multinomial model in R

I'm trying to create a model using the MCMCglmm package in R.
The data are structured as follows, where dyad, focal, other are all random effects, predict1-2 are predictor variables, and response 1-5 are outcome variables that capture # of observed behaviors of different subtypes:
dyad focal other r present village resp1 resp2 resp3 resp4 resp5
1 10101 14302 0.5 3 1 0 0 4 0 5
2 10405 11301 0.0 5 0 0 0 1 0 1
…
So a model with only one outcome (teaching) is as follows:
prior_overdisp_i <- list(R=list(V=diag(2),nu=0.08,fix=2),
G=list(G1=list(V=1,nu=0.08), G2=list(V=1,nu=0.08), G3=list(V=1,nu=0.08), G4=list(V=1,nu=0.08)))
m1 <- MCMCglmm(teaching ~ trait-1 + at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units, family = "zipoisson", prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
Hadfield's course notes (Ch 5) give an example of a multinomial model that uses only a single outcome variable with 3 levels (sheep horns of 3 types). Similar treatment can be found here: http://hlplab.wordpress.com/2009/05/07/multinomial-random-effects-models-in-r/ This is not quite right for what I'm doing, but contains helpful background info.
Another reference (Hadfield 2010) gives an example of a multi-response MCMCglmm that follows the same format but uses cbind() to predict a vector of responses, rather than a single outcome. The same model with multiple responses would look like this:
m1 <- MCMCglmm(cbind(resp1, resp2, resp3, resp4, resp5) ~ trait-1 +
at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units,
family = cbind("zipoisson","zipoisson","zipoisson","zipoisson","zipoisson"),
prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
I have two programming questions here:
How do I specify a prior for this model? I've looked at the materials mentioned in this post but just can't figure it out.
I've run a similar version with only two response variables, but I only get one slope - where I thought I should get a different slope for each resp variable. Where am I going wrong, or having I misunderstood the model?
Answer to my first question, based on the HLP post and some help from a colleage/stats consultant:
# values for prior
k <- 5 # originally: length(levels(dative$SemanticClass)), so k = # of outcomes for SemanticClass aka categorical outcomes
I <- diag(k-1) #should make matrix of 0's with diagonal of 1's, dimensions k-1 rows and k-1 columns
J <- matrix(rep(1, (k-1)^2), c(k-1, k-1)) # should make k-1 x k-1 matrix of 1's
And for my model, using the multinomial5 family and 5 outcome variables, the prior is:
prior = list(
R = list(fix=1, V=0.5 * (I + J), n = 4),
G = list(
G1 = list(V = diag(4), n = 4))
For my second question, I need to add an interaction term to the fixed effects in this model:
m <- MCMCglmm(cbind(Resp1, Resp2...) ~ -1 + trait*predictorvariable,
...
The result gives both main effects for the Response variables and posterior estimates for the Response/Predictor interaction (the effect of the predictor variable on each response variable).

Fast way of evaluating a formula?

I'm using either dyn or dynlm to predict time series using lagged variables.
However, the predict function in either case only evaluates one time step at a time, taking a constant time of 24 milliseconds per step on my computer, or about 1.8 hours for my dataset, which is super long, given that the entire regression takes about 10 seconds.
So, I'm thinking that perhaps the fastest thing might be just to evaluate the formula by hand?
So, is there some way of evaluating a formula given values in a data.frame or the current envrironment or similar?
I'm thinking of something along the lines of:
evalMagic( load ~ temperature + time, data.frame( temperature = 10, time = 4 ) )
I suppose, as I write this, that we need to handle the coefficients somehow, something like:
evalMagic( load ~ temperature + time, data.frame( temperature = 10, time = 4 ), model$coefficients )
.... so this raises the questions of:
isn't this what predict is supposed to do?
why is predict so slow?
what options do I have to make the prediction a bit faster? After all, it's not inverting any matrices or something, it's just a bit of arithmetic!
I wrote my own lag implementation in the end. It's hacky and not beautiful, but it's a lot faster. It can process 1000 rows in 4 seconds on my crappy laptop.
# lags is a data.frame, eg:
# var amount
# y 1
# y 2
addLags <- function( dataset, lags ) {
N <- nrow(dataset)
print(lags)
if( nrow(lags) > 0 ) {
print(lags)
for( j in 1:nrow(lags) ) {
sourcename <- as.character( lags[j,"var"] )
k <- lags[j,"amount"]
cat("k",k,"sourcename",sourcename,"\n")
lagcolname <- sprintf("%s_%d",sourcename,k)
dataset[,lagcolname] <- c(rep(0,k), dataset[1:(N-k),sourcename])
}
}
dataset
}
lmLagged <- function( formula, train, lags ) {
# get largest lag, and skip that
N <- nrow(train)
skip <- 0
for( j in 1:nrow(lags) ) {
k <- lags[j,"amount"]
skip <- max(k,skip)
}
print(train)
train <- addLags( train, lags )
print(train)
lm( formula, train[(skip+1):N,] )
}
# pass in training data, test data,
# it will step through one by one
# need to give dependent var name
# lags is a data.frame, eg:
# var amount
# y 1
# y 2
predictLagged <- function( model, train, test, dependentvarname, lags ) {
Ntrain <- nrow(train)
Ntest <- nrow(test)
test[,dependentvarname] <- NA
testtraindata <- rbind( train, test )
testtraindata <- addLags( testtraindata, lags )
for( i in 1:Ntest ) {
thistestdata <- testtraindata[Ntrain + i,]
result <- predict(model,newdata=thistestdata)
for( j in 1:nrow(lags) ) {
sourcename <- lags[j,"var"]
k <- lags[j,"amount"]
lagcolname <- sprintf("%s_%d",sourcename,k)
testtraindata[Ntrain + i + k,lagcolname] <- result
}
testtraindata[Ntrain+i,dependentvarname] <- result
}
return( testtraindata[(Ntrain+1):(Ntrain + Ntest),dependentvarname] )
}
library("RUnit")
# size of training data
N <- 6
predictN <- 50
# create training data, which we can get exact fit on
set.seed(1)
x = sample( 100, N )
traindata <- numeric()
traindata[1] <- 1 + 1.1 * x[1]
traindata[2] <- 2 + 1.1 * x[2]
for( i in 3:N ) {
traindata[i] <- 0.5 + 0.3 * traindata[i-2] - 0.8 * traindata[i-1] + 1.1 * x[i]
}
train <- data.frame(x = x, y = traindata, foo = 1)
#train$x <- NULL
# create testing data, bunch of NAs
test <- data.frame( x = sample(100,predictN), y = rep(NA,predictN), foo = 1)
# specify which lags we need to handle
# one row per lag, with name of variable we are lagging, and the distance
# we can then use these in the formula, eg y_1, and y_2
# are y lagged by 1 and 2 respectively
# It's hacky but it kind of works...
lags <- data.frame( var = c("y","y"), amount = c(1,2) )
# fit a model
model <- lmLagged( y ~ x + y_1 + y_2, train, lags )
# look at the model, it's a perfect fit. Nice!
print(model)
print(system.time( test <- predictLagged( model, train, test, "y", lags ) ))
#checkEqualsNumeric( 69.10228, test[56-6], tolerance = 0.0001 )
#checkEquals( 2972.159, test$y[106-6] )
print(test)
# nice plot
plot(test, type='l')
Output:
> source("test/test.regressionlagged.r",echo=F)
Call:
lm(formula = formula, data = train[(skip + 1):N, ])
Coefficients:
(Intercept) x y_1 y_2
0.5 1.1 -0.8 0.3
user system elapsed
0.204 0.000 0.204
[1] -19.108620 131.494916 -42.228519 80.331290 -54.433588 86.846257
[7] -13.807082 77.199543 12.698241 64.101270 56.428457 72.487616
[13] -3.161555 99.575529 8.991110 44.079771 28.433517 3.077118
[19] 30.768361 12.008447 2.323751 36.343533 67.822299 -13.154779
[25] 72.070513 -11.602844 115.003429 -79.583596 164.667906 -102.309403
[31] 193.347894 -176.071136 254.361277 -225.010363 349.216673 -299.076448
[37] 400.626160 -371.223862 453.966938 -420.140709 560.802649 -542.284332
[43] 701.568260 -679.439907 839.222404 -773.509895 897.474637 -935.232679
[49] 1022.328534 -991.232631
There's about 12 hours work in those 91 lines of code. Ok, I confess I played Plants and Zombies for a bit. So, 10 hours. Plus lunch and dinner. Still, quite a lot of work anyway.
If we change predictN to 1000, I get about 4.1 seconds from the system.time call.
I think it's faster because:
we don't use timeseries; I suspect that speeds things up
we don't use dynamic lm libraries, just normal lm; I guess that's slightly faster
we only pass a single row of data into predict for each prediction, which I think is significantly faster, eg using dyn$lm or dynmlm, if one has a lag of 30, one would need to pass 31 rows of data into predict AFAIK
a lot less data.frame/matrix copying, since we just update the lag values in-place on each iteration
Edit: corrected minor buggette where predictLagged returned a multi-column data-frame instead of just a numeric vector
Edit2: corrected less minor bug where you couldn't add more than one variable. Also reconciled the comments and code for lags, and changed the lags structure to "var" and "amount" in place of "name" and "lags". Also, updated the test code to add a second variable.
Edit: there are tons of bugs in this version, which I know, because I've unit-tested it a bit more and fixed them, but copying and pasting is very time-consuming, so I will update this post in a few days, once my deadline is over.
Maybe you're looking for this:
fastlinpred <- function(formula, newdata, coefs) {
X <- model.matrix( formula, data=newdata)
X %*% coefs
}
coefs <- c(1,2,3)
dd <- data.frame( temperature = 10, time = 4 )
fastlinpred( ~ temperature + time,
dd , coefs )
This assumes that the formula has only a RHS (you can get rid of the LHS of a formula by doing form[-2]).
This certainly gets rid of a lot of the overhead of predict.lm, but I don't know if it is as fast as you want. model.matrix has a lot of internal machinery too.

Resources