I have been told that, while using R's forecast package, one can reuse a model. That is, after the code x <- c(1,2,3,4); mod <- ets(x); f <- forecast(mod,h=1) one could have append(x, 5) and predict the next value without recalculating the model. How does one do that? (As I understand, using simple exponential smoothing one would only need to know alpha, right?)
Is it like forecast(x, model=mod)? If that is the case I have to say that I am using Java and calling the forecast code programmatically (for many time series), so I dont think I could keep the model object in the R environment all the time. Would there be an easy way to keep the model object in Java and load it in R environment when needed?
You have two questions here:
A) Can the forecast package "grow" its datasets? I can't speak in great detail to this package and you will have to look at its document. However, R models in general obey a structure of
fit <- someModel(formula, data)
estfit <- predict(fit, newdata=someDataFrame)
eg you supply updated data given a fit object.
B) Can I serialize a model back and forth to Java? Yes, you can. Rserve is one object, you can also try basic serialize() to (raw) character. Or even just `save(fit, file="someFile.RData").
Regarding your first question:
x <- 1:4
mod <- ets(x)
f1 <- forecast(mod, h=1)
x <- append(x, 5)
mod <- ets(x, model=mod) # Reuses old mod without re-estimating parameters.
f2 <- forecast(mod, h=1)
Related
As part of my data analysis (on time series), I am checking for correlation between log-returns and realized volatility.
My data consists of time series spanning several years for around hundred different companies (large zoo object, ~2 MB filesize). To check for the above-mentioned correlation, I have used the following code to calculate several rolling variances (a.k.a. realized volatility):
rollvar5 <- sapply(returns, rollVar, n=5, na.rm=TRUE)
rollvar10 <- sapply(returns, rollVar, n=10, na.rm=TRUE)
using the simple fTrading function rollVar. I have then converted the rolling variances to zoo objects and added the date index (by exporting to the results to csv files and manually adding the date, and then using read.zoo - not very sophisticated but it works just fine).
Now I wish to create around 100 linear regression models, each linking the log-returns of a company to the realized volatility to the specified company. On an individual basis, this would look like the following:
lm_rollvar5 <- lm(returns$[5:1000,1] ~ rollvar5[5:1000,1])
lm_rollvar10 <- lm(returns$[10:1000,1] ~ rollvar10[10:1000,1])
This works without problems.
Now I wish to extend this to automatically create the linear regression models for all 100 companies. What I've tried was a simple for-loop:
NC <- ncol(returns)
for(i in 1:NC){
lm_rollvar5 <- lm(returns[5:1000],i] ~ rollvar5[5:1000,i])
summary(lm_rollvar5)
lm_rollvar10 <- lm(returns[10:1000],i] ~ rollvar10[10:1000,i])
summary(lm_rollvar10)
}
Is there any way I could optimize my approach? (i.e. how could I save all regression results in a simple way). Since now the for-loop just outputs hundreds of regression results, which is quite ineffective in analyzing the results.
I also tried to use the apply function but I am unsure how to use it in this case, since there are several timeseries objects (the returns and the rolling variances are saved in different objects as you can see).
As to your question how you could save all regression results in a simple way, this is a bit difficult to answer given that we don't know what you need to do, and what you consider "simple". However, you could define a list outside the loop and store each regression model in this list so that you can access the models without refitting them later. Try e.g.
NC <- ncol(returns)
lm_rollvar5 <- vector(mode="list", length=NC)
lm_rollvar10 <- vector(mode="list", length=NC)
for(i in 1:NC){
lm_rollvar5[[i]] <- lm(returns[5:1000],i] ~ rollvar5[5:1000,i])
lm_rollvar10[[i]] <- lm(returns[10:1000],i] ~ rollvar10[10:1000,i])
}
This gives you the fitted model for firm i at the i-th position in the list. In the same manner, you can also save the output of summary. Or you do sth like
my.summaries_5 <- lapply(lm_rollvar5, summary)
which gives you a list of summaries.
I do not have very clear idea of how to use functions like lm() that ask for a formula and a data.frame.
On the web I red about different approach but sometimes R give us warnings and other stuff
Suppose for example a linear model where the output vector y is explained by the matrix X.
I red that the best way is to use a data.frame (expecially if we are going to use the predict function later).
In situation where the X is a matrix is this the best way to use lm?
n=100
p=20
n_new=50
X=matrix(rnorm(n*p),n,p)
Y=rnorm(n)
data=list("x"=X,"y"=Y)
l=lm(y~x,data)
X_new=matrix(rnorm(n_new*p),n_new,p)
pred=predict(l,as.data.frame(X_new))
How about:
l <- lm(y~.,data=data.frame(X,y=Y))
pred <- predict(l,data.frame(X_new))
In this case R constructs the column names (X1 ... X20) automatically, but when you use the y~. syntax you don't need to know them.
Alternatively, if you are always going to fit linear regressions based on a matrix, you can use lm.fit() and compute the predictions yourself using matrix multiplication: you have to use cbind(1,.) to add an intercept column.
fit <- lm.fit(cbind(1,X),Y)
all(coef(l)==fit$coefficients) ## TRUE
pred <- cbind(1,X_new) %*% fit$coefficients
(You could also use cbind(1,X_new) %*% coef(l).) This is efficient, but it skips a lot of the error-checking steps, so use it with caution ...
In a situation like the one you describe, you have no reason not to turn your matrix into a data frame. Try:
myData <- as.data.frame(cbind(Y, X))
l <- lm(Y~., data=myData)
I am not sure whether I am denormalizing data correctly. I have one output variable and several input variables. I am normalizing them using the RSNNS package. Suppose x is an input matrix (NxM) where each of the N rows is an object with M features. And y is a vector (N) with corresponding answers.
nx <- normalizeData(x, type='0_1')
After that, some of the data are used to make the model and some for prediction. Suppose pred.ny are predicted values. These values are normalized.
pred.y <- denormalizeData(pred.ny, getNormParameters(nx))
Is this correct? How does it work? It's clear for one input it can use the min and max values previously used for normalization. But how does it work if each input was normalized separately, using its own min and max value?
update
Here is a toy-example, where '0_1' looks better than 'norm'. 'norm' make huge training error and almost constant prediction.
x <- runif(1020, 1, 5000)
y <- sqrt(x)
nx <- normalizeData(x, type='0_1')
ny <- normalizeData(y, type='0_1')
model <- mlp(nx[1:1000], ny[1:1000], size = 1)
plotIterativeError(model)
npy <- predict(model, matrix(nx[1001:1020], ncol=1))
py <- denormalizeData(npy, getNormParameters(ny))
print(cbind(y[1001:1020], py))
There are two things going on here:
Training your model, i.e. setting the internal coefficients in the neural network. You use both the inputs and output for this.
Using the model, i.e. getting predictions with fixed internal coefficients.
For part 1, you've decided to normalize your data. So the neural net works on normalized data. So you have trained the neural net
on the inputs fX(X) rather than X, where fX is the transform you used to the matrix of original inputs to produce the normalized inputs.
on the outputs fy(y) rather than y, where fy is the transform you applied to the vector of outputs to get the normalized outputs.
In terms of your original inputs and outputs your trained machine now looks like this:
Apply the normalization function fX to inputs to get normalized inputs fX(X).
Run neural net with normalized inputs to produce normalized outputs fy(y).
Apply the denormalization function fy-1 to the normalized outputs fy(y) to get y.
Note that fX(X) and fy, and hence fy-1, are defined on the training set.
So in R you might write something like this to get training data and normalize it, the first 100 rows
tx <- x[1:100,]
ntx <- normalizeData(tx, type='0_1')
ty <- y[1:100]
nty <- normalizeData(ty, type='0_1')
and something like this to denormalize the predicted results
pred.y <- denormalizeData(pred.ny, getNormParameters(nty))
# nty (or ny) not nx here
The slight concern I have is that I'd prefer to normalize the features used in prediction using the same transform fX that I used for training, but looking at the RSNNS documentation this facility doesn't appear to be present (it would be easy to write it yourself, however). It would probably be OK to normalize the prediction features using the whole X matrix, i.e. including the training data. (I can also see it might be preferable to use the default, normalization to z-score that RSNNS provides rather than the "0_1" choice you have used.)
I'm working with data that is not normal distributed. I have applied the common methods: logs and square roots in order to transform the data and then treat it with an ARIMA model so I can make a forecast.
What I have tried is:
set.seed(123)
y<-rexp(200)
yl<-log(y+1)
shapiro.test(yl)
trans<-(y-mean(y))/sd(y)
shapiro.test(trans)
This methods are failing the test of normality, I would like to ask if there are another options to transform data into normal data in R.
You can try the forecast package, with has the BoxCox.lambda function to handle
BoxCox transformations. The scale/re-scale is done automatically. Example:
require(forecast)
y <- ts(rnorm(120,0,3) + 20*sin(2*pi*(1:120)/12), frequency=12) + runif(120)
lambda <- BoxCox.lambda(y) # should check if the transformation is necessary
model <- auto.arima(y, lambda = lambda)
plot(forecast(model))
Assume that I have sources of data X and Y that are indexable, say matrices. And I want to run a set of independent regressions and store the result. My initial approach would be
results = matrix(nrow=nrow(X), ncol=(2))
for(i in 1:ncol(X)) {
matrix[i,] = coefficients(lm(Y[i,] ~ X[i,])
}
But, loops are bad, so I could do it with lapply as
out <- lapply(1:nrow(X), function(i) { coefficients(lm(Y[i,] ~ X[i,])) } )
Is there a better way to do this?
You are certainly overoptimizing here. The overhead of a loop is negligible compared to the procedure of model fitting and therefore the simple answer is - use whatever way you find to be the most understandable. I'd go for the for-loop, but lapply is fine too.
I do this type of thing with plyr, but I agree that it's not a processing efficency issue as much as what you are comfortable reading and writing.
If you just want to perform straightforward multiple linear regression, then I would recommend not using lm(). There is lsfit(), but I'm not sure it would offer than much of a speed up (I have never performed a formal comparison). Instead I would recommend performing the (X'X)^{-1}X'y using qr() and qrcoef(). This will allow you to perform multivariate multiple linear regression; that is, treating the response variable as a matrix instead of a vector and applying the same regression to each row of observations.
Z # design matrix
Y # matrix of observations (each row is a vector of observations)
## Estimation via multivariate multiple linear regression
beta <- qr.coef(qr(Z), Y)
## Fitted values
Yhat <- Z %*% beta
## Residuals
u <- Y - Yhat
In your example, is there a different design matrix per vector of observations? If so, you may be able to modify Z in order to still accommodate this.