Accuracy Function - r

I am doing a TS Analysis. What is the difference between these two accuracies:
fit<-auto.arima(tsdata)
fcast<-forecast(fit,6)
accuracy(fcast) #### First Accuracy
fit<-auto.arima(tsdata)
fcast<-forecast(fit,6)
accuracy(fcast,actual values) #### Second Accuracy
How does the accuracy function work when I don't specify the actual values in the accuracy function as in the first case.
Secondly what is the right approach to calculate accuracy?

In this answer I'm assuming you are using the function from the forecast package.
The answer lies within accuracy's description:
Returns range of summary measures of the forecast accuracy. If x is provided, the function measures out-of-sample (test set) forecast accuracy based on x-f. If x is not provided, the function only produces in-sample (training set) accuracy measures of the forecasts based on f["x"]-fitted(f). All measures are defined and discussed in Hyndman and Koehler (2006).
In your case x being the second argument of the function. So, in short accuracy(fcst) provides an estimation of the prediction error, based on the training set.
For example: lets assume you have 12 months and predicting 6 ahead. Then if you use accuracy(fcst) you get the error of the model over the 12 months (only).
Now, let's assume x = real demand for the 6 months you are forecasting. And that you didn't use this data to build the Arima model. In this case accuracy(fcst, x) gives you the test set error, which is a better measure to what you will get in the future using this model (compared to the train set error).
The best practice is to use a test set error because this measure is less prone to bias (you will most likely get "better" prediction results on the training set then on a "hideout" test set, but these results will be a sort of "overfitting"). If you have a test set, you should use the test set as the second argument.

Related

Forecast ARIMA using a different training set

I estimate the ARIMA model on a training dataset using the auto.arima function in R. Afterwards I am using the function forecast to make suppose 50 predictions and calculate the accuracy measures such as RMSE and MAE.
If I use the forecast function, it uses only the observations in the training set, and then makes the predictions at each time unit t using the values predicted at time t-1. What I am trying to do, is to make 1 prediction at time, adding at each time t an observed value to the training set, without reestimating the ARIMA model. So instead of considering the predicted values at time t-1, I would consider the real values. So if ARIMA has been estimated on the training dataset of 100 observations, the first forecast will be done considering the training dataset of length 100, the second forecast will consider the training set of length 101, the third forecast will take the training set of length 102 and so on.
The auto.arima output contains the datasets "x" which is the training set I use to estimate the model, and the dataset "fitted" which contains the fitted values. It also has the argument "nobs" which is the length of the dataset "x". I am trying to replace auto.arima$x with a new training dataset where the last observations are given by true values I add one at the time. I also modify "nobs" so it would give me the length of the new "x". But I noticed that the forecast for only one time ahead always considers the old training set. So for instance I added one observed value at a time to the training set and made the one ahead predictions for 50 times but all the predictions are equal to the first one. Like the forecast function ignores the fact that I replaced the "x" series inside the auto.arima output. I tried to replace the "fitted" values with the same result.
Does someone know how exactly the function "forecast" considers the training set based on which to make the predictions? What should I modify inside the auto.arima output at each time t to get the one-ahead predictions based on the real values at the previous times, instead of the estimated ones? Or there is a way to tell the "forecast" function to consider a different training dataset?
I don't want to refit ARIMA model on the new training dataset (using Arima function) and reestimate the residual variance, it takes literally forever...
Any suggestion would be helpful
Thank you in advance

How to assess the model and prediction of random forest when doing regression analysis?

I know when random forest (RF) is used for classification, the AUC normally is used to assess the quality of classification after applying it to test data. However,I have no clue the parameter to assess the quality of regression with RF. Now I want to use RF for the regression analysis, e.g. using a metrics with several hundreds samples and features to predict the concentration (numerical) of chemicals.
The first step is to run randomForest to build the regression model, with y as continuous numerics. How can I know whether the model is good or not, based on the Mean of squared residuals and % Var explained? Sometime my % Var explained is negative.
Afterwards, if the model is fine and/or used straightforward for test data, and I get the predicted values. Now how can I assess the predicted values good or not? I read online some calculated the accuracy (formula: 1-abs(predicted-actual)/actual), which also makes sense to me. However, I have many zero values in my actual dataset, are there any other solutions to assess the accuracy of predicted values?
Looking forward to any suggestions and thanks in advance.
The randomForest R package comes with an importance function which can used to determine the accuracy of a model. From the documentation:
importance(x, type=NULL, class=NULL, scale=TRUE, ...), where x is the output from your initial call to randomForest.
There are two types of importance measurements. One uses a permutation of out of bag data to test the accuracy of the model. The other uses the GINI index. Again, from the documentation:
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
For further information, one more simple importance check you may do, really more of a sanity check than anything else, is to use something called the best constant model. The best constant model has a constant output, which is the mean of all responses in the test data set. The best constant model can be assumed to be the crudest model possible. You may compare the average performance of your random forest model against the best constant model, for a given set of test data. If the latter does not outperform the former by at least a factor of say 3-5, then your RF model is not very good.

evaluate forecast by the terms of p-value and pearson correlation

I am using R to do some evaluations for two different forecasting models. The basic idea of the evaluation is do the comparison of Pearson correlation and it corresponding p-value using the function of cor.() . The graph below shows the final result of the correlation coefficient and its p-value.
we suggestion that model which has lower correlation coefficient with corresponding lower p-value(less 0,05) is better(or, higher correlation coefficient but with pretty high corresponding p-value).
so , in this case, overall, we would say that the model1 is better than model2.
but the question here is, is there any other specific statistic method to quantify the comparison?
Thanks a lot !!!
Assuming you're working with time series data since you called out a "forecast". I think what you're really looking for is backtesting of your forecast model. From Ruey S. Tsay's "An Introduction to Analysis of Financial Data with R", you might want to take a look at his backtest.R function.
backtest(m1,rt,orig,h,xre=NULL,fixed=NULL,inc.mean=TRUE)
# m1: is a time-series model object
# orig: is the starting forecast origin
# rt: the time series
# xre: the independent variables
# h: forecast horizon
# fixed: parameter constriant
# inc.mean: flag for constant term of the model.
Backtesting allows you to see how well your models perform on past data and Tsay's backtest.R provides RMSE and Mean-Absolute-Error which will give you another perspective outside of correlation. Caution depending on the size of your data and complexity of your model, this can be a very slow running test.
To compare models you'll normally look at RMSE which is essentially the standard deviation of the error of your model. Those two are directly comparable and smaller is better.
An even better alternative is to set up training, testing, and validation sets before you build your models. If you train two models on the same training / test data you can compare them against your validation set (which has never been seen by your models) to get a more accurate measurement of your model's performance measures.
One final alternative, if you have a "cost" associated with an inaccurate forecast, apply those costs to your predictions and add them up. If one model performs poorly on a more expensive segment of data, you may want to avoid using it.
As a side-note, your interpretation of a p value as less is better leaves a little to be [desired] quite right.
P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis.

Forecast future values for a time series using support vector machin

I am using support vector regression in R to forecast future values for a uni-variate time series. Splitting the historical data into test and train sets, I find a model by using svm function in R to the test data and then use the predict() command with train data to predict values for the train set. We can then compute prediction errors. I wonder what happens then? we have a model and by checking the model on the train data, we see the model is efficient. How can I use this model to predict future values out of train data? Generally speaking, we use predict function in R and give it a forecast horizon (h=12) to predict 12 future values. Based on what I saw, the predict() command for SVM does not have such coomand and needs a train dataset. How should I build a train data set for predicting future data which is not in our historical data set?
Thanks
Just a stab in the dark... SVM is not for prediction but for classification, specifically supervised. I am guessing you are trying to predict stock values, no? How about classify your existing data, using some size of your choice say 100 values at a time, for noise (N), up (U), big up (UU), down (D), and big down (DD). In this way as your data comes in you slide your classification frame and get it to tell you if the upcoming trend is N, U, UU, D, DD.
What you can do is to build a data frame with columns representing the actual stock price and its n lagged values. And use it as a train set/test set (the actual value is the output and the previous values the explanatory variables). With this method you can do a 1-day (or whatever the granularity is) into the future forecast and then you can use your prediction to make another one and so on.

From Auto.arima to forecast in R

I don't quite understand the syntax of how forecast() applies external regressors in the library(forecast) in R.
My fit looks like this:
fit <- auto.arima(Y,xreg=factors)
where Y is a timeSeries object 100 x 1 and factors is a timeSeries object 100 x 5.
When I go to forecast, I apply...
forecast(fit, h=horizon)
And I get an error:
Error in forecast.Arima(fit, h = horizon) : No regressors provided
Does it want me to add back the xregressors from the fit? I thought these were included in the fit object as fit$xreg. Does that mean it's asking for future values of the xregressors, or that I should repeat the same values I used in the fit set? The documentation doesn't cover the meaning of xreg in the forecast step.
I believe all this means I should use
forecast(fit, h=horizon,xreg=factors)
or
forecast(fit, h=horizon,xreg=fit$xreg)
Which gives the same results. But I'm not sure whether the forecast step is interpreting the factors as future values, or appropriately as previous ones. So,
Is this doing a forecast out of purely past values, as I expect?
Why do I have to specify the xreg values twice? It doesn't run if I exclude them, so it doesn't behave like an option.
Correct me if I am wrong, but I think you may not completely understand how the ARIMA model with regressors works.
When you forecast with a simple ARIMA model (without regressors), it simply uses past values of your time series to predict future values. In such a model, you could simply specify your horizon, and it would give you a forecast until that horizon.
When you use regressors to build an ARIMA model, you need to include future values of the regressors to forecast. For example, if you used temperature as a regressor, and you were predicting disease incidence, then you would need future values of temperature to predict disease incidence.
In fact, the documentation does talk about xreg specifically. look up ?forecast.Arima and look at both the arguments h and xreg. You will see that If xreg is used, then h is ignored. Why? Because if your function uses xreg, then it needs them for forecasting.
So, in your code, h was simply ignored when you included xreg. Since you just used the values that you used to fit the model, it just gave you all the predictions for the same set of regressors as if they were in the future.
related
https://stats.stackexchange.com/questions/110589/arima-with-xreg-rebuilding-the-fitted-values-by-hand
I read that arima in R is borked
See Issue 3 and 4
https://www.stat.pitt.edu/stoffer/tsa4/Rissues.htm
the xreg was suggested to derive the proper intercept.
I'm using real statistics for excel to figure out what is the actual constant. I had a professor tell me you need to have a constant
These derive the same forecasts. So it appears you can use xreg to get some descriptive information, but you would have to use the statsexchange link to manually derive from them.
f = auto.arima(lacondos[,1])
f$coef
g = Arima(lacondos[,1],c(t(matrix(arimaorder(f)))),include.constant=FALSE,xreg=1:length(lacondos[,1]))
g$coef

Resources