Forecasting using support vector regression in R - r

I want to forecast the future energy consumption using support vector regression in R.I have this code but I'am not sure weather it is correct or not.
`#gathering the data
data<-read.csv("C:\\2003_smd_hourly.csv",header=TRUE) #these are the values which are used to train the given model#
data
#data1<-read.csv("C:\\pr.csv",header=TRUE)#this file/ddata is used for checking the accuracy of prediction#
#data1
#y1<-data1[,15]
#x0<-data1[,2]
y<-data[,15] #sysload
x1<-data[,2] #houroftheday
x2<-data[,13] #drybulb temp(actualtemp)
x3<-data[,14] #dewpnttemp
#train<-sample(744,447)
#train
library(e1071)
model<-svm(y~x1+x2+x3,data=data[1:48,],cost=2.52*10^11,epsilon=0.0150,gamma=1)
model
#pr<-data[-train,]
#pr
predict1<-predict(model,newdata=data[49:72,])
predict1
par(mfrow=c(2,2))
plot(x1,y,col="red",pch=4)
#par(new=TRUE)
plot(x1,predict1,col="blue",pch=5) #plotting the values that have been predicted
#par(new=TRUE)
plot(x0,y1,col="black",pch=1)
error=y1-predict1
error
mae <- function(error)
{
mean(abs(error))
}
mae(error)
error <- y1 - predict1
error
rmse <- function(error)
{
sqrt(mean(error^2))
}
svrPredictionRMSE <- rmse(error)
svrPredictionRMSE
max(error)
min(error)
mape <- function(y1,predict1)
mape
mean(abs((y1 - predict1)/y1))*100
mape
`Eg:data can be found here http://pastebin.com/MUfWFCPM

Use the newdata parameter for prediction (you newdata for test should have the same set of features as the training data). e.g., with mtcars dataset
library(e1071)
model<-svm(mpg~wt+disp+qsec,data=mtcars[1:24,],cost=512,epsilon=0.01)
model
predict1<-predict(model,newdata=mtcars[25:32,])
predict1 # prediction for the new 8 data points
Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
28.514002 31.184527 23.022863 22.603601 6.228431 30.482475 6.801507 22.939945

If you want to predict what happens in the next two days, you have to train a model to predict two days ahead. Let's pick a simple example, then I'll move to an SVR. Suppose we use a linear AR-direct forecasting model and, through some method we determined that two lags are enough. So we have this model:
y_{t+h} = alpha + phi_1 y_{t} + phi_2 y_{t-1} + e_{t+h}
The literature in economics calls this an AR-direct forecast because it directly ouputs y_{t+h}, as opposed to indirectly producing y_{t+h} by providing a recursive relationship across forecasts. Say that 'y' is the temperature in celcius degrees, so you want to forecast the temperature in two days using temperature data up until -- and including -- today. Suppose we use daily temperatures of the last month.
We know that ordinary least squares is a convergent estimator of alpha, phi_1 and phi_2, so we can form a matrix, X, containing a column of ones, one column of temperatures lagged h times and a column of temperatures lagged h + 1 times. Then, compute a linear projection of our temperature vector, y, on X like so: estimated [alpha, phi_1, phi_2] = (X'X)^-1X'Y.
Now, we have estimated parameters for the whole sample. If I want to know y_{t+h}, I need a constant (we arbitrarily picked '1' to estimate the model, so we'll use '1'), the temperature today and the temperature yesterday. Suppose h=2 here:
predicted temperate in two days = alpha + phi_1 x temperature today + phi_2 x temperature yesterday
You see, the difference between training the model and applying the model lies in a simple shift: y_{t} = alpha + phi_1 y_{t-h} + phi_2 y_{t-h-1} + e_{t} is what we fitted in the training sample. The last in-sample prediction we made using this model is the temperate today, using the temperatre 3 and 4 days ago, respectively. We also produced least square forecasts for all other observed temperatures, except the first three observations -- to forecast with this model, we need two observations plus a two day gap.
Now, with SVMs and SVRs, the point is very similar. Your predicted output is a real-valued label in the case of a regression problem. Suppose we also want to forecast the temperature, also two days in advance, using the same data and using the same regressors. Then, the input space of our SVR is defined by two vectors -- the same two lagged vectors of temperatures we used.
When we train the SVR on the whole dataset, we produce forecasts for each observations in the dataset -- again, except for the first three observations.
For e-insensitive SVR, let K() be the kernel we use, x_i is a support vector (it's one point in the y_{t}, y_{t-1} space) and n_sv is the number of support vectors:
y_{t+h} = sum_{i=1}^{n_sv} (alpha_i - alpha_i*) K(x_i, x)
Forecasting y_{t+h} is like asking what is the real-valued label of x: you input the last p (in this case, p=2) observations in the trained decision rule of the SVR and it gives you a label. If it was a support vector machine for classification, the training would result in a sperating hyperplane and you would decide on the label of any point which has coordinates in the input space by asking 'on what side of the plane is it?'... It's the exact same thing here, except you are looking for a real value.
So, programming-wise, you just need to provide a vector with the right dimension to 'predict': predict(best_model_you_picked, newdata=appropriate_input_space_vector)
Note that if you trained your model on the 'whole sample', but some of the variables you used are lagged variables, the model is not fitted on the last few observations of the non-lagged variables... just like the AR model estimated by OLS does not use the last h observations to forecast in-sample.

Related

GAM smooths interaction differences - calculate p value using mgcv and gratia 0.6

I am using the useful gratia package by Gavin Simpson to extract the difference in two smooths for two different levels of a factor variable. The smooths are generated by the wonderful mgcv package. For example
library(mgcv)
library(gratia)
m1 <- gam(outcome ~ s(dep_var, by = fact_var) + fact_var, data = my.data)
diff1 <- difference_smooths(m1, smooth = "s(dep_var)")
draw(diff1)
This give me a graph of the difference between the two smooths for each level of the "by" variable in the gam() call. The graph has a shaded 95% credible interval (CI) for the difference.
Statistical significance, or areas of statistical significance at the 0.05 level, is assessed by whether or where the y = 0 line crosses the CI, where the y axis represents the difference between the smooths.
Here is an example from Gavin's site where the "by" factor variable had 3 levels.
The differences are clearly statistically significant (at 0.05) over nearly all of the graphs.
Here is another example I have generated using a "by" variable with 2 levels.
The difference in my example is clearly not statistically significant anywhere.
In the mgcv package, an approximate p value is outputted for a smooth fit that tests the null hypothesis that the coefficients are all = 0, based on a chi square test.
My question is, can anyone suggest a way of calculating a p value that similarly assesses the difference between the two smooths instead of solely relying on graphical evidence?
The output from difference_smooths() is a data frame with differences between the smooth functions at 100 points in the range of the smoothed variable, the standard error for the difference and the upper and lower limits of the CI.
Here is a link to the release of gratia 0.4 that explains the difference_smooths() function
enter link description here
but gratia is now at version 0.6
enter link description here
Thanks in advance for taking the time to consider this.
Don
One way of getting a p value for the interaction between the by factor variables is to manipulate the difference_smooths() function by activating the ci_level option. Default is 0.95. The ci_level can be manipulated to find a level where the y = 0 is no longer within the CI bands. If for example this occurred when ci_level = my_level, the p value for testing the hypothesis that the difference is zero everywhere would be 1 - my_level.
This is not totally satisfactory. For example, it would take a little manual experimentation and it may be difficult to discern accurately when zero drops out of the CI. Although, a function could be written to search the accompanying data frame that is outputted with difference_smooths() as the ci_level is varied. This is not totally satisfactory either because the detection of a non-zero CI would be dependent on the 100 points chosen by difference_smooths() to assess the difference between the two curves. Then again, the standard errors are approximate for a GAM using mgcv, so that shouldn't be too much of a problem.
Here is a graph where the zero first drops out of the CI.
Zero dropped out at ci_level = 0.88 and was still in the interval at ci_level = 0.89. So an approxiamte p value would be 1 - 0.88 = 0.12.
Can anyone think of a better way?
Reply to Gavin Simpson's comments Feb 19
Thanks very much Gavin for taking the time to make your comments.
I am not sure if using the criterion, >= 0 (for negative diffs), is a good way to go. Because of the draws from the posterior, there is likely to be many diffs that meet this criterion. I am interpreting your criterion as sample the posterior distribution and count how many differences meet the criterion, calculate the percentage and that is the p value. Correct me if I have misunderstood. Using this approach, I consistently got p values at around 0.45 - 0.5 for different gam models, even when it was clear the difference in the smooths should be statistically significant, at least at p = 0.05, because the confidence band around the smooth did not contain zero at a number of points.
Instead, I was thinking perhaps it would be better to compare the means of the posterior distribution of each of the diffs. For example
# get coefficients for the by smooths
coeff.level1 <- coef(gam.model1)[31:38]
coeff.level0 <- coef(gam.model1)[23:30]
# these indices are specific to my multi-variable gam.model1
# in my case 8 coefficients per smooth
# get posterior coefficients variances for the by smooths' coefficients
vp_level1 <- gam.model1$Vp[31:38, 31:38]
vp_level0 <- gam.model1$Vp[23:30, 23:30]
#run the simulation to get the distribution of each
#difference coefficient using the joint variance
library(MASS)
no.draws = 1000
sim <- mvrnorm(n = no.draws, (coeff.level1 - coeff.level0),
(vp_level1 + vp_level0))
# sim is a no.draws X no. of coefficients (8 in my case) matrix
# put the results into a data.frame.
y.group <- data.frame(y = as.vector(sim),
group = c(rep(1,no.draws), rep(2,no.draws),
rep(3,no.draws), rep(4,no.draws),
rep(5,no.draws), rep(6,no.draws),
rep(7,no.draws), rep(8,no.draws)) )
# y has the differences sampled from their posterior distributions.
# group is just a grouping name for the 8 sets of differences,
# (one set for each difference in coefficients)
# compare means with a linear regression
lm.test <- lm(y ~ as.factor(group), data = y.group)
summary(lm.test)
# The p value for the F statistic tells you how
# compatible the data are with the null hypothesis that
# all the group means are equal to each other.
# Same F statistic and p value from
anova(lm.test)
One could argue that if all coefficients are not equal to each other then they all can't be equal to zero but that isn't what we want here.
The basis of the smooth tests of fit given by summary(mgcv::gam.model1)
is a joint test of all coefficients == 0. This would be from a type of likelihood ratio test where model fit with and without a term are compared.
I would appreciate some ideas how to do this with the difference between two smooths.
Now that I got this far, I had a rethink of your original suggestion of using the criterion, >= 0 (for negative diffs). I reinterpreted this as meaning for each simulated coefficient difference distribution (in my case 8), count when this occurs and make a table where each row (my case, 8) is for one of these distributions with two columns holding this count and (number of simulation draws minus count), Then on this table run a chi square test. When I did this, I got a very low p value when I believe I shouldn't have as 0 was well within the smooth difference CI across almost all the levels of the exposure. Maybe I am still misunderstanding your suggestion.
Follow up thought Feb 24
In a follow up thought, we could create a variable that represents the interaction between the by factor and continuous variable
library(dplyr)
my.dat <- my.dat %>% mutate(interact.var =
ifelse(factor.2levels == "yes", 1, 0)*cont.var)
Here I am assuming that factor.2levels has the levels ("no", "yes"), and "no" is the reference level. The ifelse function creates a dummy variable which is multiplied by the continuous variable to generate the interactive variable.
Then we place this interactive variable in the GAM and get the usual statistical test for fit, that is, testing all the coefficients == 0.
#GavinSimpson actually posted a method of how to get the difference between two smooths and assess its statistical significance here in 2017. Thanks to Matteo Fasiolo for pointing me in that direction.
In that approach, the by variable is converted to an ordered categorical variable which causes mgcv::gam to produce difference smooths in comparison to the reference level. Statistical significance for the difference smooths is then tested in the usual way with the summary command for the gam model.
However, and correct me if I have misunderstood, the ordered factor approach causes the smooth for the main effect to now be the smooth for the reference level of the ordered factor.
The approach I suggested, see the main post under the heading, Follow up thought Feb 24, where the interaction variable is created, gives an almost identical result for the p value for the difference smooth but does not change the smooth for the main effect. It also does not change the intercept and the linear term for the by categorical variable which also both changed with the ordered variable approach.

Is it possible to fit Non-Stationary GEV to a series of data in R fixing one of the distribution parameters?

Good afternoon,
I have a series of annual maxima data (say "AMdata") I'd like to model through a non-stationary GEV distribution. In particular, I want the location to vary linearly in time, i.e.:
mu = mu0 + mu1*t.
To this end, I am using the ismev package in R, computing the parameters as follows:
require(ismev)
ydat = cbind(1:length(AMdata)) ### Co-variates - years from 1 to number of annual maxima in the data
GEV_fit_1_loc = gev.fit(xdat=AMdata,ydat=ydat,mul=1)
In such a way, I obtain 4 parameters, namely mu0,mu1,shape and scale.
My question is: can I apply the gev.fit function fixing as a condition the value of mu1? not as a starting value for the successive iterations, but as a given parameter (thus estimating the three parameters left)?
Any tip would be really appreciated!
Francesco

predict.coxph() and survC1::Est.Cval -- type for predict() output

Given a coxph() model, I want to use predict() to predict hazards and then use survC1::Est.Cval( . . . nofit=TRUE) to get a c-value for the model.
The Est.Cval() documentation is rather terse, but says that "nofit=TRUE: If TRUE, the 3rd column of mydata is used as the risk score directly in calculation of C."
Say, for simplicity, that I want to predict on the same data I built the model on. For
coxModel a Cox regression model from coxph();
time a vector of times (positive reals), the same times that coxModel was built on; and
event a 0/1 vector, the same length, of event/censor indicators, the same events that coxModel was built on --
does this indicate that I want
predictions <- predict(coxModel, type="risk")
dd <- cbind(time, event, pred)
Est.Cval(mydata=dd, tau=tau, nofit=TRUE)
or should that first line be
predictions <- predict(coxModel, type="lp")
?
Thanks for any help,
The answer is that it doesn't matter.
Basically, the concordance value is testing, for all comparable pairs of times (events and censors), how probable it is that the later time has the lower risk (for a really good model, almost always). But since e^u is a monotonic function of real u, and the c-value is only testing comparisons, it doesn't matter whether you provide the hazard ratio, e^(sum{\beta_i x_i}), or the linear predictor, sum{\beta_i x_i}.
Since #42 motivated me to come up with a minimal working example, we can test this. We'll compare the values that Est.Cval() provides using one input versus using the other; and we can compare both to the value we get from coxph().
(That last value won't match exactly, because Est.Cval() uses the method of Uno et al. 2011 (Uno, H., Cai, T., Pencina, M. J., D’Agostino, R. B. & Wei, L. J. On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statist. Med. 30, 1105–1117 (2011), https://onlinelibrary.wiley.com/doi/full/10.1002/sim.4154) but it can serve as a sanity check, since the values should be close.)
The following is based on the example worked through in Survival Analysis with R, 2017-09-25, by Joseph Rickert, https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/.
library("survival")
library("survC1")
# Load dataset included with survival package
data("veteran")
# The variable `time` records survival time; `status` indicates whether the
# patient’s death was observed (status=1) or that survival time was censored
# (status = 0).
# The model they build in the example:
coxModel <- coxph(Surv(time, status) ~ trt + celltype + karno + diagtime +
age + prior, data=veteran)
# The results
summary(coxModel)
Note the c-score it gives us:
Concordance= 0.736 (se = 0.021 )
Now, we calculate the c-score given by Est.Cval() on the two types of values:
# The value from Est.Cval(), using a risk input
cvalByRisk <- Est.Cval(mydata=cbind(time=veteran$time, event=veteran$status,
predictions=predict(object=coxModel, newdata=veteran, type="risk")),
tau=2000, nofit=TRUE)
# The value from Est.Cval(), using a linear predictor input
cvalByLp <- Est.Cval(mydata=cbind(time=veteran$time, event=veteran$status,
predictions=predict(object=coxModel, newdata=veteran, type="lp")),
tau=2000, nofit=TRUE)
And we compare the results:
cvalByRisk$Dhat
[1] 0.7282348
cvalByLp$Dhat
[1] 0.7282348

Find the nearest neighbor using caret

I'm fitting a k-nearest neighbor model using R's caret package.
library(caret)
set.seed(0)
y = rnorm(20, 100, 15)
predictors = matrix(rnorm(80, 10, 5), ncol=4)
data = data.frame(cbind(y, predictors))
colnames(data)=c('Price', 'Distance', 'Cost', 'Tax', 'Transport')
I left one observation as the test data and fit the model using the training data.
id = sample(nrow(data)-1)
train = data[id, ]
test = data[-id,]
knn.model = train(Price~., method='knn', train)
predict(knn.model, test)
When I display knn.model, it tells me it uses k=9. I would love to know which 9 observations are actually the "nearest" to the test observation. Besides manually calculating the distances, is there an easier way to display the nearest neighbors?
Thanks!
When you are using knn you are creating clusters with points that are near based on independent variables. Normally, this is done using train(Price~., method='knn', train), such that the model chooses the best prediction based on some criteria (taking into account also the dependent variable as well). Given the fact I have not checked whether the R object stores the predicted price for each of the trained values, I just used the model trained to predicte the expected price given the model (where the expected price is located in the space).
At the end, the dependent variable is just a representation of all the other variables in a common space, where the price associated is assumed to be similar since you cluster based on proximity.
As a summary of steps, you need to calculate the following:
Get the distance for each of the training data points. This is done through predicting over them.
Calculate the distance between the trained data and your observation of interest (in absolut value, since you do not care about the sign but just about the absolut distances).
Take the indexes of the N smaller ones(e.g.N= 9). you can get the observations and related to this lower distances.
TestPred<-predict(knn.model, newdata = test)
TrainPred<-predict(knn.model, train)
Nearest9neighbors<-order(abs(TestPred-TrainPred))[1:9]
train[Nearest9neighbors,]
Price Distance Cost Tax Transport
15 95.51177 13.633754 9.725613 13.320678 12.981295
7 86.07149 15.428847 2.181090 2.874508 14.984934
19 106.53525 16.191521 -1.119501 5.439658 11.145098
2 95.10650 11.886978 12.803730 9.944773 16.270416
4 119.08644 14.020948 5.839784 9.420873 8.902422
9 99.91349 3.577003 14.160236 11.242063 16.280094
18 86.62118 7.852434 9.136882 9.411232 17.279942
11 111.45390 8.821467 11.330687 10.095782 16.496562
17 103.78335 14.960802 13.091216 10.718857 8.589131

Auto.arima is not showing any order

I am trying to fit arima model using auto.arima function in R. The result is showing order (0,0,0) even though the data is non-stationary.
auto.arima(x,approximation=TRUE)
ARIMA(0,0,0) with non-zero mean
Can someone advice why such results are coming? Btw i am running this function on only 10 data points.
10 data points is a very low number of observations for estimating an ARIMA model. I doubt that you can make any sensible estimation based on this. Moreover, the estimated model may depend strongly on the part of a time series you looked at and adding only very few observations can change the characteristics of the estimated model significantly. For example:
When I take a time series with only 10 observations, I also get a ARIMA(0,0,0) model:
library(forecast)
vec1 <- ts(c(10.26063, 10.60462, 10.37365, 11.03608, 11.19136, 11.13591, 10.84063, 10.66458, 11.06324, 10.75535), frequency = 12)
fit1 <- auto.arima(vec1)
summary(fit1)
However, if I use about 30 observations, it an ARIMA(1,0,0) model is estimated:
vec2 <- ts(c(10.260626, 10.604616, 10.373652, 11.036079, 11.191359, 11.135914, 10.840628, 10.664575, 11.063239, 10.755350,
10.158032, 10.653669, 10.659231, 10.483478, 10.739133, 10.400146, 10.205993, 10.827950, 11.018257, 11.633930,
11.287756, 11.202727, 11.244572, 11.452180, 11.199706, 10.970823, 10.386131, 10.184201, 10.209338, 9.544736), frequency = 12)
fit1 <- auto.arima(vec2)
summary(fit1)
If I use the whole time series (413 observations), the auto.arima function estimates a "ARIMA(2,1,4)(0,0,1)[12] with drift".
Thus, I would think that 10 observation is indeed not enough information for fitting a model.

Categories

Resources