I have used forecast() to the first 1526 data points in my data serie VIX, estimating the final 300 data points. I want to measure the goodness of fit with the variance of the difference between actual historical data and forecasted result. Is there an easy way of doing this in R?
The code currently is
r_vix_3b=diff(log(VIX[,"Close"]))
num_train=1526
h=300
plot_start=1300
plot_labels=126 # interval between x-axis major tick marks
data_fcst_pts=num_train:(num_train+h)
fit_1step=auto.arima(r_vix_3b[1:num_train])
forecast_1step = forecast(fit_1step, h=h)
plot(forecast_1step, xaxt="n", xlim=c(plot_start, num_train+h), ylim=c(-0.3, 0.3)) #ylim=range(r_vix)
points(data_fcst_pts, r_vix_3b[data_fcst_pts],col="blue", type="l", pch=16)
axis(1, at=seq(0,length(r_vix_3b)+h-1,plot_labels), labels=VIX$Date[seq(2, length(r_vix_3b)+h,plot_labels)] )
diff_1_step = r_vix_3b[1526:1825] - forecast_1step
Please check ?accuracy function from forecast package.
I guess in your case it would be something like:
acc <- accuracy(forecast_1step,diff_1_step)
Related
I have been able to use a lm poly-model to model and predict some timeseries data. However when I change to using a holt model, I obtain an error in the R console.
Here is what I am trying to do:
library(ggplot2)
library(matrixStats)
library(forecast)
df_input <- read.csv("postprocessed.csv")
x <- df_input$time
y <- df_input$value
df <- data.frame(x, y)
#poly4model <- lm(y~poly(x, degree=4), data=df)
holtmodel <- holt(df$y) # might need df$value here ?
v <- seq(1, 44)
v2 <- seq(44, 55)
pdf("postprocessed_holts.pdf")
plot(df, xlim=c(0, 55))
##lines(v, predict(poly4model, data.frame(x=v)), col="blue", pch=20, lwd=3)
##lines(v2, predict(poly4model, data.frame(x=v2)), col="red", pch=20, lwd=3)
lines(v, predict(holtmodel, data.frame(x=v)), col="blue", pch=20, lwd=3)
lines(v2, predict(holtmodel, data.frame(x=v2)), col="red", pch=20, lwd=3)
dev.off()
This is the error which shows up
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
I am a bit confused as to what x and y refer to here. The objects x and y which are in the Environment (R Studio Environment) both have length 44.
The code appears to error on both lines starting with lines.
Here's a copy of the input data...
"","time","value"
"1",1,2.61066016308988
"2",2,3.41246054742996
"3",3,3.8608767964033
"4",4,4.28686048552237
"5",5,4.4923132964825
"6",6,4.50557049744317
"7",7,4.50944447661246
"8",8,4.51097373134893
"9",9,4.48788748823809
"10",10,4.34603985656981
"11",11,4.28677073671406
"12",12,4.20065901625172
"13",13,4.02514194962519
"14",14,3.91360194972916
"15",15,3.85865748409081
"16",16,3.81318053258601
"17",17,3.70380706527433
"18",18,3.61552922363713
"19",19,3.61405310598722
"20",20,3.64591327503384
"21",21,3.70234435835577
"22",22,3.73503970503372
"23",23,3.81003078640584
"24",24,3.88201196162666
"25",25,3.89872518158949
"26",26,3.97432743542362
"27",27,4.2523675144599
"28",28,4.34654855854847
"29",29,4.49276038902684
"30",30,4.67830892029687
"31",31,4.91896819673664
"32",32,5.04350767355202
"33",33,5.09073406942046
"34",34,5.18510849382162
"35",35,5.18353176529036
"36",36,5.2210776270173
"37",37,5.22643491929207
"38",38,5.11137006553725
"39",39,5.01052467981257
"40",40,5.0361056705898
"41",41,5.18149486951409
"42",42,5.36334869132276
"43",43,5.43053620818444
"44",44,5.60001072279525
Edit
I tried an alternative method as well. I noticed that the object holtmodel contains two objects which might be useful. They are fitted and mean. As far as I can tell this is the fitted timeseries and the mean timeseries for the next 10 steps/predictions.
I tried plotting these objects with
lines(holtmodel$fitted, col="orange", lwd=2)
lines(holtmodel$mean, col="blue", lwd=2)
however the second of these fails to plot anything, despite no error being produced in the console. The first line plots an orange timeseries as expected.
Your issue
The objects you are trying to add as lines don't have the same length:
length(predict(holtmodel, data.frame(x=v)))
# 10
length(v)
# 44
length(predict(holtmodel, data.frame(x=v2)))
# 10
length(v2)
# 12
This means you can't add them as new lines.
Also, you can't really predict the same way you would with a linear regression by using say, older data as point to prepare the model. Exponential smoothing methods use historical data points to build future data points, you can't really display them for past events.
Also, you are not specifying the parameter for the number of periods you are trying to predict (h), I'll let you refer to the documentation on the holt function. It is already a prediction of future events that is the output, so the use of predict() on it doesn't change the result:
holt_predict <- predict(holtmodel)
length(setdiff(holt_predict, holtmodel))
# 0 which means they are the same objects
Solution
What you could do is use directly mean and fitted and plot them with lines, by also expanding the area to plot the chat with xlim and ylim to view the predicted values. You can directly plot holtmodel$fitted and holtmodel$mean on your chart, since they are time series objects:
plot(df, xlim=c(0, 60), ylim=c(2.5, 10))
lines(holtmodel$fitted, col="blue", pch=20, lwd=3)
lines(holtmodel$mean, col="red", pch=20, lwd=3)
And the result:
Easy alternative
To save you the hassle of having to go through this kind of solution there are easier methods. Have you tried the autoplot function included in the package forecast ? It is from ggplot2 and will give you what you want directly (unless you don't want the confidence intervals). It is very straightforward and will probably yield results close to what you want:
autoplot(holtmodel)
I'm trying to perform a simple linear regression on a dataset that uses a date range as the independent variable. I want to plot the data along with the regression line and equation. The dataset is not a time series. When I create the plot, it looks like
this.
The plot looks fine, however the slope is clearly not zero and the intercept should not be 0.945 as the equation states. This article gives a good explanation about how the programming language will shift the origin of the date range to a pre-specified zero starting point, which I believe is 1/1/1970 in R. I think this is what is happening in my case. While the article gives a good explanation of the problem, it's solution doesn't detail how to fix the problem in R. So my question is, how do I shift the origin of my independent variable from 1/1/1970 to the first date in my dataset when performing the linear regression in R?
I've tried converting the date range to both numeric and factor, neither of which was the solution. I suspect that I'm not using the right search terms when searching for solutions on the web. Most of what comes up describes how to force the regression line through the origin, which is not what I want to do here. Thanks for the help.
EDIT: My code to produce the plot is below. 'agcy' is a sample of the data. The actual dataset has more than a thousand points.
agcy <- data.frame(as.Date(c('2010-01-01', '2011-02-01', '2012-11-18', '2016-08-30', '2017-04-21')), c(-0.3, -0.1, -0.1, -0.2, -0.4))
colnames(agcy) <- c('Date', 'Diff')
png('C:\\Desktop\\file.png', width = 720, height = 480)
samps <- 0.05
MA <- movAvg(agcy, agcy$Diff, samps) #movAvg() is a user-defined function that computes the moving average of the data series, "samps" is the proportion of data points to use in the moving average calculation
model <- lm(MA ~ agcy$Date)
intercept <- round(coef(model)[1],3)
slope <- round(coef(model)[2],3)
r2 <- round(summary(model)$r.squared,3)
eqn <- sapply(c(bquote(italic(y) == .(slope)*italic(x) + .(intercept)),bquote(r^2 == .(r2))), as.expression)
plot(agcy$Date, MA, xlab='Date',ylab='Difference',col = 'green', type = 'p', pch=18, ylim = c(-0.5,0.5))
abline(model)
text(par('usr')[2],c(0.5,0.45),eqn,pos = 2)
abline(h=0, col = 'black')
dev.off()
I have a single series of values (i.e. one column of data), and I would like to create a plot with the range of data values on the x-axis and the frequency that each value appears in the data set on the y-axis.
What I would like is very close to a Kernel Density Plot:
# Kernel Density Plot
d <- density(mtcars$mpg) # returns the density data
plot(d) # plots the results
and Frequency distribution in R on stackoverflow.
However, I would like frequency (as opposed to density) on the y-axis.
Specifically, I'm working with network degree distributions, and would like a double-log scale with open, circular points, i.e. this image.
I've done research into related resources and questions, but haven't found what I wanted:
Cookbook for R's Plotting distributions is close to what I want, but not precisely. I'd like to replace the y-axis in its density curve example with "count" as it is defined in the histogram examples.
The ecdf() function in R (i.e. this question) may be what I want, but I'd like the observed frequency, and not a normalized value between 0 and 1, on the y-axis.
This question is related to frequency distributions, but I'd like points, not bars.
EDIT:
The data is a standard power-law distribution, i.e.
dat <- c(rep(1, 1000), rep(10, 100), rep(100, 10), 100)
The integral of a density is approximately 1 so multiplying the density$y estimate by the number of values should give you something on the scale of a frequency. If you want a "true" frequency then you should use a histogram:
d <- density(mtcars$mpg)
d$y <- d$y * length(mtcars$mpg) ; plot(d)
This is a histogram with breaks that are 1 unit each:
hist(mtcars$mpg,
breaks=trunc(min(mtcars$mpg)):(1+trunc(max(mtcars$mpg))), add=TRUE)
So this is the superposed comparison:
d <- density(mtcars$mpg)
d$y <- d$y * length(mtcars$mpg) ; plot(d, ylim=c(0,4) )
hist(mtcars$mpg, breaks=trunc(min(mtcars$mpg)):(1+trunc(max(mtcars$mpg))), add=TRUE)
You'll want to look at the density page where the default density bandwidth choice is criticized and alternatives offered. f you use the adjust parameter you might see a closer (smoothed correspondence to the histogram
If you have discrete values for observations and want to make a plot with points on the log scale, then
dat <- c(rep(1, 1000), rep(10, 100), rep(100, 10), 100)
dd<-aggregate(rep.int(1, length(dat))~dat, FUN=sum)
names(dd)<-c("val","freq")
plot(freq~val, dd, log="xy")
might be what you are after.
I am now using the "vars" package in R to examine the interrelationship between two time series. Specifically, our data has 66 time points. I divided it into test sample (1-60 observations) and hold-out sample (61-66 observations). I want to plot the predicted value of all 66 observations with the raw score of all 66 observations along the same scale (from 1 to 66) in the same plot to compare the model fit. But I failed to do so with par and layout function. It is highly appreciated if you can kindly give me some instructions on it.
Below is my R code:
library("vars")
setwd("c:$temp")
filename<-"data.txt"
full<-read.table(filename,header=TRUE,sep="\t")
env<-full[1:60,]
varlag1<-VAR(env,p = 2,type = "const");
summary(varlag1)
plot(varlag1)
predict<-predict(varlag1,n.ahead=6,ci=0.95)
list(predict)
raw_v1<-full[1:66,1]
plot(predict,names="v1",lwd=3)
par(new=TRUE)
plot(as.ts(raw_v1),lwd=1)
raw_v2<-full[1:66,2]
plot(predict,names="v2",lwd=3)
par(new=TRUE)
plot(as.ts(raw_v2),lwd=1)
It doesn't look as pretty as you might want, but I guess it is something like this you are looking for?
pred1<-c(env[,"v1"], predict$fcst$v1[,1])
pred2<-c(env[,"v2"], predict$fcst$v2[,1])
pred3<-c(env[,"v3"], predict$fcst$v3[,1])
ts.plot(cbind(pred1,raw_v1), col=1:2, lwd=2)
ts.plot(cbind(pred2,raw_v2), col=1:2, lwd=2)
ts.plot(cbind(pred3,raw_v3), col=1:2, lwd=2)
I am trying to plot the inverse of a survival function, as the data I'm is actually an increase in proportion of an event over time. I can produce Kaplan-Meier survival plots, but I want to produce the 'opposite' of these. I can kind of get what I want using the following fun="cloglog":
plot(survfit(Surv(Days_until_workers,Workers)~Queen_Number+Treatment,data=xdata),
fun="cloglog", lty=c(1:4), lwd=2, ylab="Colonies with Workers",
xlab="Days", las=1, font.lab=2, bty="n")
But I don't understand quite what this has done to the time (i.e. doesn't start at 0 and distance decreases?), and why the survival lines extend above the y axis.
Would really appreciate some help with this!
Cheers
Use fun="event" to get the desired output
fit <- survfit(Surv(time, status) ~ x, data = aml)
par(mfrow=1:2, las=1)
plot(fit, col=2:3)
plot(fit, col=2:3, fun="event")
The reason for fun="cloglog" screwing up the axes is that it does not plot a fraction at all. It is instead plotting this according to ?plot.survfit:
"cloglog" creates a complimentary log-log survival plot (f(y) = log(-log(y)) along with log scale for the x-axis)
Moreover, the fun argument is not limited to predefined functions like "event" or "cloglog", so you can easily give it your own custom function.
plot(fit, col=2:3, fun=function(y) 3*sqrt(1-y))