Why do I see a jump in my time series forecast? - r

I am using auto.arima() from forecast package and am coming into some strange results with the prediction.
library(forecast)
x <- structure(c(1.92, 2.1, 1.73, 1.35, 1.29, 1.35, 1.42, 1.46, 1.6,
1.67, 1.98, 1.78, 1.77, 2.35, 1.93, 1.43, 1.29, 1.26, 1.93, 2.33,
2.22, 2.19, 2.15, 2.25, 3.12, 3.32, 2.72, 2.28, 2.28, 2.16, 2.81,
3.12, 2.85, 2.98, 3.3, 3.06, 3.56, 3.81, 3.48, 2.64, 2.91, 3.35,
3.73, 3.58, 4, 3.94, 3.79, 3.85), .Tsp = c(2012, 2015.91666666667,
12), class = "ts")
fit <- auto.arima(x)
plot(forecast(fit, 12)) #forecast and actual data
f2 <- fitted.values(fit)
lines(f2, col="red") #add predicted values during training
I don't understand how fitted value (red line) is very close to observed value (black) but then there is a such a big jump in the first forecast.
Any ideas why we see this jump? I've seen other posts on Stack Exchange where the xreg option was used but this is not doing that so I haven't been able to track down a similar post.

Generally I tend to believe that auto.arima slightly overfits data. Some quick exploratory analysis with ACF shows that (0,1,2)(0,1,0)[12] is already a decent model. I will use arima0 from R base to fit this model:
fit0 <- arima0(x, order = c(0,1,2), seasonal = c(0,1,0))
Prediction / forecasting is done with predict.arima0:
pred <- predict(fit0, n.ahead = 12, se.fit = FALSE)
Let's plot observed series and forecast together:
ts.plot(x, pred, col = 1:2)
There is still a jump. But the variation is fairly reasonable, compared with the variability of the series.
Nothing wrong. When we forecast x[49] from x[1:48], it will be different from x[48]. Typically, (0,1,2)(0,1,0)[12] has a linear trend plus a seasonal effect. It helps to visualize your time series and prediction season by season:
ts.plot(window(x, 2012, 2012 + 11/12),
window(x, 2013, 2013 + 11/12),
window(x, 2014, 2014 + 11/12),
window(x, 2015, 2015 + 11/12),
pred, col = 1:5)

Related

plotting threshold/piecewise/change point models with 95% confidence intervals in R

I would like to plot a threshold model with smooth 95% confidence interval lines between line segments. You would think this would be on the simple side but I have not been able to find an answer!
My threshold/breakpoints are known, it would be great if there were a way to visualize this data. I have tried the segmented package which produces the following plot:
The plot shows a threshold model with a breakpoint at 5.4. However, the confidence intervals are not smooth between regression lines.
If anyone knows of any way to produce smooth (i.e. without the jump between line segments) CI lines between segmented regression lines (ideally in ggplot) that would be amazing. Thank you so much.
I have included sample data and the code I have tried below:
x <- c(2.26, 1.95, 1.59, 1.81, 2.01, 1.63, 1.62, 1.19, 1.41, 1.35, 1.32, 1.52, 1.10, 1.12, 1.11, 1.14, 1.23, 1.05, 0.95, 1.30, 0.79,
0.81, 1.15, 1.10, 1.29, 0.97, 1.05, 1.05, 0.84, 0.64, 0.80, 0.81, 0.61, 0.71, 0.75, 0.30, 0.30, 0.49, 1.13, 0.55, 0.77, 0.51,
0.67, 0.43, 1.11, 0.29, 0.36, 0.57, 0.02, 0.22, 3.18, 3.79, 2.49, 2.44, 2.12, 2.45, 3.22, 3.44, 3.86, 3.53, 3.13)
y <- c(22.37, 18.93, 16.99, 15.65, 14.62, 13.79, 13.09, 12.49, 11.95, 11.48, 11.05, 10.66, 10.30, 9.96, 9.65, 9.35, 9.07, 8.81,
8.56, 8.32, 8.09, 7.87, 7.65, 7.45, 7.25, 7.05, 6.86, 6.68, 6.50, 6.32, 6.15, 5.97, 5.80, 5.63, 5.47, 5.30,
5.13, 4.96, 4.80, 4.63, 4.45, 4.28, 4.09, 3.90, 3.71, 3.50, 3.27, 3.01, 2.70, 2.28, 22.37, 16.99, 11.05, 8.81,
8.56, 8.32, 7.25, 7.05, 6.50, 6.15, 5.63)
lin.mod <- lm(y ~ x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=2)
plot(x, y)
plot(segmented.mod, add=TRUE, conf.level = 0.95)
which produces the following plot (and associated jumps in 95% confidence intervals):
segmented plot
Background: The non-smoothness in existing change point packages are due to the fact that frequentist packages operate with a fixed change point value. But as with all inferred parameters, this is wrong because there is indeed uncertainty concerning the location of the change.
Solution: AFAIK, only Bayesian methods can quantify that and the mcp package fills this space.
library(mcp)
model = list(
y ~ 1 + x, # Segment 1: Intercept and slope
~ 0 + x # Segment 2: Joined slope (no intercept change)
)
fit = mcp(model, data = data.frame(x, y))
Default plot (plot.mcpfit() returns a ggplot object):
plot(fit) + ggtitle("Default plot")
Each line represents a possible model that generated the data. The posterior for the change point is shown as a blue density. You can add a credible interval on top using plot(fit, q_fit = TRUE) or plot it alone:
plot(fit, lines = 0, q_fit = c(0.025, 0.975), cp_dens = FALSE) + ggtitle("Credible interval only")
If your change point is indeed known and if you want to model different residual scales for each segment (i.e., quasi-emulate segmented), you can do:
model2 = list(
y ~ 1 + x,
~ 0 + x + sigma(1) # Add intercept change in residual scale
)
fit = mcp(model2, df, prior = list(cp_1 = 1.9)) # Note: prior is a fixed value - not a distribution.
plot(fit, q_fit = TRUE, cp_dens = FALSE)
Notice that the CI does not "jump" around the change point as in segmented. I believe that this is the correct behavior. Disclosure: I am the author of mcp.

superimpose a prediction curve on ggplot

I know this has been asked for multiple times, but I could not find an answer to solve the problem I am encountering.
I would like to generate a prediction curve and superimpose it on a ggplot. The model is a quadratic plateau nonlinear function.
Data as below
dd_ <- data.frame(yield = c(2.07, 1.58, 2.01, 2.27, 3.28,
2.31, 2.49, 2.41, 3.90, 3.26,
3.37, 3.83, 4.06, 3.54, 3.75,
3.48, 4.51, 3.39, 4.09, 3.87,
4.31, 4.36, 4.66, 3.79, 4.17,
4.63, 3.99, 3.88, 4.73),
n_trt = c(0,0,0,0,25,25,25,25,
50,50,50,50,75,75,75,75,
100,100,100,100,125,125,125,125,
150,150,150,175,175))
the function is
quadratic.plateau <- function(alpha,beta,gamma, D, x){
ifelse(x< D,alpha+beta*x+gamma*x*x,alpha+beta*D+gamma*D*D)
}
I use minpack.lm package as it creates a better fit than nls
library(minpack.lm)
library(ggiraphExtra)
q_model <- nlsLM(yield~quadratic.plateau(A,B,C, D, n_trt),
data = dd_, start=list(A=2.005904,
B=0.03158664,
C=-0.0001082836,
D = 145.8515 ))
ggPredict(q_model,interactive=TRUE,colorn=100,jitter=FALSE)
By doing this, I receive an error
Error: $ operator is invalid for atomic vectors
I also used
ggplot(dd_, aes(n_trt, yield)) +
geom_point(size = 0.5) +
geom_smooth(method = "quadratic.plateau", data = dd_)
but no prediction curve was generated.
I appreciate your help. Thanks!
Almost identical to this question: the main point is that you have to set se=FALSE because predict.nls() doesn't return standard errors ...
ggplot(dd_, aes(n_trt, yield)) +
geom_point(size = 0.5) +
geom_smooth(method="nlsLM",
se=FALSE,
formula=y~quadratic.plateau(A,B,C, D, x),
method.args=list(start=list(A=2.005904,
B=0.03158664,
C=-0.0001082836,
D = 145.8515 )))
After a few attempts, this solves my problem.
eq1 = function(x){
ifelse(x< coef(q_model)[4], coef(q_model)[1]+coef(q_model)[2]*x+coef(q_model)[3]*x*x,
coef(q_model)[1]+coef(q_model)[2]*coef(q_model)[4]+coef(q_model)[3]*coef(q_model)[4]*coef(q_model)[4])
}
ggplot(dd_, aes(n_trt, yield)) +
geom_point(size = 0.5) +
stat_function(fun=eq1,geom="line",color=scales::hue_pal()(2)[1])

Creating a 2D-grid or raster in R comparing all respondents with all variables

reproducible example for my data:
df_1 <- data.frame(cbind("Thriving" = c(2.33, 4.21, 6.37, 5.28, 4.87, 3.92, 4.16, 5.53), "Satisfaction" = c(3.45, 4.53, 6.01, 3.87, 2.92, 4.50, 5.89, 4.72), "Wellbeing" = c(2.82, 3.45, 5.23, 3.93, 6.18, 4.22, 3.68, 4.74), "id" = c(1:8)))
As you can see, it includes three variables of psychological measures and one identifier with an id for each respondent.
Now, my aim is to create a 2D-grid with which I can have a nice overview of all the values for all respondents concerning each of the variables. So on the x-axis I would have the id of all the respondents and on the y-axis all variables, whereas the colour of the particular field depends on the value - 1 to 3 in red, 3 to 5 in yellow and 5 to 7 in green The style of the grid should be like this image.
All I have achieved so far is the following code which compresses all the variables/items into one column so they can together be portrayed on the y-axis - the id is of course included in its own column as are the values:
df_1 %>%
select("Thr" = Thriving, "Stf" = Satisfaction, "Wb" = Wellbeing, "id" = id) %>%
na.omit %>%
gather(key = "variable", value = "value", -id) %>%
I am looking for a solution that works without storing the data in a new frame.
Also, I am looking for a solution that would be useful for even 100 or more respondents and up to about 40 variables. It would not matter if one rectangle would then be very small, I just want to have a nice colour play which would give a nice taste of where an organisation may be achieving low or high - and how it is achieving in general.
Thanks for reading, very grateful for any help!
There is probably a better graphics oriented approach, but you can do this with base plot and by treating your data as a raster:
library(raster)
df_1 <- cbind("Thriving" = c(2.33, 4.21, 6.37, 5.28, 4.87, 3.92, 4.16, 5.53), "Satisfaction" = c(3.45, 4.53, 6.01, 3.87, 2.92, 4.50, 5.89, 4.72), "Wellbeing" = c(2.82, 3.45, 5.23, 3.93, 6.18, 4.22, 3.68, 4.74), "id" = c(1:8))
r <- raster(ncol=nrow(df_1), nrow=3, xmn=0, xmx=8, ymn=0, ymx=3)
values(r) <- as.vector(as.matrix(df_1[,1:3]))
plot(r, axes=F, box=F, asp=NA)
axis(1, at=seq(-0.5, 8.5, 1), 0:9)
axis(2, at=seq(-0.5, 3.5, 1), c("", colnames(df_1)), las=1)

Plotting longitudinal data using loess smoother

I need to plot a set of smoothed trajectories for individuals in a longitudinal (person-period) dataset. I can plot the individual trajectories across days using OLS regression but I would like to know how to plot the trajectories using a non-parametric smoother.
Sample data below. Same outcome variable measured five times for each individual at ages eleven, twelve, thirteen, fourteen and fifteen. Exposure is a predictor variable but we are not interested in it for this exercise.
id <- c(9, 45, 268, 314, 442, 514, 569, 624, 723, 918, 949, 978, 1105, 1542, 1552, 1653)
eleven <- c(2.23, 1.12, 1.45, 1.22, 1.45, 1.34, 1.79, 1.12, 1.22, 1.00, 1.99, 1.22, 1.34, 1.22, 1.00, 1.11)
twelve <- c(2.23, 1.12, 1.45, 1.22, 1.45, 1.34, 1.79, 1.12, 1.22, 1.00, 1.99, 1.22, 1.34, 1.22, 1.00, 1.11)
thirteen <- c(1.90, 1.45, 1.99, 1.55, 1.45, 2.23, 1.90, 1.22, 1.12, 1.22, 1.12, 2.12, 1.99, 1.99, 2.23, 1.34)
fourteen <- c(2.12, 1.45, 1.79, 1.12, 1.67, 2.12, 1.99, 1.12, 1.00, 1.99, 1.45, 3.46, 1.90, 1.79, 1.55, 1.55)
fifteen <- c(2.66, 1.99, 1.34, 1.12, 1.90, 2.44, 1.99, 1.22, 1.12, 1.22, 1.55, 3.32, 2.12, 2.12, 1.55, 2.12)
exposure <- c(1.54, 1.16, 0.90, 0.81, 1.13, 0.90, 1.99, 0.98, 0.81, 1.21, 0.93, 1.59, 1.38, 1.44, 1.04, 1.25)
df <- data.frame(id, eleven, twelve, thirteen, fourteen, fifteen, exposure)
Now we convert the person-level dataset to a person-period (i.e. long) dataframe and add a time variable
library(reshape2)
library(plyr)
dfPP <- melt(df, measure.vars = c("eleven", "twelve", "thirteen", "fourteen", "fifteen"), var = "age", value.name = "score")
dfPP <- dfPP[order(dfPP$id), ]
dfPP$time <- rep(0:4, 16)
We can plot the raw data using an interaction plot
interaction.plot(dfPP$age, dfPP$id, dfPP$score)
but what we really want are lines of best fit for each individual
fit <- by(dfPP, dfPP$id, function (bydata) fitted.values(lm(score ~ time, data = bydata)))
fit <- unlist(fit)
interaction.plot(dfPP$age, dfPP$id, fit, xlab = "age", ylab = "score")
It would also be good to plot the average change trajectory across subjects
ints <- by(dfPP, df$id, function (data) coefficients(lm(score ~ time, data = data))[[1]])
ints1 <- unlist(ints)
slopes <- by(dfPP, dfPP$id, function (data) coefficients(lm(score ~ time, data = data))[[2]])
slopes1 <- unlist(slopes)
so to plot the mean trajectory we use an abline, superimposed over the existing graph
abline(a = mean(ints1), b = mean(slopes1), lwd = 2, col = "red")
Now this is all fine but does anyone know how to achieve a similar result except using a non-parametric smoother? Also with an average trajectory line superimposed. Using some kind of loess function perhaps?

Negative exponential fit: curve looks too high

I am trying to fit a negative exponential to some data in R, but the fitted line looks too high compared to the data, whereas the fit I get using Excel's built-in power fit looks more believable. Can someone tell me why? I've tried using the nls() function and also optim() and get similar parameters from both of those methods, but the fits for both look high.
x <- c(5.96, 12.86, 8.40, 2.03, 12.84, 21.44, 21.45, 19.97, 8.92, 25.00, 19.90, 20.00, 20.70, 16.68, 14.90, 26.00, 22.00, 22.00, 10.00, 5.70, 5.40, 3.20, 7.60, 0.59, 0.14, 0.85, 9.20, 0.79, 1.40, 2.68, 1.91)
y <- c(5.35, 2.38, 1.77, 1.87, 1.47, 3.27, 2.01, 0.52, 2.72, 0.85, 1.60, 1.37, 1.48, 0.39, 2.39, 1.83, 0.71, 1.24, 3.14, 2.16, 2.22, 11.50, 8.32, 38.98, 16.78, 32.66, 3.89, 1.89, 8.71, 9.74, 23.14)
xy.frame <- data.frame(x,y)
nl.fit <- nls(formula=(y ~ a * x^b), data=xy.frame, start = c(a=10, b=-0.7))
a.est <- coef(nl.fit)[1]
b.est <- coef(nl.fit)[2]
plot(x=xy.frame$x,y=xy.frame$y)
# curve looks too high
curve(a.est * x^b.est , add=T)
# these parameters from Excel seem to fit better
curve(10.495 * x^-0.655, add=T)
# alternatively use optim()
theta.init <- c(1000,-0.5, 50)
exp.nll <- function(theta, data){
a <- theta[1]
b <- theta[2]
sigma <- theta[3]
obs.y <- data$y
x <- data$x
pred.y <- a*x^b
nll <- -sum(dnorm(x=obs.y, mean=pred.y , sd=sigma, log=T))
nll
}
fit.optim <- optim(par=theta.init,fn=exp.nll,method="BFGS",data=xy.frame )
plot(x=xy.frame$x,y=xy.frame$y)
# still looks too high
curve(a.est * x^b.est, add=T)
The reason you're seeing the unexpected behavior is that the curves that look "too high" actually have much lower sums of squared errors than the curves from excel:
# Fit from nls
sum((y - a.est*x^b.est)^2)
# [1] 1588.313
# Fit from excel
sum((y - 10.495*x^ -0.655)^2)
# [1] 1981.561
The reason nls favors the higher curve is that it is working to avoid huge errors at small x values at the cost of slightly larger errors with large x values. One way to address this might be to apply a log-log transformation:
mod <- lm(log(y)~log(x))
(a.est2 <- exp(coef(mod)["(Intercept)"]))
# (Intercept)
# 10.45614
(b.est2 <- coef(mod)["log(x)"])
# log(x)
# -0.6529741
These are quite close to the coefficients from excel, and yield a more visually appealing fit (despite the worse performance on the sum-of-squared-errors metric):

Resources