Auto.Arima fits well except for a single spike - r

I'm an engineering grad student and as a small part of my thesis I'm trying to analyze some groundwater data in r using the auto.arima function. The fitted functions for my data fit well except for one spike in the data and I cannot figure out for the life of me why they go off the rails here. There are no oddities or missing values in the data. The data is the elevation of the groundwater and has one recorded point per day.
My raw unfitted data looks like this:
#load tseries library
library(tseries)
# RESERVOIR ONLY ANALYSIS#
#Daily Piezometric Data from PS13-01
PS1301 = read.csv("PS13-01.csv",TRUE,",")
#impute missing data from data set
PS1301 = imputeTS::na_interpolation(PS1301)
#Create Time Series
PS1301 = ts(PS1301[,2],frequency = (365),start = c(2013,116))
plot(PS1301, xlab='Time', ylab = 'Piezometric Head')
And then after running Auto.Arima it fits this:
#Auto Arima of only piezometers
#PS1301
AAPS1301 = auto.arima(PS1301)
AAPS1301
summary(AAPS1301)
## Series: PS1301
## ARIMA(2,1,0)(0,1,0)[365]
##
## Coefficients:
## ar1 ar2
## 0.3362 0.5722
## s.e. 0.0643 0.0625
##
## sigma^2 estimated as 0.02372: log likelihood=2779.3
## AIC=-5552.61 AICc=-5552.59 BIC=-5536.39
plot(PS1301,col="red")
lines(fitted(AAPS1301),col="blue")
Any help would be appreciated, I'm pretty unsure as to what to do from here. I feel like this has to be an error because of how well the fit is(visually) for the rest of the time series. I'm also more than happy to provide the raw data but I am not sure how to put it in this post other than as a dropbox link https://www.dropbox.com/sh/563nu3daeid0agb/AAB6NSddVUKgBCCbQtuqXPsZa?dl=0

The problem here is that the seasonal period is very long (365) and R is trying to fit a diffuse prior to the corresponding state space model -- which becomes increasingly difficult with very long periods. There appears to be some numerical instability as a result, giving inaccurate fitted values at the 366th and 367th observations.
I am not convinced that using a seasonal ARIMA with such a long period makes any sense, but if you want to do it, use the CSS estimation method instead of full likelihood:
fit_css <- auto.arima(PS1301, method='CSS')
It is also much faster.

Related

Time Series Forecasting using Support Vector Machine (SVM) in R

I've tried searching but couldn't find a specific answer to this question. So far I'm able to realize that Time Series Forecasting is possible using SVM. I've gone through a few papers/articles who've performed the same but didn't mention any code, instead explained the algorithm (which I didn't quite understand). And some have done it using python.
My problem here is that: I have a company data(say univariate) of sales from 2010 to 2017. And I need to forecast the sales value for 2018 using SVM in R.
Would you be kind enough to simply present and explain the R code to perform the same using a small example?
I really do appreciate your inputs and efforts!
Thanks!!!
let's assume you have monthly data, for example derived from Air Passengers data set. You don't need the timeseries-type data, just a data frame containing time steps and values. Let's name them x and y. Next you develop an svm model, and specify the time steps you need to forecast. Use the predict function to compute the forecast for given time steps. That's it. However, support vector machine is not commonly regarded as the best method for time series forecasting, especially for long series of data. It can perform good for few observations ahead, but I wouldn't expect good results for forecasting eg. daily data for a whole next year (but it obviously depends on data). Simple R code for SVM-based forecast:
# prepare sample data in the form of data frame with cols of timesteps (x) and values (y)
data(AirPassengers)
monthly_data <- unclass(AirPassengers)
months <- 1:144
DF <- data.frame(months,monthly_data)
colnames(DF)<-c("x","y")
# train an svm model, consider further tuning parameters for lower MSE
svmodel <- svm(y ~ x,data=DF, type="eps-regression",kernel="radial",cost=10000, gamma=10)
#specify timesteps for forecast, eg for all series + 12 months ahead
nd <- 1:156
#compute forecast for all the 156 months
prognoza <- predict(svmodel, newdata=data.frame(x=nd))
#plot the results
ylim <- c(min(DF$y), max(DF$y))
xlim <- c(min(nd),max(nd))
plot(DF$y, col="blue", ylim=ylim, xlim=xlim, type="l")
par(new=TRUE)
plot(prognoza, col="red", ylim=ylim, xlim=xlim)

Auto.arima() function does not result in white noise. How else should I go about modeling data

Here is the plot of the initial data (after performing a log transformation).
It is evident there is both a linear trend as well as a seasonal trend. I can address both of these by taking the first and twelfth (seasonal) difference: diff(diff(data), 12). After doing so, here is the plot of the resulting data
.
This data does not look great. While the mean in constant, we see a funneling effect as time progresses. Here are the ACF/PACF:.
Any suggestions for possible fits to try. I used the auto.arima() function which suggested an ARIMA(2,0,2)xARIMA(1,0,2)(12) model. However, once I took the residuals from the fit, it was clear there was still some sort of structure in them. Here is the plot of the residuals from the fit as well as the ACF/PACF of the residuals.
There does not appear to be a seasonal pattern regarding which lags have spikes in the ACF/PACF of residuals. However, this is still something not captured by the previous steps. What do you suggest I do? How could I go about building a better model that has better model diagnostics (which at this point is just a better looking ACF and PACF)?
Here is my simplified code thus far:
library(TSA)
library(forecast)
beer <- read.csv('beer.csv', header = TRUE)
beer <- ts(beer$Production, start = c(1956, 1), frequency = 12)
# transform data
boxcox <- BoxCox.ar(beer) # 0 in confidence interval
beer.log <- log(beer)
firstDifference <- diff(diff(beer.log), 12) # get rid of linear and
# seasonal trend
acf(firstDifference)
pacf(firstDifference)
eacf(firstDifference)
plot(armasubsets(firstDifference, nar=12, nma=12))
# fitting the model
auto.arima(firstDifference, ic = 'bic') # from forecasting package
modelFit <- arima(firstDifference, order=c(1,0,0),seasonal
=list(order=c(2, 0, 0), period = 12))
# assessing model
resid <- modelFit$residuals
acf(resid, lag.max = 15)
pacf(resid, lag.max = 15)
Here is the data, if interested (I think you can use an html to csv converter if you would like): https://docs.google.com/spreadsheets/d/1S8BbNBdQFpQAiCA4J18bf7PITb8kfThorMENW-FRvW4/pubhtml
Jane,
There are a few things going on here.
Instead of logs, we used the tsay variance test which shows that the variance increased after period 118. Weighted least squares deals with it.
March becomes higher beginning at period 111. An alternative to an ar12 or seasonal differencing is to identify seasonal dummies. We found that 7 of the 12 months were unusual with a couple level shifts, an AR2 with 2 outliers.
Here is the fit and forecasts.
Here are the residuals.
ACF of residuals
Note: I am a developer of the software Autobox. All models are wrong. Some are useful.
Here is Tsay's paper
http://onlinelibrary.wiley.com/doi/10.1002/for.3980070102/abstract

Generated non-stationary Data and auto.arima in R

I generated my own fictional Sales Data in order to execute a time series analysis.
It is supposed to represent a growing company and therefore i worked with a trend. However, I read through some tutorials and often read the information, that non-stationary time series should not be predicted by the auto.arima function.
But I receive results that make sense and If I would difference the data (which i did as well) the output doesn't make much sense.
So here comes my question: Can I use the auto.arima function with my data, that obviously has a trend?
Best regards and thanks in advance,
Francisco
eps <- rnorm(100, 30, 20)
trend <- seq(1, 100, 1)
trend <- 3 * trend
Sales <- trend + eps
timeframe<-seq(as.Date("2008/9/1"),by="month",length.out=100)
Data<-data.frame(Sales,timeframe)
plot(Data$timeframe,Data$Sales)
ts=ts(t(Data[,1]))
plot(ts[1,],type='o',col="black")
md=rwf(ts[1,],h=12,drift=T,level=c(80,95))
auto.arima(ts[1,])
Using the forecast function allows us to plot the expected sales for the next year: plot(forecast(auto.arima(ts[1,]),h=12))
Using the forecast function with our automated ARIMA can help us plan for the next quartal
forecast(auto.arima(ts[1,]),h=4)
plot(forecast(auto.arima(ts[1,])))
another way would be to use the autoplot function
fc<-forecast(ts[1,])
autoplot(fc)
The next step is to analyze our time-series. I execute the adf test, which has the null-hypothesis that the data is non-stationary.
So with the 5% default threshold our p-value would have to be greater than 0.05 in order to be certified as non-stationary.
library(tseries)
adf=adf.test(ts[1,])
adf
The output suggests that the data is non-stationary:
acf
acf=Acf(ts[1,])
Acf(ts[1,])
The autocorrelation is decreasing almost steadily, this points to non-stationary data also. Doing a kpss.test should verify that our data is non-stationary, since its null-hypothesis is the opposite of the adf test.
Do we expect a value smaller than 0.05
kpss=kpss.test(ts[1,])
kpss
We receive a p-value of 0.01, further proving that the data has a trend
ndiffs(ts[1,])
diff.data=diff(ts[1,])
auto.arima(diff.data)
plot(forecast(diff.data))
To answer your question - yes, you can use the auto.arima() function in the forecast package on non-stationary data.
If you look at the help file for auto.arima() (by typing ?auto.arima) you will see that it explains that you can choose to specify the "d" parameter - this is the order of differencing - first order means you difference the data once, second order means you difference the data twice etc. You can also choose not to specify this parameter and in this case, the auto.arima() function will determine the appropriate order of differencing using the "kpss" test. There are other unit root tests such as the Augmented Dickey-Fuller which you can choose to use in the auto.arima function by setting test="adf". It really depends on your preference.
You can refer to page 11 and subsequent pages for more information on the auto.arima function here:
https://cran.r-project.org/web/packages/forecast/forecast.pdf

Auto.arima is not showing any order

I am trying to fit arima model using auto.arima function in R. The result is showing order (0,0,0) even though the data is non-stationary.
auto.arima(x,approximation=TRUE)
ARIMA(0,0,0) with non-zero mean
Can someone advice why such results are coming? Btw i am running this function on only 10 data points.
10 data points is a very low number of observations for estimating an ARIMA model. I doubt that you can make any sensible estimation based on this. Moreover, the estimated model may depend strongly on the part of a time series you looked at and adding only very few observations can change the characteristics of the estimated model significantly. For example:
When I take a time series with only 10 observations, I also get a ARIMA(0,0,0) model:
library(forecast)
vec1 <- ts(c(10.26063, 10.60462, 10.37365, 11.03608, 11.19136, 11.13591, 10.84063, 10.66458, 11.06324, 10.75535), frequency = 12)
fit1 <- auto.arima(vec1)
summary(fit1)
However, if I use about 30 observations, it an ARIMA(1,0,0) model is estimated:
vec2 <- ts(c(10.260626, 10.604616, 10.373652, 11.036079, 11.191359, 11.135914, 10.840628, 10.664575, 11.063239, 10.755350,
10.158032, 10.653669, 10.659231, 10.483478, 10.739133, 10.400146, 10.205993, 10.827950, 11.018257, 11.633930,
11.287756, 11.202727, 11.244572, 11.452180, 11.199706, 10.970823, 10.386131, 10.184201, 10.209338, 9.544736), frequency = 12)
fit1 <- auto.arima(vec2)
summary(fit1)
If I use the whole time series (413 observations), the auto.arima function estimates a "ARIMA(2,1,4)(0,0,1)[12] with drift".
Thus, I would think that 10 observation is indeed not enough information for fitting a model.

Using survfit object's formula in survdiff call

I'm doing some survival analysis in R, and looking to tidy up/simplify my code.
At the moment I'm doing several steps in my data analysis:
make a Surv object (time variable with indication as to whether each observation was censored);
fit this Surv object according to a categorical predictor, for plotting/estimation of median survival time processes; and
calculate a log-rank test to ask whether there is evidence of "significant" differences in survival between the groups.
As an example, here is a mock-up using the lung dataset in the survival package from R. So the following code is similar enough to what I want to do, but much simplified in terms of the predictor set (which is why I want to simplify the code, so I don't make inconsistent calls across models).
library(survival)
# Step 1: Make a survival object with time-to-event and censoring indicator.
# Following works with defaults as status = 2 = dead in this dataset.
# Create survival object
lung.Surv <- with(lung, Surv(time=time, event=status))
# Step 2: Fit survival curves to object based on patient sex, plot this.
lung.survfit <- survfit(lung.Surv ~ lung$sex)
print(lung.survfit)
plot(lung.survfit)
# Step 3: Calculate log-rank test for difference in survival objects
lung.survdiff <- survdiff(lung.Surv ~ lung$sex)
print(lung.survdiff)
Now this is all fine and dandy, and I can live with this but would like to do better.
So my question is around step 3. What I would like to do here is to be able to use information in the formula from the lung.survfit object to feed into the calculation of the differences in survival curves: i.e. in the call to survdiff. And this is where my domitable [sic] programming skills hit a wall. Below is my current attempt to do this: I'd appreciate any help that you can give! Once I can get this sorted out I should be able to wrap a solution up in a function.
lung.survdiff <- survdiff(parse(text=(lung.survfit$call$formula)))
## Which returns following:
# Error in survdiff(parse(text = (lung.survfit$call$formula))) :
# The 'formula' argument is not a formula
As I commented above, I actually sorted out the answer to this shortly after having written this question.
So step 3 above could be replaced by:
lung.survdiff <- survdiff(formula(lung.survfit$call$formula))
But as Ben Barnes points out in the comment to the question, the formula from the survfit object can be more directly extracted with
lung.survdiff <- survdiff(formula(lung.survfit))
Which is exactly what I wanted and hoped would be available -- thanks Ben!

Resources