Imputing missing values using ARIMA model - r

I am trying to impute missing values in a time series with an ARIMA model in R. I tried this code but no success.
x <- AirPassengers
x[90:100] <- NA
fit <- auto.arima(x)
fitted(fit)[90:100] ## this is giving me NAs
plot(x)
lines(fitted(fit), col="red")
The fitted model is not imputing the missing values. Any idea on how this is done?

fitted gives in-sample one-step forecasts. The "right" way to do what you want is via a Kalman smoother. A rough approximation good enough for most purposes is obtained using the average of the forward and backward forecasts for the missing section. Like this:
x <- AirPassengers
x[90:100] <- NA
fit <- auto.arima(x)
fit1 <- forecast(Arima(AirPassengers[1:89],model=fit),h=10)
fit2 <- forecast(Arima(rev(AirPassengers[101:144]), model=fit), h=10)
plot(x)
lines(ts(0.5*c(fit1$mean+rev(fit2$mean)),
start=time(AirPassengers)[90],freq=12), col="red")

As said by Rob, using a Kalman Smoother is usually the "better" solution.
This can for example be done via the imputeTS package (disclaimer: I maintain the package).
(https://cran.r-project.org/web/packages/imputeTS/index.html)
library("imputeTS")
x <- AirPassengers
x[90:100] <- NA
x <- na.kalman(x, model = "auto.arima")
Internally the imputeTS package performs KalmanSmoothing on the State Space Representation of the ARIMA model obtained by auto.arima.
Even if the theoretical background is not easy to understand,
it usually gives very good results :)

Related

I am having trouble with plotting this logistic regression model

Please help me with plotting this model. I tried just using the plot function but I'm not sure how to incorprate the testing dataset. Please help/Thank You.
TravelInsurance <- read.csv(file="TravelInsurancePrediction.csv",header=TRUE)
set.seed(2022)
Training <- sample(c(1:1987),1500,replace=FALSE)
Test <- c(1:1987)[-Training]
TrainData <- TravelInsurance[Training,]
TestData <- TravelInsurance[Test,]
TravIns=as.factor(TravelInsurance$TravelInsurance)
years= TravelInsurance$Age
EMPTY=as.factor(TravelInsurance$Employment.Type)
Grad=as.factor(TravelInsurance$GraduateOrNot)
Income=TravelInsurance$AnnualIncome
Fam=TravelInsurance$FamilyMembers
CD=as.factor(TravelInsurance$ChronicDiseases)
FF=as.factor(TravelInsurance$FrequentFlyer)
logreg = glm(TravIns~ EMPTY+years+Grad+Income+Fam+CD+FF,family = binomial)
Too long for a comment.
Couple of things here:
You divide your dataset into train and test but then build the
model using the full dataset??
Passing vectors is not a good way to use glm(...), or any of the R modeling functions. Better to pass the data frame and reference the columns in the formula.
So, with your dataset,
logreg <- glm(TravIns~ EMPTY+years+Grad+Income+Fam+CD+FF,family = binomial, data=TrainData)
pred <- predict(logreg, newdata=TestData, type='response')
As this is a logistic regression, the responses are probabilities (that someone buys travel insurance?). There are several ways to assess goodness-of-fit. One visualization uses receiver operating characteristic (ROC) curves.
library(pROC)
roc(TestData$TravIns, pred, plot=TRUE)
The area under the roc curve (the "auc") is a measure of goodness of fit; 1.0 is prefect, 0.5 is no better than random chance. See the docs: ?roc and ?auc

R: Displaying ARIMA forecast as extension of past data after log transformation

My goal: I want to understand a time series, a strongly auto-regressive one (ACF and PACF output told me that) and make a forecast.
So what I did was I first transformed my data into a ts, then decomposed the time series, checked its stationarity (the series wasn't stationary). Then I conducted a log transformation and found an Arima model that fits the data best - I checked the accuracy with accuracy(x) - I selected the model with the accuracy output closest to 0.
Was this the correct procedure? I'm new to statistics and R and would appreciate some criticism if that wasn't correct.
When building the Arima model I used the following code:
ARIMA <- Arima(log(mydata2), order=c(2,1,2), list(order=c(0,1,1), period=12))
The result I received was a log function and the data from the past (the data I used to build the model) wasn't displayed in the diagram. So then to transform the log into the original scale I used the following code:
ARIMA_FORECAST <- forecast(ARIMA, h=24, lambda=0)
Is that correct? I found it somewhere on the web and don't really understand it.
Now my main question: How can I plot the original data and the ARIMA_FORECAST in one diagram? I mean displaying it the way the forecasts are displayed if no log transformation is undertaken - the forecast should be displayed as the extension of the data from the past, confidence intervals should be there too.
The simplest approach is to set the Box-Cox transformation parameter $\lambda=0$ within the modelling function, rather than take explicit logarithms (see https://otexts.org/fpp2/transformations.html). Then the transformation will be automatically reversed when the forecasts are produced. This is simpler than the approach described by #markus. For example:
library(forecast)
# estimate an ARIMA model to log data
ARIMA <- auto.arima(AirPassengers, lambda=0)
# make a forecast
ARIMA_forecast <- forecast(ARIMA)
# Plot forecasts and data
plot(ARIMA_forecast)
Or if you prefer ggplot graphics:
library(ggplot2)
autoplot(ARIMA_forecast)
The package forecast provides the functions autolayer and geom_forecast that might help you to draw the desired plot. Here is an example using the AirPassengers data. I use the function auto.arima to estimate the model.
library(ggplot2)
library(forecast)
# log-transform data
dat <- log(AirPassengers)
# estimate an ARIMA model
ARIMA <- auto.arima(dat)
# make a forecast
ARIMA_forecast <- forecast(ARIMA, h = 24, lambda = 0)
Since your data is of class ts you can use the autoplot function from ggplot2 to plot your original data and add the forecast with the autolayer function from forecast.
autoplot(AirPassengers) + forecast::autolayer(ARIMA_forecast)
The result is shown below.

R: Prediction using glm() gamma family

I am using glm() function in R with link= log to fit my model. I read on various websites that fitted() returns the value which we can compare with the original data as compared to the predict().
I am facing some problem while fitting the model.
data<-read.csv("training.csv")
data$X2 <- as.Date(data$X2, format="%m/%d/%Y")
data$X3 <- as.Date(data$X3, format="%m/%d/%Y")
data_subset <- subset(...)
attach(data_subset)
#define variable
Y<-cbind(Y)
X<-cbind(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X14)
# correlation among variables
cor(Y,X)
model <- glm(Y ~ X , data_subset,family=Gamma(link="log"))
summary(model)
detach(data_subset)
validation_data<-read.csv("validation.csv")
validation_data$X2 <- as.Date(validation_data$X2, format="%m/%d/%Y")
validation_data$X3 <- as.Date(validation_data$X3, format="%m/%d/%Y")
attach(validation_data)
predicted_valid<-predict(model, newdata=validation_data)
I am not sure how does predict work with gamma log link. I want to transform the predicted values so that it can be compared with the original data. Can someone please help me.
Add type="response" to your predict call, to get predictions on the response scale. See ?predict.glm.
predict(model, newdata=*, type="response")
Looks to me like fitted doesn't work the way you seem to think it does.
You probably want to use predict there, since you seem to want to pass it data.
see ?fitted vs ?predict

Back Test ARIMA model with Exogenous Regressors

Is there a way to create a holdout/back test sample in following ARIMA model with exogenous regressors. Lets say I want to estimate the following model using the first 50 observations and then evaluate model performance on the remaining 20 observations where the x-variables are pre-populated for all 70 observations. What I really want at the end is a graph that plots actual and fitted values in development period and validation/hold out period (also known as Back Testing in time series)
library(TSA)
xreg <- cbind(GNP, Time_Scaled_CO) # two time series objects
fit_A <- arima(Charge_Off,order=c(1,1,0),xreg) # Charge_Off is another TS object
plot(Charge_Off,col="red")
lines(predict(fit_A, Data),col="green") #Data contains Charge_Off, GNP, Time_Scaled_CO
You don't seem to be using the TSA package at all, and you don't need to for this problem. Here is some code that should do what you want.
library(forecast)
xreg <- cbind(GNP, Time_Scaled_CO)
training <- window(Charge_Off, end=50)
test <- window(Charge_Off, start=51)
fit_A <- Arima(training,order=c(1,1,0),xreg=xreg[1:50,])
fc <- forecast(fit_A, h=20, xreg=xreg[51:70,])
plot(fc)
lines(test, col="red")
accuracy(fc, test)
See http://otexts.com/fpp/9/1 for an intro to using R with these models.

un-log a times series while using the package forecast

Hello I use the package forecast in order to do times-series prevision. I would like to know how to un-log a series on the final forecast plot. With the forecast package I don't know how to un-log my series. Here is an example:
library(forecast)
data <- AirPassengers
data <- log(data) #with this AirPassengers data not nessesary to LOG but with my private data it is...because of some high picks...
ARIMA <- arima(data, order = c(1, 0, 1), list(order = c(12,0, 12), period = 1)) #Just a fake ARIMA in this case...
plot(forecast(ARIMA, h=24)) #but my question is how to get a forecast plot according to the none log AirPassenger data
So the image is logged. I want to have the same ARIMA modell but witht the none loged data.
It is not necessary to use the hack proposed by #ndoogan. forecast.Arima has built-in facilities for undoing transformations. The following code will do what is required:
fc <- forecast(ARIMA, h=24, lambda=0)
Better still, build the transformation into the model itself:
ARIMA <- Arima(data, order=c(1,0,1), list(order=c(1,0,1),period=12)), lambda=0)
fc <- forecast(ARIMA, h=24)
Note that you need to use the Arima function from the forecast package to do this, not the arima function from the stats package.
#Hemmo is correct that this back-transformation will not give the mean of the forecast distribution, and so is not the optimal MSE forecast. However, it will give the median of the forecast distribution, and so will give the optimal MAE forecast.
Finally, the fake model used by #Swiss12000 makes little sense as the seasonal part has frequency 1, and so is confounded with the non-seasonal part. I think you probably meant the model I've used in the code above.
The problem with #ndoogan's answer is that logarithm is not a linear transformation. Which means that E[exp(y)] != exp(E[y]). Jensen's inequality gives actually that E[exp(y)] >= exp(E[y]). Here's a simple demonstration:
set.seed(1)
x<-rnorm(1000)
mean(exp(x))
[1] 1.685356
exp(mean(x))
[1] 0.9884194
Here's a case concerning the prediction:
# Simulate AR(1) process
set.seed(1)
y<-10+arima.sim(model=list(ar=0.9),n=100)
# Fit on logarithmic scale
fit<-arima(log(y),c(1,0,0))
#Simulate one step ahead
set.seed(123)
y_101_log <- fit$coef[2]*(1-fit$coef[1]) +
fit$coef[1]*log(y[100]) + rnorm(n=1000,sd=sqrt(fit$sigma2))
y_101<-exp(y_101_log) #transform to natural scale
exp(mean(y_101_log)) # This is exp(E(log(y_101)))
[1] 5.86717 # Same as exp(predict(fit,n.ahead=1)$pred)
# differs bit because simulation
mean(y_101) # This is E(exp(log(y_101)))=E(y_101)
[1] 5.904633
# 95% Prediction intervals:
#Naive way:
pred<-predict(fit,n.ahead=1)
c(exp(pred$pred-1.96*pred$se),exp(pred$pred+1.96*pred$se))
pred$pred pred$pred
4.762880 7.268523
# Correct ones:
quantile(y_101,probs=c(0.025,0.975))
2.5% 97.5%
4.772363 7.329826
This also provides a solution to your problem in general sense:
Fit your model
Simulate multiple samples from that model (for example one step ahead predictions as above)
For each simulated sample, make the inverse transformation to get the values in original scale
From these simulated samples you can compute the expected value as a ordinary mean, or if you need confidence intervals, compute empirical quantiles.
This is a bit of a hack, but it seems to do what you want. Based on your fitted model ARIMA:
fc<-forecast(ARIMA,h=24)
fc$mean<-exp(fc$mean)
fc$upper<-exp(fc$upper)
fc$lower<-exp(fc$lower)
fc$x<-exp(fc$x)
Now plot it
plot(fc)

Resources