Only out-of-sample forecast plot using auto.arima and xreg - r

this is my first post so sorry if this is clunky or not formatted well.
period texas u3 national u3
1976 5.758333333 7.716666667
1977 5.333333333 7.066666667
1978 4.825 6.066666667
1979 4.308333333 5.833333333
1980 5.141666667 7.141666667
1981 5.291666667 7.6
1982 6.875 9.708333333
1983 7.916666667 9.616666667
1984 6.125 7.525
1985 7.033333333 7.191666667
1986 8.75 6.991666667
1987 8.441666667 6.191666667
1988 7.358333333 5.491666667
1989 6.658333333 5.266666667
1990 6.333333333 5.616666667
1991 6.908333333 6.816666667
1992 7.633333333 7.508333333
1993 7.158333333 6.9
1994 6.491666667 6.083333333
1995 6.066666667 5.608333333
1996 5.708333333 5.416666667
1997 5.308333333 4.95
1998 4.883333333 4.508333333
1999 4.666666667 4.216666667
2000 4.291666667 3.991666667
2001 4.941666667 4.733333333
2002 6.341666667 5.775
2003 6.683333333 5.991666667
2004 5.941666667 5.533333333
2005 5.408333333 5.066666667
2006 4.891666667 4.616666667
2007 4.291666667 4.616666667
2008 4.808333333 5.775
2009 7.558333333 9.266666667
2010 8.15 9.616666667
2011 7.758333333 8.95
2012 6.725 8.066666667
2013 6.283333333 7.375
2014 5.1 6.166666667
2015 4.45 5.291666667
2016 4.633333333 4.866666667
2017 4.258333333 4.35
2018 3.858333333 3.9
2019 ____ 3.5114
2020 ____ 3.477
2021 ____ 3.7921
2022 ____ 4.0433
2023 ____ 4.1339
2024 ____ 4.2269
2025 ____ 4.2738
How can one use auto.arima in R with an external regressor to make a forecast but only plot the out-of-sample values? I believe the forecast values are correct but the years do not match up correctly. So if I have annual data from 1976-2018 and I forecast the dependent variable (column 2) (I want to forecast through 2025), it plots the "forecast" for the time period 2019-2068. Weirdly enough, the figures match up well with the sample data (the "forecast" for 2019 seems to be the model prediction for 1980 and so on, all the way through 2068 matching 2025.
I would like to be able to eliminate that and have it so "2062-2068" results are instead 2019-2025. I'll try and include a picture of the plot so it might be easier to visualize my plight.
Below is the R script:
#Download the CVS file, the dependent variable in the second column, xreg in the third, and years in the first. All columns have headers.
library(forecast)
library(DataCombine)
library(tseries)
library(MASS)
library(TSA)
ts(TXB102[,2], frequency = 1, start = c(1976, 1),end = c(2018, 1)) -> TXB102ts
ts(TXB102[,3], frequency = 1, start = c(1976, 1), end = c(2018,1)) -> TXB102xregtest
ts(TXB102[,3], frequency = 1, start = c(1976, 1), end = c(2025,1)) -> TXB102xreg
as.vector(t(TXB102ts)) -> y
as.vector(t(TXB102xregtest)) -> xregtest
as.vector(t(TXB102xreg)) -> xreg
y <- ts(y,frequency = 1, start = c(1976,1),end = c(2018,1))
xregtest <- ts(xregtest, frequency = 1, start = c(1976,1), end=c(2018,1))
xreg <- ts(xreg, frequency = 1, start = c(1976,1), end=c(2025,1))
summary(y)
plot(y)
ndiffs(y)
ARIMA <- auto.arima(y, trace = TRUE, stepwise = FALSE, approximation = FALSE, xreg=xregtest)
ARIMA
forecast(ARIMA,xreg=xreg)
plot(forecast(ARIMA,xreg=xreg))
The following is a plot of what I get after running the script.
Plot
TLDR: How do I get the real out-of-sample forecast to plot for 2019-2025 as opposed to the in-sample model fit it is passing along as 2019-2068.

Related

Create time series in R with weekly measurements for 30 years period

I have a set of weekly data for 30 years (1991 - 2020). The data was collected weekly between 5th may - 10 October every year. This gives me 23 weeks of data every year for 30 years.
I want to create a time series in R with this data. How do I do that please? It should be just 690 entriesin the output, but it is generating 1531 entries in the output See my codes and data below:
I saw a similar question HERE, but mine repeats for 30 years.
myts <- ts(df$Kc_Kamble, start = c(1991, 1), end = c(2020, 23), frequency = 52)
Output in R:
Time Series:
Start = c(1991, 1)
End = c(2020, 23)
Frequency = 52
Sample data:
Year Week Kc_Kamble
1991 1 0.357445197
1991 2 0.36902168
1991 3 0.383675947
1991 4 0.400703221
1991 5 0.418901921
1991 6 0.437049406
1991 7 0.453742803
1991 8 0.467291036
1991 9 0.475942834
1991 10 0.476898402
1991 11 0.464632341
1991 12 0.436298927
1991 13 0.396338825
1991 14 0.352731819
1991 15 0.313539638
1991 16 0.283932169
1991 17 0.2627343
1991 18 0.247373874
1991 19 0.235647483
1991 20 0.225655859
1991 21 0.216663659
1991 22 0.208550065
1991 23 0.203605036
1992 1 0.336754943
1992 2 0.334735193
1992 3 0.342654691
1992 4 0.363520428
1992 5 0.397733301
1992 6 0.4399758
1992 7 0.483592219
1992 8 0.521920773
1992 9 0.548597061
1992 10 0.560150059
1992 11 0.557210705
1992 12 0.542114151
1992 13 0.5173071
1992 14 0.485236257
1992 15 0.448348321
1992 16 0.409089999
1992 17 0.369907993
1992 18 0.333162073
1992 19 0.300014261
1992 20 0.270225988
1992 21 0.243406301
1992 22 0.219247646
1992 23 0.204966601
Let me suggest the following steps to set up and start analyzing your time series.
Initialize your time series by creating a 'dates' sequence and 'data' (set to NA). Use the library xts to create the time series.
library(xts)
dates <- seq(as.Date("1991-01-01"), as.Date("2020-01-01"), by = "weeks")
data <- rep(NA, length(dates))
myxts <- xts(x = data, order.by = dates)
str(myxts); head(myxts); tail(myxts)
Collect your data.
Data is collected weekly between 5th may - 10 October every year.
Let's read the data and work with Weekly Total Precipitation for year 2014.
ts_data <- read.table("https://www.dropbox.com/s/k2cxpja3cpsyoyc/ts_data.txt?dl=1", header =TRUE, sep="\t")
year.2014 <- ts_data[which(ts_data$Year == 2014),]
year.2014 # 23 rows of data for 2014.
start <- as.Date("2014-5-5"); end <- as.Date("2014-10-10")
collect <- which ( index(myxts) >= start & index(myxts) <= end )
myxts[collect] <- year.2014$PRPtot
# year.2014 and collect must have the same number of rows
Verify the collected data. You should see data inside each time window, and NA outside the time windows.
myxts2 <- window(myxts, start=start-50, end=end+50)
str(myxts2); myxts2
Visualize the collected data. You could view the complete time series (i.e. myxts). Note that autoplot drops all NAs.
library(ggplot2)
autoplot(myxts2, geom = "point")

How to calculate the average year

I have a 20-year monthly XTS time series
Jan 1990 12.3
Feb 1990 45.6
Mar 1990 78.9
..
Jan 1991 34.5
..
Dec 2009 89.0
I would like to get the average (12-month) year, or
Jan xx
Feb yy
...
Dec kk
where xx is the average of every January, yy of every February, and so on.
I have tried apply.yearly and lapply but these return 1 value, which is the 20-year total average
Would you have any suggestions? I appreciate it.
The lubridate package could be useful for you. I would use the functions year() and month() in conjunction with aggregate():
library(xts)
library(lubridate)
#set up some sample data
dates = seq(as.Date('2000/01/01'), as.Date('2005/01/01'), by="month")
df = data.frame(rand1 = runif(length(dates)), rand2 = runif(length(dates)))
my_xts = xts(df, dates)
#get the mean by year
aggregate(my_xts$rand1, by=year(index(my_xts)), FUN=mean)
This outputs something like:
2000 0.5947939
2001 0.4968154
2002 0.4941752
2003 0.5291211
2004 0.6631564
To find the mean for each month you can do:
#get the mean by month
aggregate(my_xts$rand1, by=month(index(my_xts)), FUN=mean)
which will output something like
1 0.5560279
2 0.6352220
3 0.3308571
4 0.6709439
5 0.6698147
6 0.7483192
7 0.5147294
8 0.3724472
9 0.3266859
10 0.5331233
11 0.5490693
12 0.4642588

Boxplot not plotting all data

I'm trying to plot a boxplot for a time series (e.g. http://www.r-graph-gallery.com/146-boxplot-for-time-series/) and can get every other example to work, bar my last one. I have averages per month for six years (2011 to 2016) and have data for 2014 and 2015 (albeit in small quantities), but for some reason, boxes aren't being shown for the 2014 and 2015 data.
My input data has three columns: year, month and residency index (a value between 0 and 1). There are multiple individuals (in this example, 37) each with an average residency index per month per year (including 2014 and 2015).
For example:
year month RI
2015 1 NA
2015 2 NA
2015 3 NA
2015 4 NA
2015 5 NA
2015 6 NA
2015 7 0.387096774
2015 8 0.580645161
2015 9 0.3
2015 10 0.225806452
2015 11 0.3
2015 12 0.161290323
2016 1 0.096774194
2016 2 0.103448276
2016 3 0.161290323
2016 4 0.366666667
2016 5 0.258064516
2016 6 0.266666667
2016 7 0.387096774
2016 8 0.129032258
2016 9 0.133333333
2016 10 0.032258065
2016 11 0.133333333
2016 12 0.129032258
which is repeated for each individual fish.
My code:
#make boxplot
boxplot(RI$RI~RI$month+RI$year,
xaxt="n",xlab="",col=my_colours,pch=20,cex=0.3,ylab="Residency Index (RI)", ylim=c(0,1))
abline(v=seq(0,12*6,12)+0.5,col="grey")
axis(1,labels=unique(RI$year),at=seq(6,12*6,12))
The average trend line works as per the other examples.
a=aggregate(RI$RI,by=list(RI$month,RI$year),mean, na.rm=TRUE)
lines(a[,3],type="l",col="red",lwd=2)
Any help on this matter would be greatly appreciated.
Your problem seems to be the presence of missing values, NA, in your data, the other values are plotted correctly. I've simplified your code a bit.
boxplot(RI$RI ~ RI$month + RI$year,
ylab="Residency Index (RI)")
a <- aggregate(RI ~ month + year, data = RI, FUN = mean, na.rm = TRUE)
lines(c(rep(NA, 6), a[,3]), type="l", col="red", lwd=2)
Also, I believe that maybe a boxplot is not the best way to depict your data. You only have one value per year/month, when a boxplot would require more. Maybe a simple scatter plot will do better.

decompose() for yearly time series in R

I'm trying to perform analysis on a time series data of inflation rates from the year 1960 to 2015. The dataset is a yearly time series over 56 years with 1 real value per each year, which is the following:
Year Inflation percentage
1960 1.783264746
1961 1.752021563
1962 3.57615894
1963 2.941176471
1964 13.35403727
1965 9.479452055
1966 10.81081081
1967 13.0532972
1968 2.996404315
1969 0.574712644
1970 5.095238095
1971 3.081105573
1972 6.461538462
1973 16.92815855
1974 28.60169492
1975 5.738605162
1976 -7.63438068
1977 8.321619342
1978 2.517518817
1979 6.253164557
1980 11.3652609
1981 13.11510484
1982 7.887270664
1983 11.86886396
1984 8.32157969
1985 5.555555556
1986 8.730811404
1987 8.798689021
1988 9.384775808
1989 3.26256011
1990 8.971233545
1991 13.87024609
1992 11.78781925
1993 6.362038664
1994 10.21150033
1995 10.22488756
1996 8.977149075
1997 7.16425362
1998 13.2308409
1999 4.669821024
2000 4.009433962
2001 3.684807256
2002 4.392199745
2003 3.805865922
2004 3.76723848
2005 4.246353323
2006 6.145522388
2007 6.369996746
2008 8.351816444
2009 10.87739112
2010 11.99229692
2011 8.857845297
2012 9.312445605
2013 10.90764331
2014 6.353194544
2015 5.872426595
'stock1' contains my data where the first column stands for Year, and the second for 'Inflation.percentage', as follows:
stock1<-read.csv("India-Inflation time series.csv", header=TRUE, stringsAsFactors=FALSE, as.is=TRUE)
The following is my code for creating the time series object:
stock <- ts(stock1$Inflation.percentage,start=(1960), end=(2015),frequency=1)
Following this, I am trying to decompose the time series object 'stock' using the following line of code:
decom_add <- (decompose(stock, type ="additive"))
Here I get an error:
Error in decompose(stock, type = "additive") : time series has no
or less than 2 periods
Why is this so? I initially thought it has something to do with frequency, but since the data is annual, the frequency has to be 1 right? If it is 1, then aren't there definitely more than 2 periods in the data?
Why isn't decompose() working? What am I doing wrong?
Thanks a lot in advance!
Please try for frequency=2, because frequency needs to be greater than 1. Because this action will change your model, for me the better way is to load data which contain and month column, so the frequency will be 12.

plot data with different dates

i have some trouble with the plots of my dataset.
This is an extract of my dataset.
Date Month Year Value 1
30/05/96 May 1996 1835
06/12/96 December 1996 1770
18/03/97 March 1997 1640
27/06/97 June 1997 1379
30/09/97 September 1997 1195
24/11/97 November 1997 1335
13/03/98 March 1998 1790
07/05/98 May 1998 349
14/07/98 July 1998 1179
27/10/98 October 1998 665
What I would like to do is a plot with Value 1 (y) against the mount (x) for every year. In other words, a plot with 3 lines that show the variation of Value 1 every month in th different years.
I do the following:
plot(x[Year==1996,4], xaxt="n")
par(new=T)
plot(x[Year==1997,4], xaxt="n")
axis(1, at=1:length(x$Month), labels=x$Month)
The problem is that the first value of 1996 refers to may, and the first value of 1997 refers to march. Due to that, the values plotted are mixed and don't correspond to their month anymore.
Is there a way to plot all these values in the same graph keeping the original correspondence of the data?
df <- read.table(text="Date Month Year Value1
30/05/96 May 1996 1835
06/12/96 December 1996 1770
18/03/97 March 1997 1640
27/06/97 June 1997 1379
30/09/97 September 1997 1195
24/11/97 November 1997 1335
13/03/98 March 1998 1790
07/05/98 May 1998 349
14/07/98 July 1998 1179
27/10/98 October 1998 665", header=T, as.is=T)
df$Month <- factor(df$Month, levels=month.name, ordered=T)
library(ggplot2)
ggplot(df) + geom_line(aes(Month, Value1, group=Year)) +
facet_grid(Year~.)
And a lattice alternative using #Michele df. I show here the 2 alternative (with and without faceting)
library(lattice)
library(gridExtra)
p1 <- xyplot(Value1~Month,groups=Year,data=df,
type=c('p','l'),auto.key=list(columns=3,lines=TRUE))
p2 <- xyplot(Value1~Month|Year,groups=Year,data=df,layout= c(1,3),
type=c('p','l'),auto.key=list(columns=3,lines=TRUE))
grid.arrange(p1,p2)
Create a numeric value for your months:
x$MonthNum <- sapply(x$Month, function(x) which(x==month.name))
Then plot using those numeric values, but label the axes with words.
plot(NA, xaxt="n", xlab="Month", xlim=c(0,13),
ylim=c(.96*min(x$Value),1.04*max(x$Value)), type="l")
z <- sapply(1996:1998, function(y) with(x[x$Year==y,], lines(MonthNum, Value1)))
axis(1, at=1:12, labels=month.name)
And some labels, if you want to identify years:
xlabpos <- tapply(x$MonthNum, x$Year, max)
ylabpos <- mapply(function(mon, year) x$Value1[x$MonthNum==mon & x$Year==year],
xlabpos, dimnames(xlabpos)[[1]])
text(x=xlabpos+.5, y=ylabpos, labels=dimnames(xlabpos)[[1]])
One could also obtain something similar to the ggplot example using layout:
par(mar=c(2,4,1,1))
layout(matrix(1:3))
z <- sapply(1996:1998, function(y) {
with(x[x$Year==y,], plot(Value1 ~ MonthNum, xaxt="n", xlab="Month", ylab=y,
xlim=c(0,13), ylim=c(.96*min(x$Value),1.04*max(x$Value)), type="l"))
axis(1, at=1:12, labels=month.name)
})

Resources