Time series forecast cross-validation - r

The project im working based on hourly time series of the Belgian Electricity Price for the period 2010-01-04 to 2016-10-30. And my datasets are the above :
Train Set :
> dput(head(data1))
structure(list(datetime_utc = c("2010-01-04 00:00:00", "2010-01-04 01:00:00",
"2010-01-04 02:00:00", "2010-01-04 03:00:00", "2010-01-04 04:00:00",
"2010-01-04 05:00:00"), Generation_BE = c(13143.7, 13143.7, 13143.7,
13143.7, 13143.7, 13143.7), Generation_FR = c(63599, 62212, 62918,
62613, 62432, 63411), Prices.BE = c(37.15, 33.47, 28, 21.29,
16.92, 28), holidaysBE = c(0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA,
6L), class = "data.frame")
Test Set :
> dput(head(data2))
structure(list(datetime_utc = c("2016-10-24 00:00:00", "2016-10-24 01:00:00",
"2016-10-24 02:00:00", "2016-10-24 03:00:00", "2016-10-24 04:00:00",
"2016-10-24 05:00:00"), Generation_BE = c(9615.7075, 9626.865,
9648.0025, 9668.42, 9681.805, 9688.425), Generation_FR = c(45605L,
44629L, 44073L, 44359L, 44056L, 44799L), Prices.BE = c(44.6,
40.92, 37.39, 36.4, 33.01, 37.89), holidaysBE = c(0L, 0L, 0L,
0L, 0L, 0L)), row.names = c(NA, 6L), class = "data.frame")
I should use a time series forecasting method to forecast the time series of Belgian Prices (Prices.Be) considering a horizon of 168 (hourly forecasts for the next 7 days). For this construction i must use the Train Set data.
I have made the time series :
library(dplyr)
library(zoo)
colSums(is.na(data1))
newdata1 <- data1 %>%
mutate(across(where(~ is.numeric(.) && anyNA(.)),
na.aggregate, FUN = median))
colSums(is.na(newdata1))
# Extract Belgium prices time series from data
belgiump_ts <- ts(newdata1$Prices.BE, start = as.Date("2001-01-01"), frequency = 365*24)
So, i decided to select the ARIMA model in order to achieve forecasting the time series.
The problem, Im facing is that i cant figure out how can i create cross-validation of ARIMA forecast ???
Im quite confused about this and i cant understand how to face these difficulties. I cant think about how i can start buildind this.

Related

Using breaks in Scale_x_datetime

I am trying to plot a graph with limits from 2022-05-29 07:00:00 to 2022-05-29 07:00:00 PM on the x axis. Within these limits I would like 1 hour breaks. I am using scale_x_datetime so that I can graph as continuous variable instead of discrete. I created a vector with all breaks called axisorder.
c("2022-05-29 07:00:00", "2022-05-29 08:00:00", "2022-05-29 09:00:00",
"2022-05-29 10:00:00", "2022-05-29 11:00:00", "2022-05-29 12:00:00",
"2022-05-29 01:00:00", "2022-05-29 02:00:00", "2022-05-29 03:00:00",
"2022-05-29 04:00:00", "2022-05-29 05:00:00", "2022-05-29 06:00:00",
"2022-05-29 07:00:00", "2022-05-29 08:00:00")
here is my data
structure(list(Id = c("user_6", "user_6", "user_6", "user_6",
"user_6", "user_6", "user_6", "user_6", "user_6", "user_6", "user_6",
"user_6", "user_6", "user_6", "user_6", "user_6", "user_6", "user_6",
"user_6", "user_6", "user_6", "user_6", "user_6", "user_6"),
ActivityHour = structure(c(1653825600, 1653786000, 1653789600,
1653793200, 1653796800, 1653800400, 1653804000, 1653807600,
1653811200, 1653814800, 1653818400, 1653822000, 1653825600,
1653786000, 1653789600, 1653793200, 1653796800, 1653800400,
1653804000, 1653807600, 1653811200, 1653814800, 1653818400,
1653822000), class = c("POSIXct", "POSIXt"), tzone = ""),
TotalIntensity = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 28L, 13L,
13L, 143L, 20L, 11L, 19L, 54L, 9L, 19L, 24L, 19L, 9L, 0L,
0L, 0L, 0L), AverageIntensity = c(0, 0, 0, 0, 0, 0, 0, 0.466667,
0.216667, 0.216667, 2.383333, 0.333333, 0.183333, 0.316667,
0.9, 0.15, 0.316667, 0.4, 0.316667, 0.15, 0, 0, 0, 0)), row.names = c(NA,
-24L), class = "data.frame")
and last but not least my graph code
ggplot(data=df2, aes(x = ActivityHour, y = AverageIntensity)) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 45)) +
scale_x_datetime(breaks = "axisorder"), limits = c(as.POSIXct("2022-05-29 07:00:00"),as.POSIXct("2022-05-29 07:00:00")) +
ggtitle("Average Intensity user_6",
subtitle = "4-12-2016")
I just don't understand how to set up scale_x_datetime to have those limits and breaks? My x axis data is already formatted in (posixct.)
Any help would be appreciated!
UPDATE -- Here is where I am at..
I have converted my df2$ActivityHour to POSIXct
1653800400, 1653804000, 1653807600, 1653811200, 1653814800, 1653818400,
1653822000, 1653825600, 1653786000, 1653789600, 1653793200, 1653796800,
1653800400, 1653804000, 1653807600, 1653811200, 1653814800, 1653818400,
1653822000), class = c("POSIXct", "POSIXt"), tzone = "")
I then updated date_breaks = "hour" and changed my limits limits = c(as.POSIXct("2022-05-29 07:00:00"), as.POSIXct("....))
my new plot data looks like this
ggplot(data=df2, aes(x = ActivityHour, y = AverageIntensity)) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 45)) +
scale_x_datetime(date_breaks = "hour", limits = c(as.POSIXct("2022-05-29 07:00:00"),as.POSIXct("2022-05-29 19:00:00"))) +
ggtitle("Average Intensity user_6",
subtitle = "4-12-2016")
my plot comes out weird, as if df2$AverageIntensity is not plotting values correctly with $ActivityHour values. For instance, 13:00:00 no longer correlates to averageintensity of .31 INSTEAD there is no data. I understand this is because df2$ActivityHour values after 12:00:00 should be 13:00:00, 14:00:00, 15:00:00, etc. not back to 01:00:00. so I am looking for the quickest way to fix this?
UPDATE2 -- I fixed this with replace function
df2$ActivityHour <- replace(df2$ActivityHour,1,"2022-05-29 00:00:00")
and then replaced by position for each one.
Syntax: replace(list , position , replacement_value)
You have a couple of issues.
Firstly, your axisorder is in character format rather than the required POSIXct. Secondly, you have put "axisorder" in quotation marks, so ggplot thinks you want a single break at a location called "axisorder". You don't actually need axisorder anyway, since, as Stefan points out, you can use date_breaks = "hour"
Thirdly, and most importantly, you need to use 24-hour time formatting, so your second limit should be "2022-05-29 19:00:00". The same is also true for all the times in your data frame. The times after midday require 12 hours added on to them (and the initial value at midnight needs 12 hours removed):
df2$ActivityHour[14:24] <- df2$ActivityHour[14:24] + 12 * 3600
df2$ActivityHour[1] <- df2$ActivityHour[1] - 12 * 3600
ggplot(data=df2, aes(x = ActivityHour, y = AverageIntensity)) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_datetime(date_breaks = "hour",
limits = c(as.POSIXct("2022-05-29 07:00:00"),
as.POSIXct("2022-05-29 19:00:00"))) +
ggtitle("Average Intensity user_6", subtitle = "4-12-2016")

Duplicate dates in xts object lead to wrong indexing

I have data with users and duplicated dates e.g. users accessing a web site.
Example:
require(zoo)
require(xts)
test <- structure(list(timestamp = c("2013-03-06 01:00:00", "2014-07-06 21:00:00",
"2014-07-31 23:00:00", "2014-08-09 17:00:00", "2014-08-14 20:00:00",
"2014-08-14 22:00:00", "2014-08-16 15:00:00", "2014-08-19 02:00:00",
"2014-12-28 18:00:00", "2015-01-17 17:00:00"), user = c(1, 2,
2, 3, 3, 3, 3, 3, 4, 4)), .Names = c("timestamp", "user"), row.names = c("220667",
"331481", "422653", "629430", "378111", "646137", "558638", "151641",
"599370", "482750"), class = "data.frame")
If I create an xts object and then access it with its own index I get different data. What am I doing wrong here ?
testXts <- xts(x=test,order.by = as.Date(test$timestamp))
testXts[index(testXts)]#Different (wrong) data. Why ?

R calculate daily mean data from irregular hourly data

I have a large data set that I am trying to reformat. Unfortunately, I cannot aggregate daily data.
dataset1_updated<- structure(list(X = 1:5, Time = structure(c(1L, 1L, 2L, 3L, 3L), .Label = c("7/29/11 10:29", "7/29/11 10:30", "7/29/11 10:32"
), class = "factor"), O3 = c(32.032608222367, 32.032608222367,
32.032608222367, 32.032608222367, 32.032608222367), SO2 = c(2.611,
2.605, 2.744, 2.767, 2.778), NO = c(0.081, 0.081, 0.081, 0.081,
0.081), NO2 = c(1.938, 1.912, 1.912, 1.896, 1.863), NOx = c(2.019,
1.993, 1.993, 1.977, 1.944)), .Names = c("X", "Time", "O3",
"SO2", "NO", "NO2", "NOx"), row.names = c(NA, 5L), class = "data.frame")
I convert the data set to xts object, and apply daily mean function, the results are "NA". Could you please tell me what is missing?
x <- as.xts(as.POSIXct(dataset1_updated$Time, format="%m/%d/%Y %H:%M"))
x_up<- apply.daily(x, colMeans)
write.csv(as.data.frame(as.matrix(x_up)), file="test")
thank you,
We need to change the xts statement as the as.xts is applying only on the datetime class and not on the entire dataset
xt1 <- xts(dataset1_updated[-(1:2)], order.by = as.POSIXct(dataset1_updated$Time,
format = "%m/%d/%y %H:%M"))
x_up <- apply.daily(xt1, colMeans)
x_up
# O3 SO2 NO NO2 NOx
#2011-07-29 10:32:00 32.03261 2.701 0.081 1.9042 1.9852

convert List to zoo and use rollapply on the List

I would like to convert a list object to zoo and then apply rollapply on the zoo object. Short example reproduced below (I have 90,000 such files to process, using UNIX:)). Assume my list has two dataframes.
1) I would like to convert the date in each of the dataframes to this format:
dates <- as.Date(paste0(mylist$year, "-", mylist$month, "-", mylist$day), format="%Y-%m-%d")
z <- zoo(mylist, order.by=mylist[,1])
I understand lapply can do this but I tried without success.
Once I get my zoo object, I would like to use rollapply:
library(hydroTSM)#for daily2annual function but aggregate can do
x.3max <- rollapply(data=zooobject, width=3, FUN=sum, fill=NA, partial= TRUE,
align="center")
# Maximum value per year of 3-day total rainfall for each one of the simulations
z.3max.annual <- daily2annual(z.3max, FUN=max,na.rm=TRUE)#dates=1
What the code above does is it centers a 3-day window on each column of the dataframes in zooobject and sums the values. The, the max per year of the 3-day total is extracted.
mylist<- list(a,a)
mylist<-lapply(mylist, function(x) x[x[["Month"]] %in% c(12,1,2),])# extract data for DJF for individual sites
library(zoo)
a= structure(list(Year = c(1975L, 1975L, 1975L, 1975L, 1975L, 1975L
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), Site = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "G100", class = "factor"), Day = 1:6,
sim01 = c(28.49, 29.04, 27.62, 28.43, 28.69, 29.16), sim02 = c(29.49,
30.04, 28.62, 29.43, 29.69, 30.16), sim03 = c(30.49, 31.04,
29.62, 30.43, 30.69, 31.16), sim04 = c(31.49, 32.04, 30.62,
31.43, 31.69, 32.16), sim05 = c(32.49, 33.04, 31.62, 32.43,
32.69, 33.16), sim06 = c(33.49, 34.04, 32.62, 33.43, 33.69,
34.16), sim07 = c(34.49, 35.04, 33.62, 34.43, 34.69, 35.16
), sim08 = c(35.49, 36.04, 34.62, 35.43, 35.69, 36.16), sim09 = c(36.49,
37.04, 35.62, 36.43, 36.69, 37.16), sim10 = c(37.49, 38.04,
36.62, 37.43, 37.69, 38.16), sim11 = c(38.49, 39.04, 37.62,
38.43, 38.69, 39.16), sim12 = c(39.49, 40.04, 38.62, 39.43,
39.69, 40.16), sim13 = c(40.49, 41.04, 39.62, 40.43, 40.69,
41.16), sim14 = c(41.49, 42.04, 40.62, 41.43, 41.69, 42.16
), sim15 = c(42.49, 43.04, 41.62, 42.43, 42.69, 43.16), sim16 = c(43.49,
44.04, 42.62, 43.43, 43.69, 44.16), sim17 = c(44.49, 45.04,
43.62, 44.43, 44.69, 45.16), sim18 = c(45.49, 46.04, 44.62,
45.43, 45.69, 46.16), sim19 = c(46.49, 47.04, 45.62, 46.43,
46.69, 47.16), sim20 = c(47.49, 48.04, 46.62, 47.43, 47.69,
48.16)), .Names = c("Year", "Month", "Site", "Day", "sim01",
"sim02", "sim03", "sim04", "sim05", "sim06", "sim07", "sim08",
"sim09", "sim10", "sim11", "sim12", "sim13", "sim14", "sim15",
"sim16", "sim17", "sim18", "sim19", "sim20"), row.names = c(NA,
6L), class = "data.frame")
Output should be similar to:
Year Site Sim01...
1975 G100 ...
1976 G100 ...
1977 G100 ...
Only the values in the months c(12,1,2) are needed.
This produces a list of zoo objects, Lz, and then performs rollapply on each component of the list giving L2. Finally L3 aggregates over year taking the max of each column.
library(zoo)
mylist <- list(a, a) # a is given at bottom of question
Lz <- lapply(mylist, read.zoo, index = 1:3, format = "%Y %m %d")
L2 <- lapply(Lz, rollapply, 3, sum, partial = TRUE)
L3 <- lapply(L2, function(z) aggregate(z, as.numeric(format(time(z), "%Y")), max))
giving:
> L3
[[1]]
sim01 sim02 sim03 sim04 sim05 sim06 sim07 sim08 sim09 sim10 sim11
1975 86.28 89.28 92.28 95.28 98.28 101.28 104.28 107.28 110.28 113.28 116.28
sim12 sim13 sim14 sim15 sim16 sim17 sim18 sim19 sim20
1975 119.28 122.28 125.28 128.28 131.28 134.28 137.28 140.28 143.28
[[2]]
sim01 sim02 sim03 sim04 sim05 sim06 sim07 sim08 sim09 sim10 sim11
1975 86.28 89.28 92.28 95.28 98.28 101.28 104.28 107.28 110.28 113.28 116.28
sim12 sim13 sim14 sim15 sim16 sim17 sim18 sim19 sim20
1975 119.28 122.28 125.28 128.28 131.28 134.28 137.28 140.28 143.28
Solved
lst1 <- lapply(list.files(pattern=".csv"),function(x) read.table(x,header=TRUE,sep="")) # read all files and data and replace -999.9 with NA
lst2<-lapply(lst1, function(x) x[x[["Month"]] %in% c(6,7,8),])#c(6,7,8) extract data for DJF for individual sites
names(lst2)<-list.files(pattern=".csv")
lapply(lst2,tail,4)
lst3<-lapply(lst2, function(x) x[!(names(x) %in% c("Site"))])
Lz <- lapply(lst3, read.zoo, index = 1:3, format = "%Y %m %d")
L2 <- lapply(Lz, rollapply, 3, sum, partial = TRUE)
L3 <- lapply(L2, function(z) aggregate(z, as.numeric(format(time(z), "%Y")), max))
mapply(
write.table,
x=L3, file=paste(names(L3), "csv", sep="."),
MoreArgs=list(row.names=FALSE, sep=",")
) # write files to folder keeping the list names as file names

To merge list's members with differing number of rows [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 8 years ago.
Here is my list that you can run in your console (please, tell me if it's too long for example purposes, I can amend it):
my_list = list(structure(list(PX_LAST = c(0.398, 0.457, 0.4, 0.159, 0.126,
0.108, 0.26, 0.239, 0.222, 0.191, 0.184)), .Names = "PX_LAST", row.names = c("2014-04-28 00:00:00",
"2014-04-29 00:00:00", "2014-04-30 00:00:00", "2014-05-02 00:00:00",
"2014-05-05 00:00:00", "2014-05-06 00:00:00", "2014-05-07 00:00:00",
"2014-05-08 00:00:00", "2014-05-09 00:00:00", "2014-05-12 00:00:00",
"2014-05-13 00:00:00"), class = "data.frame"), structure(list(
PX_LAST = c(1.731, 1.706, 1.7095, 1.69, 1.713, 1.711, 1.724,
1.699, 1.702, 1.705, 1.649, 1.611)), .Names = "PX_LAST", row.names = c("2014-04-29 00:00:00",
"2014-04-30 00:00:00", "2014-05-01 00:00:00", "2014-05-02 00:00:00",
"2014-05-05 00:00:00", "2014-05-06 00:00:00", "2014-05-07 00:00:00",
"2014-05-08 00:00:00", "2014-05-09 00:00:00", "2014-05-12 00:00:00",
"2014-05-13 00:00:00", "2014-05-14 00:00:00"), class = "data.frame"),
structure(list(PX_LAST = c(0.481, 0.456, 0.448, 0.439, 0.436,
0.448, 0.458, 0.466, 0.432, 0.437, 0.441, 0.417, 0.4035)), .Names = "PX_LAST", row.names = c("2014-04-28 00:00:00",
"2014-04-29 00:00:00", "2014-04-30 00:00:00", "2014-05-01 00:00:00",
"2014-05-02 00:00:00", "2014-05-05 00:00:00", "2014-05-06 00:00:00",
"2014-05-07 00:00:00", "2014-05-08 00:00:00", "2014-05-09 00:00:00",
"2014-05-12 00:00:00", "2014-05-13 00:00:00", "2014-05-14 00:00:00"
), class = "data.frame"), structure(list(PX_LAST = c(1.65,
1.65, 1.64, 1.65, 1.662, 1.6595, 1.665, 1.6595, 1.6625, 1.652,
1.645, 1.6245, 1.627, 1.633)), .Names = "PX_LAST", row.names = c("2014-04-25 00:00:00",
"2014-04-28 00:00:00", "2014-04-29 00:00:00", "2014-04-30 00:00:00",
"2014-05-01 00:00:00", "2014-05-02 00:00:00", "2014-05-05 00:00:00",
"2014-05-06 00:00:00", "2014-05-07 00:00:00", "2014-05-08 00:00:00",
"2014-05-09 00:00:00", "2014-05-12 00:00:00", "2014-05-13 00:00:00",
"2014-05-14 00:00:00"), class = "data.frame"))
My question is: how can I use do.call() on that list to merge all the data according to their date?
Consider either merge and cbind return errors that I am not able to manage:
> do.call(what = merge, args = my_list)
Error in fix.by(by.x, x) :
'by' must specify column(s) as numbers, names or logical
> do.call(what = cbind, args = my_list)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 11, 12, 13, 14
I would like to get a single data matrix (whose possibly missing/not matching data are replaced by NAs) equal to the one I would get using merge() on the elements of my_list.
This would be a bit easier if you were not merging by row names, But you could do this with the Reduce function which will sequentially apply a function along a list of values (in this case data.frames`. Try
Reduce(function(x,y) {
dd<-merge(x,y,by=0); rownames(dd)<-dd$Row.names; dd[-1]
}, my_list)
This will merge all matching rows. You can add all=T to the match if you like as well or customize how you would if you were using a regular merge().
You will get a warning about column names because each of your columns has an identical name so when you merge into multiple columns, merge doesn't know what you name them. You could rename them with something like
my_new_list <- Map(
function(x,n) {
names(x)<-n; x
},
my_list,
paste("PX_LAST",1:length(my_list), sep="_")
)
then
Reduce(function(x,y) {
dd<-merge(x,y,by=0); rownames(dd)<-dd$Row.names; dd[-1]
}, my_new_list)
won't complain.
Here is a solution using data.table and reshape2:
# Load libraries
library(data.table)
library(reshape2)
# Setup new list object
my_list.2 <- vector(length(my_list), mode="list")
# Add time stamps as variable and add ID variable
for(i in 1:length(my_list)){
my_list.2[[i]] <- cbind(time=rownames(my_list[[i]]), my_list[[i]], id=rep(paste0("list_",i), id=nrow(my_list[[i]])))
}
# Collapse all lists in one data table
d.temp <- rbindlist(my_list.2)
# Transform the data
d.final <- dcast(time~id, value.var="PX_LAST", data=d.temp)
# > d.final
# time list_1 list_2 list_3 list_4
# 1 2014-04-28 00:00:00 0.398 NA 0.4810 1.6500
# 2 2014-04-29 00:00:00 0.457 1.7310 0.4560 1.6400
# 3 2014-04-30 00:00:00 0.400 1.7060 0.4480 1.6500
# 4 2014-05-02 00:00:00 0.159 1.6900 0.4360 1.6595
# 5 2014-05-05 00:00:00 0.126 1.7130 0.4480 1.6650
# 6 2014-05-06 00:00:00 0.108 1.7110 0.4580 1.6595
# 7 2014-05-07 00:00:00 0.260 1.7240 0.4660 1.6625
# 8 2014-05-08 00:00:00 0.239 1.6990 0.4320 1.6520
# 9 2014-05-09 00:00:00 0.222 1.7020 0.4370 1.6450
# 10 2014-05-12 00:00:00 0.191 1.7050 0.4410 1.6245
# 11 2014-05-13 00:00:00 0.184 1.6490 0.4170 1.6270
# 12 2014-05-01 00:00:00 NA 1.7095 0.4390 1.6620
# 13 2014-05-14 00:00:00 NA 1.6110 0.4035 1.6330
# 14 2014-04-25 00:00:00 NA NA NA 1.6500

Resources