Format historical data for forecasting with calendar variables - r

I have hourly time series data for the year 2015. This data corresponds to power consumption of a big commercial building. I want to use this data to predict the usage for the year 2016. To develop a forecasting model, I need to format this data in a suitable format.
I am planning to use following features to predict the 2016 usage: (1) day of week, (2) time of the day (3) temperature, (4) year 2015 usage.
I am able to create the first 3 features but the fourth one seems tricky.
How should I arrange the 2015 data so that for a particular day of 2016 I can use the corresponding day data of year 2015. My concern is :
I should not use the weekend day data of 2015 to predict the usage of working day
There are some days in 2015, where data is missing for entire day data. For the corresponding day in 2016, how should I account for these missing readings
Here, I have created dummy data corresponding to the year 2015 and 2016.
library(xts)
set.seed(123)
seq1 <- seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-12-31"), by = "hour")
data1 <- xts(rnorm(length(seq1),150,5),seq1)
seq2 <- seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-09-30"), by = "hour")
data2 <- xts(rnorm(length(seq2),140,5),seq2)
Let me give an example to clarify my problem:
Suppose model is: lm( output ~ dayofweek + timeofday + temperature + lastyearusage, data = xxx)
Now suppose I want to predict the usage on 2 oct 2016(dayY), using the lastyearusage onm2 oct 2015(dayX). In this step, issue is 1) How should I ensure thatdayX is not a weekend day if dayY is a working day. I am sure that in this case if I use dayX to predict dayY without keeping a check on day type output will get messy.

There might be already a function in a package to do this, but post here a custom function to add all these kinds of calendar variables (including the week-end info) to a data.frame containing a date/hour column. Fake data:
df <- data.frame(datetime=seq(as.POSIXlt("2013/01/01 00:00:00"), as.POSIXlt("2013/12/31 23:00:00"), by="hour"), variable=rnorm(8760))
#### datetime variable
#### 1 2013-01-01 00:00:00 1.68959052
#### 2 2013-01-01 01:00:00 0.02023722
#### 3 2013-01-01 02:00:00 -0.42080942
The code for the function:
CreateCalendarVariables = function(df, id_column=NULL) {
df <- data.frame(df)
if (is.null(id_column)) stop("Id column for the datetime variable is a mandatory argument")
temp <- df[, id_column]
if ( !(class(temp)[1] %in% c("Date", "POSIXct", "POSIXt", "POSIXlt")) ){
stop("the indicated datetime variable doesn't have the suitable format")
}
require(lubridate)
df['year'] <- year(temp)
df['.quarter'] <- quarter(temp)
df['.month'] <- month(temp)
df['.week'] <- week(temp)
df['.DMY'] <- as.Date(temp)
df['.dayinyear'] <- yday(temp)
df['.dayinmonth'] <- mday(temp)
df['.weekday'] <- wday(temp, label=T, abbr=FALSE) %>% factor(., levels=levels(.)[c(2,3,4,5,6,7,1)])
df['.is_we'] <- df$.weekday %in% c("Saturday", "Sunday")
if(class(temp)[1] != "Date"){
df['.hour'] <- factor(hour(temp))
}
return(df)
}
Then you just have to specify the N° of column containing the date format. If you need for your model these variables in factor format, feel free to adapt the code:
CreateCalendarVariables(df, 2)
#### Error in CreateCalendarVariables(df, 2) :
#### the indicated datetime variable doesn't have the suitable format
CreateCalendarVariables(df, 1)
#### datetime variable year .quarter .month .week .DMY .dayinyear .dayinmonth .weekday .is_we .hour
#### 1 2013-01-01 00:00:00 1.68959052 2013 1 1 1 2012-12-31 1 1 Tuesday FALSE 0
#### 2 2013-01-01 01:00:00 0.02023722 2013 1 1 1 2013-01-01 1 1 Tuesday FALSE 1
To answer your last question, If an entire level is missing from the calibration dataset (i.e. one whole weed and you're using .Week as a predictor), you 'll need to impute the data first.

Related

I need help writing a function to count the number of holidays within a time period using lubridate in R

I am attempting to write a function that counts the number of holidays a person worked in my organization between their start and term date in the year 2017. My organization recognized 6 holidays that year-
New Years Day- 2017-01-02
Memorial Day- 2017-05-29
Independence Day - 2017-07-04
Labor Day - 2017-09-04
Thanksgiving Day- 2017-11-23
Christmas day - 2017-12-25
I used lubridate to combine my year-month-day columns into complete dates using lubridate and dyplr like so:
dates<- data %>% mutate("Term Date" = make_date(month = `Term Month`,
day = data$`Term Day`,
year =data$`Term Year`),
"Start Date"= make_date(month = data$`Start Month`,
day = data$`Start Day`,
year = data$`Start Year`))
I then went on to attempt to write my function.
holidays <- function(x){
z<- 0
if( ymd("2017-01-01") %within% interval(dates$`Start Date`, dates$`Term Date`)){
z <- z + 1
}
print(z)
}
This was only my first step. My goal was to first make my function work for new years and then continue to build in other holidays step by step using if statements.I was unable to get the apply function to work correctly and am unsure if my function even works. I attempted to apply the function like so :
apply(dates,2,holidays)
But got an error argument.
Does anyone have any advice?
Putting the holidays in a vector:
holidays <- as.Date(c('2017-01-02', '2017-05-29', '2017-07-04', '2017-09-04', '2017-11-23', '2017-12-25'))
Extracting month and day (to make it independent of year), "%j" stands for day of year:
holidays <- format(as.Date(holidays), "%j")
Generating some random data to test (1000 uniformly distributed work entries in 2017, 5 employees):
d <- data.frame(
'date' = as.Date(as.integer(runif(1000, 17167, 17531)), origin = '1970-01-01'),
'emp' = sample(LETTERS[1:5], 1000, replace = T)
)
Filtering out the holidays:
h <- d[format(d$date, "%j") %in% holidays, ]
Counting number of holidays worked per employee using aggregate():
aggregate(h$date, list(h$emp), length)
# Group.1 x
#1 A 3
#2 B 4
#3 C 2
#4 D 5
#5 E 1
NB: will work for 2017, but won't work for leap years (one workaround that doesn't involve altering the code too too much is to change the year in the holiday vector manually).

R filtering/selecting data by POSIXct time and a condition

I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.

convert irregular 6hourly data to daily accumulated using R

I have the following data:
Date,Rain
1979_8_9_0,8.775
1979_8_9_6,8.775
1979_8_9_12,8.775
1979_8_9_18,8.775
1979_8_10_0,0
1979_8_10_6,0
1979_8_10_12,0
1979_8_10_18,0
1979_8_11_0,8.025
1979_8_12_12,0
1979_8_12_18,0
1979_8_13_0,8.025
[1] The data is six hourly but some dates have incomplete 6 hourly data. For example, August 11 1979 has only one value at 00H. I would like to get the daily accumulated from this kind of data using R. Any suggestion on how to do this easily in R?
I'll appreciate any help.
You can transform your data to dates very easily with:
dat$Date <- as.Date(strptime(dat$Date, '%Y_%m_%d_%H'))
After that you should aggregate with:
aggregate(Rain ~ Date, dat, sum)
The result:
Date Rain
1 1979-08-09 35.100
2 1979-08-10 0.000
3 1979-08-11 8.025
4 1979-08-12 0.000
5 1979-08-13 8.025
Based on the comment of Henrik, you can also transform to dates with:
dat$Date <- as.Date(dat$Date, '%Y_%m_%d')
# split the "date" variable into new, separate variable
splitDate <- stringr::str_split_fixed(string = df$Date, pattern = "_", n = 4)
df$Day <- splitDate[,3]
# split data by Day, loop over each split and add rain variable
unlist(lapply(split(df$Rain, df$Day), sum))

Set day of week to be used by to.weekly

I am trying to convert a time series of daily data (only business days) contained in an xts object into a time series of weekly data. Specifically, I want the resulting time series to contain the end of week entries (meaning last business day of a week) of the original data. I've been trying to achieve this using the function to.weekly of the xts package.
In the discussion regarding another question (Wrong week-ending date using 'to.weekly' function in 'xts' package) the below example code achieved exactly what I need. However, when I run the code, to.weekly uses Mondays as a representative for the weekly data.
I am wondering which global setting might allow me to force to.weekly to use Friday as a week's representative.
Example code:
library(lubridate); library(xts)
test.dates <- seq(as.Date("2000-01-01"),as.Date("2011-10-01"),by='days')
test.dates <- test.dates[wday(test.dates)!=1 & wday(test.dates)!=7] #Remove weekends
test.data <- rnorm(length(test.dates),mean=1,sd=2)
test.xts <- xts(x=test.data,order.by=test.dates)
test.weekly <- to.weekly(test.xts)
test.weekly[wday(test.weekly, label = TRUE, abbr = TRUE) != "Fri"]
test.dates <- test.dates[wday(test.dates)==6]
tail(wday(test.dates, label = TRUE, abbr = TRUE))
#[1] Fri Fri Fri Fri Fri Fri
#Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
OK. With the unstated requirements added to the problem:
require(timeDate)
require(lubridate)
startDate <- as.Date("2000-01-03")
endDate <- as.Date("2011-10-01")
AllDays <- as.timeDate(seq(startDate, endDate, by="day"))
is.wrk <- isBizday(AllDays, holidays = holidayNYSE(), wday = 1:5)
is.wrkdt <- as.Date(names(is.wrk)[is.wrk])
endweeks <- tapply(is.wrkdt, paste(year(is.wrkdt),week(is.wrkdt), sep = ""), max)
head(as.Date(endweeks, origin="1970-01-01"))
# 1 2 3 4 5 6
#"2011-01-06" "2011-01-13" "2011-01-20" "2011-01-27" "2011-02-03" "2011-02-10"
So you want:
as.Date(endweeks, origin="1970-01-01")
I had the same problem and I found a two-lines solution.
You need first to retain only business days (if your data set also contains holidays):
test.dates <- test.dates[ wday(dates) %in% c(2:6) ]
Then you have two alternatives. First, you can use to.weekly() which retains the most recent business day, i.e. not necessarily constrained to wday(test.dates)==6
test.weekly <- to.weekly(test.xts)
Or you can use the function endpoints() which works on multi-columns xts objects and deals much better with NA's because it does not remove missing data (preventing the warning "missing values removed from data")
test.weekly <- test.xts[endpoints(test.xts,on='weeks')[-1],]

Split date data (m/d/y) into 3 separate columns

I need to convert date (m/d/y format) into 3 separate columns on which I hope to run an algorithm.(I'm trying to convert my dates into Julian Day Numbers). Saw this suggestion for another user for separating data out into multiple columns using Oracle. I'm using R and am throughly stuck about how to code this appropriately. Would A1,A2...represent my new column headings, and what would the format difference be with the "update set" section?
update <tablename> set A1 = substr(ORIG, 1, 4),
A2 = substr(ORIG, 5, 6),
A3 = substr(ORIG, 11, 6),
A4 = substr(ORIG, 17, 5);
I'm trying hard to improve my skills in R but cannot figure this one...any help is much appreciated. Thanks in advance... :)
I use the format() method for Date objects to pull apart dates in R. Using Dirk's datetext, here is how I would go about breaking up a date into its constituent parts:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
datetxt <- as.Date(datetxt)
df <- data.frame(date = datetxt,
year = as.numeric(format(datetxt, format = "%Y")),
month = as.numeric(format(datetxt, format = "%m")),
day = as.numeric(format(datetxt, format = "%d")))
Which gives:
> df
date year month day
1 2010-01-02 2010 1 2
2 2010-02-03 2010 2 3
3 2010-09-10 2010 9 10
Note what several others have said; you can get the Julian dates without splitting out the various date components. I added this answer to show how you could do the breaking apart if you needed it for something else.
Given a text variable x, like this:
> x
[1] "10/3/2001"
then:
> as.Date(x,"%m/%d/%Y")
[1] "2001-10-03"
converts it to a date object. Then, if you need it:
> julian(as.Date(x,"%m/%d/%Y"))
[1] 11598
attr(,"origin")
[1] "1970-01-01"
gives you a Julian date (relative to 1970-01-01).
Don't try the substring thing...
See help(as.Date) for more.
Quick ones:
Julian date converters already exist in base R, see eg help(julian).
One approach may be to parse the date as a POSIXlt and to then read off the components. Other date / time classes and packages will work too but there is something to be said for base R.
Parsing dates as string is almost always a bad approach.
Here is an example:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
dates <- as.Date(datetxt) ## you could examine these as well
plt <- as.POSIXlt(dates) ## now as POSIXlt types
plt[["year"]] + 1900 ## years are with offset 1900
#[1] 2010 2010 2010
plt[["mon"]] + 1 ## and months are on the 0 .. 11 intervasl
#[1] 1 2 9
plt[["mday"]]
#[1] 2 3 10
df <- data.frame(year=plt[["year"]] + 1900,
month=plt[["mon"]] + 1, day=plt[["mday"]])
df
# year month day
#1 2010 1 2
#2 2010 2 3
#3 2010 9 10
And of course
julian(dates)
#[1] 14611 14643 14862
#attr(,"origin")
#[1] "1970-01-01"
To convert date (m/d/y format) into 3 separate columns,consider the df,
df <- data.frame(date = c("01-02-18", "02-20-18", "03-23-18"))
df
date
1 01-02-18
2 02-20-18
3 03-23-18
Convert to date format
df$date <- as.Date(df$date, format="%m-%d-%y")
df
date
1 2018-01-02
2 2018-02-20
3 2018-03-23
To get three seperate columns with year, month and date,
library(lubridate)
df$year <- year(ymd(df$date))
df$month <- month(ymd(df$date))
df$day <- day(ymd(df$date))
df
date year month day
1 2018-01-02 2018 1 2
2 2018-02-20 2018 2 20
3 2018-03-23 2018 3 23
Hope this helps.
Hi Gavin: another way [using your idea] is:
The data-frame we will use is oilstocks which contains a variety of variables related to the changes over time of the oil and gas stocks.
The variables are:
colnames(stocks)
"bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC"
"emMN" "emMN.1" "chdate" "chV" "cbO" "chC" "chMN" "chMX"
One of the first things to do is change the emdate field, which is an integer vector, into a date vector.
realdate<-as.Date(emdate,format="%m/%d/%Y")
Next we want to split emdate column into three separate columns representing month, day and year using the idea supplied by you.
> dfdate <- data.frame(date=realdate)
year=as.numeric (format(realdate,"%Y"))
month=as.numeric (format(realdate,"%m"))
day=as.numeric (format(realdate,"%d"))
ls() will include the individual vectors, day, month, year and dfdate.
Now merge the dfdate, day, month, year into the original data-frame [stocks].
ostocks<-cbind(dfdate,day,month,year,stocks)
colnames(ostocks)
"date" "day" "month" "year" "bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC" "emMN" "emMX" "chdate" "chV"
"cbO" "chC" "chMN" "chMX"
Similar results and I also have date, day, month, year as separate vectors outside of the df.

Resources