boxplot time series by monthly average in r - r

I have a time series with hourly data on energy consumption in the form of a zoo object. And there are 16 indices (in the range [1:143206]) for which the Date is NA. Here is a sample of the data:
Date PJMW_MW
1 2002-04-01 01:00:00 4374
...
8709 2003-03-29 23:00:00 4827
8710 2003-03-30 00:00:00 4611
8711 2003-03-30 01:00:00 4421
8712 NA 4285
8713 2003-03-30 03:00:00 4212
8714 2003-03-30 04:00:00 4321
...
143206 2018-08-03 00:00:005489
The data above is a data.frame object called dat but I have it in a zoo object called hourly_ts:
1 4374
...
7709 6135
7710 6324
7711 6626
7712 6866
7713 6987
7714 7028
7715 7026
...
143206 5265
I would like to see the monthly averages, like, for which month is the consumption generally higher, and I saw that there is a simple formula for this: boxplot(hourly_ts ~ cycle(hourly_ts))
But the error Error in cycle.zoo(hourly_ts) : ‘x’is not regular appears.
The weird thing is that hourly_ts has a specified frequency (24 hours per day) and start time (April 1st 2002 01:00:00), so from that there shouldn't be any missing values in the time.
Supposing the missing values are what's causing the irregularity, is there a way I can add the values myself?
I would also like to use the aggregate function but have no idea what the by parameter should be.

Related

Minute time series in R. How to insert missing values in order to have the same steps in time?

I have a dataset where column 1 is date-time and column 2 is the price at a specific point in time. This data is downloaded to Excel with bloomberg excel add-in. Then I used read_excel function to import this file to R.
This is how the data looks like in R
Question: the data is supposed to be with 1 min intervals, but it is not always the case. Sometimes the time in the next row is more than 1 min later. So, how can I insert extra rows for the missing minutes? So, for each date I would like to have the following sequence:
08:00
08:01
08:02
...
16:58
16:59
17:00
For these points in time, I would like keep the price from the dataset. If the price is not there, it should add missing. For example if we have:
...
12:31 100
12:32 102
12:35 101
...
then I would like to have:
...
12:31 100
12:32 102
12:33 missing
12:34 missing
12:35 101
...
what is the easiest way to do this? Thank you!
You can create an xts with the prices you have and merge it with a sequence that has a higher frequency (e.g. every minute).
library(xts)
library(lubridate)
set.seed(123)
prices <- 100 + rnorm(16)
timeindex <- seq(ymd_hm('2020-05-28 08:45'),
ymd_hm('2020-05-28 09:15'),
by = '2 mins')
prices_xts <- xts(prices, order.by = timeindex)
> head(prices_xts)
[,1]
2020-05-28 08:45:00 99.43952
2020-05-28 08:47:00 99.76982
2020-05-28 08:49:00 101.55871
2020-05-28 08:51:00 100.07051
2020-05-28 08:53:00 100.12929
2020-05-28 08:55:00 101.71506
timeindex2 <- seq(ymd_hm('2020-05-28 08:45'),
ymd_hm('2020-05-28 09:15'),
by = '1 mins')
prices_with_gaps_xts <- merge.xts(prices_xts,
timeindex2)
> head(prices_with_gaps_xts)
prices_xts
2020-05-28 08:45:00 99.43952
2020-05-28 08:46:00 NA
2020-05-28 08:47:00 99.76982
2020-05-28 08:48:00 NA
2020-05-28 08:49:00 101.55871
2020-05-28 08:50:00 NA

Time to failure variable based off start and end timestamps in R

I have two data sets. Data set 1 contains time stamps of 15 minute intervals starting at 2009-08-18 18:15:00 and ending 2012-11-09 22:30:00 with measurements taken at those times. Data set 2 has start and end time stamps for faults occurring in a factory. There are 6 faults and these faults' start and end times are also 15 min intervals, yet can last longer than 1 interval. They also all fall somewhere between 2009-08-18 18:15:00 and 2012-11-09 22:30:00 as well. I am trying to create a time to failure variable for the faults, where -i would indicate the next fault is i intervals (which are 15 mins) away and i would indicate the fault started i intervals ago. For example,
DataSet1
Timestamp Sensor 1
2009-09-04 10:00:00 30
2009-09-04 10:30:00 40
2009-09-04 10:45:00 33
2009-09-04 11:00:00 23
2009-09-04 11:15:00 24
2009-09-04 11:30:00 42
DataSet 2
Start Time End Time Fault Type
09/04/09 10:45 9/4/2009 11:15 1
09/04/09 21:45 9/4/2009 22:00 1
09/04/09 23:00 9/4/2009 23:15 1
09/05/09 10:45 9/5/2009 11:15 1
09/05/09 21:30 9/5/2009 23:15 1
09/08/09 10:45 9/8/2009 12:30 1
So what I want to end up with is the following time to failure variable (TTF1) and then repeat the process for faults 2-6
Timestamp Sensor 1 TTF1
2009-09-04 10:00:00 30 -3
2009-09-04 10:30:00 40 -1
2009-09-04 10:45:00 33 0
2009-09-04 11:00:00 23 1
2009-09-04 11:15:00 24 2
2009-09-04 11:30:00 42 -41
I know I can use the sqldf function to separate out each fault type, but I have no clue where to begin to even create counting the time to fault variable. I'm very stuck, any help would be greatly appreciated!
You can use the difftime() function from base R to get the time difference between these 2 timestamps:
(z <- Sys.time() - 3600)
Sys.time() - z # just over 3600 seconds.
as.difftime(c("0:3:20", "11:23:15"))
as.difftime(c("3:20", "23:15", "2:"), format = "%H:%M") # 3rd gives NA
(z <- as.difftime(c(0,30,60), units = "mins"))
as.numeric(z, units = "secs")
as.numeric(z, units = "hours")
format(z)
I would recommend set units = "mins". You can convert the class to character, strip out any non-numeric data with gsub, then change the class with as.numeric. Finally just divide by 15 to get the 15-minute time units you want. You can use floor to round the result if needed.

Calendaring Monthly Usages for each Date

Here, i have a data set with Start date and End Date and the usages. I have calculated the number of Days between these two days and got the daily usages. (I am okay with one flat usages for each day for now).
Now, what i want to achieve is the sum of the usage for each day in those TIME-FRAME FOR month of June. For example, the first case will be just the Daily_usage
START_DATE END_DATE x DAYS DAILY_USAGE
1 2015-05-01 2015-06-01 261605.00 32 8175.156250
And, for 2nd, i want to the add the Usage 3905 to June 1st, and also to June 2nd because it spans in both June 1st and June 2nd.
2015-05-04 2015-06-02 117159.00 30 3905.3000000
I want to continue doing this for all 387 rows and at the end get the sum of Usages for each day. And,I do not know how to do this for hundreds of records.
This is what my datasets looks right now:
str(YYY)
'data.frame': 387 obs. of 5 variables:
$ START_DATE : Date, format: "2015-05-01" "2015-05-04" "2015-05-11" "2015- 05-13" ...
$ END_DATE : Date, format: "2015-06-01" "2015-06-01" "2015-06-01" "2015-06-01" ...
$ x : num 261605 1380796 183 103 489 ...
$ DAYS : num 32 29 22 20 19 12 1 34 30 29 ...
$ DAILY_USAGE: num 8175.16 47613.66 8.32 5.13 25.74 ...
Also, the header.
START_DATE END_DATE x DAYS DAILY_USAGE
1 2015-05-01 2015-06-01 261605.00 32 8175.1562500
2 2015-05-04 2015-06-01 1380796.00 29 47613.6551724
6 2015-05-21 2015-06-01 1392.00 12 116.0000000
7 2015-06-01 2015-06-01 2503.00 1 2503.0000000
8 2015-04-30 2015-06-02 0.00 34 0.0000000
9 2015-05-04 2015-06-02 117159.00 30 3905.3000000
10 2015-05-05 2015-06-02 193334.00 29 6666.6896552
13 2015-05-04 2015-06-03 630.00 31 20.3225806
and so on........
Example of data sets and Results
I will call this data set. EXAMPLE1 (For 3 days, mocked up data)
START_DATE END_DATE x DAYS DAILY_USAGE
5/1/2015 6/1/2015 261605 32 8175.15625
5/4/2015 6/1/2015 1380796 29 47613.65517
5/11/2015 6/1/2015 183 22 8.318181818
4/30/2015 6/2/2015 0 34 0
5/20/2015 6/2/2015 70 14 5
6/1/2015 6/2/2015 569 2 284.5
6/1/2015 6/3/2015 582 3 194
6/2/2015 6/3/2015 6 2 3
For the above examples, answer should be like this
DAY USAGE
6/1/2015 56280.6296
6/2/2015 486.5
6/3/2015 197
HOW?
In Example 1, for June 1st, i have added all the rows of usages except the last row usage because the last row doesn't include the the date 06/01 in time-frame. It starts in 06/02 and ends in 06/03.
To get June 2nd, i have added all the usages from Row 4 to 8 because June 2nd is between all of those start and end dates.
For June 3rd, i have only added, Last two rows to get 197.
So, where to sum, depends on the time-frame of Start & End_date.
Hope this helps!
There might be a easy trick to do this than to write 400 lines of If else statement.
Thank you again for your time!!
-Gyve
library(lubridate)
indx <- lapply(unique(mdy(df[,2])), '%within%', interval(mdy(df[,1]), mdy(df[,2])))
cbind.data.frame(DAY=unique(df$END_DATE),
USAGE=unlist(lapply(indx, function(x) sum(df$DAILY_USAGE[x]))))
# DAY USAGE
# 1 6/1/2015 56280.63
# 2 6/2/2015 486.50
# 3 6/3/2015 197.00
Explanation
We can expand it to explain what is happening:
indx <- lapply(unique(mdy(df[,2])), '%within%', interval(mdy(df[,1]), mdy(df[,2])))
The unique end dates are tested to be within the range days in the first and second columns. mdy is a quick way to convert to POSIXct with lubridate. The operator %within% tests a date against an interval. We created intervals with interval('col1', 'col2'). This creates an index that we can subset the data by.
In our final data frame,
cbind.data.frame(DAY=unique(df$END_DATE),
creates the first column of dates.
And,
USAGE=unlist(lapply(indx, function(x) sum(df$DAILY_USAGE[x])))
takes the sum of df$DAILY_USAGE by the index that we created.

How can I filter specifically for certain months if the days are not the same in each year?

This is probably a very simple question that has been asked already but..
I have a data frame that I have constructed from a CSV file generated in excel. The observations are not homogeneously sampled, i.e they are for "On Peak" times of electricity usage. That means they exclude different days each year. I have 20 years of data (1993-2012) and am running both non Robust and Robust LOESS to extract seasonal and linear trends.
After the decomposition has been done, I want to focus only on the observations from June through September.
How can I create a new data frame of just those results?
Sorry about the formatting, too.
Date MaxLoad TMAX
1 1993-01-02 2321 118.6667
2 1993-01-04 2692 148.0000
3 1993-01-05 2539 176.0000
4 1993-01-06 2545 172.3333
5 1993-01-07 2517 177.6667
6 1993-01-08 2438 157.3333
7 1993-01-09 2302 152.0000
8 1993-01-11 2553 144.3333
9 1993-01-12 2666 146.3333
10 1993-01-13 2472 177.6667
As Joran notes, you don't need anything other than base R:
## Reproducible data
df <-
data.frame(Date = seq(as.Date("2009-03-15"), as.Date("2011-03-15"), by="month"),
MaxLoad = floor(runif(25,2000,3000)), TMAX=runif(25,100,200))
## One option
df[months(df$Date) %in% month.name[6:9],]
# Date MaxLoad TMAX
# 4 2009-06-15 2160 188.4607
# 5 2009-07-15 2151 164.3946
# 6 2009-08-15 2694 110.4399
# 7 2009-09-15 2460 150.4076
# 16 2010-06-15 2638 178.8341
# 17 2010-07-15 2246 131.3283
# 18 2010-08-15 2483 112.2635
# 19 2010-09-15 2174 160.9724
## Another option: strftime() will be more _generally_ useful than months()
df[as.numeric(strftime(df$Date, "%m")) %in% 6:9,]

Aggregating daily data using quantmod 'to.weekly' function creates weekly data ending on Monday not Friday

I am trying to aggregate daily share price data (close only) to weekly share price data using the "to.weekly" function in quantmod. The xts object foo holds daily share price data for a stock starting from Monday 3 January 2011 and ending on Monday 20 September 2011. To aggregate this daily data I used:
tmp <- to.weekly(foo)
The above approach succeeds in that tmp now holds a series of weekly OHLC data points, as per the quantmod docs. The problem is that the series begins on Monday 3 January 2011 and each subsequent week also begins on Monday e.g. Monday 10 January, Monday 17 January and so on. I had expected the week to default to ending on Friday so that the weekly series started on Friday 7 January and ended on Friday 16 September.
I have experimented with adjusting the start and end of the data and using 'endof' or 'startof' together with the indexAt parameter but I cannot get it to return a week ending in Friday.
I am grateful for any insights received.
(Sorry, I could not find any way to attach dput file so data appears below)
foo:
2011-01-03 2802
2011-01-04 2841
2011-01-05 2883
2011-01-06 2948
2011-01-07 2993
2011-01-10 2993
2011-01-11 3000
2011-01-12 3000
2011-01-13 3025
2011-01-14 2970
2011-01-17 2954
2011-01-18 2976
2011-01-19 2992
2011-01-20 2966
2011-01-21 2940
2011-01-24 2969
2011-01-25 2996
2011-01-26 2982
2011-01-27 3035
2011-01-28 3075
2011-01-31 3020
tmp:
foo.Open foo.High foo.Low foo.Close
2011-01-03 2802 2802 2802 2802
2011-01-10 2841 2993 2841 2993
2011-01-17 3000 3025 2954 2954
2011-01-24 2976 2992 2940 2969
2011-01-31 2996 3075 2982 3020
I've come up with something yielding only Close values, perhaps it can be hacked further to return OHLC series.
Assuming that foo is an xts object, first we create the vector of indeces of Fridays:
fridays = as.POSIXlt(time(foo))$wday == 5
Then we prepend it with 0:
indx <- c(0, which(fridays))
And use period.apply:
period.apply(foo, INDEX=indx, FUN=last)
Result:
[,1]
2011-01-07 2993
2011-01-14 2970
2011-01-21 2940
2011-01-28 3075
For Fridays (with occasional Thursdays due to market closures), use:
tmp = to.weekly(foo, indexAt = "endof")
For Mondays (with occasional Tuesdays due to market closures), use:
tmp = to.weekly(foo, indexAt = "startof")`
Or you can create a custom vector of Dates that contains the date to be associated with each week. For instance, to force every week to be associated with Friday regardless of market closures:
customIdx = seq(from = as.Date("2011-01-07"), by = 7, length.out = nrow(tmp))
index(tmp) = customIdx

Resources