Calculate maximum values for three consecutive days of tibble in R - r

I am doing time series analysis. Part of my data is as follow:
# A tibble: 6 x 3
time DOY Value
<dttm> <dbl> <dbl>
1 2015-01-08 12:30:00 8 0.664
2 2015-01-08 13:00:00 8 0.647
3 2015-01-11 14:00:00 11 0.669
4 2015-01-11 15:00:00 11 0.644
5 2015-02-04 12:30:00 35 0.664
6 2015-02-04 13:00:00 35 0.647
I would like to calculate the maximum values of 7 consecutive days of the data. For example:
# A tibble: 6 x 4
time DOY Value Max
<dttm> <dbl> <dbl> <dbl>
1 2015-01-08 12:30:00 8 0.664 11.669
2 2015-01-08 13:00:00 8 0.647 11.669
3 2015-01-11 14:00:00 11 0.669 11.669
4 2015-01-11 15:00:00 11 0.644 11.669
5 2015-02-04 12:30:00 35 0.664 35.664
6 2015-02-04 13:00:00 35 0.647 35.664

welcome to R and Stackoverflow. As mentioned above, you will find many friends here, if you provide a reproducible example, and explain what you have done and/or where things go wrong for you. This helps others to help you.
Based on your data fragment, I do some basic operations that I think might help you. Still you may need to adapt the principles to your problem case.
data
I turned your example into a tibble. Please note, when you work with date, times, date-times I recommend you use the respective variable type. This will give you access to helpful functions, etc.
Please also note you mentioned 8*3 tibble above. In fact your data structure is already a 4 column tibble with Date, time, DOY, and value!
library(dplyr) # basic dataframe/tibble operations
library(lubridate) # for datetime handling
df <- tribble(
~Date, ~time, ~DOY, ~Value
,"2015-01-08", "12:30:00", 8, 0.664
,"2015-01-08", "13:00:00", 8, 0.647
,"2015-01-11", "14:00:00", 11, 0.669
,"2015-01-11", "15:00:00", 11, 0.644
,"2015-02-04", "12:30:00", 35, 0.664
,"2015-02-04", "13:00:00", 35, 0.647
)
df <- df %>%
mutate(timestamp = ymd_hms(paste(Date, time))
This yields:
df
# A tibble: 6 x 5
Date time DOY Value timestamp
<chr> <chr> <dbl> <dbl> <dttm>
1 2015-01-08 12:30:00 8 0.664 2015-01-08 12:30:00
2 2015-01-08 13:00:00 8 0.647 2015-01-08 13:00:00
3 2015-01-11 14:00:00 11 0.669 2015-01-11 14:00:00
4 2015-01-11 15:00:00 11 0.644 2015-01-11 15:00:00
5 2015-02-04 12:30:00 35 0.664 2015-02-04 12:30:00
6 2015-02-04 13:00:00 35 0.647 2015-02-04 13:00:00
Note: timestamp is now a datetime objet dttm.
binning of data
It is not fully clear what your consecutive 7 days are and/or how you "group" them.
I assume you want to pick 7 days of a week.
As datetime is dttm, we can use the power of {lubridate} and extract the week from the datetime.
Note: you may want to bin/group your data differently. Think about what you want to achieve here and adpat this accordingly.
df <- df %>% mutate(bin = week(timestamp))
df
# A tibble: 6 x 6
Date time DOY Value timestamp bin
<chr> <chr> <dbl> <dbl> <dttm> <dbl>
1 2015-01-08 12:30:00 8 0.664 2015-01-08 12:30:00 2
2 2015-01-08 13:00:00 8 0.647 2015-01-08 13:00:00 2
3 2015-01-11 14:00:00 11 0.669 2015-01-11 14:00:00 2
4 2015-01-11 15:00:00 11 0.644 2015-01-11 15:00:00 2
5 2015-02-04 12:30:00 35 0.664 2015-02-04 12:30:00 5
6 2015-02-04 13:00:00 35 0.647 2015-02-04 13:00:00 5
If you want to work on "7 consecutive days" you will need to identify the groups of 7 days. Again, there are different ways to do this, check what the modulo operator does and how to apply this to your DOY.
operating on your groups/bins
You describe looking for the maximum per bin (7 days ~ week).
{dplyr} offers for such problems grouped operations. Read up on them:
df %>%
group_by(bin) %>%
summarise(MaxValue = max(Value) # we create a new variable and assing the max of each group to it
)
# A tibble: 2 x 2
bin MavValue
<dbl> <dbl>
1 2 0.669
2 5 0.664
Obviously, you can perform many operations (summaries of your bins/groups).
Note: You can create bins on multiple variables. Read up on group_by() and summarise(..., .groups = "drop"), if you want to use this interim tibble for further calculations.
Hope this gets you started.
clarification on grouping by 7 days
If you have a sequence of (integer) numbers, there is a neat way to group this into n-element bins, i.e. using integer division.
In your case the data comes already with a date-of-year DOY variable. For completeness: with lubridate you can pull the DOY from a timestamp with the function yday(), i.e. (df %>% mutate(DOY = yday(timestamp)).
# let's use integer division to group our DOYs into group of 7s
##--------- does not look at date or day
##--------- group 1-7 := 0, group 8-14 := 1, .... group 29-35 := 5
df <- df %>%
mutate(bin = DOY %/% 7)
This yields:
# A tibble: 6 x 6
Date time DOY Value timestamp bin
<chr> <chr> <dbl> <dbl> <dttm> <dbl>
1 2015-01-08 12:30:00 8 0.664 2015-01-08 12:30:00 1
2 2015-01-08 13:00:00 8 0.647 2015-01-08 13:00:00 1
3 2015-01-11 14:00:00 11 0.669 2015-01-11 14:00:00 1
4 2015-01-11 15:00:00 11 0.644 2015-01-11 15:00:00 1
5 2015-02-04 12:30:00 35 0.664 2015-02-04 12:30:00 5
6 2015-02-04 13:00:00 35 0.647 2015-02-04 13:00:00 5
And then build your max summary as before on the (new) grouping variable:
df %>%
group_by(bin) %>%
summarise(MaxValue = max(Value)
# A tibble: 2 x 2
bin MaxValue
<dbl> <dbl>
1 1 0.669
2 5 0.664
For the example data given the result is identical. However, with your full dataset and the offset between "weeks" (with their defined start date) vs cutting your DOYs into bins of 7 consecutive days, you will get a different summary (unless, the first day of the week (*) coincides with DOY 1).
(*): in lubridate you can set weeks to start Monday or Sunday as a parameter (in case you ever need this).

Related

need a code for this question "gap filling" time series

I have daily time series as provided in the example here, I need to know how to fill the NA value for only the morning time which is starting from 6:00 AM to 9:00 AM, that gap filling it should be by averaging the residual hours of the same day and so on for the other morning day,
set.seed(3)
df <- data.frame( timestamp = seq(as.POSIXct('2022-01-01', tz='utc'),as.POSIXct('2022-01-10 23:00', tz='utc'), by = '1 hour') ,
value = runif(240))
df$value[runif(nrow(df)) < 0.3] <- NA
if I understand you correctly this is one way to solve the task in dplyr:
df %>%
dplyr::mutate(after = ifelse(lubridate::hour(timestamp) > 10, value, NA),
day = format(df$timestamp, format = '%Y-%m-%d')) %>%
dplyr::group_by(day) %>%
dplyr::mutate(value = ifelse(lubridate::hour(timestamp) <10 & is.na(value), mean(after, na.rm = TRUE), value)) %>%
dplyr::ungroup() %>%
dplyr::select(-after, -day)
# A tibble: 240 x 2
timestamp value
<dttm> <dbl>
1 2022-01-01 00:00:00 0.427
2 2022-01-01 01:00:00 0.808
3 2022-01-01 02:00:00 0.385
4 2022-01-01 03:00:00 0.427
5 2022-01-01 04:00:00 0.602
6 2022-01-01 05:00:00 0.604
7 2022-01-01 06:00:00 0.125
8 2022-01-01 07:00:00 0.295
9 2022-01-01 08:00:00 0.578
10 2022-01-01 09:00:00 0.631
# ... with 230 more rows
# i Use `print(n = ...)` to see more rows
timestamp value after day
1 2022-01-01 00:00:00 NaN NA 00
2 2022-01-01 01:00:00 0.808 NA 01
3 2022-01-01 02:00:00 0.385 NA 02
4 2022-01-01 03:00:00 NaN NA 03
5 2022-01-01 04:00:00 0.602 NA 04
6 2022-01-01 05:00:00 0.604 NA 05
7 2022-01-01 06:00:00 0.125 NA 06
8 2022-01-01 07:00:00 0.295 NA 07
9 2022-01-01 08:00:00 0.578 NA 08
10 2022-01-01 09:00:00 0.631 NA 09
... with 230 more rows
i Use print(n = ...) to see more rows

Match two tables based on a time difference criterium

I have a data table (lv_timest) with time stamps every 3 hours for each date:
# A tibble: 6 × 5
LV0_mean LV1_mean LV2_mean Date_time Date
<dbl> <dbl> <dbl> <S3:POSIXct> <date>
1 0.778 -4.12 0.736 2016-12-28 00:00:00 2016-12-28
2 0.376 -0.234 0.388 2016-12-28 03:00:00 2016-12-28
3 0.409 1.46 0.241 2016-12-28 06:00:00 2016-12-28
4 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
5 0.759 2.91 0.735 2016-12-28 12:00:00 2016-12-28
6 0.857 3.00 0.803 2016-12-28 15:00:00 2016-12-28
from which I would like to extract the time stamps that match as closely as possible those of another table (event_timest):
# A tibble: 6 × 4
Event_number Date_time Date Date_time_new
<int> <S3: POSIXct> <date> <S3: POSIXct>
1 75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00
2 123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00
3 264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00
4 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00
5 318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00
6 369 2017-01-11 07:24:00 2017-01-11 2017-01-11 07:00:00
For example, for row 1 in table event_timest, I would extract row 4 from table lv_timest:
Event_number Date_time.x Date.x Date_time_new LV0_mean LV1_mean LV2_mean Date_time.y Date.y
<int> <S3: POSIXct> <date> <S3: POSIXct> <dbl> <dbl> <dbl> <S3: POSIXct> <date>
75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
In fact, the time difference should not be over one hour. I thought of using the fuzzyjoin package for this, and writing a function that computes the time difference between timestamps of the two table, as hours. However, fuzzy_inner_join replicates rows in the second table and takes several timestamps in the first table to match it.
require(lubridate)
require(fuzzyjoin)
diff_timest <- function(x, y){abs(x%--%y %/% hours(1)) <= 1} # time interval as hours ≤ 1 hour
match_timest <- fuzzy_inner_join(event_timest, lv_timest,
by = c("Date" = "Date",
"Date_time_new" = "Date_time"),
match_fun = list(`==`, diff_timest))
head(match_timest)
# A tibble: 6 × 9
Event_number Date_time.x Date.x Date_time_new LV0_mean LV1_mean LV2_mean Date_time.y Date.y
<int> <dttm> <date> <dttm> <dbl> <dbl> <dbl> <dttm> <date>
1 75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
2 123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00 1.24 1.83 2.05 2016-12-30 15:00:00 2016-12-30
3 264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00 -0.128 -5.43 2.72 2017-01-07 06:00:00 2017-01-07
4 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00 -0.0751 0.171 2.56 2017-01-09 09:00:00 2017-01-09
5 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00 -0.204 -0.797 2.28 2017-01-09 12:00:00 2017-01-09
6 318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00 -0.204 -0.797 2.28 2017-01-09 12:00:00 2017-01-09
Would there be another way to do this?
Joining is always a procedure of first getting all combinations of all rows followed by a filter. We can do this manually:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
datetimes_a <- tibble(
id = seq(3),
group = "A",
datetime = c("2016-12-28 00:00:00", "2016-12-28 03:00:00", "2016-12-28 23:59:59") %>% as.POSIXct()
)
datetimes_b <- tibble(
id = seq(3),
group = "B",
datetime = c("2016-12-28 00:00:10", "2016-12-28 03:20:00", "2016-12-29 00:00:02") %>% as.POSIXct()
)
datetimes_a %>%
# start with cross product of all possible pairs
expand_grid(datetimes_b %>% rename_all(~ paste0(.x, "_b"))) %>%
mutate(diff = abs(datetime - datetime_b)) %>%
# get shortest time difference
group_by(id, id_b) %>%
arrange(diff) %>%
slice(1) %>%
# time diff must be less than 1hr
filter(diff < hours(1))
#> # A tibble: 3 x 7
#> # Groups: id, id_b [3]
#> id group datetime id_b group_b datetime_b diff
#> <int> <chr> <dttm> <int> <chr> <dttm> <drtn>
#> 1 1 A 2016-12-28 00:00:00 1 B 2016-12-28 00:00:10 10 secs
#> 2 2 A 2016-12-28 03:00:00 2 B 2016-12-28 03:20:00 1200 secs
#> 3 3 A 2016-12-28 23:59:59 3 B 2016-12-29 00:00:02 3 secs
Created on 2022-02-08 by the reprex package (v2.0.1)
This works also if the nearest timepoint is on another date e.g. right before and after midnight.
I would suggest a standard join, followed by a grouped filter to the closest instance of each timestamp:
library(tidyverse)
library(lubridate)
match_timest <- event_timest %>%
inner_join(lv_timest, by = "Date") %>%
mutate(diff = abs(as.numeric(Date_time.x - Date_time.y, unit = "hours"))) %>%
group_by(Date_time.y) %>%
filter(diff <= 1 & diff == min(diff)) %>%
ungroup() %>%
select(!diff)
Note:
this will still match multiple rows if there’s more than one that are the exact same shortest difference from the index timestamp.
this won’t match timestamps from different dates — eg, 23:59:59 on 1/1/22 won’t be matched with 00:00:00 on 1/2/22. If you’d like to do that, you can use a full Cartesian join (full_join(lv_timest, by = character())) rather than the inner_join() above.

How to set up start date in time series in with hourly data

I have a time series that is hourly electricity load in Ireland. This is the data frame:
> head(load2)
# A tibble: 6 x 4
datetime fecha hora load_MWH
<dttm> <date> <dbl> <dbl>
1 2018-01-01 00:00:00 2018-01-01 0 7594
2 2018-01-01 01:00:00 2018-01-01 1 7091
3 2018-01-01 02:00:00 2018-01-01 2 6652
4 2018-01-01 03:00:00 2018-01-01 3 6308
5 2018-01-01 04:00:00 2018-01-01 4 5972
6 2018-01-01 05:00:00 2018-01-01 5 5810
I want to create a Time Series object with daily seasonality and start date 2018-01-01 00:00:00 with one year worth of data - however, I am unable to get right the date axis. I tried the following:
my_ts = ts(load2[,4], frequency = 24, start= c(as.numeric(as.Date("2018-01-01")),1))
which seems to work:
head(my_ts)
Time Series:
Start = c(17532, 1)
End = c(17532, 6)
Frequency = 24
load_MWH
[1,] 7594
[2,] 7091
[3,] 6652
[4,] 6308
[5,] 5972
[6,] 5810
But the time axis of the time series is a number (i.e number of seconds since origin in 1970-01-01) but not a date format and therefore I can not make any operation with dates and autoplot shows the number but not the month / year:
> autoplot(energyireland2018_ts2[,1]) + scale_x_date(date_labels = "%b/%d")
Scale for 'x' is already present. Adding another scale for 'x', which will replace the
existing scale.
Error: Invalid input: date_trans works with objects of class Date only
Similarly, I cannot use any of the forecast package functions that manipulate dates.
So the question is: how can I convert the time axis of this time series into a Date object? (still using the forecast package). Many thanks!
The ts class and the forecast package do not work well with hourly data. I suggest you use the newer tsibble class, for which there is an autoplot function in the feasts package and forecasting functions in the fable package. Here is an example using half-hourly electricity demand data.
library(feasts)
library(tsibbledata)
vic_elec
#> # A tsibble: 52,608 x 5 [30m] <Australia/Melbourne>
#> Time Demand Temperature Date Holiday
#> <dttm> <dbl> <dbl> <date> <lgl>
#> 1 2012-01-01 00:00:00 4383. 21.4 2012-01-01 TRUE
#> 2 2012-01-01 00:30:00 4263. 21.0 2012-01-01 TRUE
#> 3 2012-01-01 01:00:00 4049. 20.7 2012-01-01 TRUE
#> 4 2012-01-01 01:30:00 3878. 20.6 2012-01-01 TRUE
#> 5 2012-01-01 02:00:00 4036. 20.4 2012-01-01 TRUE
#> 6 2012-01-01 02:30:00 3866. 20.2 2012-01-01 TRUE
#> 7 2012-01-01 03:00:00 3694. 20.1 2012-01-01 TRUE
#> 8 2012-01-01 03:30:00 3562. 19.6 2012-01-01 TRUE
#> 9 2012-01-01 04:00:00 3433. 19.1 2012-01-01 TRUE
#> 10 2012-01-01 04:30:00 3359. 19.0 2012-01-01 TRUE
#> # … with 52,598 more rows
autoplot(vic_elec, Demand)
Created on 2020-09-21 by the reprex package (v0.3.0)
For information about the packages that handle these objects, see http://tidyverts.org.
For a textbook introduction to forecasting models and associated syntax, see http://OTexts.com/fpp3

Group data by group of days within months in R

I am trying to summarise this daily time serie of rainfall by groups of 10-day periods within each month and calculate the acummulated rainfall.
library(tidyverse)
(dat <- tibble(
date = seq(as.Date("2016-01-01"), as.Date("2016-12-31"), by=1),
rainfall = rgamma(length(date), shape=2, scale=2)))
Therefore, I will obtain variability in the third group along the year, for instance: in january the third period has 11 days, february 9 days, and so on. This is my try:
library(lubridate)
dat %>%
group_by(decade=floor_date(date, "10 days")) %>%
summarize(acum_rainfall=sum(rainfall),
days = n())
this is the resulting output
# A tibble: 43 x 3
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 48.5 10
2 2016-01-11 39.9 10
3 2016-01-21 36.1 10
4 2016-01-31 1.87 1
5 2016-02-01 50.6 10
6 2016-02-11 32.1 10
7 2016-02-21 22.1 9
8 2016-03-01 45.9 10
9 2016-03-11 30.0 10
10 2016-03-21 42.4 10
# ... with 33 more rows
can someone help me to sum the residuals periods to the third one to obtain always 3 periods within each month? This would be the desired output (pay attention to the row 3):
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 48.5 10
2 2016-01-11 39.9 10
3 2016-01-21 37.97 11
4 2016-02-01 50.6 10
5 2016-02-11 32.1 10
6 2016-02-21 22.1 9
One way to do this is to use if_else to apply floor_date with different arguments depending on the day value of date. If day(date) is <30, use the normal way, if it's >= 30, then use '20 days' to ensure it gets rounded to day 21:
dat %>%
group_by(decade=if_else(day(date) >= 30,
floor_date(date, "20 days"),
floor_date(date, "10 days"))) %>%
summarize(acum_rainfall=sum(rainfall),
days = n())
# A tibble: 36 x 3
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 38.8 10
2 2016-01-11 38.4 10
3 2016-01-21 43.4 11
4 2016-02-01 34.4 10
5 2016-02-11 34.8 10
6 2016-02-21 25.3 9
7 2016-03-01 39.6 10
8 2016-03-11 53.9 10
9 2016-03-21 38.1 11
10 2016-04-01 36.6 10
# … with 26 more rows

Daily select one hour of the day for analysis

Have dataset of 1 year
hourly records
for analysis, I need to extract seperately for each month of the year, each hour of the day , so january 00h, january 01h, january 02h, january 03h, ...., ... , march 21h, march 22h, march 23h
Thanks in advance for any useful help!
Select observations for specified hours of the day during a period with datetime, filter, subset, ...
Code below (filter, month (time) generates fatal errror
Error: unexpected ')' in "at<-subset(groenenborgerno, timestamp=hour(time) == 01))"
groenenborgerno$timestamp <- as.POSIXct(groenenborgerno$date, format="%Y-%m-%d %H:%M:%S")
library(lubridate)
january01<-filter(atimeframe,
(month(time) == 01 & hour(time) == 01) )
Since no data is provided, I will try to answer your question with sample data:
require(lubridate)
require(tidyverse)
## Create some sample data:
time_index <- seq(from = as.POSIXct("2017-01-01 07:00"),
to = as.POSIXct("2018-01-01 18:00"), by = "hour")
value <- rnorm(n = length(time_index))
data <- data.frame(time_index,value)
data <- data %>% mutate (hour = hour(time_index),
month = month(time_index)) %>%
group_by(month,hour)
head(data)
> data
# A tibble: 8,772 x 4
# Groups: month, hour [288]
time_index value hour month
<dttm> <dbl> <int> <dbl>
1 2017-01-01 07:00:00 -0.626 7 1
2 2017-01-01 08:00:00 0.184 8 1
3 2017-01-01 09:00:00 -0.836 9 1
4 2017-01-01 10:00:00 1.60 10 1
5 2017-01-01 11:00:00 0.330 11 1
6 2017-01-01 12:00:00 -0.820 12 1
7 2017-01-01 13:00:00 0.487 13 1
8 2017-01-01 14:00:00 0.738 14 1
9 2017-01-01 15:00:00 0.576 15 1
10 2017-01-01 16:00:00 -0.305 16 1
# ... with 8,762 more rows
and then just filter() the hour / month combination you would want like so:
data %>% filter(hour > 12 & month == 1)
# A tibble: 347 x 4
# Groups: month, hour [11]
time_index value hour month
<dttm> <dbl> <int> <dbl>
1 2017-01-01 13:00:00 0.487 13 1
2 2017-01-01 14:00:00 0.738 14 1
3 2017-01-01 15:00:00 0.576 15 1
4 2017-01-01 16:00:00 -0.305 16 1
5 2017-01-01 17:00:00 1.51 17 1
6 2017-01-01 18:00:00 0.390 18 1
7 2017-01-01 19:00:00 -0.621 19 1
8 2017-01-01 20:00:00 -2.21 20 1
9 2017-01-01 21:00:00 1.12 21 1
10 2017-01-01 22:00:00 -0.0449 22 1
# ... with 337 more rows

Resources