Sum between two weeks interval - r

Suppose I have a daily rain data.frame like this:
df.meteoro = data.frame(Dates = seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"),
rain = rnorm(length(seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"))))
I'm trying to sum the accumulated rain between a 14 days interval with this code:
library(tidyverse)
library(lubridate)
df.rain <- df.meteoro %>%
mutate(TwoWeeks = round_date(df.meteoro$data, "14 days")) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
The problem is that it isn't starting on 2017-01-19 but on 2017-01-15 and I was expecting my output dates to be:
"2017-02-02" "2017-02-16" "2017-03-02" "2017-03-16" "2017-03-30" "2017-04-13"
"2017-04-27" "2017-05-11" "2017-05-25" "2017-06-08" "2017-06-22" "2017-07-06" "2017-07-20"
"2017-08-03" "2017-08-17" "2017-08-31" "2017-09-14" "2017-09-28" "2017-10-12" "2017-10-26"
"2017-11-09" "2017-11-23" "2017-12-07" "2017-12-21" "2018-01-04" "2018-01-18"
TL;DR I have a year long daily rain data.frame and want to sum the accumulate rain for the dates above.
Please help.

Use of round_date in the way you have shown it will not give you 14-day periods as you might expect. I have taken a different approach in this solution and generated a sequence of dates between your first and last dates and grouped these into 14-day periods then joined the dates to your observations.
startdate = min(df.meteoro$Dates)
enddate = max(df.meteoro$Dates)
dateseq =
data.frame(Dates = seq.Date(startdate, enddate, by = 1)) %>%
mutate(group = as.numeric(Dates - startdate) %/% 14) %>%
group_by(group) %>%
mutate(starts = min(Dates))
df.rain <- df.meteoro %>%
right_join(dateseq) %>%
group_by(starts) %>%
summarise(sum_rain = sum(rain))
head(df.rain)
> head(df.rain)
# A tibble: 6 x 2
starts sum_rain
<date> <dbl>
1 2017-01-19 6.09
2 2017-02-02 5.55
3 2017-02-16 -3.40
4 2017-03-02 2.55
5 2017-03-16 -0.12
6 2017-03-30 8.95
Using a right-join to the date sequence is to ensure that if there are missing observation days that spanned a complete time period you'd still get that period listed in the result (though in your case you have a complete year of dates anyway).

round_date rounds to the nearest multiple of unit (here, 14 days) since some epoch (probably the Unix epoch of 1970-01-01 00:00:00), which doesn't line up with your purpose.
To get what you want, you can do the following:
df.rain = df.meteoro %>%
mutate(days_since_start = as.numeric(Dates - as.Date("2017/1/18")),
TwoWeeks = as.Date("2017/1/18") + 14*ceiling(days_since_start/14)) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
This computes days_since_start as the days since 2017/1/18 and then manually rounds to the next multiple of two weeks.

Assuming you want to round to the closest date from the ones you have specified I guess the following will work
targetDates<-seq(ymd("2017-02-02"),ymd("2018-01-18"),by='14 days')
df.meteoro$Dates=targetDates[sapply(df.meteoro$Dates,function(x) which.min(abs(interval(targetDates,x))))]
sum_rain=ddply(df.meteoro,.(Dates),summarize,sum_rain=sum(rain,na.rm=T))
as you can see not all dates have the same number of observations. Date "2017-02-02" for instance has all the records between "2017-01-19" until "2017-02-09", which are 22 records. From "2017-02-10" on dates are rounded to "2017-02-16" etc.

This may be a cheat, but assuming each row/observation is a separate day, then why not just group by every 14 rows and sum.
# Assign interval groups, each 14 rows
df.meteoro$my_group <-rep(1:100, each=14, length.out=nrow(df.meteoro))
# Grab Interval Names
my_interval_names <- df.meteoro %>%
select(-rain) %>%
group_by(my_group) %>%
slice(1)
# Summarise
df.meteoro %>%
group_by(my_group) %>%
summarise(rain = sum(rain)) %>%
left_join(., my_interval_names)
#> Joining, by = "my_group"
#> # A tibble: 27 x 3
#> my_group rain Dates
#> <int> <dbl> <date>
#> 1 1 3.86 2017-01-19
#> 2 2 -0.581 2017-02-02
#> 3 3 -0.876 2017-02-16
#> 4 4 1.80 2017-03-02
#> 5 5 3.79 2017-03-16
#> 6 6 -3.50 2017-03-30
#> 7 7 5.31 2017-04-13
#> 8 8 2.57 2017-04-27
#> 9 9 -1.33 2017-05-11
#> 10 10 5.41 2017-05-25
#> # ... with 17 more rows
Created on 2018-03-01 by the reprex package (v0.2.0).

Related

R function for selecting an x amount of observations per hour\day that are evenly separated

I am trying to create a function that selects a defined amount of observations per a defined time frame.
I have managed to create a function that subsets for one observation per hour:
#create example df
timestamp <- seq(ISOdate(2022,05,20), ISOdate(2022,05,22), "min")
Animal_ID <- c(rep("Avi",length(timestamp)), rep("David",length(timestamp)))
timestamp <-as.character(c(timestamp,timestamp))
df <- as.data.frame(cbind(Animal_ID,timestamp))
#function
OnePerHour <- function(df){
dataOnePerHour <- df %>%
group_by(Animal_ID, hour(timestamp), as.Date(timestamp))%>%
filter(row_number(Animal_ID) == 1)
return(dataOnePerHour)
}
However, I am not able to work my head around expanding so I would be able to select more obs/hour that are evenly distributed.
In this example, there is an observation every minute but in the "real dataset" there might be only three or four observations/hr that is 15 mins apart for one animal and an observation every second for another. So, lets say that I am looking for 3 obs\hr, and observation freq is 1/min so 1, 21, 41 is exactly what I am looking for. If there are only three (15 mins apart) I would like to include all of the observations.
Any help will be much appreciated.
Idan
Here's a solution that creates times_per_hour equally spaced intervals within every hour for every Animal_ID, and then chooses the first observation within that interval. If there aren't any observations within that interval, however, no observation will be chosen. So if you want 3 per hour and you have observations at 12:01, 12:02, and 12:03, you're only going to get the first one, because there were no observations between 12:20-12:40 or 12:40-1:00.
library(dplyr)
library(tidyr)
library(lubridate)
#create example df
timestamp <- seq(ISOdate(2022,05,20), ISOdate(2022,05,22), "min")
Animal_ID <- c(rep("Avi",length(timestamp)), rep("David",length(timestamp)))
timestamp <-as.character(c(timestamp,timestamp))
df <- as.data.frame(cbind(Animal_ID,timestamp))
get_observations <- function(df, times_per_hour, min_date_time, max_date_time) {
# madke dataframe with all possible minutes between min and max times
timespan <- expand_grid(Animal_ID = unique(df$Animal_ID),
# replace with min and max datetimes of the data
timestamp = seq(min_date_time, max_date_time, "min"))
ideal_times <- timespan %>%
group_by(Animal_ID, hour = hour(timestamp), date = as.Date(timestamp)) %>%
# select the beginning of the interval from which you want an observation
slice(seq(1, n(), by = 60/times_per_hour)) %>%
mutate(time_interval = interval(timestamp,
lead(timestamp, default = max_date_time))) %>%
select(-timestamp)
df %>%
mutate(hour = hour(timestamp), date = as.Date(timestamp)) %>%
# join so every time interval is matched with all the obs in that hour
right_join(ideal_times, by = c("Animal_ID", "hour", "date")) %>%
# then remove all the obs that aren't in the exact interval
filter(as_datetime(timestamp) %within% time_interval) %>%
group_by(Animal_ID, time_interval) %>%
# then take the first observation
slice(1) %>%
ungroup() %>%
select(-time_interval)
}
# choose 10% so that observations are not equally spaced
sample_df <- slice_sample(df, prop = .1)
get_observations(sample_df, times_per_hour = 3,
min_date_time = ISOdate(2022,05,20), max_date_time = ISOdate(2022,05,22))
#> # A tibble: 259 × 4
#> Animal_ID timestamp hour date
#> <chr> <chr> <int> <date>
#> 1 Avi 2022-05-20 12:00:00 12 2022-05-20
#> 2 Avi 2022-05-20 12:32:00 12 2022-05-20
#> 3 Avi 2022-05-20 12:48:00 12 2022-05-20
#> 4 Avi 2022-05-20 13:15:00 13 2022-05-20
#> 5 Avi 2022-05-20 13:35:00 13 2022-05-20
#> 6 Avi 2022-05-20 13:52:00 13 2022-05-20
#> 7 Avi 2022-05-20 14:17:00 14 2022-05-20
#> 8 Avi 2022-05-20 14:28:00 14 2022-05-20
#> 9 Avi 2022-05-20 14:48:00 14 2022-05-20
#> 10 Avi 2022-05-20 15:16:00 15 2022-05-20
#> # … with 249 more rows
Created on 2022-05-23 by the reprex package (v2.0.1)
If I understand correctly, I might do something like this. Maybe a bit long, I will add in the date, hour, minute, and calculate the time difference from previous time point per animal ID first.
Then calculate the number of observation in an hour, and create a filter logical column based on your description.
df <- df %>%
mutate(dt = as.Date(timestamp),
hr = hour(timestamp),
m = minute(timestamp)) %>%
group_by(Animal_ID) %>%
mutate(time_diff = m-lag(m))
df <- df %>%
group_by(Animal_ID,
dt,
hr) %>%
mutate(num_in_hour = n(),
filterlogic = (num_in_hour == 60 & m %in% c(1,21,41))|(num_in_hour %in% c(3,4)&time_diff==15)
) %>%
filter(filterlogic == TRUE)

How to group by a time window in R?

I want to find the highest average of departure delay in time windows of length 1 week in flights dataset of nycflights13 package.
I've used
seq(min(flights:time_hour), max(flights:time_hour), by = "week")
to find the dates with the difference of one week. But I don't know how to group by these dates to find the average departure delay of each period. How can I do this using tidyverse package?
Thank you for your help in advance.
We can use {lubridate} to round each date to the nearest week. Two wrinkles to think about:
To count weeks beginning with Jan 1, you'll need to specify the week_start arg. Otherwise lubridate will count from the previous Sunday, which in this case is 12/30/2012.
You also need to deal with incomplete weeks. In this case, the last week of the year only contains one day. I chose to drop weeks with < 7 days for this demo.
library(tidyverse)
library(lubridate)
library(nycflights13)
data(flights)
# what weekday was the first of the year?
weekdays(min(flights$time_hour))
#> [1] "Tuesday"
# Tuesday = day #2 so we'll pass `2` to `week_start`
flights %>%
group_by(week = floor_date(time_hour, unit = "week", week_start = 2)) %>%
filter(n_distinct(day) == 7) %>% # drop incomplete weeks
summarize(dep_delay_avg = mean(dep_delay, na.rm = TRUE)) %>%
arrange(desc(dep_delay_avg))
#> # A tibble: 52 x 2
#> week dep_delay_avg
#> <dttm> <dbl>
#> 1 2013-06-25 00:00:00 40.6 # week of June 25 had longest delays
#> 2 2013-07-09 00:00:00 24.4
#> 3 2013-12-17 00:00:00 24.0
#> 4 2013-07-23 00:00:00 21.8
#> 5 2013-03-05 00:00:00 21.7
#> 6 2013-04-16 00:00:00 21.6
#> 7 2013-07-16 00:00:00 20.4
#> 8 2013-07-02 00:00:00 20.1
#> 9 2013-12-03 00:00:00 19.9
#> 10 2013-05-21 00:00:00 19.2
#> # ... with 42 more rows
Created on 2022-03-06 by the reprex package (v2.0.1)
Edit: as requested by OP, here is a solution using only core {tidyverse} packages, without {lubridate}:
library(tidyverse)
library(nycflights13)
data(flights)
flights %>%
group_by(week = (as.POSIXlt(time_hour)$yday) %/% 7) %>%
filter(n_distinct(day) == 7) %>%
summarize(
week = as.Date(min(time_hour)),
dep_delay_avg = mean(dep_delay, na.rm = TRUE)
) %>%
arrange(desc(dep_delay_avg))

mutate(percentage = n / sum(n)) - not correctly calculating percentage

Mutate output
I have been working on the below code to calculate the percentages per hour (time column d h) for each behaviour however it is mixing up the order of the time column and incorrectly calculating the percentages. I have attached a sample of the output and some of the data. Any help is greatly appreciated!
S06Behav <- S06 %>%
group_by(Time, PredictedBehaviorFull, Context)%>%
summarise(count= n())
S06Proportions<-S06Behav %>%
group_by(Time, PredictedBehaviorFull, Context) %>%
summarise(n = sum(count)) %>%
mutate(percentage = n / sum(n))
A sample of my data is https://pastebin.com/KE0xEzk7
Thank you
I think the reason the percentages are not being calculated as expected is because according to the code, you are determining the percentage based on 2 values which are the same, hence a proportion of 1.0.
I'm not fully sure from your question, but if when you say "mixing up the order of the time column", you mean the whole Time column is incorrect, then you might be better off using the lubridate package to make your Time column.
library(lubridate)
S06 %>%
# first we convert the Timestamp column into datetime format
mutate(
Timestamp = ymd_hms(Timestamp)
) %>%
# then, we can extract the components from the Timestamp
mutate(
date = date(Timestamp),
hour = lubridate::hour(Timestamp),
timestamp_hour = ymd_h(str_c(date, ' ', hour))
) %>%
{. ->> S06_a} # this saves the data as 'S06_a' to use next
Then if I understand correctly you want to determine the percentage of observations of each behaviour type, per hour.
S06_a %>%
# then, work out the total number of observations per hour, context and behaviour
group_by(timestamp_hour, Context, PredictedBehaviorFull) %>%
summarise(
behav_total = n()
) %>%
# calculate the total number of observations per hour
group_by(timestamp_hour) %>%
mutate(
hour_total = sum(behav_total),
percentage = behav_total / hour_total
)
Which produces the following output:
# A tibble: 7 x 6
# Groups: timestamp_hour [3]
timestamp_hour Context PredictedBehaviorFull behav_total hour_total percentage
<dttm> <chr> <chr> <int> <int> <dbl>
1 2020-05-23 19:00:00 Present Bait 1971 2184 0.902
2 2020-05-23 19:00:00 Present Boat 96 2184 0.0440
3 2020-05-23 19:00:00 Present No_OP 117 2184 0.0536
4 2020-05-24 10:00:00 Absent Bait 9 1202 0.00749
5 2020-05-24 10:00:00 Absent No_OP 1193 1202 0.993
6 2020-05-24 11:00:00 Absent Bait 5 129 0.0388
7 2020-05-24 11:00:00 Absent No_OP 124 129 0.961

Calculate number of pending tasks at given time points (ideally with dplyr)

I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)

Missing data in R - How to skip grouping of days with missing information?

I have hourly values of temperature measurements and I wish to calculate the average per day only for complete (i.e. with 24 measurements) days. Incomplete days would then be summarized as "NA".
I have grouped the values together per year, month and day and call summarize().
I have three month of data missing which appears as a gap in my ggplot function and which is what I want to achieve with the rest. The problem is that when I call summarize() to calculate the mean of my values, days with only 1 or 2 measurements also get called. Only those with all missing values (24) appear as "NA".
Date TempUrb TempRur UHI
1 2011-03-21 22:00:00 10.1 11.67000 -1.570000
2 2011-03-21 23:00:00 9.9 11.67000 -1.770000
3 2011-03-22 00:00:00 10.9 11.11000 -0.210000
4 2011-03-22 01:00:00 10.7 10.56000 0.140000
5 2011-03-22 02:00:00 9.7 10.00000 -0.300000
6 2011-03-22 03:00:00 9.5 10.00000 -0.500000
7 2011-03-22 04:00:00 9.4 8.89000 0.510000
8 2011-03-22 05:00:00 8.4 8.33500 0.065000
9 2011-03-22 06:00:00 8.2 7.50000 0.700000
AvgUHI <- UHI %>% group_by(year(Date), add = TRUE) %>%
group_by(month(Date), add = TRUE) %>%
group_by(day(Date), add = TRUE, .drop = TRUE) %>%
summarize(AvgUHI = mean(UHI, na.rm = TRUE))
# A tibble: 2,844 x 4
# Groups: year(Date), month(Date) [95]
`year(Date)` `month(Date)` `day(Date)` AvgUHI
<int> <int> <int> <dbl>
1476 2015 4 4 0.96625000
1477 2015 4 5 -0.11909722
1478 2015 4 6 -0.60416667
1479 2015 4 7 -0.92916667
1480 2015 4 8 NA
1481 2015 4 9 NA
AvgUHI<- AvgUHI %>% group_by(`year(Date)`, add = TRUE) %>%
group_by(`month(Date)`, add = TRUE) %>%
summarize(AvgUHI= mean(AvgUHI, na.rm = TRUE))
# A tibble: 95 x 3
# Groups: year(Date) [9]
`year(Date)` `month(Date)` AvgUHI
<int> <int> <dbl>
50 2015 4 0.580887346
51 2015 5 0.453815051
52 2015 6 0.008479618
As you can see above on the final table, I have an average for 04-2015, while I am missing data on that month (08 - 09/04/2015 on this example represented on the second table).
The same happens when I calculate AvgUHI and I'm missing hourly data.
I simply would like to see on the last table the AvgUHI for 04-2015 be NA.
E.g: of my graph1
The following will give a dataframe aggregated by day, where only the complete days, with 4 observations, are not NA. Then you can group by month to have the final dataframe.
UHI %>%
mutate(Day = as.Date(Date)) %>%
group_by(Day) %>%
mutate(n = n(), tmpUHI = if_else(n == 24, UHI, NA_real_)) %>%
summarize(AvgUHI = mean(tmpUHI)) %>%
full_join(data.frame(Day = seq(min(.$Day), max(.$Day), by = "day"))) %>%
arrange(Day) -> AvgUHI
For hours look at Rui Barradas' answer. For months the following code worked:
AvgUHI %>%
group_by(year(Day), add = TRUE) %>%
group_by(month(Day), add = TRUE) %>%
mutate(sum = sum(is.na(AvgUHI)), tmpUHI = if_else(sum <= 10, AvgUHI, NA_real_)) %>%
summarise(AvgUHI = mean(tmpUHI, na.rm = TRUE)) -> AvgUHI

Resources