Filter data based on subgroups R - r

In reality it's much more complex, but let's say my data looks like this:
df <- data.frame(
id = c(1,1,1,2,2,2,2,3,3,3),
event = c(0,0,0,1,1,1,1,0,0,0),
day = c(1,3,3,1,6,6,7,1,4,6),
time = c("2016-10-25 14:00:00", "2016-10-27 12:00:15", "2016-10-27 15:30:00",
"2016-10-23 11:00:00", "2016-10-28 08:00:15", "2016-10-28 23:00:00", "2016-10-29 12:00:00",
"2016-10-24 15:00:00", "2016-10-27 15:00:15", "2016-10-29 16:00:00"))
df$time <- as.POSIXct(df$time)
Output:
id event day time
1 1 0 1 2016-10-25 14:00:00
2 1 0 3 2016-10-27 12:00:15
3 1 0 3 2016-10-27 15:30:00
4 2 1 1 2016-10-23 11:00:00
5 2 1 6 2016-10-28 08:00:15
6 2 1 6 2016-10-28 23:00:00
7 2 1 7 2016-10-29 12:00:00
8 3 0 1 2016-10-24 15:00:00
9 3 0 4 2016-10-27 15:00:15
10 3 0 6 2016-10-29 16:00:00
What I need to do:
If event is 0, I want to keep only the last 24 hours per id.
If event is 1, I want to keep the 6th day.
I know how to keep the last 24 hours in general:
library(lubridate)
last_twentyfour_hours <- df %>%
group_by(id) %>%
filter(time > last(time) - hours(24))
But how do i filter differently for each group?
Thank you very much in advance!

Grouped by 'id', 'event', do a filter with if/else i.e. if 0 is in 'event', then use the OP's condition or else return the rows where 'day' is 6
library(dplyr)
library(lubridate)
df %>%
group_by(id, event) %>%
filter(if(0 %in% event) time > last(time) - hours(24) else
day == 6) %>%
ungroup
-output
# A tibble: 5 × 4
id event day time
<dbl> <dbl> <dbl> <dttm>
1 1 0 3 2016-10-27 12:00:15
2 1 0 3 2016-10-27 15:30:00
3 2 1 6 2016-10-28 08:00:15
4 2 1 6 2016-10-28 23:00:00
5 3 0 6 2016-10-29 16:00:00

We could use the & and | operator:
df %>%
group_by(id) %>%
filter(event == 0 & time > last(time) - hours(24) |
event == 1 & day==6)
id event day time
<dbl> <dbl> <dbl> <dttm>
1 1 0 3 2016-10-27 12:00:15
2 1 0 3 2016-10-27 15:30:00
3 2 1 6 2016-10-28 08:00:15
4 2 1 6 2016-10-28 23:00:00
5 3 0 6 2016-10-29 16:00:00

Related

How can I create a day number variable in R based on dates?

I want to create a variable with the number of the day a participant took a survey (first day, second day, thirds day, etc.)
The issue is that there are participants that took the survey after midnight.
For example, this is what it looks like:
Id
date
1
08/03/2020 08:17
1
08/03/2020 12:01
1
08/04/2020 15:08
1
08/04/2020 22:16
2
07/03/2020 08:10
2
07/03/2020 12:03
2
07/04/2020 15:07
2
07/05/2020 00:16
3
08/22/2020 09:17
3
08/23/2020 11:04
3
08/24/2020 00:01
4
10/03/2020 08:37
4
10/03/2020 11:13
4
10/04/2020 15:20
4
10/04/2020 23:05
This is what I want:
Id
date
day
1
08/03/2020 08:17
1
1
08/03/2020 12:01
1
1
08/04/2020 15:08
2
1
08/04/2020 22:16
2
2
07/03/2020 08:10
1
2
07/03/2020 12:03
1
2
07/04/2020 15:07
2
2
07/05/2020 00:16
2
3
08/22/2020 09:17
1
3
08/23/2020 11:04
2
3
08/24/2020 00:01
2
4
10/03/2020 08:37
1
4
10/03/2020 11:13
1
4
10/04/2020 15:20
2
4
10/04/2020 23:05
2
How can I create the day variable taking into consideration participants that who took the survey after midnight still belong to the previous day?
I tried the codes here. But I have issues with participants taking surveys after midnight.
Please check the below code
code
data2 <- data %>%
mutate(date2 = as.Date(date, format = "%m/%d/%Y %H:%M")) %>%
group_by(id) %>%
mutate(row = row_number(),
date3 = as.Date(ifelse(row == 1, date2, NA), origin = "1970-01-01")) %>%
fill(date3) %>%
ungroup() %>%
mutate(diff = as.numeric(date2 - date3 + 1)) %>%
select(-date2, -date3, -row)
output
#> id date diff
#> 1 1 08/03/2020 08:17 1
#> 2 1 08/03/2020 12:01 1
#> 3 1 08/04/2020 15:08 2
#> 4 1 08/04/2020 22:16 2
#> 5 2 07/03/2020 08:10 1
#> 6 2 07/03/2020 12:03 1
#> 7 2 07/04/2020 15:07 2
#> 8 2 07/05/2020 00:16 3
Here is one approach that explicitly will show dates considered. First, would make sure your date is in POSIXct format as suggested in comments (if not done already). Then, if the hour is less than 2 (midnight to 2 AM) subtract 1 from the date so the survey_date reflects the day before. If the hour is not less than 2, just keep the date. The timezone tz argument is set to "" to avoid confusion or uncertainty. Finally, after grouping by Id, subtract each survey_date from the first survey_date to get number of days since first survey. You can use as.numeric to make this column numeric if desired.
Note: if you want to just note consecutive days taken the survey (and ignore gaps in days between surveys) you can substitute for the last line:
mutate(day = cumsum(survey_date != lag(survey_date, default = first(survey_date))) + 1)
This will increase day by 1 every new survey_date found for a given Id.
library(tidyverse)
library(lubridate)
df %>%
mutate(date = as.POSIXct(date, format = "%m/%d/%Y %H:%M", tz = "")) %>%
mutate(survey_date = if_else(hour(date) < 2,
as.Date(date, format = "%Y-%m-%d", tz = "") - 1,
as.Date(date, format = "%Y-%m-%d", tz = ""))) %>%
group_by(Id) %>%
mutate(day = survey_date - first(survey_date) + 1)
Output
Id date survey_date day
<int> <dttm> <date> <drtn>
1 1 2020-08-03 08:17:00 2020-08-03 1 days
2 1 2020-08-03 12:01:00 2020-08-03 1 days
3 1 2020-08-04 15:08:00 2020-08-04 2 days
4 1 2020-08-04 22:16:00 2020-08-04 2 days
5 2 2020-07-03 08:10:00 2020-07-03 1 days
6 2 2020-07-03 12:03:00 2020-07-03 1 days
7 2 2020-07-04 15:07:00 2020-07-04 2 days
8 2 2020-07-05 00:16:00 2020-07-04 2 days
9 3 2020-08-22 09:17:00 2020-08-22 1 days
10 3 2020-08-23 11:04:00 2020-08-23 2 days
11 3 2020-08-24 00:01:00 2020-08-23 2 days
12 4 2020-10-03 08:37:00 2020-10-03 1 days
13 4 2020-10-03 11:13:00 2020-10-03 1 days
14 4 2020-10-04 15:20:00 2020-10-04 2 days
15 4 2020-10-04 23:05:00 2020-10-04 2 days

Get daily average with R [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 6 months ago.
I have a data.frame with some prices per day. I would like to get the average daily price in another column (avg_price). How can I do that ?
date price avg_price
1 2017-01-01 01:00:00 10 18.75
2 2017-01-01 01:00:00 10 18.75
3 2017-01-01 05:00:00 25 18.75
4 2017-01-01 04:00:00 30 18.75
5 2017-01-02 08:00:00 10 20
6 2017-01-02 08:00:00 30 20
7 2017-01-02 07:00:00 20 20
library(lubridate)
library(tidyverse)
df %>%
group_by(day = day(date)) %>%
summarise(avg_price = mean(price))
# A tibble: 2 x 2
day avg_price
<int> <dbl>
1 1 18.8
2 2 20
df %>%
group_by(day = day(date)) %>%
mutate(avg_price = mean(price))
# A tibble: 7 x 4
# Groups: day [2]
date price avg_price day
<dttm> <dbl> <dbl> <int>
1 2017-01-01 01:00:00 10 18.8 1
2 2017-01-01 01:00:00 10 18.8 1
3 2017-01-01 05:00:00 25 18.8 1
4 2017-01-01 04:00:00 30 18.8 1
5 2017-01-02 08:00:00 10 20 2
6 2017-01-02 08:00:00 30 20 2
7 2017-01-02 07:00:00 20 20 2

calculate number of frost change days (number of days) from the weather hourly data in r

I have to calculate the following data Number of frost change days**(NFCD)**** as weekly basis.
That means the number of days in which minimum temperature and maximum temperature cross 0°C.
Let's say I work with years 1957-1980 with hourly temp.
Example data (couple of rows look like):
Date Time (UTC) temperature
1957-07-01 00:00:00 5
1957-07-01 03:00:00 6.2
1957-07-01 05:00:00 9
1957-07-01 06:00:00 10
1957-07-01 07:00:00 10
1957-07-01 08:00:00 14
1957-07-01 09:00:00 13.2
1957-07-01 10:00:00 15
1957-07-01 11:00:00 15
1957-07-01 12:00:00 16.3
1957-07-01 13:00:00 15.8
Expected data:
year month week NFCD
1957 7 1 1
1957 7 2 5
dat <- data.frame(date=c(rep("A",5),rep("B",5)), time=rep(1:5, times=2), temp=c(1:5,-2,1:4))
dat
# date time temp
# 1 A 1 1
# 2 A 2 2
# 3 A 3 3
# 4 A 4 4
# 5 A 5 5
# 6 B 1 -2
# 7 B 2 1
# 8 B 3 2
# 9 B 4 3
# 10 B 5 4
aggregate(temp ~ date, data = dat, FUN = function(z) min(z) <= 0 && max(z) > 0)
# date temp
# 1 A FALSE
# 2 B TRUE
(then rename temp to NFCD)
Using the data from r2evans's answer you can also use tidyverse logic:
library(tidyverse)
dat %>%
group_by(date) %>%
summarize(NFCD = min(temp) < 0 & max(temp) > 0)
which gives:
# A tibble: 2 x 2
date NFCD
<chr> <lgl>
1 A FALSE
2 B TRUE

Time-interval overlap match by group

Suppose I have the following DF:
id flag time
1 1 2017-01-01 UTC--2017-01-07 UTC
1 0 2018-01-01 UTC--2019-01-01 UTC
1 0 2017-01-03 UTC--2017-01-09 UTC
2 1 2017-01-01 UTC--2017-01-15 UTC
2 1 2018-07-01 UTC--2018-09-01 UTC
2 1 2018-10-12 UTC--2018-10-20 UTC
2 0 2017-01-12 UTC--2017-01-16 UTC
2 0 2017-03-01 UTC--2017-03-15 UTC
2 0 2017-12-01 UTC--2017-12-31 UTC
2 0 2018-08-15 UTC--2018-09-19 UTC
2 0 2018-10-01 UTC--2018-10-21 UTC
Created with the following code:
df <- data.frame(id=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
flag=c(1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0),
time=c(interval(ymd(20170101), ymd(20170107)),
interval(ymd(20180101), ymd(20190101)),
interval(ymd(20170103), ymd(20170109)),
# Casos
interval(ymd(20170101), ymd(20170115)),
interval(ymd(20180701), ymd(20180901)),
interval(ymd(20181012), ymd(20181020)),
# Controles
interval(ymd(20170112), ymd(20170116)),
interval(ymd(20170301), ymd(20170315)),
interval(ymd(20171201), ymd(20171231)),
interval(ymd(20180815), ymd(20180919)),
interval(ymd(20181001), ymd(20181021))))
And I want to obtain this result
id flag time value
1 1 2017-01-01 UTC--2017-01-07 UTC NA
1 0 2018-01-01 UTC--2019-01-01 UTC 0
1 0 2017-01-03 UTC--2017-01-09 UTC 1
2 1 2017-01-01 UTC--2017-01-15 UTC NA
2 1 2018-07-01 UTC--2018-09-01 UTC NA
2 1 2018-10-12 UTC--2018-10-20 UTC NA
2 0 2017-01-12 UTC--2017-01-16 UTC 1
2 0 2017-03-01 UTC--2017-03-15 UTC 0
2 0 2017-12-01 UTC--2017-12-31 UTC 0
2 0 2018-08-15 UTC--2018-09-19 UTC 1
2 0 2018-10-01 UTC--2018-10-21 UTC 1
This is, I want to compare the time intervals of flag = 0 to all possible flag = 1, within each group, to see if there is at least one time overlap between flag 0 and flag 1
For these purpose I have tried with lubridate int_overlaps function
I have tried the following code but does not work:
result <- df %>%
group_by(id) %>%
mutate(value = ifelse(flag == 0 & int_overlaps(time, any(time[flag == 1])), 1, 0))
I have found a very similar approach:
R: Determine if each date interval overlaps with all other date intervals in a dataframe
You can use map_int from purrr to see if any intervals overlap within each id:
library(tidyverse)
library(lubridate)
df %>%
group_by(id) %>%
mutate(value = ifelse(flag == 0, map_int(time, ~ any(int_overlaps(.x, time[flag == 1]))), NA))
Output
# A tibble: 11 x 4
# Groups: id [2]
id flag time value
<dbl> <dbl> <Interval> <int>
1 1 1 2017-01-01 UTC--2017-01-07 UTC NA
2 1 0 2018-01-01 UTC--2019-01-01 UTC 0
3 1 0 2017-01-03 UTC--2017-01-09 UTC 1
4 2 1 2017-01-01 UTC--2017-01-15 UTC NA
5 2 1 2018-07-01 UTC--2018-09-01 UTC NA
6 2 1 2018-10-12 UTC--2018-10-20 UTC NA
7 2 0 2017-01-12 UTC--2017-01-16 UTC 1
8 2 0 2017-03-01 UTC--2017-03-15 UTC 0
9 2 0 2017-12-01 UTC--2017-12-31 UTC 0
10 2 0 2018-08-15 UTC--2018-09-19 UTC 1
11 2 0 2018-10-01 UTC--2018-10-21 UTC 1
I add another answer extracted from here:
R: Determine if each date interval overlaps with all other date intervals in a dataframe
result <- df %>% group_by(id) %>%
mutate(value = map(seq_along(time), function(x){
y = setdiff(seq_along(time[flag == 1]), x)
return(any(int_overlaps(time[x], time[y])))
}))

Min and max value based on another column and combine those in r

So I basically got a while loop function that creates 1's in the "algorithm_column" based on the highest percentages in the "percent" column, until a certain total percentage is reached (90% or something). The rest of the rows that are not taken into account will have a value of 0 in the "algorithm_column" ( Create while loop function that takes next largest value untill condition is met)
I want to show, based on what the loop function found, the min and max times of the column "timeinterval" (the min is where the 1's start and max is the last row with a 1, the 0's are out of the scope). And then finally create a time interval from this.
So if we have the following code, I want to create in another column, lets say "total_time" a calculation from the min time 09:00 ( this is where 1 start in the algorithm_column) until 11:15, which makes a time interval of 02:15 hours added to the "total_time" column.
algorithm
# pc4 timeinterval stops percent idgroup algorithm_column
#1 5464 08:45:00 1 1.3889 1 0
#2 5464 09:00:00 5 6.9444 2 1
#3 5464 09:15:00 8 11.1111 3 1
#4 5464 09:30:00 7 9.7222 4 1
#5 5464 09:45:00 5 6.9444 5 1
#6 5464 10:00:00 10 13.8889 6 1
#7 5464 10:15:00 6 8.3333 7 1
#8 5464 10:30:00 4 5.5556 8 1
#9 5464 10:45:00 7 9.7222 9 1
#10 5464 11:00:00 6 8.3333 10 1
#11 5464 11:15:00 5 6.9444 11 1
#12 5464 11:30:00 8 11.1111 12 0
I have multiple pc4 groups, so it should look at every group and calculate a total_time for each group respectively.
I got this function, but I'm a bit stuck if this is what I need.
test <- function(x) {
ind <- x[["algorithm$algorithm_column"]] == 0
Mx <- max(x[["timeinterval"]][ind], na.rm = TRUE);
ind <- x[["algorithm$algorithm_column"]] == 1
Mn <- min(x[["timeinterval"]][ind], na.rm = TRUE);
list(Mn, Mx) ## or return(list(Mn, Mx))
}
test(algorithm)
Here is a dplyr solution.
library(dplyr)
algorithm %>%
mutate(tmp = cumsum(c(0, diff(algorithm_column) != 0))) %>%
filter(algorithm_column == 1) %>%
group_by(pc4, tmp) %>%
summarise(first = first(timeinterval),
last = last(timeinterval)) %>%
select(-tmp)
## A tibble: 1 x 3
## Groups: pc4 [1]
# pc4 first last
# <int> <fct> <fct>
#1 5464 09:00:00 11:15:00
Data.
algorithm <- read.table(text = "
pc4 timeinterval stops percent idgroup algorithm_column
1 5464 08:45:00 1 1.3889 1 0
2 5464 09:00:00 5 6.9444 2 1
3 5464 09:15:00 8 11.1111 3 1
4 5464 09:30:00 7 9.7222 4 1
5 5464 09:45:00 5 6.9444 5 1
6 5464 10:00:00 10 13.8889 6 1
7 5464 10:15:00 6 8.3333 7 1
8 5464 10:30:00 4 5.5556 8 1
9 5464 10:45:00 7 9.7222 9 1
10 5464 11:00:00 6 8.3333 10 1
11 5464 11:15:00 5 6.9444 11 1
12 5464 11:30:00 8 11.1111 12 0
", header = TRUE)

Resources