do() superseded! Alternative is to use across(), nest_by(), and summarise, how? - r

I'm doing something quite simple. Given a dataframe of start dates and end dates for specific periods I want to expand/create a full sequence for each period binned by week (with the factor for each row), then output this in a single large dataframe.
For instance:
library(tidyverse)
library(lubridate)
# Dataset
start_dates = ymd_hms(c("2019-05-08 00:00:00",
"2020-01-17 00:00:00",
"2020-03-03 00:00:00",
"2020-05-28 00:00:00",
"2020-12-10 00:00:00",
"2021-05-07 00:00:00",
"2022-01-04 00:00:00"), tz = "UTC")
end_dates = ymd_hms(c( "2019-10-24 00:00:00",
"2020-03-03 00:00:00",
"2020-05-28 00:00:00",
"2020-12-10 00:00:00",
"2021-05-07 00:00:00",
"2022-01-04 00:00:00",
"2022-01-19 00:00:00"), tz = "UTC")
df1 = data.frame(studying = paste0("period",seq(1:7),sep = ""),start_dates,end_dates)
It was suggested to me to use do(), which currently works fine but I hate it when things are superseded. I also have a way of doing it using map2. But reading the file (https://dplyr.tidyverse.org/reference/do.html) suggests you can use nest_by(), across() and summarise() to do the same job as do(), how would I go about getting same result? I've tried a lot of things but I just can't seem to get it.
# do() way to do it
df1 %>%
group_by(studying) %>%
do(data.frame(week=seq(.$start_dates,.$end_dates,by="1 week")))
# transmute() way to do it
df1 %>%
transmute(weeks = map2(start_dates,end_dates, seq, by = "1 week"), studying)
%>% unnest(cols = c(weeks))

As the documentation of ?do suggests, we can now use summarise and replace the . with across():
library(tidyverse)
library(lubridate)
df1 %>%
group_by(studying) %>%
summarise(week = seq(across()$start_dates,
across()$end_dates,
by = "1 week"))
#> `summarise()` has grouped output by 'studying'. You can override using the
#> `.groups` argument.
#> # A tibble: 134 x 2
#> # Groups: studying [7]
#> studying week
#> <chr> <dttm>
#> 1 period1 2019-05-08 00:00:00
#> 2 period1 2019-05-15 00:00:00
#> 3 period1 2019-05-22 00:00:00
#> 4 period1 2019-05-29 00:00:00
#> 5 period1 2019-06-05 00:00:00
#> 6 period1 2019-06-12 00:00:00
#> 7 period1 2019-06-19 00:00:00
#> 8 period1 2019-06-26 00:00:00
#> 9 period1 2019-07-03 00:00:00
#> 10 period1 2019-07-10 00:00:00
#> # … with 124 more rows
Created on 2022-01-19 by the reprex package (v0.3.0)

You can also use tidyr::complete:
df1 %>%
group_by(studying) %>%
complete(start_dates = seq(from = start_dates, to = end_dates, by = "1 week")) %>%
select(-end_dates, weeks = start_dates)
# A tibble: 134 x 2
# Groups: studying [7]
studying weeks
<chr> <dttm>
1 period1 2019-05-08 00:00:00
2 period1 2019-05-15 00:00:00
3 period1 2019-05-22 00:00:00
4 period1 2019-05-29 00:00:00
5 period1 2019-06-05 00:00:00
6 period1 2019-06-12 00:00:00
7 period1 2019-06-19 00:00:00
8 period1 2019-06-26 00:00:00
9 period1 2019-07-03 00:00:00
10 period1 2019-07-10 00:00:00
# ... with 124 more rows

Although marked Experimental the help file for group_modify does say that
‘group_modify()’ is an evolution of ‘do()’
and, in fact, the code for the example in the question using group_modify is nearly the same as with do.
# with group_modify
df2 <- df1 %>%
group_by(studying) %>%
group_modify(~ data.frame(week = seq(.$start_dates, .$end_dates, by = "1 week")))
# with do
df0 <- df1 %>%
group_by(studying) %>%
do(data.frame(week = seq(.$start_dates, .$end_dates, by = "1 week")))
identical(df2, df0)
## [1] TRUE

Not sure if this exactly what you are looking for, but here is my attempt with rowwise and unnest
df1 %>%
rowwise() %>%
mutate(week = list(seq(start_dates, end_dates, by = "1 week"))) %>%
select(studying, week) %>%
unnest(cols = c(week))

Another approach:
library(tidyverse)
df1 %>%
group_by(studying) %>%
summarise(df = tibble(weeks = seq(start_dates, end_dates, by = 'week'))) %>%
unnest(df)
#> `summarise()` has grouped output by 'studying'. You can override using the `.groups` argument.
#> # A tibble: 134 × 2
#> # Groups: studying [7]
#> studying weeks
#> <chr> <dttm>
#> 1 period1 2019-05-08 00:00:00
#> 2 period1 2019-05-15 00:00:00
#> 3 period1 2019-05-22 00:00:00
#> 4 period1 2019-05-29 00:00:00
#> 5 period1 2019-06-05 00:00:00
#> 6 period1 2019-06-12 00:00:00
#> 7 period1 2019-06-19 00:00:00
#> 8 period1 2019-06-26 00:00:00
#> 9 period1 2019-07-03 00:00:00
#> 10 period1 2019-07-10 00:00:00
#> # … with 124 more rows
Created on 2022-01-20 by the reprex package (v2.0.1)

Related

Conditionally mutate column across list of dataframes in R

I am working with a large list of dataframes that use inconsistent date formats. I would like to conditionally mutate across the list so that any dataframe that contains a string will use one date format, and those that do not contain the string use another format. In other words, I want to distinguish between dataframes launched in year 2019 (which use mdy) and those launched in all others years (which use dmy).
The following code will conditionally mutate rows within a dataframe, but I am unsure how to conditionally mutate across the entire column.
dataframes %>% map(~.x %>%
mutate(date_time = if_else(str_detect(date_time, "/19 "),
mdy_hms(date_time), dmy_hms(date_time)))
Thank you!
edit
Data and code example. There are dataframes that contain a mixture of years.
library(tidyverse)
library(lubridate)
dataframes <- list(
tibble(date_time = c("07/06/19 01:00:00 PM", "07/06/20 01:00:00 PM"), num = 1:2), # July 6th
tibble(date_time = c("06/07/20 01:00:00 PM", "06/07/21 01:00:00 PM"), num = 1:2) # July 6th
)
dataframes %>%
map(~.x %>%
mutate(date_time = if_else(str_detect(date_time, "/19 "),
mdy_hms(date_time), dmy_hms(date_time)),
date = date(date_time),
month = month(date_time),
doy = yday(date_time)))
[[1]]
# A tibble: 2 × 5
date_time num date month doy
<dttm> <int> <date> <dbl> <dbl>
1 2019-07-06 13:00:00 1 2019-07-06 7 187
2 2020-06-07 13:00:00 2 2020-06-07 6 159
[[2]]
# A tibble: 2 × 5
date_time num date month doy
<dttm> <int> <date> <dbl> <dbl>
1 2020-07-06 13:00:00 1 2020-07-06 7 188
2 2021-07-06 13:00:00 2 2021-07-06 7 187
If you are trying to determine the format of the date column for the whole data.frame based on the presence of any date from 2019, then a small tweak of your code should work.
Instead of evaluating each record for the presence of /19 , you set the condition of the if_else() to be any(str_detect(...)) which returns TRUE if any of the values are TRUE. However the result of any() is always of length 1 so you then need to rep() the result to match the length of the whole data.frame using dplyr::n().
library(tidyverse)
library(lubridate)
dataframes <- list(
tibble(date_time = c("07/06/19 01:00:00 PM", "07/06/20 01:00:00 PM"), num = 1:2), # July 6th
tibble(date_time = c("06/07/20 01:00:00 PM", "06/07/21 01:00:00 PM"), num = 1:2) # July 6th
)
dataframes %>%
map( ~ .x %>%
mutate(
date_time = if_else(str_detect(date_time, "/19 ") %>%
any() %>%
rep(n()),
mdy_hms(date_time),
dmy_hms(date_time)),
date = date(date_time),
month = month(date_time),
doy = yday(date_time)
))
#> [[1]]
#> # A tibble: 2 × 5
#> date_time num date month doy
#> <dttm> <int> <date> <dbl> <dbl>
#> 1 2019-07-06 13:00:00 1 2019-07-06 7 187
#> 2 2020-07-06 13:00:00 2 2020-07-06 7 188
#>
#> [[2]]
#> # A tibble: 2 × 5
#> date_time num date month doy
#> <dttm> <int> <date> <dbl> <dbl>
#> 1 2020-07-06 13:00:00 1 2020-07-06 7 188
#> 2 2021-07-06 13:00:00 2 2021-07-06 7 187
Created on 2022-07-20 by the reprex package (v2.0.1)

Using datetime lookback periods to record whether event occured in r

I would like to transform this dataframe:
library(tidyverse)
library(lubridate)
EVENT <- c("HR", "RR", "HR, "CREATITINE", "HR")
DATETIME <- c("2019-08-01 00:00:01", "2019-08-01 00:30:00", "2019-08-01 01:30:00", "2019-08-01 02:00:00", "2019-08-01 05:00:00")
LOOK_BACK_HOURS <- c(1,1,1,3,1)
df <- data.frame(EVENT, DATETIME, LOOK_BACK_HOURS)
df <- df %>%
mutate(DATETIME = ymd_hms(DATETIME))
EVENT
DATETIME
LOOK_BACK_HOURS
HR
2019-08-01 00:00:01
1
RR
2019-08-01 00:35:00
1
HR
2019-08-01 01:30:00
1
CREATININE
2019-08-01 02:00:00
3
HR
2019-08-01 05:00:00
1
Into a dataframe that looks like this:
EVENT
DATETIME
LOOK_BACK_HOURS
HR_FLAG
RR_FLAG
CREATININE_FLAG
HR
2019-08-01 00:00:01
1
1
0
0
RR
2019-08-01 00:35:00
1
1
1
0
HR
2019-08-01 01:30:00
1
1
1
0
CREATININE
2019-08-01 02:00:00
3
1
1
1
HR
2019-08-01 05:00:00
1
1
0
0
The flag will indicate if there has been an event of that type during the look back period. There is no requirement to count how times a particular type of event occurred, only if it happened or not.
Ideally I would like a dplyr solution, but given that my actual dataset will contain 9 event types and a couple of million rows of medical observations, a performant solution will greatly speed up running this across new datasets as they arise.
Thanks
You can create a function like this and can use purrr::map to get the required data frame.
To find out the range of time I created a new column based on LOOK_BACK_HOURS. With this, I'll check the given times for different events falls within this range, if so I'll add a flag.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
EVENT <- c("HR", "RR", "HR", "CREATITINE", "HR")
DATETIME <- c("2019-08-01 00:00:01", "2019-08-01 00:30:00", "2019-08-01 01:30:00", "2019-08-01 02:00:00", "2019-08-01 05:00:00")
LOOK_BACK_HOURS <- c(1,1,1,3,1)
df <- data.frame(EVENT, DATETIME, LOOK_BACK_HOURS)
df <- df %>%
mutate(DATETIME = ymd_hms(DATETIME))
flag <- function(start, end, event, dff = df){
dff %>%
filter(EVENT == event) %>%
pull(DATETIME) -> times
ifelse(times >= start & times <= end, 1, 0) %>%
sum() -> flag_count
ifelse(flag_count > 0, 1, 0)
}
df %>%
mutate(look_back_time = DATETIME - (LOOK_BACK_HOURS*60*60),
HR_FLAG = pmap_dbl(list(look_back_time, DATETIME, "HR"), flag),
RR_FLAG = pmap_dbl(list(look_back_time, DATETIME, "RR"), flag),
CREATITINE_FLAG = pmap_dbl(list(look_back_time, DATETIME, "CREATITINE"), flag)) %>%
select(-look_back_time)
#> EVENT DATETIME LOOK_BACK_HOURS HR_FLAG RR_FLAG
#> 1 HR 2019-08-01 00:00:01 1 1 0
#> 2 RR 2019-08-01 00:30:00 1 1 1
#> 3 HR 2019-08-01 01:30:00 1 1 1
#> 4 CREATITINE 2019-08-01 02:00:00 3 1 1
#> 5 HR 2019-08-01 05:00:00 1 1 0
#> CREATITINE_FLAG
#> 1 0
#> 2 0
#> 3 0
#> 4 1
#> 5 0
Created on 2021-02-18 by the reprex package (v0.3.0)

Converting dates to hours in R

I have a start and end date for individuals and i need to estimate if the time passed from the start to the end is within 2 days
or 3 plus days.These dates are assign to record ids, how can i filter ones that ended within 2 days (from the start date)
and the ones that ended after 3 days or later.
Record_id <- c("2245","6728","5122","9287")
Start <- c("2021-01-13 CST" ,"2021-01-21 CST" ,"2021-01-17 CST","2021-01-13 CST")
End <- c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST","2021-01-25 15:00:00 CST")
I tried using
elapsed.time <- DF$start %--% DF$End
time.duration <- as.duration(elapsed.time)
but I am getting error because End date contains hour.Thank you.
Here's a dplyr pipe that will include both constraints (2 and 3 days):
df %>%
mutate(across(Start:End, as.POSIXct)) %>%
mutate(d = difftime(End, Start, units = "days")) %>%
filter(!between(difftime(End, Start, units = "days"), 2, 3))
# # A tibble: 4 x 4
# Record_id Start End d
# <chr> <dttm> <dttm> <drtn>
# 1 2245 2021-01-13 00:00:00 2021-01-21 18:00:00 8.750000 days
# 2 6728 2021-01-21 00:00:00 2021-01-22 16:00:00 1.666667 days
# 3 5122 2021-01-17 00:00:00 2021-01-22 13:00:00 5.541667 days
# 4 9287 2021-01-13 00:00:00 2021-01-25 15:00:00 12.625000 days
I included mutate(d= so that we can see what the actual differences are. If you were looking to remove those, then use filter(between(..)) (no !).
In the case of the data you provided, all observations are less than 2 or more than 3 days. I'll expand this range so that we can see it in effect:
df %>%
mutate(across(Start:End, as.POSIXct)) %>%
mutate(d = difftime(End, Start, units = "days")) %>%
filter(!between(difftime(End, Start, units = "days"), 1, 6))
# # A tibble: 2 x 4
# Record_id Start End d
# <chr> <dttm> <dttm> <drtn>
# 1 2245 2021-01-13 00:00:00 2021-01-21 18:00:00 8.750 days
# 2 9287 2021-01-13 00:00:00 2021-01-25 15:00:00 12.625 days
Data
df <- structure(list(Record_id = c("2245", "6728", "5122", "9287"), Start = c("2021-01-13 CST", "2021-01-21 CST", "2021-01-17 CST", "2021-01-13 CST"), End = c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST", "2021-01-25 15:00:00 CST")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
I just converted the character to a date time with lubridate and then subtracted the dates. What you'll get back are days. I then filter for dates that are within 2 days.
Record_id<- c("2245","6728","5122","9287")
Start<-c("2021-01-13 CST" ,"2021-01-21 CST" ,"2021-01-17 CST","2021-01-13 CST")
End<-c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST","2021-01-25 15:00:00 CST")
df <- dplyr::tibble(x = Record_id, y = Start, z = End)
df %>%
dplyr::mutate_at(vars(y:z), ~ lubridate::as_datetime(.)) %>%
dplyr::mutate(diff = as.numeric(z - y)) %>%
dplyr::filter(diff <= 2 )

Set up data in order to use Prophet() in R

I want to use the Prophet() function in R, but I cannot transform my column "YearWeek" to a as.Date() column.
I have a column "YearWeek" that stores values from 201401 up to 201937 i.e. starting in 2014 week 1 up to 2019 week 37.
I don't know how to declare this column as a date in the form yyyy-ww needed to use the Prophet() function.
Does anyone know how to do this?
Thank you in advance.
One solution could be to append a 01 to the end of your yyyy-ww formatted dates.
Data:
library(tidyverse)
df <- cross2(2014:2019, str_pad(1:52, width = 2, pad = 0)) %>%
map_df(set_names, c("year", "week")) %>%
transmute(date = paste(year, week, sep = "")) %>%
arrange(date)
head(df)
#> # A tibble: 6 x 1
#> date
#> <chr>
#> 1 201401
#> 2 201402
#> 3 201403
#> 4 201404
#> 5 201405
#> 6 201406
Now let's append the 01 and convert to date:
df %>%
mutate(date = paste(date, "01", sep = ""),
new_date = as.Date(date, "%Y%U%w"))
#> # A tibble: 312 x 2
#> date new_date
#> <chr> <date>
#> 1 20140101 2014-01-05
#> 2 20140201 2014-01-12
#> 3 20140301 2014-01-19
#> 4 20140401 2014-01-26
#> 5 20140501 2014-02-02
#> 6 20140601 2014-02-09
#> 7 20140701 2014-02-16
#> 8 20140801 2014-02-23
#> 9 20140901 2014-03-02
#> 10 20141001 2014-03-09
#> # ... with 302 more rows
Created on 2019-10-10 by the reprex package (v0.3.0)
More info about a numeric week of the year can be found here.

Find the corresponding time for min and max for a day

Below is what my data looks like. I need to find the max and min temp for each day as well as the corresponding temperature.
Temp date time
280.9876771 01-01-79 03:00:00
291.9695498 01-01-79 06:00:00
294.9583426 01-01-79 09:00:00
290.2357847 01-01-79 12:00:00
286.2944531 01-01-79 15:00:00
282.9282138 01-01-79 18:00:00
280.326689 01-01-79 21:00:00
279.2551605 02-01-79 00:00:00
281.3981824 02-01-79 03:00:00
293.076125 02-01-79 06:00:00
295.8072204 02-01-79 09:00:00
This code I tried for min and max temp for daily.
library(xts)
read.csv("hourly1.csv", header = T) -> hourly1
xts(hourly1$Temp, as.Date(hourly1$date)) -> temp_date1
apply.daily(temp_date1, min) -> mintemp1_date
apply.daily(temp_date1, max) -> maxtemp1_date
I need help regarding how to find the time of day for min and max temp
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
dataset <- read.table(text = 'Temp date time
280.9876771 01-01-79 03:00:00
291.9695498 01-01-79 06:00:00
294.9583426 01-01-79 09:00:00
290.2357847 01-01-79 12:00:00
286.2944531 01-01-79 15:00:00
282.9282138 01-01-79 18:00:00
280.326689 01-01-79 21:00:00
279.2551605 02-01-79 00:00:00
281.3981824 02-01-79 03:00:00
293.076125 02-01-79 06:00:00
295.8072204 02-01-79 09:00:00',
header = TRUE,
stringsAsFactors = FALSE)
dataset %>%
group_by(date) %>%
summarise(min_temp = min(Temp),
min_temp_time = time[which.min(x = Temp)],
max_temp = max(Temp),
max_temp_time = time[which.max(x = Temp)])
#> # A tibble: 2 x 5
#> date min_temp min_temp_time max_temp max_temp_time
#> <chr> <dbl> <chr> <dbl> <chr>
#> 1 01-01-79 280. 21:00:00 295. 09:00:00
#> 2 02-01-79 279. 00:00:00 296. 09:00:00
Created on 2019-06-15 by the reprex package (v0.3.0)
Hope this helps.
Try the dplyr package.
df <- structure(list(Temp = c(280.9876771, 291.9695498, 294.9583426,
290.2357847, 286.2944531, 282.9282138, 280.326689, 279.2551605,
281.3981824, 293.076125, 295.8072204),
date = c("01-01-79", "01-01-79",
"01-01-79", "01-01-79", "01-01-79", "01-01-79", "01-01-79", "02-01-79",
"02-01-79", "02-01-79", "02-01-79"),
time = c("03:00:00", "06:00:00", "09:00:00", "12:00:00", "15:00:00", "18:00:00", "21:00:00", "00:00:00",
"03:00:00", "06:00:00", "09:00:00")),
row.names = c(NA, -11L),
class = c("data.frame"))
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df %>%
group_by(date)%>%
slice(which.max(Temp))
#> # A tibble: 2 x 3
#> # Groups: date [2]
#> Temp date time
#> <dbl> <chr> <chr>
#> 1 295. 01-01-79 09:00:00
#> 2 296. 02-01-79 09:00:00
df %>%
group_by(date)%>%
slice(which.min(Temp))
#> # A tibble: 2 x 3
#> # Groups: date [2]
#> Temp date time
#> <dbl> <chr> <chr>
#> 1 280. 01-01-79 21:00:00
#> 2 279. 02-01-79 00:00:00
Created on 2019-06-15 by the reprex package (v0.3.0)
a data.table + lubridate solution
# load libraries
library(data.table)
library(lubridate)
# load data
dt <- fread(" Temp date time
280.9876771 01-01-79 03:00:00
291.9695498 01-01-79 06:00:00
294.9583426 01-01-79 09:00:00
290.2357847 01-01-79 12:00:00
286.2944531 01-01-79 15:00:00
282.9282138 01-01-79 18:00:00
280.326689 01-01-79 21:00:00
279.2551605 02-01-79 00:00:00
281.3981824 02-01-79 03:00:00
293.076125 02-01-79 06:00:00
295.8072204 02-01-79 09:00:00")
# Convert date - time values to real dates:
dt[, date2 := dmy_hms(paste(date, time, sep = " "))]
# find the date - time for max temp:
dt[, date2[which(Temp == max(Temp))], by = floor_date(date2, "days")]
# find the date - time for min temp:
dt[, date2[which(Temp == min(Temp))], by = floor_date(date2, "days")]
Thank You Guys for the help. But I have 116881 entries.
So I tried the index command in R. This fetched me the corresponding id.
index(hourly1)[hourly1$Temp %in% maxtemp1_date] -> max_id
index(hourly1)[hourly1$Temp %in% mintemp1_date] -> min_id
Then I used the vlookup command in Excel to get the desired solution.
In data.table
dt[, x.value.min := frollapply(x = x, n = 2, min, fill = NA, align = "right", na.rm =TRUE), by = ID]

Resources