Combine data by several condition in R - r

I want to merge two data according to two conditions:
by same ID (only ID in the first data is retained)
if date_mid (from dat2) is in between date_begin and date_end (both from dat1), paste the result (from dat2), if not, noted as "NA"
Also, I want to drop the rows if the ID in the combine data already has the result (either as healthy or sick). In the example below I want to drop the 3rd and 12th rows.
First data (dat1):
dat1 <- tibble(ID = c(paste0(rep("A"), 1:10), "A2", "A10"),
date_begin = seq(as.Date("2020/1/1"), by = "month", length.out = 12),
date_end = date_begin + 365)
dat1
# A tibble: 12 x 3
ID date_begin date_end
<chr> <date> <date>
1 A1 2020-01-01 2020-12-31
2 A2 2020-02-01 2021-01-31
3 A3 2020-03-01 2021-03-01
4 A4 2020-04-01 2021-04-01
5 A5 2020-05-01 2021-05-01
6 A6 2020-06-01 2021-06-01
7 A7 2020-07-01 2021-07-01
8 A8 2020-08-01 2021-08-01
9 A9 2020-09-01 2021-09-01
10 A10 2020-10-01 2021-10-01
11 A2 2020-11-01 2021-11-01
12 A10 2020-12-01 2021-12-01
Second data (dat2):
dat2 <- tibble(ID = c(paste0(rep("A"), 1:4), paste0(rep("A"), 9:15), "A2"),
date_mid = seq(as.Date("2020/1/1"), by = "month", length.out = 12) + 100,
result = rep(c("healthy", "sick"), length = 12))
dat2
# A tibble: 12 x 3
ID date_mid result
<chr> <date> <chr>
1 A1 2020-04-10 healthy
2 A2 2020-05-11 sick
3 A3 2020-06-09 healthy
4 A4 2020-07-10 sick
5 A9 2020-08-09 healthy
6 A10 2020-09-09 sick
7 A11 2020-10-09 healthy
8 A12 2020-11-09 sick
9 A13 2020-12-10 healthy
10 A14 2021-01-09 sick
11 A15 2021-02-09 healthy
12 A2 2021-03-11 sick
I have tried left_join as below:
left_join(dat1, dat2, by = "ID") %>%
mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA))
# A tibble: 14 x 5
ID date_begin date_end date_mid result
<chr> <date> <date> <date> <chr>
1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
2 A2 2020-02-01 2021-01-31 2020-05-11 sick
3 A2 2020-02-01 2021-01-31 2021-03-11 NA
4 A3 2020-03-01 2021-03-01 2020-06-09 healthy
5 A4 2020-04-01 2021-04-01 2020-07-10 sick
6 A5 2020-05-01 2021-05-01 NA NA
7 A6 2020-06-01 2021-06-01 NA NA
8 A7 2020-07-01 2021-07-01 NA NA
9 A8 2020-08-01 2021-08-01 NA NA
10 A9 2020-09-01 2021-09-01 2020-08-09 NA
11 A10 2020-10-01 2021-10-01 2020-09-09 NA
12 A2 2020-11-01 2021-11-01 2020-05-11 NA
13 A2 2020-11-01 2021-11-01 2021-03-11 sick
14 A10 2020-12-01 2021-12-01 2020-09-09 NA
As I mentioned, I want to drop the 3rd and 12th rows of ID A2, since A2 already have a result of either healthy or sick in 2nd and 13th rows.
The exact result that I want is something like this (only 2 rows of A2):
# A tibble: 12 x 5
ID date_begin date_end date_mid result
<chr> <date> <date> <date> <chr>
1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
2 A2 2020-02-01 2021-01-31 2020-05-11 sick
3 A3 2020-03-01 2021-03-01 2020-06-09 healthy
4 A4 2020-04-01 2021-04-01 2020-07-10 sick
5 A5 2020-05-01 2021-05-01 NA NA
6 A6 2020-06-01 2021-06-01 NA NA
7 A7 2020-07-01 2021-07-01 NA NA
8 A8 2020-08-01 2021-08-01 NA NA
9 A9 2020-09-01 2021-09-01 2020-08-09 NA
10 A10 2020-10-01 2021-10-01 2020-09-09 NA
11 A2 2020-11-01 2021-11-01 2021-03-11 sick
12 A10 2020-12-01 2021-12-01 2020-09-09 NA
Any pointer is appreciated, thanks.

If there is more than one row for an ID in the result after joining keep only the non-NA rows. This can be written in dplyr as -
library(dplyr)
library(lubridate)
left_join(dat1, dat2, by = "ID") %>%
mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA)) %>%
group_by(ID, date_begin, date_end) %>%
filter(if(n() > 1) !is.na(result) else TRUE) %>%
ungroup
# ID date_begin date_end date_mid result
# <chr> <date> <date> <date> <chr>
# 1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
# 2 A2 2020-02-01 2021-01-31 2020-05-11 sick
# 3 A3 2020-03-01 2021-03-01 2020-06-09 healthy
# 4 A4 2020-04-01 2021-04-01 2020-07-10 sick
# 5 A5 2020-05-01 2021-05-01 NA NA
# 6 A6 2020-06-01 2021-06-01 NA NA
# 7 A7 2020-07-01 2021-07-01 NA NA
# 8 A8 2020-08-01 2021-08-01 NA NA
# 9 A9 2020-09-01 2021-09-01 2020-08-09 NA
#10 A10 2020-10-01 2021-10-01 2020-09-09 NA
#11 A2 2020-11-01 2021-11-01 2021-03-11 sick
#12 A10 2020-12-01 2021-12-01 2020-09-09 NA

Related

R: Create new variable based on date in other variable

I have a data frame that looks somewhat like this:
a = c(seq(as.Date("2020-08-01"), as.Date("2020-11-01"), by="months"), seq(as.Date("2021-08-01"), as.Date("2021-11-01"), by="months"),
seq(as.Date("2022-08-01"), as.Date("2022-11-01"), by="months"))
b = rep(LETTERS[1:3], each = 4)
df = data_frame(ID = b, Date = a)
> df
ID Date
<chr> <date>
1 A 2020-08-01
2 A 2020-09-01
3 A 2020-10-01
4 A 2020-11-01
5 B 2021-08-01
6 B 2021-09-01
7 B 2021-10-01
8 B 2021-11-01
9 C 2022-08-01
10 C 2022-09-01
11 C 2022-10-01
12 C 2022-11-01
And I want to create a new variable that replaces Date with the smallest value in Date for each ID, the resulting data frame should look like this:
c = c(rep(as.Date("2020-08-01"), each = 4), rep(as.Date("2021-08-01"), each = 4), rep(as.Date("2022-08-01"), each = 4))
df$NewDate = c
> df
# A tibble: 12 × 3
ID Date NewDate
<chr> <date> <date>
1 A 2020-08-01 2020-08-01
2 A 2020-09-01 2020-08-01
3 A 2020-10-01 2020-08-01
4 A 2020-11-01 2020-08-01
5 B 2021-08-01 2021-08-01
6 B 2021-09-01 2021-08-01
7 B 2021-10-01 2021-08-01
8 B 2021-11-01 2021-08-01
9 C 2022-08-01 2022-08-01
10 C 2022-09-01 2022-08-01
11 C 2022-10-01 2022-08-01
12 C 2022-11-01 2022-08-01
Can someone please help me do it? Thank you very much in advance.
Frist group, then mutate & min:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NewDate = min(Date)) %>%
ungroup()
#> # A tibble: 12 × 3
#> ID Date NewDate
#> <chr> <date> <date>
#> 1 A 2020-08-01 2020-08-01
#> 2 A 2020-09-01 2020-08-01
#> 3 A 2020-10-01 2020-08-01
#> 4 A 2020-11-01 2020-08-01
#> 5 B 2021-08-01 2021-08-01
#> 6 B 2021-09-01 2021-08-01
#> 7 B 2021-10-01 2021-08-01
#> 8 B 2021-11-01 2021-08-01
#> 9 C 2022-08-01 2022-08-01
#> 10 C 2022-09-01 2022-08-01
#> 11 C 2022-10-01 2022-08-01
#> 12 C 2022-11-01 2022-08-01

Turn a loop based code into a vectorised one in R?

I´ve got this dataset and want to perform some calculations based on certain conditions:
library(tidyverse)
library(lubridate)
filas <- structure(list(Año = c(rep(2020,4),rep(2021,4),2022),
Mes = c(2:5,3:4,9,11,1),
Id = c(rep(1,7),2,2)),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")) %>%
mutate(fecha = make_date(Año,Mes,1),
meses_imp = make_date(2999,1,1))
Año
Mes
Id
fecha
meses_imp
2020
2
1
2020-02-01
2999-01-01
2020
3
1
2020-03-01
2999-01-01
2020
4
1
2020-04-01
2999-01-01
2020
5
1
2020-05-01
2999-01-01
2021
3
1
2021-03-01
2999-01-01
2021
4
1
2021-04-01
2999-01-01
2021
9
1
2021-09-01
2999-01-01
2021
11
2
2021-11-01
2999-01-01
2022
1
2
2022-01-01
2999-01-01
I need to add rows for each "Id" when there are "holes" between two consecutive ones, and count those added rows later. I´ve achieved this using a "while" loop:
i <- 2
while(!is.na(filas[i,]$Id)) {
if (as.double(difftime(filas[i,]$fecha,filas[i-1,]$fecha)) > 31 &
filas[i,]$Id == filas[i-1,]$Id) {
filas <- add_row(filas,
Id = filas[i,]$Id,
fecha = filas[i-1,]$fecha + months(1),
meses_imp = pmin(filas[i-1,]$fecha,
filas[i-1,]$meses_imp),
.after = i-1)}
i=i+1}
filas2 <- filas %>%
group_by(Id,meses_imp) %>%
summarise(cant_meses_imp = n()) %>%
ungroup() %>%
filter(meses_imp != "2999-01-01")
filas <- left_join(filas,
filas2,
by=c("Id","meses_imp"))
Año
Mes
Id
fecha
meses_imp
cant_meses_imp
2020
2
1
2020-02-01
2999-01-01
NA
2020
3
1
2020-03-01
2999-01-01
NA
2020
4
1
2020-04-01
2999-01-01
NA
2020
5
1
2020-05-01
2999-01-01
NA
NA
NA
1
2020-06-01
2020-05-01
9
NA
NA
1
2020-07-01
2020-05-01
9
NA
NA
1
2020-08-01
2020-05-01
9
NA
NA
1
2020-09-01
2020-05-01
9
NA
NA
1
2020-10-01
2020-05-01
9
NA
NA
1
2020-11-01
2020-05-01
9
NA
NA
1
2020-12-01
2020-05-01
9
NA
NA
1
2021-01-01
2020-05-01
9
NA
NA
1
2021-02-01
2020-05-01
9
2021
3
1
2021-03-01
2999-01-01
NA
2021
4
1
2021-04-01
2999-01-01
NA
NA
NA
1
2021-05-01
2021-04-01
4
NA
NA
1
2021-06-01
2021-04-01
4
NA
NA
1
2021-07-01
2021-04-01
4
NA
NA
1
2021-08-01
2021-04-01
4
2021
9
1
2021-09-01
2999-01-01
NA
2021
11
2
2021-11-01
2999-01-01
NA
NA
NA
2
2021-12-01
2021-11-01
1
2022
1
2
2022-01-01
2999-01-01
NA
Since I`d like to apply this to a much larger dataset (~ 300k rows), how could I rewrite it in a vectorised way so it´s more efficient (and elegant maybe)?
Thanks!
You can apply the following code using padr and zoo packages.
This idea is to:
Add missing dates with the padr::pad() function.
Remove unwanted lines (non-integer Id values)
Create na and grp columns to identify rows added in 1.
Group by grp and create a column cant_meses_imp to count the number of consecutive na in each group
Select only desired columns
library(dplyr)
library(padr)
library(zoo)
filas %>%
pad(by = "fecha") %>% # add missing dates
mutate(Id = na.approx(Id)) %>% # interpolate NA values in Id column
subset(Id%%1 == 0) %>% # Keep only Id interger
# This part is for generating the cant_meses_imp column
mutate(na = ifelse(is.na(Mes), 1, 0),
grp = rle(na)$lengths %>% {rep(seq(length(.)), .)}) %>%
group_by(grp) %>%
mutate(cant_meses_imp = ifelse(na == 0, NA, n())) %>%
ungroup() %>%
select(-c(na, grp))
The code does not reproduce exactly the fecha column as there is no guidelines for its values.

Include in data frame 1 averaged values from several other data frames and based on varying time intervals

I have a data frame with several variables, and whose first columns look like this:
Place <- c(rep("PlaceA",14),rep("PlaceB",15))
Group_Id <- c(rep("A1",5),rep("A1",6),rep("A2",3),rep("B1",6),rep("B2",4),rep("B2",5))
Time <- as.Date(c("2018-01-15","2018-02-03","2018-02-27","2018-03-10","2018-03-18","2019-02-02","2019-03-01","2019-03-15","2019-03-28","2019-04-05","2019-04-12","2018-02-01",
"2018-03-01","2018-04-07","2018-01-17","2018-01-27","2018-02-17","2018-03-03","2018-04-02","2018-04-25","2018-03-03","2018-03-18","2018-04-08","2018-04-20",
"2019-01-23","2019-02-09","2019-02-27","2019-03-12","2019-03-30"))
FollowUp <- c("start",paste("week",week(ymd(Time[2:5]))),"start",paste("week",week(ymd(Time[7:11]))),"start",paste("week",week(ymd(Time[13:14]))),"start",paste("week",week(ymd(Time[16:20]))),"start",paste("week",week(ymd(Time[22:24]))),"start",paste("week",week(ymd(Time[26:29]))))
exprmt <- c(rep(1,5),rep(2,6),rep(3,3),rep(4,6),rep(5,4),rep(6,5))
> df1
Place Group_Id Time exprmt FollowUp
1 PlaceA A1 2018-01-15 1 start
2 PlaceA A1 2018-02-03 1 week 5
3 PlaceA A1 2018-02-27 1 week 9
4 PlaceA A1 2018-03-10 1 week 10
5 PlaceA A1 2018-03-18 1 week 11
6 PlaceA A1 2019-02-02 2 start
7 PlaceA A1 2019-03-01 2 week 9
8 PlaceA A1 2019-03-15 2 week 11
9 PlaceA A1 2019-03-28 2 week 13
10 PlaceA A1 2019-04-05 2 week 14
11 PlaceA A1 2019-04-12 2 week 15
12 PlaceA A2 2018-02-01 3 start
13 PlaceA A2 2018-03-01 3 week 9
14 PlaceA A2 2018-04-07 3 week 14
15 PlaceB B1 2018-01-17 4 start
16 PlaceB B1 2018-01-27 4 week 4
17 PlaceB B1 2018-02-17 4 week 7
18 PlaceB B1 2018-03-03 4 week 9
19 PlaceB B1 2018-04-02 4 week 14
20 PlaceB B1 2018-04-25 4 week 17
21 PlaceB B2 2018-03-03 5 start
22 PlaceB B2 2018-03-18 5 week 11
23 PlaceB B2 2018-04-08 5 week 14
24 PlaceB B2 2018-04-20 5 week 16
25 PlaceB B2 2019-01-23 6 start
26 PlaceB B2 2019-02-09 6 week 6
27 PlaceB B2 2019-02-27 6 week 9
28 PlaceB B2 2019-03-12 6 week 11
29 PlaceB B2 2019-03-30 6 week 13
For each Place (more than 2 in my actual data), I have a separate data frame with temperature records by hours. For example:
set.seed(1032)
t <- c(seq.POSIXt(from = ISOdate(2018,01,01),to = ISOdate(2018,06,01), by = "hour"),seq.POSIXt(from = ISOdate(2019,01,01),to = ISOdate(2019,06,01), by = "hour"))
temp_A <- runif(length(t),min = 5, max = 25)
temp_B <- runif(length(t),min = 3, max = 32)
data_A <- data.frame(t,temp_A)
data_B <- data.frame(t,temp_B)
> head(data_A)
t temp_A
1 2018-01-01 12:00:00 14.24961
2 2018-01-01 13:00:00 21.64925
3 2018-01-01 14:00:00 21.77058
4 2018-01-01 15:00:00 13.31673
5 2018-01-01 16:00:00 16.10350
6 2018-01-01 17:00:00 17.64567
I need to add a column in df1 with average temperature for the time interval by Place, group_Id and exprmt: the first of each group_byshould be a NaN, than I would need the average for each time interval. Knowing that for each Place, the data are also in a separate data frame.
I tried something like this, but it is not working:
df1 <- df1 %>% group_by(Place,Group_Id,exprmt) %>% mutate(
temp = case_when(FollowUp == "start" & Place == "PlaceA" ~ NA,
FollowUp == FollowUp[c(2:n())] & Place == "PlaceA" ~ mean(temp_A[c(which(date(temp_A$t))==lag(Time,1):which(date(temp_A$t))==Time),2]),
)
)
I found information on how calculate averages over multiple dataframes (e.g. this or this), but this is not what I am looking for. I would like to do it without a loop. My expected results is (etc stand for and so on..):
> df1
Place Group_Id Time exprmt FollowUp expected
1 PlaceA A1 2018-01-15 1 start NaN
2 PlaceA A1 2018-02-03 1 week 5 mean temp_A between 2018-01-15 and 2018-02-03
3 PlaceA A1 2018-02-27 1 week 9 mean temp_A between 2018-02-03 and 2018-02-27
4 PlaceA A1 2018-03-10 1 week 10 mean temp_A between 2018-02-27 and 2018-03-10
5 PlaceA A1 2018-03-18 1 week 11 mean temp_A between 2018-03-10 and 2018-03-18
6 PlaceA A1 2019-02-02 2 start NaN
7 PlaceA A1 2019-03-01 2 week 9 mean temp_A between 2019-02-02 and 2019-03-01
8 PlaceA A1 2019-03-15 2 week 11 etc
9 PlaceA A1 2019-03-28 2 week 13 etc
10 PlaceA A1 2019-04-05 2 week 14 etc
11 PlaceA A1 2019-04-12 2 week 15 etc
12 PlaceA A2 2018-02-01 3 start etc
13 PlaceA A2 2018-03-01 3 week 9 etc
14 PlaceA A2 2018-04-07 3 week 14 etc
15 PlaceB B1 2018-01-17 4 start NaN
16 PlaceB B1 2018-01-27 4 week 4 mean temp_B between 2018-01-17 and 2018-01-27
17 PlaceB B1 2018-02-17 4 week 7 etc
18 PlaceB B1 2018-03-03 4 week 9 etc
19 PlaceB B1 2018-04-02 4 week 14 etc
20 PlaceB B1 2018-04-25 4 week 17 etc
21 PlaceB B2 2018-03-03 5 start etc
22 PlaceB B2 2018-03-18 5 week 11 etc
23 PlaceB B2 2018-04-08 5 week 14 etc
24 PlaceB B2 2018-04-20 5 week 16 etc
25 PlaceB B2 2019-01-23 6 start etc
26 PlaceB B2 2019-02-09 6 week 6 etc
27 PlaceB B2 2019-02-27 6 week 9 etc
28 PlaceB B2 2019-03-12 6 week 11 etc
29 PlaceB B2 2019-03-30 6 week 13 etc
Any help will be appreciated!
I suggest a detailed step-by-step solution (using data.table, lubridate and gtools libraries) which tries not to lose the reader. So, please find below a reprex.
Reprex
1. DATA PREPARATION
library(data.table)
library(lubridate)
library(gtools)
# Convert the dataframe 'df1' into data.table and add of the dummy variable 'StartTime'
setDT(df1)[, StartTime := shift(Time,1), by = .(Place, Group_Id, exprmt)][]
setcolorder(df1, c("Place", "Group_Id", "FollowUp", "exprmt", "StartTime", "Time"))
# Convert 'StartTime' and 'Time' columns into class 'PosiXct' and into ymd_hms format
# with the function 'ymd_TO_ymd_hms'
ymd_TO_ymd_hms <- function(x,y) as_datetime(as.double(as.POSIXct(x)+3600), tz = y)
sel_cols <- c("StartTime", "Time")
df1[, (sel_cols) := lapply(.SD, ymd_TO_ymd_hms, "GMT"), .SDcols = sel_cols][, Time := Time - 3600]
# Here is to what 'df1' looks like:
df1
#> Place Group_Id FollowUp exprmt StartTime Time
#> 1: PlaceA A1 start 1 <NA> 2018-01-14 23:00:00
#> 2: PlaceA A1 week 5 1 2018-01-15 00:00:00 2018-02-02 23:00:00
#> 3: PlaceA A1 week 9 1 2018-02-03 00:00:00 2018-02-26 23:00:00
#> 4: PlaceA A1 week 10 1 2018-02-27 00:00:00 2018-03-09 23:00:00
#> 5: PlaceA A1 week 11 1 2018-03-10 00:00:00 2018-03-17 23:00:00
#> 6: PlaceA A1 start 2 <NA> 2019-02-01 23:00:00
#> 7: PlaceA A1 week 9 2 2019-02-02 00:00:00 2019-02-28 23:00:00
#> 8: PlaceA A1 week 11 2 2019-03-01 00:00:00 2019-03-14 23:00:00
#> 9: PlaceA A1 week 13 2 2019-03-15 00:00:00 2019-03-27 23:00:00
#> 10: PlaceA A1 ...
# Convert the dataframes 'data_A' and 'data_B' into data.tables
setDT(data_A)
setDT(data_B)
2. EXPAND ROWS OF 'df1' BY DATE RANGE USING 'StartTime' and 'Time'
df1_time_seq <- df1[!is.na(StartTime) # remove rows where StartTime = NA
][ ,.(Place = Place, Group_Id = Group_Id, FollowUp = FollowUp, exprmt = exprmt, Time_seq = seq(from = StartTime, to = Time, by = "hour")), by = 1:nrow(df1[!is.na(StartTime)])]
df1_time_seq
#> nrow Place Group_Id FollowUp exprmt Time_seq
#> 1: 1 PlaceA A1 week 5 1 2018-01-15 00:00:00
#> 2: 1 PlaceA A1 week 5 1 2018-01-15 01:00:00
#> 3: 1 PlaceA A1 week 5 1 2018-01-15 02:00:00
#> 4: 1 PlaceA A1 week 5 1 2018-01-15 03:00:00
#> 5: 1 PlaceA A1 week 5 1 2018-01-15 04:00:00
#> ---
#> 9784: 23 PlaceB B2 week 13 6 2019-03-29 19:00:00
#> 9785: 23 PlaceB B2 week 13 6 2019-03-29 20:00:00
#> 9786: 23 PlaceB B2 week 13 6 2019-03-29 21:00:00
#> 9787: 23 PlaceB B2 week 13 6 2019-03-29 22:00:00
#> 9788: 23 PlaceB B2 week 13 6 2019-03-29 23:00:00
3. JOINS
# Merge 'data_A' and 'data_B' on 't'
data_merge <- merge(data_A, data_B, by = 't')
# Merge 'df1_time_seq' and 'data_merge' on 'Time_seq' = 't' and add a column 'temp' filled with 'temp_A' values when 'Place == PlaceA' and 'temp_B' values when 'Place == PlaceB'
df1_time_seq_merge <- merge(df1_time_seq, data_merge, by.x = "Time_seq", by.y = "t")[, temp := fcase(Place == "PlaceA", temp_A,
Place == "PlaceB", temp_B)
][, `:=` (temp_A = NULL, temp_B = NULL)
][]
df1_time_seq_merge
#> Time_seq nrow Place Group_Id FollowUp exprmt temp
#> 1: 2018-01-15 00:00:00 1 PlaceA A1 week 5 1 10.618465
#> 2: 2018-01-15 01:00:00 1 PlaceA A1 week 5 1 16.156850
#> 3: 2018-01-15 02:00:00 1 PlaceA A1 week 5 1 6.806842
#> 4: 2018-01-15 03:00:00 1 PlaceA A1 week 5 1 21.036855
#> 5: 2018-01-15 04:00:00 1 PlaceA A1 week 5 1 21.578569
#> ---
#> 9784: 2019-04-11 18:00:00 9 PlaceA A1 week 15 2 16.646570
#> 9785: 2019-04-11 19:00:00 9 PlaceA A1 week 15 2 12.362436
#> 9786: 2019-04-11 20:00:00 9 PlaceA A1 week 15 2 24.853746
#> 9787: 2019-04-11 21:00:00 9 PlaceA A1 week 15 2 22.553074
#> 9788: 2019-04-11 22:00:00 9 PlaceA A1 week 15 2 21.020600
4. SUMMARIZE 'df1_time_seq_merge'
# Summarize df1_time_seq_merge to get the mean of 'temp' by group in the 'expected' variable
df1_mean <- df1_time_seq_merge[, .(expected = mean(temp)), by = .(Place, Group_Id, exprmt, FollowUp)]
df1_mean
#> Place Group_Id exprmt FollowUp expected
#> 1: PlaceA A1 1 week 5 15.17243
#> 2: PlaceB B1 4 week 4 19.26662
#> 3: PlaceB B1 4 week 7 17.32940
#> 4: PlaceA A2 3 week 9 14.92409
#> 5: PlaceA A1 1 week 9 14.86734
#> 6: PlaceB B1 4 week 9 18.36255
#> 7: PlaceA A1 1 week 10 14.75482
#> 8: PlaceA A2 3 week 14 14.86063
#> 9: PlaceB B1 4 week 14 17.35101
#> 10: PlaceB B2 5 week 11 17.93565
#> 11: PlaceA A1 1 week 11 14.86273
#> 12: PlaceB B2 5 week 14 16.77532
#> 13: PlaceB B1 4 week 17 18.00866
#> 14: PlaceB B2 5 week 16 18.15545
#> 15: PlaceB B2 6 week 6 17.95428
#> 16: PlaceA A1 2 week 9 14.96347
#> 17: PlaceB B2 6 week 9 16.85704
#> 18: PlaceB B2 6 week 11 17.23744
#> 19: PlaceA A1 2 week 11 15.22046
#> 20: PlaceB B2 6 week 13 17.33922
#> 21: PlaceA A1 2 week 13 14.58677
#> 22: PlaceA A1 2 week 14 15.24341
#> 23: PlaceA A1 2 week 15 15.87080
#> Place Group_Id exprmt FollowUp expected
5. FINAL JOIN BETWEEN 'df1' AND 'df1_MEAN'
DF_Results <- merge(df1, df1_mean, by = c("Place", "Group_Id", "exprmt", "FollowUp"), all.x = TRUE)[, Time := Time + 3600][]
6. CLEANING 'DF_Results' TO GET THE DESIRED OUTPUT
ymd_hms_TO_ymd <- function(x) as_date(as.POSIXct(x))
DF_Results[, `:=` (StartTime = NULL, Time = lapply(Time, ymd_hms_TO_ymd))]
setcolorder(DF_Results, c("Place", "Group_Id", "exprmt", "Time", "FollowUp", "expected"))
DF_Results <- DF_Results[gtools::mixedorder(FollowUp, decreasing = FALSE)]
setorder(DF_Results, Place, Group_Id, exprmt)
DF_Results
#> Place Group_Id exprmt Time FollowUp expected
#> 1: PlaceA A1 1 2018-01-15 start NA
#> 2: PlaceA A1 1 2018-02-03 week 5 15.17243
#> 3: PlaceA A1 1 2018-02-27 week 9 14.86734
#> 4: PlaceA A1 1 2018-03-10 week 10 14.75482
#> 5: PlaceA A1 1 2018-03-18 week 11 14.86273
#> 6: PlaceA A1 2 2019-02-02 start NA
#> 7: PlaceA A1 2 2019-03-01 week 9 14.96347
#> 8: PlaceA A1 2 2019-03-15 week 11 15.22046
#> 9: PlaceA A1 2 2019-03-28 week 13 14.58677
#> 10: PlaceA A1 2 2019-04-04 week 14 15.24341
#> 11: PlaceA A1 2 2019-04-11 week 15 15.87080
#> 12: PlaceA A2 3 2018-02-01 start NA
#> 13: PlaceA A2 3 2018-03-01 week 9 14.92409
#> 14: PlaceA A2 3 2018-04-06 week 14 14.86063
#> 15: PlaceB B1 4 2018-01-17 start NA
#> 16: PlaceB B1 4 2018-01-27 week 4 19.26662
#> 17: PlaceB B1 4 2018-02-17 week 7 17.32940
#> 18: PlaceB B1 4 2018-03-03 week 9 18.36255
#> 19: PlaceB B1 4 2018-04-01 week 14 17.35101
#> 20: PlaceB B1 4 2018-04-24 week 17 18.00866
#> 21: PlaceB B2 5 2018-03-03 start NA
#> 22: PlaceB B2 5 2018-03-18 week 11 17.93565
#> 23: PlaceB B2 5 2018-04-07 week 14 16.77532
#> 24: PlaceB B2 5 2018-04-19 week 16 18.15545
#> 25: PlaceB B2 6 2019-01-23 start NA
#> 26: PlaceB B2 6 2019-02-09 week 6 17.95428
#> 27: PlaceB B2 6 2019-02-27 week 9 16.85704
#> 28: PlaceB B2 6 2019-03-12 week 11 17.23744
#> 29: PlaceB B2 6 2019-03-30 week 13 17.33922
#> Place Group_Id exprmt Time FollowUp expected
Created on 2021-11-24 by the reprex package (v2.0.1)
Sharing the results with temperature data of 2 places. You can always generalize the same either by joining and creating a single data object (if total places are less) or use an ifelse statement.
library(data.table)
setDT(df1)
setDT(data_A) # converting to data.table
setDT(data_B) # converting to data.table
Merged temperature to have a single data object
data_AB <- merge(data_A, data_B, by = 't')
Create a lag column of Time variable based on Place, Group_Id, exprmt
df1[,':='(LAG_DATE = shift(Time, type = 'lag')), by = .(Place, Group_Id, exprmt)]
Using apply function and user defined function to subset the temperature data based on consecutive time periods and also using data.table functionality along with lapply to get the mean for those subsets
Here I have assumed Place column can somehow be joined/mapped on some condition with the temperature data.
Like in the example shared temp_A/temp_B can be formed by concatenating 'temp_' and 6th character of Place column
df1[,':='(EXPECTED = apply(cbind(LAG_DATE, Time, Place), 1, function(x) {
x1 <- as.Date(as.numeric(x[1]), origin = '1970-01-01')
x2 <- as.Date(as.numeric(x[2]), origin = '1970-01-01')
Place <- as.character(x[3])
Mean_Value <- ifelse(is.na(x1), NaN, data_AB[as.Date(t) >= x1 &
as.Date(t) <= x2, lapply(.SD, mean), .SDcols = paste('temp_', substr(Place, 6,
6), sep = '')])
return(as.numeric(Mean_Value))
}
))]

R's padr package claiming the "datetime variable does not vary" when it does vary

library(tidyverse)
library(lubridate)
library(padr)
df
#> # A tibble: 828 x 5
#> Scar_Id Code Type Value YrMo
#> <chr> <chr> <chr> <date> <date>
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
#> # ... with 818 more rows
I have an R data frame named df shown above. I want to concentrate on row numbers 5 and 6. I can usually use the package padr to pad the months in between rows 5 and 6. The pad() function of the padr will basically add rows at intervals the user specifies, best shown as the added rows "X" below.
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> X 262-179 NA NA NA 2019-09-01
#> X 262-179 NA NA NA 2019-10-01
#> X 262-179 NA NA NA 2019-11-01
#> X 262-179 NA NA NA 2019-12-01
#> X 262-179 NA NA NA 2020-01-01
#> X 262-179 NA NA NA 2020-02-01
#> X 262-179 NA NA NA 2020-03-01
#> X 262-179 NA NA NA 2020-04-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
To get there I usually issue a command, such as is shown below, and it works fine in padr. But it doesn't work in my specific example, and instead yields the warning shown below.
df %>% pad(group = "Scar_Id", by = "YrMo", interval = "month")
#> # A tibble: 828 x 5
#> Scar_Id Code Type Value YrMo
#> <chr> <chr> <chr> <date> <date>
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
#> # ... with 818 more rows
#> Warning message:
#> datetime variable does not vary for 537 of the groups, no padding applied on this / these group(s)
Why does it claim that "the datetime variable does not vary" for rows 5 and 6, when the datetime does indeed vary. The datetime for row 5 variable YrMo is "2019-08-01" and the datetime for row 6 variable YrMo is "2020-05-01". Let me state the obvious that "2019-08-01" varies from "2020-05-01".
Any ideas what went wrong? I tried to create a reproducible example and could not. The basic examples I created all work as expected (as I describe). Hopefully these clues can help somebody determine what is going on.

Ascending group by date

I cannot able to ascend my group by dates. Please help!
df <- data.frame(A = c('a1','a1','b1','b1','b1','c2','d2','d2'),
B = c("2017-02-20","2018-02-14","2017-02-06","2018-02-27","2017-02-29","2017-02-28","2017-02-09","2017-02-10"))
Code:
df %>% group_by(A) %>% arrange(A,(as.Date(B)))
I am getting wrong result as the b1 didn't sort
A B
<fctr> <fctr>
1 a1 2017-02-20
2 a1 2018-02-14
3 b1 2017-02-06
4 b1 2018-02-27
5 b1 2017-02-29
6 c2 2017-02-28
7 d2 2017-02-09
8 d2 2017-02-10
You can see that the 2017-02-29 is not a real date, only 28 days in feb 2017. So, when you are converting your column B to date, it converts that value to NA. Fix that entry and it your answer should work.
Also, you probably do not need to group_by A
library(dplyr)
#>
df <- data.frame(A = c('a1','a1','b1','b1','b1','c2','d2','d2'),
B = c("2017-02-20","2018-02-14","2017-02-06","2018-02-27","2017-02-29","2017-02-28","2017-02-09","2017-02-10"))
as.Date(df$B)
#> [1] "2017-02-20" "2018-02-14" "2017-02-06" "2018-02-27" NA
#> [6] "2017-02-28" "2017-02-09" "2017-02-10"
df%>%arrange(A, as.Date(B))
#> A B
#> 1 a1 2017-02-20
#> 2 a1 2018-02-14
#> 3 b1 2017-02-06
#> 4 b1 2018-02-27
#> 5 b1 2017-02-29
#> 6 c2 2017-02-28
#> 7 d2 2017-02-09
#> 8 d2 2017-02-10
Created on 2019-09-16 by the reprex package (v0.2.1)

Resources