I want to group by person and date1 and fill in missing data for date2 and indicator by person and day IF with the person's next observation occurs in the same day.
For instance, person 1 is missing date2 and indicator values for the second and third observations. As shown below, I want to replace these missing values with the next non-NA observation in the same day for this person: date2==2018-02-02 15:04:00 and indicator==1.
Note that for person 2, the last NA does not have a next observation in the same day, so it needs to remain NA.
Here is the data frame I have:
person date1 date2 indicator
1 1 2018-02-02 12:00:00 2018-02-02 12:05:00 1
2 1 2018-02-02 13:00:00 <NA> NA
3 1 2018-02-02 14:00:00 <NA> NA
4 1 2018-02-02 15:00:00 2018-02-02 15:04:00 1
5 2 2018-02-01 12:00:00 <NA> NA
6 2 2018-02-01 13:00:00 2018-02-01 13:06:00 1
7 2 2018-02-02 12:00:00 2018-02-02 12:03:00 1
8 2 2018-02-03 12:00:00 <NA> NA
Here is the data frame I want:
person date1 date2 indicator
1 1 2018-02-02 12:00:00 2018-02-02 12:05:00 1
2 1 2018-02-02 13:00:00 2018-02-02 15:04:00 1
3 1 2018-02-02 14:00:00 2018-02-02 15:04:00 1
4 1 2018-02-02 15:00:00 2018-02-02 15:04:00 1
5 2 2018-02-01 12:00:00 2018-02-01 13:06:00 1
6 2 2018-02-01 13:00:00 2018-02-01 13:06:00 1
7 2 2018-02-02 12:00:00 2018-02-02 12:03:00 1
8 2 2018-02-03 12:00:00 <NA> NA
Example:
library(tidyverse)
df.have <- data.frame(person=c(1, 1, 1, 1, 2, 2, 2, 2),
date1=ymd_hms(c("2018-02-02 12:00:00",
"2018-02-02 13:00:00",
"2018-02-02 14:00:00",
"2018-02-02 15:00:00",
"2018-02-01 12:00:00",
"2018-02-01 13:00:00",
"2018-02-02 12:00:00",
"2018-02-03 12:00:00")),
date2=ymd_hms(c("2018-02-02 12:05:00",
NA,
NA,
"2018-02-02 15:04:00",
NA,
"2018-02-01 13:06:00",
"2018-02-02 12:03:00",
NA)),
indicator=c(1, NA, NA, 1,
NA, 1, 1, NA))
df.want <- data.frame(person=c(1, 1, 1, 1, 2, 2, 2, 2),
date1=ymd_hms(c("2018-02-02 12:00:00",
"2018-02-02 13:00:00",
"2018-02-02 14:00:00",
"2018-02-02 15:00:00",
"2018-02-01 12:00:00",
"2018-02-01 13:00:00",
"2018-02-02 12:00:00",
"2018-02-03 12:00:00")),
date2=ymd_hms(c("2018-02-02 12:05:00",
"2018-02-02 15:04:00",
"2018-02-02 15:04:00",
"2018-02-02 15:04:00",
"2018-02-01 13:06:00",
"2018-02-01 13:06:00",
"2018-02-02 12:03:00",
NA)),
indicator=c(1, 1, 1, 1,
1, 1, 1, NA))
I can filter down to some of the replacement values, but still a good bit from where I want to get.
df.have %>%
group_by(person, date(date1)) %>%
arrange(person, date1) %>%
filter(row_number() %in% c(n()))
You can do it like this (note that you also need lubridate as well as the tidyverse packages)...
df.want <- df.have %>% mutate(day=date(date1)) %>% #add a date variable for grouping
group_by(day,person) %>%
fill(date2,indicator,.direction = "up") %>% #use tidyr 'fill' to remove NAs
ungroup() %>%
select(-day) %>% #remove grouping variable
arrange(person,date1) #restore original order
df.want
# A tibble: 8 x 4
person date1 date2 indicator
<dbl> <dttm> <dttm> <dbl>
1 1 2018-02-02 12:00:00 2018-02-02 12:05:00 1
2 1 2018-02-02 13:00:00 2018-02-02 15:04:00 1
3 1 2018-02-02 14:00:00 2018-02-02 15:04:00 1
4 1 2018-02-02 15:00:00 2018-02-02 15:04:00 1
5 2 2018-02-01 12:00:00 2018-02-01 13:06:00 1
6 2 2018-02-01 13:00:00 2018-02-01 13:06:00 1
7 2 2018-02-02 12:00:00 2018-02-02 12:03:00 1
8 2 2018-02-03 12:00:00 NA NA
Related
This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 3 years ago.
Date_Time wind_cardinal_direction_set_1d weather_condition_set_1d n
<dttm> <chr> <chr> <int>
1 2015-01-01 01:00:00 N Fog 1
2 2015-01-01 01:00:00 N Mist 2
3 2015-01-01 02:00:00 N Fog 2
4 2015-01-01 02:00:00 N Mist 1
5 2015-01-01 03:00:00 N Fog 3
6 2015-01-01 04:00:00 N Mist 3
7 2015-01-01 05:00:00 N Mist 3
8 2015-01-01 06:00:00 N Mist 3
9 2015-01-01 07:00:00 N Fog 2
10 2015-01-01 07:00:00 N Mist 1
# ... with 6,798 more rows
>
For each date-time combination, I want to keep the one with the max value of n
df_cat %>% filter(df_cat$n>df_cat$n,)
Welcome to SO! Seems you like dplyr, so here a solution:
library(dplyr)
df_cat %>%
group_by(Date_Time) %>% # group by date
summarise(n = max(n)) %>% # get the max values
left_join(df_cat) %>% # fetch the other columns
# order them
select(Date_Time,wind_cardinal_direction_set_1d,weather_condition_set_1d, n)
Joining, by = c("Date_Time", "n")
# A tibble: 7 x 4
Date_Time wind_cardinal_direction_set_1d weather_condition_set_1d n
<fct> <fct> <fct> <int>
1 2015-01-01 01:00:00 N Mist 2
2 2015-01-01 02:00:00 N Fog 2
3 2015-01-01 03:00:00 N Fog 3
4 2015-01-01 04:00:00 N Mist 3
5 2015-01-01 05:00:00 N Mist 3
6 2015-01-01 06:00:00 N Mist 3
7 2015-01-01 07:00:00 N Fog 2
Or you can make it in this way thanks to Ronak Shah:
df_cat %>% group_by(Date_Time) %>% slice(which.max(n))
With data:
df_cat <- read.table(text ='Date_Time wind_cardinal_direction_set_1d weather_condition_set_1d n
1 "2015-01-01 01:00:00" N Fog 1
2 "2015-01-01 01:00:00" N Mist 2
3 "2015-01-01 02:00:00" N Fog 2
4 "2015-01-01 02:00:00" N Mist 1
5 "2015-01-01 03:00:00" N Fog 3
6 "2015-01-01 04:00:00" N Mist 3
7 "2015-01-01 05:00:00" N Mist 3
8 "2015-01-01 06:00:00" N Mist 3
9 "2015-01-01 07:00:00" N Fog 2
10 "2015-01-01 07:00:00" N Mist 1', header = T)
I'm having some trouble with logic I need to produce df$val_most_recent. If there's a value for both a_val and b_val, val_most_recent should be the value with the most recent time (a_val corresponds toa_dtm, b_val corresponds tob_dtm). If the times are identical, I'd like a_val to be val_most_recent. If just one value is reported for the two (with the other being a NA, it should simply be that one.
library(tidyverse)
library(lubridate)
location <- c("a", "b", "c", "d")
a_dtm <- ymd_hm(c(NA, "2019-06-05 10:30", "2019-06-05 10:45", "2019-06-05 10:50"))
b_dtm <- ymd_hm(c("2019-06-05 10:30", NA, "2019-06-05 10:48", "2019-06-05 10:50"))
a_val <- c(NA, 6, 4, 2)
b_val <- c(5, NA, 3, 2)
df <- data.frame(location, a_dtm, b_dtm, a_val, b_val)
as_tibble(df)
# A tibble: 4 x 5
#location a_dtm b_dtm a_val b_val
#<fct> <dttm> <dttm> <dbl> <dbl>
#1 a NA 2019-06-05 10:30:00 NA 5
#2 b 2019-06-05 10:30:00 NA 6 NA
#3 c 2019-06-05 10:45:00 2019-06-05 10:48:00 4 3
#4 d 2019-06-05 10:50:00 2019-06-05 10:50:00 2 2
val_most_recent <- c(5,6,3,2)
desired_df <- cbind(df, val_most_recent)
as_tibble(desired_df)
#location a_dtm b_dtm a_val b_val val_most_recent
#<fct> <dttm> <dttm> <dbl> <dbl> <dbl>
#1 a NA 2019-06-05 10:30:00 NA 5 5
#2 b 2019-06-05 10:30:00 NA 6 NA 6
#3 c 2019-06-05 10:45:00 2019-06-05 10:48:00 4 3 3
#4 d 2019-06-05 10:50:00 2019-06-05 10:50:00 2 2 2
Here is the logic from your text coded into a case_when statement:
df %>%
mutate(
val_most_recent = case_when(
is.na(a_val) | is.na(b_va) ~ coalesce(a_val, b_val),
a_dtm >= b_dtm ~ a_val,
TRUE ~ b_val
)
)
# location a_dtm b_dtm a_val b_val val_most_recent
# 1 a <NA> 2019-06-05 10:30:00 NA 5 5
# 2 b 2019-06-05 10:30:00 <NA> 6 NA 6
# 3 c 2019-06-05 10:45:00 2019-06-05 10:48:00 4 3 3
# 4 d 2019-06-05 10:50:00 2019-06-05 10:50:00 2 2 2
Here is one option in base R, convert the dates to numeric, replace the NAs with 0, get the column index with the max values in each row, cbind with the row index and extract the corresponding values from 'a_val/b_val' column
m1 <- sapply(df[2:3], as.numeric)
df$val_most_recent <- df[4:5][cbind(seq_len(nrow(m1)),
max.col(replace(m1, is.na(m1), 0), "first"))]
df$val_most_recent
#[1] 5 6 3 2
I've got a fairly straight-forward problem, but I'm struggling to find a solution that doesn't require a wall of code and complicated loops.
I've got a summary table, df, for an hourly timeseries dataset where each observations belongs to a group.
I want to merge some of those groups, based on a boolean column in the summary table.
The boolean column, merge_with_next indicates whether a given group should be merged with the next group (one row down).
The merging effectively occurs by updating the end, value and removing rows:
library(dplyr)
# Demo data
df <- tibble(
group = 1:12,
start = seq.POSIXt(as.POSIXct("2019-01-01 00:00"), as.POSIXct("2019-01-12 00:00"), by = "1 day"),
end = seq.POSIXt(as.POSIXct("2019-01-01 23:59"), as.POSIXct("2019-01-12 23:59"), by = "1 day"),
merge_with_next = rep(c(TRUE, TRUE, FALSE), 4)
)
df
#> # A tibble: 12 x 4
#> group start end merge_with_next
#> <int> <dttm> <dttm> <lgl>
#> 1 1 2019-01-01 00:00:00 2019-01-01 23:59:00 TRUE
#> 2 2 2019-01-02 00:00:00 2019-01-02 23:59:00 TRUE
#> 3 3 2019-01-03 00:00:00 2019-01-03 23:59:00 FALSE
#> 4 4 2019-01-04 00:00:00 2019-01-04 23:59:00 TRUE
#> 5 5 2019-01-05 00:00:00 2019-01-05 23:59:00 TRUE
#> 6 6 2019-01-06 00:00:00 2019-01-06 23:59:00 FALSE
#> 7 7 2019-01-07 00:00:00 2019-01-07 23:59:00 TRUE
#> 8 8 2019-01-08 00:00:00 2019-01-08 23:59:00 TRUE
#> 9 9 2019-01-09 00:00:00 2019-01-09 23:59:00 FALSE
#> 10 10 2019-01-10 00:00:00 2019-01-10 23:59:00 TRUE
#> 11 11 2019-01-11 00:00:00 2019-01-11 23:59:00 TRUE
#> 12 12 2019-01-12 00:00:00 2019-01-12 23:59:00 FALSE
# Desired result
desired <- tibble(
group = c(1, 4, 7, 9),
start = c("2019-01-01 00:00", "2019-01-04 00:00", "2019-01-07 00:00", "2019-01-10 00:00"),
end = c("2019-01-03 23:59", "2019-01-06 23:59", "2019-01-09 23:59", "2019-01-12 23:59")
)
desired
#> # A tibble: 4 x 3
#> group start end
#> <dbl> <chr> <chr>
#> 1 1 2019-01-01 00:00 2019-01-03 23:59
#> 2 4 2019-01-04 00:00 2019-01-06 23:59
#> 3 7 2019-01-07 00:00 2019-01-09 23:59
#> 4 9 2019-01-10 00:00 2019-01-12 23:59
Created on 2019-03-22 by the reprex package (v0.2.1)
I'm looking for a short and clear solution that doesn't involve a myriad of helper tables and loops. The final value in the group column is not significant, I only care about the start and end columns from the result.
We can use dplyr and create groups based on every time TRUE value occurs in merge_with_next column and select first value from start and last value from end column for each group.
library(dplyr)
df %>%
group_by(temp = cumsum(!lag(merge_with_next, default = TRUE))) %>%
summarise(group = first(group),
start = first(start),
end = last(end)) %>%
ungroup() %>%
select(-temp)
# group start end
# <int> <dttm> <dttm>
#1 1 2019-01-01 00:00:00 2019-01-03 23:59:00
#2 4 2019-01-04 00:00:00 2019-01-06 23:59:00
#3 7 2019-01-07 00:00:00 2019-01-09 23:59:00
#4 10 2019-01-10 00:00:00 2019-01-12 23:59:00
I have a date-time column with non-consecutive date-times (all on the hour), like this:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
# Output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-13 01:00:00
#3 2018-02-01 11:00:00
I'd like to expand the rows of column dt so that every hour in between the very minimum and maximum date-times is present, looking like:
# Desired output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-01 13:00:00
#3 2018-01-01 14:00:00
#4 .
#5 .
And so on. tidyverse-based solutions are most preferred.
#DavidArenburg's comment is the way to go for a vector. However, if you want to expand dt inside a data frame with other columns that you would like to keep, you might be interested in tidyr::complete combined with tidyr::full_seq:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
dat$a <- letters[1:3]
dat
#> dt a
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-13 01:00:00 b
#> 3 2018-02-01 11:00:00 c
library(tidyr)
res <- complete(dat, dt = full_seq(dt, 60 ** 2))
print(res, n = 5)
#> # A tibble: 744 x 2
#> dt a
#> <dttm> <chr>
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-01 13:00:00 <NA>
#> 3 2018-01-01 14:00:00 <NA>
#> 4 2018-01-01 15:00:00 <NA>
#> 5 2018-01-01 16:00:00 <NA>
#> # ... with 739 more rows
Created on 2018-03-12 by the reprex package (v0.2.0).
I am working with electronic health records data and would like to create an indicator variable called "episode" that joins antibiotic medications that occur within 7 days of each other. Below is a mock dataset and the output that I would like. I program in R.
df2=data.frame(
id = c(01,01,01,01,01,02,02,03,04),
date = c("2015-01-01 11:00",
"2015-01-06 13:29",
"2015-01-10 12:46",
"2015-01-25 14:45",
"2015-02-15 13:30",
"2015-01-01 10:00",
"2015-05-05 15:20",
"2015-01-01 15:19",
"2015-08-01 13:15"),
abx = c("AMPICILLIN",
"ERYTHROMYCIN",
"NEOMYCIN",
"AMPICILLIN",
"VANCOMYCIN",
"VANCOMYCIN",
"NEOMYCIN",
"PENICILLIN",
"ERYTHROMYCIN"));
df2
Output desired
id date abx episode
1 2015-01-01 11:00 AMPICILLIN 1
1 2015-01-06 13:29 ERYTHROMYCIN 1
1 2015-01-10 12:46 NEOMYCIN 1
1 2015-01-25 14:45 AMPICILLIN 2
1 2015-02-15 13:30 VANCOMYCIN 3
2 2015-01-01 10:00 VANCOMYCIN 1
2 2015-05-05 15:20 NEOMYCIN 1
3 2015-01-01 15:19 PENICILLIN 1
4 2015-08-01 13:15 ERYTHROMYCIN 1
Use ave like this:
grpno <- function(x) cumsum(c(TRUE, diff(x) >=7 ))
transform(df2, episode = ave(as.numeric(as.Date(date)), id, FUN = grpno))
giving:
id date abx episode
1 1 2015-01-01 11:00 AMPICILLIN 1
2 1 2015-01-06 13:29 ERYTHROMYCIN 1
3 1 2015-01-10 12:46 NEOMYCIN 1
4 1 2015-01-25 14:45 AMPICILLIN 2
5 1 2015-02-15 13:30 VANCOMYCIN 3
6 2 2015-01-01 10:00 VANCOMYCIN 1
7 2 2015-05-05 15:20 NEOMYCIN 2
8 3 2015-01-01 15:19 PENICILLIN 1
9 4 2015-08-01 13:15 ERYTHROMYCIN 1
or with dplyr and grpno from above:
df2 %>%
group_by(id) %>%
mutate(episode = date %>% as.Date %>% as.numeric %>% grpno) %>%
ungroup