I'm having some trouble with logic I need to produce df$val_most_recent. If there's a value for both a_val and b_val, val_most_recent should be the value with the most recent time (a_val corresponds toa_dtm, b_val corresponds tob_dtm). If the times are identical, I'd like a_val to be val_most_recent. If just one value is reported for the two (with the other being a NA, it should simply be that one.
library(tidyverse)
library(lubridate)
location <- c("a", "b", "c", "d")
a_dtm <- ymd_hm(c(NA, "2019-06-05 10:30", "2019-06-05 10:45", "2019-06-05 10:50"))
b_dtm <- ymd_hm(c("2019-06-05 10:30", NA, "2019-06-05 10:48", "2019-06-05 10:50"))
a_val <- c(NA, 6, 4, 2)
b_val <- c(5, NA, 3, 2)
df <- data.frame(location, a_dtm, b_dtm, a_val, b_val)
as_tibble(df)
# A tibble: 4 x 5
#location a_dtm b_dtm a_val b_val
#<fct> <dttm> <dttm> <dbl> <dbl>
#1 a NA 2019-06-05 10:30:00 NA 5
#2 b 2019-06-05 10:30:00 NA 6 NA
#3 c 2019-06-05 10:45:00 2019-06-05 10:48:00 4 3
#4 d 2019-06-05 10:50:00 2019-06-05 10:50:00 2 2
val_most_recent <- c(5,6,3,2)
desired_df <- cbind(df, val_most_recent)
as_tibble(desired_df)
#location a_dtm b_dtm a_val b_val val_most_recent
#<fct> <dttm> <dttm> <dbl> <dbl> <dbl>
#1 a NA 2019-06-05 10:30:00 NA 5 5
#2 b 2019-06-05 10:30:00 NA 6 NA 6
#3 c 2019-06-05 10:45:00 2019-06-05 10:48:00 4 3 3
#4 d 2019-06-05 10:50:00 2019-06-05 10:50:00 2 2 2
Here is the logic from your text coded into a case_when statement:
df %>%
mutate(
val_most_recent = case_when(
is.na(a_val) | is.na(b_va) ~ coalesce(a_val, b_val),
a_dtm >= b_dtm ~ a_val,
TRUE ~ b_val
)
)
# location a_dtm b_dtm a_val b_val val_most_recent
# 1 a <NA> 2019-06-05 10:30:00 NA 5 5
# 2 b 2019-06-05 10:30:00 <NA> 6 NA 6
# 3 c 2019-06-05 10:45:00 2019-06-05 10:48:00 4 3 3
# 4 d 2019-06-05 10:50:00 2019-06-05 10:50:00 2 2 2
Here is one option in base R, convert the dates to numeric, replace the NAs with 0, get the column index with the max values in each row, cbind with the row index and extract the corresponding values from 'a_val/b_val' column
m1 <- sapply(df[2:3], as.numeric)
df$val_most_recent <- df[4:5][cbind(seq_len(nrow(m1)),
max.col(replace(m1, is.na(m1), 0), "first"))]
df$val_most_recent
#[1] 5 6 3 2
Related
I have a long data format, where I need to create a variable with the last date for each id.
Example data
id <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3)
event_date <- c("2000.01.11",
"2000.02.11",
"2000.03.08", #id_max for id 1
"2018.06.15",
"2018.07.07",
"2018.08.10", #id_max for id 1
"2018.07.15",
"2020.01.19",
"2020.02.19", #id_max for id 1
"2020.01.15")
df <- tibble::tibble(id, event_date)
# This is what I want:
df$id_max <- c("2000.03.08","2000.03.08","2000.03.08",
"2018.08.10","2018.08.10","2018.08.10","2018.08.10",
"2020.02.19","2020.02.19","2020.02.19")
Thanks in advance
Convert event_date to Date column and get max date for each id.
library(dplyr)
df %>%
mutate(event_date = as.Date(event_date, '%Y.%m.%d')) %>%
group_by(id) %>%
dplyr::mutate(id_max = max(event_date))
# id event_date id_max
# <dbl> <date> <date>
# 1 1 2000-01-11 2000-03-08
# 2 1 2000-02-11 2000-03-08
# 3 1 2000-03-08 2000-03-08
# 4 2 2018-06-15 2018-08-10
# 5 2 2018-07-07 2018-08-10
# 6 2 2018-08-10 2018-08-10
# 7 2 2018-07-15 2018-08-10
# 8 3 2020-01-19 2020-02-19
# 9 3 2020-02-19 2020-02-19
#10 3 2020-01-15 2020-02-19
I do not use to much dplyr or tibble. I prefer data.table. But this is made with base R.
id <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3)
event_date <- c("2000.01.11",
"2000.02.11",
"2000.03.08", #id_max for id 1
"2018.06.15",
"2018.07.07",
"2018.08.10", #id_max for id 1
"2018.07.15",
"2020.01.19",
"2020.02.19", #id_max for id 1
"2020.01.15")
#Get max() number in date row
md<-aggregate(df$event_date,by=list(id),max)
####OPTIONAL###
#Force date conversion substituting (.)
df["good_date"]<-gsub("\\.","-",event_date)
########
#Finally Create column with max date
merge(df,md,by.x=1,by.y=1)
# id event_date x
#1 1 2000.01.11 2000.03.08
#2 1 2000.02.11 2000.03.08
#3 1 2000.03.08 2000.03.08
#4 2 2018.06.15 2018.08.10
#5 2 2018.07.07 2018.08.10
#6 2 2018.08.10 2018.08.10
#7 2 2018.07.15 2018.08.10
#8 3 2020.01.19 2020.02.19
#9 3 2020.02.19 2020.02.19
#10 3 2020.01.15 2020.02.19
#Or if you made de as.Date() conversion
md<-aggregate(df$good_date,by=list(id),max)
merge(df,md,by.x=1,by.y=1)
# id event_date good_date x
#1 1 2000.01.11 2000-01-11 2000-03-08
#2 1 2000.02.11 2000-02-11 2000-03-08
#3 1 2000.03.08 2000-03-08 2000-03-08
#4 2 2018.06.15 2018-06-15 2018-08-10
#5 2 2018.07.07 2018-07-07 2018-08-10
#6 2 2018.08.10 2018-08-10 2018-08-10
#7 2 2018.07.15 2018-07-15 2018-08-10
#8 3 2020.01.19 2020-01-19 2020-02-19
#9 3 2020.02.19 2020-02-19 2020-02-19
#10 3 2020.01.15 2020-01-15 2020-02-19
I have two independent two datasets, one contains event date. Each ID has only one "Eventdate". As follows:
data1 <- data.frame("ID" = c(1,2,3,4,5,6), "Eventdate" = c("2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01", "2019-05-01", "2019-06-01"))
data1
ID Eventdate
1 1 2019-01-01
2 2 2019-02-01
3 3 2019-03-01
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
In another dataset, one ID have multiple event name (Eventcode) and its event date (Eventdate). As follows:
data2 <- data.frame("ID" = c(1,1,2,3,3,3,4,4,7), "Eventcode"=c(201,202,201,204,205,206,209,208,203),"Eventdate" = c("2019-01-01", "2019-01-01", "2019-02-11", "2019-02-15", "2019-03-01", "2019-03-15", "2019-03-10", "2019-03-20", "2019-06-02"))
data2
ID Eventcode Eventdate
1 1 201 2019-01-01
2 1 202 2019-01-01
3 2 201 2019-02-11
4 3 204 2019-02-15
5 3 205 2019-03-01
6 3 206 2019-03-15
7 4 209 2019-03-10
8 4 208 2019-03-20
9 7 203 2019-06-02
Two datasets were linked by ID. The ID of two datasets were not all the same.
I would like to select cases in data2 with conditions:
Match by ID
Eventdate in data2 >= Eventdate in data1.
If one ID has multiple Eventdates in data2, select the earliest one.
If one ID has multiple Eventcodes at one Eventdate in data2, just randomly select one.
Then merge the selected data2 into data1.
Expected results as follows:
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
or
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 202
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
Thank you very very much!
You can try this approach :
library(dplyr)
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = {
inds <- Eventdate.y >= Eventdate.x
val <- sum(inds, na.rm = TRUE)
if(val == 1) Eventcode[inds]
else if(val > 1) sample(Eventcode[inds], 1)
else NA_real_
})
# ID Eventdate.x Eventdate Eventcode
# <dbl> <chr> <chr> <dbl>
#1 1 2019-01-01 2019-01-01 201
#2 2 2019-02-01 2019-02-11 201
#3 3 2019-03-01 2019-03-01 205
#4 4 2019-04-01 NA NA
#5 5 2019-05-01 NA NA
#6 6 2019-06-01 NA NA
The complicated logic in Eventcode data is for randomness, if you are ok selecting the 1st value like Eventdate you can simplify it to :
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = Eventcode[Eventdate.y >= Eventdate.x][1])
Does this work:
library(dplyr)
data1 %>% rename(Eventdate_dat1 = Eventdate) %>% left_join(data2, by = 'ID') %>%
group_by(ID) %>% filter(Eventdate >= Eventdate_dat1) %>%
mutate(Eventdate = case_when(length(unique(Eventdate)) > 1 ~ min(Eventdate), TRUE ~ Eventdate),
Eventcode = case_when(length(unique(Eventcode)) > 1 ~ min(Eventcode), TRUE ~ Eventcode)) %>%
distinct() %>% right_join(data1, by = 'ID') %>% select(ID, 'Eventdate' = Eventdate.y, 'Eventdate.data2' = Eventdate.x, Eventcode)
# A tibble: 6 x 4
# Groups: ID [6]
ID Eventdate Eventdate.data2 Eventcode
<dbl> <chr> <chr> <dbl>
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01 NA NA
5 5 2019-05-01 NA NA
6 6 2019-06-01 NA NA
Here is my toy dataset:
df <- tibble::tribble(
~date, ~value,
"2007-01-31", 25,
"2007-05-31", 31,
"2007-12-31", 26
)
I am creating month-end date series using the following code.
df %>%
mutate(date = as.Date(date)) %>%
complete(date = seq(as.Date("2007-01-31"), as.Date("2019-12-31"), by="month"))
However, I am not getting the correct month-end dates.
date value
<date> <dbl>
1 2007-01-31 25
2 2007-03-03 NA
3 2007-03-31 NA
4 2007-05-01 NA
5 2007-05-31 31
6 2007-07-01 NA
7 2007-07-31 NA
8 2007-08-31 NA
9 2007-10-01 NA
10 2007-10-31 NA
11 2007-12-01 NA
12 2007-12-31 26
What am I missing here? I am okay using other functions from any other package.
No need of complete function, you can do this in base R.
Since last day of the month is different for different months, we can create a sequence of monthly start dates and subtract 1 day from it.
seq(as.Date("2007-02-01"), as.Date("2008-01-01"), by="month") - 1
#[1] "2007-01-31" "2007-02-28" "2007-03-31" "2007-04-30" "2007-05-31" "2007-06-30"
# "2007-07-31" "2007-08-31" "2007-09-30" "2007-10-31" "2007-11-30" "2007-12-31"
Using the same logic in updated dataframe, we can do :
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
tidyr::complete(date = seq(min(date) + 1, max(date) + 1, by="month") - 1)
# date value
# <date> <dbl>
# 1 2007-01-31 25
# 2 2007-02-28 NA
# 3 2007-03-31 NA
# 4 2007-04-30 NA
# 5 2007-05-31 31
# 6 2007-06-30 NA
# 7 2007-07-31 NA
# 8 2007-08-31 NA
# 9 2007-09-30 NA
#10 2007-10-31 NA
#11 2007-11-30 NA
#12 2007-12-31 26
I've seen lots of questions like this but can't figure this simple problem out. I don't want to collapse the dataset. Say I have this dataset:
library(tidyverse)
library(lubridate)
df <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b"),
starts = c("2011-09-18", NA, "2014-08-08", "2016-09-18", NA, "2013-08-08", "2015-08-08", NA),
ends = c(NA, "2013-03-06", "2015-08-08", NA, "2017-03-06", "2014-08-08", NA, "2016-08-08"))
df$starts <- parse_date_time(df$starts, "ymd")
df$ends <- parse_date_time(df$ends, "ymd")
df
group starts ends
1 a 2011-09-18 <NA>
2 a <NA> 2013-03-06
3 a 2014-08-08 2015-08-08
4 a 2016-09-18 <NA>
5 a <NA> 2017-03-06
6 b 2013-08-08 2014-08-08
7 b 2015-08-08 <NA>
8 b <NA> 2016-08-08
Desired output is:
group starts ends epi
1 a 2011-09-18 <NA> 1
2 a <NA> 2013-03-06 1
3 a 2014-08-08 2015-08-08 2
4 a 2016-09-18 <NA> 3
5 a <NA> 2017-03-06 3
6 b 2013-08-08 2014-08-08 1
7 b 2015-08-08 <NA> 2
8 b <NA> 2016-08-08 2
I was thinking something like this but obviously doesn't account for episodes where there is no NA
df <- df %>%
group_by(group) %>%
mutate(epi = cumsum(is.na(ends)))
df
I'm not sure how to incorporate cumsum(is.na) with condition if_else. Maybe I'm going at it the wrong way?
Any suggestions would be great.
A solution using dplyr. Assuming your data frame is well structured that each start always has an associated end record.
df2 <- df %>%
group_by(group) %>%
mutate(epi = cumsum(!is.na(starts))) %>%
ungroup()
df2
# # A tibble: 8 x 4
# group starts ends epi
# <fct> <dttm> <dttm> <int>
# 1 a 2011-09-18 00:00:00 NA 1
# 2 a NA 2013-03-06 00:00:00 1
# 3 a 2014-08-08 00:00:00 2015-08-08 00:00:00 2
# 4 a 2016-09-18 00:00:00 NA 3
# 5 a NA 2017-03-06 00:00:00 3
# 6 b 2013-08-08 00:00:00 2014-08-08 00:00:00 1
# 7 b 2015-08-08 00:00:00 NA 2
# 8 b NA 2016-08-08 00:00:00 2
An option is to get the rowSums of NA elements for columns 'starts', 'ends', grouped by 'group', get the rleid from the 'epi'
library(dplyr)
library(data.table)
df %>%
mutate(epi = rowSums(is.na(.[c("starts", "ends")]))) %>%
group_by(group) %>%
mutate(epi = rleid(epi))
# A tibble: 8 x 4
# Groups: group [2]
# group starts ends epi
# <fct> <dttm> <dttm> <int>
#1 a 2011-09-18 00:00:00 NA 1
#2 a NA 2013-03-06 00:00:00 1
#3 a 2014-08-08 00:00:00 2015-08-08 00:00:00 2
#4 a 2016-09-18 00:00:00 NA 3
#5 a NA 2017-03-06 00:00:00 3
#6 b 2013-08-08 00:00:00 2014-08-08 00:00:00 1
#7 b 2015-08-08 00:00:00 NA 2
#8 b NA 2016-08-08 00:00:00 2
If there are only two columns to consider
df %>%
group_by(group) %>%
mutate(epi = rleid(is.na(starts) + is.na(ends)))
I have a date-time column with non-consecutive date-times (all on the hour), like this:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
# Output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-13 01:00:00
#3 2018-02-01 11:00:00
I'd like to expand the rows of column dt so that every hour in between the very minimum and maximum date-times is present, looking like:
# Desired output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-01 13:00:00
#3 2018-01-01 14:00:00
#4 .
#5 .
And so on. tidyverse-based solutions are most preferred.
#DavidArenburg's comment is the way to go for a vector. However, if you want to expand dt inside a data frame with other columns that you would like to keep, you might be interested in tidyr::complete combined with tidyr::full_seq:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
dat$a <- letters[1:3]
dat
#> dt a
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-13 01:00:00 b
#> 3 2018-02-01 11:00:00 c
library(tidyr)
res <- complete(dat, dt = full_seq(dt, 60 ** 2))
print(res, n = 5)
#> # A tibble: 744 x 2
#> dt a
#> <dttm> <chr>
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-01 13:00:00 <NA>
#> 3 2018-01-01 14:00:00 <NA>
#> 4 2018-01-01 15:00:00 <NA>
#> 5 2018-01-01 16:00:00 <NA>
#> # ... with 739 more rows
Created on 2018-03-12 by the reprex package (v0.2.0).