How to combine two data.tables based on multiple criteria in R? - r

I have two data.tables, which I want to combine based on if a date in one table is in the given time range in the other table. In dt1 I have exit dates and I want to check in dt2 which values were valid at the exit date for each ID.
dt1 <- data.table (ID = 1:10,
exit = c("31/12/2010", "01/01/2021", "30/09/2010", "31/12/2015", "30/09/2010","31/10/2018", "01/02/2016", "01/05/2015", "01/09/2013", "01/01/2016"))
dt2 <- data.table (ID = c(1,2,2,2,3,5,6,6,7,8,8,9,10),
valid_from = c("01/01/2010", "01/01/2012", "01/01/2013", "01/12/2017", "01/05/2010", "01/04/2010", "01/05/2014", "01/11/2016", "01/01/2016", "15/04/2013", "01/01/2015", "15/02/2010", "01/04/2012"),
valid_until = c("01/01/2021", "31/12/2012", "30/11/2017", "01/01/2021", "01/01/2021", "01/01/2021", "31/10/2016", "01/01/2021", "01/01/2021", "31/12/2014", "01/05/2015", "01/01/2013", "01/01/2021"),
text1 = c("a", "a", "b", "c", "b", "b", "c", "a", "a", "b", "a", "c", "a"),
text2 = c("I", "I", "II", "I", "III", "I", "II", "III", "I", "II", "II", "I", "III" ))
ID exit
1: 1 31/12/2010
2: 2 01/01/2021
3: 3 30/09/2010
4: 4 31/12/2015
5: 5 30/09/2010
6: 6 31/10/2018
7: 7 01/02/2016
8: 8 01/05/2015
9: 9 01/09/2013
10: 10 01/01/2016
ID valid_from valid_until text1 text2
1: 1 01/01/2010 01/01/2021 a I
2: 2 01/01/2012 31/12/2012 a I
3: 2 01/01/2013 30/11/2017 b II
4: 2 01/12/2017 01/01/2021 c I
5: 3 01/05/2010 01/01/2021 b III
6: 5 01/04/2010 01/01/2021 b I
7: 6 01/05/2014 31/10/2016 c II
8: 6 01/11/2016 01/01/2021 a III
9: 7 01/01/2016 01/01/2021 a I
10: 8 15/04/2013 31/12/2014 b II
11: 8 01/01/2015 01/05/2015 a II
12: 9 15/02/2010 01/01/2013 c I
13: 10 01/04/2012 01/01/2021 a III
As a result I would like to return in dt1 the valid values to the exit dates.
If an ID is not found in dt2 (would be the case for ID 4 in the sample data), it should return NA.
ID exit text1 text2
1: 1 31/12/2010 a I
2: 2 01/01/2021 c I
3: 3 30/09/2010 b III
4: 4 31/12/2015 <NA> <NA>
5: 5 30/09/2010 b I
6: 6 31/10/2018 a III
7: 7 01/02/2016 a I
8: 8 01/05/2015 a II
9: 9 01/09/2013 c I
10: 10 01/01/2016 a III
Could anyone help me solve this?

As the input is a data.table, consider using data.table methods which are fast
library(data.table)
# // convert the date columns to `Date` class
dt1[, exit := as.IDate(exit, '%d/%m/%Y')]
dt2[, c('valid_from', 'valid_until') := .(as.IDate(valid_from, '%d/%m/%Y'),
as.IDate(valid_until, '%d/%m/%Y'))]
# // do a non-equi join
dt1[dt2, c('text1', 'text2') := .(i.text1, i.text2),
on = .(ID, exit >= valid_from, exit <= valid_until)]
-output
> dt1
ID exit text1 text2
1: 1 2010-12-31 a I
2: 2 2021-01-01 c I
3: 3 2010-09-30 b III
4: 4 2015-12-31 <NA> <NA>
5: 5 2010-09-30 b I
6: 6 2018-10-31 a III
7: 7 2016-02-01 a I
8: 8 2015-05-01 a II
9: 9 2013-09-01 <NA> <NA>
10: 10 2016-01-01 a III

Here is a dplyr solution, that was created with the help of #akrun: see here dates: Not yet implemented NAbounds=TRUE for this non-numeric and non-character type
library(dplyr)
libray(lubridate)
df1 <- left_join(dt1, dt2, by="ID") %>%
mutate(across(c(exit, valid_from, valid_until), dmy)) %>%
rowwise() %>%
mutate(match= +(dplyr::between(exit, valid_from, valid_until))) %>%
group_by(ID) %>%
filter(match==max(match) | is.na(match)) %>%
select(ID, exit, text1, text2) %>%
ungroup()
output:
ID exit text1 text2
<dbl> <date> <chr> <chr>
1 1 2010-12-31 a I
2 2 2021-01-01 c I
3 3 2010-09-30 b III
4 4 2015-12-31 NA NA
5 5 2010-09-30 b I
6 6 2018-10-31 a III
7 7 2016-02-01 a I
8 8 2015-05-01 a II
9 9 2013-09-01 c I
10 10 2016-01-01 a III

You may use fuzzyjoin after changing the dates to Date class.
library(fuzzyjoin)
library(dplyr)
dt1 %>%
mutate(exit = as.Date(exit, '%d/%m/%Y')) %>%
fuzzy_left_join(dt2 %>%
mutate(across(starts_with('valid'), as.Date, '%d/%m/%Y')),
by = c('ID', 'exit' = 'valid_from', 'exit' = 'valid_until'),
match_fun = c(`==`, `>=`, `<=`)) %>%
select(ID = ID.x, exit, text1, text2)
# ID exit text1 text2
#1 1 2010-12-31 a I
#2 2 2021-01-01 c I
#3 3 2010-09-30 b III
#4 4 2015-12-31 <NA> <NA>
#5 5 2010-09-30 b I
#6 6 2018-10-31 a III
#7 7 2016-02-01 a I
#8 8 2015-05-01 a II
#9 9 2013-09-01 <NA> <NA>
#10 10 2016-01-01 a III

Related

Create an summarizing variable for multiple columns in data.table r

I have the following data.table
dt <- data.table(id=c(1,2,2,2,3,3,4),
date=c("2019-09-13", "2018-12-06", "2017-12-14", "2018-02-08", "2015-12-06", "2012-12-14", "2011-02-08"),
variable_1=c("a","b",NA,NA,"b","c",NA),
variable_2=c(NA,NA,"a",NA,"a","c",NA),
variable_3=c(NA,NA,NA,"b","c","c",NA))
dt
id date variable_1 variable_2 variable_3
1: 1 2019-09-13 a <NA> <NA>
2: 2 2018-12-06 b <NA> <NA>
3: 2 2017-12-14 <NA> a <NA>
4: 2 2018-02-08 <NA> <NA> b
5: 3 2015-12-06 b a c
6: 3 2012-12-14 c c c
7: 4 2011-02-08 <NA> <NA> <NA>
I want to create a variable y that is summarizing all the columns. Everything that has one !is.na() among the variable should be 0 . Every row that has only is.na among all the variables should be 1. Like this:
id date variable_1 variable_2 variable_3 y
1: 1 2019-09-13 a <NA> <NA> 0
2: 2 2018-12-06 b <NA> <NA> 0
3: 2 2017-12-14 <NA> a <NA> 0
4: 2 2018-02-08 <NA> <NA> b 0
5: 3 2015-12-06 b a c 0
6: 3 2012-12-14 c c c 0
7: 4 2011-02-08 <NA> <NA> <NA> 1
In the original data.table I have 22 variables that I am looking at among 830 total variables. So I would prefer not to look for every Variable with _1 to _22 separately.
Is there a way in data.table?
dt[, y := +(rowSums(!is.na(.SD)) == 0L), .SDcols = patterns("^variable_")]
# id date variable_1 variable_2 variable_3 y
# 1: 1 2019-09-13 a <NA> <NA> 0
# 2: 2 2018-12-06 b <NA> <NA> 0
# 3: 2 2017-12-14 <NA> a <NA> 0
# 4: 2 2018-02-08 <NA> <NA> b 0
# 5: 3 2015-12-06 b a c 0
# 6: 3 2012-12-14 c c c 0
# 7: 4 2011-02-08 <NA> <NA> <NA> 1
Walk-through:
.SDcols=patterns(...) defines the columns to be processed as .SD in the j component. This doesn't involve removing/selecting columns for the output, just the ones that will be referenced internally.
!is.na(.SD) returns a logical matrix, same dims as .SD, indicating if its value is NA.
rowSums(...) returns the count of non-NAs in the row.
using the inverted logic of "count the number of non-NA values in a row", we're able to not care about the number of columns being processed; this is what allows me to use == 0L.
+(...) is a shorthand trick for converting logical to 0:1

how to calculate date difference with r [duplicate]

This question already has answers here:
subtract value from previous row by group
(3 answers)
Closed 2 years ago.
I would like to calculate the date difference between each entry. The data looks like this
> dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
> dt
id date
1: A 2017-01-01
2: A 2017-02-01
3: A 2017-05-01
4: B 2017-01-01
5: B 2017-05-01
6: B 2017-10-01
7: C 2017-01-01
8: C 2017-02-01
9: C 2017-02-15
and what I want to get is sth looks like following, how should I construct var "Diff"?
Update:
I tried to solve this by using following codes:
> dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
> dt %>%
+ group_by(id) %>%
+ mutate(diff = date - lag(date))
# A tibble: 9 x 3
# Groups: id [3]
id date diff
<chr> <date> <drtn>
1 A 2017-01-01 NA days
2 A 2017-02-01 31 days
3 A 2017-05-01 89 days
4 B 2017-01-01 -120 days
5 B 2017-05-01 120 days
6 B 2017-10-01 153 days
7 C 2017-01-01 -273 days
8 C 2017-02-01 31 days
9 C 2017-02-15 14 days
I am not sure what I did wrong. Any idea?
how about this:
dt$diff <- unlist(lapply(split(dt,dt$id), function(x) c(0,diff(x$date)) ))
output:
> dt
id date diff
1: A 2017-01-01 0
2: A 2017-02-01 31
3: A 2017-05-01 89
4: B 2017-01-01 0
5: B 2017-05-01 120
6: B 2017-10-01 153
7: C 2017-01-01 0
8: C 2017-02-01 31
9: C 2017-02-15 14
Maybe not elegant by maybe performant using diff on all and overwrite between the id.
dt$diff <- c(NA, diff(dt$date))
dt$diff[c(TRUE, dt$id[-1] != dt$id[-nrow(dt)])] <- NA
dt
# id date diff
#1 A 2017-01-01 NA
#2 A 2017-02-01 31
#3 A 2017-05-01 89
#4 B 2017-01-01 NA
#5 B 2017-05-01 120
#6 B 2017-10-01 153
#7 C 2017-01-01 NA
#8 C 2017-02-01 31
#9 C 2017-02-15 14
Tidyverse solution:
library(tidyverse)
dt %>%
group_by(id) %>%
mutate(diff = date - lag(date))
id date diff
<chr> <date> <drtn>
1 A 2017-01-01 NA days
2 A 2017-02-01 31 days
3 A 2017-05-01 89 days
4 B 2017-01-01 NA days
5 B 2017-05-01 120 days
6 B 2017-10-01 153 days
7 C 2017-01-01 NA days
8 C 2017-02-01 31 days
9 C 2017-02-15 14 days
An option with data.table
library(data.table)
setDT(dt)[, diff := date - shift(date), id]

cumsum NAs and other condition R

I've seen lots of questions like this but can't figure this simple problem out. I don't want to collapse the dataset. Say I have this dataset:
library(tidyverse)
library(lubridate)
df <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b"),
starts = c("2011-09-18", NA, "2014-08-08", "2016-09-18", NA, "2013-08-08", "2015-08-08", NA),
ends = c(NA, "2013-03-06", "2015-08-08", NA, "2017-03-06", "2014-08-08", NA, "2016-08-08"))
df$starts <- parse_date_time(df$starts, "ymd")
df$ends <- parse_date_time(df$ends, "ymd")
df
group starts ends
1 a 2011-09-18 <NA>
2 a <NA> 2013-03-06
3 a 2014-08-08 2015-08-08
4 a 2016-09-18 <NA>
5 a <NA> 2017-03-06
6 b 2013-08-08 2014-08-08
7 b 2015-08-08 <NA>
8 b <NA> 2016-08-08
Desired output is:
group starts ends epi
1 a 2011-09-18 <NA> 1
2 a <NA> 2013-03-06 1
3 a 2014-08-08 2015-08-08 2
4 a 2016-09-18 <NA> 3
5 a <NA> 2017-03-06 3
6 b 2013-08-08 2014-08-08 1
7 b 2015-08-08 <NA> 2
8 b <NA> 2016-08-08 2
I was thinking something like this but obviously doesn't account for episodes where there is no NA
df <- df %>%
group_by(group) %>%
mutate(epi = cumsum(is.na(ends)))
df
I'm not sure how to incorporate cumsum(is.na) with condition if_else. Maybe I'm going at it the wrong way?
Any suggestions would be great.
A solution using dplyr. Assuming your data frame is well structured that each start always has an associated end record.
df2 <- df %>%
group_by(group) %>%
mutate(epi = cumsum(!is.na(starts))) %>%
ungroup()
df2
# # A tibble: 8 x 4
# group starts ends epi
# <fct> <dttm> <dttm> <int>
# 1 a 2011-09-18 00:00:00 NA 1
# 2 a NA 2013-03-06 00:00:00 1
# 3 a 2014-08-08 00:00:00 2015-08-08 00:00:00 2
# 4 a 2016-09-18 00:00:00 NA 3
# 5 a NA 2017-03-06 00:00:00 3
# 6 b 2013-08-08 00:00:00 2014-08-08 00:00:00 1
# 7 b 2015-08-08 00:00:00 NA 2
# 8 b NA 2016-08-08 00:00:00 2
An option is to get the rowSums of NA elements for columns 'starts', 'ends', grouped by 'group', get the rleid from the 'epi'
library(dplyr)
library(data.table)
df %>%
mutate(epi = rowSums(is.na(.[c("starts", "ends")]))) %>%
group_by(group) %>%
mutate(epi = rleid(epi))
# A tibble: 8 x 4
# Groups: group [2]
# group starts ends epi
# <fct> <dttm> <dttm> <int>
#1 a 2011-09-18 00:00:00 NA 1
#2 a NA 2013-03-06 00:00:00 1
#3 a 2014-08-08 00:00:00 2015-08-08 00:00:00 2
#4 a 2016-09-18 00:00:00 NA 3
#5 a NA 2017-03-06 00:00:00 3
#6 b 2013-08-08 00:00:00 2014-08-08 00:00:00 1
#7 b 2015-08-08 00:00:00 NA 2
#8 b NA 2016-08-08 00:00:00 2
If there are only two columns to consider
df %>%
group_by(group) %>%
mutate(epi = rleid(is.na(starts) + is.na(ends)))

Melt or Replicate rows in a data table a certain number of times and include counter in R

I would like to "expand" a dataframe, duplicating the information on some columns the number of times indicated by a fifth column.
What would the most efficiency to achieve this task with R? (Open to Data Table or Dplyer, reshape solutions).
Original Dataframe/DataTable:
f_1 f_2 d_1 d_2 i_1
1: 1 A 2016-01-01 <NA> NA
2: 2 A 2016-01-02 <NA> NA
3: 2 B 2016-01-03 2016-01-01 2
4: 3 C 2016-01-04 <NA> NA
5: 4 D 2016-01-05 2016-01-02 5
Desired Dataframe/DataTable
f_1 f_2 d_1 d_2 i_1
1: 1 A 2016-01-01 <NA> NA
2: 2 A 2016-01-02 <NA> NA
3: 2 B 2016-01-03 2016-01-01 1
4: 2 B 2016-01-03 2016-01-01 2
5: 3 C 2016-01-04 <NA> NA
6: 4 D 2016-01-05 2016-01-02 1
7: 4 D 2016-01-05 2016-01-02 2
8: 4 D 2016-01-05 2016-01-02 3
9: 4 D 2016-01-05 2016-01-02 4
10: 4 D 2016-01-05 2016-01-02 5
Reproducible data:
DT <- data.table(
f_1 = factor(c(1,2,2,3,4)),
f_2 = factor(c("A", "A", "B", "C", "D")),
d_1 = as.Date(c("2016-01-01","2016-01-02","2016-01-03","2016-01-04","2016-01-05")),
d_2 = as.Date(c(NA,NA,"2016-01-01",NA,"2016-01-02")),
i_1 = as.integer(c(NA,NA,2,NA,5)))
Thanks and sorry if it is duplicated. I am struggling with this kind of reshaping exercises.
Here is a data.table solution. Basically, group by those columns that you want to duplicate and generate sequence of integers using the number in i_1
DT[, .(i_1=if(!is.na(i_1)) seq_len(i_1) else i_1),
by=c(names(DT)[-ncol(DT)])]
output:
f_1 f_2 d_1 d_2 i_1
1: 1 A 2016-01-01 <NA> NA
2: 2 A 2016-01-02 <NA> NA
3: 2 B 2016-01-03 2016-01-01 1
4: 2 B 2016-01-03 2016-01-01 2
5: 3 C 2016-01-04 <NA> NA
6: 4 D 2016-01-05 2016-01-02 1
7: 4 D 2016-01-05 2016-01-02 2
8: 4 D 2016-01-05 2016-01-02 3
9: 4 D 2016-01-05 2016-01-02 4
10: 4 D 2016-01-05 2016-01-02 5
Or another way using data.table. For each row, create a sequence of numbers using i_1 and add the original data to that sequence with c(.SD[, -"i_1], ..... and finally remove the by column
DT[, c(.SD[, -"i_1"], .(i_1=if (!is.na(i_1)) seq_len(i_1) else i_1)),
by=seq_len(DT[,.N])][,-1L]
Are you OK replacing i_1 with 1 when it's NA? If so, the following would be slightly more readable:
First, repeat the rows the specified number of times (ad hoc accounting for the missing values of i_1, using replace courtesy of #Frank):
DT_out = DT[rep(1:.N, replace(i_1, is.na(i_1), 1L))]
This could be just DT[rep(1:.N, i_1)] if we've already replaced DT[is.na(i_1), i_1 := 1L].
All that's left is to update the values of i_1. There are simpler versions of this, depending on your data's particulars. Here I think is the more general version:
DT_out[!is.na(i_1), i_1 := rowidv(.SD), .SDcols = !'i_1'][]
# f_1 f_2 d_1 d_2 i_1
# 1: 1 A 2016-01-01 <NA> NA
# 2: 2 A 2016-01-02 <NA> NA
# 3: 2 B 2016-01-03 2016-01-01 1
# 4: 2 B 2016-01-03 2016-01-01 2
# 5: 3 C 2016-01-04 <NA> NA
# 6: 4 D 2016-01-05 2016-01-02 1
# 7: 4 D 2016-01-05 2016-01-02 2
# 8: 4 D 2016-01-05 2016-01-02 3
# 9: 4 D 2016-01-05 2016-01-02 4
# 10: 4 D 2016-01-05 2016-01-02 5
rowid and rowidv give the row number within the groups defined by the variables it's passed. You can compare with rowid(f_2), rowid(f_1), and rowid(f_1, f_2) to get an idea of what I mean. rowidv(.SD) is a shorthand for rowid(f_1, f_2, d_1, d_2), since we exclude i_1 from the columns in .SD.

For each column, sum scores by group over prior window of time

I have a large panel dataset (10,000,000 x 53) with about 50 columns of scores. I have aggregated each score by group (there are about 15,000) and date.
Now I want to calculate a rolling sum of three values including the prior two dates' and the current date's scores, creating a new corresponding sum column.
The sums should be calculated for each score column by date and group.
For 1st and 2nd dates within a group, fewer than 3 values is allowed.
GROUP DATE LAGGED SCORE1 SUM1 SCORE2 SUM2 ... SCORE50 SUM50
#1 A 2017-04-01 2017-03-30 1 1|1 2 2|2 4 4|4
#2 A 2017-04-02 2017-03-31 1 1+1|2 3 3+2|5 3 3+4|7
#3 A 2017-04-04 2017-04-02 2 2+1+1|4 4 4+3+2|9 2 2+3+4|9
#5 B 2017-04-02 2017-03-31 2 2|2 3 3|3 1 1|1
#6 B 2017-04-05 2017-04-03 2 2+2|4 2 2+3|5 1 1+1|2
#7 B 2017-04-08 2017-04-06 3 3+2+2|7 1 1+2+3|6 3 3+1+1|5
#8 C 2017-04-02 2017-03-31 3 3|3 1 1|1 1 1|1
#9 C 2017-04-03 2017-04-01 2 2+3|5 3 3+1|4 2 2+1|3
: : : : : : : : : :
#10M XX 2018-03-30 2018-03-28 2 2 1 1 ... 1 1
David's answer from this post covered most of my questions on summing rolling windows by groups but I'm still missing a couple pieces.
library(data.table) #v1.10.4
## Convert to a proper date class, and add another column
## in order to define the range
setDT(input)[, c("Date", "Date2") := {
Date = as.IDate(Date)
Date2 = Date - 2L
.(Date, Date2)
}]
## Run a non-equi join against the unique Date/Group combination in input
## Sum the Scores on the fly
## You can ignore the second Date column
input[unique(input, by = c("Date", "Group")), ## This removes the dupes
on = .(Group, Date <= Date, Date >= Date2), ## The join condition
.(Score = sum(Score)), ## sum the scores
keyby = .EACHI] ## Run the sum by each row in
## unique(input, by = c("Date", "Group"))
My question has two parts:
What code should replace "Score" to calculate time window sums for each column in a range of columns?
Is the solution provided the most efficient version for fast calculation on large dataset?
A possible solution:
cols <- grep('^SCORE', names(input), value = TRUE)
input[, gsub('SCORE','SUM',cols) := lapply(.SD, cumsum)
, by = GROUP
, .SDcols = cols][]
which gives:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2
1: A 2017-04-01 2017-03-30 1 2 1 2
2: A 2017-04-02 2017-03-31 1 3 2 5
3: A 2017-04-04 2017-04-02 2 4 4 9
4: B 2017-04-02 2017-03-31 2 3 2 3
5: B 2017-04-05 2017-04-03 2 2 4 5
6: B 2017-04-08 2017-04-06 3 1 7 6
7: C 2017-04-02 2017-03-31 3 1 3 1
8: C 2017-04-03 2017-04-01 2 3 5 4
When you want to take a time window into account as well, you could do (assuming LAGGED is the start of the time-window):
input[input[input[, .(GROUP, DATE, LAGGED)]
, on = .(GROUP, DATE >= LAGGED, DATE <= DATE)
][, setNames(lapply(.SD, sum), gsub('SCORE','SUM',cols))
, by = .(GROUP, DATE = DATE.1)
, .SDcols = cols]
, on = .(GROUP, DATE)]
which gives:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2
1: A 2017-04-01 2017-03-30 1 2 1 2
2: A 2017-04-02 2017-03-31 1 3 2 5
3: A 2017-04-04 2017-04-02 2 4 3 7
4: B 2017-04-02 2017-03-31 2 3 2 3
5: B 2017-04-05 2017-04-03 2 2 2 2
6: B 2017-04-08 2017-04-06 3 1 3 1
7: C 2017-04-02 2017-03-31 3 1 3 1
8: C 2017-04-03 2017-04-01 2 3 5 4
Used data:
input <- fread(' GROUP DATE LAGGED SCORE1 SCORE2
A 2017-04-01 2017-03-30 1 2
A 2017-04-02 2017-03-31 1 3
A 2017-04-04 2017-04-02 2 4
B 2017-04-02 2017-03-31 2 3
B 2017-04-05 2017-04-03 2 2
B 2017-04-08 2017-04-06 3 1
C 2017-04-02 2017-03-31 3 1
C 2017-04-03 2017-04-01 2 3')

Resources