Constructing a rolling aggregation of data.table list-columns - r

My question is essentially a generalisation of this SO post but with a rolling component.
I have a dataset of people, jobs and the dates on which they held said jobs (whilst the specifics aren't important, they make the prose easier). Each person can hold multiple jobs on a given date, or they can have no job, which shows up as a missing person-date-job row.
I want to create a summarised table where there is only one row for each person-date combination, thus necessitating creating a list-column that holds the IDs for jobs held by that person-date. I have managed to do this for contemporaneous person-job-dates following the above linked SO post. The complicating factor is that I want to look backwards by 3 periods, i.e. I need the job_id list-col for date t to contain all the jobs held by a person in the date t, t-1 and t-2.
Below is some code to produce a toy input table and the desired output.
library(data.table)
# Input data
data <- data.table(
ind_id = c(rep(1, 3), rep(2, 4), rep(3, 2), rep(4, 5)),
date = c(1, 2, 3, 1, 2, 2, 3, 1, 3, 1, 1, 2, 2, 3),
job_id = c("A", "A", "A", "B", "B", "C", "B", "D", "E", "F", "G", "F", "G", "G")
)
# Desired output
output <- data.table(
ind_id = c(rep(1, 3), rep(2, 3), rep(3, 3), rep(4, 3)),
date = rep(1:3, 4),
job_id = list("A", "A", "A", "B", c("B", "C"), c("B", "C"), "D", c("D"), c("D", "E"), c("F", "G"), c("F", "G"), c("F", "G"))
)
And here is the code that works to make a table of contemporaneous person-job-date rows.
data_contemp <- data[, .(job_id = list(job_id)), by = .(date, ind_id)]
Something that I tried was to use frollapply but it doesn't work if the output is not numeric unfortunately: data[, all_jobs := frollapply(job_id, 3, list), by = ind_id]
Appreciate everyone's help on this!
EDIT: I should add that a data.table solution is highly preferred because the actual dataset is 607 million rows, data.table is faster and more memory efficient, and the syntax is better.
EDIT 2: Added some code to generate an arbitrarily large input table.
n <- 600e6
n <- round(n / 15)
t1 <- data.table(ind_id = rep(1, 3), date = 1:3, job_id = rep("A", 3))
t2 <- data.table(ind_id = rep(2, 3), date = 1:3, job_id = c("A", "B", "B"))
t3 <- data.table(ind_id = rep(3, 5), date = c(1, 2, 2, 3, 3), job_id = c("A", "A", "B", "A", "B"))
t4 <- data.table(ind_id = rep(4, 2), date = c(1, 3), job_id = c("A", "B"))
t5 <- data.table(ind_id = rep(5, 4), date = c(1, 1, 2, 3), job_id = c("A", "B", "A", "A"))
data <- rbind(t1, t2, t3, t4, t5)
data <- data[rep(seq_len(nrow(data)), n)]
data[, ind_id := rleid(ind_id)]

You could use self non-equijoins:
data[,start:=date-2]
data[data,.(ind_id,date = x.date,job_id=i.job_id),on=.(ind_id, start<= date, date>=date)][
,.(job_id=list(unique(job_id))),.(ind_id,date)]
ind_id date job_id
<num> <num> <list>
1: 1 1 A
2: 1 2 A
3: 1 3 A
4: 2 1 B
5: 2 2 B,C
6: 2 3 B,C
7: 3 1 D
8: 3 3 D,E
9: 4 1 F,G
10: 4 2 F,G
11: 4 3 F,G
Slight difference compared to your expected output: date=2 isn't present for ind_id=3 because it isn't present in initial data.
ind_id date job_id
<num> <int> <list>
8: 3 2 D

Thanks Waldi for your solution. I actually managed to figure out my own solution to my question with a combination of helper columns and mapply. So I've included my method as an answer and also benchmarked both methods. Waldi's solution with non-equi joins is about 20 per cent faster than my method with mapply, but uses about 40 per cent more memory. This differential looks like it remains constant as the number of rows scales.
Given memory is cheap and time is not, Waldi's solution works best here.
Thanks everyone for contributing!
library(data.table)
library(collapse)
## Input data
# Create three types of people with different employment histories:
# Type 1: same job over time
# Type 2: changes to a new job in t2
# Type 3: picks up a new job in t2
# Type 4: employed in t1, unemployed in t2, employed in t3
# Type 5: loses a second job in t2
make_data <- function(n) {
n <- round(n / 15)
t1 <- data.table(ind_id = rep(1, 3), date = 1:3, job_id = rep("A", 3))
t2 <- data.table(ind_id = rep(2, 3), date = 1:3, job_id = c("A", "B", "B"))
t3 <- data.table(ind_id = rep(3, 5), date = c(1, 2, 2, 3, 3), job_id = c("A", "A", "B", "A", "B"))
t4 <- data.table(ind_id = rep(4, 2), date = c(1, 3), job_id = c("A", "B"))
t5 <- data.table(ind_id = rep(5, 4), date = c(1, 1, 2, 3), job_id = c("A", "B", "A", "A"))
data <- rbind(t1, t2, t3, t4, t5)
data <- data[rep(seq_len(nrow(data)), n)]
data[, ind_id := rleid(ind_id)]
data <- data[, .(job_id = list(job_id)), by = .(date, ind_id)]
# Add back missing person-date rows to create balanced panel
date_person_rows <- CJ(ind_id = unique(data$ind_id), date = unique(data$date))
data <- date_person_rows[data, job_id := i.job_id, on = .(date, ind_id)]
return(data)
}
method_1 <- function(data) {
data[, paste0("jobs_", 0:2) := shift(.(job_id), 0:2), by = ind_id]
data[, job_id := mapply(jobs_0, jobs_1, jobs_2, FUN = function(a, b, c) sort(na_rm(unique(c(a, b, c)))))]
data[, c("jobs_0", "jobs_1", "jobs_2") := NULL]
setkey(data, NULL) # For some reason this dt has a key set but the method 2 one doesn't
return(data)
}
method_2 <- function(data) {
data[, start := date - 2]
data <-
data[data, .(ind_id, date = x.date, job_id = i.job_id), on = .(ind_id, start <= date, date >= date)][, .(job_id = list(unique(job_id))), .(ind_id, date)]
data[, job_id := lapply(job_id, function(x) unique(unlist(x)))]
}
# Benchmark
bench::mark(
method_1(make_data(10e4)),
method_2(make_data(10e4)),
iterations = 1L
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 method_1(make_data(1e+05)) 2.7s 2.7s 0.370 27.4MB 17.8
#> 2 method_2(make_data(1e+05)) 2.08s 2.08s 0.481 43MB 14.9
Created on 2022-10-11 with reprex v2.0.2

Related

How to filter a data frame to only min and max values of different columns in R?

Lets say I have the following data frame:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3,3),
col1 = c("a","a", "b", "c", "d", "e", "f", "g", "h", "g"),
start_day = c(NA,1,15, NA, 4, 22, 5, 11, 14, 18),
end_day = c(NA,2, 15, NA, 6, 22, 6, 12, 16, 21))
I want to create a data frame that has the following columns: id, start_day, end_day
such that for each unique id I only need the minimum of start_day column and the maximum of the end_day column. The final data frame should look like as follow:
To get this new data frame I wrote the following code:
df <- df[!(is.na(df$start_day)), ]
dt <- data.frame(matrix(ncol =3 , nrow = length(unique(df$id))))
colnames(dt) <- c("id", "start_day", "end_day")
dt$id <- unique(df$id)
st_day <- vector()
en_day <- vector()
for (elm in dt$id) {
d <- df[df$id == elm, ]
minimum <- min(d$start_day)
maximum <- max(d$end_day)
st_day <- c(st_day, minimum)
en_day <- c(en_day, maximum)
}
dt$start_day <- st_day
dt$end_day <- en_day
df <- dt
My code is creating what I am looking for, but I am not happy with it. I would love to learn a better and cleaner way to do the same thing. Any idea is very much appreciated.
You can try data.table like below
> library(data.table)
> na.omit(setDT(df))[, .(start_day = min(start_day), end_day = max(end_day)), id]
id start_day end_day
1: 1 1 15
2: 2 4 22
3: 3 5 21
This should do:
df %>% group_by(id) %>% summarise(start_day = min(start_day, na.rm = T),
end_day = max(end_day, na.rm = T))
Output:
id start_day end_day
<dbl> <dbl> <dbl>
1 1 1 15
2 2 4 22
3 3 5 21

How to filter my data.table by condition and by group?

Problem
I work on a data.table where each row is a medical observation. The problem is there are some errors in my data, and I need to correct them before pursuit my analysis. For example, a male patient can have an observation where he is coded as a female.
Solution
My solution is to select the mode (most frequent value) of a variable by the patient. If a patient has 10 observations as a male, and one as female, it is safe to assume that he is a male.
I have found that clever way to do it with data.table.
DATA[j = .N,
by = .(ID, SEX)][i = base::order(-N),
j = .(SEX = SEX[1L]),
keyby = ID]
The problem is that when a patient as multiple modes, it just keeps one. So a patient which is 50% male and 50% female will be counted as a male, which will lead to a bias in the end. I would like to code them as NA's.
The only way to correct this I founded is by using dplyr
DATA[j = .N,
by = .(ID, SEX)] %>%
group_by(ID) %>%
filter(N == max(N))
and then replace SEX value by NA if duplicated. But it takes way longer than data.table, it is not very optimized, and I have a big data set with a lot of variables that would need to be corrected as well.
Resume
How do I took the mode of a variable by a patient and replace it by NA's if not unique?
Example
ID <- c(rep(x = "1", 6), rep(x = "2", 6))
SEX <- c("M","M","M","M","F","M","M","F","M","F","F","M")
require(data.table)
DATA <- data.table(ID, SEX)
# First method (doesn't work)
DATA[j = .N,
by = .(ID, SEX)][i = base::order(-N),
j = .(SEX = SEX[1L]),
keyby = ID]
# Second method (work with dplyr)
require(dplyr)
DATA[j = .N,
by = .(ID, SEX)] %>%
group_by(ID) %>%
filter(N == max(N)) %>%
mutate(SEX = if_else(condition = duplicated(ID) == TRUE,
true = "NA",
false = SEX)) %>%
filter(row_number() == n())
# Applied to my data it took 84.288 seconds
Update
Solution proposed by #Cole based on an idea of #Sindri_baldur :
DATA <- data.table(
ID = c(rep(x = "1", 6), rep(x = "2", 6)),
SEX = c("M","M","M","M","F","M","M","F","M","F","F",NA),
V1 = c("a", NA, "a", "a", "b", "a", "b", "b", "b", "c", "b", "c")
)
our_mode_fac <- function(x) {
freq <- tabulate(x)
if (length(freq) == 0 || sum(freq == max(freq)) > 1 ) {NA}
else {levels(x)[which.max(freq)]}
}
vars <- c("SEX", "V1")
DATA[j = paste0(vars) := lapply(.SD, as.factor),
.SDcols = vars][j = vars := lapply(.SD, our_mode_fac),
.SDcols = vars,
by = ID]
It works perfectly fine. It took the mode, even when there is more NAs than factors, and replace values by NAs when there is more than 1 mode.
Now it is also very fast : 11 seconds for 3M+ observations and 1M+ patients (117 seconds with #Sindri_baldur answer). Thanks a lot both of you I'm very grateful !
our_mode <- function(x) {
freq <- table(x)
if (length(freq) == 0 || sum(freq == max(freq)) > 1 ) {
NA
} else {
names(freq)[which.max(freq)]
}
}
vars <- c("SEX", "V1")
DATA[, paste0(vars, "_corrected") := lapply(.SD, our_mode), .SDcols = vars, by = ID]
ID SEX V1 SEX_corrected V1_corrected
1: 1 M a M a
2: 1 M <NA> M a
3: 1 M a M a
4: 1 M a M a
5: 1 F b M a
6: 1 M a M a
7: 2 M b F b
8: 2 F b F b
9: 2 M b F b
10: 2 F c F b
11: 2 F b F b
12: 2 <NA> c F b
Reproducible data
DATA <- data.table(
ID = c(rep(x = "1", 6), rep(x = "2", 6)),
SEX = c("M","M","M","M","F","M","M","F","M","F","F",NA),
V1 = c("a", NA, "a", "a", "b", "a", "b", "b", "b", "c", "b", "c")
)
Note that our_mode() is not optimised for speed. See suggestions by Cole for speed improvements in comments.

Inner_join with two conditions and interval within interval condition

Trying to join 2 dataframes according to multiple conditions and time interval condition like in the following example:
# two sample dataframes with time intervals
df1 <- data.frame(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.POSIXct(hms::as.hms(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15"))),
time2 = as.POSIXct(hms::as.hms(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15")))) %>%
mutate(t1 = interval(time1, time2)) %>%
select(key1, key2, t1)
df2 <- data.frame(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.POSIXct(hms::as.hms(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15"))),
sam2 = as.POSIXct(hms::as.hms(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15")))) %>%
mutate(t2 = interval(sam1, sam2)) %>%
select(key1, key2, t2)
The first thing that needs to correspond are columns key1 and key2, and that can be done with the following (produces error):
df <- inner_join(df1, df2, by = c("key1", "key2"))
But there is one more condition that needs to be checked when joining and that is if the interval t2 is within t1. I can do this manually like this:
df$t2 %within% df$t1
I guess the error is from joining dataframes with intervals and this might not be the right way to do it which is why there are errors.
# desired dataframe
df <- data.frame(key1 = c("a", "b"), key2 = c(1,2), time_condition = c(TRUE, FALSE))
If the t1 is from "00:00:15" to "00:05:15" then the corresponding t2 which is "00:03:15" to "00:04:15" is going to be within the interval t1. This would result in the time_condition column which will be TRUE if t2 is within t1, and FALSE otherwise.
Using data.table, you can perform operations while joining. Here is an example
library(data.table)
df2[df1, # left join
.(time_condition = sam1 > time1 & sam2 < time2), # condition while joining
on = .(key1, key2), # keys
by = .EACHI, # check condition per join
nomatch = 0L] # make it an inner join
# key1 key2 time_condition
# 1: a 1 TRUE
# 2: b 2 FALSE
# your data generated using data.table
df1 <- data.table(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.ITime(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15")),
time2 = as.ITime(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15")))
df2 <- data.table(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.ITime(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15")),
sam2 = as.ITime(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15")))
How about this?
library(dplyr)
df1 %>%
inner_join(df2, by = c("key1", "key2")) %>%
filter(sam1 >= time1 & sam1 <= time2 & sam2 >= time1 & sam2 <= time2) %>%
mutate(t1 = interval(time1, time2),
t2 = interval(sam1, sam2)) %>%
select(key1, key2, t1, t2)
Output is:
key1 key2 t1 t2
1 a 1 1970-01-01 00:00:15 UTC--1970-01-01 00:05:15 UTC 1970-01-01 00:03:15 UTC--1970-01-01 00:04:15 UTC
Sample data:
df1 <- data.frame(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.POSIXct(hms::as.hms(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15"))),
time2 = as.POSIXct(hms::as.hms(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15"))))
df2 <- data.frame(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.POSIXct(hms::as.hms(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15"))),
sam2 = as.POSIXct(hms::as.hms(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15"))))
You can use inbuilt function merge() for joins.
df = merge(df1, df2, by = c("key1", "key2"))
df = data.frame(df[,c("key1", "key2")], time_condition = df$t2 %within% df$t1)
df
# key1 key2 time_condition
#1 a 1 TRUE
#2 b 2 FALSE
Thank You

How to concatenate different values in different columns in a new column?

suppose I have something like this:
dat <- data.frame(ID = c("A", "B", "C"),
value = c(1, 2, 3))
I would like to add an extra column in which I have a value like this:
[A,1]
It would be a column in which each single values are the concatenation of "[" + A (value in the first column) + "," + B (value in the second column) + "]".
How can I do it? I tried with paste but I am doing something wrong.
Here's an approach that will work consistently with endless numbers of columns:
dat$conc <- paste0("[",apply(dat,1,paste,collapse=","),"]")
Using your example:
dat <- data.frame(ID = c("A", "B", "C"), value = c(1, 2, 3))
dat$conc <- paste0("[",apply(dat,1,paste,collapse=","),"]")
Gives:
ID value conc
1 A 1 [A,1]
2 B 2 [B,2]
3 C 3 [C,3]
Or if we have a dataframe with more columns:
dat <- data.frame(ID = c("A", "B", "C"), value = c(1, 2, 3), value2 = c(4, 5, 6))
dat$conc <- paste0("[",apply(dat,1,paste,collapse=","),"]")
Gives:
ID value value2 conc
1 A 1 4 [A,1,4]
2 B 2 5 [B,2,5]
3 C 3 6 [C,3,6]
Assuming this is your data
dat <- data.frame(ID = c("A", "B", "C"), value = c(1, 2, 3))
This would work
dat$concat <- paste0("[", dat$ID, ", ", dat$value, "]")
ID value concat
1 A 1 [A, 1]
2 B 2 [B, 2]
3 C 3 [C, 3]
Or if you did not want the space after the comma:
dat$concat <- paste0("[", dat$ID, ",", dat$value, "]")

lookup using two columns with unique rows in R data.table

I am wondering if it is possible to use two columns to do a lookup in R data.table. Here is a little experiment that failed:
x <- data.table(A = c("a", "a", "b", "b", "c", "c"),
D = c( 1, 2, 1, 2, 4, 5))
DT <- data.table(A = c("a", "a", "b", "b"),
D = c( 1, 2, 1, 2))
setkey(DT, A, D)
DT[J(x$A, x$D), ] # Same as below, thanks to ilir, I thought it returns an error previously
DT[J(x$A, x$D), , allow.cartesian=TRUE]
# Return:
# A D
# 1: a 1
# 2: a 2
# 3: b 1
# 4: b 2
# 5: c 4 # <- ideally (NA NA) or (c NA)
# 6: c 5 # <- ideally (NA NA) or (c NA)
In this experiment, rows in DT are unique, however, both columns have duplicated entries. When calling DT[J(x$A, x$D), ], what I want to do is to lookup table DT, thus I would expect the result only has entries in DT, however, this is clearly not the case.
Is there an effective way to do a lookup with two columns as keys?
Thanks to ilir, the following code works:
x <- data.table(A = c("a", "a", "b", "b", "c", "c"),
D = c( 1, 2, 1, 2, 4, 5))
DT <- data.table(A = c("a", "a", "b", "b"),
D = c( 1, 2, 1, 2))
DT[, aux := 1L]
setkey(DT, A, D)
DT[J(x$A, x$D), ]
inx <- !is.na(DT[J(x$A, x$D), ]$aux)

Resources