Trying to join 2 dataframes according to multiple conditions and time interval condition like in the following example:
# two sample dataframes with time intervals
df1 <- data.frame(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.POSIXct(hms::as.hms(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15"))),
time2 = as.POSIXct(hms::as.hms(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15")))) %>%
mutate(t1 = interval(time1, time2)) %>%
select(key1, key2, t1)
df2 <- data.frame(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.POSIXct(hms::as.hms(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15"))),
sam2 = as.POSIXct(hms::as.hms(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15")))) %>%
mutate(t2 = interval(sam1, sam2)) %>%
select(key1, key2, t2)
The first thing that needs to correspond are columns key1 and key2, and that can be done with the following (produces error):
df <- inner_join(df1, df2, by = c("key1", "key2"))
But there is one more condition that needs to be checked when joining and that is if the interval t2 is within t1. I can do this manually like this:
df$t2 %within% df$t1
I guess the error is from joining dataframes with intervals and this might not be the right way to do it which is why there are errors.
# desired dataframe
df <- data.frame(key1 = c("a", "b"), key2 = c(1,2), time_condition = c(TRUE, FALSE))
If the t1 is from "00:00:15" to "00:05:15" then the corresponding t2 which is "00:03:15" to "00:04:15" is going to be within the interval t1. This would result in the time_condition column which will be TRUE if t2 is within t1, and FALSE otherwise.
Using data.table, you can perform operations while joining. Here is an example
library(data.table)
df2[df1, # left join
.(time_condition = sam1 > time1 & sam2 < time2), # condition while joining
on = .(key1, key2), # keys
by = .EACHI, # check condition per join
nomatch = 0L] # make it an inner join
# key1 key2 time_condition
# 1: a 1 TRUE
# 2: b 2 FALSE
# your data generated using data.table
df1 <- data.table(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.ITime(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15")),
time2 = as.ITime(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15")))
df2 <- data.table(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.ITime(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15")),
sam2 = as.ITime(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15")))
How about this?
library(dplyr)
df1 %>%
inner_join(df2, by = c("key1", "key2")) %>%
filter(sam1 >= time1 & sam1 <= time2 & sam2 >= time1 & sam2 <= time2) %>%
mutate(t1 = interval(time1, time2),
t2 = interval(sam1, sam2)) %>%
select(key1, key2, t1, t2)
Output is:
key1 key2 t1 t2
1 a 1 1970-01-01 00:00:15 UTC--1970-01-01 00:05:15 UTC 1970-01-01 00:03:15 UTC--1970-01-01 00:04:15 UTC
Sample data:
df1 <- data.frame(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.POSIXct(hms::as.hms(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15"))),
time2 = as.POSIXct(hms::as.hms(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15"))))
df2 <- data.frame(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.POSIXct(hms::as.hms(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15"))),
sam2 = as.POSIXct(hms::as.hms(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15"))))
You can use inbuilt function merge() for joins.
df = merge(df1, df2, by = c("key1", "key2"))
df = data.frame(df[,c("key1", "key2")], time_condition = df$t2 %within% df$t1)
df
# key1 key2 time_condition
#1 a 1 TRUE
#2 b 2 FALSE
Thank You
Related
I am trying to create a new df, call it df3, out of two other datasets:
df1 = data.frame("String" = c("a", "b", "c"), "Title" = c("A", "B", "C"), "Date" = c("2020-01-01", "2020-01-02", "2020-01-03"))
and:
df2 = data.frame("String" = c("a", "x", "y"), "Title" = c("ABCDEF", "XYZ", "YZ"), "Date" = c("2020-01-03", "2020-01-20", "2020-01-30"))
The conditions for the observations that should be matched, and form a new dataset, are:
df1$String %$in% df2$String
grepl(df1$Title, df2$Title) == TRUE
df1$Date < df$Date
What is the best way to do this kind of merging? I have tried to create an indicator along the lines of :
df1$indicator = ifelse(df1$String %in% df2$String & grepl(df1$Title, df2$Title) & df1$Date < df$Date, 1, 0)
or
df1$indicator = ifelse(df1$String %in% df2$String & grepl(df1$Title, df2$Title[df1$String %in% df2$String) & df1$Date < df2$Date[df1$String %in% df2$String, 1, 0)
to then use for merging, but I've been getting "longer object length is not a multiple of shorter object length" and "argument 'pattern' has length > 1 and only the first element will be used" warnings.
One way: Use a crossjoin then filter the result.
Note that grepl is not vectorized over both arguments, so i use mapply.
df1 = data.frame("String" = c("a", "b", "c"), "Title" = c("A", "B", "C"), "Date" = c("2020-01-01", "2020-01-02", "2020-01-03"))
df2 = data.frame("String" = c("a", "x", "y"), "Title" = c("ABCDEF", "XYZ", "YZ"), "Date" = c("2020-01-03", "2020-01-20", "2020-01-30"))
merge(df1,df2, by=NULL, suffixes = c(".x", ".y")) |>
subset(String.x %in% String.y
& mapply(grepl, Title.x, Title.y)
& Date.x < Date.y )
#> String.x Title.x Date.x String.y Title.y Date.y
#> 1 a A 2020-01-01 a ABCDEF 2020-01-03
My question is essentially a generalisation of this SO post but with a rolling component.
I have a dataset of people, jobs and the dates on which they held said jobs (whilst the specifics aren't important, they make the prose easier). Each person can hold multiple jobs on a given date, or they can have no job, which shows up as a missing person-date-job row.
I want to create a summarised table where there is only one row for each person-date combination, thus necessitating creating a list-column that holds the IDs for jobs held by that person-date. I have managed to do this for contemporaneous person-job-dates following the above linked SO post. The complicating factor is that I want to look backwards by 3 periods, i.e. I need the job_id list-col for date t to contain all the jobs held by a person in the date t, t-1 and t-2.
Below is some code to produce a toy input table and the desired output.
library(data.table)
# Input data
data <- data.table(
ind_id = c(rep(1, 3), rep(2, 4), rep(3, 2), rep(4, 5)),
date = c(1, 2, 3, 1, 2, 2, 3, 1, 3, 1, 1, 2, 2, 3),
job_id = c("A", "A", "A", "B", "B", "C", "B", "D", "E", "F", "G", "F", "G", "G")
)
# Desired output
output <- data.table(
ind_id = c(rep(1, 3), rep(2, 3), rep(3, 3), rep(4, 3)),
date = rep(1:3, 4),
job_id = list("A", "A", "A", "B", c("B", "C"), c("B", "C"), "D", c("D"), c("D", "E"), c("F", "G"), c("F", "G"), c("F", "G"))
)
And here is the code that works to make a table of contemporaneous person-job-date rows.
data_contemp <- data[, .(job_id = list(job_id)), by = .(date, ind_id)]
Something that I tried was to use frollapply but it doesn't work if the output is not numeric unfortunately: data[, all_jobs := frollapply(job_id, 3, list), by = ind_id]
Appreciate everyone's help on this!
EDIT: I should add that a data.table solution is highly preferred because the actual dataset is 607 million rows, data.table is faster and more memory efficient, and the syntax is better.
EDIT 2: Added some code to generate an arbitrarily large input table.
n <- 600e6
n <- round(n / 15)
t1 <- data.table(ind_id = rep(1, 3), date = 1:3, job_id = rep("A", 3))
t2 <- data.table(ind_id = rep(2, 3), date = 1:3, job_id = c("A", "B", "B"))
t3 <- data.table(ind_id = rep(3, 5), date = c(1, 2, 2, 3, 3), job_id = c("A", "A", "B", "A", "B"))
t4 <- data.table(ind_id = rep(4, 2), date = c(1, 3), job_id = c("A", "B"))
t5 <- data.table(ind_id = rep(5, 4), date = c(1, 1, 2, 3), job_id = c("A", "B", "A", "A"))
data <- rbind(t1, t2, t3, t4, t5)
data <- data[rep(seq_len(nrow(data)), n)]
data[, ind_id := rleid(ind_id)]
You could use self non-equijoins:
data[,start:=date-2]
data[data,.(ind_id,date = x.date,job_id=i.job_id),on=.(ind_id, start<= date, date>=date)][
,.(job_id=list(unique(job_id))),.(ind_id,date)]
ind_id date job_id
<num> <num> <list>
1: 1 1 A
2: 1 2 A
3: 1 3 A
4: 2 1 B
5: 2 2 B,C
6: 2 3 B,C
7: 3 1 D
8: 3 3 D,E
9: 4 1 F,G
10: 4 2 F,G
11: 4 3 F,G
Slight difference compared to your expected output: date=2 isn't present for ind_id=3 because it isn't present in initial data.
ind_id date job_id
<num> <int> <list>
8: 3 2 D
Thanks Waldi for your solution. I actually managed to figure out my own solution to my question with a combination of helper columns and mapply. So I've included my method as an answer and also benchmarked both methods. Waldi's solution with non-equi joins is about 20 per cent faster than my method with mapply, but uses about 40 per cent more memory. This differential looks like it remains constant as the number of rows scales.
Given memory is cheap and time is not, Waldi's solution works best here.
Thanks everyone for contributing!
library(data.table)
library(collapse)
## Input data
# Create three types of people with different employment histories:
# Type 1: same job over time
# Type 2: changes to a new job in t2
# Type 3: picks up a new job in t2
# Type 4: employed in t1, unemployed in t2, employed in t3
# Type 5: loses a second job in t2
make_data <- function(n) {
n <- round(n / 15)
t1 <- data.table(ind_id = rep(1, 3), date = 1:3, job_id = rep("A", 3))
t2 <- data.table(ind_id = rep(2, 3), date = 1:3, job_id = c("A", "B", "B"))
t3 <- data.table(ind_id = rep(3, 5), date = c(1, 2, 2, 3, 3), job_id = c("A", "A", "B", "A", "B"))
t4 <- data.table(ind_id = rep(4, 2), date = c(1, 3), job_id = c("A", "B"))
t5 <- data.table(ind_id = rep(5, 4), date = c(1, 1, 2, 3), job_id = c("A", "B", "A", "A"))
data <- rbind(t1, t2, t3, t4, t5)
data <- data[rep(seq_len(nrow(data)), n)]
data[, ind_id := rleid(ind_id)]
data <- data[, .(job_id = list(job_id)), by = .(date, ind_id)]
# Add back missing person-date rows to create balanced panel
date_person_rows <- CJ(ind_id = unique(data$ind_id), date = unique(data$date))
data <- date_person_rows[data, job_id := i.job_id, on = .(date, ind_id)]
return(data)
}
method_1 <- function(data) {
data[, paste0("jobs_", 0:2) := shift(.(job_id), 0:2), by = ind_id]
data[, job_id := mapply(jobs_0, jobs_1, jobs_2, FUN = function(a, b, c) sort(na_rm(unique(c(a, b, c)))))]
data[, c("jobs_0", "jobs_1", "jobs_2") := NULL]
setkey(data, NULL) # For some reason this dt has a key set but the method 2 one doesn't
return(data)
}
method_2 <- function(data) {
data[, start := date - 2]
data <-
data[data, .(ind_id, date = x.date, job_id = i.job_id), on = .(ind_id, start <= date, date >= date)][, .(job_id = list(unique(job_id))), .(ind_id, date)]
data[, job_id := lapply(job_id, function(x) unique(unlist(x)))]
}
# Benchmark
bench::mark(
method_1(make_data(10e4)),
method_2(make_data(10e4)),
iterations = 1L
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 method_1(make_data(1e+05)) 2.7s 2.7s 0.370 27.4MB 17.8
#> 2 method_2(make_data(1e+05)) 2.08s 2.08s 0.481 43MB 14.9
Created on 2022-10-11 with reprex v2.0.2
I have data
test = data.table(
a = c(1,1,3,4,5,6),
b = c("a", "be", "a", "c", "d", "c"),
c = rep(1, 6)
)
I wish to take the unique values of column a, store it in another data.table, and afterwards fill in the remaining columns with the most prevalent values of those remaining columns, such that my resulting data.table would be:
test2 = data.table(a = c(1,3,4,5,6), b = "a", c = 1)
Column be has equal amounts of "a" and "c", but it doesn't matter which is chosen in those cases.
Attempt so far:
test2 = unique(test, by = "a")
test2[, c("b", "c") := lapply(.SD, FUN = function(x){test2[, .N, by = x][order(-N)][1,1]}), .SDcols = c("b", "c")]
EDIT: I would preferrably like a generic solution that is compatible with a function where I specify the column to be "uniqued", and the rest of the columns are with the single most prevalent value. Hence my use of lapply and .SD =)
EDIT2: as #MichaelChirico points out, how do we keep the class. With the following data.table some of the solutions does not work, although solution of #chinsoon12 does work:
test = data.table(a = c(1,1,3,4,5,6),
b = c("a", "be", "a", "c", "d", "c"),
c = rep(1, 6),
d = as.Date("2019-01-01"))
Another option:
dtmode <- function(x) x[which.max(rowid(x))]
test[, .(A=unique(A), B=dtmode(B), C=dtmode(C))]
data:
test = data.table(
A = c(1,1,3,4,5,6),
B = c("a", "be", "a", "c", "d", "c"),
C = rep(1, 6)
)
Not a clean way to do this but it works.
test = data.frame(a = c(1,1,3,4,5,6), b = c("a", "be", "a", "c", "d", "c"), c = rep(1, 6))
a = unique(test$a)
b = tail(names(sort(table(test$b))), 1)
c = tail(names(sort(table(test$c))), 1)
test2 = cbind(a,b,c)
Output is like this:
> test2
a b c
[1,] "1" "c" "1"
[2,] "3" "c" "1"
[3,] "4" "c" "1"
[4,] "5" "c" "1"
[5,] "6" "c" "1"
>
#EmreKiratli is very close to what I would do:
test[ , c(
list(a = unique(a)),
lapply(.SD, function(x) as(tail(names(sort(table(x))), 1L), class(x)))
), .SDcols = !'a']
The as(., class(x)) part is because names in R are always character, so we have to convert back to the original class of x.
You might like this better in magrittr form since it's many nested functions:
library(magrittr)
test[ , c(
list(a = unique(a)),
lapply(.SD, function(x) {
table(x) %>% sort %>% names %>% tail(1L) %>% as(class(x))
})
), .SDcols = !'a']
I was able to make an OK solution, but if somebody can do it more elegantly, for example not going through the step of storting a list in refLevel below, please let me know! I'm very interested in learning data.table properly!
#solution:
test = data.table(a = c(1,1,3,4,5,6), b = c("a", "be", "a", "c", "d", "c"), c = rep(1, 6))
test2 = unique(test, by="a")
funPrev = function(x){unlist(as.data.table(x)[, .N, by=x][order(-N)][1,1], use.names = F)}
refLevel = lapply(test[, c("b", "c")], funPrev)
test2[, c("b", "c") := refLevel]
...and using a function (if anybody see any un-necessary step, please let me know):
genData = function(dt, var_unique, vars_prev){
data = copy(dt)
data = unique(data, by = var_unique)
funPrev = function(x){unlist(as.data.table(x)[, .N, by=x][order(-N)][1,1], use.names = F)}
refLevel = lapply(dt[, .SD, .SDcols = vars_prev], funPrev)
data[, (vars_prev) := refLevel]
return(data)
}
test2 = genData(test, "a", c("b", "c"))
Here's another variant which one might find less sophisticated, yet more readable. It's essentially chinsoon12's rowid approach generalized for any number of columns. Also the classes are kept.
test = data.table(a = c(1,1,3,4,5,6),
b = c("a", "be", "a", "c", "d", "c"),
c = rep(1, 6),
d = as.Date("2019-01-01"))
test2 = unique(test, by = "a")
for (col in setdiff(names(test2), "a")) test2[[col]] = test2[[col]][which.max(rowid(test2[[col]]))]
Problem
I work on a data.table where each row is a medical observation. The problem is there are some errors in my data, and I need to correct them before pursuit my analysis. For example, a male patient can have an observation where he is coded as a female.
Solution
My solution is to select the mode (most frequent value) of a variable by the patient. If a patient has 10 observations as a male, and one as female, it is safe to assume that he is a male.
I have found that clever way to do it with data.table.
DATA[j = .N,
by = .(ID, SEX)][i = base::order(-N),
j = .(SEX = SEX[1L]),
keyby = ID]
The problem is that when a patient as multiple modes, it just keeps one. So a patient which is 50% male and 50% female will be counted as a male, which will lead to a bias in the end. I would like to code them as NA's.
The only way to correct this I founded is by using dplyr
DATA[j = .N,
by = .(ID, SEX)] %>%
group_by(ID) %>%
filter(N == max(N))
and then replace SEX value by NA if duplicated. But it takes way longer than data.table, it is not very optimized, and I have a big data set with a lot of variables that would need to be corrected as well.
Resume
How do I took the mode of a variable by a patient and replace it by NA's if not unique?
Example
ID <- c(rep(x = "1", 6), rep(x = "2", 6))
SEX <- c("M","M","M","M","F","M","M","F","M","F","F","M")
require(data.table)
DATA <- data.table(ID, SEX)
# First method (doesn't work)
DATA[j = .N,
by = .(ID, SEX)][i = base::order(-N),
j = .(SEX = SEX[1L]),
keyby = ID]
# Second method (work with dplyr)
require(dplyr)
DATA[j = .N,
by = .(ID, SEX)] %>%
group_by(ID) %>%
filter(N == max(N)) %>%
mutate(SEX = if_else(condition = duplicated(ID) == TRUE,
true = "NA",
false = SEX)) %>%
filter(row_number() == n())
# Applied to my data it took 84.288 seconds
Update
Solution proposed by #Cole based on an idea of #Sindri_baldur :
DATA <- data.table(
ID = c(rep(x = "1", 6), rep(x = "2", 6)),
SEX = c("M","M","M","M","F","M","M","F","M","F","F",NA),
V1 = c("a", NA, "a", "a", "b", "a", "b", "b", "b", "c", "b", "c")
)
our_mode_fac <- function(x) {
freq <- tabulate(x)
if (length(freq) == 0 || sum(freq == max(freq)) > 1 ) {NA}
else {levels(x)[which.max(freq)]}
}
vars <- c("SEX", "V1")
DATA[j = paste0(vars) := lapply(.SD, as.factor),
.SDcols = vars][j = vars := lapply(.SD, our_mode_fac),
.SDcols = vars,
by = ID]
It works perfectly fine. It took the mode, even when there is more NAs than factors, and replace values by NAs when there is more than 1 mode.
Now it is also very fast : 11 seconds for 3M+ observations and 1M+ patients (117 seconds with #Sindri_baldur answer). Thanks a lot both of you I'm very grateful !
our_mode <- function(x) {
freq <- table(x)
if (length(freq) == 0 || sum(freq == max(freq)) > 1 ) {
NA
} else {
names(freq)[which.max(freq)]
}
}
vars <- c("SEX", "V1")
DATA[, paste0(vars, "_corrected") := lapply(.SD, our_mode), .SDcols = vars, by = ID]
ID SEX V1 SEX_corrected V1_corrected
1: 1 M a M a
2: 1 M <NA> M a
3: 1 M a M a
4: 1 M a M a
5: 1 F b M a
6: 1 M a M a
7: 2 M b F b
8: 2 F b F b
9: 2 M b F b
10: 2 F c F b
11: 2 F b F b
12: 2 <NA> c F b
Reproducible data
DATA <- data.table(
ID = c(rep(x = "1", 6), rep(x = "2", 6)),
SEX = c("M","M","M","M","F","M","M","F","M","F","F",NA),
V1 = c("a", NA, "a", "a", "b", "a", "b", "b", "b", "c", "b", "c")
)
Note that our_mode() is not optimised for speed. See suggestions by Cole for speed improvements in comments.
I am wondering if it is possible to use two columns to do a lookup in R data.table. Here is a little experiment that failed:
x <- data.table(A = c("a", "a", "b", "b", "c", "c"),
D = c( 1, 2, 1, 2, 4, 5))
DT <- data.table(A = c("a", "a", "b", "b"),
D = c( 1, 2, 1, 2))
setkey(DT, A, D)
DT[J(x$A, x$D), ] # Same as below, thanks to ilir, I thought it returns an error previously
DT[J(x$A, x$D), , allow.cartesian=TRUE]
# Return:
# A D
# 1: a 1
# 2: a 2
# 3: b 1
# 4: b 2
# 5: c 4 # <- ideally (NA NA) or (c NA)
# 6: c 5 # <- ideally (NA NA) or (c NA)
In this experiment, rows in DT are unique, however, both columns have duplicated entries. When calling DT[J(x$A, x$D), ], what I want to do is to lookup table DT, thus I would expect the result only has entries in DT, however, this is clearly not the case.
Is there an effective way to do a lookup with two columns as keys?
Thanks to ilir, the following code works:
x <- data.table(A = c("a", "a", "b", "b", "c", "c"),
D = c( 1, 2, 1, 2, 4, 5))
DT <- data.table(A = c("a", "a", "b", "b"),
D = c( 1, 2, 1, 2))
DT[, aux := 1L]
setkey(DT, A, D)
DT[J(x$A, x$D), ]
inx <- !is.na(DT[J(x$A, x$D), ]$aux)