conditional data.table match for subset of data.table - r

This post is related to the previous post here: match rows of two data.tables to fill subset of a data.table
Not sure how I can integrate them together.
I have a situation where other than the NA for one column of DT1, a couple of more conditions should apply for merging, but that doesn't work.
> DT1 <- data.table(colA = c(1,1, 2,2,2,3,3), colB = c('A', NA, 'AA', 'B', NA, 'A', 'C'), timeA = c(2,4,3,4,6,1,4))
> DT1
colA colB timeA
1: 1 A 2
2: 1 <NA> 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
> DT2 <- data.table(colC = c(1,1,1,2,2,3), timeB1 = c(1,3,6, 2,4, 1), timeB2 = c(2,5,7,3,5,4), colD = c('Z', 'YY', 'AB', 'JJ', 'F', 'RR'))
> DT2
colC timeB1 timeB2 colD
1: 1 1 2 Z
2: 1 3 5 YY
3: 1 6 7 AB
4: 2 2 3 JJ
5: 2 4 5 F
6: 3 1 4 RR
Using the same guideline as mentioned above, I'd like to merge ColD of DT2 to colB of DT1 only for NA values of colB in DT1 AND use the values of colD for which timeA in DT1 is between timeB1 and timeB2 in DT2. I tried the following but merge doesn't happen:
> output <- DT1[DT2, on = .(colA = colC), colB := ifelse(is.na(x.colB) & i.timeB1 <= x.timeA & x.timeA <= i.timeB2, i.colD, x.colB)]
> output
> output
colA colB timeA
1: 1 A 2
2: 1 <NA> 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
Nothing changes in output.
these is my desired output:
> desired_output
colA colB timeA
1: 1 A 2
2: 1 YY 4 --> should find a match
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6 --> shouldn't find a match
6: 3 A 1
7: 3 C 4
why doesn't this work?
I'd like to use data.table operations only without using additional packages.

An in place update of the colB in DT1 would work as follows:
DT1[is.na(colB), colB := DT2[DT1[is.na(colB)],
on = .(colC = colA, timeB1 <= timeA, timeB2 >= timeA), colD]]
print(DT1)
colA colB timeA
1: 1 A 2
2: 1 YY 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
This indexes the values where colB is NA and after a join on the condition, as defined in on= ..., replaces the missing values by the matching values found in colD.

Possibly not the sortest answer, but it gets the job done.. I'm no data.table-expert, so I welcome improvements/suggestions.
DT1[ is.na(colB), colB := DT1[ is.na(colB), ][ DT2, colB := i.colD, on = c( "colA == colC", "timeA >= timeB1", "timeA <= timeB2")]$colB]
what is does:
first, subset DT1 for all rows where is.na(colB) = TRUE
then, update the value of colB in these rows with the colB-vector from the result of a non-equi join of the same subset of rows on DT2
Bonus is that DT1 is chaged by reference, so it's pretty fast and memory efficient on large data (I think).
colA colB timeA
1: 1 A 2
2: 1 YY 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4

Related

Left join adding all rows from right table

I noticed, when using updating by reference, I was losing some rows from my right table if there is more tan one row by join key.
No matter how I browse the forum, I can't find how to do it. Something escapes me ?
Even with mult= it doesn't seem to work.
Due to a performance and volumetry issue I would like to keep updating by reference.
In my reprex, I expected two rows for a=2
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = c(2,2,5), b = 23:25)
A[B, on = 'a', newvar := i.b, mult = 'all']
Thanks !!
One option is to create a list column in 'B' and do the join and assign (:=) as := cannot expand rows on the original data.
A[B[, .(b = .(b)), a], on = .(a), newvar := i.b]
-output
> A
a b newvar
1: 1 12
2: 2 13 23,24
3: 3 14
4: 4 15
Once we have the list, it is easier to unnest
library(tidyr)
A[, unnest(.SD, newvar, keep_empty = TRUE)]
# A tibble: 5 x 3
a b newvar
<int> <int> <int>
1 1 12 NA
2 2 13 23
3 2 13 24
4 3 14 NA
5 4 15 NA
Or use a full join with merge.data.table
merge(A, B, by = 'a', all.x = TRUE)
a b.x b.y
1: 1 12 NA
2: 2 13 23
3: 2 13 24
4: 3 14 NA
5: 4 15 NA

Filling in missing values in a data.table by reference

Suppose I have a data.table with missing values and a reference data.table:
dt <- data.table(id = 1:5, value = c(1, 2, NA, NA, 5))
id value
1: 1 1
2: 2 2
3: 3 NA
4: 4 NA
5: 5 5
ref <- data.table(id = 1:4, value = c(1, 2, 98, 99))
id value
1: 1 1
2: 2 2
3: 3 98
4: 4 99
How would I fill the column value of dt by using the matching id in the two data.tables,
so that I get the following data.table?
id value
1: 1 1
2: 2 2
3: 3 98
4: 4 99
5: 5 5
We can use a join on the 'id' and assign (:=) the value column from 'ref' (i.value) to that in 'dt'
library(data.table)
dt[ref, value := i.value, on = .(id)]
dt
# id value
#1: 1 1
#2: 2 2
#3: 3 98
#4: 4 99
#5: 5 5
If we don't want to replace the original non-NA elements in the 'value' column
dt[ref, value := fcoalesce(value, i.value), on = .(id)]

data table subset last row by group (retain order)

Have:
> aDT <- data.table(ID = c(3,3,2,2,2,3), colA = c(5,5,4,4,4,5), colC = c(1:6))
> aDT
ID colA colC
1: 3 5 1
2: 3 5 2
3: 2 4 3
4: 2 4 4
5: 2 4 5
6: 3 5 6
Need:
> aDT <- data.table(ID = c(3,2,3), colA = c(5,4,5), colC = c(2,5,6))
> aDT
ID colA colC
1: 3 5 2
2: 2 4 5
3: 3 5 6
Tried:
> aDT[, .SD[.N], by = list(ID,colA)]
ID colA colC
1: 3 5 6
2: 2 4 5
As you can see, the result's not really what I need. How to fix it?
(btw, I would like to retain the same order)
You are not really grouping by ID and colA but by the consecutive chunks, for which you can use rleid for this purpose:
aDT[aDT[, .I[.N], rleid(ID, colA)]$V1]
# ID colA colC
#1: 3 5 2
#2: 2 4 5
#3: 3 5 6
.I[.N] extracts the global row number of the last row for each group:
aDT[, .I[.N], rleid(ID, colA)]
# rleid V1
#1: 1 2
#2: 2 5
#3: 3 6 there are three groups in total, the row numbers of last rows are 2,5,6
then use the row numbers to subset the original data table.

data.table conditional Inequality join

There're two sample datasets:
> aDT
col1 col2 ExtractDate
1: 1 A 2017-01-01
2: 1 A 2016-01-01
3: 2 B 2015-01-01
4: 2 B 2014-01-01
> bDT
col1 col2 date_pol Value
1: 1 A 2017-05-20 1
2: 1 A 2016-05-20 2
3: 1 A 2015-05-20 3
4: 2 B 2014-05-20 4
And I need:
> cDT
col1 col2 ExtractDate date_pol Value
1: 1 A 2017-01-01 2016-05-20 2
2: 1 A 2016-01-01 2015-05-20 3
3: 2 B 2015-01-01 2014-05-20 4
4: 2 B 2014-01-01 NA NA
Basically, aDT left join bDT based on col1, col2 and ExtractDate >= date_pol, only keep the first match (i.e. highest date_pol). Cartesian join not allowed due to memory limits.
Note:
To generate sample datasets
aDT <- data.table(col1 = c(1,1,2,2), col2 = c("A","A","B","B"), ExtractDate = c("2017-01-01","2016-01-01","2015-01-01","2014-01-01"))
bDT <- data.table(col1 = c(1,1,1,2), col2 = c("A","A","A","B"), date_pol = c("2017-05-20","2016-05-20","2015-05-20","2014-05-20"), Value = c(1,2,3,4))
cDT <- data.table(col1 = c(1,1,2,2), col2 = c("A","A","B","B"), ExtractDate = c("2017-01-01","2016-01-01","2015-01-01","2014-01-01")
,date_pol = c("2016-05-20","2015-05-20","2014-05-20",NA), Value = c(2,3,4,NA))
aDT[,ExtractDate := ymd(ExtractDate)]
bDT[,date_pol := ymd(date_pol)]
aDT[order(-ExtractDate)]
bDT[order(-date_pol)]
I have tried:
aDT[, c("date_pol", "Value") :=
bDT[aDT,
.(date_pol, Value)
,on = .(date_pol <= ExtractDate
,col1 = col1
,col2 = col2)
,mult = "first"]]
But results are a bit weird:
> aDT
col1 col2 ExtractDate date_pol Value ##date_pol values not right
1: 1 A 2017-01-01 2017-01-01 2
2: 1 A 2016-01-01 2016-01-01 3
3: 2 B 2015-01-01 2015-01-01 4
4: 2 B 2014-01-01 2014-01-01 NA
When i is a data.table, the columns of i can be referred to in j by using the prefix i., e.g., X[Y, .(val, i.val)]. Here val refers to X's column and i.val Y's. Columns of x can now be referred to using the prefix x. and is particularly useful during joining to refer to x's join columns as they are otherwise masked by i's. For example, X[Y, .(x.a-i.a, b), on="a"].
bDT[aDT, .(col1, col2, i.ExtractDate, x.date_pol, Value),
on = .(date_pol <= ExtractDate, col1 = col1, col2 = col2),
mult = "first"]
output
col1 col2 i.ExtractDate x.date_pol Value
1: 1 A 2017-01-01 2016-05-20 2
2: 1 A 2016-01-01 2015-05-20 3
3: 2 B 2015-01-01 2014-05-20 4
4: 2 B 2014-01-01 <NA> NA
I like the approach you did yourself: without explicitly mentioning the columns in your left join. This can be very helpful if you have a lot of columns on the left side of your join, so you don't have to specify them all.
The only thing you need to do is use the prefix x.
aDT[, c("date_pol", "Value") := bDT[aDT, on = .(date_pol <= ExtractDate, col1, col2),
mult = "first", .(x.date_pol, x.Value)]]
Output:
col1 col2 ExtractDate date_pol Value
1: 1 A 2017-01-01 2016-05-20 2
2: 1 A 2016-01-01 2015-05-20 3
3: 2 B 2015-01-01 2014-05-20 4
4: 2 B 2014-01-01 <NA> NA

Editing a dataframe to create a paired sample; removing records without a matching date in another group

I have done a bunch of searching for a solution to this and either can't find one or don't know it when I see it. I've seen some topics that are close to this but deal with matching between two different dataframes, whereas this is dealing with a single dataframe.
I have a dataframe with two groups (factors, col1) and a sampling date (date, col2), and then the measurement (numeric, col3). I would like to eventually run a statistical test on a paired sample between group A and B, so in order to create the paired sample, I want to only keep the records that have a measurement taken on the same day for both groups. In other words, remove the records in group A that do not have a corresponding measurement taken on the same day in group B, and vice versa. In the sample data below, that would result in rows 4 and 8 being removed. Another way of thinking of it is, how do I search for and remove records with only one occurrence of each date?
Sample data:
my.df <- data.frame(col1 = as.factor(c(rep("A", 4), rep("B", 4))),
col2 = as.Date(c("2001-01-01", "2001-01-02", "2001-01-03",
"2001-01-04", "2001-01-01", "2001-01-02", "2001-01-03",
"2001-02-03")),
col3 = sample(8))
Here are a few alternatives:
1) ave
> subset(my.df, ave(col3, col2, FUN = length) > 1)
col1 col2 col3
1 A 2001-01-01 3
2 A 2001-01-02 2
3 A 2001-01-03 6
5 B 2001-01-01 7
6 B 2001-01-02 4
7 B 2001-01-03 1
2) split / Filter / do.call
> do.call("rbind", Filter(function(x) nrow(x) > 1, split(my.df, my.df$col2)))
col1 col2 col3
2001-01-01.1 A 2001-01-01 3
2001-01-01.5 B 2001-01-01 7
2001-01-02.2 A 2001-01-02 2
2001-01-02.6 B 2001-01-02 4
2001-01-03.3 A 2001-01-03 6
2001-01-03.7 B 2001-01-03 1
3) dplyr (2) translates nearly directly into a dplyr solution:
> library(dplyr)
> my.df %>% group_by(col2) %>% filter(n() > 1)
Source: local data frame [6 x 3]
Groups: col2
col1 col2 col3
1 A 2001-01-01 5
2 A 2001-01-02 1
3 A 2001-01-03 7
4 B 2001-01-01 2
5 B 2001-01-02 4
6 B 2001-01-03 6
4) data.table The last two solutions can also be translated to data.table
> data.table(my.df)[, if (.N > 1) .SD, by = col2]
col2 col1 col3
1: 2001-01-01 A 5
2: 2001-01-01 B 2
3: 2001-01-02 A 1
4: 2001-01-02 B 4
5: 2001-01-03 A 7
6: 2001-01-03 B 6
5) tapply
> na.omit(tapply(my.df$col3, my.df[c('col2', 'col1')], identity))
col1
col2 A B
2001-01-01 3 7
2001-01-02 2 4
2001-01-03 6 1
attr(,"na.action")
2001-02-03 2001-01-04
5 4
6) merge
> merge(subset(my.df, col1 == 'A'), subset(my.df, col1 == 'B'), by = 2)
col2 col1.x col3.x col1.y col3.y
1 2001-01-01 A 3 B 7
2 2001-01-02 A 2 B 4
3 2001-01-03 A 6 B 1
7) sqldf (6) is similar to the following sqldf solution:
> sqldf("select * from `my.df` A join `my.df` B
+ on A.col2 = B.col2 and A.col1 = 'A' and B.col1 = 'B'")
col1 col2 col3 col1 col2 col3
1 A 2001-01-01 5 B 2001-01-01 2
2 A 2001-01-02 1 B 2001-01-02 4
3 A 2001-01-03 7 B 2001-01-03 6

Resources