I am trying to update some dt1 data.table using another dt2 data.table on common id for a specific column. That specific column should be updated only if the value of the specific column in dt1 is empty (NA).
For instance, this updates everything :
dt1 <- data.table(
"id" = c(1,2,3,4),
"animal" = c("cat","cat","dog",NA)
)
dt2 <- data.table(
"id" = c(3,4),
"animal" = c("human being","duck")
)
dt1 <- dt1[dt2, on = c("id"),"animal" := .(i.animal)]
Which gives for dt1
id animal
1: 1 cat
2: 2 cat
3: 3 human being
4: 4 duck
While I need
id animal
1: 1 cat
2: 2 cat
3: 3 dog
4: 4 duck
We could use fcoalecse
library(data.table)
dt1[dt2, animal := fcoalesce(animal, i.animal), on = .(id)]
-output
> dt1
id animal
<num> <char>
1: 1 cat
2: 2 cat
3: 3 dog
4: 4 duck
Related
I have two data.tables. Each has a column called 'firstName' and another called 'lastName', which contain some values which will match each other and some that won't. Some values in both data sets might be duplicated.
I want to add a new column to the second data.table, in which I will store the indices of matches from the first data set for each element of 'firstName' within the second data set. I will then repeat the whole matching process with the 'lastName' column and get the intersect of index matches for 'firstName' and 'lastName'. I will then use the intersect of the indices to fetch the case ID (cid) from the first data set and append it to the second data set.
Because there might be more than one match per element, I will store them as lists within my data.table. I cannot use base::match function because it will only return the first match for each element, but I do need the answer to be vectorised in just the same way as the match function.
I've tried different combinations of which(d1$x %in% y) but this does not work either because it matches for all of y at once instead of one element at a time. I am using data.table because for my real-world use case, the data set to match on could be hundreds of thousands of records, so speed is important.
I have found a related question here, but I can't quite figure out how to efficiently convert this to data.table syntax.
Here is some example data:
# Load library
library(data.table)
# First data set (lookup table):
dt1 <- data.table(cid = c("c1", "c2", "c3", "c4", "c5"),
firstName = c("Jim", "Joe", "Anne", "Jim", "Anne"),
lastName = c("Gracea", "Ali", "Mcfee", "Dutto", "Crest"))
# Second data set (data to match with case IDs from lookup table):
dt2 <- data.table(lid = c(1, 2, 3, 4),
firstName = c("Maria", "Jim", "Jack", "Anne"),
lastName = c("Antonis", "Dutto", "Blogs", "Mcfee"),
result = c("pos", "neg", "neg", "pos"))
My desired output would look like this:
# Output:
> dt2
lid firstName lastName result fn_match ln_match casematch caseid
1: 1 Maria Antonis pos NA NA NA <NA>
2: 2 Jim Dutto neg 1,4 4 4 c4
3: 3 Jack Blogs neg NA NA NA <NA>
4: 4 Anne Mcfee pos 3,5 3 3 c3
A possible solution:
dt1[,id:=seq_along(cid)]
dt1[dt2,.(lid,id,firstName = i.firstName),on=.(firstName)][
,.(casematch =.( id)),by=.(lid,firstName)]
lid firstName casematch
<num> <char> <list>
1: 1 Maria NA
2: 2 Jim 1,4
3: 3 Jack NA
4: 4 Anne 3,5
We could use
library(data.table)
dt1[dt2, .(casematch = toString(cid), lid),on = .(firstName), by = .EACHI]
-output
firstName casematch lid
<char> <char> <num>
1: Maria NA 1
2: Jim c1, c4 2
3: Jack NA 3
4: Anne c3, c5 4
Or with row index
dt1[dt2, .(casematch = na_if(toString(.I), 0), lid),on = .(firstName), by = .EACHI]
firstName casematch lid
<char> <char> <num>
1: Maria <NA> 1
2: Jim 1, 4 2
3: Jack <NA> 3
4: Anne 3, 5 4
Using .EACHI and adding the resulting list column by reference.
dt2[ , res := dt1[ , i := .I][.SD, on = .(firstName), .(.(i)), by = .EACHI]$V1]
# lid firstName res
# 1: 1 Maria NA
# 2: 2 Jim 1,4
# 3: 3 Jack NA
# 4: 4 Anne 3,5
Another data.table option
> dt1[, .(cid = toString(cid)), firstName][dt2, on = .(firstName)]
firstName cid lid
1: Maria <NA> 1
2: Jim c1, c4 2
3: Jack <NA> 3
4: Anne c3, c5 4
In my real life scenario, I need to retrieve the indices for matches on more than one column. I found a way to do this in one step by combining some of the other solutions and figured it would be useful to also share this and the explanation of how it works below.
The code below adds a new column caseid to dt2, which gets its values from the column cid in dt1 for the row indices that matched on both firstName and lastName.
Putting dt1 inside the square brackets and specifying on = .(...) is equivalent to merging dt1 with dt2 on firstName and lastName, but instead of merging all columns from both datasets, one new column called caseid is created.
The lower case i. prefix to cid indicates that cid is a column from the second data set (dt1).
The upper case .I inside the square brackets after i.cid will retrieve the row indices of dt1 that match dt2 on firstName and lastName.
# Get case IDs from dt1 for matches of firstName and lastName in one step:
dt2[dt1, caseid := i.cid[.I], on = .(firstName, lastName)]
# Output:
> dt2
lid firstName lastName result caseid
1: 1 Maria Antonis pos <NA>
2: 2 Jim Dutto neg c4
3: 3 Jack Blogs neg <NA>
4: 4 Anne Mcfee pos c3
I need to incrementally update a SQL database and therefore receive the following data.table as input.
library(data.table)
dt1 <- data.table(Category = letters[1:4]
, Max.Date = rep(as.Date("2018-01-01"),4))
I now would like to filter the data.table in R to select all Categories, which do have a later date in my data.table dt2, but of course the filtering should be done on the level of the individual category, so that the dates are only compared per category and not for the whole data.table
dt2 <- data.table(Category = letters[1:8]
, Max.Date = rep(as.Date("2019-01-01"),8))
The desired output should select from dt2, where the Max.Date is larger than in dt1, so the desired output should be:
dt.desired
Category Max.Date
1: a 2019-01-01
2: b 2019-01-01
3: c 2019-01-01
4: d 2019-01-01
So selecting from dt2, where the date per category is larger than in dt1.
The below non-equi join should work
dt2[ dt1,
.( Category, Max.Date = x.Max.Date ),
on = .( Category, Max.Date > Max.Date ) ][]
should work, resulting in:
# Category Max.Date
# 1: a 2019-01-01
# 2: b 2019-01-01
# 3: c 2019-01-01
# 4: d 2019-01-01
It's late and I can't figure this out. I'm using lubridate and dlypr.
My data is as follows:
table1 =1 observation per subject with a date
table2= 1 or more observations per subject with associated dates
When I left join I actually add observations. This is because I have multiple records in table 2 that match the key. How can I simply make this a conditional join so that only 1 matching record from table 2 is joined given that its date is closest to the date in table 1.
Sorry if this was verbose.
Use the data.table-package to join. Use roll = "nearest" to get the nearest match..
library(data.table)
dt1 <- data.table( id = 1:10, date = 1:10, stringsAsFactors = FALSE )
dt2 <- data.table( date = 6:15, letter = letters[1:10], stringsAsFactors = FALSE )
dt1[, letter := dt2[dt1, letter, on = "date", roll = "nearest"] ][]
# id date letter
# 1: 1 1 a
# 2: 2 2 a
# 3: 3 3 a
# 4: 4 4 a
# 5: 5 5 a
# 6: 6 6 a
# 7: 7 7 b
# 8: 8 8 c
# 9: 9 9 d
# 10: 10 10 e
I have a data.table 'DT' with a column ('col2') that is a list of data frames:
require(data.table)
DT <- data.table(col1 = c('A','A','B'),
col2 = list(data.frame(colA = c(1,3,54, 23),
colB = c("aa", "bb", "cc", "hh")),
data.frame(colA =c(23, 1),
colB = c("hh", "aa")),
data.frame(colA = 1,
colB = "aa")))
> DT
col1 col2
1: A <data.frame>
2: A <data.frame>
3: B <data.frame>
>> DT$col2
[[1]]
colA colB
1 1 aa
2 3 bb
3 54 cc
4 23 hh
[[2]]
colA colB
1 23 hh
2 1 aa
[[3]]
colA colB
1 1 aa
Each data.frame in col2 has two columns colA and colB.
I'd like to have a data.table output that binds each unique row of those data.frames based on col1 of DT.
I guess it's like using rbindlist in an aggregate function of the data.table.
This is the desired output:
> #desired output
> output
colA colB col1
1: 1 aa A
2: 3 bb A
3: 54 cc A
4: 23 hh A
5: 1 aa B
The dataframe of the second row of DT (DT[2, col2]) has duplicate entries, and only unique entries are desired for each unique col1.
I tried the following and I get an error.
desired_output <- DT[, lapply(col2, function(x) unique(rbindlist(x))), by = col1]
# Error in rbindlist(x) :
# Item 1 of list input is not a data.frame, data.table or list
This 'works', though not desired output:
unique(rbindlist(DT$col2))
colA colB
1: 1 aa
2: 3 bb
3: 54 cc
4: 23 hh
Is there anyway to use rbindlist in an aggregate function of a data.table?
Group by 'col1', run rbindlist on 'col2':
unique(DT[ , rbindlist(col2), by = col1]) # trimmed thanks to #snoram
# col1 colA colB
# 1: A 1 aa
# 2: A 3 bb
# 3: A 54 cc
# 4: A 23 hh
# 5: B 1 aa
only unique entries are desired for each unique col1
If you add a column for col1, the expression above means "unique entries" (unconditional on columns).
Henrik's answer is one way to keep col1. Another is:
unique(DT[, rbindlist(setNames(col2, col1), id="col1")])
I guess this should be more efficient than
bycols = "col1"
unique(DT[, rbindlist(col2), by=bycols]) # Henrik's
though the extension to either (1) col1 not being a character column (hence suitable for setNames) or (2) having multiple by= columns is not so obvious. For either of these cases, I would make an .id column equal to row numbers of DT then copy them over:
bycols = "col1"
res = unique(DT[, rbindlist(col2, id="DT_row")])
res[, (bycols) := DT[DT_row, ..bycols]]
To put those columns first/leftmost, I think setcolorder(res, bycols) should work, but am on too old a data.table version to see it do so.
There's also an open issue for a tidyr::unnest-like function.
This works:
DT1<-apply(DT, 1, function(x){cbind(col1=x$col1,x$col2)})
unique(rbindlist(DT1))
# col1 colA colB
#1: A 1 aa
#2: A 3 bb
#3: A 54 cc
#4: A 23 hh
#5: B 1 aa
You could do something hackish like this:
nDT <- cbind(rbindlist(DT[[2]]), col1 = rep(DT[[1]], sapply(DT[[2]], nrow)))
nDT[!duplicated(nDT)]
colA colB col1
1: 1 aa A
2: 3 bb A
3: 54 cc A
4: 23 hh A
5: 1 aa B
Or using tidyr (Inspired by PKumar's comment):
unique(tidyr::unnest(DT))
Or more generalisable base R:
names(DT[[2]]) <- DT[[1]]
ndf <- do.call(rbind, DT[[2]])
ndf$col1 <- substr(row.names(ndf), 1, 1)
unique(ndf)
What is efficient and elegant data.table syntax for finding the most common category for each id? I keep a boolean vector indicating NA positions (for other purposes)
dt = data.table(id=rep(1:2,7), category=c("x","y",NA))
print(dt)
In this toy example, ignoring NA, x is common category for id==1 and y for id==2.
If you want to ignore NA's, you have to exclude them first with !is.na(category), group by id and category (by = .(id, category)) and create a frequency variable with .N:
dt[!is.na(category), .N, by = .(id, category)]
which gives:
id category N
1: 1 x 3
2: 2 y 3
3: 2 x 2
4: 1 y 2
Ordering this by id will give you a clearer picture:
dt[!is.na(category), .N, by = .(id, category)][order(id)]
which results in:
id category N
1: 1 x 3
2: 1 y 2
3: 2 y 3
4: 2 x 2
If you just want the rows which indicate the top results:
dt[!is.na(category), .N, by = .(id, category)][order(id, -N), head(.SD,1), by = id]
or:
dt[!is.na(category), .N, by = .(id, category)][, .SD[which.max(N)], by = id]
which both give:
id category N
1: 1 x 3
2: 2 y 3