Use of match within i of data.table - r

The %in% operator is a wrapper for the match function returning "a vector of the same length as x". For instance:
> match(c("a", "b", "c"), c("a", "a"), nomatch = 0) > 0
## [1] TRUE FALSE FALSE
When used within i of data.table, however
(dt1 <- data.table(v1 = c("a", "b", "c"), v2 = "dt1"))
v1 v2
1: a dt1
2: b dt1
3: c dt1
(dt2 <- data.table(v1 = c("a", "a"), v2 = "dt2"))
v1 v2
1: a dt2
2: a dt2
dt1[v1 %in% dt2$v1]
v1 v2
1: a dt1
2: a dt1
duplicates are obtained. Should the expected behaviour of %in% within i of data.table not give the same result as
dt1[dt1$v1 %in% dt2$v1]
v1 v2
1: a dt1
i.e. without duplicates?

This was a bug in data.table V < 1.9.5 automatic indexing that was fixed in V >= 1.9.5.
I can think of 3 possible workarounds:
Disable the auto indexing and use base R %in% as in
options(datatable.auto.index = FALSE)
dt1[v1 %in% dt2$v1]
## v1 v2
## 1: a dt1
Use the built in %chin% operator which both more efficient and doesn't have this bug (works only on character vectors comparison)
dt1[v1 %chin% dt2$v1]
## v1 v2
## 1: a dt1
Install the development version from Github (Close all your R sessions first and reopen just one)
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table)
dt1 <- data.table(v1 = c("a", "b", "c"), v2 = "dt1")
dt2 <- data.table(v1 = c("a", "a"), v2 = "dt2")
dt1[v1 %in% dt2$v1]
## v1 v2
## 1: a dt1

Related

Check which rows in a data.table are identical

I need a solution that shows me which rows are identical but I can't find a clever solution (a solution without a bunch of complex loops). I would prefer a data.table solution.
What I want to have is a list with line numbers that have the identical entries.
An example:
library(data.table)
Data <- data.table(A = c("a", "a", "c"),
B = c("A", "A", "B"))
The first and the second line are identical.
My desired output:
[[1]]
[1] 1 2
[[2]]
[1] 3
Here is something quick and dirty:
Data[, .(.I, .GRP), by = .(A, B)][, list(split(I, GRP))]$V1
Could be simplified to:
Data[, .(list(.I)), by = .(A, B)]$V1
That was my solution until sindri_baldur came up with a better solution:
Data.unique <- unique(Data)
Data.unique[, G := .I]
Data[, I := .I]
Data.full <-
merge(Data,
Data.unique,
by = c("A", "B"))
Data.full %>%
split(by = "G") %>%
map(~ .x[, I])

Remove rows from data.table that meet condition

I have a data table
DT <- data.table(col1=c("a", "b", "c", "c", "a"), col2=c("b", "a", "c", "a", "b"), condition=c(TRUE, FALSE, FALSE, TRUE, FALSE))
col1 col2 condition
1: a b TRUE
2: b a FALSE
3: c c FALSE
4: c a TRUE
5: a b FALSE
and would like to remove rows on the following conditions:
each row for which condition==TRUE (rows 1 and 4)
each row that has the same values for col1 and col2 as a row for which the condition==TRUE (that is row 5, col1=a, col2=b)
finally each row that has the same values for col1 and col2 for which condition==TRUE, but with col1 and col2 switched (that is row 2, col1=b and col2=a)
So only row 3 should stay.
I'm doing this by making a new data table DTcond with all rows meeting the condition, looping over the values for col1 and col2, and collecting the indices from DT which will be removed.
DTcond <- DT[condition==TRUE,]
indices <- c()
for (i in 1:nrow(DTcond)) {
n1 <- DTcond[i, col1]
n2 <- DTcond[i, col2]
indices <- c(indices, DT[ ((col1 == n1 & col2 == n2) | (col1==n2 & col2 == n1)), which=T])
}
DT[!indices,]
col1 col2 condition
1: c c FALSE
This works but is terrible slow for large datasets and I guess there must be other ways in data.table to do this without loops or apply. Any suggestions how I could improve this (I'm new to data.table)?
You can do an anti join:
mDT = DT[(condition), !"condition"][, rbind(.SD, rev(.SD), use.names = FALSE)]
DT[!mDT, on=names(mDT)]
# col1 col2 condition
# 1: c c FALSE

Check frequency of data.table value in other data.table

library(data.table)
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C"))
I want to add a column popular to DT2 with value TRUE whenever DT2$group is contained in DT1$group at least twice. So, in the example above, DT2 should be
group popular
1: A TRUE
2: B TRUE
3: C FALSE
What would be an efficient way to get to this?
Updated example: DT2 may actually contain more groups than DT1, so here's an updated example:
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C", "D"))
And the desired output would be
group popular
1: A TRUE
2: B TRUE
3: C FALSE
4: D FALSE
I'd just do it this way:
## 1.9.4+
setkey(DT1, group)
DT1[J(DT2$group), list(popular = .N >= 2L), by = .EACHI]
# group popular
# 1: A TRUE
# 2: B TRUE
# 3: C FALSE
# 4: D FALSE ## on the updated example
data.table's join syntax is quite powerful, in that, while joining, you can also aggregate / select / update columns in j. Here we perform a join. For each row in DT2$group, on the corresponding matching rows in DT1, we compute the j-expression .N >= 2L; by specifying by = .EACHI (please check 1.9.4 NEWS), we compute the j-expression each time.
In 1.9.4, .() has been introduced as an alias in all i, j and by. So you could also do:
DT1[.(DT2$group), .(popular = .N >= 2L), by = .EACHI]
When you're joining by a single character column, you can drop the .() / J() syntax altogether (for convenience). So this can be also written as:
DT1[DT2$group, .(popular = .N >= 2L), by = .EACHI]
This is how I would do it: first count the number of times each group appears in DT1, then simply join DT2 and DT1.
require(data.table)
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C"))
#solution:
DT1[,num_counts:=.N,by=group] #the number of entries in this group, just count the other column
setkey(DT1, group)
setkey(DT2, group)
DT2 = DT1[DT2,mult="last"][,list(group, popular = (num_counts >= 2))]
#> DT2
# group popular
#1: A TRUE
#2: B TRUE
#3: C FALSE

Is there a %in% operator across multiple columns

Imagine you have two data frames
df1 <- data.frame(V1 = c(1, 2, 3), v2 = c("a", "b", "c"))
df2 <- data.frame(V1 = c(1, 2, 2), v2 = c("b", "b", "c"))
Here's what they look like, side by side:
> cbind(df1, df2)
V1 v2 V1 v2
1 1 a 1 b
2 2 b 2 b
3 3 c 2 c
You want to know which observations are duplicates, across all variables.
This can be done by pasting the cols together and then using %in%:
df1Vec <- apply(df1, 1, paste, collapse= "")
df2Vec <- apply(df2, 1, paste, collapse= "")
df2Vec %in% df1Vec
[1] FALSE TRUE FALSE
The second observation is thus the only one in df2 and also in df1.
Is there no faster way of generating this output - something like %IN%, which is %in% across multiple variables, or should we just be content with the apply(paste) solution?
I would go with
interaction(df2) %in% interaction(df1)
# [1] FALSE TRUE FALSE
You can wrap it in a binary operator:
"%IN%" <- function(x, y) interaction(x) %in% interaction(y)
Then
df2 %IN% df1
# [1] FALSE TRUE FALSE
rbind(df2, df2) %IN% df1
# [1] FALSE TRUE FALSE FALSE TRUE FALSE
Disclaimer: I have somewhat modified my answer from a previous one that was using do.call(paste, ...) instead of interaction(...). Consult the history if you like. I think that Arun's claims about "terrible inefficiency" (a bit extreme IMHO) still hold but if you like a concise solution that uses base R only and is fast-ish with small-ish data that's probably it.
Calling duplicated on a data.frame or using paste coerces all columns to character type, which is terribly inefficient as the data size gets bigger. duplicated.data.table method does not coerce them to characters and is therefore quite efficient and scales well.
Here's one way using data.table:
`%dtIN%` <- function(y, x) {
tmp = rbindlist(list(x,y))
len_ = nrow(x)
tmp[, idx := any(.I <= len_) & .N > 1L, by=names(tmp)]
tail(tmp$idx, nrow(y))
}
# example:
df1 <- data.frame(V1 = c(1, 2, 3), v2 = c("a", "b", "c"))
df2 <- data.frame(V1 = c(1, 2, 1, 2, 1), v2 = c("b", "b", "b", "c", "b"))
df2 %dtIN% df1
# [1] FALSE TRUE FALSE FALSE FALSE
Benchmarks:
#flodel's (earlier) benchmark is nice (see history), but doesn't really showcase the true effects of this unnecessary coercion, because the entire data size is:
print(object.size(df1), units="Kb") # 783.8 Kb
less than 1 MB. Let's construct a little bigger data set to see the effect.
First benchmark:
set.seed(45L)
df1 <- data.frame(x=sample(paste0("V", 1:1000), 1e7, TRUE),
y = sample(1e2, 1e7, TRUE), stringsAsFactors=FALSE)
df2 <- data.frame(x=sample(paste0("V", 1:700), 1e6, TRUE),
y=sample(1e2, 1e6, TRUE), stringsAsFactors=FALSE)
print(object.size(df1), units="Mb") # 114.5Mb
system.time(ans1 <- df2 %dtIN% df1)
# user system elapsed
# 1.896 0.296 2.265
system.time(ans2 <- df2 %IN% df1)
# user system elapsed
# 13.014 0.510 14.417
identical(ans1, ans2) # [1] TRUE
Flodel's solution is ~6.3x slower here.
Second benchmark:
Here's another example to try and convince that it really is terribly inefficient ;):
set.seed(1L)
DF1 <- data.frame(x=rnorm(1e7), y=sample(letters, 1e7, TRUE))
DF2 <- data.frame(x=sample(DF1$x, 1e5, TRUE), y=sample(letters, 1e5, TRUE))
require(data.table)
system.time(ans1 <- DF2 %dtIN% DF1)
# user system elapsed
# 35.024 0.884 37.225
system.time(ans2 <- DF2 %IN% DF1) ## flodel's earlier answer
# user system elapsed
# 312.931 2.591 319.652
That's 1/2 a minute vs 5 minutes on only 1 numeric column, ~8.6x. Now who wants to add another numeric column to it and try again :)?
IIUC, #flodel's new solution using interaction shouldn't be much different because, it still stores them as "factors", where the factor levels have to be characters..
But this one actually started swapping...
system.time(ans3 <- interaction(DF2) %in% interaction(DF1))
## Had to stop after ~3 min because it took 5.5GB and started to SWAP.

Access a single cell / subsetted column of a data.table

How can I access just a single cell in a data.table in the way as I could for a data.frame:
mdf <- data.frame(a = c("A", "B", "C"), b = rnorm(3), c = 1:3)
mdf[ mdf$a == "B", "c" ]
[1] 2
Doing the analogue on a data.table a data.table is returned including the key column(s):
mdt <- data.table( mdf, key = "a" )
mdt[ "B", c ]
a c
1: B 2
mdt[ "B", c ][ , c]
[1] 2
Did I miss a parameter or does it has to be done as in the last line?
Either of these will avoid repeating the c but are not as efficient since they involve computing the first [] as well as the final answer:
> mdt[ "B", ][["c"]]
[1] 2
> mdt[ "B", ][, c]
[1] 2
Recent versions of data.table make this easier
mdt[ "B", c]
# [1] 2
Original answer was returning a data.table like:
mdt['B', 'c']
# c
# 1: 2

Resources