Check frequency of data.table value in other data.table - r

library(data.table)
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C"))
I want to add a column popular to DT2 with value TRUE whenever DT2$group is contained in DT1$group at least twice. So, in the example above, DT2 should be
group popular
1: A TRUE
2: B TRUE
3: C FALSE
What would be an efficient way to get to this?
Updated example: DT2 may actually contain more groups than DT1, so here's an updated example:
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C", "D"))
And the desired output would be
group popular
1: A TRUE
2: B TRUE
3: C FALSE
4: D FALSE

I'd just do it this way:
## 1.9.4+
setkey(DT1, group)
DT1[J(DT2$group), list(popular = .N >= 2L), by = .EACHI]
# group popular
# 1: A TRUE
# 2: B TRUE
# 3: C FALSE
# 4: D FALSE ## on the updated example
data.table's join syntax is quite powerful, in that, while joining, you can also aggregate / select / update columns in j. Here we perform a join. For each row in DT2$group, on the corresponding matching rows in DT1, we compute the j-expression .N >= 2L; by specifying by = .EACHI (please check 1.9.4 NEWS), we compute the j-expression each time.
In 1.9.4, .() has been introduced as an alias in all i, j and by. So you could also do:
DT1[.(DT2$group), .(popular = .N >= 2L), by = .EACHI]
When you're joining by a single character column, you can drop the .() / J() syntax altogether (for convenience). So this can be also written as:
DT1[DT2$group, .(popular = .N >= 2L), by = .EACHI]

This is how I would do it: first count the number of times each group appears in DT1, then simply join DT2 and DT1.
require(data.table)
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C"))
#solution:
DT1[,num_counts:=.N,by=group] #the number of entries in this group, just count the other column
setkey(DT1, group)
setkey(DT2, group)
DT2 = DT1[DT2,mult="last"][,list(group, popular = (num_counts >= 2))]
#> DT2
# group popular
#1: A TRUE
#2: B TRUE
#3: C FALSE

Related

Check which rows in a data.table are identical

I need a solution that shows me which rows are identical but I can't find a clever solution (a solution without a bunch of complex loops). I would prefer a data.table solution.
What I want to have is a list with line numbers that have the identical entries.
An example:
library(data.table)
Data <- data.table(A = c("a", "a", "c"),
B = c("A", "A", "B"))
The first and the second line are identical.
My desired output:
[[1]]
[1] 1 2
[[2]]
[1] 3
Here is something quick and dirty:
Data[, .(.I, .GRP), by = .(A, B)][, list(split(I, GRP))]$V1
Could be simplified to:
Data[, .(list(.I)), by = .(A, B)]$V1
That was my solution until sindri_baldur came up with a better solution:
Data.unique <- unique(Data)
Data.unique[, G := .I]
Data[, I := .I]
Data.full <-
merge(Data,
Data.unique,
by = c("A", "B"))
Data.full %>%
split(by = "G") %>%
map(~ .x[, I])

Remove rows from data.table that meet condition

I have a data table
DT <- data.table(col1=c("a", "b", "c", "c", "a"), col2=c("b", "a", "c", "a", "b"), condition=c(TRUE, FALSE, FALSE, TRUE, FALSE))
col1 col2 condition
1: a b TRUE
2: b a FALSE
3: c c FALSE
4: c a TRUE
5: a b FALSE
and would like to remove rows on the following conditions:
each row for which condition==TRUE (rows 1 and 4)
each row that has the same values for col1 and col2 as a row for which the condition==TRUE (that is row 5, col1=a, col2=b)
finally each row that has the same values for col1 and col2 for which condition==TRUE, but with col1 and col2 switched (that is row 2, col1=b and col2=a)
So only row 3 should stay.
I'm doing this by making a new data table DTcond with all rows meeting the condition, looping over the values for col1 and col2, and collecting the indices from DT which will be removed.
DTcond <- DT[condition==TRUE,]
indices <- c()
for (i in 1:nrow(DTcond)) {
n1 <- DTcond[i, col1]
n2 <- DTcond[i, col2]
indices <- c(indices, DT[ ((col1 == n1 & col2 == n2) | (col1==n2 & col2 == n1)), which=T])
}
DT[!indices,]
col1 col2 condition
1: c c FALSE
This works but is terrible slow for large datasets and I guess there must be other ways in data.table to do this without loops or apply. Any suggestions how I could improve this (I'm new to data.table)?
You can do an anti join:
mDT = DT[(condition), !"condition"][, rbind(.SD, rev(.SD), use.names = FALSE)]
DT[!mDT, on=names(mDT)]
# col1 col2 condition
# 1: c c FALSE

Use of match within i of data.table

The %in% operator is a wrapper for the match function returning "a vector of the same length as x". For instance:
> match(c("a", "b", "c"), c("a", "a"), nomatch = 0) > 0
## [1] TRUE FALSE FALSE
When used within i of data.table, however
(dt1 <- data.table(v1 = c("a", "b", "c"), v2 = "dt1"))
v1 v2
1: a dt1
2: b dt1
3: c dt1
(dt2 <- data.table(v1 = c("a", "a"), v2 = "dt2"))
v1 v2
1: a dt2
2: a dt2
dt1[v1 %in% dt2$v1]
v1 v2
1: a dt1
2: a dt1
duplicates are obtained. Should the expected behaviour of %in% within i of data.table not give the same result as
dt1[dt1$v1 %in% dt2$v1]
v1 v2
1: a dt1
i.e. without duplicates?
This was a bug in data.table V < 1.9.5 automatic indexing that was fixed in V >= 1.9.5.
I can think of 3 possible workarounds:
Disable the auto indexing and use base R %in% as in
options(datatable.auto.index = FALSE)
dt1[v1 %in% dt2$v1]
## v1 v2
## 1: a dt1
Use the built in %chin% operator which both more efficient and doesn't have this bug (works only on character vectors comparison)
dt1[v1 %chin% dt2$v1]
## v1 v2
## 1: a dt1
Install the development version from Github (Close all your R sessions first and reopen just one)
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table)
dt1 <- data.table(v1 = c("a", "b", "c"), v2 = "dt1")
dt2 <- data.table(v1 = c("a", "a"), v2 = "dt2")
dt1[v1 %in% dt2$v1]
## v1 v2
## 1: a dt1

Grouping factor levels in a data.table

I'm trying to combine factor levels in a data.table & wondering if there's a data.table-y way to do so.
Example:
DT = data.table(id = 1:20, ind = as.factor(sample(8, 20, replace = TRUE)))
I want to say types 1,3,8 are in group A; 2 and 4 are in group B; and 5,6,7 are in group C.
Here's what I've been doing, which has been quite slow in the full version of the problem:
DT[ind %in% c(1, 3, 8), grp := as.factor("A")]
DT[ind %in% c(2, 4), grp := as.factor("B")]
DT[ind %in% c(5, 6, 7), grp := as.factor("C")]
Another approach, suggested by this related question, would I guess translate like so:
DT[ , grp := ind]
levels(DT$grp) = c("A", "B", "A", "B", "C", "C", "C", "A")
Or perhaps (given I've got 65 underlying groups and 18 aggregated groups, this feels a little neater)
DT[ , grp := ind]
lev <- letters(1:8)
lev[c(1, 3, 8)] <- "A"
lev[c(2, 4)] <- "B"
lev[5:7] <- "C"
levels(DT$grp) <- lev
Both of these seem unwieldy; does this seem like the appropriate way to do this in data.table?
For reference, I timed a beefed up version of this with 10,000,000 observations and some more subgroup/supergroup levels. My original approach is slowest (having to run all those logic checks is costly), the second the fastest, and the third a close second. But I like the readability of that approach better.
(Keying DT before searching speeds things up, but it only halves the gap vis-a-vis the latter two methods)
Update:
I recently learned of a much simpler way to re-associate factor levels from this question and a closer reading of ?levels. No merges, correspondence table, etc. necessary, just pass a named list to levels:
levels(DT$ind) = list(A = c(1, 3, 8), B = c(2, 4), C = 5:7)
Original Answer:
As suggested by #Arun we have the option of creating the correspondence as a separate data.table, then joining it to the original:
match_dt = data.table(ind = as.factor(1:12),
grp = as.factor(c("A", "B", "A", "B", "C", "C",
"C", "A", "D", "E", "F", "D")))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]
We can also do this in (what I consider to be) the more readable fashion like so (with marginal speed costs):
levels <- letters[1:12]
levels[c(1, 3, 8)] <- "A"
levels[c(2, 4)] <- "B"
levels[5:7] <- "C"
levels[c(9, 12)] <- "D"
levels[10] <- "E"
levels[11] <- "F"
match_dt <- data.table(ind = as.factor(1:12),
grp = as.factor(levels))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]

Access a single cell / subsetted column of a data.table

How can I access just a single cell in a data.table in the way as I could for a data.frame:
mdf <- data.frame(a = c("A", "B", "C"), b = rnorm(3), c = 1:3)
mdf[ mdf$a == "B", "c" ]
[1] 2
Doing the analogue on a data.table a data.table is returned including the key column(s):
mdt <- data.table( mdf, key = "a" )
mdt[ "B", c ]
a c
1: B 2
mdt[ "B", c ][ , c]
[1] 2
Did I miss a parameter or does it has to be done as in the last line?
Either of these will avoid repeating the c but are not as efficient since they involve computing the first [] as well as the final answer:
> mdt[ "B", ][["c"]]
[1] 2
> mdt[ "B", ][, c]
[1] 2
Recent versions of data.table make this easier
mdt[ "B", c]
# [1] 2
Original answer was returning a data.table like:
mdt['B', 'c']
# c
# 1: 2

Resources