Check which rows in a data.table are identical - r

I need a solution that shows me which rows are identical but I can't find a clever solution (a solution without a bunch of complex loops). I would prefer a data.table solution.
What I want to have is a list with line numbers that have the identical entries.
An example:
library(data.table)
Data <- data.table(A = c("a", "a", "c"),
B = c("A", "A", "B"))
The first and the second line are identical.
My desired output:
[[1]]
[1] 1 2
[[2]]
[1] 3

Here is something quick and dirty:
Data[, .(.I, .GRP), by = .(A, B)][, list(split(I, GRP))]$V1
Could be simplified to:
Data[, .(list(.I)), by = .(A, B)]$V1

That was my solution until sindri_baldur came up with a better solution:
Data.unique <- unique(Data)
Data.unique[, G := .I]
Data[, I := .I]
Data.full <-
merge(Data,
Data.unique,
by = c("A", "B"))
Data.full %>%
split(by = "G") %>%
map(~ .x[, I])

Related

Replace strings in variable using lookup vector

I have a dataframe df with a character variable and the fromvec and tovec.
df <- tibble(var = c("A", "B", "C", "a", "E", "D", "b"))
fromvec <- c("A", "B", "C")
tovec <- c("X", "Y", "Z")
Use strings in fromvec, check them in df and then replace them with the corresponding strings in tovec so that "A" in df gets replaced with "X", "B" with "Y" and so on to get the desired_df.
desired_df <- tibble(var = c("X", "Y", "Z", "X", "E", "D", "Y"))
I tried following, but not getting the desired result!
from_vec <- paste(fromvec, collapse="|")
to_vec <- paste(tovec, collapse="|")
undesired_df <- df %>%
mutate(var = str_replace(str_to_upper(var), from_vec, to_vec))
i.e. this
tibble(var = c("X|Y|Z", "X|Y|Z", "X|Y|Z", "X|Y|Z", "E", "D", "X|Y|Z"))
How can I get the desired_df?
You could use chartr :
df$var <- chartr(paste(fromvec,collapse=""),
paste(tovec,collapse=""),
toupper(df$var))
# # A tibble: 7 x 1
# var
# <chr>
# 1 X
# 2 Y
# 3 Z
# 4 X
# 5 E
# 6 D
# 7 Y
Or we can use recode
library(dplyr)
df$var <- recode(toupper(df$var), !!!setNames(tovec,fromvec))
If you really want to use str_replace you could do:
library(purrr)
library(stringr)
df$var <- reduce2(fromvec, tovec, str_replace, .init=toupper(df$var))
The correct way to do this with stringr is with str_replace_all:
mutate(df,str_replace_all(str_to_upper(var),setNames(tovec, fromvec)))
(thanks, #Moody_Mudskipper!)
We can use base R
with(df, ifelse(toupper(var) %in% fromvec,
setNames(tovec, fromvec)[toupper(var)], var))
#[1] "X" "Y" "Z" "X" "E" "D" "Y"
which can be also written in two lines by creating a logical condition
i1 <- toupper(df$var) %in% fromvec
df$var[i1] <- setNames(tovec, fromvec)[toupper(df$var)[i1]]
Or using data.table
library(data.table)
setDT(df)[toupper(var) %in% fromvec, var := setNames(tovec, fromvec)[toupper(var)]]
It's not clear the result should be case insensitive.
In my opinion, replacement (update) operations that involve an indeterminate number of changes are best accomplished using JOINs. In this case, it also cements a good practice of tracking your changes in a separate dataframe.
Unfortunately, the tidyverse has no "update dataframe" function....a glaring omission. That means tidyverse-ers must use a work-around, coalesce.
#JOIN Operation
tibble(fromvec, tovec) %>% #< dataframe of changes
right_join(df, by = c("fromvec" = "var")) %>% #< join operation
transmute(var = coalesce(tovec, fromvec)) #< coalesce work-around
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 a
5 E
6 D
7 b
If a case insensitive operation is preferred, consider inserting str_to_upper in the pipeline:
tibble(fromvec, tovec) %>%
right_join(df %>% mutate(var = (str_to_upper(var))), #<modify case
by = c("fromvec" = "var")) %>%
transmute(var = coalesce(tovec, fromvec))
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 X
5 E
6 D
7 Y

Use of match within i of data.table

The %in% operator is a wrapper for the match function returning "a vector of the same length as x". For instance:
> match(c("a", "b", "c"), c("a", "a"), nomatch = 0) > 0
## [1] TRUE FALSE FALSE
When used within i of data.table, however
(dt1 <- data.table(v1 = c("a", "b", "c"), v2 = "dt1"))
v1 v2
1: a dt1
2: b dt1
3: c dt1
(dt2 <- data.table(v1 = c("a", "a"), v2 = "dt2"))
v1 v2
1: a dt2
2: a dt2
dt1[v1 %in% dt2$v1]
v1 v2
1: a dt1
2: a dt1
duplicates are obtained. Should the expected behaviour of %in% within i of data.table not give the same result as
dt1[dt1$v1 %in% dt2$v1]
v1 v2
1: a dt1
i.e. without duplicates?
This was a bug in data.table V < 1.9.5 automatic indexing that was fixed in V >= 1.9.5.
I can think of 3 possible workarounds:
Disable the auto indexing and use base R %in% as in
options(datatable.auto.index = FALSE)
dt1[v1 %in% dt2$v1]
## v1 v2
## 1: a dt1
Use the built in %chin% operator which both more efficient and doesn't have this bug (works only on character vectors comparison)
dt1[v1 %chin% dt2$v1]
## v1 v2
## 1: a dt1
Install the development version from Github (Close all your R sessions first and reopen just one)
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table)
dt1 <- data.table(v1 = c("a", "b", "c"), v2 = "dt1")
dt2 <- data.table(v1 = c("a", "a"), v2 = "dt2")
dt1[v1 %in% dt2$v1]
## v1 v2
## 1: a dt1

Grouping factor levels in a data.table

I'm trying to combine factor levels in a data.table & wondering if there's a data.table-y way to do so.
Example:
DT = data.table(id = 1:20, ind = as.factor(sample(8, 20, replace = TRUE)))
I want to say types 1,3,8 are in group A; 2 and 4 are in group B; and 5,6,7 are in group C.
Here's what I've been doing, which has been quite slow in the full version of the problem:
DT[ind %in% c(1, 3, 8), grp := as.factor("A")]
DT[ind %in% c(2, 4), grp := as.factor("B")]
DT[ind %in% c(5, 6, 7), grp := as.factor("C")]
Another approach, suggested by this related question, would I guess translate like so:
DT[ , grp := ind]
levels(DT$grp) = c("A", "B", "A", "B", "C", "C", "C", "A")
Or perhaps (given I've got 65 underlying groups and 18 aggregated groups, this feels a little neater)
DT[ , grp := ind]
lev <- letters(1:8)
lev[c(1, 3, 8)] <- "A"
lev[c(2, 4)] <- "B"
lev[5:7] <- "C"
levels(DT$grp) <- lev
Both of these seem unwieldy; does this seem like the appropriate way to do this in data.table?
For reference, I timed a beefed up version of this with 10,000,000 observations and some more subgroup/supergroup levels. My original approach is slowest (having to run all those logic checks is costly), the second the fastest, and the third a close second. But I like the readability of that approach better.
(Keying DT before searching speeds things up, but it only halves the gap vis-a-vis the latter two methods)
Update:
I recently learned of a much simpler way to re-associate factor levels from this question and a closer reading of ?levels. No merges, correspondence table, etc. necessary, just pass a named list to levels:
levels(DT$ind) = list(A = c(1, 3, 8), B = c(2, 4), C = 5:7)
Original Answer:
As suggested by #Arun we have the option of creating the correspondence as a separate data.table, then joining it to the original:
match_dt = data.table(ind = as.factor(1:12),
grp = as.factor(c("A", "B", "A", "B", "C", "C",
"C", "A", "D", "E", "F", "D")))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]
We can also do this in (what I consider to be) the more readable fashion like so (with marginal speed costs):
levels <- letters[1:12]
levels[c(1, 3, 8)] <- "A"
levels[c(2, 4)] <- "B"
levels[5:7] <- "C"
levels[c(9, 12)] <- "D"
levels[10] <- "E"
levels[11] <- "F"
match_dt <- data.table(ind = as.factor(1:12),
grp = as.factor(levels))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]

Check frequency of data.table value in other data.table

library(data.table)
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C"))
I want to add a column popular to DT2 with value TRUE whenever DT2$group is contained in DT1$group at least twice. So, in the example above, DT2 should be
group popular
1: A TRUE
2: B TRUE
3: C FALSE
What would be an efficient way to get to this?
Updated example: DT2 may actually contain more groups than DT1, so here's an updated example:
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C", "D"))
And the desired output would be
group popular
1: A TRUE
2: B TRUE
3: C FALSE
4: D FALSE
I'd just do it this way:
## 1.9.4+
setkey(DT1, group)
DT1[J(DT2$group), list(popular = .N >= 2L), by = .EACHI]
# group popular
# 1: A TRUE
# 2: B TRUE
# 3: C FALSE
# 4: D FALSE ## on the updated example
data.table's join syntax is quite powerful, in that, while joining, you can also aggregate / select / update columns in j. Here we perform a join. For each row in DT2$group, on the corresponding matching rows in DT1, we compute the j-expression .N >= 2L; by specifying by = .EACHI (please check 1.9.4 NEWS), we compute the j-expression each time.
In 1.9.4, .() has been introduced as an alias in all i, j and by. So you could also do:
DT1[.(DT2$group), .(popular = .N >= 2L), by = .EACHI]
When you're joining by a single character column, you can drop the .() / J() syntax altogether (for convenience). So this can be also written as:
DT1[DT2$group, .(popular = .N >= 2L), by = .EACHI]
This is how I would do it: first count the number of times each group appears in DT1, then simply join DT2 and DT1.
require(data.table)
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C"))
#solution:
DT1[,num_counts:=.N,by=group] #the number of entries in this group, just count the other column
setkey(DT1, group)
setkey(DT2, group)
DT2 = DT1[DT2,mult="last"][,list(group, popular = (num_counts >= 2))]
#> DT2
# group popular
#1: A TRUE
#2: B TRUE
#3: C FALSE

Is there a %in% operator across multiple columns

Imagine you have two data frames
df1 <- data.frame(V1 = c(1, 2, 3), v2 = c("a", "b", "c"))
df2 <- data.frame(V1 = c(1, 2, 2), v2 = c("b", "b", "c"))
Here's what they look like, side by side:
> cbind(df1, df2)
V1 v2 V1 v2
1 1 a 1 b
2 2 b 2 b
3 3 c 2 c
You want to know which observations are duplicates, across all variables.
This can be done by pasting the cols together and then using %in%:
df1Vec <- apply(df1, 1, paste, collapse= "")
df2Vec <- apply(df2, 1, paste, collapse= "")
df2Vec %in% df1Vec
[1] FALSE TRUE FALSE
The second observation is thus the only one in df2 and also in df1.
Is there no faster way of generating this output - something like %IN%, which is %in% across multiple variables, or should we just be content with the apply(paste) solution?
I would go with
interaction(df2) %in% interaction(df1)
# [1] FALSE TRUE FALSE
You can wrap it in a binary operator:
"%IN%" <- function(x, y) interaction(x) %in% interaction(y)
Then
df2 %IN% df1
# [1] FALSE TRUE FALSE
rbind(df2, df2) %IN% df1
# [1] FALSE TRUE FALSE FALSE TRUE FALSE
Disclaimer: I have somewhat modified my answer from a previous one that was using do.call(paste, ...) instead of interaction(...). Consult the history if you like. I think that Arun's claims about "terrible inefficiency" (a bit extreme IMHO) still hold but if you like a concise solution that uses base R only and is fast-ish with small-ish data that's probably it.
Calling duplicated on a data.frame or using paste coerces all columns to character type, which is terribly inefficient as the data size gets bigger. duplicated.data.table method does not coerce them to characters and is therefore quite efficient and scales well.
Here's one way using data.table:
`%dtIN%` <- function(y, x) {
tmp = rbindlist(list(x,y))
len_ = nrow(x)
tmp[, idx := any(.I <= len_) & .N > 1L, by=names(tmp)]
tail(tmp$idx, nrow(y))
}
# example:
df1 <- data.frame(V1 = c(1, 2, 3), v2 = c("a", "b", "c"))
df2 <- data.frame(V1 = c(1, 2, 1, 2, 1), v2 = c("b", "b", "b", "c", "b"))
df2 %dtIN% df1
# [1] FALSE TRUE FALSE FALSE FALSE
Benchmarks:
#flodel's (earlier) benchmark is nice (see history), but doesn't really showcase the true effects of this unnecessary coercion, because the entire data size is:
print(object.size(df1), units="Kb") # 783.8 Kb
less than 1 MB. Let's construct a little bigger data set to see the effect.
First benchmark:
set.seed(45L)
df1 <- data.frame(x=sample(paste0("V", 1:1000), 1e7, TRUE),
y = sample(1e2, 1e7, TRUE), stringsAsFactors=FALSE)
df2 <- data.frame(x=sample(paste0("V", 1:700), 1e6, TRUE),
y=sample(1e2, 1e6, TRUE), stringsAsFactors=FALSE)
print(object.size(df1), units="Mb") # 114.5Mb
system.time(ans1 <- df2 %dtIN% df1)
# user system elapsed
# 1.896 0.296 2.265
system.time(ans2 <- df2 %IN% df1)
# user system elapsed
# 13.014 0.510 14.417
identical(ans1, ans2) # [1] TRUE
Flodel's solution is ~6.3x slower here.
Second benchmark:
Here's another example to try and convince that it really is terribly inefficient ;):
set.seed(1L)
DF1 <- data.frame(x=rnorm(1e7), y=sample(letters, 1e7, TRUE))
DF2 <- data.frame(x=sample(DF1$x, 1e5, TRUE), y=sample(letters, 1e5, TRUE))
require(data.table)
system.time(ans1 <- DF2 %dtIN% DF1)
# user system elapsed
# 35.024 0.884 37.225
system.time(ans2 <- DF2 %IN% DF1) ## flodel's earlier answer
# user system elapsed
# 312.931 2.591 319.652
That's 1/2 a minute vs 5 minutes on only 1 numeric column, ~8.6x. Now who wants to add another numeric column to it and try again :)?
IIUC, #flodel's new solution using interaction shouldn't be much different because, it still stores them as "factors", where the factor levels have to be characters..
But this one actually started swapping...
system.time(ans3 <- interaction(DF2) %in% interaction(DF1))
## Had to stop after ~3 min because it took 5.5GB and started to SWAP.

Resources