Is there a %in% operator across multiple columns - r

Imagine you have two data frames
df1 <- data.frame(V1 = c(1, 2, 3), v2 = c("a", "b", "c"))
df2 <- data.frame(V1 = c(1, 2, 2), v2 = c("b", "b", "c"))
Here's what they look like, side by side:
> cbind(df1, df2)
V1 v2 V1 v2
1 1 a 1 b
2 2 b 2 b
3 3 c 2 c
You want to know which observations are duplicates, across all variables.
This can be done by pasting the cols together and then using %in%:
df1Vec <- apply(df1, 1, paste, collapse= "")
df2Vec <- apply(df2, 1, paste, collapse= "")
df2Vec %in% df1Vec
[1] FALSE TRUE FALSE
The second observation is thus the only one in df2 and also in df1.
Is there no faster way of generating this output - something like %IN%, which is %in% across multiple variables, or should we just be content with the apply(paste) solution?

I would go with
interaction(df2) %in% interaction(df1)
# [1] FALSE TRUE FALSE
You can wrap it in a binary operator:
"%IN%" <- function(x, y) interaction(x) %in% interaction(y)
Then
df2 %IN% df1
# [1] FALSE TRUE FALSE
rbind(df2, df2) %IN% df1
# [1] FALSE TRUE FALSE FALSE TRUE FALSE
Disclaimer: I have somewhat modified my answer from a previous one that was using do.call(paste, ...) instead of interaction(...). Consult the history if you like. I think that Arun's claims about "terrible inefficiency" (a bit extreme IMHO) still hold but if you like a concise solution that uses base R only and is fast-ish with small-ish data that's probably it.

Calling duplicated on a data.frame or using paste coerces all columns to character type, which is terribly inefficient as the data size gets bigger. duplicated.data.table method does not coerce them to characters and is therefore quite efficient and scales well.
Here's one way using data.table:
`%dtIN%` <- function(y, x) {
tmp = rbindlist(list(x,y))
len_ = nrow(x)
tmp[, idx := any(.I <= len_) & .N > 1L, by=names(tmp)]
tail(tmp$idx, nrow(y))
}
# example:
df1 <- data.frame(V1 = c(1, 2, 3), v2 = c("a", "b", "c"))
df2 <- data.frame(V1 = c(1, 2, 1, 2, 1), v2 = c("b", "b", "b", "c", "b"))
df2 %dtIN% df1
# [1] FALSE TRUE FALSE FALSE FALSE
Benchmarks:
#flodel's (earlier) benchmark is nice (see history), but doesn't really showcase the true effects of this unnecessary coercion, because the entire data size is:
print(object.size(df1), units="Kb") # 783.8 Kb
less than 1 MB. Let's construct a little bigger data set to see the effect.
First benchmark:
set.seed(45L)
df1 <- data.frame(x=sample(paste0("V", 1:1000), 1e7, TRUE),
y = sample(1e2, 1e7, TRUE), stringsAsFactors=FALSE)
df2 <- data.frame(x=sample(paste0("V", 1:700), 1e6, TRUE),
y=sample(1e2, 1e6, TRUE), stringsAsFactors=FALSE)
print(object.size(df1), units="Mb") # 114.5Mb
system.time(ans1 <- df2 %dtIN% df1)
# user system elapsed
# 1.896 0.296 2.265
system.time(ans2 <- df2 %IN% df1)
# user system elapsed
# 13.014 0.510 14.417
identical(ans1, ans2) # [1] TRUE
Flodel's solution is ~6.3x slower here.
Second benchmark:
Here's another example to try and convince that it really is terribly inefficient ;):
set.seed(1L)
DF1 <- data.frame(x=rnorm(1e7), y=sample(letters, 1e7, TRUE))
DF2 <- data.frame(x=sample(DF1$x, 1e5, TRUE), y=sample(letters, 1e5, TRUE))
require(data.table)
system.time(ans1 <- DF2 %dtIN% DF1)
# user system elapsed
# 35.024 0.884 37.225
system.time(ans2 <- DF2 %IN% DF1) ## flodel's earlier answer
# user system elapsed
# 312.931 2.591 319.652
That's 1/2 a minute vs 5 minutes on only 1 numeric column, ~8.6x. Now who wants to add another numeric column to it and try again :)?
IIUC, #flodel's new solution using interaction shouldn't be much different because, it still stores them as "factors", where the factor levels have to be characters..
But this one actually started swapping...
system.time(ans3 <- interaction(DF2) %in% interaction(DF1))
## Had to stop after ~3 min because it took 5.5GB and started to SWAP.

Related

Speeding up recoding of a character column in R

I have some data where each data point is associated with a character vector of varying length. For example, it might be generated by the following function:
library(tidyverse)
set.seed(27)
generate_keyset <- function(...) {
sample(LETTERS[1:5], size = rpois(n = 1, lambda = 10), replace = TRUE)
}
generate_keyset()
#> [1] "A" "C" "A" "A" "A" "A" "A" "E" "C" "C" "A" "D" "A" "D" "C" "A"
I would like to summarize this keyset by converting it to a single number score. The way this works is straightforward: each key in the keyset has a value, and to get the value of the entire keyset I sum over the values. The key-value map is a tibble with several hundred entries, but you can imagine it looks like:
key_value_map <- tribble(
~key, ~value,
"A", 1,
"B", -2,
"C", 8,
"D", -4,
"E", 0
)
Currently I am scoring keysets with the following function:
score_keyset <- function(keyset) {
merged_keysets_to_map <- data.frame(
key = keyset,
stringsAsFactors = FALSE
) %>%
left_join(key_value_map, by = "key")
sum(merged_keysets_to_map$value)
}
score_keyset(LETTERS[1:4])
#> [1] 3
This works fine, except it is very slow, and I need to do this operation about a million times. For example, I would like the following to be much faster:
n <- 1e4 # in practice I have n = 1e6
fake_data <- tibble(
keyset = map(1:n, generate_keyset)
)
library(tictoc)
tic()
scored_data <- fake_data %>%
mutate(
value = map_dbl(keyset, score_keyset)
)
toc()
I am sure this is some much better way to do this with indexing but it is escaping me at the moment. Help speeding this up is much appreciated.
Instead of doing a join and then sum, it would be more efficient if we use a named vector to match
library(tibble)
sum(deframe(key_value_map)[generate_keyset()])
Checking the timings, the OP's tic/toc showed 45.728 sec
tic()
v1 <- deframe(key_value_map)
scored_data2 <- fake_data %>%
mutate(
value = map_dbl(keyset, ~ sum(v1[.x]))
)
toc()
#0.952 sec elapsed
identical(scored_data, scored_data2)
#[1] TRUE

Check which rows in a data.table are identical

I need a solution that shows me which rows are identical but I can't find a clever solution (a solution without a bunch of complex loops). I would prefer a data.table solution.
What I want to have is a list with line numbers that have the identical entries.
An example:
library(data.table)
Data <- data.table(A = c("a", "a", "c"),
B = c("A", "A", "B"))
The first and the second line are identical.
My desired output:
[[1]]
[1] 1 2
[[2]]
[1] 3
Here is something quick and dirty:
Data[, .(.I, .GRP), by = .(A, B)][, list(split(I, GRP))]$V1
Could be simplified to:
Data[, .(list(.I)), by = .(A, B)]$V1
That was my solution until sindri_baldur came up with a better solution:
Data.unique <- unique(Data)
Data.unique[, G := .I]
Data[, I := .I]
Data.full <-
merge(Data,
Data.unique,
by = c("A", "B"))
Data.full %>%
split(by = "G") %>%
map(~ .x[, I])

Replace strings in variable using lookup vector

I have a dataframe df with a character variable and the fromvec and tovec.
df <- tibble(var = c("A", "B", "C", "a", "E", "D", "b"))
fromvec <- c("A", "B", "C")
tovec <- c("X", "Y", "Z")
Use strings in fromvec, check them in df and then replace them with the corresponding strings in tovec so that "A" in df gets replaced with "X", "B" with "Y" and so on to get the desired_df.
desired_df <- tibble(var = c("X", "Y", "Z", "X", "E", "D", "Y"))
I tried following, but not getting the desired result!
from_vec <- paste(fromvec, collapse="|")
to_vec <- paste(tovec, collapse="|")
undesired_df <- df %>%
mutate(var = str_replace(str_to_upper(var), from_vec, to_vec))
i.e. this
tibble(var = c("X|Y|Z", "X|Y|Z", "X|Y|Z", "X|Y|Z", "E", "D", "X|Y|Z"))
How can I get the desired_df?
You could use chartr :
df$var <- chartr(paste(fromvec,collapse=""),
paste(tovec,collapse=""),
toupper(df$var))
# # A tibble: 7 x 1
# var
# <chr>
# 1 X
# 2 Y
# 3 Z
# 4 X
# 5 E
# 6 D
# 7 Y
Or we can use recode
library(dplyr)
df$var <- recode(toupper(df$var), !!!setNames(tovec,fromvec))
If you really want to use str_replace you could do:
library(purrr)
library(stringr)
df$var <- reduce2(fromvec, tovec, str_replace, .init=toupper(df$var))
The correct way to do this with stringr is with str_replace_all:
mutate(df,str_replace_all(str_to_upper(var),setNames(tovec, fromvec)))
(thanks, #Moody_Mudskipper!)
We can use base R
with(df, ifelse(toupper(var) %in% fromvec,
setNames(tovec, fromvec)[toupper(var)], var))
#[1] "X" "Y" "Z" "X" "E" "D" "Y"
which can be also written in two lines by creating a logical condition
i1 <- toupper(df$var) %in% fromvec
df$var[i1] <- setNames(tovec, fromvec)[toupper(df$var)[i1]]
Or using data.table
library(data.table)
setDT(df)[toupper(var) %in% fromvec, var := setNames(tovec, fromvec)[toupper(var)]]
It's not clear the result should be case insensitive.
In my opinion, replacement (update) operations that involve an indeterminate number of changes are best accomplished using JOINs. In this case, it also cements a good practice of tracking your changes in a separate dataframe.
Unfortunately, the tidyverse has no "update dataframe" function....a glaring omission. That means tidyverse-ers must use a work-around, coalesce.
#JOIN Operation
tibble(fromvec, tovec) %>% #< dataframe of changes
right_join(df, by = c("fromvec" = "var")) %>% #< join operation
transmute(var = coalesce(tovec, fromvec)) #< coalesce work-around
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 a
5 E
6 D
7 b
If a case insensitive operation is preferred, consider inserting str_to_upper in the pipeline:
tibble(fromvec, tovec) %>%
right_join(df %>% mutate(var = (str_to_upper(var))), #<modify case
by = c("fromvec" = "var")) %>%
transmute(var = coalesce(tovec, fromvec))
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 X
5 E
6 D
7 Y

Create unique identifier from the interchangeable combination of two variables

I need to create a unique identifier from the combination of two variables in a data frame. Consider the following data frame:
df <- data.frame(col1 = c("a", "a", "b", "c"), col2 = c("c", "b", "c", "a"), id = c(1,2,3,1))
The variable "id" is not in the data set; that's the one I would like to create. Essentially, I want every combination of the variables col1 and col2 to be treated interchangeably, e.g. the combination of c("a", "c") is the same as c("c", "a").
You could do:
labels <- apply(df[, c("col1", "col2")], 1, sort)
df$id <- as.numeric(factor(apply(labels, 2, function(x) paste(x, collapse=""))))
A more complicated, but quicker to run version than looping over each row.
sel <- c("col1","col2")
df[sel] <- lapply(df[sel], as.character)
as.numeric(factor(apply(df[1:2], 1, function(x) toString(sort(x)) )))
#[1] 2 1 3 2
as.numeric(interaction(list(do.call(pmin,df[1:2]),do.call(pmax,df[1:2])),drop=TRUE))
#[1] 2 1 3 2
Benchmarking on 1M rows:
df2 <- df[rep(1:4, each=2.5e5),]
system.time(as.numeric(factor(apply(df2[1:2], 1, function(x) toString(sort(x)) ))))
# user system elapsed
# 69.21 0.08 69.41
system.time(as.numeric(interaction(list(do.call(pmin,df2[1:2]),do.call(pmax,df2[1:2])),drop=TRUE)))
# user system elapsed
# 0.88 0.03 0.91

Merge and paste duplicate columns in R

Suppose I have two data frames with some common variable x:
df1 <- data.frame(
x=c(1, 2, 3, 4),
y=c("a", "b", "c", "d")
)
df2 <- data.frame(
x=c(1, 1, 2, 2, 3, 4, 5),
z=c("A", "B", "C", "D", "E", "F", "G")
)
We can assume that each entry of the variable we're merging over, x, appears exactly once in df1; however, it may appear an arbitrary number of times in df2.
I want to merge df2 'into' df1, while preserving df1. Is there a fast way of merging these two data frames such that the merged output would be of the form (for example):
df_merged <- data.frame(
x=c(1, 2, 3, 4),
y=c("a", "b", "c", "d"),
z=c("A B", "C D", "E", "F")
)
Essentially, I want df_merged to be a composition of the original df1, in addition to any variables in df2 coerced to match the format of df1. The various incantations of merge will append new rows to the merged output, which I want to avoid.
We can assume that each entry of the variable we're merging over, x, appears exactly once.
Speed is also a priority since I'll be merging fairly large data frames.
merge( df1,
aggregate(df2$z , df2[1], FUN=paste, collapse=" ", sep=""),
by.x="x", by.y=1)
x y x
1 1 a A B
2 2 b C D
3 3 c E
4 4 d F
Warning message:
In merge.data.frame(df1, aggregate(df2$z, df2[1], FUN = paste, collapse = " ", :
column name ‘x’ is duplicated in the result
> M1 <- .Last.value
> names(M1)[3] <- "z"
> M1
x y z
1 1 a A B
2 2 b C D
3 3 c E
4 4 d F
Another option:
df2.z <- with(df2, tapply(z, x, paste, collapse=' '))
transform(df1, z=df2.z[match(x, names(df2.z))])
# x y z
# 1 1 a A B
# 2 2 b C D
# 3 3 c E
# 4 4 d F
If df1$x is in order, then use df2.z[names(df2.z) %in% x] in the transform statement.
I'm submitting this question with my own potential answer, but it is fairly slow and I'm curious what other methods might be available.
by <- "x"
df2_processed <- as.data.frame(
sapply( names(df2), function(x) {
tapply( df2[[x]], df2[[by]], function(xx) {
if( x == by ) {
return(xx[1])
} else {
paste(xx, collapse=" ")
}
})
}), optional=TRUE, stringsAsFactors=FALSE )
merge( df1, df2_processed, all.x=TRUE )

Resources