compare dataframes and extract matching rows [duplicate] - r

I have two character vectors of IDs.
I would like to compare the two character vectors, in particular I am interested in the following figures:
How many IDs are both in A and B
How many IDs are in A but not in B
How many IDs are in B but not in A
I would also love to draw a Venn diagram.

Here are some basics to try out:
> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog" "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"
Similarly, you could get counts simply as:
> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2

I'm usually dealing with large-ish sets, so I use a table instead of a Venn diagram:
xtab_set <- function(A,B){
both <- union(A,B)
inA <- both %in% A
inB <- both %in% B
return(table(inA,inB))
}
set.seed(1)
A <- sample(letters[1:20],10,replace=TRUE)
B <- sample(letters[1:20],10,replace=TRUE)
xtab_set(A,B)
# inB
# inA FALSE TRUE
# FALSE 0 5
# TRUE 6 3

Yet an another way, with using %in% and boolean vectors of common elements instead of intersect and setdiff. I take it you actually want to compare two vectors, not two lists - a list is an R class that may contain any type of element, while vectors always contain elements of just one type, hence easier comparison of what is truly equal. Here the elements are transformed to character strings, as that was the most inflexible element type that was present.
first <- c(1:3, letters[1:6], "foo", "bar")
second <- c(2:4, letters[5:8], "bar", "asd")
both <- first[first %in% second] # in both, same as call: intersect(first, second)
onlyfirst <- first[!first %in% second] # only in 'first', same as: setdiff(first, second)
onlysecond <- second[!second %in% first] # only in 'second', same as: setdiff(second, first)
length(both)
length(onlyfirst)
length(onlysecond)
#> both
#[1] "2" "3" "e" "f" "bar"
#> onlyfirst
#[1] "1" "a" "b" "c" "d" "foo"
#> onlysecond
#[1] "4" "g" "h" "asd"
#> length(both)
#[1] 5
#> length(onlyfirst)
#[1] 6
#> length(onlysecond)
#[1] 4
# If you don't have the 'gplots' package, type: install.packages("gplots")
require("gplots")
venn(list(first.vector = first, second.vector = second))
Like it was mentioned, there are multiple choices for plotting Venn-diagrams in R. Here is the output using gplots.

With sqldf: Slower but very suitable for data frames with mixed types:
t1 <- as.data.frame(1:10)
t2 <- as.data.frame(5:15)
sqldf1 <- sqldf('SELECT * FROM t1 EXCEPT SELECT * FROM t2') # subset from t1 not in t2
sqldf2 <- sqldf('SELECT * FROM t2 EXCEPT SELECT * FROM t1') # subset from t2 not in t1
sqldf3 <- sqldf('SELECT * FROM t1 UNION SELECT * FROM t2') # UNION t1 and t2
sqldf1 X1_10
1
2
3
4
sqldf2 X5_15
11
12
13
14
15
sqldf3 X1_10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Using the same example data as one of the answers above.
A = c("Dog", "Cat", "Mouse")
B = c("Tiger","Lion","Cat")
match(A,B)
[1] NA 3 NA
The match function returns a vector with the location in B of all values in A. So, cat, the second element in A, is the third element in B. There are no other matches.
To get the matching values in A and B, you can do:
m <- match(A,B)
A[!is.na(m)]
"Cat"
B[m[!is.na(m)]]
"Cat"
To get the non-matching values in A and B:
A[is.na(m)]
"Dog" "Mouse"
B[which(is.na(m))]
"Tiger" "Cat"
Further, you can use length() to get the total number of matching and non-matching values.

If A is a data.table with field a of type list, with entries themselves as vectors of a primitive type, e.g. created as follows
A<-data.table(a=c(list(c("abc","def","123")),list(c("ghi","zyx"))),d=c(9,8))
and B is a list with vector of primitive entries, e.g. created as follows
B<-list(c("ghi","zyx"))
and you're attempting to find which (if any) element of A$a matches B
A[sapply(a,identical,unlist(B))]
if you just want the entry in a
A[sapply(a,identical,unlist(B)),a]
if you want the matching indicies of a
A[,which(sapply(a,identical,unlist(B)))]
if instead B is itself a data.table with the same structure as A, e.g.
B<-data.table(b=c(list(c("zyx","ghi")),list(c("abc","def",123))),z=c(5,7))
and you're looking for the intersection of the two lists by one column, where you require the same order of vector elements.
# give the entry in A for in which A$a matches B$b
A[,`:=`(res=unlist(sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)))
][res==TRUE
][,res:=NULL][]
# get T/F for each index of A
A[,sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)]
Note that you can't do something as easy as
setkey(A,a)
setkey(B,b)
A[B]
to join A&B because you cannot key on a field of type list in data.table 1.12.2
similarly, you cannot ask
A[a==B[,b]]
even if A and B are identical, as the == operator hasn't been implemented in R for type list

Related

A more elegant way to combine two vectors as separate columns (or dataframes), match the rows, and have NA where they do not match

I have two vectors of the same 'thing' that I want to combine into a dataframe. Each vector will become its own column, but they will match up the rows are the same and introduce NA values for one vector where it does not match the second vector. Since the data starts as just two vectors, there are no common id values or anything to match up other than the vector values.
I got this to work in a toy data test using a simple and straightforward approach, but would like to know if there is a more direct and elegant way to do this.
My current approach requires assigning a unique value by which I can then merge the two vectors, but I am curious if I can do this without it and rely instead on the vector values. My other attempts tried to not adopt a new id value, exploring functions like merge and join, cbind, rbind, bind_rows, bind_cols, intersect and union. Perhaps I wasn't using them as well as I could. I found some other useful posts on SO (like this one), but they all already start with a unique identifier.
Here is my toy data test with a final output how I want it to look. It does not matter to me if the final output has an id column or not. Note, my actual data will be character, hence my use of letters here.
# create toy data
x <- letters[1:5]
y <- letters[2:6]
# combine into dataframe, keep only unique values & assign id
xy <- data.frame(xy=unique(c(x,y))); xy
xy$id <- 1:length(xy$xy); xy
# match id back to original toy data as dataframes
x <- data.frame(x)
x$id <- match(x$x, xy$xy)
y <- data.frame(y)
y$id <- match(y$y, xy$xy)
# merge using id
xy2 <- merge(x, y, by="id", all=TRUE)
xy2
# results in
id x y
1 1 a <NA>
2 2 b b
3 3 c c
4 4 d d
5 5 e e
6 6 <NA> f
Using tidyverse you can try using full_join and create keys based on your 2 vectors:
library(tidyverse)
full_join(data.frame(key=x, x),
data.frame(key=y, y), by="key") %>%
select(-key)
Alternatively, you can just use merge in base R:
merge(data.frame('key'=x, x), data.frame('key'=y, y), by='key', all=T)[-1]
Output
x y
1 a <NA>
2 b b
3 c c
4 d d
5 e e
6 <NA> f
Here's an alternative one-liner in base R:
cbind(x[match(unique(c(x, y)), x)], y[match(unique(c(x, y)), y)])
#> [,1] [,2]
#> [1,] "a" NA
#> [2,] "b" "b"
#> [3,] "c" "c"
#> [4,] "d" "d"
#> [5,] "e" "e"
#> [6,] NA "f"

Why multiple option in vunion function of the vecsets package does not work for character vectors?

When I run the code:
library(vecsets)
p <- c("a","b")
q <- c( "a")
vunion(p,q, multiple = TRUE)
I get the result:
[1] "a" "b"
But I expect the result to be
vunion(p,q, multiple = TRUE)
[1] "a" "b" "a"
I also do not understand the result provided in the example of the vesect package. The example shows:
x <- c(1:5,3,3,3,2,NA,NA)
y <- c(2:5,4,3,NA)
vunion(x,y,multiple=TRUE)
[1] 2 3 3 4 5 NA 1 3 3 2 NA 4
But if we check
length(x)+length(y); length(vunion(x,y))
[1] 18
[1] 12
we get different lengths, but I think they should be the same. Note, for example, 5 appears only once.
What's going on here? Can someone explain?
I think the vecset package documentation (link) describes this behavior quite well:
The base::union function removes duplicates per algebraic set theory. vunion does not, and so returns as many duplicate elements as are in either input vector (not the sum of their inputs.) In short, vunion is the same as vintersect(x,y) + vsetdiff(x,y) + vsetdiff(y,x).
It's true that you have to read carefully, though. I've emphasized the important part. The issue is not with character versus numeric vectors, but rather whether elements are repeated within the same vector or not. Consider p1 versus p2 in the following example. The result from vunion will have as many a's as either p or q, so we expect 1 "a" in the first part and two a's in the second part; both times we expect only 1 "b":
library(vecsets)
q <- c("a", "b")
p1 <- c("a", "b")
vunion(p1, q, multiple = TRUE)
[1] "a" "b"
p2 <- c("a", "a", "b")
vunion(p2, q, multiple = TRUE)
[1] "a" "b" "a"

Can't match character value to multiple columns

I'm trying to match a character value, "C", to multiple columns in a dataframe. Here's part of the frame:
X1 X2
1 F F
2 C C
3 D D
4 A# A#
Here's what happens when I try to match the value "C":
> "C" %in% frame[, 1]
[1] TRUE
> "C" %in% frame[, 1:2]
[1] FALSE
Considering that "C" is in both columns, I can't figure out why it's returning false. Is there a function or operator that can test to see if a value is present in multiple columns? My goal is to create a function that can sum the number of times a character value like "C" is found in specified columns.
Try:
apply(frame, 2, function(u) "C" %in% u)
You can also use is.element:
apply(frame, 2, function(u) is.element("C",u))
You probably want to use grepl here, which returns a logical vector. Then you can count the number of occurrences with sum.
> frame
X1 X2
1 F F
2 C C
3 D D
4 A# A#
> grepl('C', frame$X1)
[1] FALSE TRUE FALSE FALSE
> sum(grepl('C', frame$X1))
[1] 1
and to count the total number of Cs in every column you can use lapply
(note: apply is better suited for matrices, not data frames which are
lists.)
> sum(unlist(lapply(frame, function(col) grepl('C', col))))
[1] 2

Compare two character vectors in R

I have two character vectors of IDs.
I would like to compare the two character vectors, in particular I am interested in the following figures:
How many IDs are both in A and B
How many IDs are in A but not in B
How many IDs are in B but not in A
I would also love to draw a Venn diagram.
Here are some basics to try out:
> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog" "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"
Similarly, you could get counts simply as:
> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2
I'm usually dealing with large-ish sets, so I use a table instead of a Venn diagram:
xtab_set <- function(A,B){
both <- union(A,B)
inA <- both %in% A
inB <- both %in% B
return(table(inA,inB))
}
set.seed(1)
A <- sample(letters[1:20],10,replace=TRUE)
B <- sample(letters[1:20],10,replace=TRUE)
xtab_set(A,B)
# inB
# inA FALSE TRUE
# FALSE 0 5
# TRUE 6 3
Yet an another way, with using %in% and boolean vectors of common elements instead of intersect and setdiff. I take it you actually want to compare two vectors, not two lists - a list is an R class that may contain any type of element, while vectors always contain elements of just one type, hence easier comparison of what is truly equal. Here the elements are transformed to character strings, as that was the most inflexible element type that was present.
first <- c(1:3, letters[1:6], "foo", "bar")
second <- c(2:4, letters[5:8], "bar", "asd")
both <- first[first %in% second] # in both, same as call: intersect(first, second)
onlyfirst <- first[!first %in% second] # only in 'first', same as: setdiff(first, second)
onlysecond <- second[!second %in% first] # only in 'second', same as: setdiff(second, first)
length(both)
length(onlyfirst)
length(onlysecond)
#> both
#[1] "2" "3" "e" "f" "bar"
#> onlyfirst
#[1] "1" "a" "b" "c" "d" "foo"
#> onlysecond
#[1] "4" "g" "h" "asd"
#> length(both)
#[1] 5
#> length(onlyfirst)
#[1] 6
#> length(onlysecond)
#[1] 4
# If you don't have the 'gplots' package, type: install.packages("gplots")
require("gplots")
venn(list(first.vector = first, second.vector = second))
Like it was mentioned, there are multiple choices for plotting Venn-diagrams in R. Here is the output using gplots.
With sqldf: Slower but very suitable for data frames with mixed types:
t1 <- as.data.frame(1:10)
t2 <- as.data.frame(5:15)
sqldf1 <- sqldf('SELECT * FROM t1 EXCEPT SELECT * FROM t2') # subset from t1 not in t2
sqldf2 <- sqldf('SELECT * FROM t2 EXCEPT SELECT * FROM t1') # subset from t2 not in t1
sqldf3 <- sqldf('SELECT * FROM t1 UNION SELECT * FROM t2') # UNION t1 and t2
sqldf1 X1_10
1
2
3
4
sqldf2 X5_15
11
12
13
14
15
sqldf3 X1_10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Using the same example data as one of the answers above.
A = c("Dog", "Cat", "Mouse")
B = c("Tiger","Lion","Cat")
match(A,B)
[1] NA 3 NA
The match function returns a vector with the location in B of all values in A. So, cat, the second element in A, is the third element in B. There are no other matches.
To get the matching values in A and B, you can do:
m <- match(A,B)
A[!is.na(m)]
"Cat"
B[m[!is.na(m)]]
"Cat"
To get the non-matching values in A and B:
A[is.na(m)]
"Dog" "Mouse"
B[which(is.na(m))]
"Tiger" "Cat"
Further, you can use length() to get the total number of matching and non-matching values.
If A is a data.table with field a of type list, with entries themselves as vectors of a primitive type, e.g. created as follows
A<-data.table(a=c(list(c("abc","def","123")),list(c("ghi","zyx"))),d=c(9,8))
and B is a list with vector of primitive entries, e.g. created as follows
B<-list(c("ghi","zyx"))
and you're attempting to find which (if any) element of A$a matches B
A[sapply(a,identical,unlist(B))]
if you just want the entry in a
A[sapply(a,identical,unlist(B)),a]
if you want the matching indicies of a
A[,which(sapply(a,identical,unlist(B)))]
if instead B is itself a data.table with the same structure as A, e.g.
B<-data.table(b=c(list(c("zyx","ghi")),list(c("abc","def",123))),z=c(5,7))
and you're looking for the intersection of the two lists by one column, where you require the same order of vector elements.
# give the entry in A for in which A$a matches B$b
A[,`:=`(res=unlist(sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)))
][res==TRUE
][,res:=NULL][]
# get T/F for each index of A
A[,sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)]
Note that you can't do something as easy as
setkey(A,a)
setkey(B,b)
A[B]
to join A&B because you cannot key on a field of type list in data.table 1.12.2
similarly, you cannot ask
A[a==B[,b]]
even if A and B are identical, as the == operator hasn't been implemented in R for type list

R - preserve order when using matching operators (%in%)

I am using matching operators to grab values that appear in a matrix from a separate data frame. However, the resulting matrix has the values in the order they appear in the data frame, not in the original matrix. Is there any way to preserve the order of the original matrix using the matching operator?
Here is a quick example:
vec=c("b","a","c"); vec
df=data.frame(row.names=letters[1:5],values=1:5); df
df[rownames(df) %in% vec,1]
This produces > [1] 1 2 3 which is the order "a" "b" "c" appears in the data frame. However, I would like to generate >[1] 2 1 3 which is the order they appear in the original vector.
Thanks!
Use match.
df[match(vec, rownames(df)), ]
# [1] 2 1 3
Be aware that if you have duplicate values in either vec or rownames(df), match may not behave as expected.
Edit:
I just realized that row name indexing will solve your issue a bit more simply and elegantly:
df[vec, ]
# [1] 2 1 3
Use match (and get rid of the NA values for elements in either vector for those that don't match in the other):
Filter(function(x) !is.na(x), match(rownames(df), vec))
Since row name indexing also works on vectors, we can take this one step further and define:
'%ino%' <- function(x, table) {
xSeq <- seq(along = x)
names(xSeq) <- x
Out <- xSeq[as.character(table)]
Out[!is.na(Out)]
}
We now have the desired result:
df[rownames(df) %ino% vec, 1]
[1] 2 1 3
Inside the function, names() does an auto convert to character and table is changed with as.character(), so this also works correctly when the inputs to %ino% are numbers:
LETTERS[1:26 %in% 4:1]
[1] "A" "B" "C" "D"
LETTERS[1:26 %ino% 4:1]
[1] "D" "C" "B" "A"
Following %in%, missing values are removed:
LETTERS[1:26 %in% 3:-5]
[1] "A" "B" "C"
LETTERS[1:26 %ino% 3:-5]
[1] "C" "B" "A"
With %in% the logical sequence is repeated along the dimension of the object being subsetted, this is not the case with %ino%:
data.frame(letters, LETTERS)[1:5 %in% 3:-5,]
letters LETTERS
1 a A
2 b B
3 c C
6 f F
7 g G
8 h H
11 k K
12 l L
13 m M
16 p P
17 q Q
18 r R
21 u U
22 v V
23 w W
26 z Z
data.frame(letters, LETTERS)[1:5 %ino% 3:-5,]
letters LETTERS
3 c C
2 b B
1 a A

Resources