Compare 2 dataframes for equality in R - r

I have 2 dataframes with 2 same columns. I want to check if the datasets are identical. The original datasets have some 700K records but I'm trying to figure out a way to do it using dummy datasets
I tried using compare, identical, all, all_equal etc. None of them returns me True.
The dummy datasets are -
a <- data.frame(x = 1:10, b = 20:11)
c <- data.frame(x = 10:1, b = 11:20)
all(a==c)
[1] FALSE
compare(a,c)
FALSE [FALSE, FALSE]
identical(a,c)
[1] FALSE
all.equal(a,c)
[1] "Component “x”: Mean relative difference: 0.9090909" "Component “b”: Mean relative difference: 0.3225806"
The datasets are entirely same, except for the order of the records. If these functions only work when the datasets are mirror images of each other, then I must try something else. If that is the case, can someone help with how do I get True for these 2 datasets (unordered)

dplyr's setdiff works on data frames, I would suggest
library(dplyr)
nrow(setdiff(a, c)) == 0 & nrow(setdiff(c, a)) == 0
# [1] TRUE
Note that this will not account for number of duplicate rows. (i.e., if a has multiple copies of a row, and c has only one copy of that row, it will still return TRUE). Not sure how you want duplicate rows handled...
If you do care about having the same number of duplicates, then I would suggest two possibilities: (a) adding an ID column to differentiate the duplicates and using the approach above, or (b) sorting, resetting the row names (annoyingly), and using identical.
(a) adding an ID column
library(dplyr)
a_id = group_by_all(a) %>% mutate(id = row_number())
c_id = group_by_all(c) %>% mutate(id = row_number())
nrow(setdiff(a_id, c_id)) == 0 & nrow(setdiff(c_id, a_id)) == 0
# [1] TRUE
(b) sorting
a_sort = a[do.call(order, a), ]
row.names(a_sort) = NULL
c_sort = c[do.call(order, c), ]
row.names(c_sort) = NULL
identical(a_sort, c_sort)
# [1] TRUE

Maybe a function to sort the columns before comparison is what you need. But it will be slow on large dataframes.
unordered_equal <- function(X, Y, exact = FALSE){
X[] <- lapply(X, sort)
Y[] <- lapply(Y, sort)
if(exact) identical(X, Y) else all.equal(X, Y)
}
unordered_equal(a, c)
#[1] TRUE
unordered_equal(a, c, TRUE)
#[1] TRUE
a$x <- a$x + .Machine$double.eps
unordered_equal(a, c)
#[1] TRUE
unordered_equal(a, c, TRUE)
#[1] FALSE

Basically what you want may be to compare the ordered underlying matrices.
all.equal(matrix(unlist(a[order(a[1]), ]), dim(a)),
matrix(unlist(c[order(c[1]), ]), dim(c)))
# [1] TRUE
identical(matrix(unlist(a[order(a[1]), ]), dim(a)),
matrix(unlist(c[order(c[1]), ]), dim(c)))
# [1] TRUE
You could wrap this into a function for more convenience:
om <- function(d) matrix(unlist(d[order(d[1]), ]), dim(d))
all.equal(om(a), om(c))
# [1] TRUE

You can use the new package called waldo
library(waldo)
a <- data.frame(x = 1:10, b = 20:11)
c <- data.frame(x = 10:1, b = 11:20)
compare(a,c)
And you get:
`old$x`: 1 2 3 4 5 6 7 8 9 10 and 9 more...
`new$x`: 10 ...
`old$b`: 20 19 18 17 16 15 14 13 12 11 and 9 more...
`new$b`:

Related

R: Test for overlap of name values in dataframe

I have a dataframe filled with names.
For a given row in the dataframe, I'd like to compare that row to every row above it in the df and determine if the number of matching names is less than or equal to 4 for every row.
Toy Example where row 3 is the row of interest
"Jim","Dwight","Michael","Andy","Stanley","Creed"
"Jim","Dwight","Angela","Pam","Ryan","Jan"
"Jim","Dwight","Angela","Pam","Creed","Ryan" <--- row of interest
So first we'd compare row 3 to row 1 and see that the name overlap is 3, which meets the <= 4 criteria.
Then we'd compare row 3 to row 2 and see that the name overlap is 5 which fails the <= 4 criteria, ultimately returning a failed condition for being <=4 for every row above it.
Right now I am doing this operation using a for loop but the speed is much too slow for the dataframe size I am working with.
Example data
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
), stringsAsFactors = FALSE)
df
# V1 V2 V3 V4 V5 V6
# 1 Jim Dwight Michael Andy Stanley Creed
# 2 Jim Dwight Angela Pam Ryan Jan
# 3 Jim Dwight Angela Pam Creed Ryan
Operation and output (sapply over columns with %in% and take rowSums)
out_lgl <- rowSums(sapply(df, '%in%', unlist(df[3,]))) <= 4
out_lgl
# [1] TRUE FALSE FALSE
which(out_lgl)
# [1] 1
Explanation:
For each column, each element is compared to the third row (the vector unlist(df[3,])). The output is a matrix of logical values with the same dimensions as df, TRUE if there is a match.
sapply(df, '%in%', unlist(df[3,]))
# V1 V2 V3 V4 V5 V6
# [1,] TRUE TRUE FALSE FALSE FALSE TRUE
# [2,] TRUE TRUE TRUE TRUE TRUE FALSE
# [3,] TRUE TRUE TRUE TRUE TRUE TRUE
Then we can sum the TRUEs to see the number of matches for each row
rowSums(sapply(df, '%in%', unlist(df[3,])))
# [1] 3 5 6
Edit:
I have added the stringsAsFactors = FALSE option to the creation of df above. However, as far as I can tell the output of %in% is the same whether comparing factors with different levels or characters, so I don't believe this could change the results in any way. See example below
x <- c('b', 'c', 'z')
y <- c('a', 'b', 'g')
all.equal(x %in% y, factor(x) %in% factor(y))
# [1] TRUE
Similar solution as IceCreamToucan, but for any row.
For the data.frame:
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
)
For any row number i:
f <- function(i) {
if(i == 1) return(T)
r <- vapply(df[1:(i-1),], '%in%', unlist(df[i,]), FUN.VALUE = logical(i-1))
out_lgl <- rowSums(as.matrix(r)) <= 4
return(all(out_lgl))
}

R: make 2 subset vectors so that values are different index-wise

I want to make 2 vectors subsetting from the same data, with replace=TRUE.
Even if both vectors can contain the same values, they cannot be the same at the same index position.
For example:
> set.seed(1)
> a <- sample(15, 10, replace=T)
> b <- sample(15, 10, replace=T)
> a
[1] 4 6 9 14 4 14 15 10 10 1
> b
[1] 4 3 11 6 12 8 11 15 6 12
> a==b
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
In this case, vectors a and b contain the same value at index 1 (value==4), which is wrong for my purposes.
Is there an easy way to correct this?
And can it be done on the subset step?
Or should I go through a loop checking element by element and if the values are identical, make another selection for b[i] and check again if it's not identical ad infinitum?
many thanks!
My idea is, instead of getting 2 samples of length 10 with replacement, get 10 samples of length 2 without replacement
library(purrr)
l <- rerun(10,sample(15,2,replace=FALSE))
Each element in l is a vector of integers of length two. Those two integers are guaranteed to be different because we specified replace=FALSE in sample
# from l extract all first element in each element, this is a
a <- map_int(l,`[[`,1)
# from list extract all second elements, this is b
b <- map_int(l,`[[`,2)
How about a two-stage sampling process
set.seed(1)
x <- 1:15
a <- sample(x, 10, replace = TRUE)
b <- sapply(a, function(v) sample(x[x != v], 1))
a != b
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
We first draw samples a; then for every sample from a, we draw a new sample from the set of values x excluding the current sample from a. Since we're doing this one-sample-at-a-time, we automatically allow for sampling with replacement.

Matching all elements in nested lists (irrespective of position) and returning matches with indexes

I have two lists x and y created from
x1 = list(c(1,2,3,4))
x2 = list(c(seq(1, 10, by = 2)))
x<- list(x1,x2)
x
[[1]]
[[1]][[1]]
[1] 1 2 3 4
[[2]]
[[2]][[1]]
[1] 1 3 5 7 9
and y,
y1 = list(c(5, 6, 7, 8))
y2 = list(c(9, 7, 5, 3, 1))
y <- list(y1, y2)
y
[[1]]
[[1]][[1]]
[1] 5 6 7 8
[[2]]
[[2]][[1]]
[1] 9 7 5 3 1
So basically, I want to get matches of x into y so I should just get '1 3 5 7 9' actually being a match. I am also needing indexes.
I have tried, I want to match the values irrespective of the position each x[[ ]] with each y[[ ]].
Matches <- x[x %in% y]
IDX <- which(x %in% y)
This does not work....
I would like something that can return matches of the same elements irrespective of positions per each list. This would be a rough idea of what I need...
matches
[1] False
[1] 1 3 5 7 9
Thanks in advance, appreciate all the help.
Here is what you can do:
So, you have made list of lists, which is quite confusing to work with, you could have totally avoided using c, so you can have, x <- c(x1, x2) to get list of vectors, which is much more easy to work with.
But since you provided with list of lists, I will work with that.
Now back to solving your question:
flags <- lapply(Map(`%in%`, unlist(x, recursive = F), unlist(y, recursive=F)),all)
k <- lapply(1:length(x), function(i)ifelse(unlist(flags)[i] == TRUE,
list(unlist(x, recursive=F)[[i]]),
unlist(flags[i])))
unlist(k, recursive = F) #Final Output
Logic:
Mapping each items in list using %in% to see if an element
contains item of other elements, if all the elements are present it
will return a TRUE or a FALSE, In your case it would return FALSE and
TRUE respectively.
Here we are iterating to the lists of x by using flag as a filter
criteria you can make another list k, when value of flag created in
earlier step is TRUE it will copy back the contents of x, however
when FALSE it will remain as FALSE
Final step to your answer, unlist k again to convert into a list
of vectors using unlist with recursive = F.
Output:
# [[1]]
# [1] FALSE
# [[2]]
# [1] 1 3 5 7 9

Compare two character vectors in R

I have two character vectors of IDs.
I would like to compare the two character vectors, in particular I am interested in the following figures:
How many IDs are both in A and B
How many IDs are in A but not in B
How many IDs are in B but not in A
I would also love to draw a Venn diagram.
Here are some basics to try out:
> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog" "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"
Similarly, you could get counts simply as:
> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2
I'm usually dealing with large-ish sets, so I use a table instead of a Venn diagram:
xtab_set <- function(A,B){
both <- union(A,B)
inA <- both %in% A
inB <- both %in% B
return(table(inA,inB))
}
set.seed(1)
A <- sample(letters[1:20],10,replace=TRUE)
B <- sample(letters[1:20],10,replace=TRUE)
xtab_set(A,B)
# inB
# inA FALSE TRUE
# FALSE 0 5
# TRUE 6 3
Yet an another way, with using %in% and boolean vectors of common elements instead of intersect and setdiff. I take it you actually want to compare two vectors, not two lists - a list is an R class that may contain any type of element, while vectors always contain elements of just one type, hence easier comparison of what is truly equal. Here the elements are transformed to character strings, as that was the most inflexible element type that was present.
first <- c(1:3, letters[1:6], "foo", "bar")
second <- c(2:4, letters[5:8], "bar", "asd")
both <- first[first %in% second] # in both, same as call: intersect(first, second)
onlyfirst <- first[!first %in% second] # only in 'first', same as: setdiff(first, second)
onlysecond <- second[!second %in% first] # only in 'second', same as: setdiff(second, first)
length(both)
length(onlyfirst)
length(onlysecond)
#> both
#[1] "2" "3" "e" "f" "bar"
#> onlyfirst
#[1] "1" "a" "b" "c" "d" "foo"
#> onlysecond
#[1] "4" "g" "h" "asd"
#> length(both)
#[1] 5
#> length(onlyfirst)
#[1] 6
#> length(onlysecond)
#[1] 4
# If you don't have the 'gplots' package, type: install.packages("gplots")
require("gplots")
venn(list(first.vector = first, second.vector = second))
Like it was mentioned, there are multiple choices for plotting Venn-diagrams in R. Here is the output using gplots.
With sqldf: Slower but very suitable for data frames with mixed types:
t1 <- as.data.frame(1:10)
t2 <- as.data.frame(5:15)
sqldf1 <- sqldf('SELECT * FROM t1 EXCEPT SELECT * FROM t2') # subset from t1 not in t2
sqldf2 <- sqldf('SELECT * FROM t2 EXCEPT SELECT * FROM t1') # subset from t2 not in t1
sqldf3 <- sqldf('SELECT * FROM t1 UNION SELECT * FROM t2') # UNION t1 and t2
sqldf1 X1_10
1
2
3
4
sqldf2 X5_15
11
12
13
14
15
sqldf3 X1_10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Using the same example data as one of the answers above.
A = c("Dog", "Cat", "Mouse")
B = c("Tiger","Lion","Cat")
match(A,B)
[1] NA 3 NA
The match function returns a vector with the location in B of all values in A. So, cat, the second element in A, is the third element in B. There are no other matches.
To get the matching values in A and B, you can do:
m <- match(A,B)
A[!is.na(m)]
"Cat"
B[m[!is.na(m)]]
"Cat"
To get the non-matching values in A and B:
A[is.na(m)]
"Dog" "Mouse"
B[which(is.na(m))]
"Tiger" "Cat"
Further, you can use length() to get the total number of matching and non-matching values.
If A is a data.table with field a of type list, with entries themselves as vectors of a primitive type, e.g. created as follows
A<-data.table(a=c(list(c("abc","def","123")),list(c("ghi","zyx"))),d=c(9,8))
and B is a list with vector of primitive entries, e.g. created as follows
B<-list(c("ghi","zyx"))
and you're attempting to find which (if any) element of A$a matches B
A[sapply(a,identical,unlist(B))]
if you just want the entry in a
A[sapply(a,identical,unlist(B)),a]
if you want the matching indicies of a
A[,which(sapply(a,identical,unlist(B)))]
if instead B is itself a data.table with the same structure as A, e.g.
B<-data.table(b=c(list(c("zyx","ghi")),list(c("abc","def",123))),z=c(5,7))
and you're looking for the intersection of the two lists by one column, where you require the same order of vector elements.
# give the entry in A for in which A$a matches B$b
A[,`:=`(res=unlist(sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)))
][res==TRUE
][,res:=NULL][]
# get T/F for each index of A
A[,sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)]
Note that you can't do something as easy as
setkey(A,a)
setkey(B,b)
A[B]
to join A&B because you cannot key on a field of type list in data.table 1.12.2
similarly, you cannot ask
A[a==B[,b]]
even if A and B are identical, as the == operator hasn't been implemented in R for type list

compare two characters based on subset

I have a simple dataframe with two columns:
df <- data.frame(x = c(1,1,2,2,3),
y = c(rep(1:2,2),1),
target = c('a','a','a','b','a'))
I would like to compare the strings in the target column (find out whether they are equal or not, i.e., TRUE or FALSE) within every level of x (same number for x).
First I would like to compare lines 1 and 2, then 3 and 4 ...
My problem is that I am missing some comparisons, for example, line 5 has only one case instead of two - so it should turn out to be FALSE.
Variable y indicates the first and second case within x.
I played around with ddply doing something like:
ddply(df, .(x), summarise,
ifelse(as.character(df[df$y == '1',]$target),
as.character(df[df$y == '2',]$target),0,1))
which is ugly ...
and does not work ...
Any insights how I could achieve this comparison?
Thanks
ddply(df, .(x), function(d) NROW(d) == 2 & d$target[1] == d$target[2])
This assumes you want the value to be TRUE only if there are exactly 2 rows with that 'x' value. If it is possible for there to be 3 or more, and you want it to be TRUE if all target values are identical, you could do:
ddply(df, .(x), function(d) NROW(d) > 1 & length(unique(d$target)) == 1)
Here is a base R solution, assuming I have followed what you wanted correctly. foo() is a function that compares the two target values in each subset, whilst we split() the data on df$x and l|sapply() foo() to each of the subsets.
foo <- function(x) {
with(x, {if(length(target) < 2) {
FALSE
} else {
isTRUE(all.equal(target[1], target[2]))
}})
}
lapply(split(df, df$x), foo)
sapply(split(df, df$x), foo)
Which produces this output
> lapply(split(df, df$x), foo)
$`1`
[1] TRUE
$`2`
[1] FALSE
$`3`
[1] FALSE
>
> sapply(split(df, df$x), foo)
1 2 3
TRUE FALSE FALSE
ave(as.character(df$target), df$x,
FUN=function(z) if ( length(z)=="2" & length(unique(z))==1){TRUE} else{ FALSE })
[1] "TRUE" "TRUE" "FALSE" "FALSE" "FALSE"
Or ... if you only want the results by group ...., use aggregate:
> aggregate(as.character(df$target), list(df$x),
+ FUN=function(z) if ( length(z)=="2" & length(unique(z))==1){TRUE} else{ FALSE })
Group.1 x
1 1 TRUE
2 2 FALSE
3 3 FALSE

Resources