Is there an efficient way of comparing two data frames - r

I have two data frames with different number of rows, but the same number of columns. In the example below data frame 1 is 4 x 2, data frame 2 is 3 x 2. I need a 4 x 3 logical matrix where TRUE indicates that the all the rows in the data frames match. This example works, but takes a very long time to run with larger data frames (I'm trying two data frames with about 5,000 rows, but still just two columns). Is there a more efficient way of doing this?
> df1 <- data.frame(row.names=1:4, var1=c(TRUE, TRUE, FALSE, FALSE), var2=c(1,2,3,4))
> df2 <- data.frame(row.names=5:7, var1=c(FALSE, TRUE, FALSE), var2=c(5,2,3))
>
> m1 <- t(as.matrix(df1))
> m2 <- as.matrix(df2)
>
> apply(m2, 1, FUN=function(x) { apply(m1, 2, FUN=function(y) { all(x==y) } ) })
5 6 7
1 FALSE FALSE FALSE
2 FALSE TRUE FALSE
3 FALSE FALSE TRUE
4 FALSE FALSE FALSE
Thanks in advance for any help.

I was drawn here by your post on R-bloggers: http://jason.bryer.org/posts/2013-01-24/Comparing_Two_Data_Frames.html
If like you say, your data has no numeric vectors, then I think I can suggest a faster approach. It consists in:
turn your two data.frames into two matrices of integers
compute the Euclidean distance between rows of your two datas
Quick example using your data:
mat1 <- as.matrix(sapply(df1, as.integer))
mat2 <- as.matrix(sapply(df2, as.integer))
library(fields)
rdist(mat1, mat2) < 1e-9
# [,1] [,2] [,3]
# [1,] FALSE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE TRUE
# [4,] FALSE FALSE FALSE
A few comments:
if your data contained vectors of characters, you would have to convert them into factors and make sure that they share the same factor levels.
I used the fields package to compute the Euclidean distance. It uses a Fortran implementation and is as far as I know the fastest R package around for the task (and I have tested many, trust me.)

I'm honestly not sure if this will be faster, but you might try:
foo <- Vectorize(function(x,y) {all(df1[x,] == df2[y,])})
> outer(1:4,1:3,FUN = foo)
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE FALSE
I feel compelled to at least mention the danger in using == for comparisons as opposed to all.equal or identical. I'm presuming that you're comfortable enough with the data types involves that this won't be a problem.

I suspect that the optimal solution depends on how many unique rows and how many total rows you have.
For the example on your blog, where there are 1000-1500 rows but only 20 unique values (for the seed you set there), I think it's faster to do this:
assign ids to each unique row and then
run outer on the vector of ids seen in each data.frame.
Here's the performance I got. #flodel's approach does about the same on my computer; it's the third one below. Disclaimer: I don't know much about running these kinds of tests.
> set.seed(2112)
> df1 <- data.frame(row.names=1:1000,
+ var1=sample(c(TRUE,FALSE), 1000, replace=TRUE),
+ var2=sample(1:10, 1000, replace=TRUE) )
> df2 <- data.frame(row.names=1001:2500,
+ var1=sample(c(TRUE,FALSE), 1500, replace=TRUE),
+ var2=sample(1:10, 1500, replace=TRUE))
>
> # candidate method on blog
> system.time({
+ df1$var3 <- apply(df1, 1, paste, collapse='.')
+ df2$var3 <- apply(df2, 1, paste, collapse='.')
+ df6 <- sapply(df2$var3, FUN=function(x) { x == df1$var3 })
+ dimnames(df6) <- list(row.names(df1), row.names(df2))
+ })
user system elapsed
1.13 0.00 1.14
>
> rownames(df1) <- NULL # in case something weird happens to rownames on merge
> rownames(df2) <- NULL
> # id method
> system.time({
+ df12 <- unique(rbind(df1,df2))
+ df12$id <- rownames(df12)
+
+ id1 <- merge(df12,df1)$id
+ id2 <- merge(df12,df2)$id
+
+ x <- outer(id1,id2,`==`)
+ })
user system elapsed
0.11 0.02 0.13
>
> library(fields)
> # rdlist from fields method
> system.time({
+ mat1 <- as.matrix(sapply(df1, as.integer))
+ mat2 <- as.matrix(sapply(df2, as.integer))
+ rdist(mat1, mat2) < 1e-9
+ })
user system elapsed
0.15 0.00 0.16
I guess the rbind and the merges would make this solution relatively more costly with different data.

Related

Automate group generation according to intervals in string

I have data on covariates for several units. Additionally, I have access to a scoring rule that ranks my observations according to a score.
I decided to divide my training sample X according to the quantiles of score, which I achieved by using the quantile_group function from the GenericMl package.
## Generate data.
set.seed(1986)
n <- 1000
n_val <- 10000
k <- 3
X <- matrix(rnorm(n * k), ncol = k)
X_val <- matrix(rnorm(n_val * k), ncol = k)
score <- rexp(n)
score_val <- rexp(n_val)
## Quantiles of score.
library(GenericML)
groups <- quantile_group(score)
head(groups)
#> [-Inf, 0.277) [0.277, 0.678) [0.678, 1.34) [1.34, Inf]
#> [1,] TRUE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE TRUE
#> [3,] FALSE FALSE TRUE FALSE
#> [4,] FALSE TRUE FALSE FALSE
#> [5,] FALSE TRUE FALSE FALSE
#> [6,] FALSE FALSE TRUE FALSE
The g-th column of groups consists of TRUEs and FALSEs denoting membership to the g-th quantile of score. My next step is to divide units in the validation sample X_val using the same partition of groups. To clarify, I want to divide score_val in four groups defined by the intervals given by colnames(groups):
colnames(groups)
#> [1] "[-Inf, 0.277)" "[0.277, 0.678)" "[0.678, 1.34)" "[1.34, Inf]"
I need to automate this.
I think this can be an approach to get what you are looking for. I don't use the GenericML package because If I understood well, you only want to divide X_val into sub-sets.
# Load library
library(dplyr)
# Generate data
set.seed(1986)
n <- 1000
n_val <- 10000
k <- 3
X <- matrix(rnorm(n * k), ncol = k)
# Here I use "as.data.frame.matrx" in order to add the group (according to the interval)
X_val <- as.data.frame.matrix(matrix(rnorm(n_val * k), ncol = k))
score <- rexp(n)
score_val <- rexp(n_val)
# Get the quantiles of score
q.score <- quantile(score)
# Divide score_val acording to the quantiles of q.score
group.var <- cut(score_val, breaks = c(-Inf, q.score[2:4], Inf))
# Add "group.var" to X_val matrix
X_val$group.var <- group.var
# Divide the information according to "group.var"
new_X_val <- X_val %>%
group_split(group.var)
At the end, what you get is new_X_val, a list with 4 elements, one for each quantile.

identify if matrices have NAs in the same position (logical)

i want to identify whether two matrices have NA's in the same spot.
Setup:
We have three matrices. I want to run a function that tells me mat1 and mat2 have NA's in identical spots, and that tells me that mat1(and mat3) vs mat2 have NA's in different spots
mat1 <- matrix(nrow=2, ncol =2, data =c(NA,0,0,NA))
mat2 <- matrix(nrow=2, ncol =2, data=c(NA,0,0,NA))
mat3 <- matrix(nrow=2, ncol=2, data = c(NA,0,0,0))
Compare the NA status of all elements:
> all(is.na(mat1) == is.na(mat2))
[1] TRUE
> all(is.na(mat1) == is.na(mat3))
[1] FALSE
In a function I'd do this:
> nanana = function(m1, m2){!any(is.na(m1) != is.na(m2))}
I've inverted the logic so that any can stop checking if it finds any difference. If you use all it has to go over every element. I'm not sure if this kind of short-circuiting is in R but it might save you a millisecond or two.
> nanana(mat1, mat2)
[1] TRUE
> nanana(mat1, mat3)
[1] FALSE
We can write a function which compares the position of NA elements in two matrix
identical_NA_matrix <- function(m1, m2) {
identical(which(is.na(m1), arr.ind = TRUE), which(is.na(m2), arr.ind = TRUE))
}
identical_NA_matrix(mat1,mat3)
#[1] FALSE
identical_NA_matrix(mat1,mat2)
#[1] TRUE

Check if all rows are the same in a matrix

I'm looking to see if sample(..., replace=T) results in sampling the same row n times. I see the duplicated function checks if elements are repeated by returning a logical vector for each index, but I need to see if one element is repeated n times (single boolean value). What's the best way to go about this?
Here's just an example. Some function on this matrix should return TRUE
t(matrix(c(rep(c(rep(4,1),rep(5,1)),8)),nrow=2,ncol=8))
[,1] [,2]
[1,] 4 5
[2,] 4 5
[3,] 4 5
[4,] 4 5
[5,] 4 5
[6,] 4 5
[7,] 4 5
[8,] 4 5
Here is one solution that works to produce the true/false result you are looking for:
m <- t(matrix(c(rep(c(rep(4,1),rep(5,1)),8)),nrow=2,ncol=8))
apply(m, 2, function(x) length(unique(x)) == 1)
[1] TRUE TRUE
m <- rbind(m, c(4, 6))
apply(m, 2, function(x) length(unique(x)) == 1)
[1] TRUE FALSE
If you want a single boolean value saying if all columns have unique values, you can do:
all(apply(m, 2, function(x) length(unique(x)) == 1) == TRUE)
[1] FALSE
A little cleaner looking (and easier to tell what the code is doing):
m <- t(matrix(c(rep(c(rep(4,1),rep(5,1)),8)),nrow=2,ncol=8))
apply(m, 2, function(x) all(x==x[1]))
[1] TRUE TRUE
Think I've got my solution.
B <- t(matrix(c(rep(c(rep(4,1),rep(5,1)),8)),nrow=2,ncol=8))
length(table(B)) == ncol(B)
[1] TRUE
B <- rbind(B,c(4,6)) # different sample
length(table(B)) == ncol(B)
[1] FALSE
We could also replicate the first row, compare with the original matrix, get the colSums and check whether it is equal to nrow of 'm'
colSums(m[1,][col(m)]==m)==nrow(m)
[1] TRUE TRUE
Or another option would be to check the variance
!apply(m, 2, var)
#[1] TRUE TRUE
For a matrix you can apply unique directly to your matrix, no need for apply:
nrow(unique(m)) == 1L
[1] TRUE
nrow(unique(rbind(m, c(6,7)))) == 1L
[1] FALSE
From the documentation ?unique:
The array method calculates for each element of the dimension specified by MARGIN if the remaining dimensions are identical to those for an earlier element (in row-major order). This would most commonly be used for matrices to find unique rows (the default) or columns (with MARGIN = 2).
Alternatively you can transpose your matrix and leverage vectorized comparison:
all(m[1,] == t(m))
[1] TRUE

Check that a vector is contained in a matrix in R

I can't believe this is taking me this long to figure out, and I still can't figure it out.
I need to keep a collection of vectors, and later check that a certain vector is in that collection. I tried lists combined with %in% but that doesn't appear to work properly.
My next idea was to create a matrix and rbind vectors to it, but now I don't know how to check if a vector is contained in a matrix. %in appears to compare sets and not exact rows. Same appears to apply to intersect.
Help much appreciated!
Do you mean like this:
wantVec <- c(3,1,2)
myList <- list(A = c(1:3), B = c(3,1,2), C = c(2,3,1))
sapply(myList, function(x, want) isTRUE(all.equal(x, want)), wantVec)
## or, is the vector in the set?
any(sapply(myList, function(x, want) isTRUE(all.equal(x, want)), wantVec))
We can do a similar thing with a matrix:
myMat <- matrix(unlist(myList), ncol = 3, byrow = TRUE)
## As the vectors are now in the rows, we use apply over the rows
apply(myMat, 1, function(x, want) isTRUE(all.equal(x, want)), wantVec)
## or
any(apply(myMat, 1, function(x, want) isTRUE(all.equal(x, want)), wantVec))
Or by columns:
myMat2 <- matrix(unlist(myList), ncol = 3)
## As the vectors are now in the cols, we use apply over the cols
apply(myMat, 2, function(x, want) isTRUE(all.equal(x, want)), wantVec)
## or
any(apply(myMat, 2, function(x, want) isTRUE(all.equal(x, want)), wantVec))
If you need to do this a lot, write your own function
vecMatch <- function(x, want) {
isTRUE(all.equal(x, want))
}
And then use it, e.g. on the list myList:
> sapply(myList, vecMatch, wantVec)
A B C
FALSE TRUE FALSE
> any(sapply(myList, vecMatch, wantVec))
[1] TRUE
Or even wrap the whole thing:
vecMatch <- function(x, want) {
out <- sapply(x, function(x, want) isTRUE(all.equal(x, want)), want)
any(out)
}
> vecMatch(myList, wantVec)
[1] TRUE
> vecMatch(myList, 5:3)
[1] FALSE
EDIT: Quick comment on why I used isTRUE() wrapped around the all.equal() calls. This is due to the fact that where the two arguments are not equal, all.equal() doesn't return a logical value (FALSE):
> all.equal(1:3, c(3,2,1))
[1] "Mean relative difference: 1"
isTRUE() is useful here because it returns TRUE iff it's argument is TRUE, whilst it returns FALSE if it is anything else.
> M
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
v <- c(2, 5, 8)
check each column:
c1 <- which(M[, 1] == v[1])
c2 <- which(M[, 2] == v[2])
c3 <- which(M[, 3] == v[3])
Here is a way to still use intersect() on more than 2 elements
> intersect(intersect(c1, c2), c3)
[1] 2

Insert elements into a vector at given indexes

I have a logical vector, for which I wish to insert new elements at particular indexes. I've come up with a clumsy solution below, but is there a neater way?
probes <- rep(TRUE, 15)
ind <- c(5, 10)
probes.2 <- logical(length(probes)+length(ind))
probes.ind <- ind + 1:length(ind)
probes.original <- (1:length(probes.2))[-probes.ind]
probes.2[probes.ind] <- FALSE
probes.2[probes.original] <- probes
print(probes)
gives
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
and
print(probes.2)
gives
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[13] TRUE TRUE TRUE TRUE TRUE
So it works but is ugly looking - any suggestions?
These are all very creative approaches. I think working with indexes is definitely the way to go (Marek's solution is very nice).
I would just mention that there is a function to do roughly that: append().
probes <- rep(TRUE, 15)
probes <- append(probes, FALSE, after=5)
probes <- append(probes, FALSE, after=11)
Or you could do this recursively with your indexes (you need to grow the "after" value on each iteration):
probes <- rep(TRUE, 15)
ind <- c(5, 10)
for(i in 0:(length(ind)-1))
probes <- append(probes, FALSE, after=(ind[i+1]+i))
Incidentally, this question was also previously asked on R-Help. As Barry says:
"Actually I'd say there were no ways of doing this, since I dont think you can actually insert into a vector - you have to create a new vector that produces the illusion of insertion!"
You can do some magic with indexes:
First create vector with output values:
probs <- rep(TRUE, 15)
ind <- c(5, 10)
val <- c( probs, rep(FALSE,length(ind)) )
# > val
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [13] TRUE TRUE TRUE FALSE FALSE
Now trick. Each old element gets rank, each new element gets half-rank
id <- c( seq_along(probs), ind+0.5 )
# > id
# [1] 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0
# [16] 5.5 10.5
Then use order to sort in proper order:
val[order(id)]
# [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
# [13] TRUE TRUE TRUE TRUE TRUE
probes <- rep(TRUE, 1000000)
ind <- c(50:100)
val <- rep(FALSE,length(ind))
new.probes <- vector(mode="logical",length(probes)+length(val))
new.probes[-ind] <- probes
new.probes[ind] <- val
Some timings:
My method
user system elapsed
0.03 0.00 0.03
Marek method
user system elapsed
0.18 0.00 0.18
R append with for loop
user system elapsed
1.61 0.48 2.10
How about this:
> probes <- rep(TRUE, 15)
> ind <- c(5, 10)
> probes.ind <- rep(NA, length(probes))
> probes.ind[ind] <- FALSE
> new.probes <- as.vector(rbind(probes, probes.ind))
> new.probes <- new.probes[!is.na(new.probes)]
> new.probes
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[13] TRUE TRUE TRUE TRUE TRUE
That is sorta tricky. Here's one way. It iterates over the list, inserting each time, so it's not too efficient.
probes <- rep(TRUE, 15)
probes.ind <- ind + 0:(length(ind)-1)
for (i in probes.ind) {
probes <- c(probes[1:i], FALSE, probes[(i+1):length(probes)])
}
> probes
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[13] TRUE TRUE TRUE TRUE TRUE
This should even work if ind has repeated elements, although ind does need to be sorted for the probes.ind construction to work.
Or you can do it using the insertRow function from the miscTools package.
probes <- rep(TRUE, 15)
ind <- c(5,10)
for (i in ind){
probes <- as.vector(insertRow(as.matrix(probes), i, FALSE))
}
I came up with a good answer that's easy to understand and fairly fast to run, building off Wojciech's answer above. I'll adapt the method for the example here, but it can be easily generalized to pretty much any data type for an arbitrary pattern of missing points (shown below).
probes <- rep(TRUE, 15)
ind <- c(5,10)
probes.final <- rep(FALSE, length(probes)+length(ind))
probes.final[-ind] <- probes
The data I needed this for is sampled at a regular interval, but many samples are thrown out, and the resulting data file only includes the timestamps and measurements for those retained. I needed to produce a vector containing all the timestamps and a data vector with NAs inserted for timestamps that were tossed. I used the "not in" function stolen from here to make it a bit simpler.
`%notin%` <- Negate(`%in%`)
dat <- rnorm(50000) # Data given
times <- seq(from=554.3, by=0.1, length.out=70000] # "Original" time stamps
times <- times[-sample(2:69999, 20000)] # "Given" times with arbitrary points missing from interior
times.final <- seq(from=times[1], to=times[length(times)], by=0.1)
na.ind <- which(times.final %notin% times)
dat.final <- rep(NA, length(times.final))
dat.final[-na.ind] <- dat
Um, hi, I had the same doubt, but I couldn't understand what people had answered, because I'm still learning the language. So I tried make my own and I suppose it works! I created a vector and I wanted to insert the value 100 after the 3rd, 5th and 6th indexes. This is what I wrote.
vector <- c(0:9)
indexes <- c(6, 3, 5)
indexes <- indexes[order(indexes)]
i <- 1
j <- 0
while(i <= length(indexes)){
vector <- append(vector, 100, after = indexes[i] + j)
i <-i + 1
j <- j + 1
}
vector
The vector "indexes" must be in ascending order for this to work. This is why I put them in order at the third line.
The variable "j" is necessary because at each iteration, the length of the new vector increases and the original values are moved.
In the case you wish to insert the new value next to each other, simply repeat the number of the index. For instance, by assigning indexes <- c(3, 5, 5, 5, 6), you should get vector == 0 1 2 100 3 4 100 100 100 5 100 6 7 8 9

Resources