row wise comparison between a vector and a matrix in r - r

I have two datasets from 10 people. One is a vector, and the other is a matrix. What I want to see is if the first element of the vector includes in the first row of the matrix, and if the second element of the vector includes in the second row of the matrix, and so on.
so, I changed the vector into a matrix and used apply to compare them row-wise. But, the result was not that correct.
Here is the datasets.
df1<-matrix(c(rep(0,10),2,4,7,6,5,7,4,2,2,2),ncol=2)
df1
# [,1] [,2]
# [1,] 0 2
# [2,] 0 4
# [3,] 0 7
# [4,] 0 6
# [5,] 0 5
# [6,] 0 7
# [7,] 0 4
# [8,] 0 2
# [9,] 0 2
#[10,] 0 2
df2<-c(1,3,6,4,1,3,3,2,2,5)
df2<-as.matrix(df2)
apply(df2, 1, function(x) any(x==df1))
# [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
However, the result must be all FALSE but 8th and 9th.
Can anyone correct the function? Thanks!

This vectorized code should be very efficient:
> as.logical( rowSums(df1==df2))
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE

Here are a few approaches you could take
Two calls to apply
#
# 1 by column to check if the values are equal
# then by row to see if any rows contain TRUE
apply(apply(df1,2,`==`,df2),1,any)
Use sapply and seq_along
sapply(seq_along(df2), function(x, y, i) y[i] %in% x[i, ], y = df2 ,x = df1)
repeat df2 to the same length as df1 and then compare
rowSums(df1==rep(df2, length = length(df1))) > 0

Related

Compare each matrix row to a vector

I have a 3-column matrix, and I want to compare its rows with a vector. And I know that the easiest way is through apply(table==vector,1,sum)>(length(vector)-1), but since I kept getting wrong rows marked I started to dig into partial results. Below there is my code and a mistake R seems to be doing.
transition_matrix<-cbind(permutations(n=7,r=4,v=c(0,1,2,3,4,5,6),repeats.allowed=T),prob=0,n=1)
vector<-c(1,0,1)
table <- transition_matrix[,c(1:3)]
table[59,]
>0 1 1
(table==vector)[59,]
>TRUE TRUE TRUE
So I am just puzzled staring at my code and honestly not understanding why it does not work. I might be missing something because if I compare directly row number 59 with the vector I get the right result.
As Chi Pak notes, table == vector is operating down columns, but you want to compare each row to the vector.
One way to get the behavior you want is to transpose the matrix before comparing to the vector, and then re-transposing afterwards.
Sample data:
(table <- matrix(rep(0:2, 4), 4))
# [,1] [,2] [,3]
# [1,] 0 1 2
# [2,] 1 2 0
# [3,] 2 0 1
# [4,] 0 1 2
(vector <- c(0, 0, 1))
# [1] 0 0 1
Calculation:
t(t(table) == vector)
# [,1] [,2] [,3]
# [1,] TRUE FALSE FALSE
# [2,] FALSE FALSE FALSE
# [3,] FALSE TRUE TRUE
# [4,] TRUE FALSE FALSE
One advantage of this compared to an approach using apply is that all the operations are vectorized, which means this will be a good deal more efficient on large matrices. To see this, let's look at a matrix with one million rows:
set.seed(144)
table <- matrix(sample(0:1, 3e6, replace=TRUE), 1e6)
system.time(t(t(table) == vector))
# user system elapsed
# 0.066 0.013 0.078
system.time(t(apply(table,1,function(x) x==vector)))
# user system elapsed
# 2.508 0.057 2.576
Explanation
When you compare a vector to a matrix it does so column-wise
See the following reproducible example
table <- matrix(c(rep(0,60),rep(1,60),rep(1,60)),ncol=3)
vector <- c(1,0,1)
head(table==vector)
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] TRUE FALSE FALSE
[3,] FALSE TRUE TRUE
[4,] FALSE TRUE TRUE
[5,] TRUE FALSE FALSE
[6,] FALSE TRUE TRUE
1,1 is FALSE because vector[1]==1 and table[1,1]==0. 2,1 is TRUE because vector[2]==0 and table[2,1]==0, etc.
Solution
head(t(apply(table,1,function(x) x==vector)))
[,1] [,2] [,3]
[1,] FALSE FALSE TRUE
[2,] FALSE FALSE TRUE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE TRUE
[5,] FALSE FALSE TRUE
[6,] FALSE FALSE TRUE

Per-row index of a matrix in R (including 0-rows)

Assume we have the following logical matrix in R:
A <- matrix(as.logical(c(0,0,0,1,0,1,0,0,1,0,0,0)), nrow=4)
# [,1] [,2] [,3]
# [1,] FALSE FALSE TRUE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE FALSE
# [4,] TRUE FALSE FALSE
I want to convert this matrix into a column-wise index using
B <- column_wise_index(A)
where column_wise_index returns a vector containing the same number of elements as the number of rows in A (4), and each element contains the column of A that has a logical value TRUE. For A above, B should resemble
B <- c(3,2,0,1)
# [1] 3 2 0 1
where 0 indicates a row that has no TRUE value.
The closest I've come is applying which by row:
unlist(apply(A, 1, function(x) which(x)))
# [1] 3 2 1
However, the result skips 0, and I'm not sure how efficient this is for large matrices (say ~100K x 100 entries).
Here is a solution that is more in the spirit of how you started, but you have to admire #rawr's clever solution.
A <- matrix(as.logical(c(0,0,0,1,0,1,0,0,1,0,0,0)), nrow=4)
TrueSpots = apply(A, 1, which)
TrueSpots[!sapply(TrueSpots, length)] = 0
unlist(TrueSpots)
[1] 3 2 0 1
Update including #akrun's suggestion:
TrueSpots = apply(A, 1, which)
TrueSpots[!lengths(TrueSpots)] = 0
unlist(TrueSpots)
[1] 3 2 0 1
max.col(A) identifies the index where the maximum entry occurs within the row. Ties are broken at random (by default). rowSums(A) on a logical matrix performs a per-row binary addition.
Based on the assumption that each row has at most one TRUE value, rowSums(A) will result in a binary vector. Performing a vector-based multiplication nullifies the truly FALSE rows in A.
> A <- matrix(as.logical(c(0,0,0,1,0,1,0,0,1,0,0,0)), nrow=4)
> max.col(A)*rowSums(A)
[1] 3 2 0 1

Subset data frame in R based on matching multiple ranges for multiple variables

I have a problem that seems kind of similar to some previously asked questions on SO, but different enough that I can't quite figure out an elegant solution.
I have a set of real data that I need to match to a database of theoretical values. I'd like to filter based on multiple sets of multiple conditions. For example, if I have the following data.frame of theoretical values,
df <- data.frame(x=c(10,13,16,22,28,30), y=c(1:6))
> df
x y
1 10 1
2 13 2
3 16 3
4 22 4
5 28 5
6 30 6
and I have the following real data,
realdata <- data.frame(x=c(10.05, 13.06, 22.01),y=c(1.02, 1.99, 3.96))
> realdata
x y
1 10.05 1.02
2 13.06 1.99
3 22.01 3.96
I can easily search for which theoretical rows correspond to rows in my real data one at a time with something like this:
tolerance <- .10
subset(df, x>(realdata[1,1]-tolerance) & x<(realdata[1,1]+tolerance) &
+ y>(realdata[1,2]-tolerance) & y<(realdata[1,2]+tolerance))
subset(df, x>(realdata[2,1]-tolerance) & x<(realdata[2,1]+tolerance) &
+ y>(realdata[2,2]-tolerance) & y<(realdata[2,2]+tolerance))
#...etc for each row of real data
But is there any way to do this for all the rows in my real data without writing a loop? Basically, I want to find all the theoretical rows that correspond to any one of the rows in my real data, within a given tolerance. In reality, my theoretical and real tables have hundreds of thousands of observations, and this is something I do quite a bit, so speed will matter, I think.
Also, if anyone knows a way of determining whether a value is within a range using a single expression that works inside subset(), that would be icing on the cake. Maybe subset is the wrong function to be using, though, in which case never mind.
You can use outer to calculate all pairwise differences between df and realdata and examine if both x and y are less than the tolerance
tolerance <- .10
# x
xx <- abs(outer(df$x, realdata$x, "-")) < tolerance
# y
yy <- abs(outer(df$y, realdata$y, "-")) < tolerance
# if both are within the tolerance the sum of xx and yy will be 2
(mat <- xx + yy > 1)
# [,1] [,2] [,3]
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE
#[3,] FALSE FALSE FALSE
#[4,] FALSE FALSE TRUE
#[5,] FALSE FALSE FALSE
#[6,] FALSE FALSE FALSE
So the first column of mat shows which rows of df are within the tolerance (in this case the first).
Rather inelegantly return the row of matches in df in the order of the rows of realdata
lapply(1:ncol(mat), function(i) df[mat[,i], ])
# return all matched data
df[row(mat)[mat], ]
This is a implicit loop with a vectorized test:
apply( realdata, 1,
function(x) abs( x[1] - df[,1] ) < tolerance &
abs( x[2] - df[,2]) <tolerance )
#------------------------
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE TRUE
[5,] FALSE FALSE FALSE
[6,] FALSE FALSE FALSE
This does it with no apply functions:
> kronecker( as.matrix(df), as.matrix(realdata), function(x,y) { abs(x -y) <tolerance} )[,c(1,4)]
[,1] [,2]
[1,] TRUE TRUE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] TRUE TRUE
[6,] FALSE FALSE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
[10,] FALSE FALSE
[11,] FALSE FALSE
[12,] TRUE TRUE
[13,] FALSE FALSE
[14,] FALSE FALSE
[15,] FALSE FALSE
[16,] FALSE FALSE
[17,] FALSE FALSE
[18,] FALSE FALSE
You can consolidate it with rowSums(.) == 2

retrieve specific entries of a matrix based on values from a data frame

I have a data frame of the form:
my.df = data.frame(ID=c(1,2,3,4,5,6,7), STRAND=c('+','+','+','-','+','-','+'), COLLAPSE=c(0,0,1,0,1,0,0))
and another matrix of dimensions nrow(mydf) by nrow(my.df). It is a correlation matrix, but that's not important for the discussion.
For example:
mat = matrix(rnorm(n=nrow(my.df)*nrow(my.df),mean=1,sd=1), nrow = nrow(my.df), ncol=nrow(my.df))
The question is how to retrieve only the upper triangle elements from matrix mat, such that my.df have values of COLLAPSE == 0, and are of the of the same strand?
In this specific example, I'd interested in retrieving the following entries from matrix mat in a vector:
mat[1,2]
mat[1,7]
mat[2,7]
mat[4,6]
The logic is as follows, 1,2 are both of the same strand, and it's collapse value is equal to zero so should be retrieved, 3 would never be combined with any other row because it has collapse value = 1, 1,3 are of the same strand and have collapse value = 0 so should also be retrieved,...
I could write a for loop but I am looking for a more crantastic way to achieve such results...
Here's one way to do it using outer:
First, find indices with identical STRAND values and where COLLAPSE == 0:
idx <- with(my.df, outer(STRAND, STRAND, "==") &
outer(COLLAPSE, COLLAPSE, Vectorize(function(x, y) !any(x, y))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE
# [2,] TRUE FALSE FALSE FALSE FALSE FALSE TRUE
# [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [6,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# [7,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Second, set values in lower triangle and on the diagonal to FALSE. Create a numeric index:
idx2 <- which(idx & upper.tri(idx), arr.ind = TRUE)
# row col
# [1,] 1 2
# [2,] 4 6
# [3,] 1 7
# [4,] 2 7
Extract values:
mat[idx2]
# [1] 1.72165093 0.05645659 0.74163428 3.83420241
Here's one way to do it.
# select only the 0 collapse records
sel <- my.df$COLLAPSE==0
# split the data frame by strand
groups <- split(my.df$ID[sel], my.df$STRAND[sel])
# generate all possible pairs of IDs within the same strand
pairs <- lapply(groups, combn, 2)
# subset the entries from the matrix
lapply(pairs, function(ij) mat[t(ij)])
df <- my.df[my.df$COLLAPSE == 0, ]
strand <- c("+", "-")
idx <- do.call(rbind, lapply(strand, function(strand){
t(combn(x = df$ID[df$STRAND == strand], m = 2))
}))
idx
# [,1] [,2]
# [1,] 1 2
# [2,] 1 7
# [3,] 2 7
# [4,] 4 6
mat[idx]

Convert a vector to data.frame, one column for each unique value

We are given a vector, like this:
x <- c(1,2,1,5,2,1,2,5,1)
What we need is a data.frame say y having number of rows equal to length(x) and number of columns equal to length(unique(x)), that means one column per unique item in x, such that y[i,j]==TRUE if and only if the ith element of x is the jth unique item of x (assigned to column j):
y <- data.frame("1"=x==1, "2"=x==2, "5"=x==5, check.names=F)
A simple way to perform this is:
y <- setNames(data.frame(sapply(unique(x), function(i) x==i)), unique(x))
Do you have a better idea (i.e. a particular function)?
If you can live with a binary representation instead of a logical representation of your data, I would just use table:
y <- table(seq_along(x), x)
To get a data.frame, use as.data.frame.matrix:
as.data.frame.matrix(y)
# 1 2 5
# 1 1 0 0
# 2 0 1 0
# 3 1 0 0
# 4 0 0 1
# 5 0 1 0
# 6 1 0 0
# 7 0 1 0
# 8 0 0 1
# 9 1 0 0
How about using outer?
outer( x , unique(x) , `==` )
# [,1] [,2] [,3]
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] TRUE FALSE FALSE
# [4,] FALSE FALSE TRUE
# [5,] FALSE TRUE FALSE
# [6,] TRUE FALSE FALSE
# [7,] FALSE TRUE FALSE
# [8,] FALSE FALSE TRUE
# [9,] TRUE FALSE FALSE
Obviously finishing it all off would be wrapping it like so...
setNames( data.frame( outer( x , unique(x) , `==` ) ) , unique( x ) )

Resources