Compare each matrix row to a vector - r

I have a 3-column matrix, and I want to compare its rows with a vector. And I know that the easiest way is through apply(table==vector,1,sum)>(length(vector)-1), but since I kept getting wrong rows marked I started to dig into partial results. Below there is my code and a mistake R seems to be doing.
transition_matrix<-cbind(permutations(n=7,r=4,v=c(0,1,2,3,4,5,6),repeats.allowed=T),prob=0,n=1)
vector<-c(1,0,1)
table <- transition_matrix[,c(1:3)]
table[59,]
>0 1 1
(table==vector)[59,]
>TRUE TRUE TRUE
So I am just puzzled staring at my code and honestly not understanding why it does not work. I might be missing something because if I compare directly row number 59 with the vector I get the right result.

As Chi Pak notes, table == vector is operating down columns, but you want to compare each row to the vector.
One way to get the behavior you want is to transpose the matrix before comparing to the vector, and then re-transposing afterwards.
Sample data:
(table <- matrix(rep(0:2, 4), 4))
# [,1] [,2] [,3]
# [1,] 0 1 2
# [2,] 1 2 0
# [3,] 2 0 1
# [4,] 0 1 2
(vector <- c(0, 0, 1))
# [1] 0 0 1
Calculation:
t(t(table) == vector)
# [,1] [,2] [,3]
# [1,] TRUE FALSE FALSE
# [2,] FALSE FALSE FALSE
# [3,] FALSE TRUE TRUE
# [4,] TRUE FALSE FALSE
One advantage of this compared to an approach using apply is that all the operations are vectorized, which means this will be a good deal more efficient on large matrices. To see this, let's look at a matrix with one million rows:
set.seed(144)
table <- matrix(sample(0:1, 3e6, replace=TRUE), 1e6)
system.time(t(t(table) == vector))
# user system elapsed
# 0.066 0.013 0.078
system.time(t(apply(table,1,function(x) x==vector)))
# user system elapsed
# 2.508 0.057 2.576

Explanation
When you compare a vector to a matrix it does so column-wise
See the following reproducible example
table <- matrix(c(rep(0,60),rep(1,60),rep(1,60)),ncol=3)
vector <- c(1,0,1)
head(table==vector)
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] TRUE FALSE FALSE
[3,] FALSE TRUE TRUE
[4,] FALSE TRUE TRUE
[5,] TRUE FALSE FALSE
[6,] FALSE TRUE TRUE
1,1 is FALSE because vector[1]==1 and table[1,1]==0. 2,1 is TRUE because vector[2]==0 and table[2,1]==0, etc.
Solution
head(t(apply(table,1,function(x) x==vector)))
[,1] [,2] [,3]
[1,] FALSE FALSE TRUE
[2,] FALSE FALSE TRUE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE TRUE
[5,] FALSE FALSE TRUE
[6,] FALSE FALSE TRUE

Related

How to merge logical vectors into a new column

What I have are many columns of logical vectors, and would like to be able to merge 2 or more columns into one, and if there is any TRUE in the row to only get that a TRUE in the merged column.
Here is an example of 2 columns and the various combinations
X <- c(T,F,T,F,F,T,F,T,T,F,F,F)
Y <- matrix(X,nrow = 6, ncol = 2)
Y
[,1] [,2]
[1,] TRUE FALSE
[2,] FALSE TRUE
[3,] TRUE TRUE
[4,] FALSE FALSE
[5,] FALSE FALSE
[6,] TRUE FALSE
How to create a 3rd column "adding" the true and leaving behind if both say False, and would this also work if there were 3 or more columns to be added?
If you have logical vectors in all the columns, you can use rowSums
cbind(Y, rowSums(Y) > 0)
# [,1] [,2] [,3]
#[1,] TRUE FALSE TRUE
#[2,] FALSE TRUE TRUE
#[3,] TRUE TRUE TRUE
#[4,] FALSE FALSE FALSE
#[5,] FALSE FALSE FALSE
#[6,] TRUE FALSE TRUE
This will return TRUE if there is at least 1 TRUE in any of the row and FALSE otherwise. This would also work for any number of columns.
Use the below code based on Base R
X <- c(T,F,T,F,F,T,F,T,T,F,F,F)
Y <- as.data.frame(matrix(X,nrow = 6, ncol = 2))
unique(Y$V1)
Y$condition <- ifelse(Y$V1 == "TRUE" | Y$V2 == "TRUE","TRUE","FALSE")
Here is a possible solution using apply() and logical operator | that will work for any number of columns of Y.
result = cbind(Y, apply(Y, 1, FUN = function (x) Reduce(f="|", x)))
result
# [,1] [,2] [,3]
# [1,] TRUE FALSE TRUE
# [2,] FALSE TRUE TRUE
# [3,] TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE
# [6,] TRUE FALSE TRUE

How to add matrices generated through manipulations on three lists in R?

I currently have three lists List.1, List.2, and List.3 which each contains 500 matrices, each of which has dimensions 100 x 100. Hence, List.1[[1]] is a matrix of dimensions 100 x 100.
The manipulation I would like to do is to see which of the elements for a given matrix in List.2 is between the corresponding matrix and elements in List.1 and List.3. The manipulation is the following for one matrix in the 3 lists:
+(List.2[[1]] < List.3[[1]] & List.2[[1]] > List.1[[1]])
which returns a matrix of 1's and 0's, with 1 for an entry being if the condition above was satisfied and 0 if it wasn't.
I would like to then do this over all 500 matrices in the list, without having to resort to loops. Is there a way to do this with the Reduce or lapply function, or both?
So far what I have is:
zero.one.mat <- List.1[[1]]-List.1[[1]] # Create empty zero matrix
for(i in 1:500){
zero.one.mat <- zero.one.mat + +(List.2[[i]] < List.3[[i]] & List.2[[i]] > List.1[[i]])
}
which obviously isn't the most ideal way to do it. Any thought would be appreciated. Thanks!
list.1 <- list()
list.2 <- list()
list.3 <- list()
N=5
list.1[[1]] <- matrix(1,nrow=N, ncol=N)
list.2[[1]] <- matrix(2,nrow=N, ncol=N)
list.3[[1]] <- matrix(3,nrow=N, ncol=N)
list.1[[2]] <- matrix(-1,nrow=N, ncol=N)
list.2[[2]] <- matrix(-2,nrow=N, ncol=N)
list.3[[2]] <- matrix(-3,nrow=N, ncol=N)
list.1[[3]] <- matrix(rnorm(N*N),nrow=N, ncol=N)
list.2[[3]] <- matrix(rnorm(N*N),nrow=N, ncol=N)
list.3[[3]] <- matrix(rnorm(N*N),nrow=N, ncol=N)
list.result <- lapply(1:length(list.1), FUN=function(i){list.2[[i]] < list.3[[i]] & list.2[[i]] > list.1[[i]]})
# [[1]]
# [,1] [,2] [,3] [,4] [,5]
# [1,] TRUE TRUE TRUE TRUE TRUE
# [2,] TRUE TRUE TRUE TRUE TRUE
# [3,] TRUE TRUE TRUE TRUE TRUE
# [4,] TRUE TRUE TRUE TRUE TRUE
# [5,] TRUE TRUE TRUE TRUE TRUE
#
# [[2]]
# [,1] [,2] [,3] [,4] [,5]
# [1,] FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE
#
# [[3]]
# [,1] [,2] [,3] [,4] [,5]
# [1,] FALSE FALSE FALSE TRUE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE TRUE TRUE
# If you need to fund the sum of all of them, then you can add Reduce:
Reduce("+",lapply(1:length(list.1),
FUN=function(i){
list.2[[i]] < list.3[[i]] &
list.2[[i]] > list.1[[i]]}))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 1 1 2 1
# [2,] 1 1 1 1 1
# [3,] 1 1 1 1 1
# [4,] 1 2 1 1 1
# [5,] 1 1 1 2 2

How to check if columns in dataframe are identical in R [produce matrix]

I have a large dataframe ncol =220 I want to compare the columns to see if they may be identical and produce a matrix for ease of identification.
So what I have is
x y z
1 dog dog cat
2 dog dog dog
3 cat cat cat
What I want
x y z
x - True False
y True - False
z False False -
Is there a way to do this using identical() in R?
To compliment #Cath's comment about stringdist, it is as easy as,
library(stringdist)
stringdistmatrix(df, df) == 0
# [,1] [,2] [,3]
#[1,] TRUE TRUE FALSE
#[2,] TRUE TRUE FALSE
#[3,] FALSE FALSE TRUE
Probably not very efficient but you can try:
seq_col <- seq_len(ncol(df))
sapply(seq_col, function(i) sapply(seq_col, function(j) identical(df[, i], df[, j])))
# [,1] [,2] [,3]
# [1,] TRUE TRUE FALSE
# [2,] TRUE TRUE FALSE
# [3,] FALSE FALSE TRUE
It gives you what you want (except for the diagonal, which is all TRUE here) but there must be a package with a function to create a distance matrix based on character vectors. Maybe something with stringdist ?

row wise comparison between a vector and a matrix in r

I have two datasets from 10 people. One is a vector, and the other is a matrix. What I want to see is if the first element of the vector includes in the first row of the matrix, and if the second element of the vector includes in the second row of the matrix, and so on.
so, I changed the vector into a matrix and used apply to compare them row-wise. But, the result was not that correct.
Here is the datasets.
df1<-matrix(c(rep(0,10),2,4,7,6,5,7,4,2,2,2),ncol=2)
df1
# [,1] [,2]
# [1,] 0 2
# [2,] 0 4
# [3,] 0 7
# [4,] 0 6
# [5,] 0 5
# [6,] 0 7
# [7,] 0 4
# [8,] 0 2
# [9,] 0 2
#[10,] 0 2
df2<-c(1,3,6,4,1,3,3,2,2,5)
df2<-as.matrix(df2)
apply(df2, 1, function(x) any(x==df1))
# [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
However, the result must be all FALSE but 8th and 9th.
Can anyone correct the function? Thanks!
This vectorized code should be very efficient:
> as.logical( rowSums(df1==df2))
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
Here are a few approaches you could take
Two calls to apply
#
# 1 by column to check if the values are equal
# then by row to see if any rows contain TRUE
apply(apply(df1,2,`==`,df2),1,any)
Use sapply and seq_along
sapply(seq_along(df2), function(x, y, i) y[i] %in% x[i, ], y = df2 ,x = df1)
repeat df2 to the same length as df1 and then compare
rowSums(df1==rep(df2, length = length(df1))) > 0

Subset data frame in R based on matching multiple ranges for multiple variables

I have a problem that seems kind of similar to some previously asked questions on SO, but different enough that I can't quite figure out an elegant solution.
I have a set of real data that I need to match to a database of theoretical values. I'd like to filter based on multiple sets of multiple conditions. For example, if I have the following data.frame of theoretical values,
df <- data.frame(x=c(10,13,16,22,28,30), y=c(1:6))
> df
x y
1 10 1
2 13 2
3 16 3
4 22 4
5 28 5
6 30 6
and I have the following real data,
realdata <- data.frame(x=c(10.05, 13.06, 22.01),y=c(1.02, 1.99, 3.96))
> realdata
x y
1 10.05 1.02
2 13.06 1.99
3 22.01 3.96
I can easily search for which theoretical rows correspond to rows in my real data one at a time with something like this:
tolerance <- .10
subset(df, x>(realdata[1,1]-tolerance) & x<(realdata[1,1]+tolerance) &
+ y>(realdata[1,2]-tolerance) & y<(realdata[1,2]+tolerance))
subset(df, x>(realdata[2,1]-tolerance) & x<(realdata[2,1]+tolerance) &
+ y>(realdata[2,2]-tolerance) & y<(realdata[2,2]+tolerance))
#...etc for each row of real data
But is there any way to do this for all the rows in my real data without writing a loop? Basically, I want to find all the theoretical rows that correspond to any one of the rows in my real data, within a given tolerance. In reality, my theoretical and real tables have hundreds of thousands of observations, and this is something I do quite a bit, so speed will matter, I think.
Also, if anyone knows a way of determining whether a value is within a range using a single expression that works inside subset(), that would be icing on the cake. Maybe subset is the wrong function to be using, though, in which case never mind.
You can use outer to calculate all pairwise differences between df and realdata and examine if both x and y are less than the tolerance
tolerance <- .10
# x
xx <- abs(outer(df$x, realdata$x, "-")) < tolerance
# y
yy <- abs(outer(df$y, realdata$y, "-")) < tolerance
# if both are within the tolerance the sum of xx and yy will be 2
(mat <- xx + yy > 1)
# [,1] [,2] [,3]
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE
#[3,] FALSE FALSE FALSE
#[4,] FALSE FALSE TRUE
#[5,] FALSE FALSE FALSE
#[6,] FALSE FALSE FALSE
So the first column of mat shows which rows of df are within the tolerance (in this case the first).
Rather inelegantly return the row of matches in df in the order of the rows of realdata
lapply(1:ncol(mat), function(i) df[mat[,i], ])
# return all matched data
df[row(mat)[mat], ]
This is a implicit loop with a vectorized test:
apply( realdata, 1,
function(x) abs( x[1] - df[,1] ) < tolerance &
abs( x[2] - df[,2]) <tolerance )
#------------------------
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE TRUE
[5,] FALSE FALSE FALSE
[6,] FALSE FALSE FALSE
This does it with no apply functions:
> kronecker( as.matrix(df), as.matrix(realdata), function(x,y) { abs(x -y) <tolerance} )[,c(1,4)]
[,1] [,2]
[1,] TRUE TRUE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] TRUE TRUE
[6,] FALSE FALSE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
[10,] FALSE FALSE
[11,] FALSE FALSE
[12,] TRUE TRUE
[13,] FALSE FALSE
[14,] FALSE FALSE
[15,] FALSE FALSE
[16,] FALSE FALSE
[17,] FALSE FALSE
[18,] FALSE FALSE
You can consolidate it with rowSums(.) == 2

Resources