check whether matrix rows equal a vector in R , vectorized - r

I'm very surprised this question has not been asked, maybe the answer will clear up why. I want to compare rows of a matrix to a vector and return whether the row == the vector everywhere. See the example below. I want a vectorized solution, no apply functions because the matrix is too large for slow looping. Suppose there are many rows as well, so I would like to avoid repping the vector.
set.seed(1)
M = matrix(rpois(50,5),5,10)
v = c(3 , 2 , 7 , 7 , 4 , 4 , 7 , 4 , 5, 6)
M
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 8 3 5 9 4 5 6 7 7
[2,] 4 9 3 6 3 1 5 7 6 1
[3,] 5 6 6 11 6 4 5 2 7 5
[4,] 8 6 4 4 3 8 3 6 5 6
[5,] 3 2 7 7 4 4 7 4 5 6
Output should be
FALSE FALSE FALSE FALSE TRUE

One possibility is
rowSums(M == v[col(M)]) == ncol(M)
## [1] FALSE FALSE FALSE FALSE TRUE
Or simlarly
rowSums(M == rep(v, each = nrow(M))) == ncol(M)
## [1] FALSE FALSE FALSE FALSE TRUE
Or
colSums(t(M) == v) == ncol(M)
## [1] FALSE FALSE FALSE FALSE TRUE
v[col(M)] is just a shorter version of rep(v, each = nrow(M)) which creates a vector the same size as M (matrix is just a vector, try c(M)) and then compares each element against its corresponding one using ==. Fortunately == is a generic function which has an array method (see methods("Ops") and is.array(M)) which allows us to run rowSums (or colSums) on it in order to makes sure we have the amount of matches as ncol(M)

Using DeMorgan's rule (Not all = Some not), then All equal = Not Some Not equal, we also have
!colSums(t(M) != v)

The package prodlim has a function called row.match, which is easy to use and ideal for your problem. First install and load the library: library(prodlim). In our example, row.match will return '5' because the 5th row in M is equal to v. We can then convert this into a logical vector.
m <- row.match(v, M)
m==1:NROW(M)#[1] FALSE FALSE FALSE FALSE TRUE

Related

Scope of Aggregation Functions when nesting apply(within())

Edited original post to clarify question
Background
I'm learning R and saw this scenario and don't understand how R handles (what I'll call) implied context transitions. The script I am trying to understand simply iterates through each row of a matrix and prints the index of the column(s) within that row that contain the minimum value of that row. What I don't understand is how R handles the context transition as different functions are applied to the dependent variable x:
x (when defined as an argument to function(x)) is an atomic vector because of the apply() function with a MARGIN = 1 argument
The which() function then iterates over the individual elements within the atomic vector x to see which ones == min(x)
This is the part that truly confuses me: Despite the fact which() is iterating over elements of atomic vector x, you can call min(x) within the which() function and R somehow switches x to be defined as the entire atomic vector again for calculating the min() across the vector vs. within the scope of a single element
Example Data Matrix
a <- matrix (c(5, 2, 7, 1, 2, 8, 4, 5, 6), 3, 3)
[,1] [,2] [,3]
[1,] 5 1 4
[2,] 2 2 5
[3,] 7 8 6
This is the script that returns the column indexes that I am struggling to understand
apply (a, 1, function(x) which(x == min(x)))
My question:
Within the which() function, why does min(x) return the minimum of the atomic vector (as is desired) and not the minimum within the scope of an individual element within that vector, since which() is iterating over each individual element within the atomic vector x?
Edit: discussion about which and x:
the first comment on your question is incorrect:
x is anonymous function, lambda
x is just a variable, nothing fancy. function(x) declares it as the first (and only) argument of the anonymous function, and then every reference to x after that is referencing what is passed to this anonymous function;
the code uses an anonymous function; normally, almost everything you do in R is using named functions (e.g., mean, min). In some cases (e.g., in apply and related functions), it makes sense to define a whole function as an argument and not name it, as in
## anonymous (unnamed) function
apply(m, 1, function(x) which(x == min(x)))
## equivalently, with a named function
myfunc <- function(x) which(x == min(x))
apply(m, 1, myfunc)
In the first case, function(x) which(x == min(x))) is not named, so it is "anonymous". The results between the two apply calls are identical.
Given that context, x is the first argument to the function (myfunc or the anonymous function in your case). With the rest of the apply/MARGIN discussion below,
x (in this case) contains the whole row (when MARGIN=1);
min(x) returns the value of the lowest value within x, and it is always length 1); and
which(x == min(x)) returns the index of that lowest value within x; in this case, it will always be length 1 or more, because we are confident that there is always one element such that it is equal to the minimum of that vector ... however, there is no guarantee that which will find any matches, so the length of which(...)'s return value can be between 0 and the length of the inputs. Examples:
which(11:15 == 13)
# [1] 3
which(11:15 == 1:5)
# integer(0)
which(11:15 == 11:15)
# [1] 1 2 3 4 5
which(11:15 %in% c(12, 14))
# [1] 2 4
apply works one or more dimensions at a time. For now, I'll stick with a 2d matrix, in which case MARGIN= selects rows or columns. (There is a caveat, see below.)
I'm going to use a step-by-step verbose function for trying to show each step. I'll name it anonfunc, but in your mind convert apply(a, 1, anonfunc) later with apply(a, 1, function(x) { ... }) and you will see what I'm intending to do. Also, I have a dematrix function to help show what's being used in the anonfunc.
dematrix <- function(m, label = "") {
if (!is.matrix(m)) m <- matrix(m, nrow = 1)
out <- capture.output(print(m))[-1]
out <- gsub("^[][,0-9]+", "", out)
paste(paste0(c(label, rep(strrep(" ", nchar(label)), length(out) - 1)), out),
collapse = "\n")
}
anonfunc <- function(x) {
message(dematrix(x, "Input: "))
step1 <- x == min(x)
message(dematrix(step1, "Step1: "))
step2 <- which(step1)
message("Step2: ", paste(step2, collapse = ","), "\n#\n")
step2
}
2d arrays
I'm going to modify your sample data a little by adding a column. This helps visualize how many function calls there are and how big the function's input is.
apply(a, 1, anonfunc)
# Input: 5 1 4 11
# Step1: FALSE TRUE FALSE FALSE
# Step2: 2
# #
# Input: 2 2 5 12
# Step1: TRUE TRUE FALSE FALSE
# Step2: 1,2
# #
# Input: 7 8 6 13
# Step1: FALSE FALSE TRUE FALSE
# Step2: 3
# #
# [[1]]
# [1] 2
# [[2]]
# [1] 1 2
# [[3]]
# [1] 3
Our anonymous function is called three times, once for each row. In each call, it is passed a vector of length 4, which is the size of one row in the matrix.
Note that we get a list in return. Normally apply returns a vector or matrix. The return value is actually the dimension of the MARGIN= axes, with an added dimension of the length of the return values. That is, a has dims 3x4; if the return value from each call to the anon-func is length 1, then the return value is "sort of" 3x1, but R simplifies that to a vector of length 3 (this might be construed as inconsistent mathematically, I don't disagree).; if the return value from each anon-func call is length 10, then the output would be a matrix of 3x10.
However, when any of the anon-func returns is of a different length/size/class as the others, then apply will return a list. (This is the same behavior as sapply, and it can be frustrating if it changes when you are not expecting it. There is allegedly a patch in R-devel that allows us to force a list with apply(..., simplify=FALSE).)
If we instead use MARGIN=2, we'll be operating on columns:
apply(a, 2, anonfunc)
# Input: 5 2 7
# Step1: FALSE TRUE FALSE
# Step2: 2
# #
# Input: 1 2 8
# Step1: TRUE FALSE FALSE
# Step2: 1
# #
# Input: 4 5 6
# Step1: TRUE FALSE FALSE
# Step2: 1
# #
# Input: 11 12 13
# Step1: TRUE FALSE FALSE
# Step2: 1
# #
# [1] 2 1 1 1
Now, one call for each column (4 calls) and x is a vector of length 3 (number of rows in the source matrix).
It is possible to operate on more than one axis at a time; while it seems meaningless to do it with a matrix (2d array), it makes more sense with larger-dimensioned arrays.
apply(a, 1:2, anonfunc)
# Input: 5
# Step1: TRUE
# Step2: 1
# #
# Input: 2
# Step1: TRUE
# Step2: 1
# #
# Input: 7
# Step1: TRUE
# Step2: 1
# #
# ...truncated... total of 12 calls to `anonfunc`
# #
# [,1] [,2] [,3] [,4]
# [1,] 1 1 1 1
# [2,] 1 1 1 1
# [3,] 1 1 1 1
From the discussion of output dimensions, the MARGIN=1:2 means the output dimension will be the dimensions of the margin -- 3x4 -- with the dimension/length of the output. Since the output here is always length 1, then that is technically 3x4x1, which in R-speak is a matrix of dim 3x4.
Pics of what each margin uses from a matrix:
3d array
Let's go slightly larger to see some of the "plane" operations.
a3 <- array(1:24, dim = c(3,4,2))
a3
# , , 1
# [,1] [,2] [,3] [,4]
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
# , , 2
# [,1] [,2] [,3] [,4]
# [1,] 13 16 19 22
# [2,] 14 17 20 23
# [3,] 15 18 21 24
Starting with MARGIN=1. While you have both arrays visible, look at the first Input: and see which "plane" is being used from the original a3 array. It appears transposed, sure ...
For the sake of brevity (too late!), I'll abbreviate the third and subsequent iterations of anonfunc to show just the first line (inner-matrix row) of the verbose output.
apply(a3, 1, anonfunc)
# Input: 1 13
# 4 16
# 7 19
# 10 22
# Step1: TRUE FALSE
# FALSE FALSE
# FALSE FALSE
# FALSE FALSE
# Step2: 1
# #
# Input: 2 14
# 5 17
# 8 20
# 11 23
# Step1: TRUE FALSE
# FALSE FALSE
# FALSE FALSE
# FALSE FALSE
# Step2: 1
# #
# Input: 3 15 ...
# #
# [1] 1 1 1
Similarly, MARGIN=2. I'll show a3 again so you can see which "plane" is being used:
a3
# , , 1
# [,1] [,2] [,3] [,4]
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
# , , 2
# [,1] [,2] [,3] [,4]
# [1,] 13 16 19 22
# [2,] 14 17 20 23
# [3,] 15 18 21 24
apply(a3, 2, anonfunc)
# Input: 1 13
# 2 14
# 3 15
# Step1: TRUE FALSE
# FALSE FALSE
# FALSE FALSE
# Step2: 1
# #
# Input: 4 16
# 5 17
# 6 18
# Step1: TRUE FALSE
# FALSE FALSE
# FALSE FALSE
# Step2: 1
# #
# Input: 7 19 ...
# Input: 10 22 ...
# #
# [1] 1 1 1 1
MARGIN=3 is not very exciting: anonfunc is only called twice, one for each of the front-facing "planes" (no abbreviation necessary here):
apply(a3, 3, anonfunc)
# Input: 1 4 7 10
# 2 5 8 11
# 3 6 9 12
# Step1: TRUE FALSE FALSE FALSE
# FALSE FALSE FALSE FALSE
# FALSE FALSE FALSE FALSE
# Step2: 1
# #
# Input: 13 16 19 22
# 14 17 20 23
# 15 18 21 24
# Step1: TRUE FALSE FALSE FALSE
# FALSE FALSE FALSE FALSE
# FALSE FALSE FALSE FALSE
# Step2: 1
# #
# [1] 1 1
One can use multiple dimensions here as well, and this is where I think the Input: string becomes a little clarifying:
a3
# , , 1
# [,1] [,2] [,3] [,4]
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
# , , 2
# [,1] [,2] [,3] [,4]
# [1,] 13 16 19 22
# [2,] 14 17 20 23
# [3,] 15 18 21 24
apply(a3, 2:3, anonfunc)
# Input: 1 2 3
# Step1: TRUE FALSE FALSE
# Step2: 1
# #
# Input: 4 5 6
# Step1: TRUE FALSE FALSE
# Step2: 1
# #
# Input: 7 8 9 ...
# Input: 10 11 12 ...
# Input: 13 14 15 ...
# Input: 16 17 18 ...
# Input: 19 20 21 ...
# Input: 22 23 24 ...
# #
# [,1] [,2]
# [1,] 1 1
# [2,] 1 1
# [3,] 1 1
# [4,] 1 1
And since the dimensions of a3 are 3,4,2, and we're looking at margins 2:3, and each call to anonfunc returns length 1, our returned matrix is 4x2x1 (where the x1 is silently dropped by R).
To visualize what each call of MARGIN= actually uses, see the below pics:
"Lexical scoping looks up symbol values based on how functions were nested when they were created, not how they are nested when they are called. With lexical scoping, you don’t need to know how the function is called to figure out where the value of a variable will be looked up. You just need to look at the function’s definition."**
**Source: http://adv-r.had.co.nz/Functions.html#lexical-scoping

Select a column from a list of matrices using a list of vectors

I want to select columns from a list of matrices of different dimension using a list of Boolean vectors that represent the columns of each matrix.
I have tried different map and (s|l)apply combinations, even classic for loops, but I can't manage to select the columns.
With this code, you can generate a list of matrices and Boolean vectors to experiment:
matrices <- list(matrix(c(7,8,9,10), nrow=1), matrix(c(7,8,7,9,7,10,8,9,8,10,9,10), nrow=2), matrix(c(7,8,9,7,8,10,7,9,10,8,9,10), nrow=3))
listOfColumns <- list(c(FALSE,FALSE,FALSE,FALSE), c(FALSE,FALSE,FALSE,FALSE,FALSE,FALSE), c(TRUE,FALSE,FALSE,FALSE))
As an example, having the list of matrices created with the above code:
[[1]]
[,1] [,2] [,3] [,4]
[1,] 7 8 9 10
[[2]]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 7 7 7 8 8 9
[2,] 8 9 10 9 10 10
[[3]]
[,1] [,2] [,3] [,4]
[1,] 7 7 7 8
[2,] 8 8 9 9
[3,] 9 10 10 10
and the list of Boolean vectors:
[[1]]
[1] FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE
[[3]]
[1] TRUE FALSE FALSE FALSE
The result should be a list with a single element:
[[1]]
[1] 7 8 9
Because your list contains matrices that you don't want to keep at all we can first create an index vector that selects only list elements of listOfColumns for which there is at least one TRUE
(idx <- sapply(listOfColumns, any))
# [1] FALSE FALSE TRUE
Next we use Map to subset the remaining matrices
Map(function(x, cols) x[, cols], x = matrices[idx], cols = listOfColumns[idx])
#[[1]]
#[1] 7 8 9
Thanks to #thelatemail for the helpful comment.

How do I generate a list of all possible permutations of a vector of numbers (N) taken (n) at a time in R with additional constraints?

For example, Suppose I want to generate all possible permutations in the series 1:10 taken 3 at a time. But, the 3 numbers chosen have to be in ascending order. Hence, 3,4,5 is acceptable but not 5,4,3. The second condition is that they can't have jumps, they have to be consecutive in order. Hence, 1,2,4 is unacceptable. How to get this in R?
We can create the combinations of numbers using combn, then subset the columns by creating a logical index by checking the difference of the rows are equal to 1, and transpose the output
m1 <- combn(1:10, 3)
t(m1[,colSums(diff(m1)==1)==2])
# [,1] [,2] [,3]
#[1,] 1 2 3
#[2,] 2 3 4
#[3,] 3 4 5
#[4,] 4 5 6
#[5,] 5 6 7
#[6,] 6 7 8
#[7,] 7 8 9
#[8,] 8 9 10
These consist of the sequences 1:3, 2:4, ..., 8:10. In general, to obtain all such subsequences of length k among 1:n, you can start with the smallest 1:k and keep adding 1 to its elements:
subseq <- function(n,k) if (1 <= k && k <= n) outer(1:k, 0:(n-k), "+")
The sequences are in the columns, already in lexicographic order. Since no sorting is actually done, this is a O(kn) algorithm, which is asymptotically optimal.
Example: subseq(10,3) produces
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 2 3 4 5 6 7 8
[2,] 2 3 4 5 6 7 8 9
[3,] 3 4 5 6 7 8 9 10
A slightly faster R implementation might avoid outer like this:
subseq <- function(n=10, k=3) if (1 <= k && k <= n) matrix(rep(0:(n-k), each=k), k) + 1:k

Preserve structure, when indexing a matrix with another matrix in R

Dear StackOverflowers,
I have an integer matrix in R and I would like to subset it so that I remove 1 specified cell in each column. So that, for instance, a 4x3 matrix becomes a 3x3 matrix. I have tried doing it by creating the second logical matrix of the same dimensions.
(subject.matrix <- matrix(1:12, nrow = 4))
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
(query.matrix <- matrix(c(T, T, F, T, T, F, T, T, T, T, T, F), nrow = 4))
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE FALSE TRUE
[3,] FALSE TRUE TRUE
[4,] TRUE TRUE FALSE
The problem is that, when I index the first matrix by the second one, it is simplified to an integer vector.
subject.matrix[query.matrix]
[1] 1 2 4 5 7 8 9 10 11
I've tried adding drop=F, but to no avail. I know, I can just wrap the resulting vector into a 3x3 matrix. So the expected outcome would be:
matrix(subject.matrix[query.matrix], nrow = 3)
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 7 10
[3,] 4 8 11
But I wonder if there's a more elegant/direct solution. I'm also not attached to using a logical matrix as the index, if that means a simpler solution. Perhaps, I could subset it with a vector of indices for the rows to be removed in each column, which in this case would translate into c(3, 2, 4).
Many thanks!
Edit based on #LyzandeR suggestion: My final goal was to take column sums of the resulting matrix. So replacing the redundant values with NA's seems to be the best way to go.
I think that the only way you can preserve the matrix structure would be to use a more general way of your question edit i.e.:
matrix(subject.matrix[query.matrix], ncol = ncol(subject.matrix))
You could even convert it into a function if you plan on using it multiple times:
subset.mat <- function(mat, index, cols=ncol(mat)) {
matrix(mat[index], ncol = cols)
}
Output:
> subset.mat(subject.matrix, query.matrix)
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 7 10
[3,] 4 8 11
Also (sorry just read your updated comment) you might consider using NAs in the matrix instead of subsetting them out, which will allow you to calculate the column sums as you say:
subject.matrix[!query.matrix] <- NA
subject.matrix
# [,1] [,2] [,3]
#[1,] 1 5 9
#[2,] 2 NA 10
#[3,] NA 7 11
#[4,] 4 8 NA
This is a little brute-forceish, but I think you'll be able to extrapolate it into something more general:
new.matrix = matrix(ncol = ncol(subject.matrix), nrow = nrow(subject.matrix) - 1)
for(i in 1:ncol(subject.matrix)){
new.matrix[,i] = subject.matrix[,i][query.matrix[,i] == TRUE]
}
new.matrix
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 7 10
[3,] 4 8 11
Essentially, I just initialized an empty matrix, and then iterated through each column of subject.matrix taking only the TRUE values for query.matrix.

Is there a way to extract continuous feature in an 2D array

Say I have an array of number
a <- c(1,2,3,6,7,8,9,10,20)
if there a way to tell R to output just the range of the continuous sequence from "a"
e.g., the continuous sequences in "a" are the following
1,3
6,10
20
Thanks a lot!
Derek
I don't think there is a straight way, but you could create two logical vectors telling you if next/previous element is 1 greatest/least. E.g.:
data.frame(
a,
is_first = c(TRUE,diff(a)!=1),
is_last = c(diff(a)!=1,TRUE)
)
# Gives you:
a is_first is_last
1 1 TRUE FALSE
2 2 FALSE FALSE
3 3 FALSE TRUE
4 6 TRUE FALSE
5 7 FALSE FALSE
6 8 FALSE FALSE
7 9 FALSE FALSE
8 10 FALSE TRUE
9 20 TRUE TRUE
So ranges are:
cbind(a[c(TRUE,diff(a)!=1)], a[c(diff(a)!=1,TRUE)])
[1,] 1 3
[2,] 6 10
[3,] 20 20
I did this (not so elegant I admit) in case you want all the numbers of each sequence in a list
a <- c(1,2,3,6,7,8,9,10,20)
z <- c(1,which(c(1,diff(a))!=1))
g <- lapply(seq(1:length(z)),function(i) {
if (i < length(z)) a[z[i] : (z[i+1] - 1)]
else a[z[i] : length(a)]
})
[[1]]
[1] 1 2 3
[[2]]
[1] 6 7 8 9 10
[[3]]
[1] 20
Then you can get a 2D array with something like this
sapply(g,function(x) c(x[1],x[length(x)]))
[,1] [,2] [,3]
[1,] 1 6 20
[2,] 3 10 20
> a <- c(1,2,3,6,7,8,9,10,20)
> N<-length(a)
> k<-2:(N-1)
> z<-(a[k-1]+1)!=a[k] | (a[k+1]-1)!=a[k]
> c(a[1],a[k][z],a[N])
[1] 1 3 6 10 20

Resources