Efficiently combine (AND) groups of columns in a logical matrix - r

I am looking for an efficient way to combine selected columns in a logical matrix by "ANDing" them together and ending up with a new matrix. An example of what I am looking for:
matrixData <- rep(c(TRUE, TRUE, FALSE), 8)
exampleMatrix <- matrix(matrixData, nrow=6, ncol=4, byrow=TRUE)
exampleMatrix
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE FALSE TRUE
[2,] TRUE FALSE TRUE TRUE
[3,] FALSE TRUE TRUE FALSE
[4,] TRUE TRUE FALSE TRUE
[5,] TRUE FALSE TRUE TRUE
[6,] FALSE TRUE TRUE FALSE
The columns to be ANDed to each other are specified in a numeric vector of length ncol(exampleMatrix), where the columns to be grouped together ANDed have the same value (a value from 1 to n, where n <= ncol(exampleMatrix) and every value in 1:n is used at least once). The resulting matrix should have the columns in order from 1:n. For example, if the vector that specifies the column groups is
colGroups <- c(3, 2, 2, 1)
Then the resulting matrix would be
[,1] [,2] [,3]
[1,] TRUE FALSE TRUE
[2,] TRUE FALSE TRUE
[3,] FALSE TRUE FALSE
[4,] TRUE FALSE TRUE
[5,] TRUE FALSE TRUE
[6,] FALSE TRUE FALSE
Where in the resulting matrix
[,1] = exampleMatrix[,4]
[,2] = exampleMatrix[,2] & exampleMatrix[,3]
[,3] = exampleMatrix[,1]
My current way of doing this looks basically like this:
finalMatrix <- matrix(TRUE, nrow=nrow(exampleMatrix), ncol=3)
for (i in 1:3){
selectedColumns <- exampleMatrix[,colGroups==i, drop=FALSE]
finalMatrix[,i] <- rowSums(selectedColumns)==ncol(selectedColumns)
}
Where rowSums(selectedColumns)==ncol(selectedColumns) is an efficient way to AND all of the columns of a matrix together.
My problem is that I am doing this on very big matrices (millions of rows) and I am looking for any way to make this quicker. My first instinct would be to use apply in some way but I can't see any way to use that to improve efficiency as I am not performing the operation in the for loop many times but instead it is the operation in the loop that is slow.
In addition, any tips to reduce memory allocation would be very useful, as I currently have to run gc() within the loop frequently to avoid running out of memory completely, and it is a very expensive operation that significantly slows everything down as well. Thanks!
For a more representative example, this is a much larger exampleMatrix:
matrixData <- rep(c(TRUE, TRUE, FALSE), 8e7)
exampleMatrix <- matrix(matrixData, nrow=6e7, ncol=4, byrow=TRUE)

From your example, I understand that there are very few columns and very many rows. In this case, it'll be efficient to just do a simple loop over colGroups (30% improvement over your suggestion):
for (jj in seq_along(colGroups))
finalMatrix[ , colGroups[jj]] =
finalMatrix[ , colGroups[jj]] & exampleMatrix[ , jj]
I think it will be hard to beat this without parallelizing. This loop is parallelizable if there are more columns (though the parallelization will have to be done a bit carefully (in batches)).

As far as I can tell, this is an aggregation across columns using the all function. So if you transpose to rows, then use colGroups as the grouping factor to apply all, then transpose back to columns, you should get the intended result:
t(aggregate(t(exampleMatrix), list(colGroups), FUN=all)[-1])
# [,1] [,2] [,3]
#V1 TRUE FALSE TRUE
#V2 TRUE FALSE TRUE
#V3 FALSE TRUE FALSE
#V4 TRUE FALSE TRUE
#V5 TRUE FALSE TRUE
#V6 FALSE TRUE FALSE
The [-1] just drops the group-identifier variable which you don't require in the final output.
If you're working with stupid big data, the by-group aggregation could be done in data.table as well:
library(data.table)
t(as.data.table(t(exampleMatrix))[, lapply(.SD,all), by=colGroups][,-1])

Related

How to create permutations of a logical vector?

Is there a function that will help me output all 2^n permutations of a boolean vector of length n? For instance, if i have a boolean vector of length n=2, c(FALSE,FALSE), i should obtain 2^2=4 permutations.
As such, I need a function, that will generalize this output for an array of length n,
that means if n=3, output should be of length 2^3 and so on...
I have already tried permutations from gtools package but this seems to be incorrect, or providing only a partial answer to say the least. This method does not generalize well and has given me errors for n>2 as well.
> permutations(2,2,c(TRUE,FALSE))
[,1] [,2]
[1,] FALSE TRUE
[2,] TRUE FALSE
Output should be:
FALSE, FALSE,
TRUE, TRUE,
FALSE, TRUE,
TRUE, FALSE
You where missing repeats.allowed=T :
gtools::permutations(2,2, c(T,F), repeats.allowed = T)
[,1] [,2]
[1,] FALSE FALSE
[2,] FALSE TRUE
[3,] TRUE FALSE
[4,] TRUE TRUE
You can make your custom function around permutations:
my_permute <- function(vect, n, repeats = TRUE) {
gtools::permutations(length(vect), n, vect, repeats.allowed = repeats)
}
my_permute(vect=c(T,F), n=2)
Example with more elements:
my_permute(letters[1:3], n=3)
You can use expand.grid,
expand.grid(c(TRUE, FALSE), c(TRUE, FALSE))
# Var1 Var2
#1 TRUE TRUE
#2 FALSE TRUE
#3 TRUE FALSE
#4 FALSE FALSE
You can use gtools package and the function permutations:
This is the source code:
library(gtools)
x <- c(TRUE, FALSE)
permutations(n=length(x),r=2,v=x,repeats.allowed=T)

counting N occurrences within a ceiling range of a matrix by-row

I would like to tally each time a value lies within a given range in a matrix by-row, and then sum these logical outcomes to derive a "measure of consistency" for each row.
Reproducible example:
m1 <- matrix(c(1,2,1,6,3,7,4,2,6,8,11,15), ncol=4, byrow = TRUE)
# expected outcome, given a range of +/-1 either side
exp.outcome<-matrix(c(TRUE,TRUE,TRUE,FALSE,
TRUE,FALSE,TRUE,TRUE,
FALSE,FALSE,FALSE,FALSE),
ncol=4, byrow=TRUE)
Above I've indicated the the expected outcome, in the case where each value lies within +/- 1 range of any other values within that row.
Within the first row of m1 the first value (1) is within +/-1 of any other value in that row hence equals TRUE, and so on.
By contrast, none of the values in row 4 of m1 are within a single digit value of each other and hence each is assigned FALSE.
Any pointers would be much appreciated?
Update:
Thanks to the help provided I can now count the unique pairs of values which meet the ceiling criteria for any arbitrarily large matrix (using the binomial coefficient, k draws from n, without replacement).
Before progressing with the answer I just wanted to clarify that in your question you have said:
Within the first row of m1 the first value (1) is within +/-1 of any
other value in that row hence equals TRUE, and so on.
However,
>> m1[1,4]
[1] 6
6 is not within the +/- 1 from 1, and there is FALSE value as a correct result in your answer.
Solution
This solution should get you to the desired results:
t(apply(
X = m1,
# Take each row from the matrix
MARGIN = 1,
FUN = function(x) {
sapply(
X = x,
# Now go through each element of that row
FUN = function(y) {
# Your conditions
y %in% c(x - 1) | y %in% c(x + 1)
}
)
}
))
Results
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE TRUE FALSE
[2,] TRUE FALSE TRUE TRUE
[3,] FALSE FALSE FALSE FALSE
Check
For results stored as res.
>> identical(res, exp.outcome)
[1] TRUE
Here is a kind of neat base R method that uses an array:
The first two lines are setup that store a three dimensional array of acceptable values and a matrix that will store the desired output. The structure of the array is as follows: columns correspond with acceptable values of a matrix element in same column. The third dimension correspond with the rows of the matrix.
Pre-allocation in this way should cut down on repeated computations.
# construct array of all +1/-1 values
valueArray <- sapply(1:nrow(m1), function(i) rbind(m1[i,]-1, m1[i,], m1[i,]+1),
simplify="array")
# get logical matrix of correct dimensions
exp.outcome <- matrix(TRUE, nrow(m1), ncol(m1))
# get desired values
for(i in 1:nrow(m1)) {
exp.outcome[i, ] <- sapply(1:ncol(m1), function(j) m1[i, j] %in% c(valueArray[, -j, i]))
}
Which returns
exp.outcome
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE TRUE FALSE
[2,] TRUE FALSE TRUE TRUE
[3,] FALSE FALSE FALSE FALSE

Vectorize R command (part 2)

Yesterday I asked a very simple vectorization question and got some great answers. Today the question is a bit more complex and I'm wondering if R has a function to speed up the runtime of this loop through vectorization.
The loop is
for(j in 1:N) {
A[j,1] = B[max(which(C[j]>=D))];
}
I tried
A[,1] = B[max(which(C>=D))];
and this dropped the runtime considerably ... but the answer was wrong. Is there a "correct" way to do this in R?
EDIT1:
Thanks for the questions regarding data. I will give sizes of the arrays here:
We are looping over 1:N
A is N x 1
B is length M
C is length N
D is length M
If it matters in terms of speed, in this example, N = 844, M = 2500.
Edit2:
And here are some values for a smaller simulated dataset:
B <- c(1.0000000, 1.0000000, 1.0000000, 0.9565217, 0.9565217, 0.9565217, 0.9565217,
0.9565217, 0.9565217, 0.9565217, 0.8967391, 0.8369565, 0.7771739, 0.7173913,
0.7173913, 0.7173913, 0.7173913, 0.7173913, 0.6277174, 0.6277174, 0.5230978,
0.5230978, 0.3923234, 0.3923234, 0.3923234)
C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074, 2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587,
0.58848, 0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164,
1.22956, 1.23760, 1.39074, 1.87604, 2.36452, 2.89497, 4.42393)
The result should be:
> A
[,1]
[1,] 1.0000000
[2,] 0.9565217
[3,] 0.9565217
[4,] 0.9565217
[5,] 0.7173913
[6,] 0.7173913
[7,] 0.7173913
[8,] 0.6277174
[9,] 0.5230978
[10,] 0.3923234
If you are eager to get the answer immediately, jump to Conclusion. I offer you a single line R code, with maximum efficiency. For details/ideas, read through the following.
Code re-shaping and problem re-definition
When OP asks a vectorization of the following loop:
for(j in 1:N) A[j, 1] <- B[max(which(C[j] >= D))]
The first thing I do is to transform it into a nice version:
## stage 1: index computation (need vectorization)
id <- integer(N); for(j in 1:N) id[j] <- max(which(D <= C[j]))
## stage 2: shuffling (readily vectorized)
A[, 1] <- B[id]
Now we see that only stage 1 needs be vectorized. This stage essentially does the following:
D[1] D[2] D[3] ... D[M]
C[1]
C[2]
C[3]
.
.
C[N]
For each row j, find the cut off location k(j) in D, such that D[k(j) + 1], D[k(j) + 2], ..., D[M] > C[j].
Efficient algorithm based on sorting
There is actually an efficient algorithm to do this:
sort C in ascending order, into CC (record ordering index iC, such that C[iC] == CC)
sort D in ascending order, into DD (record ordering index iD, such that D[iD] == DD)
By sorting, we substantially reduce the work complexity.
If data are unsorted, then we have to explicitly scan all elements: D[1], D[2], ..., D[M] in order to decide on k(j). So there is O(M) costs for each row, thus O(MN) costs in total.
However, If data are sorted, then we only need to do the following:
j = 1: search `D[1], D[2], ..., D[k(1)]`, till `D[k(1) + 1] > C[1]`;
j = 2: search `D[k(1) + 1], D[k(1)+2], ..., D[k(2)]`, till `D[k(2) + 1] > C[2]`;
...
For each row, only partial searching is applied, and the overall complexity is only O(M), i.e., D vector is only touched once, rather than N times as in the trivial implementation. As a result, after sorting, the algorithm is N times faster!! For large M and N, this is a huge difference! As you said in other comment, this code will be called millions of times, then we definitely want O(M) algorithm instead of O(MN) algorithm.
Also note, that the memory costs for this approach is O(M + N), i.e., we only concatenate two vectors together, rather than expanding it into an M-by-N matrix. So such storage saving is also noticeable.
In fact, we can take one step further, by converting this comparison problem into a matching problem, which is easier to vectorize in R.
## version 1:
CCDD <- c(CC, DD) ## combine CC and DD
CCDD <- sort(CCDD, decreasing = TRUE) ## sort into descending order
id0 <- M + N - match(CC, CCDD) + 1
id <- id0 - 1:N
To understand why this work, consider an alternative representation:
## version 2:
CCDD <- c(CC, DD) ## combine CC and DD
CCDD <- sort(CCDD) ## sort into ascending order
id0 <- match(CC, CCDD)
id <- id0 - 1:N
Now the following diagram illustrates what CCDD vector looks like:
CCDD: D[1] D[2] C[1] D[3] C[2] C[3] D[4] D[5] D[6] C[4] .....
id0: 3 5 6 10 .....
id : 2 3 3 6 .....
So, CCDD[id] gives: D[2], D[3], D[3], D[6], ...., exactly the last element no greater than C[1], C[2]. C[3], C[4], ...., Therefore, id is just the index we want!
Then people may wonder why I suggest doing "version 1" rather than "version 2". Because when there are tied values in CCDD, "version 2" will give wrong result, because match() will take the first element that matches, ignoring later matches. So instead of matching from left to right (in ascending index), we have to match from right to left (in descending index).
Using OP's data
With this in mind, I start looking at OP's data. Now amazingly, OP's data are already sorted:
C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074, 2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587, 0.58848,
0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164, 1.22956, 1.23760,
1.39074, 1.87604, 2.36452, 2.89497, 4.42393)
M <- length(D); N <- length(C)
is.unsorted(C)
# FALSE
is.unsorted(D)
#FALSE
Furthermore, OP has already combined C and D:
all(C %in% D)
# TRUE
It seems that OP and I have the same idea on efficiency in mind. Presumably OP once had a shorter D vector, while the D vector he supplied is really the CCDD vector I mentioned above!
Now, in this situation, things are all the way simple: we just do a single line:
id <- M - match(C, rev(D)) + 1
Note I put rev() because OP has sorted D in ascending order so I need to reverse it. This single line may look very much different from the "version 1" code, but nothing is wrong here. Remember, The D used here is really the CCDD in "version 1" code, and the M here is really the M + N there. Also, there is no need to subtract 1:N from id, due to our different definition of D.
Checking result
Now, the trivial R-loop gives:
id <- integer(N); for(j in 1:N) id[j] <- max(which(D <= C[j]))
id
# [1] 1 4 7 9 14 15 16 19 21 23
Well, our single line, vectorized code gives:
id <- M - match(C, rev(D)) + 1
id
# [1] 1 4 7 9 14 15 16 19 21 23
Perfect match, hence we are doing the right thing.
Conclusion
So, Laurbert, this is the answer you want:
A[, 1] <- B[M - match(C, rev(D)) + 1]
You can use outer for this.
Your code:
A1 <- matrix(NA_real_, ncol = 1, nrow = length(C))
for(j in seq_along(C)) {
A1[j,1] = B[max(which(C[j]>=D))];
}
Test if the elements of C are larger/equal the elements of D with outer:
test <- outer(C, D, FUN = ">=")
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [9,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
#[10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
Note that this can use a lot of memory for large vectors.
Then find the last TRUE value in each row:
ind <- max.col(test, ties.method = "last") * (rowSums(test) > 0)
rowSums(test) > 0 tests if there are any TRUE values and makes the corresponding element of ind 0 otherwise. It's undefined what you'd want to happen in this case. (A 0 index is ignored during subsetting. Possibly, you'd want NA instead in your final result?)
Now subset:
A2 <- as.matrix(B[ind], ncol = 1)
# [,1]
# [1,] 1.0000000
# [2,] 0.9565217
# [3,] 0.9565217
# [4,] 0.9565217
# [5,] 0.7173913
# [6,] 0.7173913
# [7,] 0.7173913
# [8,] 0.6277174
# [9,] 0.5230978
#[10,] 0.3923234
Are the results identical?
identical(A2, A1)
#[1] TRUE
The data (please use dput next time to provide example data):
B <- c(1.0000000, 1.0000000, 1.0000000, 0.9565217, 0.9565217, 0.9565217, 0.9565217,
0.9565217, 0.9565217, 0.9565217, 0.8967391, 0.8369565, 0.7771739, 0.7173913,
0.7173913, 0.7173913, 0.7173913, 0.7173913, 0.6277174, 0.6277174, 0.5230978,
0.5230978, 0.3923234, 0.3923234, 0.3923234)
C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074,
2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587,
0.58848, 0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164,
1.22956, 1.23760, 1.39074, 1.87604, 2.36452, 2.89497, 4.42393)
After seeing #Roland's answer, I think I understand better what you are asking. To double check: you want to compare each value of C (individually) against all values of D, and get the largest index of D (let's call it k) that holds a value smaller than C[j]. You then want to use it to assign the corresponding value of B to A, thus A[j]=B[k]. Is this correct?
I don't have an answer regarding how to vectorize what you want to do, but do have some suggestions on how to speed it up. Before that, let me ask whether it's actually worth going through the effort. For the larger example you mentioned (N~1000, M~2500), your loop still runs in well under a second on my laptop. Unless this calculation is done many times over inside another loop, it seems like unnecessary optimization...
Also, like #Roland pointed out, it's not clear what should happen if there is a value in C that's smaller than all values in D. These functions (including your original loop) will not work if that happens and would need some slight tweaking.
Anyway, these are my suggestions:
First, let me wrap your loop into a function for convenience.
f_loop <- function(B, C, D){
N <- length(C)
A <- matrix(0, ncol=1, nrow=N)
for(j in 1:N) {
A[j,1] = B[max(which(C[j]>=D))]
}
return(A)
}
If you want it to look a bit more "R-like" you can replace the loop with one of the *apply functions. In this case, it also runs slightly faster than the loop.
vapply(C, function(x) B[max(which(x>=D))], 0)
## Wrapped into a function for easier reference
f_vapply <- function(B, C, D){
vapply(C, function(x) B[max(which(x>=D))], 0)
}
My other suggestion is uglier (and not really "R-like"), but can help speed things up a lot (if that's the end goal here). I used the inline package to create a compiled version of your loop (note that depending on your OS and R setup, you may need to download additional tools or packages to be able to compile code).
## Translate loop into Fortran
loopcode <-
" integer i, j, k
do i = 1, n
k = 0
do j = 1, m
if (C(i) >= D(j)) k = j
end do
A(i) = B(k)
end do
"
## Compile into function
library(inline)
loopfun <- cfunction(sig = signature(A="numeric", B="numeric", C="numeric", D="numeric", n="integer", m="integer"), dim=c("(n)", "(m)", "(n)", "(m)", "", ""), loopcode, language="F95")
## Wrap into function for easier reference
f_compiled <- function(B, C, D){
A <- C
n <- length(A)
m <- length(B)
out <- loopfun(A, B, C, D, n, m)
return(as.matrix(out$A, ncol=1))
}
Let's check that the results all match:
cbind(A, f_loop(B, C, D), f_vapply(B, C, D), f_compiled(B, C, D))
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 1.0000000 1.0000000 1.0000000
## [2,] 0.9565217 0.9565217 0.9565217 0.9565217
## [3,] 0.9565217 0.9565217 0.9565217 0.9565217
## [4,] 0.9565217 0.9565217 0.9565217 0.9565217
## [5,] 0.7173913 0.7173913 0.7173913 0.7173913
## [6,] 0.7173913 0.7173913 0.7173913 0.7173913
## [7,] 0.7173913 0.7173913 0.7173913 0.7173913
## [8,] 0.6277174 0.6277174 0.6277174 0.6277174
## [9,] 0.5230978 0.5230978 0.5230978 0.5230978
## [10,] 0.3923234 0.3923234 0.3923234 0.3923234
And check the speed:
microbenchmark(f_loop(B, C, D), f_vapply(B, C, D), f_compiled(B, C, D))
## Unit: microseconds
## expr min lq mean median uq max neval cld
## f_loop(B, C, D) 52.804 54.8075 57.34588 56.5420 58.4615 83.843 100 c
## f_vapply(B, C, D) 38.677 41.5055 43.21231 42.8825 44.1525 65.355 100 b
## f_compiled(B, C, D) 17.095 18.2775 20.55372 20.1770 21.4710 66.407 100 a
We can also try it with vectors of similar size to the larger ones you mentioned (note the change in units for the results):
## Make the vector larger for benchmark
B <- rep(B, 100) # M = 2500
C <- rep(C, 100) # N = 1000
D <- rep(D, 100) # M = 2500
microbenchmark(f_loop(B, C, D), f_vapply(B, C, D), f_compiled(B, C, D))
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f_loop(B, C, D) 24.380069 24.85061 25.99855 25.839282 25.952433 62.75721 100 b
## f_vapply(B, C, D) 23.543749 24.18427 25.34881 25.015859 25.179924 62.60746 100 b
## f_compiled(B, C, D) 1.976611 2.01403 2.06750 2.032864 2.057594 3.13658 100 a
EDIT:
I realized that if you always want the largest index of D for which C[j]>=D holds, of course it makes much more sense to loop through D starting from the end of the array, and exiting as soon as the first instance is found (instead of looping through the full array).
This is a small tweak to the Fortran code I wrote above that takes advantage of that.
loopcode <-
" integer i, j, k
do j = 1, n
k = 0
do i = m, 1, -1
if (C(j) >= D(i)) then
k = i
exit
end if
end do
A(j) = B(k)
end do
"
I won't include it in the benchmarks, because it'll be much more dependent on the actual data points. But it is obvious that worst case behavior is the same as the previous loop (e.g. if the index of interest occurs at the beginning, D is looped through in full) and the best case behavior almost completely eliminates looping through D (e.g. if the condition holds at the end of the array).

Why does this vectorized matrix comparison fail?

I am trying to compare 1st row of a matrix with all rows of the same matrix. But the vectorized comparison is not returning correct results. Any reason why this may be happening?
m <- matrix(c(1,2,3,1,2,4), nrow=2, ncol=3, byrow=TRUE)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 4
> # Why does the first row not have 3 TRUE values?
> m[1,] == m
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE FALSE FALSE
> m[1,] == m[1,]
[1] TRUE TRUE TRUE
> m[1,] == m[2,]
[1] TRUE TRUE FALSE
Follow-up. In my actual data I have large number of rows then (atleast 10million) then both time and memory adds up. Additional suggestions on the below as suggested below by others?
m <- matrix(rep(c(1,2,3), 1000000), ncol=3, byrow=TRUE)
> #by #alexis_laz
> m1 <- matrix(m[1,], nrow = nrow(m), ncol = ncol(m), byrow = T)
> system.time(m == m1)
user system elapsed
0.21 0.03 0.31
> object.size(m1)
24000112 bytes
> #by #PaulHiemstra
> system.time( t(apply(m, 1, function(x) x == m[1,])) )
user system elapsed
35.18 0.08 36.04
Follow-up 2. #alexis_laz you are correct. I want to compare every row with each other and have posted a followup question on that ( How to vectorize comparing each row of matrix with all other rows)
In the comparison m[1,] == m, the first term m[1,] is recycled (once) to equal the length of m. The comparison is then done column-wise.
You're comparing c(1,2,3) with c(1,1,2,2,3,4), thus c(1,2,3,1,2,3) with c(1,1,2,2,3,3,4) so you have one TRUE followed by five FALSE (and packaged as a matrix to match the dimensions of m).
As #MatthewLundberg pointed out, the recycling rules of R do not behave as you expected. In my opinion it is always better to explicitely state what to compare and not rely on R's assumptions. One way to make the correct comparison:
t(apply(m, 1, function(x) x == m[1,]))
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE
or:
m == rbind(m[1,], m[1,])
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE
or by making R's recyling working in your favor (thanks to #Arun):
t(t(m) == m[1,])
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE

Select an element from each row of a matrix in R

The question is the same as here, but in R. I have a matrix and a vector such that
length(vec) == nrow(mat)
How do i get a vector such that
v[i] == mat[v[i],i]
I tried to achieve this by using logical matrix:
>a = matrix(runif(12),4,3)
a
[,1] [,2] [,3]
[1,] 0.6077585 0.5354680 0.2802681
[2,] 0.2596180 0.6358106 0.9336301
[3,] 0.5317069 0.4981082 0.8668405
[4,] 0.6150885 0.5164009 0.5797668
> sel = col(a) == c(1,3,2,1)
> sel
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE FALSE TRUE
[3,] FALSE TRUE FALSE
[4,] TRUE FALSE FALSE
> a[sel]
[1] 0.6077585 0.6150885 0.4981082 0.9336301
It selects right elements but messes up the order. I thought of using mapply either, but i don't know how to make it iterate through rows, like in apply.
upd: #gsk3 suggested to use as.list(as.data.frame(t(a))) this works. But still i would like to know if there is a more vectorized way, without lists.
I am not 100% sure I understand your question, but it seems like this may be close?
> b=c(1,3,2,1)
> i=cbind(1:nrow(a),b)
> a[i]

Resources