Efficient way to perform cell-wise calculation in a large matrix - r

I am trying to get 2 bits of 8-bit value from the cloud mask intermediate product of the NASA.
The matrix has a dimension of 3200 x 3248. I have to do this conversion for thousands of data sets.
Here is one of the dataset that I want to do this conversion. dataset
Here is my code:
library(binaryLogic)
test = as.logical(c(0,0))
#n_row <- nrow(cmask_1)
n_row <- 100
cmask_2bits <- matrix(nrow=n_row, ncol=ncol(cmask_1))
t1 <- Sys.time()
for(i in 1:n_row){
cmask_2bits[i,] <- sapply(cmask_1[i,], function (x) ifelse(identical(as.logical(as.binary(x, n=8)[5:6]), test), 0, 1))
}
t2 <- Sys.time()
time <- difftime(t2, t1)
t1_mthd2 <- Sys.time()
cmask_2bits_mthd2 <- matrix(nrow=n_row, ncol=ncol(cmask_1))
cmask_2bits_mthd2 <- mapply(function (x)
ifelse(identical(as.logical(as.binary(x, n=8)[5:6]), test), 0, 1), cmask_1[1:n_row,])
cmask_2bits_mthd2 <- matrix(cmask_2bits_mthd2, nrow=n_row, ncol=ncol(cmask_1))
t2_mthd2 <- Sys.time()
time_mthd2 <- difftime(t2_mthd2, t1_mthd2)
time_mthd2 - time
I have tried these two lines of code with mapply and sapply with for loop. I am wondering whether ifelse statement can be also improved for faster result.
My second question is whether this job (getting thousands of matrices) should be done on Hadoop platform or not.
My result should be like this for the first 10 rows and 10 columns:
Any suggestion would be appreciated.
Edit: As an example, as.binary(15, n=8) gives me the result as
0 0 0 0 1 1 1 1 as an 8-bit unsigned character. This binary value is read from the right, so that my 2 bits of interest are 3rd and 4th bits, which are 1 1. Since as.binary(15, n=8) gives me a "binary" "logical" vector, I can get these bits by requesting 5th and 6th values of this code result.

For the broader question, the most efficient method for cell-wise operations on a large matrix, when the operation is identical for every cell, is to use the built-in vectorized operations. A matrix in R is really just a vector with some metadata about dimensions. For your specific question, in addition to vectorization, it looks like that binaryLogic.asBinary is not computationally efficient. For your simple case of bits 5 and 6 being zero in an 8 bit integer, just do it with integer math:
(((cmask_1 %% 128) %% 64) < 16) + 0
The modulos clear out bits 7 and 8, and from there all values with bits 5 and 6 equal to zero will be less than 16. Adding 0 to the result converts from a logical vector to 0/1.
Edit: looking back at your example it looks like you want the result to be zero when bits 5 and 6 are both zero. That would be:
(((cmask_1 %% 128) %% 64) > 15) + 0

I think the fastest method is to use bitwise logical operators. If you want to extract bit 3 and 4 from an integer X you can use "X AND 12" (4 + 8 = 12). As result you get "4" if the 3rd bit is set, "8" for the 4th bit and "12" if the 3rd and 4th bit are set.
In R there is the package "bitops" which support the operations you need:
library(bitops)
mat_cmask = as.matrix(df_cmask)
v = as.vector(mat_cmask, mode="integer")
v1 = bitAnd(v, 12) # there are still values 4, 8 and 12
v2 = as.integer(v1>0)
result = matrix(v2, nrow=nrow(cmask), ncol(cmask))
result[1:10, 1:10]
Best, Stefan

Thanks to #W. Murphy for this simple and clear answer. The correct answer following integer division should be
(((((((cmask_1 %% 256) %% 128) %% 64) %% 32) %% 16) < 16) & (3 < (((((cmask_1 %% 256) %% 128) %% 64) %% 32) %% 16))) + 0,
where I wanted to restrict the remainder between 3 and 16, so that the number will fall in this interval will be divided by either 8 or 4 or both.
Thanks again.

Related

Randomly select values from a given number list to add to a certain value in r

If I have a set of values such as
c(1,2,5,6,7,15,19,20)
and I want to randomly select 2 values where the sum equals 20. From the above list possible samples that I would like to see would be
[19,1], [15,5]
How do I do this in R. Any help would be greatly appreciated.
This computes all possible combinations of your input vector, so if this is very long, this might be a problem.
getVal <- function(vec,val) {
comb = combn(vec, 2)
idx = colSums(comb) == val
if (sum(idx)) {
return(comb[,idx][,sample(sum(idx),1)])
}
return(FALSE)
}
vec = (c(1,4,6,9))
val = 10
getVal(vec,val)
>>[1] 1 9
val = 11
>>[1] FALSE
getVal(vec,val)
For a small vector of values you can do an exhaustive search by working out all the combinations of pairs in the values. Example:
> values = c(1,2,5,6,7,15,19,20)
> pairs = matrix(values[t(combn(length(values),2))],ncol=2)
That is a 2-column matrix of all pairs from values. Now sum the rows and look for the target value of 20:
> targets = apply(pairs,1,sum)==20
> pairs[targets,]
[,1] [,2]
[1,] 1 19
[2,] 5 15
The size of pairs increases such that if you have 100 values then pairs will have nearly 5000 rows.
You can do this with the sample()-functie and a while-loop. It isn't the prettiest solution but a simple to implement one for sure.
First you sample two values from the vector and store them in an object, like:
values <- c(1, 2, 5, 6, 7, 15, 19, 20)
randomTwo <- sample(values, 2)
Then you start you while-loop. This loop checks if sum of the two sampled values modulo 10 equals 0 (I assumed you meant modulo from the examples in your question, see https://en.wikipedia.org/wiki/Modulo_operation to see what it does). If the operation does not equal 0 the loop samples two new values until the operation does equal zero, and you get your two values.
Here's what it looks like:
while (sum(randomTwo) %% 10 != 0) {
randomTwo <- sample(values, 2)
}
Now this might take more iterations than checking all combo's, and it might take less, depending on chance. If you have just this small vector than it's a nice solution. Good luck!
In a way where you don't need to compute a inmense matrix (way faster):
findpairs=function(a,sum,num){
list=list()
aux=1
for (i in 1:length(a)){
n=FALSE
n=which((a+a[i])==sum)
if (length(n)){
for (j in n){
if (j!=i){
list[[aux]]=c(a[i],a[j])
aux=aux+1
}
}
}
}
return(sample(list[1:(length(list)/2),num))
}
a=c(1,2,5,6,19,7,15,20)
a=a[order(a)]
sum=20
findpairs(a,sum,2)
[[1]]
[1] 5 15
[[2]]
[1] 1 19
Issue is that it gives repetition.
edit
Solved. Just take half of the list as the other half will be the same pairs the other way around.

R - How to get row & column subscripts of matched elements from a distance matrix

I have an integer vector vec1 and I am generating a distant matrix using dist function. I want to get the coordinates (row and column) of element of certain value in the distance matrix. Essentially I would like to get the pair of elements that are d-distant apart. For example:
vec1 <- c(2,3,6,12,17)
distMatrix <- dist(vec1)
# 1 2 3 4
#2 1
#3 4 3
#4 10 9 6
#5 15 14 11 5
Say, I am interested in pair of elements in the vector that are 5 unit apart. I wanted to get the coordinate1 which are the rows and coordinate2 which are the columns of the distance matrix. In this toy example, I would expect
coord1
# [1] 5
coord2
# [1] 4
I am wondering if there is an efficient way to get these values that doesn't involve converting the dist object to a matrix or looping through the matrix?
A distance matrix is a lower triangular matrix in packed format, where the lower triangular is stored as a 1D vector by column. You can check this via
str(distMatrix)
# Class 'dist' atomic [1:10] 1 4 10 15 3 9 14 6 11 5
# ...
Even if we call dist(vec1, diag = TRUE, upper = TRUE), the vector is still the same; only the printing styles changes. That is, no matter how you call dist, you always get a vector.
This answer focus on how to transform between 1D and 2D index, so that you can work with a "dist" object without first making it a complete matrix using as.matrix. If you do want to make it a matrix, use the dist2mat function defined in as.matrix on a distance object is extremely slow; how to make it faster?.
R functions
It is easy to write vectorized R functions for those index transforms. We only need some care dealing with "out-of-bound" index, for which NA should be returned.
## 2D index to 1D index
f <- function (i, j, dist_obj) {
if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
n <- attr(dist_obj, "Size")
valid <- (i >= 1) & (j >= 1) & (i > j) & (i <= n) & (j <= n)
k <- (2 * n - j) * (j - 1) / 2 + (i - j)
k[!valid] <- NA_real_
k
}
## 1D index to 2D index
finv <- function (k, dist_obj) {
if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
n <- attr(dist_obj, "Size")
valid <- (k >= 1) & (k <= n * (n - 1) / 2)
k_valid <- k[valid]
j <- rep.int(NA_real_, length(k))
j[valid] <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k_valid - 1))) / 2)
i <- j + k - (2 * n - j) * (j - 1) / 2
cbind(i, j)
}
These functions are extremely cheap in memory usage, as they work with index instead of matrices.
Applying finv to your question
You can use
vec1 <- c(2,3,6,12,17)
distMatrix <- dist(vec1)
finv(which(distMatrix == 5), distMatrix)
# i j
#[1,] 5 4
Generally speaking, a distance matrix contains floating point numbers. It is risky to use == to judge whether two floating point numbers are equal. Read Why are these numbers not equal? for more and possible strategies.
Alternative with dist2mat
Using the dist2mat function given in as.matrix on a distance object is extremely slow; how to make it faster?, we may use which(, arr.ind = TRUE).
library(Rcpp)
sourceCpp("dist2mat.cpp")
mat <- dist2mat(distMatrix, 128)
which(mat == 5, arr.ind = TRUE)
# row col
#5 5 4
#4 4 5
Appendix: Markdown (needs MathJax support) for the picture
## 2D index to 1D index
The lower triangular looks like this: $$\begin{pmatrix} 0 & 0 & \cdots & 0\\ \times & 0 & \cdots & 0\\ \times & \times & \cdots & 0\\ \vdots & \vdots & \ddots & 0\\ \times & \times & \cdots & 0\end{pmatrix}$$ If the matrix is $n \times n$, then there are $(n - 1)$ elements ("$\times$") in the 1st column, and $(n - j)$ elements in the j<sup>th</sup> column. Thus, for element $(i,\ j)$ (with $i > j$, $j < n$) in the lower triangular, there are $$(n - 1) + \cdots (n - (j - 1)) = \frac{(2n - j)(j - 1)}{2}$$ "$\times$" in the previous $(j - 1)$ columns, and it is the $(i - j)$<sup>th</sup> "$\times$" in the $j$<sup>th</sup> column. So it is the $$\left\{\frac{(2n - j)(j - 1)}{2} + (i - j)\right\}^{\textit{th}}$$ "$\times$" in the lower triangular.
----
## 1D index to 2D index
Now for the $k$<sup>th</sup> "$\times$" in the lower triangular, how can we find its matrix index $(i,\ j)$? We take two steps: 1> find $j$; 2> obtain $i$ from $k$ and $j$.
The first "$\times$" of the $j$<sup>th</sup> column, i.e., $(j + 1,\ j)$, is the $\left\{\frac{(2n - j)(j - 1)}{2} + 1\right\}^{\textit{th}}$ "$\times$" of the lower triangular, thus $j$ is the maximum value such that $\frac{(2n - j)(j - 1)}{2} + 1 \leq k$. This is equivalent to finding the max $j$ so that $$j^2 - (2n + 1)j + 2(k + n - 1) \geq 0.$$ The LHS is a quadratic polynomial, and it is easy to see that the solution is the integer no larger than its first root (i.e., the root on the left side): $$j = \left\lfloor\frac{(2n + 1) - \sqrt{(2n-1)^2 - 8(k-1)}}{2}\right\rfloor.$$ Then $i$ can be obtained from $$i = j + k - \left\{\frac{(2n - j)(j - 1)}{2}\right\}.$$
If the vector is not too large, the best way is probably to wrap the output of dist into as.matrix and to use which with the option arr.ind=TRUE. The only disadvantage of this standard method to retrieve the index numbers within a dist matrix is an increase of memory usage, which may become important in the case of very large vectors passed to dist. This is because the conversion of the lower triangular matrix returned by dist into a regular, dense matrix effectively doubles the amount of stored data.
An alternative consists in converting the dist object into a list, such that each column in the lower triangular matrix of dist represents one member of the list. The index number of the list members and the position of the elements within the list members can then be mapped to the column and row number of the dense N x N matrix, without generating the matrix.
Here is one possible implementation of this list-based approach:
distToList <- function(x) {
idx <- sum(seq(length(x) - 1)) - rev(cumsum(seq(length(x) - 1))) + 1
listDist <- unname(split(dist(x), cumsum(seq_along(dist(x)) %in% idx)))
# http://stackoverflow.com/a/16358095/4770166
}
findDistPairs <- function(vec, theDist) {
listDist <- distToList(vec)
inList <- lapply(listDist, is.element, theDist)
matchedCols <- which(sapply(inList, sum) > 0)
if (length(matchedCols) > 0) found <- TRUE else found <- FALSE
if (found) {
matchedRows <- sapply(matchedCols, function(x) which(inList[[x]]) + x )
} else {matchedRows <- integer(length = 0)}
matches <- cbind(col=rep(matchedCols, sapply(matchedRows,length)),
row=unlist(matchedRows))
return(matches)
}
vec1 <- c(2, 3, 6, 12, 17)
findDistPairs(vec1, 5)
# col row
#[1,] 4 5
The parts of the code that might be somewhat unclear concern the mapping of the position of an entry within the list to a column / row value of the N x N matrix. While not trivial, these transformations are straightforward.
In a comment within the code I have pointed out an answer on StackOverflow which has been used here to split a vector into a list. The loops (sapply, lapply) should be unproblematic in terms of performance since their range is of order O(N). The memory usage of this code is largely determined by the storage of the list. This amount of memory should be similar to that of the dist object since both objects contain the same data.
The dist object is calculated and transformed into a list in the function distToList(). Because of the dist calculation, which is required in any case, this function could be time-consuming in the case of large vectors. If the goal is to find several pairs with different distance values, then it may be better to calculate listDist only once for a given vector and to store the resulting list, e.g., in the global environment.
Long story short
The usual way to treat such problems is simple and fast:
distMatrix <- as.matrix(dist(vec1)) * lower.tri(diag(vec1))
which(distMatrix == 5, arr.ind = TRUE)
# row col
#5 5 4
I suggest using this method by default. More complicated solutions may become necessary in situations where memory limits are reached, i.e., in the case of very large vectors vec1. The list-based approach described above could then provide a remedy.

Is there a general algorithm to identify a numeric series?

I am looking for a general purpose algorithm to identify short numeric series from lists with a max length of a few hundred numbers. This will be used to identify series of masses from mass spectrometry (ms1) data.
For instance, given the following list, I would like to identify that 3 of these numbers fit the series N + 1, N +2, etc.
426.24 <= N
427.24 <= N + 1/x
371.10
428.24 <= N + 2/x
851.47
451.16
The series are all of the format: N, N+1/x, N+2/x, N+3/x, N+4/x, etc, where x is an integer (in the example x=1). I think this constraint makes the problem very tractable. Any suggestions for a quick/efficient way to tackle this in R?
This routine will generate series using x from 1 to 10 (you could increase it). And will check how many are contained in the original list of numbers.
N = c(426.24,427.24,371.1,428.24,851.24,451.16)
N0 = N[1]
x = list(1,2,3,4,5,6,7,8,9,10)
L = 20
Series = lapply(x, function(x){seq(from = N0, by = 1/x,length.out = L)})
countCoincidences = lapply(Series, function(x){sum(x %in% N)})
Result:
unlist(countCoincidences)
[1] 3 3 3 3 3 3 3 3 3 2
As you can see, using x = 1 will have 3 coincidences. The same goes for all x until x=9. Here you have to decide which x is the one you want.
Since you're looking for an arithmetic sequence, the difference k is constant. Thus, you can loop over the vector and subtract each value from the sequence. If you have a sequence, subtracting the second term from the vector will result in values of -k, 0, and k, so you can find the sequence by looking for matches between vector - value and its opposite, value - vector:
x <- c(426.24, 427.24, 371.1, 428.24, 851.47, 451.16)
unique(lapply(x, function(y){
s <- (x - y) %in% (y - x);
if(sum(s) > 1){x[s]}
}))
# [[1]]
# NULL
#
# [[2]]
# [1] 426.24 427.24 428.24

How many values of a vector are divisible by 2? Use R

I have an ex. where I have to see how many values of a vector are divisible by 2. I have this random sample:
set.seed(1)
y <- sample(c(0:99, NA), 400, replace=TRUE)
I created a new variable d to see which of the values are or aren't divisible by 2:
d <- y/2 ; d
What I want to do is to create a logical argument, where all entire numbers give true and the rest gives false. (ex: 22.0 -> TRUE & 24.5 -> FALSE)
I used this command, but I believe that the answer is wrong since it would only give me the numbers that are in the sample:
sum(d %in% y, na.rm=T)
I also tried this (I found on the internet, but I don't really understand it)
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
sum(is.wholenumber(d),na.rm = T)
Are there other ways that I could use the operator "%%"?
you can sum over the mod operator like so: sum(1-y%%2) or sum(y%%2 == 0). Note that x %% 2 is the remainder after dividing by two which is why this solution works.
Here are three different ways:
length(y[y %% 2 == 0])
length(subset(y, y %% 2 == 0))
length(Filter(function(x) x %% 2 == 0, y))
Since we're talking about a division by 2, I would actually take it to the bit level and check if the last bit of the number is a 0 or a 1 (a 0 means it would be divisible by 2).
Going out on a limb here (not sure how the compiler handles this division by 2) but think that would likely be more optimized than a division, which is typically fairly expensive.
To do this at the bit level, you can just do an AND operation between the number itself and 1, if result it 1 it means won't be divisible by 2:
bitwAnd(a, b)

fill up a matrix one random cell at a time

I am filling a 10x10 martix (mat) randomly until sum(mat) == 100
I wrote the following.... (i = 2 for another reason not specified here but i kept it at 2 to be consistent with my actual code)
mat <- matrix(rep(0, 100), nrow = 10)
mat[1,] <- c(0,0,0,0,0,0,0,0,0,1)
mat[2,] <- c(0,0,0,0,0,0,0,0,1,0)
mat[3,] <- c(0,0,0,0,0,0,0,1,0,0)
mat[4,] <- c(0,0,0,0,0,0,1,0,0,0)
mat[5,] <- c(0,0,0,0,0,1,0,0,0,0)
mat[6,] <- c(0,0,0,0,1,0,0,0,0,0)
mat[7,] <- c(0,0,0,1,0,0,0,0,0,0)
mat[8,] <- c(0,0,1,0,0,0,0,0,0,0)
mat[9,] <- c(0,1,0,0,0,0,0,0,0,0)
mat[10,] <- c(1,0,0,0,0,0,0,0,0,0)
i <- 2
set.seed(129)
while( sum(mat) < 100 ) {
# pick random cell
rnum <- sample( which(mat < 1), 1 )
mat[rnum] <- 1
##
print(paste0("i =", i))
print(paste0("rnum =", rnum))
print(sum(mat))
i = i + 1
}
For some reason when sum(mat) == 99 there are several steps extra...I would assume that once i = 91 the while would stop but it continues past this. Can somone explain what I have done wrong...
If I change the while condition to
while( sum(mat) < 100 & length(which(mat < 1)) > 0 )
the issue remains..
Your problem is equivalent to randomly ordering the indices of a matrix that are equal to 0. You can do this in one line with sample(which(mat < 1)). I suppose if you wanted to get exactly the same sort of output, you might try something like:
set.seed(144)
idx <- sample(which(mat < 1))
for (i in seq_along(idx)) {
print(paste0("i =", i))
print(paste0("rnum =", idx[i]))
print(sum(mat)+i)
}
# [1] "i =1"
# [1] "rnum =5"
# [1] 11
# [1] "i =2"
# [1] "rnum =70"
# [1] 12
# ...
See ?sample
Arguments:
x: Either a vector of one or more elements from which to choose,
or a positive integer. See ‘Details.’
...
If ‘x’ has length 1, is numeric (in the sense of ‘is.numeric’) and
‘x >= 1’, sampling _via_ ‘sample’ takes place from ‘1:x’. _Note_
that this convenience feature may lead to undesired behaviour when
‘x’ is of varying length in calls such as ‘sample(x)’. See the
examples.
In other words, if x in sample(x) is of length 1, sample returns a random number from 1:x. This happens towards the end of your loop, where there is just one 0 left in your matrix and one index is returned by which(mat < 1).
The iteration repeats on level 99 because sample() behaves very differently when the first parameter is a vector of length 1 and when it is greater than 1. When it is length 1, it assumes you a random number from 1 to that number. When it has length >1, then you get a random number from that vector.
Compare
sample(c(99,100),1)
and
sample(c(100),1)
Of course, this is an inefficient way of filling your matrix. As #josilber pointed out, a single call to sample could do everything you need.
The issue comes from how sample and which do the sampling when you have only a single '0' value left.
For example, do this:
mat <- matrix(rep(1, 100), nrow = 10)
Now you have a matrix of all 1's. Now lets make two numbers 0:
mat[15]<-0
mat[18]<-0
and then sample
sample(which(mat<1))
[1] 18 15
by adding a size=1 argument you get one or the other
now lets try this:
mat[18]<-1
sample(which(mat<1))
[1] 3 13 8 2 4 14 11 9 10 5 15 7 1 12 6
Oops, you did not get [1] 15 . Instead what happens in only a single integer (15 in this case) is passed tosample. When you do sample(x) and x is an integer, it gives you a sample from 1:x with the integers in random order.

Resources