min rows multiplication by the max rows - r

I have two matrices as follows in R:
M<-matrix(c(1,4,1,3,1,4,2,3,1,2,1,2),3)
1 3 2 2
4 1 3 1
1 4 1 2
N<-matrix(c(1,1,2,2,3,4,-2,2,1,4,3,-1),3)
1 2 -2 4
1 3 2 3
2 4 1 -1
I want to find a vector which is a matrix 1*3 and each of its elements is the multiplication of min element of each row of M by the max element of the corresponding row of N (for example, the first element of the vector is the min element of the first row of matrix M, which is 1, multiply by the max element of the first row of matrix N, which is 4, and so the first element of the vector is 1*4 which is 4).
The final answer is: (1*4, 1*3,1*4)=(4,3,4)
To find this vector (or matrix) I have written the below code:
c(min(M[1,])*max(N[1,]),min(M[2,])*max(N[2,]),min(M[3,])*max(N[3,]))
But it is so long. could anyone write a shorter (or simpler, or easier) code?

apply(M, 1, min) * apply(N, 1, max)

The most straightforward way to tackle this, and maybe also the most readable, is to use apply (as already suggested by #Jan):
apply(M, 1, min) * apply(N, 1, max)
However, if you have a lot of data, the apply approach - which loops through all the data - can be slow. A faster way is to use built-in vectorized functions to perform fast operations on all the rows together.
The R max.col(m) function returns the index of the column with the highest value in each row of a matrix m. There isn't a min.col(m) function, but you can obviously get the same result as by using max.col(-m).
So, the vectorized approach is:
M_min_of_each_row=M[cbind(seq_len(nrow(M)),max.col(-M))]
N_max_of_each_row=N[cbind(seq_len(nrow(N)),max.col(N))]
answer=M*N
How much faster is this for a big matrix? We can use microbenchmark to test:
using_apply=function(M,N) apply(M,1,min)*apply(N,1,max)
using_maxcol=function(M,N) M[cbind(seq_len(nrow(M)),max.col(-M))]*N[cbind(seq_len(nrow(N)),max.col(N))]
library(microbenchmark)
M=matrix(sample(1:100,40000,replace=T),ncol=4);N=matrix(sample(1:100,40000,replace=T),ncol=4)
microbenchmark(using_apply(M,N),using_maxcol(M,N))
# Unit: milliseconds
# expr min lq mean median uq max neval
# using_apply(M, N) 25.319694 28.411979 31.762766 30.829093 33.789692 71.893174 100
# using_maxcol(M, N) 1.608357 1.876968 2.117926 2.042053 2.270023 4.858531 100
# check that the results are the same:
all(using_apply(M,N)==using_maxcol(M,N))
# [1] TRUE
So: the vectorized approach is about 15x faster. But, of course, you might consider that the apply approach is good enough, and that it's more-concise and (arguably) more-readable...

Related

Extraction speed in Matrix package is very slow compared to regular matrix class

This is an example of comparing row extraction from large matrices, sparse and dense, using the Matrix package versus the regular R base-matrix class.
For dense matrices the speed is almost 395 times faster for the base class matrix:
library(Matrix)
library(microbenchmark)
## row extraction in dense matrices
D1<-matrix(rnorm(2000^2), 2000, 2000)
D2<-Matrix(D1)
> microbenchmark(D1[1,], D2[1,])
Unit: microseconds
expr min lq mean median uq max neval
D1[1, ] 14.437 15.9205 31.72903 31.4835 46.907 75.101 100
D2[1, ] 5730.730 5744.0130 5905.11338 5777.3570 5851.083 7447.118 100
For sparse matrices it is almost 63 times in favor of matrix again.
## row extraction in sparse matrices
S1<-matrix(1*(runif(2000^2)<0.1), 2000, 2000)
S2<-Matrix(S1, sparse = TRUE)
microbenchmark(S1[1,], S2[1,])
Unit: microseconds
expr min lq mean median uq max neval
S1[1, ] 15.225 16.417 28.15698 17.7655 42.9905 45.692 100
S2[1, ] 1652.362 1670.507 1771.51695 1774.1180 1787.0410 5241.863 100
Why the speed discrepancy, and is there a way to speed up extraction in Matrix package?
I don't know exactly what the trouble is, possibly S4 dispatch (which could potentially be a big piece of a small call like this). I was able to get performance equivalent to matrix (which has a pretty easy job, indexing + accessing a contiguous chunk of memory) by (1) switching to a row-major format and (2) writing my own special-purpose accessor function. I don't know exactly what you want to do or if it will be worth the trouble ...
Set up example:
set.seed(101)
S1 <- matrix(1*(runif(2000^2)<0.1), 2000, 2000)
Convert to column-major (dgCMatrix) and row-major (dgRMatrix) forms:
library(Matrix)
S2C <- Matrix(S1, sparse = TRUE)
S2R <- as(S1,"dgRMatrix")
Custom accessor:
my_row_extract <- function(m,i=1) {
r <- numeric(ncol(m)) ## set up zero vector for results
## suggested by #OttToomet, handles empty rows
inds <- seq(from=m#p[i]+1,
to=m#p[i+1], length.out=max(0, m#p[i+1] - m#p[i]))
r[m#j[inds]+1] <- m#x[inds] ## set values
return(r)
}
Check equality of results across methods (all TRUE):
all.equal(S2C[1,],S1[1,])
all.equal(S2C[1,],S2R[1,])
all.equal(my_row_extract(S2R,1),S2R[1,])
all.equal(my_row_extract(S2R,17),S2R[17,])
Benchmark:
benchmark(S1[1,], S2C[1,], S2R[1,], my_row_extract(S2R,1),
columns=c("test","elapsed","relative"))
## test elapsed relative
## 4 my_row_extract(S2R, 1) 0.015 1.154
## 1 S1[1, ] 0.013 1.000
## 2 S2C[1, ] 0.563 43.308
## 3 S2R[1, ] 4.113 316.385
The special-purpose extractor is competitive with base matrices. S2R is super-slow, even for row extraction (surprisingly); however, ?"dgRMatrix-class" does say
Note: The column-oriented sparse classes, e.g., ‘dgCMatrix’, are preferred and better supported in the ‘Matrix’ package.

how to find consecutive composite numbers in R

I want first 'n' consecutive composite numbers
I searched command for finding consecutive composite numbers, but i got the result proving for that thorem. I didn't get any command for that..please help me to slove this problem in R.
Here is another option:
n_composite <- function(n) {
s <- 4L
i <- 1L
vec <- numeric(n)
while(i <= n) {
if(any(s %% 2:(s-1) == 0L)) {
vec[i] <- s
i <- i + 1L
}
s <- s + 1L
}
vec
}
It uses basic control flows to cycle through positive integers indexing composites.
benchmark
all.equal(find_N_composites(1e4), n_composite(1e4))
[1] TRUE
library(microbenchmark)
microbenchmark(
Mak = find_N_composites(1e4),
plafort = n_composite(1e4),
times=5
)
Unit: milliseconds
expr min lq mean median uq
Mak 2304.8671 2347.9768 2397.0620 2376.4306 2475.2368
plafort 508.8132 509.3055 522.1436 509.3608 530.4311
max neval cld
2480.7988 5 b
552.8076 5 a
The code of #Pierre Lafortune is neat and not too slow, but I'd like to propose another approach which is substantially faster.
Tackling the problem from another perspective, finding the first n composite numbers in R can be translated to "get the first n+k integers and remove the primes". This is fast because generating the sequence 1:(n+k) takes almost no time and there are very sophisticated algorithms to find primes available, one implementation being numbers::Primes().
The sequence needs to end with n+k because within the first n integers there will be some (k1) primes that need to be replaced. Note that the range (n+1):(n+k1) might also contain k2 primes, which need to be replaced as well. And on, and on, and on, … This will require a recursive structure.
Pierre's answer basically does something similar: He iteratively checks if an integer is a composite number (non-prime) and continues until enough composites are found. However, this has one drawback: The algorithm to find (non-) primes is rather naive (as compared to other algorithms to find primes; no offense intended). One the other hand, that solution doesn't involve the recursive problem of possible primes in any range of integers mentioned above.
The recursive solution I'd like to suggest is the following:
library(numbers)
n_composite2 <- function(n, from = 2) {
endRange <- from + n - 1
numbers <- seq(from = from, to = endRange)
primes <- Primes(n1 = from, n2 = endRange)
composites <- numbers[!(numbers %in% primes)]
nPrimes <- length(primes)
if (nPrimes >= 1) return(c(composites, n_composite2(nPrimes, from = endRange + 1)))
return(composites)
}
This generates a sequence of integers (potential composites), then uses numbers::Primes() to find the primes in that range and removes them from the sequence. If some numbers have been removed, the function calls itself again, this time computing [number of primes in previous step] composites and starting the sequence from where the previous step stopped.
If there are doubts whether this actually works, here the check against Pierre's solution (n_composite()):
> all(n_composite(1e4) == n_composite2(1e4))
[1] TRUE
Comparing both functions, n_composite2() is approximately 19 times faster:
library(microbenchmark)
microbenchmark(
"n_composite2" = n_composite2(1e4),
"n_composite" = n_composite(1e4),
times=5
)
Unit: milliseconds
expr min lq mean median uq max neval
n_composite2 34.44039 34.51352 35.10659 34.71281 35.21145 36.65476 5
n_composite 642.34106 661.15725 666.02819 662.99657 671.52093 692.12512 5
As a final remark: There are many solutions "between" Pierre's approach and the solution presented here. One could use numbers::Primes() in a while loop, very similar to what's happening in n_composite(). One could also start with a "sufficiently long" sequence of integers, remove the primes and then take the first n remaining numbers. To be efficient, this approach required a good approximation of the numbers of primes in a given range which is also not trivial (for low numbers).
That is indeed a lazy way of asking a question, but nevertheless; this should do it:
is_composite<-function(x){
sapply(x,function(y) if(y<3){FALSE}else{any(y%%(2:(y-1))==0)})
}
which(is_composite(1:100))
find_N_composites<-function(N){
which(is_composite(1:(2*N+2)))[1:N]
}
find_N_composites(10)
system.time({
x<-find_N_composites(1e+04)
})
The idea is to consequently check for each number if it has any divisors except 1 and itself. The function I provided finds first 10 000 composite numbers in about 2 seconds. If you want greater speed on large numbers, it will be better to optimize it. For example, by looking for divisors only amongst simple numbers.

row-wise differences between two large matrices in R

I would like to ask an opinion on how to speed up the following operation.
I have two matrices says A and B with n rows and 3 columns; for any row vector of A I want to compare its difference with any row vector of B. So it is a pairwise difference between all row vectors of the two matrices. The resulting matrix is then a n*n matrix. Then I want to apply a function to any element of this, the biharm() function that I wrote in the example. The problem is that, while for small matrices I have no problems, I have the necessity to apply this operation to very large matrices such as 1000*3. In the sigm() function, that I wrote to do that, I first initialize S and then I wrote two annidated for cycles. However, this is slow for large matrices. Does anyone has an idea on how to speed up this? I think using apply() but I cannot figure out the correct way. Here below a fully reproducible example. Thanks in advance for any advice. Best, Paolo.
biharm<-function(vec1,vec2){
reso<-norm(as.matrix(vec1)-as.matrix(vec2),type="F")^2*log(norm((as.matrix(vec1)-as.matrix(vec2)),type="F"))
reso
}
sigm<-function(mat1,mat2=NULL){
tt<-mat1
if(is.null(mat2)){yy<-mat1}else{yy<-mat2}
k<-nrow(yy)
m<-ncol(yy)
SGMr<-matrix(rep(0,k^2),ncol=k)
for(i in 1:k){
for(j in 1: k){
SGMr[i,j]<-biharm(yy[i,],tt[j,])
}}
SGMr<-replace(SGMr,which(SGMr=="NaN",arr.ind=T),0)
return(SGMr)}
### small matrices example:
A<-matrix(rnorm(30),ncol=3)
B<-matrix(rnorm(30),ncol=3)
sigm(A,B)
### large matrices example:
A<-matrix(rnorm(900),ncol=3)
B<-matrix(rnorm(900),ncol=3)
sigm(A,B)
This is about 8 times faster on my system.
biharm.new <- function(vec1,vec2){
n <- sqrt(sum((vec1-vec2)^2))
n^2*log(n)
}
sigm.new<-function(mat1,mat2=NULL){
tt<-mat1
if(is.null(mat2)){yy<-mat1}else{yy<-mat2}
SGMr <- apply(tt,1,function(t)apply(yy,1,biharm.new,t))
replace(SGMr,which(SGMr=="NaN",arr.ind=T),0)
}
### large matrices example:
set.seed(1)
A<-matrix(rnorm(900),ncol=3)
B<-matrix(rnorm(900),ncol=3)
system.time(result.1<-sigm(A,B))
# user system elapsed
# 6.13 0.00 6.13
system.time(result.2<-sigm.new(A,B))
# user system elapsed
# 0.81 0.00 0.81
all.equal(result.1,result.2)
# [1] TRUE
The use of apply(...) results in about a 3-fold improvement. The rest comes from optimizing biharm(...) - since you are calling this 810,000 times it pays to make it as efficient as possible.
Note that the Frobenius norm is just the Euclidean norm, so if that is what you really want use sqrt(sum(x^2)) rather than converting to matrices and using norm(...). The former is much faster.
How about this:
set.seed(1)
foo<-matrix(runif(30),nc=3)
bar<-matrix(runif(30),nc=3)
sapply(1:10,function(j) sapply(1:10,function(k) biharm(bar[k,],foo[j,])) )
EDIT -- basically same as jhoward's "sigm.new" without the error checking. Clearly biharm.new is a winner.
microbenchmark(carl(foo,bar),jhoward(foo,bar),times=3)
Unit: milliseconds
expr min lq median uq max neval
carl(foo, bar) 5846.8273 6071.364 6295.8999 6322.425 6348.951 3
jhoward(foo, bar) 891.5734 934.550 977.5267 1008.388 1039.248 3

Repeating or looping an argument

I am quite familiar with R as I've been using it for a few years now. Unfortunately, I am not very well versed in creating functions that involve looping or repetition of an equation. The problem goes as follows:
I have a vector containing over 1000 values. I would like to calculate the absolute difference between two juxtaposing means of equal size from a subset of that vector.
Here is an an example.
I have the vector (vec) of length 8
[1] 0.12472963 1.15341289 -1.09662288 -0.73241639 0.06437658 -0.13647136 -1.52592048 1.46450084
I would like calculate the mean of the first 2 values ( 0.12472963, 1.15341289) and obtain the absolute difference with the mean of the following 2 values (-1.09662288 -0.73241639), thereafter, working my way down the vector.
In this case, I can easily use the following equation:
abs(mean(vec[1:2])-mean(vec[3:4]))
and incrementally increase each number by 1 so as to work my way down manually until the end of the vector. I would obtain the following vector.
[1] 1.553591 0.3624149 0.8784722 0.497176 0.005337574
What I would like, however, to have an automated routine which enables be me to do that over long vectors and change the number of values from which to calculate the means.
It appears to me that it should be relatively simple, but I do not know where to start.
Use filter:
c(abs(filter(vec, c(0.5, 0.5, -0.5, -0.5), sides=1)[-(1:3)]))
#[1] 1.55359090 0.36241491 0.87847224 0.49717601 0.00533757
Using rollapply from zoo
library(zoo)
n <- 2
n1 <- length(vec)
abs(rollapply(vec[1:(n1-n)], 2, mean)-rollapply(vec[(n+1):n1], 2,mean))
#[1] 1.55359090 0.36241491 0.87847224 0.49717601 0.00533757
Also, other variations of the above code are (from commented by #G. Grothendieck- one of the authors of zoo package)
abs(rollmean(vec[1:(n1-n)], 2) - rollmean(vec[(n+1):n1], 2)) #using
#`rollmean` instead of `rollapply`
or
rollapply(vec, 4, function(x) abs(mean(x[1:2]) - mean(x[3:4])))
or
abs(rollapply(vec, 4, "%*%", c(1, 1, -1, -1)/2))
As always, I chime in with:
vec<-rep(c( 0.12472963 , 1.15341289, -1.09662288, -0.73241639 , 0.06437658, -0.13647136 ,-1.52592048 , 1.46450084 ),100)
microbenchmark(roland(vec),akrun(vec),times=3)
Unit: microseconds
expr min lq mean median uq max
roland(vec) 564.128 565.2275 647.3353 566.327 688.939 811.551
akrun(vec) 3717.410 3982.1535 4218.3057 4246.897 4468.753 4690.610
neval
3
3

R: smallest distance between an element of vector a and an element of vector b

a and b are two vectors of real numbers.
They do not necessarily have the same length.
The distance between the ith element of a and the jth element of b is defined as abs(a[i] - b[j])
How would you compute the smallest distance between any element of a and any element of b without explicit loops?
Here is what I did: min(sapply(X=1:length(b), FUN=function(x) abs(a - b[x]))).
However, I have the feeling there is something better to do...
I'd use the dist function to create a distance matrix, and then find the minimum distance in that. This is probably much faster than an explicit loop in R (including sapply).
a = runif(23)
b = runif(10)
d_matrix = as.matrix(dist(cbind(a,b)))
d_matrix[d_matrix == 0] <- NA
sqrt(min(d_matrix, na.rm = TRUE))
Note that cbind recycles the smaller vector. So this function is probably not optimal, but for vectors that do not differ that much in size still much fast than an explicit loop.
And to find which pair of elements had this distance (although the recycling introduces some challenges here):
which(d_matrix == min(d_matrix, na.rm = TRUE), arr.ind = TRUE)
Here's an attempt:
a <- c(9,5,6); b <- c(6,9)
# a
#[1] 9 5 6
# b
#[1] 6 9
combos <- sapply(b,function(x) abs(x-a))
# or an alternative
combos <- abs(outer(a,b,FUN="-"))
You could then get the minimum distance with:
min(combos)
If you wanted to get the respective indexes of the minimum values you could do:
which(combos==min(combos),arr.ind=TRUE)
# each matrix row has the 2 indexes for the minimums
# first column is 'a' index, second is 'b' index
# row col
# [1,] 3 1
# [2,] 1 2
One-liner should work here: min(abs(outer(a, b, "-")))

Resources