How can i replace nested loop using lapply in R? - r

Good afternoon ,
I have developped this R function that hashes data in buckets :
# The used packages
library("pacman")
pacman::p_load(dplyr, tidyr, devtools, MASS, pracma, mvtnorm, interval, intervals)
pacman::p_load(sprof, RDocumentation, helpRFunctions, foreach , philentropy , Rcpp , RcppAlgos)
hash<-function(v,p){
if(dot(v,p)>0) return(1) else (0) }
LSH_Band<-function(data,K ){
# We retrieve numerical columns of data
t<-list.df.var.types(data)
df.r<-as.matrix(data[c(t$numeric,t$Intervals)])
n=nrow(df.r)
# we create K*K matrice using normal law
rn=array(rnorm(K*K,0,1),c(K,K))
# we create K*K matrice of integers using uniform law , integrs are unique in each column
rd=unique.array(array(unique(ceiling(runif(K*K,0,ncol(df.r)))),c(K,K)))
buckets<-array(NA,c(K,n))
for (i in 1:K) {
for (j in 1:n) {
buckets[i,j]<-hash(df.r[j,][rd[,i]],rn[,i])
}
}
return(buckets)
}
> df.r
age height salaire.1 salaire.2
1 27 180 0 5000
2 26 178 0 5000
3 30 190 7000 10000
4 31 185 7000 10000
5 31 187 7000 10000
6 38 160 10000 15000
7 39 158 10000 15000
> LSH_Band(df.r, 3 )
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 1 1 1 1 1 1
[2,] 1 1 0 0 0 0 0
[3,] 0 0 0 0 0 0 0
The dot function is the scalar product of two vectors.
My Lsh function takes a row of my data , then it takes a part of the
obtained row using df.r[j,][rd[,i]] . df.r[j,] is j-éme row of the
data.
rd[,i] : rd is a K*K matrix of integers between 1 and ncol(df.r) , each column of the matrix contains only unique integers.
rn[,i] : rn is a K*K matrix that contains values of N(0,1) law.
In the resulting table , observations are represented in columns . I will have k Rows. For the last row , i will compute the scalar product between df.r[j,][rd[,K]] and rn[,K]. I will obtain 1 if the scalar product is positive. rd[,K] and rn[,K] will be used only for the last row in the resulting table and for all observations in that row.
My question :
Is it to replace the loops with variables i and j by a lapply function ?
My real data will be large , this is why i'm asking this question.
Thank you !

The following is a bit too long as a comment, so here are some pointers/issues/remarks:
First off, I have to say I struggle to understand what LHS_Band does. Perhaps some context would help here.
I don't understand the purpose of certain functions like helpRFunctions::list.df.var.type which simply seems to return the column names of data in a list. Note also that t$Intervals returns NULL based on the sample data you give. So I'm not sure what's going on there.
I don't see the point of function pracma::dot either. The dot product between two vectors can be calculated in base R using %*%. There's really no need for an additional package.
Function hash can be written more compactly as
hash <- function(v, p) +(as.numeric(v %*% p) > 0)
This avoids the if conditional which is slow.
Notwithstanding my lack of understanding what it is you're trying to do, here are some tweaks to your code
hash <- function(v, p) +(as.numeric(v %*% p) > 0)
LSH_Band <- function(data, K, seed = NULL) {
# We retrieve numerical columns of data
data <- as.matrix(data[sapply(data, is.numeric)])
# we create K*K matrice using normal law
if (!is.null(seed)) set.seed(seed)
rn <- matrix(rnorm(K * K, 0, 1), nrow = K, ncol = K)
# we create K*K matrice of integers using uniform law , integrs are unique in each column
rd <- sapply(seq_len(K), function(col) sample.int(ncol(data), K))
buckets <- matrix(NA, nrow = K, ncol = nrow(data))
for (i in 1:K) {
buckets[i, ] <- apply(data, 1, function(row) hash(row[rd[, i]], rn[, i]))
}
buckets
}
Always add an option to use a reproducible seed when working with random numbers. That will make debugging a lot easier.
You can replace at least one for loop with apply (which when using MARGIN = 1 iterates through the rows of a matrix (or array)).
I've removed all the unnecessary package dependencies, and replaced the functionality with base R functions.

Related

Efficiently change individual elements in matrix/array in R

I am running a simulation in R, which I am trying to make more efficient.
A little bit of background: this is an abstract simulation to test the effects of mutation on a population. The population has N individuals and each individuals has a genotype of M letters, each letter can be one of the twenty amino acids (I denote as 0:19).
One of the most (computationally) expensive tasks involves taking a matrix "mat" with M rows and N columns, which initially starts as a matrix of all zeroes,
mat <- matrix(rep(0,M*N),nrow=M)
And then changing (mutating) at least one letter in the genotype of each individual. The reason I say at least is, I would ideally like to set a mutation rate (mutrate) that, if I set to 2 in my overall simulation function, it will cause 2 mutations in the matrix per individual.
I found two rather computationally expensive ways to do so. As you can see below, only the second method incorporates the mutation rate parameter mutrate (I could not easily of think how to incorporate it into the first).
#method 1
for(i in 1:N){
position <- floor(runif(N, min=0, max=M))
letter <- floor(runif(N, min=0, max=19))
mat[position[i],i] = letter[i]}
#method 2, somewhat faster and incorporates mutation rate
mat <- apply(mat,2,function(x) (x+sample(c(rep(0,M-mutrate),sample(0:19,size=mutrate))%%20))))
The second method incorporates a modulus because genotype values have to be between 0 and 19 as I mentioned.
A few additional notes for clarity:
I don't strictly need every individual to get exactly the same mutation amount. But that being said, the distribution should be narrow enough such that, if mutrate = 2, most individuals get two mutations, some one, some maybe three. I don't want however one individual getting a huge amount of mutations and many individuals getting no mutations Notably, some mutations will change the letter into the same letter, and so for a large population size N, the expected average number of mutations is slightly less than the assigned mutrate.
I believe the answer has something to do with the ability to use the square-bracket subsetting method to obtain one random element from every column of the matrix mat. However, I could not find any information about how to use the syntax to isolate one random element from every column of a matrix. mat[sample(1:M),sample(1:N)] obviously gives you the whole matrix... perhaps I am missing something stupidly clear here.
Any help is greatly appreciated !
To answer your last question first; you can access a single cell in a matrix with mat[row,column], or multiple scattered cells by their sequential cell id. Cell 1,1 is the first cell, followed by 2,1, 3,1, etc:
mat <- matrix(rep(0, 5*5), nrow=5)
mat[c(1,3,5,7,9)] = c(1,2,3,4,5)
mat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 4 0 0 0
[3,] 2 0 0 0 0
[4,] 0 5 0 0 0
[5,] 3 0 0 0 0
Accessing / overwriting the individual cells is fast too however. The fastest way that I could think of to perform your task, is to first create vectors for the values we want. A vector of all column indices (every column as many times as mutrate), a vector of row indices (randomly), and a vector of new values for these column/row combinations (randomly).
cols = rep(seq_len(N), mutrate)
rows = sample(M, N*mutrate, replace = T)
values = sample(genotypes, N*mutrate, replace = T) - 1 # -1 offset since genotypes are 0-indexed
for(i in seq_len(N*mutrate)) {
mat[rows[i],cols[i]] = values[i]
}
Instead of that for-loop to update the matrix, we can also calculate the cell-IDs so we can update all matrix cells in one go:
cols = rep(seq_len(N), mutrate)
rows = sample(M, N*mutrate, replace = T)
cellid = rows + (cols-1)*M
values = sample(genotypes, N*mutrate, replace = T) - 1 # -1 offset since genotypes are 0-indexed
mat[cellid] = values
Trying with a 6000x10000 matrix to benchmark the multiple methods, shows how fast each method is:
N = 6000 # individuals
M = 10000 # genotype length
genotypes = 20
mutrate = 2
method1 <- function() {
mat <- matrix(rep(0,M*N),nrow=M)
for(i in 1:(N*mutrate)){
position <- sample(M, 1)
letter <- sample(genotypes, 1) - 1
mat[position,i%%N] = letter
}
return(mat)
}
method2 <- function() {
mat <- matrix(rep(0,M*N),nrow=M)
mat <- apply(mat,2,function(x) (x+sample(c(rep(0,M-mutrate),sample(0:19,size=mutrate))%%20)))
}
method3 <- function() {
mat <- matrix(rep(0,M*N),nrow=M)
cols = rep(seq_len(N), mutrate)
rows = sample(M, N*mutrate, replace = T)
values = sample(genotypes, N*mutrate, replace = T) - 1 # -1 offset since genotypes are 0-indexed
for(i in seq_len(N*mutrate)) {
mat[rows[i],cols[i]] = values[i]
}
return(mat)
}
method4 <- function() {
mat <- matrix(rep(0,M*N),nrow=M)
cols = rep(seq_len(N), mutrate)
rows = sample(M, N*mutrate, replace = T)
cellid = rows + (cols-1)*M
values = sample(genotypes, N*mutrate, replace = T) - 1 # -1 offset since genotypes are 0-indexed
mat[cellid] = values
return(mat)
}
benchmark <- function(func, times=10) {
begin <- as.numeric(Sys.time())
for(i in seq_len(times))
retval <- eval(parse(text=func))
end <- as.numeric(Sys.time())
cat(func, 'took', (end-begin)/times, 'seconds\n')
return(retval)
}
ret1 <- benchmark('method1()')
ret2 <- benchmark('method2()')
ret3 <- benchmark('method3()')
ret4 <- benchmark('method4()')
I've modified your first method to speed it up and perform mutrate.
method1() took 0.8936087 seconds
method2() took 8.767686 seconds
method3() took 0.7008878 seconds
method4() took 0.6548331 seconds

How to generate a matrices A) each row has a single value of one; B) rows sum to one

This is a two-part problem: the first is to create an NXN square matrix for which only one random element in each row is 1, the other items must be zero. (i.e. the sum of elements in each row is 1).
The second is to create an NXN square matrix for which the sum of items in each row is 1, but each element follows a distribution e.g. normal distribution.
Related questions include (Create a matrix with conditional sum in each row -R)
Matlab seems to do what I want automatically (Why this thing happens with random matrix such that all rows sum up to 1?), but I am looking for a solution in r.
Here is what I tried:
# PART 1
N <- 50
x <- matrix(0,N,N)
lapply(1:N, function(y){
x[y,sample(N,1)]<- 1
})
(I get zeroes still)
# PART 2
N <- 50
x <- matrix(0,N,N)
lapply(1:N, function(y){
x[y,]<- rnorm(N)
})
(It needs scaling)
Here's another loop-less solution that uses the two column addressing facility using the "[<-" function. This creates a two-column index matrix whose first column is simply an ascending series that assigns the row locations, and whose second column (the one responsible for picking the column positions) is a random integer value. (It's a vectorized version of Matthew's "easiest method", and I suspect would be faster since there is only one call to sample.):
M <- matrix(0,N,N)
M[ cbind(1:N, sample(1:N, N, rep=TRUE))] <- 1
> rowSums(M)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
If you didn't specify rep=TRUE, then colSums(M) would have all been ones as well, but that was not what you requested. It does mean the rank of your resultant matrix may be less than N. If you left out the rep=TRUE the matrix would be full rank.
Here you see why lapply doesn't always replace a loop. You're trying to iterate through the rows of x and modify the matrix, but what you're modifying is a copy of the x from the global environment.
The easiest fix is to use a for loop:
for (y in 1:N) {
x[y,sample(N,1)]<- 1
}
apply series should be used for the return value, rather than programming functions with side-effects.
A way to do this is to return the rows, then rbind them into a matrix. The second example is shown here, as this more closely resembles an apply:
do.call(rbind, lapply((1:N), function(i) rnorm(N)))
However, this is more readable:
matrix(rnorm(N*N), N, N)
Now to scale this to have row sums equal to 1. You use the fact that a matrix is column-oriented and that vectors are recycled, meaning that you can divide a matrix M by rowSums(M). Using a more reasonable N=5:
m <- matrix(rnorm(N*N), N, N)
m/rowSums(m)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.1788692 0.5398464 0.24980924 -0.01282655 0.04430168
## [2,] 0.4176512 0.2564463 0.11553143 0.35432975 -0.14395871
## [3,] 0.3480568 0.7634421 -0.38433940 0.34175983 -0.06891932
## [4,] 1.1807180 -0.0192272 0.16500179 -0.31201400 -0.01447859
## [5,] 1.1601173 -0.1279919 -0.07447043 0.20865963 -0.16631458
No-loop solution :)
n <- 5
# on which column in each row insert 1s
s <- sample(n,n,TRUE)
# indexes for each row
w <- seq(1,n*n,by=n)-1
index <- s+w
# vector of 0s
vec <- integer(n*n)
# put 1s
vec[index] <- 1
# voila :)
matrix(vec,n,byrow = T)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 0 0 1 0
[3,] 0 0 0 0 1
[4,] 1 0 0 0 0
[5,] 1 0 0 0 0

Efficient implementation of summed area table/integral image in R

I am trying to construct a summed area table or integral image given an image matrix. For those of you who dont know what it is, from wikipedia:
A summed area table (also known as an integral image) is a data structure and algorithm for quickly and efficiently generating the sum of values in a rectangular subset of a grid
In other words, its used to sum up values of any rectangular region in the image/matrix in constant time.
I am trying to implement this in R. However, my code seems to take too long to run.
Here is the pseudo code from this link. in is the input matrix or image and intImg is whats returned
for i=0 to w do
sum←0
for j=0 to h do
sum ← sum + in[i, j]
if i = 0 then
intImg[i, j] ← sum
else
intImg[i, j] ← intImg[i − 1, j] + sum
end if
end for
end for
And here is my implementation
w = ncol(im)
h = nrow(im)
intImg = c(NA)
length(intImg) = w*h
for(i in 1:w){ #x
sum = 0;
for(j in 1:h){ #y
ind = ((j-1)*w)+ (i-1) + 1 #index
sum = sum + im[ind]
if(i == 1){
intImg[ind] = sum
}else{
intImg[ind] = intImg[ind-1]+sum
}
}
}
intImg = matrix(intImg, h, w, byrow=T)
Example of input and output matrix:
However, on a 480x640 matrix, this takes ~ 4 seconds. In the paper they describe it to take on the order of milliseconds for those dimensions.
Am I doing something inefficient in my loops or indexing?
I considered writing it in C++ and wrapping it in R, but I am not very familiar with C++.
Thank you
You could try to use apply (isn't faster than your for-loops if you pre-allocating the memory):
areaTable <- function(x) {
return(apply(apply(x, 1, cumsum), 1, cumsum))
}
areaTable(m)
# [,1] [,2] [,3] [,4]
# [1,] 4 5 7 9
# [2,] 4 9 12 17
# [3,] 7 13 16 25
# [4,] 9 16 22 33

How to write linearly dependent column in a matrix in terms of linearly independent columns?

I have a large mxn matrix, and I have identified the linearly dependent columns. However, I want to know if there's a way in R to write the linearly dependent columns in terms of the linearly independent ones. Since it's a large matrix, it's not possible to do based on inspection.
Here's a toy example of the type of matrix I have.
> mat <- matrix(c(1,1,0,1,0,1,1,0,0,1,1,0,1,1,0,1,0,1,0,1), byrow=TRUE, ncol=5, nrow=4)
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 0 1 0
[2,] 1 1 0 0 1
[3,] 1 0 1 1 0
[4,] 1 0 1 0 1
Here it's obvious that x3 = x1-x2, x5=x1-x4. I want to know if there's an automated way to get that for a larger matrix.
Thanks!
I'm sure there is a better way but I felt like playing around with this. I basically do a check at the beginning to see if the input matrix is full column rank to avoid unnecessary computation in case it is full rank. After that I start with the first two columns and check if that submatrix is of full column rank, if it is then I check the first thee columns and so on. Once we find some submatrix that isn't of full column rank I regress the last column in that submatrix on the previous one which tells us how to construct linear combinations of the first columns to get the last column.
My function isn't very clean right now and could do some additional checking but at least it's a start.
mat <- matrix(c(1,1,0,1,0,1,1,0,0,1,1,0,1,1,0,1,0,1,0,1), byrow=TRUE, ncol=5, nrow=4)
linfinder <- function(mat){
# If the matrix is full rank then we're done
if(qr(mat)$rank == ncol(mat)){
print("Matrix is of full rank")
return(invisible(seq(ncol(mat))))
}
m <- ncol(mat)
# cols keeps track of which columns are linearly independent
cols <- 1
for(i in seq(2, m)){
ids <- c(cols, i)
mymat <- mat[, ids]
if(qr(mymat)$rank != length(ids)){
# Regression the column of interest on the previous
# columns to figure out the relationship
o <- lm(mat[,i] ~ mat[,cols] + 0)
# Construct the output message
start <- paste0("Column_", i, " = ")
# Which coefs are nonzero
nz <- !(abs(coef(o)) <= .Machine$double.eps^0.5)
tmp <- paste("Column", cols[nz], sep = "_")
vals <- paste(coef(o)[nz], tmp, sep = "*", collapse = " + ")
message <- paste0(start, vals)
print(message)
}else{
# If the matrix subset was of full rank
# then the newest column in linearly independent
# so add it to the cols list
cols <- ids
}
}
return(invisible(cols))
}
linfinder(mat)
which gives
> linfinder(mat)
[1] "Column_3 = 1*Column_1 + -1*Column_2"
[1] "Column_5 = 1*Column_1 + -1*Column_4"

R equivalent to diag(x,k) in matlab

I guess, I have a two leveled question referring to diag in R and matlab.
1) I was wondering if there was a way already developed to access different diagonals of matrices in R similar to the way it is done in Matlab (see http://www.mathworks.com/help/techdoc/ref/diag.html).
2) If there is not already a current function how can my code be improved such that it is similar to the R diag where
diag(x = 1, nrow, ncol) # returns the values of the diagonal
diag(x) <- value # inserts values on the diagonal
Presently my code returns the elements on the diagonal given k but how can it be written so that if it is specified like the second way (above) that it allows me to insert the values on the diagonal. Presently to do this, I use diag.ind to give me the indices and then using those indices to insert the values on the k diagonal.
Here is the code:
'diag.ind'<-function(x,k=0){
if(k=='') k=0
x<-as.matrix(x)
if(dim(x)[2]==dim(x)[1]){
stp_pt_r<-dim(x)[1]
stp_pt_c<-dim(x)[2]
}
if(ncol(x)> dim(x)[1]){
stp_pt_r<-dim(x)[1]
stp_pt_c<-stp_pt_r + 1
}
if(ncol(x)< dim(x)[1]){
stp_pt_c<-dim(x)[2]
stp_pt_r<-stp_pt_c+1
}
if(k==0){
r<-as.matrix(seq(1,stp_pt_r,by=1))
c<-as.matrix(seq(1,stp_pt_c,by=1))
ind.r<- cbind(r,c)
}
if(k>0){
r<-t(as.matrix(seq(1,stp_pt_r,by=1)))
c<-t(as.matrix(seq((1+k),stp_pt_c,by=1)))
ind<-t(rbind.fill.matrix(r,c))
ind.r<-ind[!is.na(ind[,2]),]
}
if(k<0){
k<-abs(k)
r<-t(as.matrix(seq((1+k),stp_pt_r,by=1)))
c<-t(as.matrix(seq(1,stp_pt_c,by=1)))
ind<-t(rbind.fill.matrix(r,c))
ind.r<-ind[!is.na(ind[,1]),]
}
diag.x<-x[ind.r]
output<-list(diag.x=diag.x, diag.ind=ind.r)
return(output)
}
This is kind of clunky and I feel like I must be reinventing the wheel. Thanks in advance for any insight!
After your reply to Andrie this may satisfy:
exdiag <- function(mat, off) {mat[row(mat)+off == col(mat)]}
x <- matrix(1:16, ncol=4)
exdiag(x,1)
#[1] 5 10 15
I was thinking you wanted a function that can assign or return one of a diagonal or a sub- or super- diagonal matrix, This is the constructor function:
subdiag <- function(vec, size, offset=0){
M <- matrix(0, size, size)
M[row(M)-offset == col(M)] <- vec
return(M)}
> subdiag(1, 5, 1)
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 1 0 0 0 0
[3,] 0 1 0 0 0
[4,] 0 0 1 0 0
[5,] 0 0 0 1 0
Called with only two arguments you would get a diagonal matrix. You can construct super-diagonal matrices with negative offsets. If this is what you wanted for the constructor, then it should not be too hard to construct a similar subdiag<- function to go along with it.
In MATLAB, to assign the values x to the diagonal of A:
n = size(A,1);
A(1:n+1:end) = x
Look up linear indexing.
Although, that might not be what you asked.

Resources