Most efficient way to randomize a matrix in R or in Python - r

I'm working with a numeric matrix M in R which is quite big (11000 rows per 20 columns). On this matrix, I'm performing a lot of correlation tests
=> the function cor.test(M[i,], M[j,], method='spearman') where i and j are two rows from the matrix (all possible combinations are tested).
The problem as you know is that I'm doing too many tests to get a very reliable p-value returned by this test.
My strategy to overcome this limitation would be to generate a new probability distribution by Bootstrap on my matrix M: I would like to get 100 random matrices generated from M to do the multiple correlations on these matrices and choose the right cut-off for the p-value to get a FDR of 5%.
My question is:
What is the most efficient way to randomize my matrix?
Since it's quite time consumming (I suppose) it could be interresting if the solution could be parallelized.
Thank you in advance for all the usefull answers that you'll provide to me.

In python there is a function random.sample() in module random. If you store M as list of rows, randomly sampling n rows from matrix M without replacement would be like this
M_sample = random.sample(M,n)
However, for bootstrapping, you might want to do random sampling with replacement. To do this, you can use numpy.random.choice():
import numpy
M_sample = numpy.random.choice(M,n,replace=True)
In R, we use sample() to randomly decide the row indices to take, and then use row access to take the rows from the matrices. Randomly sampling n rows from matrix M without replacement is done as follows:
indices = sample(nrow(M), n,replace=FALSE)
M_sample = M[indices, ]
And for randomly sampling with replacement, replace the first line with this:
indices = sample(nrow(M), n,replace=TRUE)

Related

How to get a single number result for a big matrix using var function?

Using the var function,
(a) find the sample variance of your row averages from above;
(b) find the sample variance for your XYZmat as a whole; <-this
(c) Divide the sample variance of the XYZmat by the sample variance of the row averages. The statistical theory says that ratio will on average be close to the row sample size, which is n, here.
(d) Do your results agree with theory? (That is a non-trivial question.) Show your work.
So this is what he asked for in the question, I could not get the single number result, so I just used the sd function and then squared the result. I keep wondering if there is still a way to get a single number result using var function. In my case n is 30, I got it from the previous part of the homework. This is the first R class I am taking and this is the first homework assigned, so the answer should be pretty simple.
I tried as.vector() function and I still got the set of numbers as a result. I played around with var function, no changes.
Unfortunately, I deleted all the code I had since the matrix is so big that my laptop started lagging.
I did not have any error messages, but I kept getting a set of numbers for the answer...
set.seed(123)
XYZmat <- matrix(runif(10000), nrow=100, ncol=100) # make a matrix
varmat <- var(as.vector(XYZmat)) # variance of whole matrix
n <- nrow(XYZmat) # number of rows
n
#> [1] 100
rowmeans <- rowMeans(XYZmat) # row means
varmat/var(rowmeans) # should be near n
#> [1] 100.6907
Created on 2019-07-17 by the reprex package (v0.3.0)

Generate permutations in sequential order - R

I previously asked the following question
Permutation of n bernoulli random variables in R
The answer to this question works great, as long as n is relatively small (<30), otherwise the following error code occurs Error: cannot allocate vector of size 4.0 Gb. I can get the code to run with somewhat larger values by using my desktop at work but eventually the same error occurs. Even for values that my computer can handle, say 25, the code is extremely slow.
The purpose of this code to is to calculate the difference between the CDF of an exact distribution (hence the permutations) and a normal approximation. I randomly generate some data, calculate the test statistic and then I need to determine the CDF by summing all the permutations that result in a smaller test statistic value divided by the total number of permutations.
My thought is to just generate the list of permutations one at a time, note if it is smaller than my observed value and then go on to the next one, i.e. loop over all possible permutations, but I can't just have a data frame of all the permutations to loop over because that would cause the exact same size and speed issue.
Long story short: I need to generate all possible permutations of 1's and 0's for n bernoulli trials, but I need to do this one at a time such that all of them are generated and none are generated more than once for arbitrary n. For n = 3, 2^3 = 8, I would first generate
000
calculate if my test statistic was greater (1 or 0) then generate
001
calculate again, then generate
010
calculate, then generate
100
calculate, then generate
011
etc until 111
I'm fine with this being a loop over 2^n, that outputs the permutation at each step of the loop but doesn't save them all somewhere. Also I don't care what order they are generated in, the above is just how I would list these out if I was doing it by hand.
In addition if there is anyway to speed up the previous code that would also be helpful.
A good solution for your problem is iterators. There is a package called arrangements that is able to generate permutations in an iterative fashion. Observe:
library(arrangements)
# initialize iterator
iperm <- ipermutations(0:1, 3, replace = T)
for (i in 1:(2^3)) {
print(iperm$getnext())
}
[1] 0 0 0
[1] 0 0 1
.
.
.
[1] 1 1 1
It is written in C and is very efficient. You can also generate m permutations at a time like so:
iperm$getnext(m)
This allows for better performance because the next permutations are being generated by a for loop in C as opposed to a for loop in R.
If you really need to ramp up performance you can you the parallel package.
iperm <- ipermutations(0:1, 40, replace = T)
parallel::mclapply(1:100, function(x) {
myPerms <- iperm$getnext(10000)
# do something
}, mc.cores = parallel::detectCores() - 1)
Note: All code is untested.

prcomp( .. ,retx=TRUE), do I get the new data to train over?

I am having some issues in interpreting the results from prcomp().
Say I have a centered and scaled data.table called dat, with N columns and M rows. Indeed every column represents a feature and every row a record. I also got a M-dimensional vector of outcomes Y.
I wanted to know what the PCA of this system says. So I just executed:
dat.pca=prcomp(dat,retx=TRUE)
By the elbow method I decided to retain 5 PCA modes, accounting for 90% of the variance. Then, I got the following data.table:
dat.pcadata=as.data.table(dat.pca$x)
dat.pcadata has M rows and N columns, and each column corresponds to a PCA mode.
My question is: do I understand correctly if I say that now my system should be trained to forecast the outcomes Y using the first 5 columns of dat.pcadata as features?

compute matrix distance using dynamic programming

I have a matrix composing values 0, 1, and 2. 99% of the values are 0. The matrix has 1 million rows and 700 columns. There will be at least one non-zero values each row.
I need to compute the distance between each pair of columns using this formula for distance between column x and y:
D=(Sum(|xi-yi|)/2L for i from 1 to L, L=1 million, i.e. the number of rows.
I wrote a piece of R code but it's taking too long to compute, is it possible to use dynamic programing to do it faster? Here is my code:
#mac is the matrix
nCols=ncol(mac)
nRows=nrow(mac)
#the pairwise distance matrix
distMat=matrix(data=-1,nrow=nCols,ncol=nCols)
abs.dist=function(x){return(abs(x[1]-x[2]))}
for(i in 1:(nCols-1)){
for(j in (i+1):nCols){
d1=apply(mac[,c(i,j),1,abs.dist)
k=sum(d1)/(2*nRows)
distMat[i,j]=k
distMat[j,i]=k
}
}
for(i in 1:nCols) distMat[i,i]=0
Thanks a lot for any help?
I will just summarize what is in the comments already:
#mac is the matrix
nCols=ncol(mac)
nRows=nrow(mac)
#the pairwise distance matrix
distMat=matrix(data=-1,nrow=nCols,ncol=nCols)
for(i in 1:(nCols-1)){
for(j in (i+1):nCols){
d1=abs(mac[,i]-mac[,j])
k=sum(d1)/(2*nRows)
distMat[i,j]=k
distMat[j,i]=k
}
}
diag(distMat) <- 0
This is approximately 100 times faster for a 2000x500 matrix.
It took about half a minute for a 1e6x700 matrix.
Computing a distance matrix means you need (n^2-n)/2 operations. I'm not surprised it is taking a while.
Since you need all pairs, these calculations have to be done independently. Dynamic programming will not help. DP helps when you build the solution from smaller parts. Everything here is independent so DP won't help (as far as I know).
You said most entries are 0. Try looking at a sparse matrix library. This blog post may give you some ideas for doing this in R.

Multiple Matrix Operations in R with loop based on matrix name

I'm a novice R user, who's learning to use this coding language to deal with data problems in research. I am trying to understand how knowledge evolves within an industry by looking at patenting in subclasses. So far I managed to get the following:
# kn.matrices<-with(patents, table(Class,year,firm))
# kn.ind <- with(patents, table(Class, year))
patents is my datafile, with Subclass, app.yr, and short.name as three of the 14 columns
# for (k in 1:37)
# kn.firms = assign(paste("firm", k ,sep=''),kn.matrices[,,k])
There are 37 different firms (in the real dataset, here only 5)
This has given 37 firm-specific and 1 industry-specific 2635 by 29 matrices (in the real dataset). All firm-specific matrices are called firmk with k going from 1 until 37.
I would like to perform many operations in each of the firm-specific matrices (e.g. compare the numbers in app.yr 't' with the average of the 3 previous years across all rows) so I am looking for a way that allows me to loop the operations for every matrix named firm1,firm2,firm3...,firm37 and that generates new matrices with consistent naming, e.g. firm1.3yearcomparison
Hopefully I framed this question in an appropriate way. Any help would be greatly appreciated.
Following comments I'm trying to add a minimal reproducible example
year<-c(1990,1991,1989,1992,1993,1991,1990,1990,1989,1993,1991,1992,1991,1991,1991,1990,1989,1991,1992,1992,1991,1993)
firm<-(c("a","a","a","b","b","c","d","d","e","a","b","c","c","e","a","b","b","e","e","e","d","e"))
class<-c(1900,2000,3000,7710,18000,19000,36000,115000,212000,215000,253600,383000,471000,594000)
These three vectors thus represent columns in a spreadsheet that forms the "patents" matrix mentioned before.
it looks like you already have a 3 dimensional array with all your data. You can basically view this as your 38 matrices all piled one on top of the other. You don't want to split this into 38 matrices and use loops. Instead, you can use R's apply function and extraction functions. Just view the help topic on the apply() family and it should show you how to do what you want. Here are a few basic examples
examples:
# returns the sums of all columns for all matrices
apply(kn.matrices, 3, colSums)
# extract the 5th row of all matrices
kn.matrices[5, , ]
# extract the 5th column of all matrices
kn.matrices[, 5, ]
# extract the 5th matrix
kn.matrices[, , 5]
# mean of 5th column for all matrices
colMeans(kn.matrices[, 5, ])

Resources