I have a matrix composing values 0, 1, and 2. 99% of the values are 0. The matrix has 1 million rows and 700 columns. There will be at least one non-zero values each row.
I need to compute the distance between each pair of columns using this formula for distance between column x and y:
D=(Sum(|xi-yi|)/2L for i from 1 to L, L=1 million, i.e. the number of rows.
I wrote a piece of R code but it's taking too long to compute, is it possible to use dynamic programing to do it faster? Here is my code:
#mac is the matrix
nCols=ncol(mac)
nRows=nrow(mac)
#the pairwise distance matrix
distMat=matrix(data=-1,nrow=nCols,ncol=nCols)
abs.dist=function(x){return(abs(x[1]-x[2]))}
for(i in 1:(nCols-1)){
for(j in (i+1):nCols){
d1=apply(mac[,c(i,j),1,abs.dist)
k=sum(d1)/(2*nRows)
distMat[i,j]=k
distMat[j,i]=k
}
}
for(i in 1:nCols) distMat[i,i]=0
Thanks a lot for any help?
I will just summarize what is in the comments already:
#mac is the matrix
nCols=ncol(mac)
nRows=nrow(mac)
#the pairwise distance matrix
distMat=matrix(data=-1,nrow=nCols,ncol=nCols)
for(i in 1:(nCols-1)){
for(j in (i+1):nCols){
d1=abs(mac[,i]-mac[,j])
k=sum(d1)/(2*nRows)
distMat[i,j]=k
distMat[j,i]=k
}
}
diag(distMat) <- 0
This is approximately 100 times faster for a 2000x500 matrix.
It took about half a minute for a 1e6x700 matrix.
Computing a distance matrix means you need (n^2-n)/2 operations. I'm not surprised it is taking a while.
Since you need all pairs, these calculations have to be done independently. Dynamic programming will not help. DP helps when you build the solution from smaller parts. Everything here is independent so DP won't help (as far as I know).
You said most entries are 0. Try looking at a sparse matrix library. This blog post may give you some ideas for doing this in R.
Related
I have a question for an assignment I'm doing.
Q:
"Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages.
What proportion of these 1,000 averages are more than 1 gram away from the average of x ?"
I understand that basically, I need to write a code that says: What percentage of "Nulls" is +or- 1 gram from the average of "x." I'm not really certain how to write that given that this course hasn't given us the information on how to do that yet is asking us to do so. Any help on how to do so?
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )
set.seed(1)
n <- 1000
nulls<-vector("numeric", n)
for(i in 1:n){
control <- sample(x, 5)
nulls[i] <-mean(control)
##I know my last line for this should be something like this
## mean(nulls "+ or - 1")> or < mean(x)
## not certain if they're asking for abs() to be involved.
## is the question asking only for those that are 1 gram MORE than the avg of x?
}
Thanks for any help.
Z
I do think that the absolute distance is what they're after here.
Vectors in R are nice in that you can just perform arithmetic operations between a vector and a scalar and it will apply it element-wise, so computing the absolute value of nulls - mean(x) is easy. The abs function also takes vectors as arguments.
Logical operators (such as < and >) can also be used in the same way, making it equally simple to compare the result with 1. This will yield a vector of booleans (TRUE/FALSE) where TRUE means the value at that index was indeed greater than 1, but booleans are really just numbers (1 or 0), so you can just sum that vector to find the number of TRUE elements.
I don't know what programming level you are on, but I hope this helps without giving the solution away completely (since you said it's for an assignment).
Preface: I am fairly novice at using R, I've used SAS my entire adult life and am not used to working with matrices either.
I am currently working on a project for an evolutionary biology class that requires running the rbinom function through a nested loop over a matrix.
The probability in the first row is set to 0.1 but then the value in subsequent rows must use the probability from the previous row. I cannot figure out how to reference the value in the previous row. My code is below, if anybody knows the syntax for this I would greatly appreciate it! Currently I have it set to i-1 but I know that's not right.
#equation = rbinom(1,2*N,p) / (2*N)
p<-0.1
N<-10
T<-5 #number generations
L<-3 #number independent SNP's
alleles<-matrix(nrow=T,ncol=L) #initialize a matrix of allele frequencies
each generation
alleles[1,]<-p #initialize first row to equal p
for (j in 1:ncol(alleles)) {
for (i in 2:nrow(alleles)) {
alleles[i,j]<-(rbinom(1,(2*N),(i-1))/(2*N))
}
}
alleles
I'm working with a numeric matrix M in R which is quite big (11000 rows per 20 columns). On this matrix, I'm performing a lot of correlation tests
=> the function cor.test(M[i,], M[j,], method='spearman') where i and j are two rows from the matrix (all possible combinations are tested).
The problem as you know is that I'm doing too many tests to get a very reliable p-value returned by this test.
My strategy to overcome this limitation would be to generate a new probability distribution by Bootstrap on my matrix M: I would like to get 100 random matrices generated from M to do the multiple correlations on these matrices and choose the right cut-off for the p-value to get a FDR of 5%.
My question is:
What is the most efficient way to randomize my matrix?
Since it's quite time consumming (I suppose) it could be interresting if the solution could be parallelized.
Thank you in advance for all the usefull answers that you'll provide to me.
In python there is a function random.sample() in module random. If you store M as list of rows, randomly sampling n rows from matrix M without replacement would be like this
M_sample = random.sample(M,n)
However, for bootstrapping, you might want to do random sampling with replacement. To do this, you can use numpy.random.choice():
import numpy
M_sample = numpy.random.choice(M,n,replace=True)
In R, we use sample() to randomly decide the row indices to take, and then use row access to take the rows from the matrices. Randomly sampling n rows from matrix M without replacement is done as follows:
indices = sample(nrow(M), n,replace=FALSE)
M_sample = M[indices, ]
And for randomly sampling with replacement, replace the first line with this:
indices = sample(nrow(M), n,replace=TRUE)
I'm trying to fill a 10 x 1500 matrix with a loop.
I have to fill that matrix with 150 small 10 x 10 matrixes. I have tried to implement this with a double loop, but unsuccessfully. My problem is that each 10*10 matrix is the result of a scalar product.
At the begin it seems to be easy, but then I realized I couldn't figure out the sizes of the 10 x 1500 matrix with the 150 small 10*10 matrixes.
Here is what I did:
es_var is a 1 x 150 matrix, which I converted to a vector to simplify the scalar product (at least in my opinion).
diax is a 10 x 10 matrix.
I want to multiply each value of the es_var vector per the whole diag 10*10 matrix.
I am having troubles because I don't manage to input R in filling 10 rows per time. Thus in the end I get a 10*1500 matrix, but it is the same 10*10 time matrix repeated 150 times.
Here is my code
es_var1 = as.vector(es_var)
v = matrix(0, 10, 10*N)
for (i in 1:N){
v[,] = es_var1[i] * diax
}
Can somebody help in figuring out this, please? I spent the whole day trying it. And I need to do that without using in build functions since this is a small part of a big math demonstration I have to implement.
If I understand your requirement correctly, you can accomplish this with the following line:
v <- matrix(diax,10,1500)*rep(es_var1,each=100);
This constructs a 10x1500 matrix with the 10x10 diax matrix as the initial values, cycled sufficiently to cover the complete 10x1500 size. Then, to apply the es_var1 multiplication, you can replicate each of its elements 100 times, such that they will naturally align with each consecutive 10x10 small matrix during vectorized multiplication.
I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.