In R: sort the maximum dissimilarity between rows in a matrix - r

I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.

First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.

Related

How to get a single number result for a big matrix using var function?

Using the var function,
(a) find the sample variance of your row averages from above;
(b) find the sample variance for your XYZmat as a whole; <-this
(c) Divide the sample variance of the XYZmat by the sample variance of the row averages. The statistical theory says that ratio will on average be close to the row sample size, which is n, here.
(d) Do your results agree with theory? (That is a non-trivial question.) Show your work.
So this is what he asked for in the question, I could not get the single number result, so I just used the sd function and then squared the result. I keep wondering if there is still a way to get a single number result using var function. In my case n is 30, I got it from the previous part of the homework. This is the first R class I am taking and this is the first homework assigned, so the answer should be pretty simple.
I tried as.vector() function and I still got the set of numbers as a result. I played around with var function, no changes.
Unfortunately, I deleted all the code I had since the matrix is so big that my laptop started lagging.
I did not have any error messages, but I kept getting a set of numbers for the answer...
set.seed(123)
XYZmat <- matrix(runif(10000), nrow=100, ncol=100) # make a matrix
varmat <- var(as.vector(XYZmat)) # variance of whole matrix
n <- nrow(XYZmat) # number of rows
n
#> [1] 100
rowmeans <- rowMeans(XYZmat) # row means
varmat/var(rowmeans) # should be near n
#> [1] 100.6907
Created on 2019-07-17 by the reprex package (v0.3.0)

How to use pointDistance with a very large vector

I've got a big problem.
I've got a large raster (rows=180, columns=480, number of cells=86400)
At first I binarized it (so that there are only 1's and 0's) and then I labelled the clusters.(Cells that are 1 and connected to each other got the same label.)
Now I need to calculate all the distances between the cells, that are NOT 0.
There are quiet a lot and that's my big problem.
I did this to get the coordinates of the cells I'm interested in (get the positions (i.e. cell numbers) of the cells, that are not 0):
V=getValues(label)
Vu=c(1:max(V))
pos=which(V %in% Vu)
XY=xyFromCell(label,pos)
This works very well. So XY is a matrix, which contains all the coordinates (of cells that are not 0). But now I'm struggling. I need to calculate the distances between ALL of these coordinates. Then I have to put each one of them in one of 43 bins of distances. It's kind of like this (just an example):
0<x<0.2 bin 1
0.2<x<0.4 bin2
When I use this:
pD=pointDistance(XY,lonlat=FALSE)
R says it's not possible to allocate vector of this size. It's getting too large.
Then I thought I could do this (create an empty data frame df or something like that and let the function pointDistance run over every single value of XY):
for (i in 1:nrow(XY))
{pD=PointDistance(XY,XY[i,],lonlat=FALSE)
pDbin=as.matrix(table(cut(pD,breaks=seq(0,8.6,by=0.2),Labels=1:43)))
df=cbind(df,pDbin)
df=apply(df,1,FUN=function(x) sum(x))}
It is working when I try this with e.g. the first 50 values of XY.
But when I use that for the whole XY matrix it's taking too much time.(Sometimes this XY matrix contains 10000 xy-coordinates)
Does anyone have an idea how to do it faster?
I don't know if this will works fast or not. I recommend you try this:
Let say you have dataframe with value 0 or 1 in each cell. To find coordinates all you have to do is write the below code:
cord_matrix <- which(dataframe == 1, arr.ind = TRUE)
Now, you get the coordinate matrix with row index and column index.
To find the euclidean distance use dist() function. Go through it. It will look like this:
dist_vector <- dist(cord_matrix)
It will return lower triangular matrix. can be transformed into vector/symmetric matrix. Now all you have to do is calculating bins according to your requirement.
Let me know if this works within the specific memory space.

compute matrix distance using dynamic programming

I have a matrix composing values 0, 1, and 2. 99% of the values are 0. The matrix has 1 million rows and 700 columns. There will be at least one non-zero values each row.
I need to compute the distance between each pair of columns using this formula for distance between column x and y:
D=(Sum(|xi-yi|)/2L for i from 1 to L, L=1 million, i.e. the number of rows.
I wrote a piece of R code but it's taking too long to compute, is it possible to use dynamic programing to do it faster? Here is my code:
#mac is the matrix
nCols=ncol(mac)
nRows=nrow(mac)
#the pairwise distance matrix
distMat=matrix(data=-1,nrow=nCols,ncol=nCols)
abs.dist=function(x){return(abs(x[1]-x[2]))}
for(i in 1:(nCols-1)){
for(j in (i+1):nCols){
d1=apply(mac[,c(i,j),1,abs.dist)
k=sum(d1)/(2*nRows)
distMat[i,j]=k
distMat[j,i]=k
}
}
for(i in 1:nCols) distMat[i,i]=0
Thanks a lot for any help?
I will just summarize what is in the comments already:
#mac is the matrix
nCols=ncol(mac)
nRows=nrow(mac)
#the pairwise distance matrix
distMat=matrix(data=-1,nrow=nCols,ncol=nCols)
for(i in 1:(nCols-1)){
for(j in (i+1):nCols){
d1=abs(mac[,i]-mac[,j])
k=sum(d1)/(2*nRows)
distMat[i,j]=k
distMat[j,i]=k
}
}
diag(distMat) <- 0
This is approximately 100 times faster for a 2000x500 matrix.
It took about half a minute for a 1e6x700 matrix.
Computing a distance matrix means you need (n^2-n)/2 operations. I'm not surprised it is taking a while.
Since you need all pairs, these calculations have to be done independently. Dynamic programming will not help. DP helps when you build the solution from smaller parts. Everything here is independent so DP won't help (as far as I know).
You said most entries are 0. Try looking at a sparse matrix library. This blog post may give you some ideas for doing this in R.

Generating two sets of numbers where the sum of each set and the sum of their dot product is N

In this question Getting N random numbers that the sum is M, the object was to generate a set of random numbers that sums to a specific number N. After reading this question, I started playing around with the idea of generating sets of numbers that satisfy this condition
sum(A) == sum(B) && sum(B) == sum(A * B)
An example of this would be
A <- c(5, 5, -10, 6, 6, -12)
B <- c(5, -5, 0, 6, -6, 0)
In this case, the three sums equal zero. Obviously, those sets aren't random, but they satisfy the condition. Is there a way to generate 'random' sets of data that satisfy the above condition? (As opposed to using a little algorithm as in the above example.)
(Note: I tagged this as an R question, but the language really doesn't matter to me.)
You'd need to define the first vector in n-dimensional space, and the 2nd one will have N-2 degrees of freedom (i.e. random numbers) since the sum and one angle are already determined.
The 2nd vector would need to be transformed into N-dimensional space; There are infinitely many transforms that could work, so if you don't care about the probability distribution of the resulting vectors, just choose the one that's most intuitive to you.
There's a nice geometrical interpretation to the first constraint: it constrains the 2nd vector to a (hyper-)plane in N-dimensional space; the 2nd constraint doesn't have a simple geometric interpretation.
check out hyperspherical cooridnates.
You can generate one set completely randomly. And generate randomly all numbers in set B except for two numbers. Since you have two equations you should be able to solve for those two numbers.

R: How to get a count for a certain value in a matrix row in R?

Ok I have the following problem:
I have several ranks in a matrix in r. (I've got this by ranking asset returns. Ranks>=3 get an NA, Ranks <3 get the rank number. If some assets share a rank, less NAs are in a row). Here are two example rows and two example rows of a matrix with returns.
ranks<-matrix(c(1,1,2,NA,NA, 1,2,NA,NA,NA),nrow=2,ncol=5)
returns<-matrix(c(0.3,0.1,-0.5,-0.7,0.2,0.1,0.4,0.05,-0.7,-0.3),nrow=2,ncol=5)
Now if all assets are equally bought for our portfolio, I can calculate the average return with:
Mat.Ret<-returns*ranks
Mean.Ret<-rowMeans(Mat.Ret,na.rm=TRUE)
However I want to have the option of giving a vector of weights for the two ranks and these weights say how big of a percentage this particular rank should have in my portfolio. As an example we have a vector of
weights<-c(0.7,0.3)
Now how would I use this in my code? I want to calculate basically ranks*returns*weights. If only ONE rank 1 and ONE rank 2 are in the table, the code works. But how would I do this variable? I mean a solution would be to calculate for each rank how many times it exists in a particular row and then divide the weight by this count. And then I would multiply this "net weight" * rank * returns.
But I have no clue how to do this in code..any help?
UPDATE AFTER FIRST COMMENTS
Ok I want to Keep it flexible to adjust the weights depending on how many times a certain rank is given. A user can choose the top 5 ranked assets, so none or several assets may share ranks. So the distribution of weights must be very flexible. I've programmed a formula which doesn't work yet since I'm obviously not yet experienced enough with the whole matrix and vector selection syntax I guess. This is what I got so far:
ranks<-apply(ranks,1,function(x)distributeWeightsPerMatrixRow(x,weights))
distributeWeightsPerMatrixRow<-function(MatrixRow,Weights){
if(length(Weights)==length(MatrixRow[!is.na(MatrixRow)])){
MatrixRow <- Weights[MatrixRow]
} else {
for(i in 1:length(MatrixRow)){
if(!is.na(MatrixRow[i])){
EqWeights<-length(MatrixRow[MatrixRow==MatrixRow[i]])
MatrixRow[i]<-sum(Weights[MatrixRow[i]:(MatrixRow[i]+EqWeights-1)])/EqWeights
}
}
}
return(MatrixRow)
}
EDIT2:
Function seems to work, however now the resulting ranks object is the transposed version of the original matrix without the column names..
Since your ranks are integers above zero, you can use this matrix for indexing the vector ranks:
mat.weights <- weights[ranks]
mat.weighted.ret <- returns * ranks * mat.weights
Update based on comment.
I suppose you're looking for something like this:
if (length(unique(na.omit(as.vector(ranks)))) == 1)
mat.weights <- (!is.na(ranks)) * 0.5
else
mat.weights <- weights[ranks]
mat.weighted.ret <- returns * ranks * mat.weights
If there is only one rank. All weights become 0.5.

Resources