R Matching function between two matrices - r

Sorry if this has been posted before. I looked for the answer both on Google and Stackoverflow and couldn't find a solution.
Right now I have two matrices of data in R. I am trying to loop through each row in the matrix, and find the row in the other matrix that is most similar by some distance metric (for now least squared). I figured out one method but it is O(n^2) which is prohibitive for my data.
I think this might be similar to some dictionary learning techniques but I couldn't find anything.
Thanks!
Both matrices are just 30 by n matrices with a number at each entry.
distance.fun=function(mat1,mat2){
match=c()
for (i in 1:nrow(mat1)){
if (all(is.na(mat1[i,]))==FALSE){
dist=c()
for (j in 1:nrow(mat2)){
dist[j]=sum((mat1[i,]-mat2[j,])^2)
match[i]=which(min(dist) %in% dist)
}
}
}
return(match)
}

A better strategy would be to compute the distance matrix all at once first, then extract the mins. Here's an example using simualted data
set.seed(15)
mat1<-matrix(runif(2*25), ncol=2)
mat2<-matrix(runif(2*25), ncol=2)
and here's a helper function that can calculate the distances between values in one matrix to another. It uses the built in dist function but it does do unnecessary within-group comparisons that we eventually have to filter out, still it may be better performing overall.
distab<-function(m1, m2) {
stopifnot(ncol(m1)==ncol(m2))
m<-as.matrix(dist(rbind(m1, m2)))[1:nrow(m1), -(1:nrow(m1))]
rownames(m)<-rownames(m1)
colnames(m)<-rownames(m2)
m
}
mydist<-distab(mat1, mat2)
now that we have the between-group distances, we just need to minimize the matches.
best <- apply(mydist, 2, which.min)
rr <- cbind(m1.row=seq.int(nrow(mat1)), best.m2.row = best)
head(rr) #just print a few
# m1.row best.m2.row
# [1,] 1 1
# [2,] 2 14
# [3,] 3 7
# [4,] 4 3
# [5,] 5 23
# [6,] 6 15
note that with a strategy like this (we well as with your original implementation) it is possible for multiple rows from mat1 to match to the same row in mat2 and some rows in mat2 to be unmatched to mat1.

Related

Avoiding loops when filling a sparse matrix

I am dealing with a quite large sparse matrix (1.000.000 * 1.000.000) for my master's thesis. I use R and more precisely the 'Matrix' package dedicated to this issue.
Basically, my question is: Are there existing techniques to fill a sparse matrix without looping over the whole pre-existing dataframe I build the sparse matrix upon ? Is there a more efficient way to extract the indices where my sparse matrix will not take value 0 ?
I'm not the first to ask this question, but most threads concern Matlab or Python, and I didn't find a clear answer.
Let's take an example to shed some light on the issue.
Suppose I have a data frame Df of this form:
list1 <- list("a","b","c")
list2 <- list("b,c","c", NA)
Df <- data.frame(cbind(list1, list2))
Df
list1 list2
1 a b,c
2 b c
3 c NA
I want to build a sparse matrix S that assigns S[i,j]=1 if Df[j,1] appears in Df[i,2]. Thus the matrix S would be of this form:
S=matrix(c(0,0,0,1,0,0,1,1,0), nrow=3)
S
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 0 0 1
[3,] 0 0 0
Is it possible to fill this matrix without writing a loop that scans through each row and column of Df ?
Indeed, at the moment, my code would look like something like this:
row_vector <- vector()
col_vector <- vector()
val_vector <- vector()
count = 0
for (i in 1:3){
for (j in 1:3){
if (grepl(Df[i,1], Df[j,2], fixed=TRUE)==TRUE){
count=count+1
col_vector[count]=i
row_vector[count]=j
val_vector[count]=1
}
}
}
This is fine when I have a small matrix, but might get quite tedious with big matrices.
I hope I was clear enough. I would be grateful for any help or advice.
Regards,
Germain
You have to read documentation before asking such question - ?sparseMatrix.
You need to provide 3 vectors to the constructor - row indices, column indices and values. Now you need to think how to get indices and values from your data.

Repeated subsetting of the same matrix using apply in R

Motivation: I am currently trying to rethink my coding such as to exclude for-loops where possible. The below problem can easily be solved with conventional for-loops, but I was wondering if R offers a possibility to utilize the apply-family to make the problem easier.
Problem: I have a matrix, say X (n x k matrix) and two matrices of start and stop indices, called index.starts and index.stops, respectively. They are of size n x B and it holds that index.stops = index.starts + m for some integer m. Each pair index.starts[i,j] and index.stops[i,j] are needed to subset X as X[ (index.starts[i,j]:index.stops[i,j]),]. I.e., they should select all the rows of X in their index range.
Can I solve this problem using one of the apply functions?
Application: (Not necessarily important for understanding my problem.) In case you are interested, this is needed for a bootstrapping application with blocks in a time series application. The X represents the original sample. index.starts is sampled as replicate(repetitionNumber, sample.int((n-r), ceiling(n/r), replace=TRUE)) and index.stopsis obtained as index.stop = index.starts + m. What I want in the end is a collection of rows of X. In particular, I want to resample repetitionNumber times m blocks of length r from X.
Example:
#generate data
n<-100 #the size of your sample
B<-5 #the number of columns for index.starts and index.stops
#and equivalently the number of block bootstraps to sample
k<-2 #the number of variables in X
X<-matrix(rnorm(n*k), nrow=n, ncol = k)
#take a random sample of the indices 1:100 to get index.starts
r<-10 #this is the block length
#get a sample of the indices 1:(n-r), and get ceiling(n/r) of these
#(for n=100 and r=10, ceiling(n/r) = n/r = 10). Replicate this B times
index.starts<-replicate(B, sample.int((n-r), ceiling(n/r), replace=TRUE))
index.stops<-index.starts + r
#Now can I use apply-functions to extract the r subsequent rows that are
#paired in index.starts[i,j] and index.stops[i,j] for i = 1,2,...,10 = ceiling(n/r) and
#j=1,2,3,4,5=B ?
It's probably way more complicated than what you want/need, but here is a first approach. Just comment if that helps you in any way and I am happy to help.
My approach uses (multiple) *apply-functions. The first lapply "loops" over 1:B cases, where it first calculates the start and end points, which are combined into the take.rows (with subsetting numbers). Next, the inital matrix is subsetted by take.rows (and returned in a list). As a last step, the standard deviation is taken for each column of the subsetted matrizes (as a dummy function).
The code (with heavy commenting) looks like this:
# you can use lapply in parallel mode if you want to speed up code...
lapply(1:B, function(i){
starts <- sample.int((n-r), ceiling(n/r), replace=TRUE)
# [1] 64 22 84 26 40 7 66 12 25 15
ends <- starts + r
take.rows <- Map(":", starts, ends)
# [[1]]
# [1] 72 73 74 75 76 77 78 79 80 81 82
# ...
res <- lapply(take.rows, function(subs) X[subs, ])
# res is now a list of 10 with the ten subsets
# [[1]]
# [,1] [,2]
# [1,] 0.2658915 -0.18265235
# [2,] 1.7397478 0.66315385
# ...
# say you want to compute something (sd in this case) you can do the following
# but better you do the computing directly in the former "lapply(take.rows...)"
res2 <- t(sapply(res, function(tmp){
apply(tmp, 2, sd)
})) # simplify into a vector/data.frame
# [,1] [,2]
# [1,] 1.2345833 1.0927203
# [2,] 1.1838110 1.0767433
# [3,] 0.9808146 1.0522117
# ...
return(res2)
})
Does that point you in the right direction/gives you the answer?

compare clusters' objects in R

I have two clustering results for the same variables but with different values each time. Let us create them with the following code:
set.seed(11)
a<-matrix(rnorm(10000),ncol=100)
colnames(a)<-(c(1:100))
set.seed(31)
b<-matrix(rnorm(10000),ncol=100)
colnames(b)<-colnames(a)
c.a<-hclust(dist(t(a)))
c.b<-hclust(dist(t(b)))
# clusters
groups.a<-cutree(c.a, k=15)
# take groups names
clus.a=list()
for (i in 1:15) clus.a[[i]] <- colnames(a)[groups.a==i]
# see the clusters
clus.a
groups.b<-cutree(c.b, k=15)
clus.b=list()
for (i in 1:15) clus.b[[i]] <- colnames(b)[groups.b==i]
# see the clusters
clus.b
What I get from that is two lists, clus.a and clus.b with the names (here just numbers from 1 to 100) of each cluster's variables.
Is there any way to examine if and which of the variables are clustered together in both clusterings? Meaning, how can I see if I have variables (could be teams of 2, 3, 4 etc) in same clusters for both clus.a and clus.b (doesn't have to be in the same cluster number).
If I understand your question correctly, you want to know if there are any clusters in a which have exactly the same membership as any of the clusters in b. Here's one way to do that.
Note: AFAICT in your example there are no matching clusters in a and b, so we create a few artificially to demo the solution.
# create artificial matches
clus.b[[3]] <- clus.a[[2]]
clus.b[[10]] <- clus.a[[8]]
clus.b[[15]] <- clus.a[[11]]
f <- function(a,b) (length(a)==length(b) & length(intersect(a,b))==length(a))
result <- sapply(clus.b,function(x)sapply(clus.a,f,b=x))
which(result, arr.ind=TRUE)
# row col
# [1,] 2 3
# [2,] 8 10
# [3,] 11 15
So this loops through all the clusters in b (sapply(clus.b,...)) and for each, loops through all the clusters in a looking for an exact match (in arbitrary order). For there to be a match, both clusters must have the same length, and the intersection of the two must contain all the elements in either - hence have the same length. This process produces a logical matrix with rows representing a and columns representing b.
Edit: To reflect the fact that OP is changing the question.
To detect clusters with two or more common elements, use:
f <- function(a,b) length(intersect(a,b))>1
result <- sapply(clus.b,function(x)sapply(clus.a,f,b=x))
matched <- which(result, arr.ind=TRUE)
matched
# row col
# [1,] 4 1
# [2,] 8 1
# [3,] 11 1
# [4,] 3 2
# ...
To identify which elements were present in both:
apply(matched,1,function(r) intersect(clus.a[[r[1]]],clus.b[[r[2]]]))

Split matrix into submatrices

I found a solution to a similar question that breaks up a matrix into a set of non-overlapping sub-matrices when the matrix may not be square (as it may not be in my situation) Function to split a matrix into sub-matrices in R. However, in my case I want the matrices to be overlapping. Using the following matrix as an example:
M <- matrix(1:20, 5)
# [,1] [,2] [,3] [,4]
#[1,] 1 6 11 16
#[2,] 2 7 12 17
#[3,] 3 8 13 18
#[4,] 4 9 14 19
#[5,] 5 10 15 20
If I choose the number of rows and columns of the sub-matrices to be 4 and 4 respectively then the sub-matrices returned should be every possible overlapping 4X4 sub matrix that completely fits within the boundary of the original matrix (this is another place where the answer to this question may need to be different to the question I have referenced). In my example there would only be two sub matrices that should be returned. M[1:4,1:4] and M[2:5,1:4]. I should be able to choose any arbitrary size of sub matrix. I can't seem to find any examples of extracting overlapping sub matrices, but I may be over-thinking this one. Does anyone have any ideas on the best way to go about this?
Perhaps a strategy like this could work
submat <- function(m, nrow, ncol) {
stopifnot(nrow(m)>=nrow, ncol(m)>=ncol)
rowstarts<-1:(nrow(m)-nrow+1)
colstarts<-1:(ncol(m)-ncol+1)
ss <- function(r, c) {
m[r:(r+nrow-1), c:(c+ncol-1), drop=FALSE]
}
with(expand.grid(r=rowstarts, c=colstarts), mapply(ss, r, c, SIMPLIFY=FALSE))
}
submat(M, 4, 4)
we determine where the possible start indexes for the rows and columns are, then we use expand.grid() to generate all possible combinations of such starting values, then we use mapply to extract every possible submatrix with those starting positions.

Need a more efficient threshold matching with function for R

Not sure how best to ask this question, so feel free to edit the question title if there is a more standard vocabulary to use here.
I have two 2-column data tables in R, the first is a list of unique 2-variable values (u), so much shorter than the second, which is a raw list of similar values (d). I need a function that will, for every 2-variable set of values in u, find all the 2-variable sets of values in d for which both variables are within a given threshold.
Here's a minimal example. Actual data is much larger (see below, as this is the problem) and (obviously) not created randomly as in the example. In the actual data, u would have about 600,000 to 1,000,000 values (rows) and d would have upwards of 10,000,000 rows.
# First create the table of unique variable pairs (no 2-column duplicates)
u <- data.frame(PC1=c(-1.10,-1.01,-1.13,-1.18,-1.12,-0.82),
PC2=c(-1.63,-1.63,-1.81,-1.86,-1.86,-1.77))
# Now, create the set of raw 2-variable pairs, which may include duplicates
d <- data.frame(PC1=sample(u$PC1,100,replace=T)*sample(90:100,100,replace=T)/100,
PC2=sample(u$PC2,100,replace=T)*sample(90:100,100,replace=T)/100)
# Set the threshold that defined a 'close-enough' match between u and d values
b <- 0.1
So, my first attempt to do this was with a for loop for all values of u. This works nicely, but is computationally intensive and takes quite a while to process the actual data.
# Make a list to output the list of within-threshold rows
m <- list()
# Loop to find all values of d within a threshold b of each value of u
# The output list will have as many items as values of u
# For each list item, there may be up to several thousand matching rows in d
# Note that there's a timing command (system.time) in here to keep track of performance
system.time({
for(i in 1:nrow(u)){
m <- c(m, list(which(abs(d$PC1-u$PC1[i])<b & abs(d$PC2-u$PC2[i])<b)))
}
})
m
That works. But I thought using a function with apply() would be more efficient. Which it is...
# Make the user-defined function for the threshold matching
match <- function(x,...){
which(abs(d$PC1-x[1])<b & abs(d$PC2-x[2])<b)
}
# Run the function with the apply() command.
system.time({
m <- apply(u,1,match)
})
Again, this apply function works and is slightly faster than the for loop, but only marginally. This may simply be a big data problem for which I need a bit more computing power (or more time!). But I thought others might have thoughts on a sneaky command or function syntax that would dramatically speed this up. Outside the box approaches to finding these matching rows also welcome.
Somewhat sneaky:
library(IRanges)
ur <- with(u*100L, IRanges(PC2, PC1))
dr <- with(d*100L, IRanges(PC2, PC1))
hits <- findOverlaps(ur, dr + b*100L)
Should be fast once the number of rows is sufficiently large. We multiply by 100 to get into integer space. Reversing the order of the arguments to findOverlaps could improve performance.
Alas, this seems only slightly faster than the for loop
unlist(Map(function(x,y) {
which(abs(d$PC1-x)<b & abs(d$PC2-y)<b)
}, u$PC1, u$PC2))
but at least it's something.
I have a cunning plan :-) . How about just doing calculations:
> set.seed(10)
> bar<-matrix(runif(10),nc=2)
> bar
[,1] [,2]
[1,] 0.50747820 0.2254366
[2,] 0.30676851 0.2745305
[3,] 0.42690767 0.2723051
[4,] 0.69310208 0.6158293
[5,] 0.08513597 0.4296715
> foo<-c(.3,.7)
> thresh<-foo-bar
> sign(thresh)
[,1] [,2]
[1,] -1 1
[2,] 1 1
[3,] -1 1
[4,] 1 -1
[5,] 1 1
Now all you have to do is select the rows of that last matrix which are c(-1,1) , using which , and you can easily extract the desired rows from your bar matrix. Repeat for each row in foo.

Resources