Avoiding loops when filling a sparse matrix - r

I am dealing with a quite large sparse matrix (1.000.000 * 1.000.000) for my master's thesis. I use R and more precisely the 'Matrix' package dedicated to this issue.
Basically, my question is: Are there existing techniques to fill a sparse matrix without looping over the whole pre-existing dataframe I build the sparse matrix upon ? Is there a more efficient way to extract the indices where my sparse matrix will not take value 0 ?
I'm not the first to ask this question, but most threads concern Matlab or Python, and I didn't find a clear answer.
Let's take an example to shed some light on the issue.
Suppose I have a data frame Df of this form:
list1 <- list("a","b","c")
list2 <- list("b,c","c", NA)
Df <- data.frame(cbind(list1, list2))
Df
list1 list2
1 a b,c
2 b c
3 c NA
I want to build a sparse matrix S that assigns S[i,j]=1 if Df[j,1] appears in Df[i,2]. Thus the matrix S would be of this form:
S=matrix(c(0,0,0,1,0,0,1,1,0), nrow=3)
S
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 0 0 1
[3,] 0 0 0
Is it possible to fill this matrix without writing a loop that scans through each row and column of Df ?
Indeed, at the moment, my code would look like something like this:
row_vector <- vector()
col_vector <- vector()
val_vector <- vector()
count = 0
for (i in 1:3){
for (j in 1:3){
if (grepl(Df[i,1], Df[j,2], fixed=TRUE)==TRUE){
count=count+1
col_vector[count]=i
row_vector[count]=j
val_vector[count]=1
}
}
}
This is fine when I have a small matrix, but might get quite tedious with big matrices.
I hope I was clear enough. I would be grateful for any help or advice.
Regards,
Germain

You have to read documentation before asking such question - ?sparseMatrix.
You need to provide 3 vectors to the constructor - row indices, column indices and values. Now you need to think how to get indices and values from your data.

Related

How to sum a result of the repeated function in R

I have a very complicated function. I need to repeat this function several times and sum the result. This is easy. However, I need to sum them at the same time. Since my function is difficult to show it here, I provide a very simple example just to explain my idea. Please note that (based on the amazing questions from the comments) My function needs to be done pairwise. Also, my matrices are all the same dimensions. Finally, the result is not as a list. I need to assign the result to a new variable. That is,
Res <– myfunc(x[i,j],y[i,j])+myfunc(z[i,j],t[i,j])+..+..
Also, my function must loop over the elements of the matrices. x[i,j].
My matrices are stored in a list.
Mymatrices–list(x,y,z,t).
For example,
x <- matrix(5,5,5)
x[upper.tri(x,diag=T)] <- 0
y <– matrix(4,5,5)
y[upper.tri(y,diag=T)] <- 0
z <- matrix(3,5,5)
z[upper.tri(z,diag=T)] <- 0
t <- matrix(2,5,5)
t[upper.tri(t,diag=T)] <- 0
myfunc <– function(x,y){
sum(x,y)
}
I would like it like this:
Res <– myfunc(x[i,j],y[i,j])+myfunc(z[i,j],t[i,j])+..+..
Suppose I have 10 matrices and would like to have the sum as shown above. It is hard to do it manually. I would like to do this automatically. lapply function takes a list and I do not want it as a list.
Any help, please?
I cant either tell whether you need a matrix in the end or a value. But since you used i,j I presume you need a matrix:
Reduce("+",list(x,y,z,t))
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 14 0 0 0 0
[3,] 14 14 0 0 0
[4,] 14 14 14 0 0
[5,] 14 14 14 14 0
or do you need:
Reduce(sum,list(x,y,z,t))
[1] 140
Let's say your matrices are in a list, paired in the way you want:
input = list(list(x, y), list(z, t))
For convenience, we'll make a version of your function that takes a list as input (we could use do.call or an anonymous function instead, but this is very clear):
myfunc_list = function(x) {
myfunc(x[[1]], x[[2]])
}
We can then sapply the list function to your input list, and sum:
sum(sapply(input, myfunc_list))
# [1] 140
Glad to have helped. To be honest, I'm still not completely sure what you are asking for though - no one thinks your final answer will be a list, just an intermediate step in order to do the summation effectively. Looking at the answers, I think the Reduce function suggested by Onyambu is what you need - where x, y, z, and t are the results from your function (called pairwise on different matrices).
Is the summation really where you need help, or is it efficiently calling your function pairwise on all those matrices? That is a very different question. If that's the case, check out the map2 function in the purrr package. It takes two lists (of the same length) as inputs, computes a function on each element, and returns a list (which can be fed into Reduce).

R Matching function between two matrices

Sorry if this has been posted before. I looked for the answer both on Google and Stackoverflow and couldn't find a solution.
Right now I have two matrices of data in R. I am trying to loop through each row in the matrix, and find the row in the other matrix that is most similar by some distance metric (for now least squared). I figured out one method but it is O(n^2) which is prohibitive for my data.
I think this might be similar to some dictionary learning techniques but I couldn't find anything.
Thanks!
Both matrices are just 30 by n matrices with a number at each entry.
distance.fun=function(mat1,mat2){
match=c()
for (i in 1:nrow(mat1)){
if (all(is.na(mat1[i,]))==FALSE){
dist=c()
for (j in 1:nrow(mat2)){
dist[j]=sum((mat1[i,]-mat2[j,])^2)
match[i]=which(min(dist) %in% dist)
}
}
}
return(match)
}
A better strategy would be to compute the distance matrix all at once first, then extract the mins. Here's an example using simualted data
set.seed(15)
mat1<-matrix(runif(2*25), ncol=2)
mat2<-matrix(runif(2*25), ncol=2)
and here's a helper function that can calculate the distances between values in one matrix to another. It uses the built in dist function but it does do unnecessary within-group comparisons that we eventually have to filter out, still it may be better performing overall.
distab<-function(m1, m2) {
stopifnot(ncol(m1)==ncol(m2))
m<-as.matrix(dist(rbind(m1, m2)))[1:nrow(m1), -(1:nrow(m1))]
rownames(m)<-rownames(m1)
colnames(m)<-rownames(m2)
m
}
mydist<-distab(mat1, mat2)
now that we have the between-group distances, we just need to minimize the matches.
best <- apply(mydist, 2, which.min)
rr <- cbind(m1.row=seq.int(nrow(mat1)), best.m2.row = best)
head(rr) #just print a few
# m1.row best.m2.row
# [1,] 1 1
# [2,] 2 14
# [3,] 3 7
# [4,] 4 3
# [5,] 5 23
# [6,] 6 15
note that with a strategy like this (we well as with your original implementation) it is possible for multiple rows from mat1 to match to the same row in mat2 and some rows in mat2 to be unmatched to mat1.

Need a more efficient threshold matching with function for R

Not sure how best to ask this question, so feel free to edit the question title if there is a more standard vocabulary to use here.
I have two 2-column data tables in R, the first is a list of unique 2-variable values (u), so much shorter than the second, which is a raw list of similar values (d). I need a function that will, for every 2-variable set of values in u, find all the 2-variable sets of values in d for which both variables are within a given threshold.
Here's a minimal example. Actual data is much larger (see below, as this is the problem) and (obviously) not created randomly as in the example. In the actual data, u would have about 600,000 to 1,000,000 values (rows) and d would have upwards of 10,000,000 rows.
# First create the table of unique variable pairs (no 2-column duplicates)
u <- data.frame(PC1=c(-1.10,-1.01,-1.13,-1.18,-1.12,-0.82),
PC2=c(-1.63,-1.63,-1.81,-1.86,-1.86,-1.77))
# Now, create the set of raw 2-variable pairs, which may include duplicates
d <- data.frame(PC1=sample(u$PC1,100,replace=T)*sample(90:100,100,replace=T)/100,
PC2=sample(u$PC2,100,replace=T)*sample(90:100,100,replace=T)/100)
# Set the threshold that defined a 'close-enough' match between u and d values
b <- 0.1
So, my first attempt to do this was with a for loop for all values of u. This works nicely, but is computationally intensive and takes quite a while to process the actual data.
# Make a list to output the list of within-threshold rows
m <- list()
# Loop to find all values of d within a threshold b of each value of u
# The output list will have as many items as values of u
# For each list item, there may be up to several thousand matching rows in d
# Note that there's a timing command (system.time) in here to keep track of performance
system.time({
for(i in 1:nrow(u)){
m <- c(m, list(which(abs(d$PC1-u$PC1[i])<b & abs(d$PC2-u$PC2[i])<b)))
}
})
m
That works. But I thought using a function with apply() would be more efficient. Which it is...
# Make the user-defined function for the threshold matching
match <- function(x,...){
which(abs(d$PC1-x[1])<b & abs(d$PC2-x[2])<b)
}
# Run the function with the apply() command.
system.time({
m <- apply(u,1,match)
})
Again, this apply function works and is slightly faster than the for loop, but only marginally. This may simply be a big data problem for which I need a bit more computing power (or more time!). But I thought others might have thoughts on a sneaky command or function syntax that would dramatically speed this up. Outside the box approaches to finding these matching rows also welcome.
Somewhat sneaky:
library(IRanges)
ur <- with(u*100L, IRanges(PC2, PC1))
dr <- with(d*100L, IRanges(PC2, PC1))
hits <- findOverlaps(ur, dr + b*100L)
Should be fast once the number of rows is sufficiently large. We multiply by 100 to get into integer space. Reversing the order of the arguments to findOverlaps could improve performance.
Alas, this seems only slightly faster than the for loop
unlist(Map(function(x,y) {
which(abs(d$PC1-x)<b & abs(d$PC2-y)<b)
}, u$PC1, u$PC2))
but at least it's something.
I have a cunning plan :-) . How about just doing calculations:
> set.seed(10)
> bar<-matrix(runif(10),nc=2)
> bar
[,1] [,2]
[1,] 0.50747820 0.2254366
[2,] 0.30676851 0.2745305
[3,] 0.42690767 0.2723051
[4,] 0.69310208 0.6158293
[5,] 0.08513597 0.4296715
> foo<-c(.3,.7)
> thresh<-foo-bar
> sign(thresh)
[,1] [,2]
[1,] -1 1
[2,] 1 1
[3,] -1 1
[4,] 1 -1
[5,] 1 1
Now all you have to do is select the rows of that last matrix which are c(-1,1) , using which , and you can easily extract the desired rows from your bar matrix. Repeat for each row in foo.

How to efficiently loop through a triangular matrix

I have a triangular matrix and would like to loop through all the elements efficiently. Is there a smart way that I simply don't see?
So here's a small example of my matrix
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 2 1 0 0
[3,] 3 7 1 0
[4,] 4 4 10 1
What I want to do is perform some function f() on the elements in this matrix that are under (over) the main diagonal. (background: I know that I have a symmetric matrix and would like to perform some time consuming data manipulations and I'd like to 'mirror' my matrix)
As shown below, lower.tri() and upper.tri() provide an expressive (and fast) means of extracting and replacing elements in the relevant sectors a matrix. Since the function you're applying to the elements is presumably slow compared to indexing operations, there's probably no point in searching for faster indexing options (or in trying to avoid the single call to t()).
## Example data
m <- matrix(c(1,2,3,4,0,1,7,4,0,0,1,10,0,0,0,1), ncol=4)
## Example of a slow function
slowFun <- function(x) sapply(x, function(x) {Sys.sleep(0.1); x^2})
## Proposed strategy
m[lower.tri(m)] <- slowFun(m[lower.tri(m)])
m[upper.tri(m)] <- t(m)[upper.tri(m)]

R: a for statement wanted that allows for the use of values from each row

I'm pretty new to R..
I'm reading in a file that looks like this:
1 2 1
1 4 2
1 6 4
and storing it in a matrix:
matrix <- read.delim("filename",...)
Does anyone know how to make a for statement that adds up the first and last numbers of one row per iteration ?
So the output would be:
2
3
5
Many thanks!
Edit: My bad, I should have made this more clear...
I'm actually more interested in an actual for-loop where I can use multiple values from any column on that specific row in each iteration. The adding up numbers was just an example. I'm actually planning on doing much more with those values (for more than 2 columns), and there are many rows.
So something in the lines of:
for (i in matrix_i) #where i means each row
{
#do something with column j and column x from row i, for example add them up
}
If you want to get a vector out of this, it is simpler (and marginally computationally faster) to use apply rather than a for statement. In this case,
sums = apply(m, 1, function(x) x[1] + x[3])
Also, you shouldn't call your variables "matrix" since that is the name of a built in function.
ETA: There is an even easier and computationally faster way. R lets you pull out columns and add them together (since they are vectors, they will get added elementwise):
sums = m[, 1] + m[, 3]
m[, 1] means the first column of the data.
Something along these lines should work rather efficiently (i.e. this is a vectorised approach):
m <- matrix(c(1,1,1,2,4,6,1,2,4), 3, 3)
# [,1] [,2] [,3]
# [1,] 1 2 1
# [2,] 1 4 2
# [3,] 1 6 4
v <- m[,1] + m[,3]
# [1] 2 3 5
You probably can use an apply function or a vectorized approach --- and if you can you really should, but you ask for how to do it in a for loop, so here's how to do that. (Let's call your matrix m.)
results <- numeric(nrow(m))
for (row in nrow(m)) {
results[row] <- m[row, 1] + m[row, 3]
}
This is probably one of those 100 ways to skin a cat questions. You are perhaps looking for the rowSums function, although you might also find many answers using the apply function.

Resources