I have a very complicated function. I need to repeat this function several times and sum the result. This is easy. However, I need to sum them at the same time. Since my function is difficult to show it here, I provide a very simple example just to explain my idea. Please note that (based on the amazing questions from the comments) My function needs to be done pairwise. Also, my matrices are all the same dimensions. Finally, the result is not as a list. I need to assign the result to a new variable. That is,
Res <– myfunc(x[i,j],y[i,j])+myfunc(z[i,j],t[i,j])+..+..
Also, my function must loop over the elements of the matrices. x[i,j].
My matrices are stored in a list.
Mymatrices–list(x,y,z,t).
For example,
x <- matrix(5,5,5)
x[upper.tri(x,diag=T)] <- 0
y <– matrix(4,5,5)
y[upper.tri(y,diag=T)] <- 0
z <- matrix(3,5,5)
z[upper.tri(z,diag=T)] <- 0
t <- matrix(2,5,5)
t[upper.tri(t,diag=T)] <- 0
myfunc <– function(x,y){
sum(x,y)
}
I would like it like this:
Res <– myfunc(x[i,j],y[i,j])+myfunc(z[i,j],t[i,j])+..+..
Suppose I have 10 matrices and would like to have the sum as shown above. It is hard to do it manually. I would like to do this automatically. lapply function takes a list and I do not want it as a list.
Any help, please?
I cant either tell whether you need a matrix in the end or a value. But since you used i,j I presume you need a matrix:
Reduce("+",list(x,y,z,t))
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 14 0 0 0 0
[3,] 14 14 0 0 0
[4,] 14 14 14 0 0
[5,] 14 14 14 14 0
or do you need:
Reduce(sum,list(x,y,z,t))
[1] 140
Let's say your matrices are in a list, paired in the way you want:
input = list(list(x, y), list(z, t))
For convenience, we'll make a version of your function that takes a list as input (we could use do.call or an anonymous function instead, but this is very clear):
myfunc_list = function(x) {
myfunc(x[[1]], x[[2]])
}
We can then sapply the list function to your input list, and sum:
sum(sapply(input, myfunc_list))
# [1] 140
Glad to have helped. To be honest, I'm still not completely sure what you are asking for though - no one thinks your final answer will be a list, just an intermediate step in order to do the summation effectively. Looking at the answers, I think the Reduce function suggested by Onyambu is what you need - where x, y, z, and t are the results from your function (called pairwise on different matrices).
Is the summation really where you need help, or is it efficiently calling your function pairwise on all those matrices? That is a very different question. If that's the case, check out the map2 function in the purrr package. It takes two lists (of the same length) as inputs, computes a function on each element, and returns a list (which can be fed into Reduce).
Related
I am dealing with a quite large sparse matrix (1.000.000 * 1.000.000) for my master's thesis. I use R and more precisely the 'Matrix' package dedicated to this issue.
Basically, my question is: Are there existing techniques to fill a sparse matrix without looping over the whole pre-existing dataframe I build the sparse matrix upon ? Is there a more efficient way to extract the indices where my sparse matrix will not take value 0 ?
I'm not the first to ask this question, but most threads concern Matlab or Python, and I didn't find a clear answer.
Let's take an example to shed some light on the issue.
Suppose I have a data frame Df of this form:
list1 <- list("a","b","c")
list2 <- list("b,c","c", NA)
Df <- data.frame(cbind(list1, list2))
Df
list1 list2
1 a b,c
2 b c
3 c NA
I want to build a sparse matrix S that assigns S[i,j]=1 if Df[j,1] appears in Df[i,2]. Thus the matrix S would be of this form:
S=matrix(c(0,0,0,1,0,0,1,1,0), nrow=3)
S
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 0 0 1
[3,] 0 0 0
Is it possible to fill this matrix without writing a loop that scans through each row and column of Df ?
Indeed, at the moment, my code would look like something like this:
row_vector <- vector()
col_vector <- vector()
val_vector <- vector()
count = 0
for (i in 1:3){
for (j in 1:3){
if (grepl(Df[i,1], Df[j,2], fixed=TRUE)==TRUE){
count=count+1
col_vector[count]=i
row_vector[count]=j
val_vector[count]=1
}
}
}
This is fine when I have a small matrix, but might get quite tedious with big matrices.
I hope I was clear enough. I would be grateful for any help or advice.
Regards,
Germain
You have to read documentation before asking such question - ?sparseMatrix.
You need to provide 3 vectors to the constructor - row indices, column indices and values. Now you need to think how to get indices and values from your data.
This might be the most stupid question, but I do not seem to grasp the function sapply: Here is my issue
Example:
d = matrix(1:10, 5,2)
d[3] = NA
# [,1] [,2]
#[1,] 1 6
#[2,] 2 7
#[3,] NA 8
#[4,] 4 9
#[5,] 5 10
If I would like to calculate the row means using sapply function I would use something like this:
sapply(d,mean)
#[1] 1 2 NA 4 5 6 7 8 9 10
Should it not give me the mean of the list of the elements? It just spits out the elements of my matrix rather then the mean.
When I use apply, I get the right answer:
apply(d,1,mean, na.rm=T)
[1] 3.5 4.5 8.0 6.5 7.5
Can anyone bother giving me a very dummy explanation.Highly appreicated.
Used the following links before asking the question. link 1 Link 2 Link 3
have you read ?sapply. The function takes either a vector or an expression object. It cannot take a matrix. So what happens in your example is that the matrix d is considered as a vector of numeric:
sapply(as.numeric(d),mean)
sapply (and its friends, like lapply) require a list (or a data.frame, which is really a special kind of list) as input. But even if you had turned your matrix into a data frame, it wouldn't have given you row means, it would have given you column means. If you want to understand how these functions work, it might help to look at this function (copied from here: http://adv-r.had.co.nz/Functionals.html), which shows the essence of lapply using base R. (sapply works in the same way, it just tries to simplify the output to a vector rather than always returning a list.)
lapply2 <- function(x, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], ...)
}
out
}
By the way, if the point of your question is to find the best way of calculating row means, rather than understanding sapply, then there is a function rowMeans that is the fastest way of doing it -- faster than using apply, for example.
I have 4 lists
a <- list(1,2,3,4)
b <- list(5,6,7,8)
c <- list(7,9,0)
d <- list(12,14)
I would like to know which of the lists have elements in common. In this example, lists b and c have the element 7 in common.
A brute force approach would be to take every combination of lists and find the intersection. Is there any other efficient way to do it in R?
Another approach would be to make a single list from all the lists and find the duplicates. Then maybe we could have a mapping function to indicate from which original lists these duplicates are from. But am not so sure about how to do it. I came across this post
Find indices of duplicated rows
I was thinking if we could modify this to find out the actual lists which have duplicates.
I have to repeat this process for many groups of lists.
Any suggestions/ideas are greatly appreciated!
Thanks in advance
What about using this double sapply?
l <- list(a,b,c,d)
sapply(seq_len(length(l)), function(x)
sapply(seq_len(length(l)), function(y) length(intersect(unlist(l[x]), unlist(l[y])))))
[,1] [,2] [,3] [,4]
[1,] 4 0 0 0
[2,] 0 4 1 0
[3,] 0 1 3 0
[4,] 0 0 0 2
Interpretation: e.g. the element [1,2] of the matrix shows you how many elements the first element of the list l (in this case the sublist a) has in commom with the second list element (i.e. the sublist b)
Or alternatively just to see the indices of the sublists which have a common value with some other sublist:
which(sapply(seq_len(length(l)), function(x) length(intersect(l[[x]], unlist(l[-x])))) >= 1)
[1] 2 3
Not sure how best to ask this question, so feel free to edit the question title if there is a more standard vocabulary to use here.
I have two 2-column data tables in R, the first is a list of unique 2-variable values (u), so much shorter than the second, which is a raw list of similar values (d). I need a function that will, for every 2-variable set of values in u, find all the 2-variable sets of values in d for which both variables are within a given threshold.
Here's a minimal example. Actual data is much larger (see below, as this is the problem) and (obviously) not created randomly as in the example. In the actual data, u would have about 600,000 to 1,000,000 values (rows) and d would have upwards of 10,000,000 rows.
# First create the table of unique variable pairs (no 2-column duplicates)
u <- data.frame(PC1=c(-1.10,-1.01,-1.13,-1.18,-1.12,-0.82),
PC2=c(-1.63,-1.63,-1.81,-1.86,-1.86,-1.77))
# Now, create the set of raw 2-variable pairs, which may include duplicates
d <- data.frame(PC1=sample(u$PC1,100,replace=T)*sample(90:100,100,replace=T)/100,
PC2=sample(u$PC2,100,replace=T)*sample(90:100,100,replace=T)/100)
# Set the threshold that defined a 'close-enough' match between u and d values
b <- 0.1
So, my first attempt to do this was with a for loop for all values of u. This works nicely, but is computationally intensive and takes quite a while to process the actual data.
# Make a list to output the list of within-threshold rows
m <- list()
# Loop to find all values of d within a threshold b of each value of u
# The output list will have as many items as values of u
# For each list item, there may be up to several thousand matching rows in d
# Note that there's a timing command (system.time) in here to keep track of performance
system.time({
for(i in 1:nrow(u)){
m <- c(m, list(which(abs(d$PC1-u$PC1[i])<b & abs(d$PC2-u$PC2[i])<b)))
}
})
m
That works. But I thought using a function with apply() would be more efficient. Which it is...
# Make the user-defined function for the threshold matching
match <- function(x,...){
which(abs(d$PC1-x[1])<b & abs(d$PC2-x[2])<b)
}
# Run the function with the apply() command.
system.time({
m <- apply(u,1,match)
})
Again, this apply function works and is slightly faster than the for loop, but only marginally. This may simply be a big data problem for which I need a bit more computing power (or more time!). But I thought others might have thoughts on a sneaky command or function syntax that would dramatically speed this up. Outside the box approaches to finding these matching rows also welcome.
Somewhat sneaky:
library(IRanges)
ur <- with(u*100L, IRanges(PC2, PC1))
dr <- with(d*100L, IRanges(PC2, PC1))
hits <- findOverlaps(ur, dr + b*100L)
Should be fast once the number of rows is sufficiently large. We multiply by 100 to get into integer space. Reversing the order of the arguments to findOverlaps could improve performance.
Alas, this seems only slightly faster than the for loop
unlist(Map(function(x,y) {
which(abs(d$PC1-x)<b & abs(d$PC2-y)<b)
}, u$PC1, u$PC2))
but at least it's something.
I have a cunning plan :-) . How about just doing calculations:
> set.seed(10)
> bar<-matrix(runif(10),nc=2)
> bar
[,1] [,2]
[1,] 0.50747820 0.2254366
[2,] 0.30676851 0.2745305
[3,] 0.42690767 0.2723051
[4,] 0.69310208 0.6158293
[5,] 0.08513597 0.4296715
> foo<-c(.3,.7)
> thresh<-foo-bar
> sign(thresh)
[,1] [,2]
[1,] -1 1
[2,] 1 1
[3,] -1 1
[4,] 1 -1
[5,] 1 1
Now all you have to do is select the rows of that last matrix which are c(-1,1) , using which , and you can easily extract the desired rows from your bar matrix. Repeat for each row in foo.
I have a triangular matrix and would like to loop through all the elements efficiently. Is there a smart way that I simply don't see?
So here's a small example of my matrix
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 2 1 0 0
[3,] 3 7 1 0
[4,] 4 4 10 1
What I want to do is perform some function f() on the elements in this matrix that are under (over) the main diagonal. (background: I know that I have a symmetric matrix and would like to perform some time consuming data manipulations and I'd like to 'mirror' my matrix)
As shown below, lower.tri() and upper.tri() provide an expressive (and fast) means of extracting and replacing elements in the relevant sectors a matrix. Since the function you're applying to the elements is presumably slow compared to indexing operations, there's probably no point in searching for faster indexing options (or in trying to avoid the single call to t()).
## Example data
m <- matrix(c(1,2,3,4,0,1,7,4,0,0,1,10,0,0,0,1), ncol=4)
## Example of a slow function
slowFun <- function(x) sapply(x, function(x) {Sys.sleep(0.1); x^2})
## Proposed strategy
m[lower.tri(m)] <- slowFun(m[lower.tri(m)])
m[upper.tri(m)] <- t(m)[upper.tri(m)]