I have 60 matrices in R named as mat1, mat2....mat60 and I would like to combine them into a big matrix using rbind. I know that I could write something like
matList <- list(mat1, mat2, ... mat60)
rbind(matList)
but it seems a very dumb solution. Any idea how I can simplify the process?
I supposed they all have same number of column (and same colnames). If so, try this:
do.call("rbind", matlist)
otherwise:
matlist <- lapply(matlist, as.data.frame)
library(plyr)
rbind.fill(matlist)
EDIT:
adding some timings:
lst <- list()
for(i in 1:1000)
lst[[i]] <- matrix(rnorm(10000, 100))
f1 <- function()
do.call("rbind", lst)
f2 <- function(){
lst <- lapply(lst, as.data.table)
rbindlist(lst)
}
library(data.table)
library(microbenchmark)
> microbenchmark(f1(), f2())
Unit: milliseconds
expr min lq median uq max neval
f1() 53.78661 55.22728 63.43546 66.08829 103.1996 100
f2() 210.46232 215.32043 217.93846 221.35012 333.2758 100
If the OP has got his data in matrices I thought that including lst <- lapply(lst, as.data.table) was the correct way of comparison. Otherwise it would be:
> lst.dt <- lapply(lst, as.data.table)
> f2 <- function(){
+ rbindlist(lst.dt)
+ }
> microbenchmark(f1(), f2())
Unit: milliseconds
expr min lq median uq max neval
f1() 49.00308 50.28515 54.98947 60.71945 87.66487 100
f2() 24.23454 28.57692 31.79278 32.75494 63.78825 100
I think this question really relates to the OP having to type out the names of the matrices manually. You can use mget to return the matrices in a list and then use do.call and rbind as posed by #Michele like this (assuming the matrices are located in the .GlobalEnv ):
matList <- mget(paste0("mat",1:60),env=globalenv())
bigm <- do.call("rbind" , matList)
This should be faster:
library(data.table)
rbindlist(matList)
EDIT
The above solution will work for list of data.frame or list, If you have a list of matrix you should convert them before:
rbindlist(lapply(ll,as.data.frame))
Related
I have an array a with some matrices in it. Now i need to efficiently check how many different matrices I have and what indices (in ascending order) they have in the array. My approach is the following: Paste the columns of the matrixes as character vectors and have a look at the frequency table like this:
n <- 10 #observations
a <- array(round(rnorm(2*2*n),1),
c(2,2,n))
paste_a <- apply(a, c(3), paste, collapse=" ") #paste by column
names(paste_a) <- 1:n
freq <- as.numeric( table(paste_a) ) # frequencies of different matrices (in ascending order)
indizes <- as.numeric(names(sort(paste_a[!duplicated(paste_a)])))
nr <- length(freq) #number of different matrices
However, as you increase n to large numbers, this gets very inefficient (it's mainly paste() that's getting slower and slower). Does anyone have a better solution?
Here is a "real" dataset with 100 observations where some matrices are actual duplicates (as opposed to my example above): https://pastebin.com/aLKaSQyF
Thank you very much.
Since your actual data is made up of the integers 0,1,2,3, why not take advantage of base 4? Integers are much faster to compare than entire matrix objects. (All occurrences of a below are of the data found in the real data set from the link.)
Base4Approach <- function() {
toBase4 <- sapply(1:dim(a)[3], function(x) {
v <- as.vector(a[,,x])
pows <- which(v > 0)
coefs <- v[pows]
sum(coefs*(4^pows))
})
myDupes <- which(duplicated(toBase4))
a[,,-(myDupes)]
}
And since the question is about efficiency, let's benchmark:
MartinApproach <- function() {
### commented this out for comparison reasons
# dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
a <- a[,,!duplicated(a, MARGIN = 3)]
nr <- dim(a)[3]
a
}
identical(MartinApproach(), Base4Approach())
[1] TRUE
microbenchmark(Base4Approach(), MartinApproach())
Unit: microseconds
expr min lq mean median uq max neval
Base4Approach() 291.658 303.525 339.2712 325.4475 352.981 636.361 100
MartinApproach() 983.855 1000.958 1160.4955 1071.9545 1187.321 3545.495 100
The approach by #d.b. doesn't really do the same thing as the previous two approaches (it simply identifies and doesn't remove duplicates).
DBApproach <- function() {
a[, , 9] = a[, , 1]
#Convert to list
mylist = lapply(1:dim(a)[3], function(i) a[1:dim(a)[1], 1:dim(a)[2], i])
temp = sapply(mylist, function(x) sapply(mylist, function(y) identical(x, y)))
temp2 = unique(apply(temp, 1, function(x) sort(which(x))))
#The indices in 'a' where the matrices are same
temp2[lengths(temp2) > 1]
}
However, Base4Approach still dominates:
microbenchmark(Base4Approach(), MartinApproach(), DBApproach())
Unit: microseconds
expr min lq mean median uq max neval
Base4Approach() 298.764 324.0555 348.8534 338.899 356.0985 476.475 100
MartinApproach() 1012.601 1087.9450 1204.1150 1110.662 1162.9985 3224.299 100
DBApproach() 9312.902 10339.4075 11616.1644 11438.967 12413.8915 17065.494 100
Update courtesy of #alexis_laz
As mentioned in the comments by #alexis_laz, we can do much better.
AlexisBase4Approach <- function() {
toBase4 <- colSums(a * (4 ^ (0:(prod(dim(a)[1:2]) - 1))), dims = 2)
myDupes <- which(duplicated(toBase4))
a[,,-(myDupes)]
}
microbenchmark(Base4Approach(), MartinApproach(), DBApproach(), AlexisBase4Approach(), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
Base4Approach() 11.67992 10.55563 8.177654 8.537209 7.128652 5.288112 100
MartinApproach() 39.60408 34.60546 27.930725 27.870019 23.836163 22.488989 100
DBApproach() 378.91510 342.85570 262.396843 279.190793 231.647905 108.841199 100
AlexisBase4Approach() 1.00000 1.00000 1.000000 1.000000 1.000000 1.000000 100
## Still gives accurate results
identical(MartinApproach(), AlexisBase4Approach())
[1] TRUE
My first attempt was actually really slow. So here is slightly changed version of yours:
dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
a <- a[,,!duplicated(a, MARGIN = 3)]
nr <- dim(a)[3] #number of different matrices
idx <- dimnames(a)[[3]] # indices of left over matrices
I don't know if this is exactly what you want but here is a way you can extract indices where the matrices are same. More processing may be necessary to get what you want
#DATA
n <- 10
a <- array(round(rnorm(2*2*n),1), c(2,2,n))
a[, , 9] = a[, , 1]
temp = unique(apply(X = sapply(1:dim(a)[3], function(i)
sapply(1:dim(a)[3], function(j) identical(a[, , i], a[, , j]))),
MARGIN = 1,
FUN = function(x) sort(which(x))))
temp[lengths(temp) > 1]
#[[1]]
#[1] 1 9
I'm trying to find whether vector of characters maps to another another, and looking for a fast way of doing it in R.
Specifically, my character alphabet is amino acids:
aa.LETTERS <- c('G','P','A','V','L','I','M','C','F','Y','W','H','K','R','Q','N','E','D','S','T')
I have a vector of peptide and protein sequences:
set.seed(1)
peptides.vec <- sapply(1:100,function(p) paste(aa.LETTERS[sample(20,ceiling(runif(1,8,12)),replace=T)],collapse=""))
proteins.vec <- sapply(1:1000,function(p) paste(aa.LETTERS[sample(20,ceiling(runif(1,200,400)),replace=T)],collapse=""))
I want to try and see if for each peptide sequence in peptides.vec if it exists in any sequence in proteins.vec.
This is one of the obvious ways of doing it:
mapping.mat <- do.call(rbind,lapply(peptides.vec,function(p){
grepl(p,proteins.vec)
}))
Another one is using the Biostrings Bioconductor package:
require(Biostrings)
peptides.set <- AAStringSet(x=peptides.vec)
proteins.set <- AAStringSet(x=proteins.vec)
mapping.mat <- vcountPDict(peptides.set,proteins.set)
Both are slow for the dimensions I'm working with:
> microbenchmark(do.call(rbind,lapply(peptides.vec,function(p){
grepl(p,proteins.vec)
})),times=100)
Unit: milliseconds
expr min lq mean median uq max neval
do.call(rbind, lapply(peptides.vec, function(p) { grepl(p, proteins.vec) })) 477.2509 478.8714 482.8937 480.4398 484.3076 509.8098 100
> microbenchmark(vcountPDict(peptides.set,proteins.set),times=100)
Unit: milliseconds
expr min lq mean median uq max neval
vcountPDict(peptides.set, proteins.set) 283.32 284.3334 285.0205 284.7867 285.2467 290.6725 100
Any idea how to get this done faster?
As mentioned in my comment, adding fixed = TRUE will lead to some performance improvement, and "stringi" is likely to give a good boost too.
Here are some tests:
N <- as.integer(length(proteins.vec))
funOP <- function() {
do.call(rbind, lapply(peptides.vec, function(p) grepl(p, proteins.vec)))
}
funBASE_1 <- function() {
# Just adds "fixed = TRUE"
do.call(rbind, lapply(peptides.vec, function(p) grepl(p, proteins.vec, fixed = TRUE)))
}
funBASE_2 <- function() {
# Does away with the `do.call` but probably won't improve performance
vapply(peptides.vec, function(x) grepl(x, proteins.vec, fixed = TRUE), logical(N))
}
library(stringi)
funSTRINGI <- function() {
# Should be considerably faster
vapply(peptides.vec, function(x) stri_detect_fixed(proteins.vec, x), logical(N))
}
library(microbenchmark)
microbenchmark(funOP(), funBASE_1(), funBASE_2(), funSTRINGI())
# Unit: milliseconds
# expr min lq mean median uq max neval
# funOP() 344.500600 348.562879 352.94847 351.585206 356.508197 371.99683 100
# funBASE_1() 128.724523 129.763464 132.55028 132.198112 135.277821 139.65782 100
# funBASE_2() 128.564914 129.831660 132.33836 131.607216 134.380077 140.46987 100
# funSTRINGI() 8.629728 8.825296 9.22318 9.038496 9.444376 11.28491 100
Go "stringi"!
I have a list of large matrices. All these matrices have the same number of rows and I want to "unlist" them and bind all their columns together. Below is a piece of code that I wrote, but I am not sure if this is the best I can achieve in terms of computational efficiency.
# simulate
n <- 10
nr <- 24
nc <- 8000
test <- list()
set.seed(1234)
for (i in 1:n){
test[[i]] <- matrix(rnorm(nr*nc),nr,nc)
}
> system.time( res <- matrix( as.numeric( unlist(test) ) ,nr,nc*n) )
user system elapsed
0.114 0.006 0.120
To work on a list and call a function on all objects, do.call is my usual first idea, along with cbind here to bind by column all objects.
For n=100 (with others answers for sake of completeness):
n <- 10
nr <- 24
nc <- 8000
test <- list()
set.seed(1234)
for (i in 1:n){
test[[i]] <- matrix(rnorm(nr*nc),nr,nc)
}
require(data.table)
ori <- function() { matrix( as.numeric( unlist(test) ) ,nr,nc*n) }
Tensibai <- function() { do.call(cbind,test) }
BrodieG <- function() { `attr<-`(do.call(c, test), "dim", c(nr, nc * n)) }
nicola <- function() { setattr(unlist(test),"dim",c(nr,nc*n)) }
library(microbenchmark)
microbenchmark(r1 <- ori(),
r2 <- Tensibai(),
r3 <- BrodieG(),
r4 <- nicola(), times=10)
Results:
Unit: milliseconds
expr min lq mean median uq max neval cld
r1 <- ori() 23.834673 24.287391 39.49451 27.066844 29.737964 93.74249 10 a
r2 <- Tensibai() 17.416232 17.706165 18.18665 17.873083 18.192238 21.29512 10 a
r3 <- BrodieG() 6.009344 6.145045 21.63073 8.690869 10.323845 77.95325 10 a
r4 <- nicola() 5.912984 6.106273 13.52697 6.273904 6.678156 75.40914 10 a
As for the why (in comments), #nicola did give the answer about it, there's less copy than original method.
All methods gives the same result:
> identical(r1,r2,r3,r4)
[1] TRUE
It seems that do.call beats the other method due to a copy made during the matrix call. What is interesting is that you can avoid that copy using the data.table::setattr function which allows to set attributes by reference, avoiding any copy. I omitted also the as.numeric part, since it is not necessary (unlist(test) is already numeric). So:
require(microbenchmark)
require(data.table)
f1<-function() setattr(unlist(test),"dim",c(nr,nc*n))
f2<-function() do.call(cbind,test)
microbenchmark(res <-f1(),res2 <- f2(),times=10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# res <- f1() 4.088455 4.183504 7.540913 4.44109 4.988605 35.05378 10
#res2 <- f2() 18.325302 18.379328 18.776834 18.66857 19.100681 19.47415 10
identical(res,res2)
#[1] TRUE
I think I have a better one. We can avoid some of the overhead from cbind since we know these all have the same number of rows and columns. Instead, we use c knowing that the underlying vector nature of the matrices will allow us to re-wrap them into the correct dimensions:
microbenchmark(
x <- `attr<-`(do.call(c, test), "dim", c(nr, nc * n)),
y <- do.call(cbind, test)
)
# Unit: milliseconds
# expr min lq
# x <- `attr<-`(do.call(c, test), "dim", c(nr, nc * n)) 4.435943 4.699006
# y <- do.call(cbind, test) 19.339477 19.567063
# mean median uq max neval cld
# 12.76214 5.209938 9.095001 379.77856 100 a
# 21.64878 20.000279 24.210848 26.02499 100 b
identical(x, y)
# [1] TRUE
If you have varying number of columns you can probably still do this with some care in computing the total number of columns.
I'm trying to find a vectorised trick to calculate the products between row i and col i of two matrices, without wasting resources on the other products (row i and col j, i!=j).
A <- matrix(rnorm(4*5), nrow=4)
B <- matrix(rnorm(5*4), ncol=4)
diag(A %*% B)
Is there a name for this product, a base R function, or a reshaping strategy that avoids a for loop?
for (ii in seq.int(nrow(A)))
print(crossprod(A[ii,], B[,ii]))
rowSums(A * t(B)) seems to be quite fast:
A <- matrix(rnorm(400*500), nrow=400)
B <- matrix(rnorm(500*400), ncol=400)
bF <- function() diag(A %*% B)
jF <- function() rowSums(A * t(B))
vF <- function() mapply(crossprod, as.data.frame(t(A)), as.data.frame(B))
lF <- function() {
vec <- numeric(nrow(A))
for (ii in seq.int(nrow(A)))
vec[ii] <- crossprod(A[ii,], B[,ii])
vec
}
library(microbenchmark)
microbenchmark(bF(), jF(), vF(), lF(), times = 100)
# Unit: milliseconds
# expr min lq median uq max neval
# bF() 137.828993 183.320782 185.823658 200.747130 207.67997 100
# jF() 4.434627 5.300882 5.341477 5.475393 46.96347 100
# vF() 39.110948 51.071936 54.147338 55.127911 102.17793 100
# lF() 14.029454 18.667055 18.931154 22.166137 65.40562 100
How about this?
mapply(crossprod, as.data.frame(t(A)), as.data.frame(B))
I have 60 matrices in R named as mat1, mat2....mat60 and I would like to combine them into a big matrix using rbind. I know that I could write something like
matList <- list(mat1, mat2, ... mat60)
rbind(matList)
but it seems a very dumb solution. Any idea how I can simplify the process?
I supposed they all have same number of column (and same colnames). If so, try this:
do.call("rbind", matlist)
otherwise:
matlist <- lapply(matlist, as.data.frame)
library(plyr)
rbind.fill(matlist)
EDIT:
adding some timings:
lst <- list()
for(i in 1:1000)
lst[[i]] <- matrix(rnorm(10000, 100))
f1 <- function()
do.call("rbind", lst)
f2 <- function(){
lst <- lapply(lst, as.data.table)
rbindlist(lst)
}
library(data.table)
library(microbenchmark)
> microbenchmark(f1(), f2())
Unit: milliseconds
expr min lq median uq max neval
f1() 53.78661 55.22728 63.43546 66.08829 103.1996 100
f2() 210.46232 215.32043 217.93846 221.35012 333.2758 100
If the OP has got his data in matrices I thought that including lst <- lapply(lst, as.data.table) was the correct way of comparison. Otherwise it would be:
> lst.dt <- lapply(lst, as.data.table)
> f2 <- function(){
+ rbindlist(lst.dt)
+ }
> microbenchmark(f1(), f2())
Unit: milliseconds
expr min lq median uq max neval
f1() 49.00308 50.28515 54.98947 60.71945 87.66487 100
f2() 24.23454 28.57692 31.79278 32.75494 63.78825 100
I think this question really relates to the OP having to type out the names of the matrices manually. You can use mget to return the matrices in a list and then use do.call and rbind as posed by #Michele like this (assuming the matrices are located in the .GlobalEnv ):
matList <- mget(paste0("mat",1:60),env=globalenv())
bigm <- do.call("rbind" , matList)
This should be faster:
library(data.table)
rbindlist(matList)
EDIT
The above solution will work for list of data.frame or list, If you have a list of matrix you should convert them before:
rbindlist(lapply(ll,as.data.frame))