I have 2 large matrices (typically of dimensions 5000*40 and 20000*40). I am trying to create a correlation matrix, where i would like to calculate the correlation of each row from the first matrix to every other row in the second matrix. I have following minimal code, by it takes extremely long time. Any recommendations to speed it up or parallize. Thanks
-Jaison
nprots <- 50 #usually ca. 5000
ngenes <- 1000 #usually ca. 20000
a_mat <- matrix( runif(40*nprots, 120, 116000), ncol=40)
b_mat <- matrix( runif(40*ngenes, 0.1, 1000), ncol=40)
system.time(apply( a_mat, 1, function(xx)
apply(b_mat, 1, cor, y = xx, use = "pairwise.complete.obs")) -> cor_mat)
Thanks Richard, I did see the post you refered to on SO, but passed it over without thinking too much about it because it did not seem to pertain to my problem. I have 2 matrices across which I perform correlations. A little more thought on this and it ocurred to my dim brain that I can simply extend the solution you pointed out to suit my question. rbind the two matrices, then follow the solution you refered to. Finally I extract out the relevant corner of the correlation matrix. This works faster than my original code using apply and cor. I double checked the answers and all seem ok. So below is my current solution.
fast_cor <- function(a,b) {
mat <- rbind(a, b);
mat <- mat - rowMeans(mat);
mat <- mat / sqrt(rowSums(mat^2));
cr <- tcrossprod(mat)
edge <- dim(a_mat)[1]
cr <- t(cr[1:edge, -c(1:edge)])
return(cr)
}
nprots <- 50 #usually ca. 5000
ngenes <- 10000 #usually ca. 20000
a_mat <- matrix( runif(40*nprots, 120, 116000), ncol=40)
b_mat <- matrix( runif(40*ngenes, 0.1, 1000), ncol=40)
system.time(apply( a_mat, 1, function(xx)
apply(b_mat, 1, cor, y = xx, use = "pairwise.complete.obs")) -> c_1)
user system elapsed
20.48 0.00 20.48
system.time(c_2 <- fast_cor(a_mat, b_mat))
user system elapsed
1.97 0.11 2.08
Related
Assume I have a numeric data with 10,000 rows and 8 columns. I want to obtain the first k neighbors for each row (skipping first 1,000 rows) using euclidean distance but the catch is for each row I am only interested in the previous rows. (e.g. for the 2001th row, I only search first 2000 rows).
Changing the reference for each row is too slow. The fastest function I could write was using RANN to get 2k (or 5k) closest neighbors then filtering out the future observations.
A slow example: (200 rows, 5 columns, 3 nearest neighbors)
data = matrix(rnorm(1000), nrow = 200, ncol = 5)
result <- list()
for (i in c(101:200)) {
distances <- apply(data[1:(i-1),], 1, function(x) {
dist(rbind(x, data[i, ]))
})
neighbors <- sort(distances, index.return = TRUE)$ix[1:3]
result[[i - 100]] <- neighbors
}
The fastest approach was using RANN::nn2(data = data[1:200,], query = data[101:200,], k = 2*k), then filtering the future ones (and hope to have at least k values).
The filtering part and unneccessary computation of nearest neighbors increase the time complexity significantly.
I would be glad to hear any suggestions.
A solution using RANN::nn2. The idea is to break the data into chunks and process each separately before combining the results. It processes a 10k-by-8 matrix in a fraction of a second.
library(RANN) # for the nn2 function
library(Rfast) # for the rowOrder function
library(data.table) # for the rbindlist function
nr <- 1e4L
start <- 1001L
nc <- 8L
k <- 3L
data <- matrix(rnorm(nr*nc), nr, nc)
system.time({
n <- 450L # chunk size
n1 <- n - 1L
nn1 <- rbindlist(
lapply(
# split the data into chunks
as.data.frame(matrix(start:nr, n)),
function(i) {
# initialize a matrix with -Inf
d <- matrix(Inf, n1, n1)
# fill the lower diagonal of m with the negative of the distance matrix
d[sequence(n1:1, seq(1, n1^2, n))] <- dist(data[i,])
# get the nearest neighbor from previous chunks
nn <- nn2(data[1:(i[1] - 1L),], data[i,], k)
# bind the two distance matrices together
out <- rowOrder(cbind(nn$nn.dists, rbind(Inf, d)))[,1:k]
# which neighbors are from a previous chunk?
iPrev <- which(out <= k)
# indices of nearest neighbors from a previous chunk
out[iPrev] <- nn$nn.idx[cbind(((iPrev - 1L) %% n) + 1L, out[iPrev])]
# indices of nearest neighbors from current chunk
out[-iPrev] <- out[-iPrev] - k + i[1] - 1L
# convert to a data.table in order to use rbindlist
as.data.table(out)
}
), FALSE
)
})
#> user system elapsed
#> 0.46 0.00 0.47
Compare the result to a filtering approach. Note that the code below is not guaranteed to find the nearest previous neighbor, but it could be modified to iteratively increase k for rows that fail to do so.
system.time({
nn2 <- with(
# get the 100 nearest neighbors
nn2(data, data[start:nr,], 100),
# find the nearest from a previous row
matrix(nn.idx[cbind(1:nrow(nn.idx), c(rowOrder(1 - (nn.idx < start:nr), TRUE)[,1:k]))], ncol = k)
)
})
#> user system elapsed
#> 1.31 0.00 1.31
identical(unlist(nn1, 0, 0), c(nn2))
#> [1] TRUE
I have a few moderately large data frames and need to do a calculation across different columns in the data; for example I want to compare column i in one data frame with i - 1 in another. I currently use a for loop. The calculation involves element-wise comparison of each pair of values so is somewhat slow: e.g. I take each column of data, turn it into a matrix and compare with the transpose of itself (with some additional complications). In my application (in which the data have about 100 columns and 3000 rows) this currently takes about 95 seconds. I am looking for ways to make this more efficient. If I were comparing the SAME column of each data frame I would try using mapply, but because I need to make comparisons across different columns I don't see how this could work. The current code is something like this:
d1 <- as.data.frame(matrix(rnorm(100000), nrow=1000))
d2 <- as.data.frame(matrix(rnorm(100000), nrow=1000))
r <- list()
ptm2 <- proc.time()
for(i in 2:100){
t <- matrix(0 + d1[,i] > 0,1000,1000)
u <- matrix(d1[,i],1000,1000)*t(matrix(d2[,i-1],1000,1000))
r[[i]] <- t * u
}
proc.time() - ptm2
This takes about 3 seconds on my computer; as mentioned the actual calculation is a bit more complicated than this MWE suggests. Obviously one could also improve efficiency in the calculation itself but I am looking for a solution to the 'compare column i to column i-1' issue.
Based on your example, if you align the d1 and d2 matrices ahead of time based on which columns you are comparing, then here is how you could use mapply. It appears to be only marginally faster, so parallel computing would be a better way to achieve speed gains.
d1 <- as.data.frame(matrix(rnorm(100000), nrow=1000))
d2 <- as.data.frame(matrix(rnorm(100000), nrow=1000))
r <- list()
ptm2 <- proc.time()
for(i in 2:100){
t <- matrix(0 + d1[,i] > 0,1000,1000)
u <- matrix(d1[,i],1000,1000)*t(matrix(d2[,i-1],1000,1000))
r[[i]] <- t * u
}
proc.time() - ptm2
#user system elapsed
#0.90 0.87 1.79
#select last 99 columns of d1 and first 99 columns of d2 based on your calcs
d1_99 <- as.data.frame(d1[,2:100]) #have to convert to data.frame for mapply to loop across columns; a data.frame is simply a list of vectors of equal length
d2_99 <- as.data.frame(d2[,1:99])
ptm3 <- proc.time()
r_test <- mapply(function(x, y) {
t <- matrix(x > 0, 1000, 1000) #didn't understand why you were adding 0 in your example
u <- matrix(x,1000,1000)*t(matrix(y,1000,1000))
t * u
}, x=d1_99, y=d2_99, SIMPLIFY = FALSE)
proc.time() - ptm3
#user system elapsed
#0.91 0.83 1.75
class(r_test)
#[1] "list"
length(r_test)
#[1] 99
#test for equality
all.equal(r[[2]], r_test[[1]])
#[1] TRUE
all.equal(r[[100]], r_test[[99]])
#[1] TRUE
I have a RasterStack s1 consisting of 400 layers with data from an island. The extent of the raster is cropped to the extent of the island, but due to its irregular shape, only around 20% of the pixels are actually land area and have data values; the other 80% are water and NA.
I also have a land-water-mask lwm (RasterLayer), where land is coded as 1 and water as NA.
I would like to do different kinds of cell-based calculations on s1, but noticed that these take a long time to finish. To speed things up, the calculations should only be carried out for cells that are land area, whereas water areas should always be NA. In pseudo-code:
for each cell:
if cell is land
do calculation
if cell is water
return(NA)
An requirement is memory-safety.
Here is some sample data to illustrate the problem:
library(raster)
# generate data
lwm <- raster(nrow = 5, ncol = 5)
lwm[] <- c(rep(NA, 10), rep(1, 5), rep(NA, 10))
r1 <- raster(nrow = 5, ncol = 5)
r1[] <- runif(ncell(r1)) * 10
r2 <- raster(nrow = 5, ncol = 5)
r2[] <- runif(ncell(r2)) * 10
s1 <- stack(r1, r2)
s1 <- mask(s1, lwm)
# this works, but all NA-values on water are also unnecessarily evaluated
calc(s1, function(x) {sum(!is.na(x))})
After some playing around, I finally found a solution which works very well in my case, and which takes off quite a nice amount from the overall processing time:
library(raster)
# generate data
lwm <- raster(nrow = 50, ncol = 50)
lwm[] <- 1
# replace 80% with NA values
lwm[sample(1:ncell(lwm), round(0.8 * ncell(lwm)))] <- NA
r1 <- raster(lwm)
r1[] <- runif(ncell(r1))
r1_list <- replicate(400 , r1)
s1 <- stack(r1_list)
s1 <- mask(s1, lwm)
# this works, but all NA-values on water are also unnecessarily evaluated
system.time(r_sum1 <- calc(s1, function(x) {sum(x)}))
#user system elapsed
#0.14 0.00 0.14
## new approach:
# stack land-water-mask with RasterStack
s1_lwm <- stack(lwm, s1)
# function to check if first element of vector is NA; if yes, return NA; if no, do calculation
fun1 <- function(y) {
if (!is.na(y[1])) {
y = y[-1]
return(sum(y))
} else {
return(NA)
}
}
system.time(
r_sum2 <- calc(s1_lwm, fun = fun1)
)
# user system elapsed
# 0.4 0.0 0.4
# results are identical
identical(r_sum1[], r_sum2[])
That's a tricky one and unfortunetely I don't have a straight forward solution for you.
You can either do multiple crops of the island (i.e. 2-3) to minimise NA values and do the calculcations separately on each cropped raster and mosaic the results.
Or another option is to do a parallel calculation, which will speed up the process significantly:
#initialize cluster
#number of cores to use for clusterR function (max recommended: ncores - 1)
beginCluster(3)
#calculation
result <- clusterR(s1, calc, args=list(fun=function(x) {sum(!is.na(x))}))
#end cluster
endCluster()
Since you asked for a memory-safe solution, you should look how much RAM is being allocated when you only a singe core and then estimate how many cores you can run your calculation on, so you won't run out of memory.
Good luck! Hope it helps.
I'm the maintainer of the seqMeta package and looking for ideas on how to speed up the bottleneck of splitting a large matrix into smaller pieces a large number of times.
Background
The seqMeta package is used to analyze genetic data. So you have a group of subjects (n_subject) and a number of genetic markers (n_snps). This leads to a n_subject x n_snp matrix (Z). There is also a data frame that tells you which snps get grouped together for analysis (typically which snps comprise a given gene).
While Z may be large, it is quite sparse. Typically less than 10%, and sometimes around 2%, of the values are non-zero. A sprase matrix representation seems like an obvious choice to save space.
A current project: nsubjects ~15,000 and nsnps ~2 million, with over 200,000 splits.
As the data size continues to grow I've found the time limiting factor tends to be the number of groupings, not the actual size of the data. (See the below example runtime is a linear function of n_splits for the same data)
Simplified Example
library(Matrix)
seed(1)
n_subjects <- 1e3
n_snps <- 1e5
sparcity <- 0.05
n <- floor(n_subjects*n_snps*sparcity)
# create our simulated data matrix
Z <- Matrix(0, nrow = n_subjects, ncol = n_snps, sparse = TRUE)
pos <- sample(1:(n_subjects*n_snps), size = n, replace = FALSE)
vals <- rnorm(n)
Z[pos] <- vals
# create the data frame on how to split
# real data set the grouping size is between 1 and ~1500
n_splits <- 500
sizes <- sample(2:20, size = n_splits, replace = TRUE)
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
snps = sample(n_snps, size = sum(sizes)))
# simple function that gets called on the split
# the real function creates a cols x cols dense upper triangular matrix
# similar to a covariance matrix
simple_fun <- function(Z, cols) {sum(Z[ , cols])}
# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)
## user system elapsed
## 3.21 0.00 3.21
n_splits <- 1000
sizes <- sample(2:20, size = n_splits, replace = TRUE)
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
snps = sample(n_snps, size = sum(sizes)))
# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)
## user system elapsed
## 6.38 0.00 6.38
n_splits <- 5000
sizes <- sample(2:20, size = n_splits, replace = TRUE)
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
snps = sample(n_snps, size = sum(sizes)))
# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)
## user system elapsed
## 31.65 0.00 31.66
Question: Is there a faster way to subset a Matrix than '['? Or other approached I'm missing?
I am trying to develop a function to "synchronise" NAs among layers of a raster stack, i.e. to make sure that for any given pixel of the stack, if one layer has a NA, then all layers should be set to NA for that pixel.
This is particularly useful when combining rasters coming from varying sources for species distribution modelling, because some models do not handle properly NAs.
I have found two ways to do this, but I find neither of them satisfactory. One of them requires to use the function getValues and thus is not usable for very large stacks or computers with low RAM. The other one is more memory-safe but is much slower. I am therefore here to ask if anyone has an idea to improve my attempts?
Here are the two possibilities:
Using getValues()
syncNA1 <- function (x)
{
val <- getValues(x)
NA.pos <- unique(which(is.na(val), arr.ind = T)[, 1])
val[NA.pos, ] <- NA
x <- setValues(x, val)
return(x)
}
Using calc()
syncNA2 <- function(y)
{
calc(y, na.rm = T, fun = function(x, na.rm = na.rm)
{
if(any(is.na(x)))
{
rep(NA, length(x))
} else
{
x
}
})
}
Now a demonstration of their respective computing times for the same stack:
> system.time(
+ b1 <- syncNA1(a1)
+ )
user system elapsed
3.04 0.15 3.20
> system.time(
+ b2 <- syncNA2(a1)
+ )
user system elapsed
5.89 0.19 6.08
Many thanks for your help,
Boris
With a stack named "s", I would first use calc(s, fun = sum) to compute a mask layer that records the location of all cells with an NA value in at least one of the stack's layers. mask() will then allow you to apply that mask to every layer in the stack.
Here's an example:
library(raster)
## Construct reproducible data! (Here a raster stack with NA values in each layer)
m <- raster(ncol=10, nrow=10)
n <- raster(ncol=10, nrow=10)
m[] <- runif(ncell(m))
n[] <- runif(ncell(n)) * 10
m[m < 0.5] <- NA
n[n < 5] <- NA
s <- stack(m,n)
## Synchronize the NA values
s2 <- mask(s, calc(s,fun = sum))
## Check that it worked
plot(s2)
I don't know about speed, but you might try converting to an array, loading up the NA, and converting back. pseudocode:
xarray<-as.array(xstack)
ind.na<-which(is.na(xarray),array.ind=TRUE)
for(j in nrow(ind.na) ) {
xarray[ind.na[j,1],ind.na[j,2],]<-NA
}
nastack<-raster(xarray)
I haven't verified the correct choice of indices there, nor have I verified I converted back to raster stack correctly, but I hope you get the idea.
EDIT: I ran a time test, with rasters 1000x1000 but otherwise as Josh created.
microbenchmark(josh(s),syncNA1(s),syncNA2(s),times=5)
Unit: milliseconds
expr min lq median uq max
josh(s) 774.2363 789.1653 800.2511 806.5364 809.9087
syncNA1(s) 652.3928 659.8327 692.3578 695.8057 743.9123
syncNA2(s) 7951.3918 8291.7917 8604.2226 8606.3432 10254.4739
neval
5
5
5
I ended up building an hybrid function between syncNA1 and Josh's solution.
This function is memory-safe if the computer does not have enough RAM, but can process faster if the computer has enough RAM:
synchroniseNA <- function(x)
{
if(canProcessInMemory(x, n = 2))
{
val <- getValues(x)
NA.pos <- unique(which(is.na(val), arr.ind = T)[, 1])
val[NA.pos, ] <- NA
x <- setValues(x, val)
return(x)
} else
{
x <- mask(x, calc(x, fun = sum))
return(x)
}
}
However, I empirically determined that the amount of ram used by a data.frame is twice the size of a raster file (for the n argument of canProcessInMemory()), but I am not exactly sure I am right here.