I'm attempting to apply the sweep function to a sparse matrix (dgCMatrix). Unfortunately, when I do that I get a memory error. It seems that sweep is expanding my sparse matrix to a full dense matrix.
If there an easy way to perform this function without if blowing up my memory?
This is what I'm trying to do.
sparse_matrix <- sweep(sparse_matrix, 1, vector_to_multiply, '*')
I'm working with a big and very sparse dgTMatrix matrix (200k rows and 10k columns) in a NLP problem. After hours thinking in a good solution, I created an alternative sweep function for sparse matrices. It is very fast and memory efficient. It took just 1 second and less than 1G of memory to multiply all matrix rows by a array of weights. For margin = 1 it works for both dgCMatrix and dgTMatrix.
Here it follows:
sweep_sparse <- function(x, margin, stats, fun = "*") {
f <- match.fun(fun)
if (margin == 1) {
idx <- x#i + 1
} else {
idx <- x#j + 1
}
x#x <- f(x#x, stats[idx])
return(x)
}
I second #user20650's recommendation to use direct multiplication of the form mat * vec which multiplies every column of your matrix mat with your vector vec by implicitly recycling vec.
Processing time profiling
I understand that you're main requirement here is memory, but it's interesting to perform a microbenchmark comparison of the sweep and direct multiplication methods for both a dense and sparse matrix:
# Sample data
library(Matrix)
set.seed(2018)
mat <- matrix(sample(c(0, 1), 10^6, replace = T), nrow = 10^3)
mat_sparse <- Matrix(mat, sparse = T)
vec <- 1:dim(mat)[1]
library(microbenchmark)
res <- microbenchmark(
sweep_dense = sweep(mat, 1, vec, '*'),
sweep_sparse = sweep(mat_sparse, 1, vec, '*'),
mult_dense = mat * vec,
mult_sparse = mat_sparse * vec
)
res
Unit: milliseconds
expr min lq mean median uq max
sweep_dense 8.639459 10.038711 14.857274 13.064084 18.07434 32.2172
sweep_sparse 116.649865 128.111162 162.736864 135.932811 155.63415 369.3997
mult_dense 2.030882 3.193082 7.744076 4.033918 7.10471 184.9396
mult_sparse 12.998628 15.020373 20.760181 16.894000 22.95510 201.5509
library(ggplot2)
autoplot(res)
On average the operations involving a sparse matrix are actually slightly slower than the ones with a dense matrix. Note however, how direct multiplication is faster than sweep.
Memory profiling
We can use memprof to profile the memory usage of the different approaches.
library(profmem)
mem <- list(
sweep_dense = profmem(sweep(mat, 1, vec, '*')),
sweep_sparse = profmem(sweep(mat_sparse, 1, vec, '*')),
mult_dense = profmem(sweep(mat * vec)),
mult_sparse = profmem(sweep(mat_sparse * vec)))
lapply(mem, function(x) utils:::format.object_size(sum(x$bytes), units = "Mb"))
#$sweep_dense
#[1] "15.3 Mb"
#
#$sweep_sparse
#[1] "103.1 Mb"
#
#$mult_dense
#[1] "7.6 Mb"
#
#$mult_sparse
#[1] "13.4 Mb"
To be honest, I'm surprised that the memory imprint of the direct multiplication with a sparse matrix is not smaller than that involving a dense matrix. Perhaps the sample data are too simplistic. It might be worth exploring this with your actual data (or a representative subset thereof).
Related
I have a list, which contains 4438 dataframes with different sizes. I am not sure how to make a reproducible example, but the way I obtained the list is using the expand.grid function to have a dataframe with all the possible combination of elements:
citation <- citation %>%
map_depth(., 1, expand.grid)
List before applying expand.grid
List after applying expand.grid
What I am going to achieve is for each dataframe, counting the number of unique values per row, and finding the minimum number of unique values in the dataframe.
First, I write the function below
fun1 <- function(res){
min(apply(res,1,function(x) length(unique(x))))
}
Then, apply the function to each dataframe:
library(furrr)
plan(multisession, workers = 4)
min_set <- c()
min_set <- citation %>% future_map_dbl(fun1)
However, the calculation is super slow, almost 30 mins to complete. I would like to find another way to accelerate the performance. Looking forward to hear the solution from you guys. Thank you in advance
To speed up the current approach of enumerating the combinations, use rowTabulate from the Rfast package (or rowTabulates from the matrixStats package).
However, it will be much faster to get the desired results with the setcover function in the adagio package, which solves the set cover problem directly (i.e., without the use of expand.grid) via integer linear programming with lp from the lpSolve package.
library(Rfast) # for the rowTabulate function
library(adagio) # for the setcover function
# reproducible example data
set.seed(1141593349)
citation1 <- list(
lapply(c(5,2,8,12,6,38), function(size) sample(50, size)),
lapply(c(5,2,8,12,7), function(size) sample(50, size))
)
# get all combinations of the indices of the unique values for each list in citation1
citation2 <- lapply(citation1, function(x) expand.grid(lapply(x, match, table = unique(unlist(x)))))
# original solution
fun1 <- function(res) min(apply(res, 1, function(x) length(unique(x))))
# faster version of the original solution
fun2 <- function(res) min(rowsums(rowTabulate(as.matrix(res)) > 0L))
# linear programming solution (uses citation1 rather than citation2)
fun3 <- function(res) {
v <- unlist(res)
m <- matrix(0L, max(v), length(res))
m[cbind(v, rep.int(seq_along(res), lengths(res)))] <- 1L
setcover(m)$objective
}
microbenchmark::microbenchmark(fun1 = sapply(citation2, fun1),
fun2 = as.integer(sapply(citation2, fun2)),
fun3 = as.integer(sapply(citation1, fun3)),
times = 10,
check = "identical")
#> Unit: milliseconds
#> expr min lq mean median uq max
#> fun1 1110.4976 1162.003601 1217.049501 1204.608151 1281.121601 1331.057001
#> fun2 101.5173 113.123501 142.265371 145.964502 165.788700 187.196301
#> fun3 1.4038 1.461101 1.734781 1.850701 1.870801 1.888702
For the base R matrix class we have the rowsum function, which is very fast for computing column sums across groups of rows.
Is there an equivalent function or approach implemented in the Matrix-package?
I'm particularly interested in a fast alternative to rowsum for large dgCMatrix-objects (i.e. millions of rows, but roughly 95% sparse).
I know this is an old question, but Matrix::rowSums might be the function you are looking for.
The DelayedArray BioConductor package now has a rowsum function that accepts sparse matrices that has been very fast when I tried it.
Here is an approach using matrix multiplication, based on an example in https://slowkow.com/notes/sparse-matrix/. First, let's create a sparse matrix to play with,
library(magrittr)
library(forcats)
library(stringr)
library(Matrix)
set.seed(42)
m <- sparseMatrix(
i = sample(x = 1e4, size = 1e4),
j = sample(x = 1e4, size = 1e4),
x = rnorm(n = 1e4)
)
colnames(m) <- str_c("col", seq(ncol(m)))
rownames(m) <- str_c("row", seq(nrow(m)))
and a grouping vector defining which rows to sum,
group <- sample(1:10, nrow(m), replace = TRUE) %>%
paste0("new_row", .) %>%
fct_inorder
Whether group is a factor and its level order will affect the final row order in the merged matrix. I made group a factor with levels ordered by first appearance in group to make the row order resemble that from the rowsum() operation with reorder = FALSE.
Next, we create a (sparse) matrix that we can left-multiply by m to get a version of m whose rows have been summed based on group,
group_mat <- sparse.model.matrix(~ 0 + group) %>% t
# Adjust row names to get the correct final row names
rownames(group_mat) <- rownames(group_mat) %>% str_extract("(?<=^group).+")
msum <- group_mat %*% m
The result matches base::rowsum() on the dense version of the matrix,
d <- as.matrix(m)
dsum <- rowsum(d, group, reorder = FALSE)
all.equal(as.matrix(msum), dsum)
#> [1] TRUE
but the sparse-matrix multiplication method is much faster,
bench::mark( msum <- group_mat %*% m )$median
#> [1] 344µs
bench::mark( dsum <- rowsum(d, group) )$median
#> [1] 146ms
I have a piece of R code I want to optimise for speed working with larger datasets. It currently depends on sapply cycling through a vector of numbers (which correspond to rows of a sparse matrix). The reproducible example below gets at the nub of the problem; it is the three line function expensive() that chews up the time, and its obvious why (lots of matching big vectors to eachother, and two nested paste statements for each cycle of the loop). Before I give up and start struggling with doing this bit of the work in C++, is there something I'm missing? Is there a way to vectorize the sapply call that will make it an order of magnitude or three faster?
library(microbenchmark)
# create an example object like a simple_triple_matrix
# number of rows and columns in sparse matrix:
n <- 2000 # real number is about 300,000
ncols <- 1000 # real number is about 80,000
# number of non-zero values, about 10 per row:
nonzerovalues <- n * 10
stm <- data.frame(
i = sample(1:n, nonzerovalues, replace = TRUE),
j = sample(1:ncols, nonzerovalues, replace = TRUE),
v = sample(rpois(nonzerovalues, 5), replace = TRUE)
)
# It seems to save about 3% of time to have i, j and v as objects in their own right
i <- stm$i
j <- stm$j
v <- stm$v
expensive <- function(){
sapply(1:n, function(k){
# microbenchmarking suggests quicker to have which() rather than a vector of TRUE and FALSE:
whichi <- which(i == k)
paste(paste(j[whichi], v[whichi], sep = ":"), collapse = " ")
})
}
microbenchmark(expensive())
The output of expensive is a character vector, of n elements, that looks like this:
[1] "344:5 309:3 880:7 539:6 338:1 898:5 40:1"
[2] "307:3 945:2 949:1 130:4 779:5 173:4 974:7 566:8 337:5 630:6 567:5 750:5 426:5 672:3 248:6 300:7"
[3] "407:5 649:8 507:5 629:5 37:3 601:5 992:3 377:8"
For what its worth, the motivation is to efficiently write data from a sparse matrix format - either from slam or Matrix, but starting with slam - into libsvm format (which is the format above, but with each row beginning with a number representing a target variable for a support vector machine - omitted in this example as it's not part of the speed problem). Trying to improve on the answers to this question. I forked one of the repositories referred to from there and adapted its approach to work with sparse matrices with these functions. The tests show that it works fine; but it doesn't scale up.
Use package data.table. Its by combined with the fast sorting saves you from finding the indices of equal i values.
res1 <- expensive()
library(data.table)
cheaper <- function() {
setDT(stm)
res <- stm[, .(i, jv = paste(j, v, sep = ":"))
][, .(res = paste(jv, collapse = " ")), keyby = i][["res"]]
setDF(stm) #clean-up which might not be necessary
res
}
res2 <- cheaper()
all.equal(res1, res2)
#[1] TRUE
microbenchmark(expensive(),
cheaper())
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# expensive() 127.63343 135.33921 152.98288 136.13957 138.87969 222.36417 100 b
# cheaper() 15.31835 15.66584 16.16267 15.98363 16.33637 18.35359 100 a
I hope someone will be able to help me on this problem. I have a list object that includes 48 vectors and each vector has a length of 2,000,000 observations in it. Here is a code that creates the same structure with only 100,000 items per vector:
mtx_sim <- matrix(data = runif(48 * 100000), ncol = 48, nrow = 100000)
mtx_list <- as.list(data.frame(mtx_sim))
I want to cumulatively sum each row of the vectors in the list. However, there is one stipulation I only want to sum the last thirty vectors. For example, the 35th vector in the list should be added to the 34 preceding vectors. On the other hand, the fourth vector in the list should be added to the three preceding vectors (vector number three, two, and one). Here is my code example that relies on the lapply function in combination with rowSums, which is relatively slow:
start <- c(rep(1, times = 30), seq(2, 19, 1))
end <- seq(1,48,1)
system.time(xxx <- lapply(1:48, function(x)
rowSums(
matrix(
unlist(mtx_list[start[x]:end[x]]),
ncol = (end[x] - start[x] + 1)))
) )
user system elapsed
62.19 0.56 63.04
Does anyone have an idea to optimize the code?
You are doing two expensive things in an otherwise reasonable algorithm:
You are recreating a matrix from your list for every iteration; this is likely slow
You are recomputing the entire row sums repeatedly, when in reality you just need to calculate the marginal changes
Here is an alternative. We reconstitute the original matrix once, and then just add the marginal columns.
fun_brodie <- function(mtx_list) {
mtx <- do.call(cbind, mtx_list)
base <- mtx[, 1]
res <- list(base)
for(i in seq(ncol(mtx))[-1])
res[[i]] <- res[[i - 1]] + mtx[, i] - if(i > 30) mtx[, i - 30] else 0
res
}
res <- fun_brodie(mtx_list)
Confirm equal:
all.equal(res, xxx)
# [1] TRUE
Benchmarks:
library(microbenchmark)
microbenchmark(times=3, fun_marat(mtx_list), fun_brodie(mtx_list), fun_op(mtx_list))
Produces:
Unit: milliseconds
expr min lq mean
fun_marat(mtx_list) 1661.9135 1763.418 1800.3530
fun_brodie(mtx_list) 115.7877 116.061 153.6794
fun_op(mtx_list) 58059.7803 60388.303 62060.5557
Thanks to Marat for pointing out an interpretation error on my part. Also, note that in order to make fun_marat compartible I added a step of cbinding the list to a data frame.
You can use this solution:
M <- t(apply(mtx_sim,1,cumsum))
if (ncol(M)>30) {
i <- 31:ncol(M)
M[,i] <- M[,i] - M[,i-30]
}
M
I am attempting to build a large (~200 MM line) dataframe in R. Each entry in the dataframe will consist of approximately 10 digits (e.g. 1234.12345). The code is designed to walk through a list, subtract an item in position [i] from every item after [i], but not the items before [i] (If I was putting the output into a matrix it would be a triangular matrix). The code is simple and works fine on smaller lists, but I am wondering if there is a faster or more efficient way to do this? I assume the first part of the answer is going to entail "don't use a nested for loop," but I am not sure what the alternatives are.
The idea is that this will be an "edge list" for a social network analysis graph. Once I have 'outlist' I will reduce the number of edges based on some criteria(<,>,==,) so the final list (and graph) won't be quite so ponderous.
#Fake data of same approximate dimensions as real data
dlist<-sample(1:20,20, replace=FALSE)
#purge the output list before running the loop
rm(outlist)
outlist<-data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
outlist<-rbind(outlist, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
IIUC your final dataset will be ~200 million rows by 3 columns, all of type numeric, which takes a total space of:
200e6 (rows) * 3 (cols) * 8 (bytes) / (1024 ^ 3)
# ~ 4.5GB
That's quite a big data, where it's essential to avoid copies wherever possible.
Here's a method that uses data.table package's unexported (internal) vecseq function (written in C and is fast + memory efficient) and makes use of it's assignment by reference operator :=, to avoid copies.
fn1 <- function(x) {
require(data.table) ## 1.9.2
lx = length(x)
vx = as.integer(lx * (lx-1)/2)
# R v3.1.0 doesn't copy on doing list(.) - so should be even more faster there
ans = setDT(list(v1 = rep.int(head(x,-1L), (lx-1L):1L),
v2=x[data.table:::vecseq(2:lx, (lx-1L):1, vx)]))
ans[, v3 := v2-v1]
}
Benchmarking:
I'll benchmark with functions from other answers on your data dimensions. Note that my benchmark is on R v3.0.2, but fn1() should give better performance (both speed and memory) on R v3.1.0 because list(.) doesn't result in copy anymore.
fn2 <- function(x) {
diffmat <- outer(x, x, "-")
ss <- which(upper.tri(diffmat), arr.ind = TRUE)
data.frame(v1 = x[ss[,1]], v2 = x[ss[,2]], v3 = diffmat[ss])
}
fn3 <- function(x) {
idx <- combn(seq_along(x), 2)
out2 <- data.frame(v1=x[idx[1, ]], v2=x[idx[2, ]])
out2$v3 <- out2$v2-out2$v1
out2
}
set.seed(45L)
x = runif(20e3L)
system.time(ans1 <- fn1(x)) ## 18 seconds + ~8GB (peak) memory usage
system.time(ans2 <- fn2(x)) ## 158 seconds + ~19GB (peak) memory usage
system.time(ans3 <- fn3(x)) ## 809 seconds + ~12GB (peak) memory usage
Note that fn2() due to use of outer requires quite a lot of memory (peak memory usage was >=19GB) and is slower than fn1(). fn3() is just very very slow (due to combn, and unnecessary copy).
Another way to create that data is
#Sample Data
N <- 20
set.seed(15) #for reproducibility
dlist <- sample(1:N,N, replace=FALSE)
we could do
idx <- combn(1:N,2)
out2 <- data.frame(i=dlist[idx[1, ]], j=dlist[idx[2, ]])
out2$dist <- out2$j-out2$i
This uses combn to create all paris of indices in the data.set rather than doing loops. This allows us to build the data.frame all at once rather than adding a row at a time.
We compare that to
out1 <- data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
out1<-rbind(out1, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
we see that
all(out1==out2)
# [1] TRUE
Plus, if we compare with microbenchmark we see that
microbenchmark(loops(), combdata())
# Unit: microseconds
# expr min lq median uq max neval
# loops() 30888.403 32230.107 33764.7170 34821.2850 82891.166 100
# combdata() 684.316 800.384 873.5015 940.9215 4285.627 100
The method that doesn't use loops is much faster.
You can always start with a triangular matrix and then make your dataframe directly from that:
vec <- 1:10
diffmat <- outer(vec,vec,"-")
ss <- which(upper.tri(diffmat),arr.ind = TRUE)
data.frame(one = vec[ss[,1]],
two = vec[ss[,2]],
diff = diffmat[ss])
You need to preallocate out list, this will significantly increase the speed of your code. By preallocating I mean creating an output structure that already has the desired size, but filled with for example NA's.