Writing to a large matrix much slower than normal - r

Suppose I do this:
m <- matrix(0, nrow = 20, ncol = 3)
system.time(m[1, 1:3] <- c(1,1,1))
That takes 0 seconds.
Now I do this:
m <- matrix(0, nrow = 10000000, ncol = 3)
system.time(m[1, 1:3] <- c(1,1,1))
This takes about 0.47 seconds on my system.
I need to fill in a matrix of around 8.5 million rows so at 0.47 seconds each it's not an option. Is there any way around this? Other than creating many smaller sub matrices and rbinding later?
Thanks!

After starting a new R session:
m <- matrix(0, nrow = 10000000, ncol = 3)
system.time(m[1, 1:3] <- c(1,1,1))
# User System elapsed
# 0 0 0
n <- m
system.time(m[1, 1:3] <- c(1,1,1))
# User System elapsed
# 0.074 0.061 0.135
The first time m is modified in place. The second time a copy is made since m is referred to by n.
This question might be of interest. However, if you do a rolling regression, you should first look, if it is implemented in some package. If you want to do this in Rcpp, you should do the whole loop in Rcpp and not assign to m 8.5M times.

Related

Efficient subsetting of each column of a matrix with a column of indices of another matrix

I am currently facing this problem while working on R.
I have a big matrix, say A, of size (M x K) and a smaller matrix, say B, of size (N x K), with of course N < M. So B has the same number of columns as A, but less rows.
A possible example can be:
A <- replicate(K, rnorm(M))
B <- replicate(K, sample(M, N))
My goal is to subset each column of A with the correspondent column of B in an efficent way, since the dimension of the matrices is huge.
A possible numerical example is:
A <- [1 2 3 B <- [1 1 3 ---> [1 2 9
4 5 6 3 2 2] 7 5 6]
7 8 9]
This means I am looking for something better than parsing column-by-column using a simple for loop:
for(i in 1:K)
{
A[B[,i],i]
}
Thank you for helping me to solve this apparently trivial, but for me challenging problem.
See this question and ?'[':
When indexing arrays by [ a single argument i can be a matrix with as
many columns as there are dimensions of x; the result is then a vector
with elements corresponding to the sets of indices in each row of i.
For your example, it would be:
A <- matrix(1:9, 3, byrow = TRUE)
B <- matrix(c(1, 1, 3, 3, 2, 2), 2, byrow = TRUE)
C <- matrix(A[cbind(c(B), rep(1:ncol(B), each = nrow(B)))], nrow(B))
UPDATE: For efficiency, Mossa is right. There's not going to be much of a difference in performance in this case between using a matrix of indices and looping. Almost all of your time is in generating A. Generating B even takes longer than generating C:
M <- K <- 1e4; N <- 1e3
> system.time(A <- replicate(K, rnorm(M)))
user system elapsed
7.110 0.922 8.030
> system.time(B <- replicate(K, sample(M, N)))
user system elapsed
0.727 0.033 0.760
> system.time(C <- matrix(A[cbind(c(B), rep(1:ncol(B), each = nrow(B)))], nrow(B)))
user system elapsed
0.474 0.186 0.659
You can speed up the generation of A slightly by sampling all in one go, but you don't have much room for improvement in any of the A, B, C operations:
> system.time(A <- matrix(rnorm(M*K), M))
user system elapsed
6.82 0.74 7.56
I believe that you've arrived at a pretty reasonable method:
C <- matrix(NA, nrow = N, ncol = K)
for (i in seq_len(K))
{
C[, i] <- A[B[, i], i]
}
C

Optimization of apply

I have existing code that calculates concordance value for a dataframe/matrix. It's basically the number of rows where all the values are the same over the total number of rows.
...
concordance<-new[complete.cases(new),] #removes rows with NAs
TF<-apply(concordance, 1, function(x) if(length(unique(x))>1) F else T)
#outputs vector of T/F if it is concordant
numF<-table(TF)["TRUE"]#gets number of trues
concValue<-numF/NROW(TF) #true/total
...
Above is what I have now. It runs ok but I was wondering if there was any way to make it faster.
Edit: Dimensions of object is variable, but # of cols are typically 2-6 and there are typically 1,000,000+ rows. This is part of a package i'm developing so input data is variable.
Because the number of rows is much larger than the number of columns it makes sense to loop on columns instead, dropping rows, where there is more than different one value in the process:
propIdentical <- function(Mat){
nrowInit <- nrow(Mat)
for(i in 1:(ncol(Mat) - 1)){
if(!nrow(Mat)) break #stop if the matrix has no rows
else{
#check which elements of column i and column i+1 are equal:
equals <- Mat[,i] == Mat[, i+1]
# remove all other rows from the matrix
Mat <- Mat[equals,,drop = F]
}
}
return(nrow(Mat)/nrowInit)
}
some tests:
set.seed(1)
# normal case
dat <- matrix(sample(1:10, rep = T, size = 3*10^6), nrow = 10^6)
system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.053 0.017 0.070
[1] 0.009898
# normal case on my pc for comparison:
system.time(app <- mean(apply(dat, 1, function(x) length(unique(x))) == 1L)); app
user system elapsed
12.176 0.036 12.231
[1] 0.009898
# worst case
dat <- matrix(1L, nrow = 10^6, ncol = 6)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.302 0.044 0.348
[1] 1
# worst case on my pc for comparison
system.time(mean(apply(dat, 1, function(x) length(unique(x))) == 1L))
user system elapsed
12.562 0.001 12.578
# testing drop = F and if(!nrow(Mat)) break
dat <- matrix(1:2, ncol = 2)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0 0 0
[1] 0
Note: if you run this on a data.frame make sure to turn it into a matrix first.

set missing values to constant in R, computational speed

In R, I have a reasonably large data frame (d) which is 10500 by 6000. All values are numeric.
It has many na value elements in both its rows and columns, and I am looking to replace these values with a zero. I have used:
d[is.na(d)] <- 0
but this is rather slow. Is there a better way to do this in R?
I am open to using other R packages.
I would prefer it if the discussion focused on computational speed rather than, "why would you replace na's with zeros", for example. And, while I realize a similar Q has been asked (How do I replace NA values with zeros in an R dataframe?) the focus has not been towards computational speed on a large data frame with many missing values.
Thanks!
Edited Solution:
As helpfully suggested, changing d to a matrix before applying is.na sped up the computation by an order of magnitude
You can get a considerable performance increase using the data.table package.
It is much faster, in general, with a lot of manipulations and transformations.
The downside is the learning curve of the syntax.
However, if you are looking for a speed performance boost, the investment could be worth it.
Generate fake data
r <- 10500
c <- 6000
x <- sample(c(NA, 1:5), r * c, replace = TRUE)
df <- data.frame(matrix(x, nrow = r, ncol = c))
Base R
df1 <- df
system.time(df1[is.na(df1)] <- 0)
user system elapsed
4.74 0.00 4.78
tidyr - replace_na()
dfReplaceNA <- function (df) {
require(tidyr)
l <- setNames(lapply(vector("list", ncol(df)), function(x) x <- 0), names(df))
replace_na(df, l)
}
system.time(df2 <- dfReplaceNA(df))
user system elapsed
4.27 0.00 4.28
data.table - set()
dtReplaceNA <- function (df) {
require(data.table)
dt <- data.table(df)
for (j in 1:ncol(dt)) {set(dt, which(is.na(dt[[j]])), j, 0)}
setDF(dt) # Return back a data.frame object
}
system.time(df3 <- dtReplaceNA(df))
user system elapsed
0.80 0.31 1.11
Compare data frames
all.equal(df1, df2)
[1] TRUE
all.equal(df1, df3)
[1] TRUE
I guess that all columns must be numeric or assigning 0s to NAs wouldn't be sensible.
I get the following timings, with approximately 10,000 NAs:
> M <- matrix(0, 10500, 6000)
> set.seed(54321)
> r <- sample(1:10500, 10000, replace=TRUE)
> c <- sample(1:6000, 10000, replace=TRUE)
> M[cbind(r, c)] <- NA
> D <- data.frame(M)
> sum(is.na(M)) # check
[1] 9999
> sum(is.na(D)) # check
[1] 9999
> system.time(M[is.na(M)] <- 0)
user system elapsed
0.19 0.12 0.31
> system.time(D[is.na(D)] <- 0)
user system elapsed
3.87 0.06 3.95
So, with this number of NAs, I get about an order of magnitude speedup by using a matrix. (With fewer NAs, the difference is smaller.) But the time using a data frame is just 4 seconds on my modest laptop -- much less time than it took to answer the question. If the problem really is of this magnitude, why is that slow?
I hope this helps.

Fast extraction of rows in R

I have many binary matrices from which I want to extract every possible combination of three rows into a list. I then want to sum the columns of each of the extracted row combinations.
My current method is below, but it is extremely slow.
set.seed(123)
x <- matrix(sample(0:1, 110 * 609, replace = TRUE), 110, 609)
row.combinations <- t(combn(nrow(x),3))
extracted.row.combns <- lapply(1:nrow(row.combinations), FUN = function(y) x[c(row.combinations[y,1],row.combinations[y,2],row.combinations[y,3]),])
summed.rows <- lapply(extracted.row.combns, colSums)
How could this be sped up?
Using ?combn and an inline function as an argument, I can run this analysis in under 5 seconds on my current machine:
combn(nrow(x), 3, FUN=function(r) colSums(x[r,]), simplify=FALSE)
We can make this faster with combnPrim from gRbase.
library(gRbase)
lapply(combnPrim(nrow(x), 3, simplify = FALSE), function(r) colSums(x[r,]))
Benchmarks
system.time(x1 <- combn(nrow(x), 3, FUN=function(r) colSums(x[r,]), simplify=FALSE))
# user system elapsed
# 6.46 0.21 6.67
system.time(x2 <- lapply(combnPrim(nrow(x), 3, simplify = FALSE),
function(r) colSums(x[r,])))
# user system elapsed
# 4.61 0.22 4.83

How to convert a huge list-of-vector to a matrix more efficiently?

I have a list of length 130,000 where each element is a character vector of length 110. I would like to convert this list to a matrix with dimension 1,430,000*10. How can I do it more efficiently?\
My code is :
output=NULL
for(i in 1:length(z)) {
output=rbind(output,
matrix(z[[i]],ncol=10,byrow=TRUE))
}
This should be equivalent to your current code, only a lot faster:
output <- matrix(unlist(z), ncol = 10, byrow = TRUE)
I think you want
output <- do.call(rbind,lapply(z,matrix,ncol=10,byrow=TRUE))
i.e. combining #BlueMagister's use of do.call(rbind,...) with an lapply statement to convert the individual list elements into 11*10 matrices ...
Benchmarks (showing #flodel's unlist solution is 5x faster than mine, and 230x faster than the original approach ...)
n <- 1000
z <- replicate(n,matrix(1:110,ncol=10,byrow=TRUE),simplify=FALSE)
library(rbenchmark)
origfn <- function(z) {
output <- NULL
for(i in 1:length(z))
output<- rbind(output,matrix(z[[i]],ncol=10,byrow=TRUE))
}
rbindfn <- function(z) do.call(rbind,lapply(z,matrix,ncol=10,byrow=TRUE))
unlistfn <- function(z) matrix(unlist(z), ncol = 10, byrow = TRUE)
## test replications elapsed relative user.self sys.self
## 1 origfn(z) 100 36.467 230.804 34.834 1.540
## 2 rbindfn(z) 100 0.713 4.513 0.708 0.012
## 3 unlistfn(z) 100 0.158 1.000 0.144 0.008
If this scales appropriately (i.e. you don't run into memory problems), the full problem would take about 130*0.2 seconds = 26 seconds on a comparable machine (I did this on a 2-year-old MacBook Pro).
It would help to have sample information about your output. Recursively using rbind on bigger and bigger things is not recommended. My first guess at something that would help you:
z <- list(1:3,4:6,7:9)
do.call(rbind,z)
See a related question for more efficiency, if needed.
You can also use,
output <- as.matrix(as.data.frame(z))
The memory usage is very similar to
output <- matrix(unlist(z), ncol = 10, byrow = TRUE)
Which can be verified, with mem_changed() from library(pryr).
you can use as.matrix as below:
output <- as.matrix(z)

Resources