Fast way to get topN elements for each matrix column - r

This is the code snippet from recommenderlab package, that takes matrix with ratings and returns top 5 elements for each user -
reclist <- apply(ratings, MARGIN=2, FUN=function(x)
head(order(x, decreasing=TRUE, na.last=NA), 5))
For large matrix (>10K columns) it takes too long to run, is there any way to re-write it to make more efficient? Maybe by using dpyr, or data.table package)? Writing C++ code is not an option for me

An answer with data.table and base R
# 10000 column dummy matrix
cols <- 10000
mat <- matrix(rnorm(100*cols), ncol=cols)
With data.table:
library(data.table)
dt1 <- data.table(mat)
# sort every column, return first 5 rows
dt1[, lapply(.SD, sort, decreasing=T)][1:5]
system.time(dt1[, lapply(.SD, sort, decreasing=T)][1:5])
result:
user system elapsed
2.904 0.013 2.916
In plain old base, it's actually faster! (thanks for the comment Arun)
system.time(head(apply(mat, 2, sort, decreasing=T), 5))
user system elapsed
0.473 0.002 0.475
However, both are faster than the code sample above, according to system.time()
system.time(
apply(mat, MARGIN=2, FUN=function(x) {
head(order(x, decreasing=TRUE, na.last=NA), 5)
}))
user system elapsed
3.063 0.031 3.094

Related

Sum every n rows of a matrix within a list

I am trying to create a matrix where each row consists of the sum of every three rows in another matrix. There are actually a bunch of these matrices in a list and I am performing the same operation on each of the elements in that list. Based on this post I was able to generate the code below. It works but it takes forever for my more complicated data set.
test<-lapply(1:1000, function(x) matrix(1:300, nrow=60))
testCons<-lapply(test, function(x) apply(x, 2, function(y) tapply(y, ceiling(seq_along(y)/3), sum)))
Does anybody have an idea of how to speed that up or simplify it?
rowsum gives an easy speed up - it calculates the sum of rows according to a grouping variable, which is an index for every three rows.
test <- lapply(1:1000, function(x) matrix(1:300, nrow=60))
system.time(
testCons <- lapply(test, function(x) apply(x, 2, function(y) tapply(y, ceiling(seq_along(y)/3), sum)))
)
# user system elapsed
# 1.672 0.004 1.678
system.time(
testCons2 <- lapply(test, function(x) rowsum(x, rep(seq_len(nrow(x) / 3), each=3)))
)
# user system elapsed
# 0.08 0.00 0.08
all.equal(testCons, testCons2)
#[1] TRUE

set missing values to constant in R, computational speed

In R, I have a reasonably large data frame (d) which is 10500 by 6000. All values are numeric.
It has many na value elements in both its rows and columns, and I am looking to replace these values with a zero. I have used:
d[is.na(d)] <- 0
but this is rather slow. Is there a better way to do this in R?
I am open to using other R packages.
I would prefer it if the discussion focused on computational speed rather than, "why would you replace na's with zeros", for example. And, while I realize a similar Q has been asked (How do I replace NA values with zeros in an R dataframe?) the focus has not been towards computational speed on a large data frame with many missing values.
Thanks!
Edited Solution:
As helpfully suggested, changing d to a matrix before applying is.na sped up the computation by an order of magnitude
You can get a considerable performance increase using the data.table package.
It is much faster, in general, with a lot of manipulations and transformations.
The downside is the learning curve of the syntax.
However, if you are looking for a speed performance boost, the investment could be worth it.
Generate fake data
r <- 10500
c <- 6000
x <- sample(c(NA, 1:5), r * c, replace = TRUE)
df <- data.frame(matrix(x, nrow = r, ncol = c))
Base R
df1 <- df
system.time(df1[is.na(df1)] <- 0)
user system elapsed
4.74 0.00 4.78
tidyr - replace_na()
dfReplaceNA <- function (df) {
require(tidyr)
l <- setNames(lapply(vector("list", ncol(df)), function(x) x <- 0), names(df))
replace_na(df, l)
}
system.time(df2 <- dfReplaceNA(df))
user system elapsed
4.27 0.00 4.28
data.table - set()
dtReplaceNA <- function (df) {
require(data.table)
dt <- data.table(df)
for (j in 1:ncol(dt)) {set(dt, which(is.na(dt[[j]])), j, 0)}
setDF(dt) # Return back a data.frame object
}
system.time(df3 <- dtReplaceNA(df))
user system elapsed
0.80 0.31 1.11
Compare data frames
all.equal(df1, df2)
[1] TRUE
all.equal(df1, df3)
[1] TRUE
I guess that all columns must be numeric or assigning 0s to NAs wouldn't be sensible.
I get the following timings, with approximately 10,000 NAs:
> M <- matrix(0, 10500, 6000)
> set.seed(54321)
> r <- sample(1:10500, 10000, replace=TRUE)
> c <- sample(1:6000, 10000, replace=TRUE)
> M[cbind(r, c)] <- NA
> D <- data.frame(M)
> sum(is.na(M)) # check
[1] 9999
> sum(is.na(D)) # check
[1] 9999
> system.time(M[is.na(M)] <- 0)
user system elapsed
0.19 0.12 0.31
> system.time(D[is.na(D)] <- 0)
user system elapsed
3.87 0.06 3.95
So, with this number of NAs, I get about an order of magnitude speedup by using a matrix. (With fewer NAs, the difference is smaller.) But the time using a data frame is just 4 seconds on my modest laptop -- much less time than it took to answer the question. If the problem really is of this magnitude, why is that slow?
I hope this helps.

Row maximum in data table

I have a dataset of 8,000,000 rows with 100 columns in a data.table where each column is a count. I need to find the maximum count in each row and which column this maximum is in.
I can quickly get which column has the maximum value for each row using
dt <- dt[, maxCol := which.max(.SD), by=pmxid]
but trying to get the actual maximum value using
dt <- dt[, nmax := max(.SD), by=pmxid]
is incredibly slow. I ran it for nearly 20 mins and only 200,000 row maximums had been calculated. Finding the max column took approx. 2 mins for all 8,000,000 rows.
How come finding the maximum takes so long? Shouldn't it take the same time as which.max() or less?
Though, you are seeking a data.table solution, here is a base R solution which would be fast enough for your dataset.
indx <- max.col(df, ties.method='first')
df[cbind(1:nrow(df), indx)]
On a slightly bigger dataset, system.time comparisons revealed
system.time({
indx <- max.col(df1, ties.method='first')
res <- df1[cbind(1:nrow(df1), indx)]
})
# user system elapsed
# 2.180 0.163 2.345
df1$pmxid <- 1:nrow(df1)
dt <- as.data.table(df1)
system.time(dt[, nmax:= max(.SD), by= pmxid])
# user system elapsed
#1265.792 2.305 1267.836
base R method to be faster than the data.table method in the post.
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA,0:20), 20*10,
replace=TRUE), ncol=10))
#if there are NAs, change it to lowest number
df[is.na(df)] <- -999
set.seed(585)
df1 <- as.data.frame(matrix(sample(c(NA,0:20), 100*1e6,
replace=TRUE), ncol=100))
df1[is.na(df1)] <- -999
For the maximum over columns in a data.table,
dt[, max:= do.call(pmax, .SD)]
is much faster then dt[, nmax:= max(.SD), by= 1:nrow(dt)], and faster than the above base R solution :
library(data.table)
ncols=100
nrows=8000000
dfi <- as.data.frame(matrix(runif(ncols*nrows), ncol = ncols, nrow = nrows))
df=dfi
system.time({
indx <- max.col(df, ties.method='first')
df$max <- df[cbind(1:nrow(df1), indx)]
})
# user system elapsed
# 8.89 1.37 10.45
dt <- as.data.table(dfi)
system.time({
dt[, max:= do.call(pmax, .SD)]
})
# user system elapsed
# 3.31 0.01 3.33
Once you have calculated the Colmax index, use the index to retrieve the maximum in each row
dt[Colmax == <value>]
or,
dt[J(<values>), on = 'Colmax']
Also, wrong syntax in
dt[, nmax := max(.SD), by = pmxid]
this collates a vector of nrow(dt) * length(.SD) length (see the Note in Description of max())
Instead try:
dt[, nmax := apply(.SD, 1, max), by = pmxid]
or, the parallel max:
dt[, nmax := pmax(.SD), by = pmxid]

replacement for na.locf.xts (extremely slow when used with a multicolumn xts)

The R function
xts:::na.locf.xts
is extremely slow when used with a multicolumn xts of more than a few columns.
There is indeed a loop over the columns in the code of na.locf.xts
I am trying to find a way to avoid this loop.
Any idea?
The loop in na.locf.xts is slow because it creates a copy of the entire object for each column in the object. The loop itself isn't slow; the copies created by [.xts are slow.
There's an experimental (and therefore unexported) version of na.locf.xts on R-Forge that moves the loop over columns to C, which avoids copying the object. It's quite a bit faster for very large objects.
set.seed(21)
m <- replicate(20, rnorm(1e6))
is.na(m) <- sample(length(x), 1e5)
x <- xts(m, Sys.time()-1e6:1)
y <- x[1:1e5,1:3]
> # smaller objects
> system.time(a <- na.locf(y))
user system elapsed
0.008 0.000 0.008
> system.time(b <- xts:::.na.locf.xts(y))
user system elapsed
0.000 0.000 0.003
> identical(a,b)
[1] TRUE
> # larger objects
> system.time(a <- na.locf(x))
user system elapsed
1.620 1.420 3.064
> system.time(b <- xts:::.na.locf.xts(x))
user system elapsed
0.124 0.092 0.220
> identical(a,b)
[1] TRUE
timeIndex <- index(x)
x <- apply(x, 2, na.locf)
x <- as.xts(x, order.by = timeIndex)
This avoids the column-by-column data copying. Without this, when filling the nth column, you make a copy of 1 : (n - 1) columns and append the nth column to it, which becomes prohibitively slow when n is large.

Transposing a data.table

What would be a good way to efficiently transform a data.table after the data computation is over
nrow=500e3
ncol=2000
m <- matrix(rnorm(nrow*ncol),nrow=nrow)
colnames(m) <- c('foo',seq(ncol-1))
dt <- data.table(m)
df <- as.data.frame(m)
dt <- t(dt) #take a long time and converts the data table to a matrix
compute time
1. to transpose the matrix
system.time(mt <- t(m))
user system elapsed
20.005 0.016 20.024
2. to transpose the dt
system.time(dt <- t(dt))
user system elapsed
32.722 15.129 47.855
3. to transpose a df
system.time(df <- t(df))
user system elapsed
32.414 15.357 47.775
This is quite an old question, and since then data.table has added/exported transpose for transposing lists. Performance-wise, it out-performs t except on matrices (I think this is to be expected)
system.time(t(m))
# user system elapsed
# 23.990 23.416 85.722
system.time(t(dt))
# user system elapsed
# 31.223 53.197 195.221
system.time(t(df))
# user system elapsed
# 30.609 45.404 148.323
system.time(setDT(transpose(dt)))
# user system elapsed
# 42.135 38.478 116.599

Resources