Fast(er) way of indexing matrix in R

Fast(er) way of indexing matrix in R - r

Foremost, I am looking for a fast(er) way of subsetting/indexing a matrix many, many times over:
for (i in 1:99000) {
subset.data <- data[index[, i], ]
}
Background:
I'm implementing a sequential testing procedure involving the bootstrap in R. Wanting to replicate some simulation results, I came upon
this bottleneck where lots of indexing needs to be done. For implementation of the block-bootstrap I created an index matrix with which I subset
the original data matrix to draw resamples of the data.
# The basic setup
B <- 1000 # no. of bootstrap replications
n <- 250 # no. of observations
m <- 100 # no. of models/data series
# Create index matrix with B columns and n rows.
# Each column represents a resampling of the data.
# (actually block resamples, but doesn't matter here).
boot.index <- matrix(sample(1:n, n * B, replace=T), nrow=n, ncol=B)
# Make matrix with m data series of length n.
sample.data <- matrix(rnorm(n * m), nrow=n, ncol=m)
subsetMatrix <- function(data, index) { # fn definition for timing
subset.data <- data[index, ]
return(subset.data)
}
# check how long it takes.
Rprof("subsetMatrix.out")
for (i in 1:(m - 1)) {
for (b in 1:B) { # B * (m - 1) = 1000 * 99 = 99000
boot.data <- subsetMatrix(sample.data, boot.index[, b])
# do some other stuff
}
# do some more stuff
}
Rprof()
summaryRprof("subsetMatrix.out")
# > summaryRprof("subsetMatrix.out")
# $by.self
# self.time self.pct total.time total.pct
# subsetMatrix 9.96 100 9.96 100
# In the actual application:
#########
# > summaryRprof("seq_testing.out")
# $by.self
# self.time self.pct total.time total.pct
# subsetMatrix 6.78 53.98 6.78 53.98
# colMeans 1.98 15.76 2.20 17.52
# makeIndex 1.08 8.60 2.12 16.88
# makeStats 0.66 5.25 9.66 76.91
# runif 0.60 4.78 0.72 5.73
# apply 0.30 2.39 0.42 3.34
# is.data.frame 0.22 1.75 0.22 1.75
# ceiling 0.18 1.43 0.18 1.43
# aperm.default 0.14 1.11 0.14 1.11
# array 0.12 0.96 0.12 0.96
# estimateMCS 0.10 0.80 12.56 100.00
# as.vector 0.10 0.80 0.10 0.80
# matrix 0.08 0.64 0.08 0.64
# lapply 0.06 0.48 0.06 0.48
# / 0.04 0.32 0.04 0.32
# : 0.04 0.32 0.04 0.32
# rowSums 0.04 0.32 0.04 0.32
# - 0.02 0.16 0.02 0.16
# > 0.02 0.16 0.02 0.16
#
# $by.total
# total.time total.pct self.time self.pct
# estimateMCS 12.56 100.00 0.10 0.80
# makeStats 9.66 76.91 0.66 5.25
# subsetMatrix 6.78 53.98 6.78 53.98
# colMeans 2.20 17.52 1.98 15.76
# makeIndex 2.12 16.88 1.08 8.60
# runif 0.72 5.73 0.60 4.78
# doTest 0.68 5.41 0.00 0.00
# apply 0.42 3.34 0.30 2.39
# aperm 0.26 2.07 0.00 0.00
# is.data.frame 0.22 1.75 0.22 1.75
# sweep 0.20 1.59 0.00 0.00
# ceiling 0.18 1.43 0.18 1.43
# aperm.default 0.14 1.11 0.14 1.11
# array 0.12 0.96 0.12 0.96
# as.vector 0.10 0.80 0.10 0.80
# matrix 0.08 0.64 0.08 0.64
# lapply 0.06 0.48 0.06 0.48
# unlist 0.06 0.48 0.00 0.00
# / 0.04 0.32 0.04 0.32
# : 0.04 0.32 0.04 0.32
# rowSums 0.04 0.32 0.04 0.32
# - 0.02 0.16 0.02 0.16
# > 0.02 0.16 0.02 0.16
# mean 0.02 0.16 0.00 0.00
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 12.56'
Doing the sequential testing procedure once takes about 10 seconds. Using this in simulations with 2500 replications and several
parameter constellations, it would take something like 40 days. Using parallel processing and better CPU power it's possible to do faster, but
still not very pleasing :/
Is there a better way to resample the data / get rid of the loop?
Can apply, Vectorize, replicate etc. come in anywhere?
Would it make sense to implement the subsetting in C (e.g. manipulate some pointers)?
Even though every single step is already done incredibly fast by R, it's just not quite fast enough.
I'd be very glad indeed for any kind of response/help/advice!
related Qs:
- Fast matrix subsetting via '[': by rows, by columns or doesn't matter?
- fast function for generating bootstrap samples in matrix forms in R
- random sampling - matrix
from there
mapply(function(row) return(sample.data[row,]), row = boot.index)
replicate(B, apply(sample.data, 2, sample, replace = TRUE))
didn't really do it for me.

I rewrote makeStats and makeIndex as they were two of the biggest bottlenecks:
makeStats <- function(data, index) {
data.mean <- colMeans(data)
m <- nrow(data)
n <- ncol(index)
tabs <- lapply(1L:n, function(j)tabulate(index[, j], nbins = m))
weights <- matrix(unlist(tabs), m, n) * (1 / nrow(index))
boot.data.mean <- t(data) %*% weights - data.mean
return(list(data.mean = data.mean,
boot.data.mean = boot.data.mean))
}
makeIndex <- function(B, blocks){
n <- ncol(blocks)
l <- nrow(blocks)
z <- ceiling(n/l)
start.points <- sample.int(n, z * B, replace = TRUE)
index <- blocks[, start.points]
keep <- c(rep(TRUE, n), rep(FALSE, z*l - n))
boot.index <- matrix(as.vector(index)[keep],
nrow = n, ncol = B)
return(boot.index)
}
This brought down the computation times from 28 to 6 seconds on my machine. I bet there are other parts of the code that can be improved (including my use of lapply/tabulate above.)

Related

How to reorder letters in string efficiently in R?

I have the following function to reorder letters in a character vector.
reorder_letter <- function(x){
sapply(strsplit(x,split = ""),function(x) paste(sort(toupper(x)),collapse = ""))
}
reorder_letter(c("trErty","Bca","def"))
#> [1] "ERRTTY" "ABC" "DEF"
Created on 2020-04-29 by the reprex package (v0.3.0)
Basically I want to return same letter of the character but with upper case and sorted order.
Currently it takes around 1 min to run for a 1.5 million length vector.
EDIT: I also tried to parallelize using future.apply package which is 3x faster than base R solution (also easy to modify current code)
reorder_letter <- function(x){
future_sapply(strsplit(x,split = ""),function(x) paste(sort(toupper(x)),collapse = ""))
}
I just wonder
how can I efficiently achieve my purpose?
what is the best approach to find the bottleneck of a function? For example, I have this function finished. What the next step?

Maybe utf8ToInt and intToUtf8 are faster than strsplit and paste.
x <- c("trErty","Bca","def")
unlist(lapply(x, function(y) {intToUtf8(sort(utf8ToInt(toupper(y))))}))
#[1] "ERRTTY" "ABC" "DEF"
Times: (It is not faster ... sorry)
But stringi is faster and writing a Function C++ is even more faster (and might be improved but it's already 10 times faster).
FrankZhang <- function(x) {
unlist(lapply(strsplit(toupper(x),NULL),function(x) paste(sort(x),collapse = "")))}
GKi <- function(x) {
unlist(lapply(toupper(x), function(y) {intToUtf8(sort(utf8ToInt(y)))}))
}
library(stringi)
stringi <- function(y) {
vapply(stri_split_boundaries(toupper(y), type = "character"), function(x) stri_c(x[stri_order(x)], collapse = ""), "")
}
library(Rcpp)
cppFunction("std::string GKiC(std::string &str) {
std::sort(str.begin(), str.end());
return(str);}")
GKi2 <- function(x) {unlist(lapply(toupper(x), GKiC))}
x <- apply(expand.grid(letters, LETTERS), 1, paste, collapse = "")
microbenchmark::microbenchmark(FrankZhang(x), GKi(x), stringi(x), GKi2(x), control=list(order="block"))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# FrankZhang(x) 17.533428 18.686879 20.380002 19.719311 21.014381 33.836692 100 d
# GKi(x) 16.551358 17.665436 18.656223 18.271688 19.343088 23.225199 100 c
# stringi(x) 4.644196 4.844622 5.082298 5.011344 5.237714 7.355251 100 b
# GKi2(x) 1.527124 1.624337 1.997725 1.691099 2.242797 5.593543 100 a
To find out what uses much computation time you can use Rprof e.g.:
reorder_letter <- function(x) { #Function
sapply(strsplit(x,split = ""),function(x) paste(sort(toupper(x)),collapse = ""))}
x <- apply(expand.grid(letters, LETTERS, letters), 1, paste, collapse = "") #Data
Rprof()
y <- reorder_letter(x)
Rprof(NULL)
summaryRprof()
#$by.self
# self.time self.pct total.time total.pct
#"FUN" 0.12 20.69 0.54 93.10
#"sort.int" 0.10 17.24 0.22 37.93
#"paste" 0.08 13.79 0.42 72.41
#"sort" 0.06 10.34 0.34 58.62
#"sort.default" 0.06 10.34 0.28 48.28
#"match.arg" 0.04 6.90 0.10 17.24
#"eval" 0.04 6.90 0.04 6.90
#"sapply" 0.02 3.45 0.58 100.00
#"lapply" 0.02 3.45 0.56 96.55
#".doSortWrap" 0.02 3.45 0.02 3.45
#"formals" 0.02 3.45 0.02 3.45
#
#$by.total
# total.time total.pct self.time self.pct
#"sapply" 0.58 100.00 0.02 3.45
#"reorder_letter" 0.58 100.00 0.00 0.00
#"lapply" 0.56 96.55 0.02 3.45
#"FUN" 0.54 93.10 0.12 20.69
#"paste" 0.42 72.41 0.08 13.79
#"sort" 0.34 58.62 0.06 10.34
#"sort.default" 0.28 48.28 0.06 10.34
#"sort.int" 0.22 37.93 0.10 17.24
#"match.arg" 0.10 17.24 0.04 6.90
#"eval" 0.04 6.90 0.04 6.90
#".doSortWrap" 0.02 3.45 0.02 3.45
#"formals" 0.02 3.45 0.02 3.45
#
#$sample.interval
#[1] 0.02
#
#$sampling.time
#[1] 0.58

Speeding up ifelse() without writing C/C++?

I need to get the results of the following function
getScore <- function(history, similarities) {
nh<-ifelse(similarities<0, 6-history,history)
x <- nh*abs(similarities)
contados <- !is.na(history)
x2 <- sum(x, na.rm=TRUE)/sum(abs(similarities[contados]),na.rm=TRUE)
x2
}
For example for the following vectors:
notes <- c(1:5, NA)
history <- sample(notes, 1000000, replace=T)
similarities <- runif(1000000, -1,1)
That changes inside a loop. This takes:
ptm <- proc.time()
for (i in (1:10)) getScore(history, similarities)
proc.time() - ptm
user system elapsed
3.71 1.11 4.67
Initially I suspect that the problem is the for loop, but profiling result points to ifelse().
Rprof("foo.out")
for (i in (1:10)) getScore(history, similarities)
Rprof(NULL)
summaryRprof("foo.out")
$by.self
self.time self.pct total.time total.pct
"ifelse" 2.96 65.78 3.48 77.33
"-" 0.24 5.33 0.24 5.33
"getScore" 0.22 4.89 4.50 100.00
"<" 0.22 4.89 0.22 4.89
"*" 0.22 4.89 0.22 4.89
"abs" 0.22 4.89 0.22 4.89
"sum" 0.22 4.89 0.22 4.89
"is.na" 0.12 2.67 0.12 2.67
"!" 0.08 1.78 0.08 1.78
$by.total
total.time total.pct self.time self.pct
"getScore" 4.50 100.00 0.22 4.89
"ifelse" 3.48 77.33 2.96 65.78
"-" 0.24 5.33 0.24 5.33
"<" 0.22 4.89 0.22 4.89
"*" 0.22 4.89 0.22 4.89
"abs" 0.22 4.89 0.22 4.89
"sum" 0.22 4.89 0.22 4.89
"is.na" 0.12 2.67 0.12 2.67
"!" 0.08 1.78 0.08 1.78
$sample.interval
[1] 0.02
$sampling.time
[1] 4.5
ifelse() is my performance bottleneck. Unless there is a way in R to speed up ifelse(), there is unlikely to be great performance boost.
However, ifelse() is already the vectorized approach. It seems to me that the only chance left is to use C/C++. But is there a way to avoid using compiled code?

You can use logical multiplication for this task to achieve the same effect:
s <- similarities < 0
nh <- s*(6-history) + (!s)*history
Benchmark on i7-3740QM:
f1 <- function(history, similarities) { s <- similarities < 0
s*(6-history) + (!s)*history}
f2 <- function(history, similarities) ifelse(similarities<0, 6-history,history)
f3 <- function(history, similarities) { nh <- history
ind <- similarities<0
nh[ind] <- 6 - nh[ind]
nh }
microbenchmark(f1(history, similarities),
f2(history, similarities),
f3(history, similarities))
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f1(history, similarities) 22.830260 24.6167695 28.31384860 24.89869950000000 25.651655 81.043713 100 a
## f2(history, similarities) 364.514460 412.7117810 408.37156626 415.10114899999996 417.345748 437.977256 100 c
## f3(history, similarities) 84.220279 86.2894795 92.64614571 87.18016549999999 89.616522 149.243051 100 b
On E5-2680 v2:
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f1(history, similarities) 20.03963 20.10954 21.41055 20.68597 21.25920 50.95278 100 a
## f2(history, similarities) 314.54913 315.96621 324.91486 319.50290 325.93168 378.26016 100 c
## f3(history, similarities) 73.81413 73.92162 76.10418 74.79893 75.84634 105.98770 100 b
On T5600 (Core2 Duo Mobile):
## Unit: milliseconds
expr min lq mean median uq max neval cld
## f1(history, similarities) 147.2953 152.9307 171.0870 155.5632 167.0998 344.7524 100 b
## f2(history, similarities) 408.5728 493.3886 517.0573 501.6993 525.8573 797.9624 100 c
## f3(history, similarities) 102.9621 110.6003 131.1826 112.9961 125.3906 303.1170 100 a
Aha! My approach is slower on the Core 2 architecture.

I have encountered this before. We don't have to use ifelse() all the time. If you have a look at how ifelse is written, by typing "ifelse" in your R console, you can see that this function is written in R language, and it does various checking which is really inefficient.
Instead of using ifelse(), we can do this:
getScore <- function(history, similarities) {
######## old code #######
# nh <- ifelse(similarities < 0, 6 - history, history)
######## old code #######
######## new code #######
nh <- history
ind <- similarities < 0
nh[ind] <- 6 - nh[ind]
######## new code #######
x <- nh * abs(similarities)
contados <- !is.na(history)
sum(x, na.rm=TRUE) / sum(abs(similarities[contados]), na.rm = TRUE)
}
And then let's check profiling result again:
Rprof("foo.out")
for (i in (1:10)) getScore(history, similarities)
Rprof(NULL)
summaryRprof("foo.out")
# $by.total
# total.time total.pct self.time self.pct
# "getScore" 2.10 100.00 0.88 41.90
# "abs" 0.32 15.24 0.32 15.24
# "*" 0.26 12.38 0.26 12.38
# "sum" 0.26 12.38 0.26 12.38
# "<" 0.14 6.67 0.14 6.67
# "-" 0.14 6.67 0.14 6.67
# "!" 0.06 2.86 0.06 2.86
# "is.na" 0.04 1.90 0.04 1.90
# $sample.interval
# [1] 0.02
# $sampling.time
# [1] 2.1
We have a 2+ times boost in performance. Furthermore, the profile is more like a flat profile, without any single part dominating execution time.
In R, vector indexing / reading / writing is at speed of C code, so whenever we can, use a vector.
Testing #Matthew's answer
mat_getScore <- function(history, similarities) {
######## old code #######
# nh <- ifelse(similarities < 0, 6 - history, history)
######## old code #######
######## new code #######
ind <- similarities < 0
nh <- ind*(6-history) + (!ind)*history
######## new code #######
x <- nh * abs(similarities)
contados <- !is.na(history)
sum(x, na.rm=TRUE) / sum(abs(similarities[contados]), na.rm = TRUE)
}
Rprof("foo.out")
for (i in (1:10)) mat_getScore(history, similarities)
Rprof(NULL)
summaryRprof("foo.out")
# $by.total
# total.time total.pct self.time self.pct
# "mat_getScore" 2.60 100.00 0.24 9.23
# "*" 0.76 29.23 0.76 29.23
# "!" 0.40 15.38 0.40 15.38
# "-" 0.34 13.08 0.34 13.08
# "+" 0.26 10.00 0.26 10.00
# "abs" 0.20 7.69 0.20 7.69
# "sum" 0.18 6.92 0.18 6.92
# "<" 0.16 6.15 0.16 6.15
# "is.na" 0.06 2.31 0.06 2.31
# $sample.interval
# [1] 0.02
# $sampling.time
# [1] 2.6
Ah? Slower?
The full profiling result shows that this approach spends more time on floating point multiplication "*", and the logical not "!" seems pretty expensive. While my approach requires floating point addition / subtraction only.
Well, The result might be also architecture dependent. I am testing on Intel Nahalem (Intel Core 2 Duo) at the moment. So benchmarking between two approaches on various platforms are welcomed.
Remark
All profiling are using OP's data in the question.

Here is a faster ifelse, though it isn't faster than the above answers, it maintains the ifelse structure.
ifelse_sign <- function(b,x,y){
x[!b] <- 0
y[b] <-0
x + y + b *0
}

How to do row wise operations on .SD columns in data.table

Although I've figured this out before, I still find myself searching (and unable to find) this syntax on stackoverflow, so...
I want to do row wise operations on a subset of the data.table's columns, using .SD and .SDcols. I can never remember if the operations need an sapply, lapply, or if the belong inside the brackets of .SD.
As an example, say you have data for 10 students over two quarters. In both quarters they have two exams and a final exam. How would you take a straight average of the columns starting with q1?
Since overly trivial examples are annoying, I'd also like to calculate a weighted average for columns starting with q2? (weights = 25% 25% and 50% for q2)
library(data.table)
set.seed(10)
dt <- data.table(id = paste0("student_", sprintf("%02.f" , 1:10)),
q1_exam1 = round(rnorm(10, .78, .05), 2),
q1_exam2 = round(rnorm(10, .68, .02), 2),
q1_final = round(rnorm(10, .88, .08), 2),
q2_exam1 = round(rnorm(10, .78, .05), 2),
q2_exam2 = round(rnorm(10, .68, .10), 2),
q2_final = round(rnorm(10, .88, .04), 2))
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89

Here are a few thoughts on your options, largely gathered from the comments:
apply along rows
The OP's approach uses apply(.,1,.) for the by-row operation, but this is discouraged because it unnecessarily coerces the data.table into a matrix. lapply/sapply also are not suitable, since they are designed to work on each columns separately, not to combine them.
rowMeans and similarly-named functions also coerce to a matrix.
Split by rows
As #Jaap said, you can use by=1:nrow(dt) for any rowwise operation, but it may be slow.
Efficiently create new columns
This approach taken from eddi is probably the most efficient if you must keep your data in wide format:
jwts = list(
q1_AVG = c(q1_exam1 = 1 , q1_exam2 = 1 , q1_final = 1)/3,
q2_WAVG = c(q1_exam1 = 1/4, q2_exam2 = 1/4, q2_final = 1/2)
)
for (newj in names(jwts)){
w = jwts[[newj]]
dt[, (newj) := Reduce("+", lapply(names(w), function(x) dt[[x]] * w[x]))]
}
This avoids coercion to a matrix and allows for different weighting rules (unlike rowMeans).
Go long
As #alexis_laz suggested, you might gain clarity and efficiency with a different structure, like
# reshape
m = melt(dt, id.vars="id", value.name="score")[,
c("quarter","exam") := tstrsplit(variable, "_")][, variable := NULL]
# input your weighting rules
w = unique(m[,c("quarter","exam")])
w[quarter=="q1" , wt := 1/.N]
w[quarter=="q2" & exam=="final", wt := .5]
w[quarter=="q2" & exam!="final", wt := (1-.5)/.N]
# merge and compute
m[w, on=c("quarter","exam")][, sum(score*wt), by=.(id,quarter)]
This is what I would do.
In any case, you should have your weighting rules stored somewhere explicitly rather than entered on the fly if you want to scale up the number of quarters.

In this case it is possible to use the apply function in base R, but that's not taking advantage of the data.table framework. Also, it doesn't generalize because there are cases which will require more conditional checking.
apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)
# > apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)
# [1] 0.7700000 0.7266667 0.7400000 0.7200000 0.7533333 0.7766667 0.7333333 0.7500000 0.7566667 0.7733333
In this case, again it's possible to put apply into the j argument of the data.table, and use it on the .SD columns:
dt[i = TRUE,
q1_AVG := round(apply(.SD, 1, mean), 2),
.SDcols = grep("^q1", colnames(dt))]
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86 0.77
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87 0.73
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93 0.74
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97 0.72
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90 0.75
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91 0.78
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84 0.73
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90 0.75
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85 0.76
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89 0.77
The case with the weighted average can be calculated using matrix multiplication;
dt[i = TRUE,
q2_WAVG := round(as.matrix(.SD) %*% c(.25, .25, .50), 2),
.SDcols = grep("^q2", colnames(dt))]
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG q2_WAVG
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86 0.77 0.80
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87 0.73 0.78
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93 0.74 0.82
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97 0.72 0.87
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90 0.75 0.77
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91 0.78 0.80
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84 0.73 0.80
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90 0.75 0.78
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85 0.76 0.77
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89 0.77 0.82

Speed up `strsplit` when possible output are known

I have a large data frame with a factor column that I need to divide into three factor columns by splitting up the factor names by a delimiter. Here is my current approach, which is very slow with a large data frame (sometimes several million rows):
data <- readRDS("data.rds")
data.df <- reshape2:::melt.array(data)
head(data.df)
## Time Location Class Replicate Population
##1 1 1 LIDE.1.S 1 0.03859605
##2 2 1 LIDE.1.S 1 0.03852957
##3 3 1 LIDE.1.S 1 0.03846853
##4 4 1 LIDE.1.S 1 0.03841260
##5 5 1 LIDE.1.S 1 0.03836147
##6 6 1 LIDE.1.S 1 0.03831485
Rprof("str.out")
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
Rprof(NULL)
head(data.df)
## Time Location Species SizeClass Infected Replicate Population
##1 1 1 LIDE 1 S 1 0.03859605
##2 2 1 LIDE 1 S 1 0.03852957
##3 3 1 LIDE 1 S 1 0.03846853
##4 4 1 LIDE 1 S 1 0.03841260
##5 5 1 LIDE 1 S 1 0.03836147
##6 6 1 LIDE 1 S 1 0.03831485
summaryRprof("str.out")
$by.self
self.time self.pct total.time total.pct
"strsplit" 1.34 50.00 1.34 50.00
"<Anonymous>" 1.16 43.28 1.16 43.28
"do.call" 0.04 1.49 2.54 94.78
"unique.default" 0.04 1.49 0.04 1.49
"data.frame" 0.02 0.75 0.12 4.48
"is.factor" 0.02 0.75 0.02 0.75
"match" 0.02 0.75 0.02 0.75
"structure" 0.02 0.75 0.02 0.75
"unlist" 0.02 0.75 0.02 0.75
$by.total
total.time total.pct self.time self.pct
"do.call" 2.54 94.78 0.04 1.49
"strsplit" 1.34 50.00 1.34 50.00
"<Anonymous>" 1.16 43.28 1.16 43.28
"cbind" 0.14 5.22 0.00 0.00
"data.frame" 0.12 4.48 0.02 0.75
"as.data.frame.matrix" 0.08 2.99 0.00 0.00
"as.data.frame" 0.08 2.99 0.00 0.00
"as.factor" 0.08 2.99 0.00 0.00
"factor" 0.06 2.24 0.00 0.00
"unique.default" 0.04 1.49 0.04 1.49
"unique" 0.04 1.49 0.00 0.00
"is.factor" 0.02 0.75 0.02 0.75
"match" 0.02 0.75 0.02 0.75
"structure" 0.02 0.75 0.02 0.75
"unlist" 0.02 0.75 0.02 0.75
"[.data.frame" 0.02 0.75 0.00 0.00
"[" 0.02 0.75 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 2.68
Is there any way to speed up this operation? I note that there are a small (<5) number of each of the categories "Species", "SizeClass", and "Infected", and I know what these are in advance.
Notes:
stringr::str_split_fixed performs this task, but not any faster
The data frame is actually initially generated by calling reshape::melt on an array in which Class and its associated levels are a dimension. If there's a faster way to get from there to here, great.
data.rds at http://dl.getdropbox.com/u/3356641/data.rds

This should probably offer quite an increase:
library(data.table)
DT <- data.table(data.df)
DT[, c("Species", "SizeClass", "Infected")
:= as.list(strsplit(Class, "\\.")[[1]]), by=Class ]
The reasons for the increase:
data.table pre allocates memory for columns
every column assignment in data.frame reassigns the entirety of the data (data.table in contrast does not)
the by statement allows you to implement the strsplit task once per each unique value.
Here is a nice quick method for the whole process.
# Save the new col names as a character vector
newCols <- c("Species", "SizeClass", "Infected")
# split the string, then convert the new cols to columns
DT[, c(newCols) := as.list(strsplit(as.character(Class), "\\.")[[1]]), by=Class ]
DT[, c(newCols) := lapply(.SD, factor), .SDcols=newCols]
# remove the old column. This is instantaneous.
DT[, Class := NULL]
## Have a look:
DT[, lapply(.SD, class)]
# Time Location Replicate Population Species SizeClass Infected
# 1: integer integer integer numeric factor factor factor
DT

You could get a decent increase in speed by just extracting the parts of the string you need using gsub instead of splitting everything up and trying to put it back together:
data <- readRDS("~/Downloads/data.rds")
data.df <- reshape2:::melt.array(data)
# using `strsplit`
system.time({
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
})
user system elapsed
3.349 0.062 3.411
#using `gsub`
system.time({
data.df$Class <- as.character(data.df$Class)
data.df$SizeClass <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\2", data.df$Class,
perl = TRUE)
data.df$Infected <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\3", data.df$Class,
perl = TRUE)
data.df$Class <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\1", data.df$Class,
perl = TRUE)
})
user system elapsed
0.812 0.037 0.848

Looks like you have a factor, so work on the levels and then map back. Use fixed=TRUE in strsplit, adjusting to split=".".
Classes <- do.call(rbind, strsplit(levels(data.df$Class), ".", fixed=TRUE))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
df0 <- as.data.frame(Classes[data.df$Class,], row.names=NA)
cbind(data.df, df0)

Passing an expression to a nested grouping in data.table

I have a data.table object similar to this one
library(data.table)
c <- data.table(CO = c(10000,10000,10000,20000,20000,20000,20000),
SH = c(1427,1333,1333,1000,1000,300,350),
PRC = c(6.5,6.125,6.2,0.75,0.5,3,3.5),
DAT = c(0.5,-0.5,0,-0.1,NA_real_,0.2,0.5),
MM = c("A","A","A","A","A","B","B"))
and I am trying to perform calculations using nested grouping, passing an expression as an argument. Here is a simplified version of what I have:
setkey(c,MM)
mycalc <- quote({nobscc <- length(DAT[complete.cases(DAT)]);
list(MKTCAP = tail(SH,n=1)*tail(PRC,n=1),
SQSUM = ifelse(nobscc>=2, sum(DAT^2,na.rm=TRUE), NA_real_),
COVCOMP = ifelse(nobscc >= 2, head(DAT,n=1), NA_real_),
NOBS = nobscc)})
myresults <- c[,.SD[,{setkey=CO; eval(mycalc)},by=CO],by=MM]
which produces
MM CO MKTCAP SQSUM COVCOMP NOBS
[1,] A 10000 8264.6 0.50 0.5 3
[2,] A 20000 500.0 NA NA 1
[3,] B 20000 1225.0 0.29 0.2 2
In the example above I have two elements of the list which use the ifelse construct (in the actual code there are 3), all doing the same test : if the number of observations is greater than 2, then a certain calculation (which is different for each element of the list, and each could be written as a function) is to be performed, otherwise I want the value of the these elements to be NA. Another thing these elements have in common is that they use one and the same column of my data.table: the one called DAT.
So my question is: is there any way I can do the ifelse test only once, and if it is FALSE, pass the value NA to the respective elements of the list, and if TRUE, evaluate a different expression for each of the elements of the list?
NOTE: My goal is to reduce the system.time (system and elapsed). If this modification will not reduce time and calculations, bearing in mind I have 72 million observations, that's an acceptable answer. I also welcome suggestions to change other parts of the code.
EDIT: Results of summaryRprof()
$by.total
total.time total.pct self.time self.pct
"system.time" 18.94 99.79 0.00 0.00
".Call" 18.92 99.68 0.10 0.53
"[" 18.92 99.68 0.04 0.21
"[.data.table" 18.92 99.68 0.02 0.11
"eval" 18.80 99.05 0.24 1.26
"ifelse" 18.30 96.42 0.46 2.42
"lm" 17.70 93.26 0.58 3.06
"sapply" 8.06 42.47 0.36 1.90
"model.frame" 7.74 40.78 0.16 0.84
"model.frame.default" 7.58 39.94 0.98 5.16
"lapply" 6.62 34.88 0.70 3.69
"FUN" 4.24 22.34 1.10 5.80
"model.matrix" 4.04 21.29 0.02 0.11
"model.matrix.default" 4.02 21.18 0.26 1.37
"match" 3.66 19.28 0.86 4.53
".getXlevels" 3.12 16.44 0.12 0.63
"na.omit" 2.40 12.64 0.24 1.26
"%in%" 2.30 12.12 0.34 1.79
"simplify2array" 2.24 11.80 0.12 0.63
"na.omit.data.frame" 2.16 11.38 0.14 0.74
"[.data.frame" 2.12 11.17 1.18 6.22
"deparse" 1.80 9.48 0.66 3.48
"unique" 1.80 9.48 0.54 2.85
"[[" 1.52 8.01 0.12 0.63
"[[.data.frame" 1.40 7.38 0.54 2.85
".deparseOpts" 1.34 7.06 0.96 5.06
"paste" 1.32 6.95 0.16 0.84
"lm.fit" 1.20 6.32 0.64 3.37
"mode" 1.14 6.01 0.14 0.74
"unlist" 1.12 5.90 0.56 2.95

Instead of forming and operating on data subsets like this:
setkey(c,MM)
myresults <- c[, .SD[,{setkey=CO; eval(mycalc)},by=CO], by=MM]
You could try doing this:
setkeyv(c, c("MM", "CO"))
myresults <- c[, eval(mycalc), by=key(c)]
This should speed up your code, since it avoids all of the nested subsetting of .SD objects, each of which requires its own call to [.data.table.
On your original question, I doubt the ifelse evaluations are taking much time, but if you want to avoid them, you could take them out of mycalc and use := to overwrite the desired values with NA:
mycalc <- quote(list(MKTCAP = tail(SH,n=1)*tail(PRC,n=1),
SQSUM = sum(DAT^2,na.rm=TRUE),
COVCOMP = head(DAT,n=1),
NOBS = length(DAT[complete.cases(DAT)])))
setkeyv(c, c("MM", "CO"))
myresults <- c[, eval(mycalc), by=key(c)]
myresults[NOBS<2, c("SQSUM", "COVCOMP"):=NA_real_]
## Or, alternatively
# myresults[NOBS<2, SQSUM:=NA_real_]
# myresults[NOBS<2, COVCOMP:=NA_real_]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Fast(er) way of indexing matrix in R - r

Related

How to reorder letters in string efficiently in R?

Speeding up ifelse() without writing C/C++?

How to do row wise operations on .SD columns in data.table

Speed up `strsplit` when possible output are known

Passing an expression to a nested grouping in data.table

Categories

Resources