Fastest by column sort in R - r

I have a data frame full from which I want to take the last column and a column v. I then want to sort both columns on v in the fastest way possible. full is read in from a csv but this can be used for testing (included some NAs for realism):
n <- 200000
full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))
full[sample(n, 10000), 'A'] <- NA
v <- 1
I have v as one here, but in reality it could change, and full has many columns.
I have tried sorting data frames, data tables and matrices each with order and sort.list (some ideas taken from this thread). The code for all these:
# DATA FRAME
ord_df <- function() {
a <- full[c(v, length(full))]
a[with(a, order(a[1])), ]
}
sl_df <- function() {
a <- full[c(v, length(full))]
a[sort.list(a[[1]]), ]
}
# DATA TABLE
require(data.table)
ord_dt <- function() {
a <- as.data.table(full[c(v, length(full))])
colnames(a)[1] <- 'values'
a[order(values)]
}
sl_dt <- function() {
a <- as.data.table(full[c(v, length(full))])
colnames(a)[1] <- 'values'
a[sort.list(values)]
}
# MATRIX
ord_mat <- function() {
a <- as.matrix(full[c(v, length(full))])
a[order(a[, 1]), ]
}
sl_mat <- function() {
a <- as.matrix(full[c(v, length(full))])
a[sort.list(a[, 1]), ]
}
Time results:
ord_df sl_df ord_dt sl_dt ord_mat sl_mat
Min. 0.230 0.1500 0.1300 0.120 0.140 0.1400
Median 0.250 0.1600 0.1400 0.140 0.140 0.1400
Mean 0.244 0.1610 0.1430 0.136 0.142 0.1450
Max. 0.250 0.1700 0.1600 0.140 0.160 0.1600
Or using microbenchmark (results are in milliseconds):
min lq median uq max
1 ord_df() 243.0647 248.2768 254.0544 265.2589 352.3984
2 ord_dt() 133.8159 140.0111 143.8202 148.4957 181.2647
3 ord_mat() 140.5198 146.8131 149.9876 154.6649 191.6897
4 sl_df() 152.6985 161.5591 166.5147 171.2891 194.7155
5 sl_dt() 132.1414 139.7655 144.1281 149.6844 188.8592
6 sl_mat() 139.2420 146.8578 151.6760 156.6174 186.5416
Seems like ordering the data table wins. There isn't all that much difference between order and sort.list except when using data frames where sort.list is much faster.
In the data table versions I also tried setting v as the key (since it is then sorted according to the documentation) but I couldn't get it work since the contents of v are not integer.
I would ideally like to speed this up as much as possible since I have to do it many times for different v values. Does anyone know how I might be able to speed this process up even further? Also might it be worth trying an Rcpp implementation? Thanks.
Here's the code I used for timing if it's useful to anyone:
sortMethods <- list(ord_df, sl_df, ord_dt, sl_dt, ord_mat, sl_mat)
require(plyr)
timings <- raply(10, sapply(sortMethods, function(x) system.time(x())[[3]]))
colnames(timings) <- c('ord_df', 'sl_df', 'ord_dt', 'sl_dt', 'ord_mat', 'sl_mat')
apply(timings, 2, summary)
require(microbenchmark)
mb <- microbenchmark(ord_df(), sl_df(), ord_dt(), sl_dt(), ord_mat(), sl_mat())
plot(mb)

I don't know if it's better to put this sort of thing in as an edit but it seems more like answer so here will do. Updated test functions:
n <- 1e7
full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))
full[sample(n, 100000), 'A'] <- NA
fdf <- full
fma <- as.matrix(full)
fdt <- as.data.table(full)
setnames(fdt, colnames(fdt)[1], 'values')
# DATA FRAME
ord_df <- function() { fdf[order(fdf[1]), ] }
sl_df <- function() { fdf[sort.list(fdf[[1]]), ] }
# DATA TABLE
require(data.table)
ord_dt <- function() { fdt[order(values)] }
key_dt <- function() {
setkey(fdt, values)
fdt
}
# MATRIX
ord_mat <- function() { fma[order(fma[, 1]), ] }
sl_mat <- function() { fma[sort.list(fma[, 1]), ] }
Results (using a different computer, R 2.13.1 and data.table 1.8.2):
ord_df sl_df ord_dt key_dt ord_mat sl_mat
Min. 37.56 20.86 2.946 2.249 20.22 20.21
1st Qu. 37.73 21.15 2.962 2.255 20.54 20.59
Median 38.43 21.74 3.002 2.280 21.05 20.82
Mean 38.76 21.75 3.074 2.395 21.09 20.95
3rd Qu. 39.85 22.18 3.151 2.445 21.48 21.42
Max. 40.36 23.08 3.330 2.797 22.41 21.84
So data.table is the clear winner. Using a key is faster than ordering, and has a nicer syntax as well I'd argue. Thanks for the help everyone.

Related

Speed up Tidyverse Calculation of Multiple Quantiles

I have this great little function summarise_posterior (given below) as part of my package driver (available here).
The function is great and super useful. The one problem is that I have been working on larger and larger data and it can be very slow. In short, my question is: Is there a tidyverse-esque way of speeding this up while still retaining the key flexibility of this function (see examples in documentation).
At least one key speed up could come from figuring out how to put the calculation of the quantiles in a single call rather than calling the quantile function over and over. The latter option which is currently implemented is probably re-sorting same vectors over and over.
#' Shortcut for summarize variable with quantiles and mean
#'
#' #param data tidy data frame
#' #param var variable name (unquoted) to be summarised
#' #param ... other expressions to pass to summarise
#'
#' #return data.frame
#' #export
#' #details Notation: \code{pX} refers to the \code{X}\% quantile
#' #import dplyr
#' #importFrom stats quantile
#' #importFrom rlang quos quo UQ
#' #examples
#' d <- data.frame("a"=sample(1:10, 50, TRUE),
#' "b"=rnorm(50))
#'
#' # Summarize posterior for b over grouping of a and also calcuate
#' # minmum of b (in addition to normal statistics returned)
#' d <- dplyr::group_by(d, a)
#' summarise_posterior(d, b, mean.b = mean(b), min=min(b))
summarise_posterior <- function(data, var, ...){
qvar <- enquo(var)
qs <- quos(...)
data %>%
summarise(p2.5 = quantile(!!qvar, prob=0.025),
p25 = quantile(!!qvar, prob=0.25),
p50 = quantile(!!qvar, prob=0.5),
mean = mean(!!qvar),
p75 = quantile(!!qvar, prob=0.75),
p97.5 = quantile(!!qvar, prob=0.975),
!!!qs)
}
Rcpp back-end options are also more than welcome.
Thanks!
Here's a solution that makes use of nesting to avoid calling quantile multiple times. Any time you need to store a vector of results inside summarize, simply wrap it inside list. Afterwards, you can unnest these results, pair them up against their names, and use spread to put them in separate columns:
summarise_posterior2 <- function(data, var, ...){
qvar <- ensym(var)
vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)
summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
.nms = list(str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
unnest %>% spread( .nms, .qq )
}
This doesn't give you nearly the same speed up as #jay.sf's solution
d <- data.frame("a"=sample(1:10, 5e5, TRUE), "b"=rnorm(5e5))
microbenchmark::microbenchmark( f1 = summarise_posterior(d, b, mean.b = mean(b), min=min(b)),
f2 = summarise_posterior2(d, b, mean.b = mean(b), min=min(b)) )
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1 49.06697 50.81422 60.75100 52.43030 54.17242 200.2961 100
# f2 29.05209 29.66022 32.32508 30.84492 32.56364 138.9579 100
but it will work correctly with group_by and inside nested functions (whereas substitute-based solutions will break when nested).
r1 <- d %>% dplyr::group_by(a) %>% summarise_posterior(b, mean.b = mean(b), min=min(b))
r2 <- d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
all_equal( r1, r2 ) # TRUE
If you profile the code, you can see where the major hang ups are
Rprof()
for( i in 1:100 )
d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
Rprof(NULL)
summaryRprof()$by.self %>% head
# self.time self.pct total.time total.pct
# ".Call" 1.84 49.73 3.18 85.95
# "sort.int" 0.94 25.41 1.12 30.27
# "eval" 0.08 2.16 3.64 98.38
# "tryCatch" 0.08 2.16 1.44 38.92
# "anyNA" 0.08 2.16 0.08 2.16
# "structure" 0.04 1.08 0.08 2.16
The .Call corresponds mainly to the C++ backend of dplyr, while sort.int is the worker behind quantile(). #jay.sf's solution gains a major speedup by decoupling from dplyr, but it also loses the associated flexibility (e.g., integration with group_by). Ultimately, it's up to you to decide which is more important.
Why not something like this?
summarise_posterior2 <- function(data, x, ...){
x <- deparse(substitute(x))
nm <- deparse(substitute(...))
M <- matrix(unlist(data[, x]), ncol=length(data[, x]))
qs <- t(sapply(list(...), do.call, list(M)))
'rownames<-'(cbind(p2.5 = quantile(M, prob=0.025),
p25 = quantile(M, prob=0.25),
p50 = quantile(M, prob=0.5),
mean = mean(M),
p75 = quantile(M, prob=0.75),
p97.5 = quantile(M, prob=0.975), qs), NULL
)
}
> summarise_posterior2(df1, X4, mean=mean, mean=mean, min=min)
p2.5 p25 p50 mean p75 p97.5 mean mean min
[1,] 28.2 30 32 32 34 35.8 32 32 28
# > summarise_posterior(df1, X4, mean.b = mean(X4), min=min(X4))
# p2.5 p25 p50 mean p75 p97.5 mean.b min
# 1 28.2 30 32 32 34 35.8 32 28
Runs six times faster:
> microbenchmark::microbenchmark(orig.fun=summarise_posterior(df1, X4, max(X4), min(X4)),
+ new.fun=summarise_posterior2(df1, X4, max=max, min=min))
Unit: microseconds
expr min lq mean median uq max neval
orig.fun 4289.541 4324.490 4514.1634 4362.500 4411.225 8928.316 100
new.fun 716.071 734.694 802.9949 755.867 778.317 4759.439 100
Data
df1 <- data.frame(matrix(1:144, 9, 16))

Using dplyr to make sample from data frame

I have a very large data frame (150.000.000 rows) with a format like this:
df = data.frame(pnr = rep(500+2*(1:15),each=3), x = runif(3*15))
pnr is person id and x is some data. I would like to sample 10% of the persons. Is there a fast way to do this in dplyr?
The following is a solution, but it is slow because of the merge-statement
prns = as.data.frame(unique(df$prn))
names(prns)[1] = "prn"
prns$s = rbinom(nrow(prns),1,0.1)
df = merge(df,prns)
df2 = df[df$s==1,]
I would actually suggest the "data.table" package over "dplyr" for this. Here's an example with some big-ish sample data (not much smaller than your own 15 million rows).
I'll also show some right and wrong ways to do things :-)
Here's the sample data.
library(data.table)
library(dplyr)
library(microbenchmark)
set.seed(1)
mydf <- DT <- data.frame(person = sample(10000, 1e7, TRUE),
value = runif(1e7))
We'll also create a "data.table" and set the key to "person". Creating the "data.table" takes no significant time, but setting the key can.
system.time(setDT(DT))
# user system elapsed
# 0.001 0.000 0.001
## Setting the key takes some time, but is worth it
system.time(setkey(DT, person))
# user system elapsed
# 0.620 0.025 0.646
I can't think of a more efficient way to select your "person" values than the following, so I've removed these from the benchmarks--they are common to all approaches.
## Common to all tests...
A <- unique(mydf$person)
B <- sample(A, ceiling(.1 * length(A)), FALSE)
For convenience, the different tests are presented as functions...
## Base R #1
fun1a <- function() {
mydf[mydf$person %in% B, ]
}
## Base R #2--sometimes using `which` makes things quicker
fun1b <- function() {
mydf[which(mydf$person %in% B), ]
}
## `filter` from "dplyr"
fun2 <- function() {
filter(mydf, person %in% B)
}
## The "wrong" way to do this with "data.table"
fun3a <- function() {
DT[which(person %in% B)]
}
## The "right" (I think) way to do this with "data.table"
fun3b <- function() {
DT[J(B)]
}
Now, we can benchmark:
## The benchmarking
microbenchmark(fun1a(), fun1b(), fun2(), fun3a(), fun3b(), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1a() 382.37534 394.27968 396.76076 406.92431 494.32220 20
# fun1b() 401.91530 413.04710 416.38470 425.90150 503.83169 20
# fun2() 381.78909 394.16716 395.49341 399.01202 417.79044 20
# fun3a() 387.35363 397.02220 399.18113 406.23515 413.56128 20
# fun3b() 28.77801 28.91648 29.01535 29.37596 42.34043 20
Look at the performance we get from using "data.table" the right way! All the other approaches are impressively fast though.
summary shows the results to be the same. (The row order for the "data.table" solution would be different since it has been sorted.)
summary(fun1a())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
summary(fun2())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
summary(fun3b())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
In base R, to sample 10% of the rows, rounding up to the next row
> df[sample(nrow(df), ceiling(0.1*nrow(df)), FALSE), ]
## pnr x
## 16 512 0.9781232
## 21 514 0.5279925
## 33 522 0.8332834
## 14 510 0.7989481
## 4 504 0.7825318
or rounding down to the next row
> df[sample(nrow(df), floor(0.1*nrow(df)), FALSE), ]
## pnr x
## 43 530 0.449985180
## 35 524 0.996350657
## 2 502 0.499871966
## 25 518 0.005199058
or sample 10% of the pnr column, rounding up
> sample(df$pnr, ceiling(0.1*length(df$pnr)), FALSE)
## [1] 530 516 526 518 514
ADD:
If you're looking to sample 10% of the people (unique pnr ID), and return those people and their respective data, I think you want
> S <- sample(unique(df$pnr), ceiling(0.1*length(unique(df$pnr))), FALSE)
> df[df$pnr %in% S, ]
## pnr x
## 1 502 0.7630667
## 2 502 0.4998720
## 3 502 0.4839460
## 22 516 0.8248153
## 23 516 0.5795991
## 24 516 0.1572472
PS: I would wait for a dplyr answer. It will likely be quicker on 15mil rows.
If you don't necessarily want a thoroughly random sample, then you could do
filter(df, pnr %% 10 ==0).
Which would take every 10th person (you could get 10 different samples by changing to ==1,...). You could make this random by re-allocating IDs randomly - fairly trivial to do this using sample(15)[(df$pnr-500)/2] for your toy example - reversing the mapping of pnr onto a set that's suitable for sample might be less easy for the real-world case.

R - streamline subsampling procedure

I have a dataset composed of values obtained from studies and experiments. Experiments are nested within studies. I want to subsample the dataset so that only 1 experiment is represented for each study. I want to repeat this procedure 10,000 times, randomly drawing the 1 experiment each time, and then calculate some summary statistics for the values. Here is an example dataset:
df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))
I wrote the following function to do the above, but it is taking forever. Does anyone have any suggestions for streamlining this code? Thanks!
subsample=function(x,A) {
subsample.list=sapply(1:A,function(m) {
idx=ddply(x,c("study"),function(i) sample(1:nrow(i),1)) #Sample one experiment from each study
x[paste(x$study,x$expt,sep="-") %in% paste(idx$study,idx$V1,sep="-"),"value"] } ) #Match the study-experiment combinations and retrieve values
means.list=ldply(subsample.list,mean) #Calculate the mean of 'values' for each iteration
c(quantile(means.list$V1,0.025),mean(means.list$V1),upper=quantile(means.list$V1,0.975)) } #Calculate overall means and 95% CIs
You can vectorise this way more (even using plyr), and go much much faster:
function=yoursummary(x)c(quantile(x,0.025),mean(x),upper=quantile(x,0.975))
subsampleX=function(x,M)
yoursummary(
aaply(
daply(.drop_o=F,df,.(study),
function(x)sample(x$value,M,replace=T)
),1,mean
)
)
The trick here is to do all the sampling up front. If we want to sample M times, why not do all that while you have access to the study.
Original code:
> system.time(subsample(df,20000))
user system elapsed
123.23 0.06 124.74
New vectorised code:
> system.time(subsampleX(df,20000))
user system elapsed
0.24 0.00 0.25
That's about 500x faster.
Here's a base R solution which avoids ddply for speed reasons:
df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))
sample.experiments <- function(df) {
r <- rle(df$study)
samp <- sapply( r$lengths , function(x) sample(seq(x),1) )
start.idx <- c(0,cumsum(r$lengths)[1:(length(r$lengths)-1)] )
df[samp + start.idx,]
}
> sample.experiments(df)
study expt value
1 1 1 0.6113196
4 2 2 0.5026527
6 3 1 0.2803080
7 4 1 0.9824377
Benchmarks
> m <- microbenchmark(
+ ddply(df,.(study),function(i) i[sample(1:nrow(i),1),]) ,
+ sample.experiments(df)
+ )
> m
Unit: microseconds
expr min lq median uq max
1 ddply(df, .(study), function(i) i[sample(1:nrow(i), 1), ]) 3808.652 3883.632 3936.805 4022.725 6530.506
2 sample.experiments(df) 337.327 350.734 357.644 365.915 580.097

quick/elegant way to construct mean/variance summary table

I can achieve this task, but I feel like there must be a "best" (slickest, most compact, clearest-code, fastest?) way of doing it and have not figured it out so far ...
For a specified set of categorical factors I want to construct a table of means and variances by group.
generate data:
set.seed(1001)
d <- expand.grid(f1=LETTERS[1:3],f2=letters[1:3],
f3=factor(as.character(as.roman(1:3))),rep=1:4)
d$y <- runif(nrow(d))
d$z <- rnorm(nrow(d))
desired output:
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.09537958
2 A a II 0.4876630 0.11079670
3 A a III 0.3102926 0.20280568
4 A b I 0.3914084 0.05869310
5 A b II 0.5257355 0.21863126
6 A b III 0.3356860 0.07943314
... etc. ...
using aggregate/merge:
library(reshape)
m1 <- aggregate(y~f1*f2*f3,data=d,FUN=mean)
m2 <- aggregate(y~f1*f2*f3,data=d,FUN=var)
mvtab <- merge(rename(m1,c(y="y.mean")),
rename(m2,c(y="y.var")))
using ddply/summarise (possibly best but haven't been able to make it work):
mvtab2 <- ddply(subset(d,select=-c(z,rep)),
.(f1,f2,f3),
summarise,numcolwise(mean),numcolwise(var))
results in
Error in output[[var]][rng] <- df[[var]] :
incompatible types (from closure to logical) in subassignment type fix
using melt/cast (maybe best?)
mvtab3 <- cast(melt(subset(d,select=-c(z,rep)),
id.vars=1:3),
...~.,fun.aggregate=c(mean,var))
## now have to drop "variable"
mvtab3 <- subset(mvtab3,select=-variable)
## also should rename response variables
Won't (?) work in reshape2. Explaining ...~. to someone could be tricky!
Here is a solution using data.table
library(data.table)
d2 = data.table(d)
ans = d2[,list(avg_y = mean(y), var_y = var(y)), 'f1, f2, f3']
I'm a bit puzzled. Does this not work:
mvtab2 <- ddply(d,.(f1,f2,f3),
summarise,y.mean = mean(y),y.var = var(y))
This give me something like this:
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.095379578
2 A a II 0.4876630 0.110796695
3 A a III 0.3102926 0.202805677
4 A b I 0.3914084 0.058693103
5 A b II 0.5257355 0.218631264
Which is in the right form, but it looks like the values are different that what you specified.
Edit
Here's how to make your version with numcolwise work:
mvtab2 <- ddply(subset(d,select=-c(z,rep)),.(f1,f2,f3),summarise,
y.mean = numcolwise(mean)(piece),
y.var = numcolwise(var)(piece))
You forgot to pass the actual data to numcolwise. And then there's the little ddply trick that each piece is called piece internally. (Which Hadley points out in the comments shouldn't be relied upon as it may change in future versions of plyr.)
(I voted for Joshua's.) Here's an Hmisc::summary.formula solution. The advantage of this for me is that it is well integrated with the Hmisc::latex output "channel".
summary(y ~ interaction(f3,f2,f1), data=d, method="response",
fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
#-----output----------
y N=108
+-----------------------+-------+---+---------+-----------+
| | |N |mean.y |var.y |
+-----------------------+-------+---+---------+-----------+
|interaction(f3, f2, f1)|I.a.A | 4|0.6502307|0.095379578|
| |II.a.A | 4|0.4876630|0.110796695|
snipped output to show the latex -> PDF -> png output:
#joran is spot-on with the ddply answer. Here's how I would do it with aggregate. Note that I avoid the formula interface (it is slower).
aggregate(d$y, d[,c("f1","f2","f3")], FUN=function(x) c(mean=mean(x),var=var(x)))
I'm slightly addicted to speed comparisons even though they're largely irrelevant for me in this situation ...
joran_ddply <- function(d) ddply(d,.(f1,f2,f3),
summarise,y.mean = mean(y),y.var = var(y))
joshulrich_aggregate <- function(d) {
aggregate(d$y, d[,c("f1","f2","f3")],
FUN=function(x) c(mean=mean(x),var=var(x)))
}
formula_aggregate <- function(d) {
aggregate(y~f1*f2*f3,data=d,
FUN=function(x) c(mean=mean(x),var=var(x)))
}
library(data.table)
d2 <- data.table(d)
ramnath_datatable <- function(d) {
d[,list(avg_y = mean(y), var_y = var(y)), 'f1, f2, f3']
}
library(Hmisc)
dwin_hmisc <- function(d) {summary(y ~ interaction(f3,f2,f1),
data=d, method="response",
fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
}
library(rbenchmark)
benchmark(joran_ddply(d),
joshulrich_aggregate(d),
ramnath_datatable(d2),
formula_aggregate(d),
dwin_hmisc(d))
aggregate is fastest (even faster than data.table, which is a surprise to me, although things might be different with a bigger table to aggregate), even using the formula interface ...)
test replications elapsed relative user.self sys.self
5 dwin_hmisc(d) 100 1.235 2.125645 1.168 0.044
4 formula_aggregate(d) 100 0.703 1.209983 0.656 0.036
1 joran_ddply(d) 100 3.345 5.757315 3.152 0.144
2 joshulrich_aggregate(d) 100 0.581 1.000000 0.596 0.000
3 ramnath_datatable(d2) 100 0.750 1.290878 0.708 0.000
(Now I just need Dirk to step up and post an Rcpp solution that is 1000 times faster than anything else ...)
I find the doBy package has some very convenient functions for things like this. For example, the function ?summaryBy is quite handy. Consider:
> summaryBy(y~f1+f2+f3, data=d, FUN=c(mean, var))
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.095379578
2 A a II 0.4876630 0.110796695
3 A a III 0.3102926 0.202805677
4 A b I 0.3914084 0.058693103
5 A b II 0.5257355 0.218631264
6 A b III 0.3356860 0.079433136
7 A c I 0.3367841 0.079487973
8 A c II 0.6273320 0.041373836
9 A c III 0.4532720 0.022779672
10 B a I 0.6688221 0.044184575
11 B a II 0.5514724 0.020359289
12 B a III 0.6389354 0.104056229
13 B b I 0.5052346 0.138379070
14 B b II 0.3933283 0.050261804
15 B b III 0.5953874 0.161943989
16 B c I 0.3490460 0.079286849
17 B c II 0.5534569 0.207381592
18 B c III 0.4652424 0.187463143
19 C a I 0.3340988 0.004994589
20 C a II 0.3970315 0.126967554
21 C a III 0.3580250 0.066769484
22 C b I 0.7676858 0.124945402
23 C b II 0.3613772 0.182689385
24 C b III 0.4175562 0.095933470
25 C c I 0.3592491 0.039832864
26 C c II 0.7882591 0.084271963
27 C c III 0.3936949 0.085758343
So the function call is simple, easy to use, and I would say, elegant.
Now, if your primary concern is speed, it seems that it would be reasonable--at least with smaller sized tasks (note that I couldn't get the ramnath_datatable function to work for whatever reason):
test replications elapsed relative user.self
4 dwin_hmisc(d) 100 0.50 2.778 0.50
3 formula_aggregate(d) 100 0.23 1.278 0.24
5 gung_summaryBy(d) 100 0.34 1.889 0.35
1 joran_ddply(d) 100 1.34 7.444 1.32
2 joshulrich_aggregate(d) 100 0.18 1.000 0.19
I've came accross with this question and found the benchmarks are done with small tables, so it's hard to tell which method is better with 100 rows.
I've also modified the data a bit also to make it "unsorted", this would be a more common case, for example as the data is in a DB.
I've added a few more data.table trials to see if setting a key is faster beforehand. It seems here, setting the key beforehand doesn't improve much the performance, so ramnath solution seems to be the fastest.
set.seed(1001)
d <- data.frame(f1 = sample(LETTERS[1:3], 30e5, replace = T), f2 = sample(letters[1:3], 30e5, replace = T),
f3 = sample(factor(as.character(as.roman(1:3))), 30e5, replace = T), rep = sample(1:4, replace = T))
d$y <- runif(nrow(d))
d$z <- rnorm(nrow(d))
str(d)
require(Hmisc)
require(plyr)
require(data.table)
d2 = data.table(d)
d3 = data.table(d)
# Set key of d3 to compare how fast it is if the DT is already keyded
setkey(d3,f1,f2,f3)
joran_ddply <- function(d) ddply(d,.(f1,f2,f3),
summarise,y.mean = mean(y),y.var = var(y))
formula_aggregate <- function(d) {
aggregate(y~f1*f2*f3,data=d,
FUN=function(x) c(mean=mean(x),var=var(x)))
}
ramnath_datatable <- function(d) {
d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}
key_agg_datatable <- function(d) {
setkey(d2,f1,f2,f3)
d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}
one_key_datatable <- function(d) {
setkey(d2,f1)
d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}
including_3key_datatable <- function(d) {
d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}
dwin_hmisc <- function(d) {summary(y ~ interaction(f3,f2,f1),
data=d, method="response",
fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
}
require(rbenchmark)
benchmark(joran_ddply(d),
joshulrich_aggregate(d),
ramnath_datatable(d2),
including_3key_datatable(d3),
one_key_datatable(d2),
key_agg_datatable(d2),
formula_aggregate(d),
dwin_hmisc(d)
)
# test replications elapsed relative user.self sys.self
# dwin_hmisc(d) 100 1757.28 252.121 1590.89 165.65
# formula_aggregate(d) 100 433.56 62.204 390.83 42.50
# including_3key_datatable(d3) 100 7.00 1.004 6.02 0.98
# joran_ddply(d) 100 173.39 24.877 119.35 53.95
# joshulrich_aggregate(d) 100 328.51 47.132 307.14 21.22
# key_agg_datatable(d2) 100 24.62 3.532 19.13 5.50
# one_key_datatable(d2) 100 29.66 4.255 22.28 7.34
# ramnath_datatable(d2) 100 6.97 1.000 5.96 1.01
And here is a solution using Hadley Wickham's new dplyr library.
library(dplyr)
d %>% group_by(f1, f2, f3) %>%
summarise(y.mean = mean(y), z.mean = mean(z))

Equivalent to rowMeans() for min()

I have seen this question being asked multiple times on the R mailing list, but still could not find a satisfactory answer.
Suppose I a matrix m
m <- matrix(rnorm(10000000), ncol=10)
I can get the mean of each row by:
system.time(rowMeans(m))
user system elapsed
0.100 0.000 0.097
But obtaining the minimum value of each row by
system.time(apply(m,1,min))
user system elapsed
16.157 0.400 17.029
takes more than 100 times as long, is there a way to speed this up?
You could use pmin, but you would have to get each column of your matrix into a separate vector. One way to do that is to convert it to a data.frame then call pmin via do.call (since data.frames are lists).
system.time(do.call(pmin, as.data.frame(m)))
# user system elapsed
# 0.940 0.000 0.949
system.time(apply(m,1,min))
# user system elapsed
# 16.84 0.00 16.95
Quite late to the party, but as the author of matrixStats and in case someone spots this, please note that matrixStats::rowMins() is very fast these days, e.g.
library(microbenchmark)
library(Biobase) # rowMin()
library(matrixStats) # rowMins()
options(digits=3)
m <- matrix(rnorm(10000000), ncol=10)
stats <- microbenchmark(
rowMeans(m), ## A benchmark by OP
rowMins(m),
rowMin(m),
do.call(pmin, as.data.frame(m)),
apply(m, MARGIN=1L, FUN=min),
times=10
)
> stats
Unit: milliseconds
expr min lq mean median uq max
rowMeans(m) 77.7 82.7 85.7 84.4 90.3 98.2
rowMins(m) 72.9 74.1 88.0 79.0 90.2 147.4
rowMin(m) 341.1 347.1 395.9 383.4 395.1 607.7
do.call(pmin, as.data.frame(m)) 326.4 357.0 435.4 401.0 437.6 657.9
apply(m, MARGIN = 1L, FUN = min) 3761.9 3963.8 4120.6 4109.8 4198.7 4567.4
If you want to stick to CRAN packages, then both the matrixStats and the fBasics packages have the function rowMins [note the s which is not in the Biobase function] and a variety of other row and column statistics.
library("sos")
findFn("rowMin")
gets a hit in the Biobase package, from Bioconductor ...
source("http://bioconductor.org/biocLite.R")
biocLite("Biobase")
m <- matrix(rnorm(10000000), ncol=10)
system.time(rowMeans(m))
## user system elapsed
## 0.132 0.148 0.279
system.time(apply(m,1,min))
## user system elapsed
## 11.825 1.688 13.603
library(Biobase)
system.time(rowMin(m))
## user system elapsed
## 0.688 0.172 0.864
Not as fast as rowMeans, but a lot faster than apply(...,1,min)
I've been meaning to try out the new compiler package in R 2.13.0. This essentially follows the post outlined by Dirk here.
library(compiler)
library(rbenchmark)
rowMin <- function(x, ind) apply(x, ind, min)
crowMin <- cmpfun(rowMin)
benchmark(
rowMin(m,1)
, crowMin(m,1)
, columns=c("test", "replications","elapsed","relative")
, order="relative"
, replications=10)
)
And the results:
test replications elapsed relative
2 crowMin(m, 1) 10 120.091 1.0000
1 rowMin(m, 1) 10 122.745 1.0221
Anticlimatic to say the least, though looks like you've gotten some other good options.
Not particularly R-idiosyncratic, but surely the fastest method is just to use pmin and loop over columns:
x <- m[,1]
for (i in 2:ncol(m)) x <- pmin(x, m[,i])
On my machine that takes just 3 times longer than rowMeans for the 1e+07x10 matrix, and is slightly faster than the do.call method via data.frame.

Resources