Using dplyr to make sample from data frame - r

I have a very large data frame (150.000.000 rows) with a format like this:
df = data.frame(pnr = rep(500+2*(1:15),each=3), x = runif(3*15))
pnr is person id and x is some data. I would like to sample 10% of the persons. Is there a fast way to do this in dplyr?
The following is a solution, but it is slow because of the merge-statement
prns = as.data.frame(unique(df$prn))
names(prns)[1] = "prn"
prns$s = rbinom(nrow(prns),1,0.1)
df = merge(df,prns)
df2 = df[df$s==1,]

I would actually suggest the "data.table" package over "dplyr" for this. Here's an example with some big-ish sample data (not much smaller than your own 15 million rows).
I'll also show some right and wrong ways to do things :-)
Here's the sample data.
library(data.table)
library(dplyr)
library(microbenchmark)
set.seed(1)
mydf <- DT <- data.frame(person = sample(10000, 1e7, TRUE),
value = runif(1e7))
We'll also create a "data.table" and set the key to "person". Creating the "data.table" takes no significant time, but setting the key can.
system.time(setDT(DT))
# user system elapsed
# 0.001 0.000 0.001
## Setting the key takes some time, but is worth it
system.time(setkey(DT, person))
# user system elapsed
# 0.620 0.025 0.646
I can't think of a more efficient way to select your "person" values than the following, so I've removed these from the benchmarks--they are common to all approaches.
## Common to all tests...
A <- unique(mydf$person)
B <- sample(A, ceiling(.1 * length(A)), FALSE)
For convenience, the different tests are presented as functions...
## Base R #1
fun1a <- function() {
mydf[mydf$person %in% B, ]
}
## Base R #2--sometimes using `which` makes things quicker
fun1b <- function() {
mydf[which(mydf$person %in% B), ]
}
## `filter` from "dplyr"
fun2 <- function() {
filter(mydf, person %in% B)
}
## The "wrong" way to do this with "data.table"
fun3a <- function() {
DT[which(person %in% B)]
}
## The "right" (I think) way to do this with "data.table"
fun3b <- function() {
DT[J(B)]
}
Now, we can benchmark:
## The benchmarking
microbenchmark(fun1a(), fun1b(), fun2(), fun3a(), fun3b(), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1a() 382.37534 394.27968 396.76076 406.92431 494.32220 20
# fun1b() 401.91530 413.04710 416.38470 425.90150 503.83169 20
# fun2() 381.78909 394.16716 395.49341 399.01202 417.79044 20
# fun3a() 387.35363 397.02220 399.18113 406.23515 413.56128 20
# fun3b() 28.77801 28.91648 29.01535 29.37596 42.34043 20
Look at the performance we get from using "data.table" the right way! All the other approaches are impressively fast though.
summary shows the results to be the same. (The row order for the "data.table" solution would be different since it has been sorted.)
summary(fun1a())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
summary(fun2())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
summary(fun3b())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000

In base R, to sample 10% of the rows, rounding up to the next row
> df[sample(nrow(df), ceiling(0.1*nrow(df)), FALSE), ]
## pnr x
## 16 512 0.9781232
## 21 514 0.5279925
## 33 522 0.8332834
## 14 510 0.7989481
## 4 504 0.7825318
or rounding down to the next row
> df[sample(nrow(df), floor(0.1*nrow(df)), FALSE), ]
## pnr x
## 43 530 0.449985180
## 35 524 0.996350657
## 2 502 0.499871966
## 25 518 0.005199058
or sample 10% of the pnr column, rounding up
> sample(df$pnr, ceiling(0.1*length(df$pnr)), FALSE)
## [1] 530 516 526 518 514
ADD:
If you're looking to sample 10% of the people (unique pnr ID), and return those people and their respective data, I think you want
> S <- sample(unique(df$pnr), ceiling(0.1*length(unique(df$pnr))), FALSE)
> df[df$pnr %in% S, ]
## pnr x
## 1 502 0.7630667
## 2 502 0.4998720
## 3 502 0.4839460
## 22 516 0.8248153
## 23 516 0.5795991
## 24 516 0.1572472
PS: I would wait for a dplyr answer. It will likely be quicker on 15mil rows.

If you don't necessarily want a thoroughly random sample, then you could do
filter(df, pnr %% 10 ==0).
Which would take every 10th person (you could get 10 different samples by changing to ==1,...). You could make this random by re-allocating IDs randomly - fairly trivial to do this using sample(15)[(df$pnr-500)/2] for your toy example - reversing the mapping of pnr onto a set that's suitable for sample might be less easy for the real-world case.

Related

Is there a way to pass a subset of time series data with the map or lapply command in r?

To improve my R programming and simplify my code, I'm trying to replace a 'for' loop with 'lapply', 'map', or a similar variant. I want to perform a function using 2 minute time intervals from my time series data. When I try to pass just a subset of the data using one of the functionals, 'map' or 'lapply', I get a 'subscript out of bounds error'. Any ideas?
library(lubridate)
library(purrr)
library(stats)
library(xts)
# Set up the time series data
t <- ymd_hms("2020-01-01 08:00:00","2020-01-01 08:01:00","2020-01-01 08:02:00", "2020-01-01 08:03:00")
tsData <- as.xts(1:4,order.by=(t))
# Set up the summary time periods; in this case every 2 minutes
timeSlots <- seq(from=t[1],to=t[length(t)],by=120)
# Make sure it also summarizes the last period
lastTime <- stats::time(tsData[nrow(tsData)])
# The next 4 lines iterate through the time series and print a summary for each 2 minute time period;
# This is the loop I want to replace with 'map'
for (i in 1:(length(timeSlots))) {
if (i < length(timeSlots)) {
print (summary (tsData[paste(timeSlots[i],'/',(timeSlots[i+1]-1),sep='')]))
}
# This makes sure the last subset includes the last observation
else print (summary (tsData[paste(timeSlots[i],'/',lastTime,sep='')]))
}
# This next statement gets a subscript out of bounds error
lapply (timeSlots, function(x) summary(tsData[x:x+1]))
# This next statement gets a subscript out of bounds error
map (timeSlots,function(x) summary(tsData[x:x+1]))
We can loop over the sequence and paste as in the for loop
library(xts)
lapply(seq_along(timeSlots), function(i)
if(i < length(timeSlots)) {
summary(tsData[paste(timeSlots[i], timeSlots[i+1]-1, sep="/")])
} else {
summary (tsData[paste(timeSlots[i],'/',lastTime,sep='')])
}
)
#[[1]]
# Index tsData[paste(timeSlots[i], timeSlots[i + 1] - 1, sep = "/")]
# Min. :2020-01-01 08:00:00 Min. :1.00
# 1st Qu.:2020-01-01 08:00:15 1st Qu.:1.25
# Median :2020-01-01 08:00:30 Median :1.50
# Mean :2020-01-01 08:00:30 Mean :1.50
# 3rd Qu.:2020-01-01 08:00:45 3rd Qu.:1.75
# Max. :2020-01-01 08:01:00 Max. :2.00
#[[2]]
# Index tsData[paste(timeSlots[i], "/", lastTime, sep = "")]
# Min. :2020-01-01 08:02:00 Min. :3.00
# 1st Qu.:2020-01-01 08:02:15 1st Qu.:3.25
# Median :2020-01-01 08:02:30 Median :3.50
# Mean :2020-01-01 08:02:30 Mean :3.50
# 3rd Qu.:2020-01-01 08:02:45 3rd Qu.:3.75
# Max. :2020-01-01 08:03:00 Max. :4.00

Group by State and Year using Panel data [duplicate]

Whenever I want to do something "map"py in R, I usually try to use a function in the apply family.
However, I've never quite understood the differences between them -- how {sapply, lapply, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.
Can someone explain how to use which one when?
My current (probably incorrect/incomplete) understanding is...
sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]), giving you a matrix if f has a multi-element output
lapply(vec, f): same as sapply, but output is a list?
apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix)
tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names
by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column.
aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.
Side question: I still haven't learned plyr or reshape -- would plyr or reshape replace all of these entirely?
R has many *apply functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.
Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.
This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.
apply - When you want to apply a function to the rows or columns
of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.
# Two dimensional matrix
M <- matrix(seq(1,16), 4, 4)
# apply min to rows
apply(M, 1, min)
[1] 1 2 3 4
# apply max to columns
apply(M, 2, max)
[1] 4 8 12 16
# 3 dimensional array
M <- array( seq(32), dim = c(4,4,2))
# Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
apply(M, 1, sum)
# Result is one-dimensional
[1] 120 128 136 144
# Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
apply(M, c(1,2), sum)
# Result is two-dimensional
[,1] [,2] [,3] [,4]
[1,] 18 26 34 42
[2,] 20 28 36 44
[3,] 22 30 38 46
[4,] 24 32 40 48
If you want row/column means or sums for a 2D matrix, be sure to
investigate the highly optimized, lightning-quick colMeans,
rowMeans, colSums, rowSums.
lapply - When you want to apply a function to each element of a
list in turn and get a list back.
This is the workhorse of many of the other *apply functions. Peel
back their code and you will often find lapply underneath.
x <- list(a = 1, b = 1:3, c = 10:100)
lapply(x, FUN = length)
$a
[1] 1
$b
[1] 3
$c
[1] 91
lapply(x, FUN = sum)
$a
[1] 1
$b
[1] 6
$c
[1] 5005
sapply - When you want to apply a function to each element of a
list in turn, but you want a vector back, rather than a list.
If you find yourself typing unlist(lapply(...)), stop and consider
sapply.
x <- list(a = 1, b = 1:3, c = 10:100)
# Compare with above; a named vector, not a list
sapply(x, FUN = length)
a b c
1 3 91
sapply(x, FUN = sum)
a b c
1 6 5005
In more advanced uses of sapply it will attempt to coerce the
result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:
sapply(1:5,function(x) rnorm(3,x))
If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:
sapply(1:5,function(x) matrix(x,2,2))
Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array:
sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
vapply - When you want to use sapply but perhaps need to
squeeze some more speed out of your code or want more type safety.
For vapply, you basically give R an example of what sort of thing
your function will return, which can save some time coercing returned
values to fit in a single atomic vector.
x <- list(a = 1, b = 1:3, c = 10:100)
#Note that since the advantage here is mainly speed, this
# example is only for illustration. We're telling R that
# everything returned by length() should be an integer of
# length 1.
vapply(x, FUN = length, FUN.VALUE = 0L)
a b c
1 3 91
mapply - For when you have several data structures (e.g.
vectors, lists) and you want to apply a function to the 1st elements
of each, and then the 2nd elements of each, etc., coercing the result
to a vector/array as in sapply.
This is multivariate in the sense that your function must accept
multiple arguments.
#Sums the 1st elements, the 2nd elements, etc.
mapply(sum, 1:5, 1:5, 1:5)
[1] 3 6 9 12 15
#To do rep(1,4), rep(2,3), etc.
mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list.
Map(sum, 1:5, 1:5, 1:5)
[[1]]
[1] 3
[[2]]
[1] 6
[[3]]
[1] 9
[[4]]
[1] 12
[[5]]
[1] 15
rapply - For when you want to apply a function to each element of a nested list structure, recursively.
To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply:
# Append ! to string, otherwise increment
myFun <- function(x){
if(is.character(x)){
return(paste(x,"!",sep=""))
}
else{
return(x + 1)
}
}
#A nested list structure
l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"),
b = 3, c = "Yikes",
d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
# Result is named vector, coerced to character
rapply(l, myFun)
# Result is a nested list like l, with values altered
rapply(l, myFun, how="replace")
tapply - For when you want to apply a function to subsets of a
vector and the subsets are defined by some other vector, usually a
factor.
The black sheep of the *apply family, of sorts. The help file's use of
the phrase "ragged array" can be a bit confusing, but it is actually
quite simple.
A vector:
x <- 1:20
A factor (of the same length!) defining groups:
y <- factor(rep(letters[1:5], each = 4))
Add up the values in x within each subgroup defined by y:
tapply(x, y, sum)
a b c d e
10 26 42 58 74
More complex examples can be handled where the subgroups are defined
by the unique combinations of a list of several factors. tapply is
similar in spirit to the split-apply-combine functions that are
common in R (aggregate, by, ave, ddply, etc.) Hence its
black sheep status.
On the side note, here is how the various plyr functions correspond to the base *apply functions (from the intro to plyr document from the plyr webpage http://had.co.nz/plyr/)
Base function Input Output plyr function
---------------------------------------
aggregate d d ddply + colwise
apply a a/l aaply / alply
by d l dlply
lapply l l llply
mapply a a/l maply / mlply
replicate r a/l raply / rlply
sapply l a laply
One of the goals of plyr is to provide consistent naming conventions for each of the functions, encoding the input and output data types in the function name. It also provides consistency in output, in that output from dlply() is easily passable to ldply() to produce useful output, etc.
Conceptually, learning plyr is no more difficult than understanding the base *apply functions.
plyr and reshape functions have replaced almost all of these functions in my every day use. But, also from the Intro to Plyr document:
Related functions tapply and sweep have no corresponding function in plyr, and remain useful. merge is useful for combining summaries with the original data.
From slide 21 of http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:
(Hopefully it's clear that apply corresponds to #Hadley's aaply and aggregate corresponds to #Hadley's ddply etc. Slide 20 of the same slideshare will clarify if you don't get it from this image.)
(on the left is input, on the top is output)
First start with Joran's excellent answer -- doubtful anything can better that.
Then the following mnemonics may help to remember the distinctions between each. Whilst some are obvious, others may be less so --- for these you'll find justification in Joran's discussions.
Mnemonics
lapply is a list apply which acts on a list or vector and returns a list.
sapply is a simple lapply (function defaults to returning a vector or matrix when possible)
vapply is a verified apply (allows the return object type to be prespecified)
rapply is a recursive apply for nested lists, i.e. lists within lists
tapply is a tagged apply where the tags identify the subsets
apply is generic: applies a function to a matrix's rows or columns (or, more generally, to dimensions of an array)
Building the Right Background
If using the apply family still feels a bit alien to you, then it might be that you're missing a key point of view.
These two articles can help. They provide the necessary background to motivate the functional programming techniques that are being provided by the apply family of functions.
Users of Lisp will recognise the paradigm immediately. If you're not familiar with Lisp, once you get your head around FP, you'll have gained a powerful point of view for use in R -- and apply will make a lot more sense.
Advanced R: Functional Programming, by Hadley Wickham
Simple Functional Programming in R, by Michael Barton
Since I realized that (the very excellent) answers of this post lack of by and aggregate explanations. Here is my contribution.
BY
The by function, as stated in the documentation can be though, as a "wrapper" for tapply. The power of by arises when we want to compute a task that tapply can't handle. One example is this code:
ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )
cb
iris$Species: setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
--------------------------------------------------------------
iris$Species: versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
--------------------------------------------------------------
iris$Species: virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
ct
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
$versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
$virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
If we print these two objects, ct and cb, we "essentially" have the same results and the only differences are in how they are shown and the different class attributes, respectively by for cb and array for ct.
As I've said, the power of by arises when we can't use tapply; the following code is one example:
tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) :
arguments must have same length
R says that arguments must have the same lengths, say "we want to calculate the summary of all variable in iris along the factor Species": but R just can't do that because it does not know how to handle.
With the by function R dispatch a specific method for data frame class and then let the summary function works even if the length of the first argument (and the type too) are different.
bywork <- by(iris, iris$Species, summary )
bywork
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 versicolor: 0
Median :5.000 Median :3.400 Median :1.500 Median :0.200 virginica : 0
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
--------------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
--------------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 setosa : 0
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 versicolor: 0
Median :6.500 Median :3.000 Median :5.550 Median :2.000 virginica :50
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
it works indeed and the result is very surprising. It is an object of class by that along Species (say, for each of them) computes the summary of each variable.
Note that if the first argument is a data frame, the dispatched function must have a method for that class of objects. For example is we use this code with the mean function we will have this code that has no sense at all:
by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
-------------------------------------------
iris$Species: versicolor
[1] NA
-------------------------------------------
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
AGGREGATE
aggregate can be seen as another a different way of use tapply if we use it in such a way.
at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)
at
setosa versicolor virginica
5.006 5.936 6.588
ag
Group.1 x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
The two immediate differences are that the second argument of aggregate must be a list while tapply can (not mandatory) be a list and that the output of aggregate is a data frame while the one of tapply is an array.
The power of aggregate is that it can handle easily subsets of the data with subset argument and that it has methods for ts objects and formula as well.
These elements make aggregate easier to work with that tapply in some situations.
Here are some examples (available in documentation):
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
ag
supp dose len
1 OJ 0.5 13.23
2 VC 0.5 7.98
3 OJ 1.0 22.70
4 VC 1.0 16.77
5 OJ 2.0 26.06
6 VC 2.0 26.14
We can achieve the same with tapply but the syntax is slightly harder and the output (in some circumstances) less readable:
att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)
att
OJ VC
0.5 13.23 7.98
1 22.70 16.77
2 26.06 26.14
There are other times when we can't use by or tapply and we have to use aggregate.
ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
ag1
Month Ozone Temp
1 5 23.61538 66.73077
2 6 29.44444 78.22222
3 7 59.11538 83.88462
4 8 59.96154 83.96154
5 9 31.44828 76.89655
We cannot obtain the previous result with tapply in one call but we have to calculate the mean along Month for each elements and then combine them (also note that we have to call the na.rm = TRUE, because the formula methods of the aggregate function has by default the na.action = na.omit):
ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)
cbind(ta1, ta2)
ta1 ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000
while with by we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function, mean):
by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)
Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:
byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)
The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.
There are lots of great answers which discuss differences in the use cases for each function. None of the answer discuss the differences in performance. That is reasonable cause various functions expects various input and produces various output, yet most of them have a general common objective to evaluate by series/groups. My answer is going to focus on performance. Due to above the input creation from the vectors is included in the timing, also the apply function is not measured.
I have tested two different functions sum and length at once. Volume tested is 50M on input and 50K on output. I have also included two currently popular packages which were not widely used at the time when question was asked, data.table and dplyr. Both are definitely worth to look if you are aiming for good performance.
library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)
timing = list()
# sapply
timing[["sapply"]] = system.time({
lt = split(x, grp)
r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})
# lapply
timing[["lapply"]] = system.time({
lt = split(x, grp)
r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})
# tapply
timing[["tapply"]] = system.time(
r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)
# by
timing[["by"]] = system.time(
r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)
# aggregate
timing[["aggregate"]] = system.time(
r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)
# dplyr
timing[["dplyr"]] = system.time({
df = data_frame(x, grp)
r.dplyr = summarise(group_by(df, grp), sum(x), n())
})
# data.table
timing[["data.table"]] = system.time({
dt = setnames(setDT(list(x, grp)), c("x","grp"))
r.data.table = dt[, .(sum(x), .N), grp]
})
# all output size match to group count
sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table),
function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
# sapply lapply tapply by aggregate dplyr data.table
# TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# print timings
as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
)[,.(fun = V1, elapsed = V2)
][order(-elapsed)]
# fun elapsed
#1: aggregate 109.139
#2: by 25.738
#3: dplyr 18.978
#4: tapply 17.006
#5: lapply 11.524
#6: sapply 11.326
#7: data.table 2.686
Despite all the great answers here, there are 2 more base functions that deserve to be mentioned, the useful outer function and the obscure eapply function
outer
outer is a very useful function hidden as a more mundane one. If you read the help for outer its description says:
The outer product of the arrays X and Y is the array A with dimension
c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] =
FUN(X[arrayindex.x], Y[arrayindex.y], ...).
which makes it seem like this is only useful for linear algebra type things. However, it can be used much like mapply to apply a function to two vectors of inputs. The difference is that mapply will apply the function to the first two elements and then the second two etc, whereas outer will apply the function to every combination of one element from the first vector and one from the second. For example:
A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
mapply(FUN=pmax, A, B)
> mapply(FUN=pmax, A, B)
[1] 1 3 6 9 12
outer(A,B, pmax)
> outer(A,B, pmax)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 6 9 12
[2,] 3 3 6 9 12
[3,] 5 5 6 9 12
[4,] 7 7 7 9 12
[5,] 9 9 9 9 12
I have personally used this when I have a vector of values and a vector of conditions and wish to see which values meet which conditions.
eapply
eapply is like lapply except that rather than applying a function to every element in a list, it applies a function to every element in an environment. For example if you want to find a list of user defined functions in the global environment:
A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
C<-list(x=1, y=2)
D<-function(x){x+1}
> eapply(.GlobalEnv, is.function)
$A
[1] FALSE
$B
[1] FALSE
$C
[1] FALSE
$D
[1] TRUE
Frankly I don't use this very much but if you are building a lot of packages or create a lot of environments it may come in handy.
It is maybe worth mentioning ave. ave is tapply's friendly cousin. It returns results in a form that you can plug straight back into your data frame.
dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
## A B C D E
## 2.5 6.5 10.5 14.5 18.5
## great, but putting it back in the data frame is another line:
dfr$m <- means[dfr$f]
dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
## a f m m2
## 1 A 2.5 2.5
## 2 A 2.5 2.5
## 3 A 2.5 2.5
## 4 A 2.5 2.5
## 5 B 6.5 6.5
## 6 B 6.5 6.5
## 7 B 6.5 6.5
## ...
There is nothing in the base package that works like ave for whole data frames (as by is like tapply for data frames). But you can fudge it:
dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
x <- dfr[x,]
sum(x$m*x$m2)
})
dfr
## a f m m2 foo
## 1 1 A 2.5 2.5 25
## 2 2 A 2.5 2.5 25
## 3 3 A 2.5 2.5 25
## ...
I recently discovered the rather useful sweep function and add it here for the sake of completeness:
sweep
The basic idea is to sweep through an array row- or column-wise and return a modified array. An example will make this clear (source: datacamp):
Let's say you have a matrix and want to standardize it column-wise:
dataPoints <- matrix(4:15, nrow = 4)
# Find means per column with `apply()`
dataPoints_means <- apply(dataPoints, 2, mean)
# Find standard deviation with `apply()`
dataPoints_sdev <- apply(dataPoints, 2, sd)
# Center the points
dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")
# Return the result
dataPoints_Trans1
## [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,] 0.5 0.5 0.5
## [4,] 1.5 1.5 1.5
# Normalize
dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")
# Return the result
dataPoints_Trans2
## [,1] [,2] [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,] 0.3872983 0.3872983 0.3872983
## [4,] 1.1618950 1.1618950 1.1618950
NB: for this simple example the same result can of course be achieved more easily by apply(dataPoints, 2, scale)
In the collapse package recently released on CRAN, I have attempted to compress most of the common apply functionality into just 2 functions:
dapply (Data-Apply) applies functions to rows or (default) columns of matrices and data.frames and (default) returns an object of the same type and with the same attributes (unless the result of each computation is atomic and drop = TRUE). The performance is comparable to lapply for data.frame columns, and about 2x faster than apply for matrix rows or columns. Parallelism is available via mclapply (only for MAC).
Syntax:
dapply(X, FUN, ..., MARGIN = 2, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame"), drop = TRUE)
Examples:
# Apply to columns:
dapply(mtcars, log)
dapply(mtcars, sum)
dapply(mtcars, quantile)
# Apply to rows:
dapply(mtcars, sum, MARGIN = 1)
dapply(mtcars, quantile, MARGIN = 1)
# Return as matrix:
dapply(mtcars, quantile, return = "matrix")
dapply(mtcars, quantile, MARGIN = 1, return = "matrix")
# Same for matrices ...
BY is a S3 generic for split-apply-combine computing with vector, matrix and data.frame method. It is significantly faster than tapply, by and aggregate (an also faster than plyr, on large data dplyr is faster though).
Syntax:
BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame", "list"))
Examples:
# Vectors:
BY(iris$Sepal.Length, iris$Species, sum)
BY(iris$Sepal.Length, iris$Species, quantile)
BY(iris$Sepal.Length, iris$Species, quantile, expand.wide = TRUE) # This returns a matrix
# Data.frames
BY(iris[-5], iris$Species, sum)
BY(iris[-5], iris$Species, quantile)
BY(iris[-5], iris$Species, quantile, expand.wide = TRUE) # This returns a wider data.frame
BY(iris[-5], iris$Species, quantile, return = "matrix") # This returns a matrix
# Same for matrices ...
Lists of grouping variables can also be supplied to g.
Talking about performance: A main goal of collapse is to foster high-performance programming in R and to move beyond split-apply-combine alltogether. For this purpose the package has a full set of C++ based fast generic functions: fmean, fmedian, fmode, fsum, fprod, fsd, fvar, fmin, fmax, ffirst, flast, fNobs, fNdistinct, fscale, fbetween, fwithin, fHDbetween, fHDwithin, flag, fdiff and fgrowth. They perform grouped computations in a single pass through the data (i.e. no splitting and recombining).
Syntax:
fFUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)
Examples:
v <- iris$Sepal.Length
f <- iris$Species
# Vectors
fmean(v) # mean
fmean(v, f) # grouped mean
fsd(v, f) # grouped standard deviation
fsd(v, f, TRA = "/") # grouped scaling
fscale(v, f) # grouped standardizing (scaling and centering)
fwithin(v, f) # grouped demeaning
w <- abs(rnorm(nrow(iris)))
fmean(v, w = w) # Weighted mean
fmean(v, f, w) # Weighted grouped mean
fsd(v, f, w) # Weighted grouped standard-deviation
fsd(v, f, w, "/") # Weighted grouped scaling
fscale(v, f, w) # Weighted grouped standardizing
fwithin(v, f, w) # Weighted grouped demeaning
# Same using data.frames...
fmean(iris[-5], f) # grouped mean
fscale(iris[-5], f) # grouped standardizing
fwithin(iris[-5], f) # grouped demeaning
# Same with matrices ...
In the package vignettes I provide benchmarks. Programming with the fast functions is significantly faster than programming with dplyr or data.table, especially on smaller data, but also on large data.

Creating a new dataframe column containing summed values from a different dataframe that corrispond to specific criteria [duplicate]

Whenever I want to do something "map"py in R, I usually try to use a function in the apply family.
However, I've never quite understood the differences between them -- how {sapply, lapply, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.
Can someone explain how to use which one when?
My current (probably incorrect/incomplete) understanding is...
sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]), giving you a matrix if f has a multi-element output
lapply(vec, f): same as sapply, but output is a list?
apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix)
tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names
by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column.
aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.
Side question: I still haven't learned plyr or reshape -- would plyr or reshape replace all of these entirely?
R has many *apply functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.
Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.
This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.
apply - When you want to apply a function to the rows or columns
of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.
# Two dimensional matrix
M <- matrix(seq(1,16), 4, 4)
# apply min to rows
apply(M, 1, min)
[1] 1 2 3 4
# apply max to columns
apply(M, 2, max)
[1] 4 8 12 16
# 3 dimensional array
M <- array( seq(32), dim = c(4,4,2))
# Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
apply(M, 1, sum)
# Result is one-dimensional
[1] 120 128 136 144
# Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
apply(M, c(1,2), sum)
# Result is two-dimensional
[,1] [,2] [,3] [,4]
[1,] 18 26 34 42
[2,] 20 28 36 44
[3,] 22 30 38 46
[4,] 24 32 40 48
If you want row/column means or sums for a 2D matrix, be sure to
investigate the highly optimized, lightning-quick colMeans,
rowMeans, colSums, rowSums.
lapply - When you want to apply a function to each element of a
list in turn and get a list back.
This is the workhorse of many of the other *apply functions. Peel
back their code and you will often find lapply underneath.
x <- list(a = 1, b = 1:3, c = 10:100)
lapply(x, FUN = length)
$a
[1] 1
$b
[1] 3
$c
[1] 91
lapply(x, FUN = sum)
$a
[1] 1
$b
[1] 6
$c
[1] 5005
sapply - When you want to apply a function to each element of a
list in turn, but you want a vector back, rather than a list.
If you find yourself typing unlist(lapply(...)), stop and consider
sapply.
x <- list(a = 1, b = 1:3, c = 10:100)
# Compare with above; a named vector, not a list
sapply(x, FUN = length)
a b c
1 3 91
sapply(x, FUN = sum)
a b c
1 6 5005
In more advanced uses of sapply it will attempt to coerce the
result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:
sapply(1:5,function(x) rnorm(3,x))
If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:
sapply(1:5,function(x) matrix(x,2,2))
Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array:
sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
vapply - When you want to use sapply but perhaps need to
squeeze some more speed out of your code or want more type safety.
For vapply, you basically give R an example of what sort of thing
your function will return, which can save some time coercing returned
values to fit in a single atomic vector.
x <- list(a = 1, b = 1:3, c = 10:100)
#Note that since the advantage here is mainly speed, this
# example is only for illustration. We're telling R that
# everything returned by length() should be an integer of
# length 1.
vapply(x, FUN = length, FUN.VALUE = 0L)
a b c
1 3 91
mapply - For when you have several data structures (e.g.
vectors, lists) and you want to apply a function to the 1st elements
of each, and then the 2nd elements of each, etc., coercing the result
to a vector/array as in sapply.
This is multivariate in the sense that your function must accept
multiple arguments.
#Sums the 1st elements, the 2nd elements, etc.
mapply(sum, 1:5, 1:5, 1:5)
[1] 3 6 9 12 15
#To do rep(1,4), rep(2,3), etc.
mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list.
Map(sum, 1:5, 1:5, 1:5)
[[1]]
[1] 3
[[2]]
[1] 6
[[3]]
[1] 9
[[4]]
[1] 12
[[5]]
[1] 15
rapply - For when you want to apply a function to each element of a nested list structure, recursively.
To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply:
# Append ! to string, otherwise increment
myFun <- function(x){
if(is.character(x)){
return(paste(x,"!",sep=""))
}
else{
return(x + 1)
}
}
#A nested list structure
l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"),
b = 3, c = "Yikes",
d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
# Result is named vector, coerced to character
rapply(l, myFun)
# Result is a nested list like l, with values altered
rapply(l, myFun, how="replace")
tapply - For when you want to apply a function to subsets of a
vector and the subsets are defined by some other vector, usually a
factor.
The black sheep of the *apply family, of sorts. The help file's use of
the phrase "ragged array" can be a bit confusing, but it is actually
quite simple.
A vector:
x <- 1:20
A factor (of the same length!) defining groups:
y <- factor(rep(letters[1:5], each = 4))
Add up the values in x within each subgroup defined by y:
tapply(x, y, sum)
a b c d e
10 26 42 58 74
More complex examples can be handled where the subgroups are defined
by the unique combinations of a list of several factors. tapply is
similar in spirit to the split-apply-combine functions that are
common in R (aggregate, by, ave, ddply, etc.) Hence its
black sheep status.
On the side note, here is how the various plyr functions correspond to the base *apply functions (from the intro to plyr document from the plyr webpage http://had.co.nz/plyr/)
Base function Input Output plyr function
---------------------------------------
aggregate d d ddply + colwise
apply a a/l aaply / alply
by d l dlply
lapply l l llply
mapply a a/l maply / mlply
replicate r a/l raply / rlply
sapply l a laply
One of the goals of plyr is to provide consistent naming conventions for each of the functions, encoding the input and output data types in the function name. It also provides consistency in output, in that output from dlply() is easily passable to ldply() to produce useful output, etc.
Conceptually, learning plyr is no more difficult than understanding the base *apply functions.
plyr and reshape functions have replaced almost all of these functions in my every day use. But, also from the Intro to Plyr document:
Related functions tapply and sweep have no corresponding function in plyr, and remain useful. merge is useful for combining summaries with the original data.
From slide 21 of http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:
(Hopefully it's clear that apply corresponds to #Hadley's aaply and aggregate corresponds to #Hadley's ddply etc. Slide 20 of the same slideshare will clarify if you don't get it from this image.)
(on the left is input, on the top is output)
First start with Joran's excellent answer -- doubtful anything can better that.
Then the following mnemonics may help to remember the distinctions between each. Whilst some are obvious, others may be less so --- for these you'll find justification in Joran's discussions.
Mnemonics
lapply is a list apply which acts on a list or vector and returns a list.
sapply is a simple lapply (function defaults to returning a vector or matrix when possible)
vapply is a verified apply (allows the return object type to be prespecified)
rapply is a recursive apply for nested lists, i.e. lists within lists
tapply is a tagged apply where the tags identify the subsets
apply is generic: applies a function to a matrix's rows or columns (or, more generally, to dimensions of an array)
Building the Right Background
If using the apply family still feels a bit alien to you, then it might be that you're missing a key point of view.
These two articles can help. They provide the necessary background to motivate the functional programming techniques that are being provided by the apply family of functions.
Users of Lisp will recognise the paradigm immediately. If you're not familiar with Lisp, once you get your head around FP, you'll have gained a powerful point of view for use in R -- and apply will make a lot more sense.
Advanced R: Functional Programming, by Hadley Wickham
Simple Functional Programming in R, by Michael Barton
Since I realized that (the very excellent) answers of this post lack of by and aggregate explanations. Here is my contribution.
BY
The by function, as stated in the documentation can be though, as a "wrapper" for tapply. The power of by arises when we want to compute a task that tapply can't handle. One example is this code:
ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )
cb
iris$Species: setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
--------------------------------------------------------------
iris$Species: versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
--------------------------------------------------------------
iris$Species: virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
ct
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
$versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
$virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
If we print these two objects, ct and cb, we "essentially" have the same results and the only differences are in how they are shown and the different class attributes, respectively by for cb and array for ct.
As I've said, the power of by arises when we can't use tapply; the following code is one example:
tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) :
arguments must have same length
R says that arguments must have the same lengths, say "we want to calculate the summary of all variable in iris along the factor Species": but R just can't do that because it does not know how to handle.
With the by function R dispatch a specific method for data frame class and then let the summary function works even if the length of the first argument (and the type too) are different.
bywork <- by(iris, iris$Species, summary )
bywork
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 versicolor: 0
Median :5.000 Median :3.400 Median :1.500 Median :0.200 virginica : 0
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
--------------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
--------------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 setosa : 0
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 versicolor: 0
Median :6.500 Median :3.000 Median :5.550 Median :2.000 virginica :50
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
it works indeed and the result is very surprising. It is an object of class by that along Species (say, for each of them) computes the summary of each variable.
Note that if the first argument is a data frame, the dispatched function must have a method for that class of objects. For example is we use this code with the mean function we will have this code that has no sense at all:
by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
-------------------------------------------
iris$Species: versicolor
[1] NA
-------------------------------------------
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
AGGREGATE
aggregate can be seen as another a different way of use tapply if we use it in such a way.
at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)
at
setosa versicolor virginica
5.006 5.936 6.588
ag
Group.1 x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
The two immediate differences are that the second argument of aggregate must be a list while tapply can (not mandatory) be a list and that the output of aggregate is a data frame while the one of tapply is an array.
The power of aggregate is that it can handle easily subsets of the data with subset argument and that it has methods for ts objects and formula as well.
These elements make aggregate easier to work with that tapply in some situations.
Here are some examples (available in documentation):
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
ag
supp dose len
1 OJ 0.5 13.23
2 VC 0.5 7.98
3 OJ 1.0 22.70
4 VC 1.0 16.77
5 OJ 2.0 26.06
6 VC 2.0 26.14
We can achieve the same with tapply but the syntax is slightly harder and the output (in some circumstances) less readable:
att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)
att
OJ VC
0.5 13.23 7.98
1 22.70 16.77
2 26.06 26.14
There are other times when we can't use by or tapply and we have to use aggregate.
ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
ag1
Month Ozone Temp
1 5 23.61538 66.73077
2 6 29.44444 78.22222
3 7 59.11538 83.88462
4 8 59.96154 83.96154
5 9 31.44828 76.89655
We cannot obtain the previous result with tapply in one call but we have to calculate the mean along Month for each elements and then combine them (also note that we have to call the na.rm = TRUE, because the formula methods of the aggregate function has by default the na.action = na.omit):
ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)
cbind(ta1, ta2)
ta1 ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000
while with by we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function, mean):
by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)
Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:
byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)
The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.
There are lots of great answers which discuss differences in the use cases for each function. None of the answer discuss the differences in performance. That is reasonable cause various functions expects various input and produces various output, yet most of them have a general common objective to evaluate by series/groups. My answer is going to focus on performance. Due to above the input creation from the vectors is included in the timing, also the apply function is not measured.
I have tested two different functions sum and length at once. Volume tested is 50M on input and 50K on output. I have also included two currently popular packages which were not widely used at the time when question was asked, data.table and dplyr. Both are definitely worth to look if you are aiming for good performance.
library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)
timing = list()
# sapply
timing[["sapply"]] = system.time({
lt = split(x, grp)
r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})
# lapply
timing[["lapply"]] = system.time({
lt = split(x, grp)
r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})
# tapply
timing[["tapply"]] = system.time(
r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)
# by
timing[["by"]] = system.time(
r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)
# aggregate
timing[["aggregate"]] = system.time(
r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)
# dplyr
timing[["dplyr"]] = system.time({
df = data_frame(x, grp)
r.dplyr = summarise(group_by(df, grp), sum(x), n())
})
# data.table
timing[["data.table"]] = system.time({
dt = setnames(setDT(list(x, grp)), c("x","grp"))
r.data.table = dt[, .(sum(x), .N), grp]
})
# all output size match to group count
sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table),
function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
# sapply lapply tapply by aggregate dplyr data.table
# TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# print timings
as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
)[,.(fun = V1, elapsed = V2)
][order(-elapsed)]
# fun elapsed
#1: aggregate 109.139
#2: by 25.738
#3: dplyr 18.978
#4: tapply 17.006
#5: lapply 11.524
#6: sapply 11.326
#7: data.table 2.686
Despite all the great answers here, there are 2 more base functions that deserve to be mentioned, the useful outer function and the obscure eapply function
outer
outer is a very useful function hidden as a more mundane one. If you read the help for outer its description says:
The outer product of the arrays X and Y is the array A with dimension
c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] =
FUN(X[arrayindex.x], Y[arrayindex.y], ...).
which makes it seem like this is only useful for linear algebra type things. However, it can be used much like mapply to apply a function to two vectors of inputs. The difference is that mapply will apply the function to the first two elements and then the second two etc, whereas outer will apply the function to every combination of one element from the first vector and one from the second. For example:
A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
mapply(FUN=pmax, A, B)
> mapply(FUN=pmax, A, B)
[1] 1 3 6 9 12
outer(A,B, pmax)
> outer(A,B, pmax)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 6 9 12
[2,] 3 3 6 9 12
[3,] 5 5 6 9 12
[4,] 7 7 7 9 12
[5,] 9 9 9 9 12
I have personally used this when I have a vector of values and a vector of conditions and wish to see which values meet which conditions.
eapply
eapply is like lapply except that rather than applying a function to every element in a list, it applies a function to every element in an environment. For example if you want to find a list of user defined functions in the global environment:
A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
C<-list(x=1, y=2)
D<-function(x){x+1}
> eapply(.GlobalEnv, is.function)
$A
[1] FALSE
$B
[1] FALSE
$C
[1] FALSE
$D
[1] TRUE
Frankly I don't use this very much but if you are building a lot of packages or create a lot of environments it may come in handy.
It is maybe worth mentioning ave. ave is tapply's friendly cousin. It returns results in a form that you can plug straight back into your data frame.
dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
## A B C D E
## 2.5 6.5 10.5 14.5 18.5
## great, but putting it back in the data frame is another line:
dfr$m <- means[dfr$f]
dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
## a f m m2
## 1 A 2.5 2.5
## 2 A 2.5 2.5
## 3 A 2.5 2.5
## 4 A 2.5 2.5
## 5 B 6.5 6.5
## 6 B 6.5 6.5
## 7 B 6.5 6.5
## ...
There is nothing in the base package that works like ave for whole data frames (as by is like tapply for data frames). But you can fudge it:
dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
x <- dfr[x,]
sum(x$m*x$m2)
})
dfr
## a f m m2 foo
## 1 1 A 2.5 2.5 25
## 2 2 A 2.5 2.5 25
## 3 3 A 2.5 2.5 25
## ...
I recently discovered the rather useful sweep function and add it here for the sake of completeness:
sweep
The basic idea is to sweep through an array row- or column-wise and return a modified array. An example will make this clear (source: datacamp):
Let's say you have a matrix and want to standardize it column-wise:
dataPoints <- matrix(4:15, nrow = 4)
# Find means per column with `apply()`
dataPoints_means <- apply(dataPoints, 2, mean)
# Find standard deviation with `apply()`
dataPoints_sdev <- apply(dataPoints, 2, sd)
# Center the points
dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")
# Return the result
dataPoints_Trans1
## [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,] 0.5 0.5 0.5
## [4,] 1.5 1.5 1.5
# Normalize
dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")
# Return the result
dataPoints_Trans2
## [,1] [,2] [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,] 0.3872983 0.3872983 0.3872983
## [4,] 1.1618950 1.1618950 1.1618950
NB: for this simple example the same result can of course be achieved more easily by apply(dataPoints, 2, scale)
In the collapse package recently released on CRAN, I have attempted to compress most of the common apply functionality into just 2 functions:
dapply (Data-Apply) applies functions to rows or (default) columns of matrices and data.frames and (default) returns an object of the same type and with the same attributes (unless the result of each computation is atomic and drop = TRUE). The performance is comparable to lapply for data.frame columns, and about 2x faster than apply for matrix rows or columns. Parallelism is available via mclapply (only for MAC).
Syntax:
dapply(X, FUN, ..., MARGIN = 2, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame"), drop = TRUE)
Examples:
# Apply to columns:
dapply(mtcars, log)
dapply(mtcars, sum)
dapply(mtcars, quantile)
# Apply to rows:
dapply(mtcars, sum, MARGIN = 1)
dapply(mtcars, quantile, MARGIN = 1)
# Return as matrix:
dapply(mtcars, quantile, return = "matrix")
dapply(mtcars, quantile, MARGIN = 1, return = "matrix")
# Same for matrices ...
BY is a S3 generic for split-apply-combine computing with vector, matrix and data.frame method. It is significantly faster than tapply, by and aggregate (an also faster than plyr, on large data dplyr is faster though).
Syntax:
BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame", "list"))
Examples:
# Vectors:
BY(iris$Sepal.Length, iris$Species, sum)
BY(iris$Sepal.Length, iris$Species, quantile)
BY(iris$Sepal.Length, iris$Species, quantile, expand.wide = TRUE) # This returns a matrix
# Data.frames
BY(iris[-5], iris$Species, sum)
BY(iris[-5], iris$Species, quantile)
BY(iris[-5], iris$Species, quantile, expand.wide = TRUE) # This returns a wider data.frame
BY(iris[-5], iris$Species, quantile, return = "matrix") # This returns a matrix
# Same for matrices ...
Lists of grouping variables can also be supplied to g.
Talking about performance: A main goal of collapse is to foster high-performance programming in R and to move beyond split-apply-combine alltogether. For this purpose the package has a full set of C++ based fast generic functions: fmean, fmedian, fmode, fsum, fprod, fsd, fvar, fmin, fmax, ffirst, flast, fNobs, fNdistinct, fscale, fbetween, fwithin, fHDbetween, fHDwithin, flag, fdiff and fgrowth. They perform grouped computations in a single pass through the data (i.e. no splitting and recombining).
Syntax:
fFUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)
Examples:
v <- iris$Sepal.Length
f <- iris$Species
# Vectors
fmean(v) # mean
fmean(v, f) # grouped mean
fsd(v, f) # grouped standard deviation
fsd(v, f, TRA = "/") # grouped scaling
fscale(v, f) # grouped standardizing (scaling and centering)
fwithin(v, f) # grouped demeaning
w <- abs(rnorm(nrow(iris)))
fmean(v, w = w) # Weighted mean
fmean(v, f, w) # Weighted grouped mean
fsd(v, f, w) # Weighted grouped standard-deviation
fsd(v, f, w, "/") # Weighted grouped scaling
fscale(v, f, w) # Weighted grouped standardizing
fwithin(v, f, w) # Weighted grouped demeaning
# Same using data.frames...
fmean(iris[-5], f) # grouped mean
fscale(iris[-5], f) # grouped standardizing
fwithin(iris[-5], f) # grouped demeaning
# Same with matrices ...
In the package vignettes I provide benchmarks. Programming with the fast functions is significantly faster than programming with dplyr or data.table, especially on smaller data, but also on large data.

Splitting Dataframe into Confirmatory and Exploratory Samples

I have a very large dataframe (N = 107,251), that I wish to split into relatively equal halves (~53,625). However, I would like the split to be done such that three variables are kept in equal proportion in the two sets (pertaining to Gender, Age Category with 6 levels, and Region with 5 levels).
I can generate the proportions for the variables independently (e.g., via prop.table(xtabs(~dat$Gender))) or in combination (e.g., via prop.table(xtabs(~dat$Gender + dat$Region + dat$Age)), but I'm not sure how to utilise this information to actually do the sampling.
Sample dataset:
set.seed(42)
Gender <- sample(c("M", "F"), 1000, replace = TRUE)
Region <- sample(c("1","2","3","4","5"), 1000, replace = TRUE)
Age <- sample(c("1","2","3","4","5","6"), 1000, replace = TRUE)
X1 <- rnorm(1000)
dat <- data.frame(Gender, Region, Age, X1)
Probabilities:
round(prop.table(xtabs(~dat$Gender)), 3) # 48.5% Female; 51.5% Male
round(prop.table(xtabs(~dat$Age)), 3) # 16.8, 18.2, ..., 16.0%
round(prop.table(xtabs(~dat$Region)), 3) # 21.5%, 17.7, ..., 21.9%
# Multidimensional probabilities:
round(prop.table(xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
The end goal for this dummy example would be two data frames with ~500 observations in each (completely independent, no participant appearing in both), and approximately equivalent in terms of gender/region/age splits. In the real analysis, there is more disparity between the age and region weights, so doing a single random split-half isn't appropriate. In real world applications, I'm not sure if every observation needs to be used or if it is better to get the splits more even.
I have been reading over the documentation from package:sampling but I'm not sure it is designed to do exactly what I require.
You can check out my stratified function, which you should be able to use like this:
set.seed(1) ## just so you can reproduce this
## Take your first group
sample1 <- stratified(dat, c("Gender", "Region", "Age"), .5)
## Then select the remainder
sample2 <- dat[!rownames(dat) %in% rownames(sample1), ]
summary(sample1)
# Gender Region Age X1
# F:235 1:112 1:84 Min. :-2.82847
# M:259 2: 90 2:78 1st Qu.:-0.69711
# 3: 94 3:82 Median :-0.03200
# 4: 97 4:80 Mean :-0.01401
# 5:101 5:90 3rd Qu.: 0.63844
# 6:80 Max. : 2.90422
summary(sample2)
# Gender Region Age X1
# F:238 1:114 1:85 Min. :-2.76808
# M:268 2: 92 2:81 1st Qu.:-0.55173
# 3: 97 3:83 Median : 0.02559
# 4: 99 4:83 Mean : 0.05789
# 5:104 5:91 3rd Qu.: 0.74102
# 6:83 Max. : 3.58466
Compare the following and see if they are within your expectations.
x1 <- round(prop.table(
xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
x2 <- round(prop.table(
xtabs(~sample1$Gender + sample1$Age + sample1$Region)), 3)
x3 <- round(prop.table(
xtabs(~sample2$Gender + sample2$Age + sample2$Region)), 3)
It should be able to work fine with data of the size you describe, but a "data.table" version is in the works that promises to be much more efficient.
Update:
stratified now has a new logical argument "bothSets" which lets you keep both sets of samples as a list.
set.seed(1)
Samples <- stratified(dat, c("Gender", "Region", "Age"), .5, bothSets = TRUE)
lapply(Samples, summary)
# $SET1
# Gender Region Age X1
# F:235 1:112 1:84 Min. :-2.82847
# M:259 2: 90 2:78 1st Qu.:-0.69711
# 3: 94 3:82 Median :-0.03200
# 4: 97 4:80 Mean :-0.01401
# 5:101 5:90 3rd Qu.: 0.63844
# 6:80 Max. : 2.90422
#
# $SET2
# Gender Region Age X1
# F:238 1:114 1:85 Min. :-2.76808
# M:268 2: 92 2:81 1st Qu.:-0.55173
# 3: 97 3:83 Median : 0.02559
# 4: 99 4:83 Mean : 0.05789
# 5:104 5:91 3rd Qu.: 0.74102
# 6:83 Max. : 3.58466
The following code basically creates a key based on the group membership then loops through each group, sampling half to one set and half (roughly) to the other. If you compare the resulting probabilities they are within 0.001 of each other. The downside to this is that its biased to make a larger sample size for the second group due to how rounding of odd-numbered group member number is handled. In this case the first sample is 488 observations and the second is 512. You can probably throw in some logic to account for that and even it out better.
EDIT: Added that logic and it split it up evenly.
set.seed(42)
Gender <- sample(c("M", "F"), 1000, replace = TRUE)
Region <- sample(c("1","2","3","4","5"), 1000, replace = TRUE)
Age <- sample(c("1","2","3","4","5","6"), 1000, replace = TRUE)
X1 <- rnorm(1000)
dat <- data.frame(Gender, Region, Age, X1)
dat$group <- with(dat, paste(Gender, Region, Age))
groups <- unique(dat$group)
setA <- dat[NULL,]
setB <- dat[NULL,]
for (i in 1:length(groups)){
temp <- dat[dat$group==groups[i],]
if (nrow(setA) > nrow(setB)){
tempA <- temp[1:floor(nrow(temp)/2),]
tempB <- temp[(1+floor(nrow(temp)/2)):nrow(temp),]
} else {
tempA <- temp[1:ceiling(nrow(temp)/2),]
tempB <- temp[(1+ceiling(nrow(temp)/2)):nrow(temp),]
}
setA <- rbind(setA, tempA)
setB <- rbind(setB, tempB)
}

Fastest by column sort in R

I have a data frame full from which I want to take the last column and a column v. I then want to sort both columns on v in the fastest way possible. full is read in from a csv but this can be used for testing (included some NAs for realism):
n <- 200000
full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))
full[sample(n, 10000), 'A'] <- NA
v <- 1
I have v as one here, but in reality it could change, and full has many columns.
I have tried sorting data frames, data tables and matrices each with order and sort.list (some ideas taken from this thread). The code for all these:
# DATA FRAME
ord_df <- function() {
a <- full[c(v, length(full))]
a[with(a, order(a[1])), ]
}
sl_df <- function() {
a <- full[c(v, length(full))]
a[sort.list(a[[1]]), ]
}
# DATA TABLE
require(data.table)
ord_dt <- function() {
a <- as.data.table(full[c(v, length(full))])
colnames(a)[1] <- 'values'
a[order(values)]
}
sl_dt <- function() {
a <- as.data.table(full[c(v, length(full))])
colnames(a)[1] <- 'values'
a[sort.list(values)]
}
# MATRIX
ord_mat <- function() {
a <- as.matrix(full[c(v, length(full))])
a[order(a[, 1]), ]
}
sl_mat <- function() {
a <- as.matrix(full[c(v, length(full))])
a[sort.list(a[, 1]), ]
}
Time results:
ord_df sl_df ord_dt sl_dt ord_mat sl_mat
Min. 0.230 0.1500 0.1300 0.120 0.140 0.1400
Median 0.250 0.1600 0.1400 0.140 0.140 0.1400
Mean 0.244 0.1610 0.1430 0.136 0.142 0.1450
Max. 0.250 0.1700 0.1600 0.140 0.160 0.1600
Or using microbenchmark (results are in milliseconds):
min lq median uq max
1 ord_df() 243.0647 248.2768 254.0544 265.2589 352.3984
2 ord_dt() 133.8159 140.0111 143.8202 148.4957 181.2647
3 ord_mat() 140.5198 146.8131 149.9876 154.6649 191.6897
4 sl_df() 152.6985 161.5591 166.5147 171.2891 194.7155
5 sl_dt() 132.1414 139.7655 144.1281 149.6844 188.8592
6 sl_mat() 139.2420 146.8578 151.6760 156.6174 186.5416
Seems like ordering the data table wins. There isn't all that much difference between order and sort.list except when using data frames where sort.list is much faster.
In the data table versions I also tried setting v as the key (since it is then sorted according to the documentation) but I couldn't get it work since the contents of v are not integer.
I would ideally like to speed this up as much as possible since I have to do it many times for different v values. Does anyone know how I might be able to speed this process up even further? Also might it be worth trying an Rcpp implementation? Thanks.
Here's the code I used for timing if it's useful to anyone:
sortMethods <- list(ord_df, sl_df, ord_dt, sl_dt, ord_mat, sl_mat)
require(plyr)
timings <- raply(10, sapply(sortMethods, function(x) system.time(x())[[3]]))
colnames(timings) <- c('ord_df', 'sl_df', 'ord_dt', 'sl_dt', 'ord_mat', 'sl_mat')
apply(timings, 2, summary)
require(microbenchmark)
mb <- microbenchmark(ord_df(), sl_df(), ord_dt(), sl_dt(), ord_mat(), sl_mat())
plot(mb)
I don't know if it's better to put this sort of thing in as an edit but it seems more like answer so here will do. Updated test functions:
n <- 1e7
full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))
full[sample(n, 100000), 'A'] <- NA
fdf <- full
fma <- as.matrix(full)
fdt <- as.data.table(full)
setnames(fdt, colnames(fdt)[1], 'values')
# DATA FRAME
ord_df <- function() { fdf[order(fdf[1]), ] }
sl_df <- function() { fdf[sort.list(fdf[[1]]), ] }
# DATA TABLE
require(data.table)
ord_dt <- function() { fdt[order(values)] }
key_dt <- function() {
setkey(fdt, values)
fdt
}
# MATRIX
ord_mat <- function() { fma[order(fma[, 1]), ] }
sl_mat <- function() { fma[sort.list(fma[, 1]), ] }
Results (using a different computer, R 2.13.1 and data.table 1.8.2):
ord_df sl_df ord_dt key_dt ord_mat sl_mat
Min. 37.56 20.86 2.946 2.249 20.22 20.21
1st Qu. 37.73 21.15 2.962 2.255 20.54 20.59
Median 38.43 21.74 3.002 2.280 21.05 20.82
Mean 38.76 21.75 3.074 2.395 21.09 20.95
3rd Qu. 39.85 22.18 3.151 2.445 21.48 21.42
Max. 40.36 23.08 3.330 2.797 22.41 21.84
So data.table is the clear winner. Using a key is faster than ordering, and has a nicer syntax as well I'd argue. Thanks for the help everyone.

Resources