I can achieve this task, but I feel like there must be a "best" (slickest, most compact, clearest-code, fastest?) way of doing it and have not figured it out so far ...
For a specified set of categorical factors I want to construct a table of means and variances by group.
generate data:
set.seed(1001)
d <- expand.grid(f1=LETTERS[1:3],f2=letters[1:3],
f3=factor(as.character(as.roman(1:3))),rep=1:4)
d$y <- runif(nrow(d))
d$z <- rnorm(nrow(d))
desired output:
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.09537958
2 A a II 0.4876630 0.11079670
3 A a III 0.3102926 0.20280568
4 A b I 0.3914084 0.05869310
5 A b II 0.5257355 0.21863126
6 A b III 0.3356860 0.07943314
... etc. ...
using aggregate/merge:
library(reshape)
m1 <- aggregate(y~f1*f2*f3,data=d,FUN=mean)
m2 <- aggregate(y~f1*f2*f3,data=d,FUN=var)
mvtab <- merge(rename(m1,c(y="y.mean")),
rename(m2,c(y="y.var")))
using ddply/summarise (possibly best but haven't been able to make it work):
mvtab2 <- ddply(subset(d,select=-c(z,rep)),
.(f1,f2,f3),
summarise,numcolwise(mean),numcolwise(var))
results in
Error in output[[var]][rng] <- df[[var]] :
incompatible types (from closure to logical) in subassignment type fix
using melt/cast (maybe best?)
mvtab3 <- cast(melt(subset(d,select=-c(z,rep)),
id.vars=1:3),
...~.,fun.aggregate=c(mean,var))
## now have to drop "variable"
mvtab3 <- subset(mvtab3,select=-variable)
## also should rename response variables
Won't (?) work in reshape2. Explaining ...~. to someone could be tricky!
Here is a solution using data.table
library(data.table)
d2 = data.table(d)
ans = d2[,list(avg_y = mean(y), var_y = var(y)), 'f1, f2, f3']
I'm a bit puzzled. Does this not work:
mvtab2 <- ddply(d,.(f1,f2,f3),
summarise,y.mean = mean(y),y.var = var(y))
This give me something like this:
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.095379578
2 A a II 0.4876630 0.110796695
3 A a III 0.3102926 0.202805677
4 A b I 0.3914084 0.058693103
5 A b II 0.5257355 0.218631264
Which is in the right form, but it looks like the values are different that what you specified.
Edit
Here's how to make your version with numcolwise work:
mvtab2 <- ddply(subset(d,select=-c(z,rep)),.(f1,f2,f3),summarise,
y.mean = numcolwise(mean)(piece),
y.var = numcolwise(var)(piece))
You forgot to pass the actual data to numcolwise. And then there's the little ddply trick that each piece is called piece internally. (Which Hadley points out in the comments shouldn't be relied upon as it may change in future versions of plyr.)
(I voted for Joshua's.) Here's an Hmisc::summary.formula solution. The advantage of this for me is that it is well integrated with the Hmisc::latex output "channel".
summary(y ~ interaction(f3,f2,f1), data=d, method="response",
fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
#-----output----------
y N=108
+-----------------------+-------+---+---------+-----------+
| | |N |mean.y |var.y |
+-----------------------+-------+---+---------+-----------+
|interaction(f3, f2, f1)|I.a.A | 4|0.6502307|0.095379578|
| |II.a.A | 4|0.4876630|0.110796695|
snipped output to show the latex -> PDF -> png output:
#joran is spot-on with the ddply answer. Here's how I would do it with aggregate. Note that I avoid the formula interface (it is slower).
aggregate(d$y, d[,c("f1","f2","f3")], FUN=function(x) c(mean=mean(x),var=var(x)))
I'm slightly addicted to speed comparisons even though they're largely irrelevant for me in this situation ...
joran_ddply <- function(d) ddply(d,.(f1,f2,f3),
summarise,y.mean = mean(y),y.var = var(y))
joshulrich_aggregate <- function(d) {
aggregate(d$y, d[,c("f1","f2","f3")],
FUN=function(x) c(mean=mean(x),var=var(x)))
}
formula_aggregate <- function(d) {
aggregate(y~f1*f2*f3,data=d,
FUN=function(x) c(mean=mean(x),var=var(x)))
}
library(data.table)
d2 <- data.table(d)
ramnath_datatable <- function(d) {
d[,list(avg_y = mean(y), var_y = var(y)), 'f1, f2, f3']
}
library(Hmisc)
dwin_hmisc <- function(d) {summary(y ~ interaction(f3,f2,f1),
data=d, method="response",
fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
}
library(rbenchmark)
benchmark(joran_ddply(d),
joshulrich_aggregate(d),
ramnath_datatable(d2),
formula_aggregate(d),
dwin_hmisc(d))
aggregate is fastest (even faster than data.table, which is a surprise to me, although things might be different with a bigger table to aggregate), even using the formula interface ...)
test replications elapsed relative user.self sys.self
5 dwin_hmisc(d) 100 1.235 2.125645 1.168 0.044
4 formula_aggregate(d) 100 0.703 1.209983 0.656 0.036
1 joran_ddply(d) 100 3.345 5.757315 3.152 0.144
2 joshulrich_aggregate(d) 100 0.581 1.000000 0.596 0.000
3 ramnath_datatable(d2) 100 0.750 1.290878 0.708 0.000
(Now I just need Dirk to step up and post an Rcpp solution that is 1000 times faster than anything else ...)
I find the doBy package has some very convenient functions for things like this. For example, the function ?summaryBy is quite handy. Consider:
> summaryBy(y~f1+f2+f3, data=d, FUN=c(mean, var))
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.095379578
2 A a II 0.4876630 0.110796695
3 A a III 0.3102926 0.202805677
4 A b I 0.3914084 0.058693103
5 A b II 0.5257355 0.218631264
6 A b III 0.3356860 0.079433136
7 A c I 0.3367841 0.079487973
8 A c II 0.6273320 0.041373836
9 A c III 0.4532720 0.022779672
10 B a I 0.6688221 0.044184575
11 B a II 0.5514724 0.020359289
12 B a III 0.6389354 0.104056229
13 B b I 0.5052346 0.138379070
14 B b II 0.3933283 0.050261804
15 B b III 0.5953874 0.161943989
16 B c I 0.3490460 0.079286849
17 B c II 0.5534569 0.207381592
18 B c III 0.4652424 0.187463143
19 C a I 0.3340988 0.004994589
20 C a II 0.3970315 0.126967554
21 C a III 0.3580250 0.066769484
22 C b I 0.7676858 0.124945402
23 C b II 0.3613772 0.182689385
24 C b III 0.4175562 0.095933470
25 C c I 0.3592491 0.039832864
26 C c II 0.7882591 0.084271963
27 C c III 0.3936949 0.085758343
So the function call is simple, easy to use, and I would say, elegant.
Now, if your primary concern is speed, it seems that it would be reasonable--at least with smaller sized tasks (note that I couldn't get the ramnath_datatable function to work for whatever reason):
test replications elapsed relative user.self
4 dwin_hmisc(d) 100 0.50 2.778 0.50
3 formula_aggregate(d) 100 0.23 1.278 0.24
5 gung_summaryBy(d) 100 0.34 1.889 0.35
1 joran_ddply(d) 100 1.34 7.444 1.32
2 joshulrich_aggregate(d) 100 0.18 1.000 0.19
I've came accross with this question and found the benchmarks are done with small tables, so it's hard to tell which method is better with 100 rows.
I've also modified the data a bit also to make it "unsorted", this would be a more common case, for example as the data is in a DB.
I've added a few more data.table trials to see if setting a key is faster beforehand. It seems here, setting the key beforehand doesn't improve much the performance, so ramnath solution seems to be the fastest.
set.seed(1001)
d <- data.frame(f1 = sample(LETTERS[1:3], 30e5, replace = T), f2 = sample(letters[1:3], 30e5, replace = T),
f3 = sample(factor(as.character(as.roman(1:3))), 30e5, replace = T), rep = sample(1:4, replace = T))
d$y <- runif(nrow(d))
d$z <- rnorm(nrow(d))
str(d)
require(Hmisc)
require(plyr)
require(data.table)
d2 = data.table(d)
d3 = data.table(d)
# Set key of d3 to compare how fast it is if the DT is already keyded
setkey(d3,f1,f2,f3)
joran_ddply <- function(d) ddply(d,.(f1,f2,f3),
summarise,y.mean = mean(y),y.var = var(y))
formula_aggregate <- function(d) {
aggregate(y~f1*f2*f3,data=d,
FUN=function(x) c(mean=mean(x),var=var(x)))
}
ramnath_datatable <- function(d) {
d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}
key_agg_datatable <- function(d) {
setkey(d2,f1,f2,f3)
d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}
one_key_datatable <- function(d) {
setkey(d2,f1)
d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}
including_3key_datatable <- function(d) {
d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}
dwin_hmisc <- function(d) {summary(y ~ interaction(f3,f2,f1),
data=d, method="response",
fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
}
require(rbenchmark)
benchmark(joran_ddply(d),
joshulrich_aggregate(d),
ramnath_datatable(d2),
including_3key_datatable(d3),
one_key_datatable(d2),
key_agg_datatable(d2),
formula_aggregate(d),
dwin_hmisc(d)
)
# test replications elapsed relative user.self sys.self
# dwin_hmisc(d) 100 1757.28 252.121 1590.89 165.65
# formula_aggregate(d) 100 433.56 62.204 390.83 42.50
# including_3key_datatable(d3) 100 7.00 1.004 6.02 0.98
# joran_ddply(d) 100 173.39 24.877 119.35 53.95
# joshulrich_aggregate(d) 100 328.51 47.132 307.14 21.22
# key_agg_datatable(d2) 100 24.62 3.532 19.13 5.50
# one_key_datatable(d2) 100 29.66 4.255 22.28 7.34
# ramnath_datatable(d2) 100 6.97 1.000 5.96 1.01
And here is a solution using Hadley Wickham's new dplyr library.
library(dplyr)
d %>% group_by(f1, f2, f3) %>%
summarise(y.mean = mean(y), z.mean = mean(z))
Related
I have the following function that uses nested loops and honestly I'm not sure how to proceed with making the code run more efficient. It runs fine for 100 sims in my opinion but when I ran for 2000 sims it took almost 12 seconds.
This code will generate any n Brownian Motion simulations and works well, the issue is once the simulation size is increased to say 500+ then it starts to bog down, and when it hits 2k then it's pretty slow ie 12.
Here is the function:
ts_brownian_motion <- function(.time = 100, .num_sims = 10, .delta_time = 1,
.initial_value = 0) {
# TidyEval ----
T <- as.numeric(.time)
N <- as.numeric(.num_sims)
delta_t <- as.numeric(.delta_time)
initial_value <- as.numeric(.initial_value)
# Checks ----
if (!is.numeric(T) | !is.numeric(N) | !is.numeric(delta_t) | !is.numeric(initial_value)){
rlang::abort(
message = "All parameters must be numeric values.",
use_cli_format = TRUE
)
}
# Initialize empty data.frame to store the simulations
sim_data <- data.frame()
# Generate N simulations
for (i in 1:N) {
# Initialize the current simulation with a starting value of 0
sim <- c(initial_value)
# Generate the brownian motion values for each time step
for (t in 1:(T / delta_t)) {
sim <- c(sim, sim[t] + rnorm(1, mean = 0, sd = sqrt(delta_t)))
}
# Bind the time steps, simulation values, and simulation number together in a data.frame and add it to the result
sim_data <- rbind(
sim_data,
data.frame(
t = seq(0, T, delta_t),
y = sim,
sim_number = i
)
)
}
# Clean up
sim_data <- sim_data %>%
dplyr::as_tibble() %>%
dplyr::mutate(sim_number = forcats::as_factor(sim_number)) %>%
dplyr::select(sim_number, t, y)
# Return ----
attr(sim_data, ".time") <- .time
attr(sim_data, ".num_sims") <- .num_sims
attr(sim_data, ".delta_time") <- .delta_time
attr(sim_data, ".initial_value") <- .initial_value
return(sim_data)
}
Here is some output of the function:
> ts_brownian_motion(.time = 10, .num_sims = 25)
# A tibble: 275 × 3
sim_number t y
<fct> <dbl> <dbl>
1 1 0 0
2 1 1 -2.13
3 1 2 -1.08
4 1 3 0.0728
5 1 4 0.562
6 1 5 0.255
7 1 6 -1.28
8 1 7 -1.76
9 1 8 -0.770
10 1 9 -0.536
# … with 265 more rows
# ℹ Use `print(n = ...)` to see more rows
As suggested in the comments, if you want speed, you should use cumsum. You need to be clear what type of Brownian Motion you want (arithmetic, geometric). For geometric Brownian motion, you'll need to correct the approximation error by adjusting the mean. As an example, the NMOF package (which I maintain), contains a function gbm that implements geometric Brownian Motion through cumsum. Here is an example call for 2000 paths with 100 timesteps each.
library("NMOF")
library("zoo") ## for plotting
timesteps <- 100
system.time(b <- NMOF::gbm(2000, tau = 1, timesteps = 100, r = 0, v = 1))
## user system elapsed
## 0.013 0.000 0.013
dim(b) ## each column is one path, starting at time zero
## [1] 101 2000
plot(zoo(b[, 1:5], 0:timesteps), plot.type = "single")
So I'm trying to find the roots for specific values of Y with uniroot(). I have them all in a column in a dataframe, and I want to create a new column with the root found for each one of the Ys of the original column via lapply().
The way I'm creating the function that uniroot takes as an argument to find its roots, is I am substracting the Y value to the last coefficient of this function, and that Y value is passed as an extra argument to uniroot (according to the uniroot help page).
After a couple hours trying to figure out what was happening I realized that the value that lapply() feeds to the function is the Y, but it's being read as the argument "interval" inside uniroot, thus giving me errors about this argument.
I think I could implement this another way, but it'd be much better and simpler if this way has a solution.
pol_mod <- lm(abs_p ~ poly(patron, 5, raw = TRUE), data = bradford)
a <- as.numeric(coefficients(pol_mod)[6])
b <- as.numeric(coefficients(pol_mod)[5])
c <- as.numeric(coefficients(pol_mod)[4])
d <- as.numeric(coefficients(pol_mod)[3])
e <- as.numeric(coefficients(pol_mod)[2])
f <- as.numeric(coefficients(pol_mod)[1])
fs <- function (x,y) {a*x^5 + b*x^4 + c*x^3 + d*x^2 + e*x + f - y}
interpol <- function (y, fs) {
return(uniroot(fs,y=y, interval=c(0,2000)))
}
bradford$concentracion <- lapply(bradford$abs_m, interpol, fs=fs)
The error I'm getting:
Error in uniroot(fs, y = y, interval = c(0, 2000)) :
f.lower = f(lower) is NA
Needless to say, everything works when applied outside of lapply()
I'd be really happy If someone could lend a hand! Thanks in advance!
EDIT: This is how the dataframe looks like.
bradford
# A tibble: 9 x 3
patron abs_p abs_m
<dbl> <dbl> <dbl>
1 0 0 1.57
2 25 0.041 1.27
3 125 0.215 1.59
4 250 0.405 1.61
5 500 0.675 0.447
6 750 0.97 0.441
7 1000 1.23 NA
8 1500 1.71 NA
9 2000 2.04 NA
I m just trying to calculate the relative angle between with my x,y,z data frame to the reference vector. So far, I use dplyr to group things and apply my angle function to get relative angle. However things are quite slow even for dummy data that I provide here.
set.seed(12345)
x <- replicate(1,c(replicate(1000,rnorm(50,0,0.01))))
y <- replicate(1,c(replicate(1000,rnorm(50,0,0.01))))
z <- replicate(1,c(replicate(1000,rnorm(50,0.9,0.01))))
ref_vector <- data.frame(ref_x=rep(0,100),ref_y=rep(0,100),ref_z=rep(1,100))
set <- rep(seq(1,1000),each=50)
data_rep <- data.frame(x,y,z,ref_vector,set)
>
head(data_rep)
# x y z ref_x ref_y ref_z set
# 1 0.005855288 -0.015472796 0.9059337 0 0 1 1
# 2 0.007094660 -0.013354359 0.9040137 0 0 1 1
# 3 -0.001093033 -0.014661486 0.9047502 0 0 1 1
# 4 -0.004534972 -0.002764655 0.9070553 0 0 1 1
# 5 0.006058875 -0.008339952 0.8926551 0 0 1 1
# 6 -0.018179560 -0.008412400 0.9055541 0 0 1 1
I define the angle between two vectors with this angle function,
angle <- function(x,y){
dot.prod <- x%*%y
norm.x <- norm(x,type="2")
norm.y <- norm(y,type="2")
theta <- acos(dot.prod / (norm.x * norm.y))
as.numeric(theta)
}
then lets apply this to our data_rep
library(dplyr)
system.time(df_angle <- data_rep%>%
rowwise()%>%
do(data.frame(.,angle_rad=angle(unlist(.[1:3]),unlist(.[4:6]))))%>%
group_by(set)%>%
mutate(angle=angle_rad*180/pi, mean_angle=mean(angle)))
# user system elapsed
# 64.22 0.08 64.81
# Warning message:
# Grouping rowwise data frame strips rowwise nature
As you can see, the process took around 1 min and I even did not provide all my real data set which has 350000 rows and it takes 10 min to calculate the relative angle.
I wonder is there any way to speed up this process.
Thanks!
Just discover linear algebra for yourself:
m1 = as.matrix(data_rep[, 1:3])
m2 = as.matrix(data_rep[, 4:6])
system.time( {
m1 = m1 / sqrt(rowSums(m1 ^ 2))
m2 = m2 / sqrt(rowSums(m2 ^ 2))
RESULT <- acos(rowSums(m1 * m2))
})
# user system elapsed
# 0.004 0.001 0.006
all.equal(df_angle$angle_rad, RESULT)
# TRUE
Just make a simple mutatestatement instead of your do(data.frame()) part. This improves the performance quite a bit, because you no longer have to convert each row into a data.frame
system.time(df_angle2 <- data_rep%>%
rowwise() %>%
mutate(angle_rad=angle(x = c(x,y,z),y = c(ref_x,ref_y,ref_z))) %>%
group_by(set)%>%
mutate(angle=angle_rad*180/pi, mean_angle=mean(angle)))
## user system elapsed
## 3.72 0.00 3.71
all.equal(df_angle,df_angle2)
## TRUE
i have this script:
library(plyr)
library(gstat)
library(sp)
library(dplyr)
library(ggplot2)
library(scales)
a<-c(10,20,30,40,50,60,70,80,90,100)
b<-c(15,25,35,45,55,65,75,85,95,105)
x<-rep(a,3)
y<-rep(b,3)
E<-sample(30)
freq<-rep(c(100,200,300),10)
data<-data.frame(x,y,freq,E)
data<-arrange(data,x,y,freq)
df <- ddply(data,"freq", function (h){
dim_h<-length(h$x)
perc_max <- 0.9
perc_min <- 0.8
u<-round((seq(perc_max,perc_min,by=-0.1))*dim_h)
dim_u<-length(u)
perc_punti<- percent(seq(perc_max,perc_min,by=-0.1))
for (i in 1:dim_u)
{ t<-u[i]
time[i]<-system.time(
for (j in 1:2)
{
df_tass <- sample_n(h, t)
df_residuo <- slice(h,-as.numeric(rownames(df_tass)))
coordinates(df_tass)= ~x + y
x.range <- range(h$x)
y.range <- range(h$y)
grid <- expand.grid(x = seq(from = x.range[1], to = x.range[2], by = 1), y = seq(from = y.range[1],
to = y.range[2], by = 1))
coordinates(grid) <- ~x + y
gridded(grid) <- TRUE
nearest = krige(E ~ 1, df_tass, grid, nmax = 1)
nearest_df<-as.data.frame(nearest)
names(nearest_df) <- c("x", "y", "E")
#Error of prediction
df_pred <- inner_join(nearest_df[1:3],select(df_residuo,x,y,E),by=c("x","y"))
names(df_pred) <- c("x", "y", "E_pred","E")
sqm[j] <- mean((df_pred[,4]-df_pred[,3])^2)
})[3]
sqmm[i]<-mean(sqm)
}
df_finale<-data.frame(sqmm,time,perc_punti)
})
df
I measured in several points of coordinates (x,y) the value of the electromagnetic field (E value) at different frequencies (freq value). For each frequency value, I use once 90% of points and once 80% (with the for loop with l) to interpolate the value of the electromagnetic field (E) inside grid with Nearest Neighbour Interpolation (krige function); and i repeat this 2 times. The remaining points will then be used to calculate the prediction error. I hope it's clear.
This script above is a simplified case. Unfortunately, in my case the script takes too long for the two for-loops implemented.
I want to ask if it's possible to simplify the code in some way, for instance by using the apply function family. Thanks.
Reply #clemlaflemme ok it works! thanks... now i have a little proble with the final dataframe, it looks like this:
freq 1 2
1 100 121.00 338.00
2 100 0.47 0.85
3 200 81.00 462.50
4 200 0.74 0.73
5 300 36.00 234.00
6 300 0.82 0.76
but i want something like this:
freq sqmm time
1 100 121.0 0.47
2 100 338.0 0.85
3 200 81.0 0.74
4 200 462.5 0.73
5 300 36.0 0.82
6 300 234.0 0.76
how can i do that??
I have a data frame full from which I want to take the last column and a column v. I then want to sort both columns on v in the fastest way possible. full is read in from a csv but this can be used for testing (included some NAs for realism):
n <- 200000
full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))
full[sample(n, 10000), 'A'] <- NA
v <- 1
I have v as one here, but in reality it could change, and full has many columns.
I have tried sorting data frames, data tables and matrices each with order and sort.list (some ideas taken from this thread). The code for all these:
# DATA FRAME
ord_df <- function() {
a <- full[c(v, length(full))]
a[with(a, order(a[1])), ]
}
sl_df <- function() {
a <- full[c(v, length(full))]
a[sort.list(a[[1]]), ]
}
# DATA TABLE
require(data.table)
ord_dt <- function() {
a <- as.data.table(full[c(v, length(full))])
colnames(a)[1] <- 'values'
a[order(values)]
}
sl_dt <- function() {
a <- as.data.table(full[c(v, length(full))])
colnames(a)[1] <- 'values'
a[sort.list(values)]
}
# MATRIX
ord_mat <- function() {
a <- as.matrix(full[c(v, length(full))])
a[order(a[, 1]), ]
}
sl_mat <- function() {
a <- as.matrix(full[c(v, length(full))])
a[sort.list(a[, 1]), ]
}
Time results:
ord_df sl_df ord_dt sl_dt ord_mat sl_mat
Min. 0.230 0.1500 0.1300 0.120 0.140 0.1400
Median 0.250 0.1600 0.1400 0.140 0.140 0.1400
Mean 0.244 0.1610 0.1430 0.136 0.142 0.1450
Max. 0.250 0.1700 0.1600 0.140 0.160 0.1600
Or using microbenchmark (results are in milliseconds):
min lq median uq max
1 ord_df() 243.0647 248.2768 254.0544 265.2589 352.3984
2 ord_dt() 133.8159 140.0111 143.8202 148.4957 181.2647
3 ord_mat() 140.5198 146.8131 149.9876 154.6649 191.6897
4 sl_df() 152.6985 161.5591 166.5147 171.2891 194.7155
5 sl_dt() 132.1414 139.7655 144.1281 149.6844 188.8592
6 sl_mat() 139.2420 146.8578 151.6760 156.6174 186.5416
Seems like ordering the data table wins. There isn't all that much difference between order and sort.list except when using data frames where sort.list is much faster.
In the data table versions I also tried setting v as the key (since it is then sorted according to the documentation) but I couldn't get it work since the contents of v are not integer.
I would ideally like to speed this up as much as possible since I have to do it many times for different v values. Does anyone know how I might be able to speed this process up even further? Also might it be worth trying an Rcpp implementation? Thanks.
Here's the code I used for timing if it's useful to anyone:
sortMethods <- list(ord_df, sl_df, ord_dt, sl_dt, ord_mat, sl_mat)
require(plyr)
timings <- raply(10, sapply(sortMethods, function(x) system.time(x())[[3]]))
colnames(timings) <- c('ord_df', 'sl_df', 'ord_dt', 'sl_dt', 'ord_mat', 'sl_mat')
apply(timings, 2, summary)
require(microbenchmark)
mb <- microbenchmark(ord_df(), sl_df(), ord_dt(), sl_dt(), ord_mat(), sl_mat())
plot(mb)
I don't know if it's better to put this sort of thing in as an edit but it seems more like answer so here will do. Updated test functions:
n <- 1e7
full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))
full[sample(n, 100000), 'A'] <- NA
fdf <- full
fma <- as.matrix(full)
fdt <- as.data.table(full)
setnames(fdt, colnames(fdt)[1], 'values')
# DATA FRAME
ord_df <- function() { fdf[order(fdf[1]), ] }
sl_df <- function() { fdf[sort.list(fdf[[1]]), ] }
# DATA TABLE
require(data.table)
ord_dt <- function() { fdt[order(values)] }
key_dt <- function() {
setkey(fdt, values)
fdt
}
# MATRIX
ord_mat <- function() { fma[order(fma[, 1]), ] }
sl_mat <- function() { fma[sort.list(fma[, 1]), ] }
Results (using a different computer, R 2.13.1 and data.table 1.8.2):
ord_df sl_df ord_dt key_dt ord_mat sl_mat
Min. 37.56 20.86 2.946 2.249 20.22 20.21
1st Qu. 37.73 21.15 2.962 2.255 20.54 20.59
Median 38.43 21.74 3.002 2.280 21.05 20.82
Mean 38.76 21.75 3.074 2.395 21.09 20.95
3rd Qu. 39.85 22.18 3.151 2.445 21.48 21.42
Max. 40.36 23.08 3.330 2.797 22.41 21.84
So data.table is the clear winner. Using a key is faster than ordering, and has a nicer syntax as well I'd argue. Thanks for the help everyone.