Why is method 1 faster than method 2? - r

My question: I was debugging some code at work, running it block-by-block, when I realized a small block was taking an unusual amount of time. I killed it and made a minor (but logically equivalent) tweak, and it ran almost instantly. I would like to understand why. The following code is in R, however, I imagine the answer may not be specific to R, and may apply to most programming languages of a similar paradigm or 'method-of-compiling'?
The code & information:
Using R version 3.6.1
Libraries loaded: dplyr, knitr, DataExplorer, glue, zoo
old_df is data frame of 5653380 obs. of 91 variables.
field1 is a col of policy numbers with class "character". Not unique, each occurs many times.
date_col1 and date_col2 are columns with class "date".
Method 1:
new_df <- old_df %>%
group_by(field1) %>%
mutate(checkfield = date_col1 - date_col2) %>%
filter(checkfield < 0) %>%
filter(row_number() == 1)
old_df$filter <- ifelse(old_df$field1 %in% new_df$field1,1,0)
Method 2:
new_df <- old_df %>%
group_by(field1) %>%
filter(date_col1 < date_col2) %>%
filter(row_number() == 1)
old_df$filter <- ifelse(old_df$field1 %in% new_df$field1,1, 0)
As you can probably see, the intended output of both methods is to add a flag, "1", in the column "filter" for policy numbers where date_col1 < date_col2. I did not write method 1, and my goal in writing method 2 was to change it as little as possible while also making it faster, so please avoid spending too much time talking about problems with method 1 that are not related to why it is unbearably slower than method 2. Feel free to mention such things, but I would like the crux to be why method 1 was taking 20, 30 minutes etc. For example, I believe in method 1, the first filter call could be above the group_by call. This might increase speed by an unnoticeable amount. I am not too concerned about this.
My thoughts: Clearly method 2 might be a little faster because it avoids making the column "checkfield", but I dont think this is the issue, as I ran method 1 line by line, and it appears to be the line 'filter(checkfield < 0)' where things went awry. For testing, I defined two dates x,y and checked class(x-y) which returned "difftime". So in this filter call, we are comparing "difftime" to a "numeric". Perhaps this requires some type of type-juggling to make the comparison, where as method 2 compares a date object to a date object?
Let me know what you think! I am very curious about this.

My explorations so far, with a simplified example and a slightly smaller data set (only a million rows and a minimal subset of columns) have the individual tests (test_cf for filtering on the checkfield variable, test_lt for filtering on the date comparison) taking about the same time, which both take about the same time as creating the checkfield column. Doing both at once (comb, creating and comparing) takes 2.5 x longer, not sure why.
Perhaps you can use this as a starting point for bisecting/benchmarking to find the culprit.
test elapsed relative
2 comb 5.557 2.860
1 make_cf 1.943 1.000
4 test_cf 2.122 1.092
3 test_lt 2.109 1.085
I used rbenchmark::benchmark() because I prefer the output format: microbenchmark::microbenchmark() might be more accurate (but I would be surprised if it made a big difference).
code
library(dplyr)
n <- 1e6 ## 5653380 in orig; reduce size for laziness
set.seed(101)
## sample random dates, following
## https://stackoverflow.com/questions/21502332/generating-random-dates
f <- function(n)
sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"),
replace=TRUE,
size=n)
dd <- tibble(
date_col1=f(n),
date_col2=f(n)
## set up checkfield so we can use it without creating it
) %>% mutate(cf=date_col1-date_col2)
Benchmark:
library(rbenchmark)
benchmark(
make_cf=dd %>% mutate(checkfield=date_col1-date_col2),
comb=dd %>% mutate(checkfield=date_col1-date_col2) %>% filter(checkfield<0),
test_lt=dd %>% filter(date_col1<date_col2),
test_cf=dd %>% filter(cf<0),
columns=c("test","elapsed","relative")
)

I believe most of the increased time is due to creating the new column. As you can see, M1 and M3 have similar times. Of course that difference of ~2 milliseconds between M1 and M3 will multiply based on the data size
library(tidyverse)
library(microbenchmark)
set.seed(42)
n = 1e5
d = seq.Date(Sys.Date() - 10000, Sys.Date(), 1)
x = sample(d, n, TRUE)
y = sample(d, n, TRUE)
df1 = data.frame(x, y, id = sample(LETTERS, n, TRUE))
microbenchmark(M1 = {
df1 %>%
group_by(id) %>%
mutate(chk = x - y) %>%
filter(chk < 0) %>%
filter(row_number() == 1)
},
M2 = {
df1 %>%
group_by(id) %>%
filter(x < y) %>%
filter(row_number() == 1)
},
M3 = {
df1 %>%
group_by(id) %>%
mutate(chk = x - y) %>%
filter(x < y) %>%
filter(row_number() == 1)
})
#Unit: milliseconds
# expr min lq mean median uq max neval
# M1 13.130673 13.405151 15.088266 14.096772 15.56080 22.636533 100
# M2 5.931192 6.208457 6.564363 6.402879 6.71973 9.354252 100
# M3 11.360640 11.607993 12.449220 12.001383 12.57732 18.260131 100
For the point about comparing difftime to numeric, there doesn't seem to be a lot of difference
library(microbenchmark)
set.seed(42)
n = 1e7
x = sample(d, n, TRUE)
y = sample(d, n, TRUE)
df1 = data.frame(x, y)
df1$difference = df1$x - df1$y
class(df1$difference)
#[1] "difftime"
microbenchmark(date_vs_date = {
df1 %>% filter(x < y)
},
date_vs_numeric ={
df1 %>% filter(difference < 0)
})
#Unit: milliseconds
# expr min lq mean median uq max neval
# date_vs_date 177.1789 222.4112 243.6617 233.7221 244.2765 430.4273 100
# date_vs_numeric 181.6222 217.1121 251.6127 232.7213 249.8218 455.8285 100

Related

Is it possible to do your own efficient descriptive statistics function? - R

Usually, I find myself using a few summary functions or making my own computations to get some additional initial information from the data. For example, I wanted to see the count and percentage per variable given a limit of distinct values:
table_transposed <- function(vector){
merge(as.data.frame(table(vector, dnn="values")),
as.data.frame(round(prop.table(table(vector, dnn="values")),2)),
by="values",
all.x=TRUE) %>%
data.table::transpose(keep.names = "values",
make.names = names(.)[1]) %T>%
{.[,c("values")] <- c("Count", "Percentage")}
}
table_transposed_filter <- function(dataframe, max_number_categories) {
(lapply(dataframe, function(x) NROW(unique(x))) <= max_number_categories) %>%
as.vector() %>%
{dataframe[,.]} %>%
lapply(table_transposed)
}
So, you give the dataframe and the threshold of distinct values per variable.
table_transposed_filter(mtcars, 10)
However, it's SUPER slow (maybe because of using merge() instead of left_join() from dplyr). Now, I'm trying to figure an efficient, fast, and simple way to do a combination of psych::describe(), Hmisc::describe(), other, and my own, for numeric and categorical variables (one descriptive function for each one). Something like (for numerical):
| Variable | dtype | mean | mode | variance | skew | percentile 25 | ...
If I create this table with mainly with sapply() for example, is it better (more efficient, faster, simpler code) than actually learning to create a r-package and developing in there?
PS: I thought to put this question in StackMetaExchange or Crossvalidation, but none of them seem to fit it.
Here's a somewhat faster version. It's about 2x faster on small data (like mtcars), but the difference narrows on litte bit on larger data.
This makes sense as the most expensive operation you do is table - your version does it twice, my version does it once. I didn't profile the code, but my guess is table is the bottleneck by more than one order of magnitude on any sizeable data, so it's a waste to try to optimize any other parts of the code.
t_transp = function(x, digits = 2) {
tab = table(x)
prop_tab = prop.table(tab)
df = data.frame(values = c("Count", "Percentage"))
df = cbind(df, rbind(tab, round(prop_tab, digits = digits)))
row.names(df) = NULL
df
}
t_transp_filter = function(data, n_max, ...) {
lapply(Filter(function(x) NROW(unique(x)) <= n_max, data), t_transp, ...)
}
Benchmarking:
microbenchmark::microbenchmark(
gregor = t_transp_filter(mtcars, n_max = 4),
OP = table_transposed_filter(mtcars, 4),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 1.6483 1.7093 2.253425 1.74765 1.84680 7.5394 20 a
# OP 5.6988 5.7627 6.316295 6.08545 6.57965 8.1048 20 b
set.seed(47)
df = as.data.frame(matrix(
c(sample(letters[1:5], size = 1e5 * 20, replace = T))
, ncol = 20))
microbenchmark::microbenchmark(
gregor = t_transp_filter(df, n_max = 5),
OP = table_transposed_filter(df, 5),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 59.5466 59.95545 63.6825 61.14075 67.2167 75.4270 20 a
# OP 110.3265 117.35585 123.8782 118.91005 133.7795 149.0651 20 b

Why dplyr code which precomputes the maximum of a column is slower than dplyr code which computes it inside the mutate call?

Sample data frame:
ngroups <- 100
nsamples <- 1000
foo <- data.frame(engine = rep(seq(1, ngroups), each = nsamples), cycles = runif(ngroups*nsamples, 0, nsamples))
I want to find the max of cycles for each engine group, and create a new variable tte = max(cycles) - cycles with mutate. I thought that if I would precompute the column of maximum cycles, instead than recomputing it inside the mutate command for each row, the code would be faster. Turns out I'm wrong:
library(microbenchmark)
library(dplyr)
library(magrittr)
add_tte <- function(dataset){
dataset %<>% group_by(engine) %>% mutate(max_cycles = max(cycles)) %>%
mutate(tte = max_cycles - cycles) %>% select(-max_cycles) %>% ungroup
}
add_tte_old <- function(dataset){
dataset %<>% group_by(engine) %>% mutate(tte = max(cycles) - cycles) %>% ungroup
}
microbenchmark(add_tte(foo), add_tte_old(foo), times = 500)
# Unit: milliseconds
# expr min lq mean median uq max neval
# add_tte(foo) 17.45324 21.107264 26.50535 24.52625 28.75208 113.98433 500
# add_tte_old(foo) 8.10376 9.949188 13.35830 12.18336 14.52474 77.64578 500
Why is this happening? Is the reason that dplyr computes the maximum just once for group, instead that once for row?
EDIT: even if I use a single mutate statement in add_tte, and I create a bigger example, add_tte_old is still faster
# these are the only lines of code modified, the rest is as before
nsamples <- 10000
foo <- data.frame(engine = rep(seq(1, ngroups), each = nsamples), cycles = runif(ngroups*nsamples, 0, nsamples))
add_tte <- function(dataset){
dataset %<>% group_by(engine) %>% mutate(max_cycles = max(cycles), tte = max_cycles - cycles) %>%
select(-max_cycles) %>% ungroup
}
# the new results are:
microbenchmark(add_tte(foo), add_tte_old(foo), times = 500)
# Unit: milliseconds
# expr min lq mean median uq max neval
# add_tte(foo) 90.46658 107.14015 139.13570 131.83689 158.24358 411.3272 500
# add_tte_old(foo) 39.38357 46.13531 62.57386 52.00782 69.26815 176.1512 500
You have made some wrong assumptions, but besides that, more importantly, you are not comparing like-wise.
It would make more sense to look at the two variants below:
add_tte <- function(dataset) {
dataset %<>% group_by(engine) %>% mutate(max_cycles = rep(max(cycles), times = n()), tte = max_cycles - cycles) %>%
select(-max_cycles) %>% ungroup
}
add_tte_old <- function(dataset) {
dataset %<>% group_by(engine) %>% mutate(extra = rep(1, times = n()), tte = max(cycles) - cycles) %>%
select(-extra) %>% ungroup
}
microbenchmark(add_tte(foo), add_tte_old(foo), times = 100)
On my machine, these two are pretty similar.
It is kind of ironic that with your way of attempting to pre-compute the max(cycles), you probably did what you were trying to avoid :)
In the case here, you should really use the explicit rep() to fill up the column, whereas in the subtraction max(cycles) - cycles the auto-recycling is alright.

plyr outperforms dplyr and data.table - What's wrong?

I have to apply a function to every row of a large table (~ 2M rows). I used to use plyr for that, but the table is growing continuously and the current solution starts to approach unacceptable runtimes. I thought I could just switch to data.table or dplyr and all is fine, but that's not the case.
Here's an example:
library(data.table)
library(plyr)
library(dplyr)
dt = data.table("ID_1" = c(1:1000), # unique ID
"ID_2" = ceiling(runif(1000, 0, 100)), # other ID, duplicates possible
"group" = sample(LETTERS[1:10], 1000, replace = T),
"value" = runif(1000),
"ballast1" = "X", # keeps unchanged in derive_dt
"ballast2" = "Y", # keeps unchanged in derive_dt
"ballast3" = "Z", # keeps unchanged in derive_dt
"value_derived" = 0)
setkey(dt, ID_1)
extra_arg = c("A", "F", "G", "H")
ID_1 is guaranteed to contain no duplicates. Now I define a function to apply to every row/ID_1:
derive = function(tmprow, extra_arg){
if(tmprow$group %in% extra_arg){return(NULL)} # exlude entries occuring in extra_arg
group_index = which(LETTERS == tmprow$group)
group_index = ((group_index + sample(1:26, 1)) %% 25) + 1
new_group = LETTERS[group_index]
if(new_group %in% unique(dt$group)){return(NULL)}
new_value = runif(1)
row_derived = tmprow
row_derived$group = new_group
row_derived$value = runif(1)
row_derived$value_derived = 1
return(row_derived)
}
This one doesn't do anything useful (the actual one does). The point is that the function takes one row and computes a new row of the same format.
Now the comparison:
set.seed(42)
system.time(result_dt <- dt[, derive(.SD, extra_arg), by = ID_1])
set.seed(42)
system.time(result_dplyr <- dt %>% group_by(ID_1) %>% do(derive(., extra_arg)))
set.seed(42)
system.time(results_plyr <- x <- ddply(dt, .variable = "ID_1", .fun = derive, extra_arg))
plyr is about 8x faster than both data.table and dplyr. Obviously I'm doing something wrong here, but what?
EDIT
Thanks to eddi's answer I could reduce runtimes for data.table and dplyr to ~ 0.6 and 0.8 of the plyr version, respectively. I intialized row_derived as data.frame: row_derived = as.data.frame(tmprow). That's cool, but I still expected a higher performance increase from these packages...any further suggestions?
The issue is the assignment you use has a very high overhead in data.table, and plyr converts the row to a data.frame before passing to your derive function, and thus avoids it:
library(microbenchmark)
df = as.data.frame(dt)
microbenchmark({dt$group = dt$group}, {df$group = df$group})
#Unit: microseconds
# expr min lq mean median uq max neval
# { dt$group = dt$group } 1895.865 2667.499 3092.38903 3080.3620 3389.049 4984.406 100
# { df$group = df$group } 26.045 45.244 64.13909 61.6045 79.635 157.266 100
I can't suggest a good fix, since you say your example is not real problem, so no point in solving it better. Some basic suggestions to look at are - vectorizing the code, and using := or set instead (depending on what exactly you end up doing).

Going from a for loop to a function in R

I'm curious how I could convert a for loop that I've written into a function in R? I've no experience with writing my own functions in R. I looked here and here but these did not seem to offer much help. I am aware that for loops are not necessary and overall I'm trying to do something similar to this blog post.
The for loop with reproducible data is here:
P <- c(1:50)
y <- length(P)
D <- as.data.frame(combs(P,2))
Z <- choose(y,2)
Num = NULL
Denom = NULL
Diff = NULL
for(n in 1:Z)
{
Num[n] = abs(D$V1[n]-D$V2[n])
Denom[n] = max(D$V1[n], D$V2[n])
Diff[n] = Num[n]/Denom[n]
}
PV=mean(Diff)
PV
But, I'm interested in calculating PV based on levels such as in this data:
DATA <- c(1:500)
NAME <- c("a", "b", "c", "d", "e")
mydf <- as.data.frame(cbind(DATA, NAME))
Therefore, my final code I would like to use would be:
ANSWER <- tapply(mydf$DATA, mydf$NAME, MY.FUNCTION)
So, if I could turn the above for loop into a working function I could run the tapply function to get PV based on levels.
Any help would be appreciated or any other suggestions opposed to the one I offer.
Thanks!
Once you have your library loaded:
library(caTools)
Here's a function you can run on your data:
mymeandiff <- function(values){
df <- as.data.frame(combs(values, 2))
diff <- abs(df$V1 - df$V2)/pmax(df$V1, df$V2)
mean(diff)
}
mymeandiff(1:50)
Then we can use dplyr to run on each group (after correcting the data):
mydf$DATA <-as.numeric(as.character(mydf$DATA))
library(dplyr)
mydf %>% group_by(NAME) %>%
summarise(mymeandiff(DATA))
For apply, rather than dplyr:
tapply(mydf$DATA, mydf$NAME, FUN = mymeandiff)
Let's time it:
microbenchmark::microbenchmark(tapply = tapply(mydf$DATA, mydf$NAME, FUN=mymeandiff),
dplyr = mydf %>% group_by(NAME) %>%
summarise(mymeandiff(DATA)))
Unit: milliseconds
expr min lq mean median uq max neval
tapply 60.36543 61.08658 63.81995 62.61182 66.13671 80.37819 100
dplyr 61.84766 62.53751 67.33161 63.61270 67.58688 287.78364 100
tapply is slightly faster

Why are sum() working in this dplyr expression while quantile() isn't?

I want to calculate the quantiles of each row of a data frame and return the result as a matrix. Since I want to calculate and arbitrary number of quantiles (and I imagine that it is faster to calculate them all at once, rather than re-running the function), I tried using a formula I found in this question:
library(dplyr)
df<- as.data.frame(matrix(rbinom(1000,10,0.5),nrow = 2))
interim_res <- df %>%
rowwise() %>%
do(out = sapply(min(df):max(df), function(i) sum(i==.)))
interim_res <- interim_res[[1]] %>% do.call(rbind,.) %>% as.data.frame(.)
This makes sense, but when I try to apply the same framework to the quantile() function, as coded here,
interim_res <- df %>%
rowwise() %>%
do(out = quantile(.,probs = c(0.1,0.5,0.9)))
interim_res <- interim_res[[1]] %>% do.call(rbind,.) %>% as.data.frame(.)
I get this error message:
Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) :
'x' must be atomic
Why am I getting an error with quantile and not sum? How should I fix this issue?
. in do is a data frame, which is why you get the error. This works:
df %>%
rowwise() %>%
do(data.frame(as.list(quantile(unlist(.),probs = c(0.1,0.5,0.9)))))
but risks being horrendously slow. Why not just:
apply(df, 1, quantile, probs = c(0.1,0.5,0.9))
Here are some timings with larger data:
df <- as.data.frame(matrix(rbinom(100000,10,0.5),nrow = 1000))
library(microbenchmark)
microbenchmark(
df %>% rowwise() %>% do(data.frame(as.list(quantile(unlist(.),probs = c(0.1,0.5,0.9))))),
apply(df, 1, quantile, probs = c(0.1,0.5,0.9)),
times=5
)
Produces:
min lq mean median uq max neval
dplyr 2375.2319 2376.6658 2446.4070 2419.4561 2454.6017 2606.0794 5
apply 224.7869 231.7193 246.7137 233.4757 245.0718 298.5144 5
If you go the apply route you should probably stick with a matrix from the get go.

Resources