I'm trying to divide each row of a dataframe by a number stored in a second mapping dataframe.
for(g in rownames(data_table)){
print(g)
data_table[g,] <- data_table[g,]/mapping[g,2]
}
However, this is incredibly slow, each row takes almost 1-2 seconds to run. I know iteration is usually not the best way to do things in R, but I don't know how else to do it. Is there any way I can speed up the runtime?
Try this :
sweep(data_table, 1, mapping[[2]], "/")
In terms of speed here is a benchmark for the possibilities using iris dataset and including your version :
microbenchmark::microbenchmark(
A = {
for(g in rownames(test)){
# print(g)
test[g,] <- test[g,]/test[g,2]
}
},
B = sweep(test, 1, test[[2]], "/"),
C = test / test[[2]],
times = 100
)
#Unit: microseconds
#expr min lq mean median uq max neval
#A 82374.693 83722.023 101688.1254 84582.052 147280.057 157507.892 100
#B 453.652 484.393 514.4094 513.850 539.480 623.688 100
#C 404.506 423.794 456.0063 446.101 470.675 729.205 100
you can vectorize this operation if the two variables have the same number of rows:
dt <- data.frame(a = rnorm(100), b = rnorm(100))
mapping <- data.frame(x = rnorm(100), y = rnorm(100))
dt / mapping[,2]
Related
Is using comment() from base to assign information to R object slowing the code down?
That is, should its implementation be used carefully?
Context: I'm having a function that creates several tibbles/dataframes that are saved in a list; and I'm thinking of saving a comment to each dataframe (or just saving one comment to the entire list).
From the comment documentation it seems that the method is just an interface to get/set a comment attribute to any R object. I can't see it becoming a burden in the vast majority of real-world use cases.
To have an idea of how the function behaves under load I've written a simple function to generate n dataframes (2000 rows, 3 columns) and annotate them at will. Results will be appended to a list:
df_and_comment <- function(n, add_comment = FALSE) {
res_list <- list()
for (i in seq(1:n)) {
x <- data.frame(
x = rnorm(2000),
y = rnorm(2000),
z = rnorm(2000)
)
if (add_comment) {
comment(x) <- sprintf("this is df no: %d", i)
}
res_list[[i]] <- x
}
res_list
}
Normal load - creating 50 dataframes
library(microbenchmark)
microbenchmark(
df_and_comment(n=50),
df_and_comment(n=50, add_comment = TRUE),
times = 10
)
Unit: milliseconds
expr min lq mean median uq max neval
df_and_comment(n = 50) 25.34398 25.51473 26.70731 25.74472 25.97483 33.81251 10
df_and_comment(n = 50, add_comment = TRUE) 26.32009 26.39826 27.49835 26.60218 27.80038 32.47273 10
Heavy load - creating 15.000 dataframes
microbenchmark(
df_and_comment(n=15000),
df_and_comment(n=15000, add_comment = TRUE),
times = 10
)
Unit: seconds expr min lq mean median uq max neval
df_and_comment(n = 15000) 8.218535 8.254919 8.324075 8.317126 8.354637 8.469191 10
df_and_comment(n = 15000, add_comment = TRUE) 8.414405 8.561279 8.687380 8.571137 8.685309 9.591972 10
In both cases, the performance difference are completely negligible. I wouldn't be worried about performance implications of annotating dataframes/regression results iteratively.
Usually, I find myself using a few summary functions or making my own computations to get some additional initial information from the data. For example, I wanted to see the count and percentage per variable given a limit of distinct values:
table_transposed <- function(vector){
merge(as.data.frame(table(vector, dnn="values")),
as.data.frame(round(prop.table(table(vector, dnn="values")),2)),
by="values",
all.x=TRUE) %>%
data.table::transpose(keep.names = "values",
make.names = names(.)[1]) %T>%
{.[,c("values")] <- c("Count", "Percentage")}
}
table_transposed_filter <- function(dataframe, max_number_categories) {
(lapply(dataframe, function(x) NROW(unique(x))) <= max_number_categories) %>%
as.vector() %>%
{dataframe[,.]} %>%
lapply(table_transposed)
}
So, you give the dataframe and the threshold of distinct values per variable.
table_transposed_filter(mtcars, 10)
However, it's SUPER slow (maybe because of using merge() instead of left_join() from dplyr). Now, I'm trying to figure an efficient, fast, and simple way to do a combination of psych::describe(), Hmisc::describe(), other, and my own, for numeric and categorical variables (one descriptive function for each one). Something like (for numerical):
| Variable | dtype | mean | mode | variance | skew | percentile 25 | ...
If I create this table with mainly with sapply() for example, is it better (more efficient, faster, simpler code) than actually learning to create a r-package and developing in there?
PS: I thought to put this question in StackMetaExchange or Crossvalidation, but none of them seem to fit it.
Here's a somewhat faster version. It's about 2x faster on small data (like mtcars), but the difference narrows on litte bit on larger data.
This makes sense as the most expensive operation you do is table - your version does it twice, my version does it once. I didn't profile the code, but my guess is table is the bottleneck by more than one order of magnitude on any sizeable data, so it's a waste to try to optimize any other parts of the code.
t_transp = function(x, digits = 2) {
tab = table(x)
prop_tab = prop.table(tab)
df = data.frame(values = c("Count", "Percentage"))
df = cbind(df, rbind(tab, round(prop_tab, digits = digits)))
row.names(df) = NULL
df
}
t_transp_filter = function(data, n_max, ...) {
lapply(Filter(function(x) NROW(unique(x)) <= n_max, data), t_transp, ...)
}
Benchmarking:
microbenchmark::microbenchmark(
gregor = t_transp_filter(mtcars, n_max = 4),
OP = table_transposed_filter(mtcars, 4),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 1.6483 1.7093 2.253425 1.74765 1.84680 7.5394 20 a
# OP 5.6988 5.7627 6.316295 6.08545 6.57965 8.1048 20 b
set.seed(47)
df = as.data.frame(matrix(
c(sample(letters[1:5], size = 1e5 * 20, replace = T))
, ncol = 20))
microbenchmark::microbenchmark(
gregor = t_transp_filter(df, n_max = 5),
OP = table_transposed_filter(df, 5),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 59.5466 59.95545 63.6825 61.14075 67.2167 75.4270 20 a
# OP 110.3265 117.35585 123.8782 118.91005 133.7795 149.0651 20 b
Like most people, I'm impressed by Hadley Wickham and what he's done for R -- so i figured that i'd move some functions toward his tidyverse ... having done so i'm left wondering what the point of it all is?
My new dplyr functions are much slower than their base equivalents -- i hope i'm doing something wrong. I'd particularly like some payoff from the effort required to understand non-standard-evaluation.
So, what am i doing wrong? Why is dplyr so slow?
An example:
require(microbenchmark)
require(dplyr)
df <- tibble(
a = 1:10,
b = c(1:5, 4:0),
c = 10:1)
addSpread_base <- function() {
df[['spread']] <- df[['a']] - df[['b']]
df
}
addSpread_dplyr <- function() df %>% mutate(spread := a - b)
all.equal(addSpread_base(), addSpread_dplyr())
microbenchmark(addSpread_base(), addSpread_dplyr(), times = 1e4)
Timing results:
Unit: microseconds
expr min lq mean median uq max neval
addSpread_base() 12.058 15.769 22.07805 24.58 26.435 2003.481 10000
addSpread_dplyr() 607.537 624.697 666.08964 631.19 636.291 41143.691 10000
So using dplyr functions to transform the data takes about 30x longer -- surely this isn't the intention?
I figured that perhaps this is too easy a case -- and that dplyr would really shine if we had a more realistic case where we are adding a column and sub-setting the data -- but this was worse. As you can see from the timings below, this is ~70x slower than the base approach.
# mutate and substitute
addSpreadSub_base <- function(df, col1, col2) {
df[['spread']] <- df[['a']] - df[['b']]
df[, c(col1, col2, 'spread')]
}
addSpreadSub_dplyr <- function(df, col1, col2) {
var1 <- as.name(col1)
var2 <- as.name(col2)
qq <- quo(!!var1 - !!var2)
df %>%
mutate(spread := !!qq) %>%
select(!!var1, !!var2, spread)
}
all.equal(addSpreadSub_base(df, col1 = 'a', col2 = 'b'),
addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'))
microbenchmark(addSpreadSub_base(df, col1 = 'a', col2 = 'b'),
addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'),
times = 1e4)
Results:
Unit: microseconds
expr min lq mean median uq max neval
addSpreadSub_base(df, col1 = "a", col2 = "b") 22.725 30.610 44.3874 45.450 53.798 2024.35 10000
addSpreadSub_dplyr(df, col1 = "a", col2 = "b") 2748.757 2837.337 3011.1982 2859.598 2904.583 44207.81 10000
These are micro seconds, your dataset has 10 rows, unless you plan on looping on millions of datasets of 10 rows your benchmark is pretty much irrelevant (and in that case I can't imagine a situation where it wouldn't be wise to bind them together as a first step).
Let's do it with a bigger dataset, like 1 million times bigger :
df <- tibble(
a = 1:10,
b = c(1:5, 4:0),
c = 10:1)
df2 <- bind_rows(replicate(1000000,df,F))
addSpread_base <- function(df) {
df[['spread']] <- df[['a']] - df[['b']]
df
}
addSpread_dplyr <- function(df) df %>% mutate(spread = a - b)
microbenchmark::microbenchmark(
addSpread_base(df2),
addSpread_dplyr(df2),
times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# addSpread_base(df2) 25.85584 26.93562 37.77010 32.33633 35.67604 170.6507 100 a
# addSpread_dplyr(df2) 26.91690 27.57090 38.98758 33.39769 39.79501 182.2847 100 a
Still quite fast and not much difference.
As for the "whys" of the result that you got, it's because you're using a much more complex function, so it has overheads.
Commenters have pointed that dplyr doesn't try too hard to be fast and maybe it's true when you compare to data.table, and interface is the first concern, but the authors have been working hard on speed as well. Hybrid evaluation for example allows (if I got it right) to execute C code directly on grouped data when aggregating with common functions, which can be much faster than base code, but simple code will always run faster with simple functions.
I was hoping someone could help point me in the right direction to create a vector in R, containing a defined amount of randomly generated numbers. I am a complete newbie to R, and I have learned that the concatenate function is used for creating vectors. However, I wish to populate the vector with 50 random numbers. I do not wish to specify a range or any other conditions for the numbers.
MyVectorObject <- c(...)
Any suggestions would be greatly appreciated!
It depends on which numbers you want to generate. These are some options.
x1 <- rpois(n = 50, lambda = 10)
x2 <- runif(n = 50, min = 1, max = 10)
x3 <- sample(x = c(1, 3, 5), size = 50, replace = TRUE)
If we are talking about integer numbers, you want to generate number in interval <-base::.Machine$integer.max, base::.Machine$integer.max> which is for example on my computer interval <-2147483647,2147483647>
Implementation
you can use base::sample to generate positive numbers from 1 to base::.Machine$integer.max
random.pos <- function(N) {
int.max <- .Machine$integer.max
return(sample(int.max, N, replace=TRUE))
}
if you want also negative numbers, you can use
random.posneg <- function(N) {
int.max <- .Machine$integer.max
random.numbers <- sample(int.max, N, replace = TRUE)
random.signs <- sample(c(1,-1), N, replace=TRUE)
return(random.numbers * random.signs)
}
NOTE: No one from functions specified above does generate 0 (zero)
The best approach (by my opinion) is to use base::runif function.
random.runif <- function(N) {
int.max <- .Machine$integer.max
int.min <- -int.max
return(as.integer(runif(N, int.min, int.max)))
}
This will be little bit slower then using base::sample but you get random numbers uniformly distributed with possible zero.
Benchmark
library(microbenchmark)
require(compiler)
random.runif <- cmpfun(random.runif)
random.pos <- cmpfun(random.pos)
random.posneg <- cmpfun(random.posneg)
N <- 500
op <- microbenchmark(
RUNIF = random.runif(N),
POS = random.pos(N),
POSNEG = random.posneg(N),
times = 10000
)
print(op)
## library(ggplot2)
## boxplot(op)
## qplot(y=time, data=op, colour=expr) + scale_y_log10()
and results from the benchmark above
Unit: microseconds
expr min lq mean median uq max neval
RUNIF 13.423 14.251 15.197122 14.482 14.694 2425.290 10000
POS 4.174 5.043 5.613292 5.317 5.645 2436.909 10000
POSNEG 11.673 12.845 13.383194 13.285 13.800 60.304 10000
I do have a similar problem that is explained in this question. Similar to that question I have a data frame that has 3 columns (id, group, value). I want to take n samples with replacement from each group and produce a smaller data frame with n samples from each group.
However, I am doing hundreds of subsamples in a simulation code and the solution based on ddply is very slow to be used in my code. I tried to rewrite a simple code to see if I can get a better performance but it is still slow (not better than the ddply solution if not worse). Below is my code. I am wondering if it can be improved for performance
#Producing example DataFrame
dfsize <- 10
groupsize <- 7
test.frame.1 <- data.frame(id = 1:dfsize, group = rep(1:groupsize,each = ceiling(dfsize/groupsize))[1:dfsize], junkdata = sample(1:10000, size =dfsize))
#Main function for subsampling
sample.from.group<- function(df, dfgroup, size, replace){
outputsize <- 1
newdf <-df # assuming a sample cannot be larger than the original
uniquegroups <- unique(dfgroup)
for (uniquegroup in uniquegroups){
dataforgroup <- which(dfgroup==uniquegroup)
mysubsample <- df[sample(dataforgroup, size, replace),]
sizeofsample <- nrow(mysubsample)
newdf[outputsize:(outputsize+sizeofsample-1), ] <- mysubsample
outputsize <- outputsize + sizeofsample
}
return(newdf[1:(outputsize-1),])
}
#Using the function
sample.from.group(test.frame.1, test.frame.1$group, 100, replace = TRUE)
Here's two plyr based solutions:
library(plyr)
dfsize <- 1e4
groupsize <- 7
testdf <- data.frame(
id = seq_len(dfsize),
group = rep(1:groupsize, length = dfsize),
junkdata = sample(1:10000, size = dfsize))
sample_by_group_1 <- function(df, dfgroup, size, replace) {
ddply(df, dfgroup, function(x) {
x[sample(nrow(df), size = size, replace = replace), , drop = FALSE]
})
}
sample_by_group_2 <- function(df, dfgroup, size, replace) {
idx <- split_indices(df[[dfgroup]])
subs <- lapply(idx, sample, size = size, replace = replace)
df[unlist(subs, use.names = FALSE), , drop = FALSE]
}
library(microbenchmark)
microbenchmark(
ddply = sample_by_group_1(testdf, "group", 100, replace = TRUE),
plyr = sample_by_group_2(testdf, "group", 100, replace = TRUE)
)
# Unit: microseconds
# expr min lq median uq max neval
# ddply 4488 4723 5059 5360 36606 100
# plyr 443 487 507 536 31343 100
The second approach is much faster because it does the subsetting in a single step - if you can figure out how to do it in one step, it's usually any easy way to get better performance.
I think this is cleaner and possibly faster:
z <- sapply(unique(test.frame.1$group), FUN= function(x){
sample(which(test.frame.1$group==x), 100, TRUE)
})
out <- test.frame.1[z,]
out