Usually, I find myself using a few summary functions or making my own computations to get some additional initial information from the data. For example, I wanted to see the count and percentage per variable given a limit of distinct values:
table_transposed <- function(vector){
merge(as.data.frame(table(vector, dnn="values")),
as.data.frame(round(prop.table(table(vector, dnn="values")),2)),
by="values",
all.x=TRUE) %>%
data.table::transpose(keep.names = "values",
make.names = names(.)[1]) %T>%
{.[,c("values")] <- c("Count", "Percentage")}
}
table_transposed_filter <- function(dataframe, max_number_categories) {
(lapply(dataframe, function(x) NROW(unique(x))) <= max_number_categories) %>%
as.vector() %>%
{dataframe[,.]} %>%
lapply(table_transposed)
}
So, you give the dataframe and the threshold of distinct values per variable.
table_transposed_filter(mtcars, 10)
However, it's SUPER slow (maybe because of using merge() instead of left_join() from dplyr). Now, I'm trying to figure an efficient, fast, and simple way to do a combination of psych::describe(), Hmisc::describe(), other, and my own, for numeric and categorical variables (one descriptive function for each one). Something like (for numerical):
| Variable | dtype | mean | mode | variance | skew | percentile 25 | ...
If I create this table with mainly with sapply() for example, is it better (more efficient, faster, simpler code) than actually learning to create a r-package and developing in there?
PS: I thought to put this question in StackMetaExchange or Crossvalidation, but none of them seem to fit it.
Here's a somewhat faster version. It's about 2x faster on small data (like mtcars), but the difference narrows on litte bit on larger data.
This makes sense as the most expensive operation you do is table - your version does it twice, my version does it once. I didn't profile the code, but my guess is table is the bottleneck by more than one order of magnitude on any sizeable data, so it's a waste to try to optimize any other parts of the code.
t_transp = function(x, digits = 2) {
tab = table(x)
prop_tab = prop.table(tab)
df = data.frame(values = c("Count", "Percentage"))
df = cbind(df, rbind(tab, round(prop_tab, digits = digits)))
row.names(df) = NULL
df
}
t_transp_filter = function(data, n_max, ...) {
lapply(Filter(function(x) NROW(unique(x)) <= n_max, data), t_transp, ...)
}
Benchmarking:
microbenchmark::microbenchmark(
gregor = t_transp_filter(mtcars, n_max = 4),
OP = table_transposed_filter(mtcars, 4),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 1.6483 1.7093 2.253425 1.74765 1.84680 7.5394 20 a
# OP 5.6988 5.7627 6.316295 6.08545 6.57965 8.1048 20 b
set.seed(47)
df = as.data.frame(matrix(
c(sample(letters[1:5], size = 1e5 * 20, replace = T))
, ncol = 20))
microbenchmark::microbenchmark(
gregor = t_transp_filter(df, n_max = 5),
OP = table_transposed_filter(df, 5),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 59.5466 59.95545 63.6825 61.14075 67.2167 75.4270 20 a
# OP 110.3265 117.35585 123.8782 118.91005 133.7795 149.0651 20 b
Related
I have a problem where the idea is to find unique elements of an extremely large matrix, apply some function (in this example, it is a sine function, but it can be an arbitrary function, even a trained neural net) to each unique element, and then replace it back in the original matrix.
I have the following R code snippet:
nrows <- 28000
ncols <- 3000
x <- matrix(round(runif(nrows*ncols, 1, 5), 5), nrow=nrows, ncol=ncols)
u <- unique(as.vector(x))
uindex <- seq(from=1, to=length(u), by=1)
ut <- sin(u)
for (hh in uindex) {
x[x == u[hh]] <- ut[hh]
}
In the above, code-snippet, the portion
for (hh in uindex) {
x[x == u[hh]] <- ut[hh]
}
takes forever to finish for the matrix dimension of order 1e4x1e3. How can I optimize the for loop?
Please note that this is just a minimal working example for Stackoverflow only. Hence, please refrain from telling me that I can do x = sin(x). My point is the for loop.
To expand on my comment regarding avoiding the for loop entirely...
As OP knows, using loops in R is generally slow and if there's an alternative, it's likely to be faster. One answer that avoids the for loop has already been provided. Here is another, with benchmarking.
To start, turn the OP's code into a function.
withForLoop <- function(nrows=28000, ncols=3000) {
x <- matrix(round(runif(nrows*ncols, 1, 5), 5), nrow=nrows, ncol=ncols)
u <- unique(as.vector(x))
uindex <- seq(from=1, to=length(u), by=1)
ut <- sin(u)
for (hh in uindex) {
x[x == u[hh]] <- ut[hh]
}
}
and benchmark it
library(microbenchmark)
microbenchmark(withForLoop, times=100)
Unit: nanoseconds
expr min lq mean median uq max neval
withForLoop 47 49 83.46 49 55 3226 100
So we are trying to beat a median time of 49 nanoseconds. (You could, of course, pick any other summary statistic as your target metric.)
Now rewrite the code using the tidyverse. OP's code starts with a matrix, converts it to a vector and then manipulates the vector. It's not clear if the matrix needs to be recovered. Assume it does, so provide the means to recover it, but - for consistency with OP's code - don't do the recovery.
library(tidyverse)
withTidyverse <- function(nrows=28000, ncols=3000) {
x <- tibble() %>%
expand(
Row=1:nrows,
Col=1:ncols
) %>%
mutate(
Random=round(runif(nrow(.), 1, 5), 5),
Sin=sin(Random)
)
}
microbenchmark(withTidyverse, times=100)
Unit: nanoseconds
expr min lq mean median uq max neval
withTidyverse1 41 42 52.21 42.5 43 964 100
So that's reduced the median execution time to 42.5 nanoseconds on my machine. That's a saving of just over 13%.
Because sin is a relatively quick function I've not bothered to search for unique values and replace each unique value in a batch. I've just taken a blunderbuss approach and recalculated each value in the vector as it arises. Here's a way of taking the more sophisticated, "replace unique values in batches" approach:
withTidyverse2 <- function(nrows=28000, ncols=3000) {
x <- tibble() %>%
expand(
Row=1:nrows,
Col=1:ncols
) %>%
mutate(
Random=round(runif(nrow(.), 1, 5), 5)
)
y <- x %>%
distinct(Random) %>%
mutate(Sin=sin(Random))
x <- x %>%
left_join(y, by="Random")
}
microbenchmark(withTidyverse2, times=100)
Unit: nanoseconds
expr min lq mean median uq max neval
withTidyverse2 44 45 82.31 45.5 51 2543 100
So, in this specific case, the overhead of extracting the unique values and updating in batches is not worthwhile, although it is still quicker than the for loop. OP will have to investigate their actual use case.
There are, of course, lots of other ways to address the actual problem. Which one is optimal is impossible to say based on the information provided.
Here is a data.table based solution:
nrows <- 28000
ncols <- 3000
x <- round(runif(nrows*ncols, 1, 5), 5)
u <- unique(as.vector(x))
uindex = seq(from = 1, to = length(u), by = 1)
dt.x <- data.table(x)
dt.u <- data.table(u)
dt.u[, ut := sin(u)]
dt.res <- merge(dt.x, dt.u, by.x = "x", by.y = "u", all.x = TRUE)
ut <- dt.res[, ut]
output <- matrix(ut, nrow=nrows, ncol=ncols)
Main idea here, is to work with the vectors and tables. I think you can convert to the matrix at the very end.
You could use match:
nrows <- 28000
ncols <- 3000
x <- matrix(round(runif(nrows*ncols, 1, 5), 5), nrow=nrows, ncol=ncols)
u <- unique(as.vector(x))
ut <- sin(u)
i <- match(x, u)
ut[i]
Is using comment() from base to assign information to R object slowing the code down?
That is, should its implementation be used carefully?
Context: I'm having a function that creates several tibbles/dataframes that are saved in a list; and I'm thinking of saving a comment to each dataframe (or just saving one comment to the entire list).
From the comment documentation it seems that the method is just an interface to get/set a comment attribute to any R object. I can't see it becoming a burden in the vast majority of real-world use cases.
To have an idea of how the function behaves under load I've written a simple function to generate n dataframes (2000 rows, 3 columns) and annotate them at will. Results will be appended to a list:
df_and_comment <- function(n, add_comment = FALSE) {
res_list <- list()
for (i in seq(1:n)) {
x <- data.frame(
x = rnorm(2000),
y = rnorm(2000),
z = rnorm(2000)
)
if (add_comment) {
comment(x) <- sprintf("this is df no: %d", i)
}
res_list[[i]] <- x
}
res_list
}
Normal load - creating 50 dataframes
library(microbenchmark)
microbenchmark(
df_and_comment(n=50),
df_and_comment(n=50, add_comment = TRUE),
times = 10
)
Unit: milliseconds
expr min lq mean median uq max neval
df_and_comment(n = 50) 25.34398 25.51473 26.70731 25.74472 25.97483 33.81251 10
df_and_comment(n = 50, add_comment = TRUE) 26.32009 26.39826 27.49835 26.60218 27.80038 32.47273 10
Heavy load - creating 15.000 dataframes
microbenchmark(
df_and_comment(n=15000),
df_and_comment(n=15000, add_comment = TRUE),
times = 10
)
Unit: seconds expr min lq mean median uq max neval
df_and_comment(n = 15000) 8.218535 8.254919 8.324075 8.317126 8.354637 8.469191 10
df_and_comment(n = 15000, add_comment = TRUE) 8.414405 8.561279 8.687380 8.571137 8.685309 9.591972 10
In both cases, the performance difference are completely negligible. I wouldn't be worried about performance implications of annotating dataframes/regression results iteratively.
I am trying to learn how to use parallel processing in R. A snapshot of the data and the code is provided below.
Creating a rough dataset
library(truncnorm)
#Creating a mock dataframe
Market =c('City1','City2','City3','City4','City5','City2','City4','City1','City3','City5')
Car_type = c('A','A','A','A','A','B','B','B','B','B')
Variable1=c(.34,.19,.85,.27,.32,.43,.22,.56,.17,.11)
Car_purchased = c(1,0,0,1,0,1,0,0,1,1)
Market_data = data.frame(Market,Car_type,Variable1,Car_purchased)
Market_data2=do.call("rbind", replicate(100, Market_data, simplify = FALSE))
#Create a bigger dataset
Market_data2$Final_value = 0 #create a column of for future calculation
empty_list = list()
Writing a function and running the function
Car_Value=function(data){
market_list=unique(Market_data2$Market)
for (m in market_list){
market_subset = Market_data2[which(Market_data2$Market==m),]
for (i in 1:nrow(market_subset)){
if(market_subset[i,'Car_purchased']==1){
market_subset[i,'Final_value'] = rtruncnorm(1,a=-10,b=0,mean=max(market_subset$Variable1),sd=1)
} else{
market_subset[i,'Final_value'] = rtruncnorm(1,a=-10,b=0,mean = market_subset[i,'Variable1'],sd=1)
}
}
empty_list=rbind(empty_list,market_subset)
}
return(empty_list)
}
get_value = Car_Value(data=Market_data2)
In the above example, there are a total of 5 "Market" for cars and 2 "Car_type". Consumers may have bought the cars in either market. I have to calculate a value ("Final_value") from a given truncated normal distribution. This value only depends on the value of Variable1 of the given market. That is why I use the outer for loop. The means of the truncated normal distribution depends on the value of Variable1 (max(Variable1) in a market if the Car_purchased==1 or the given value if Car_purchased==0). This version of the code runs perfectly fine (although it is not optimized for speed).
Problem
Next what I would like to do is to use parallel processing for the outer for loop i.e. for the loop across the markets since the Final_value of a market depends only on the observations within the market.
Unfortunately, I only know how to implement parallel processing for each line of the dataset. For eg. my code (provided below) assigns the 1st line to the 1st core, 2nd line to the 2nd core and so on. This is inefficient and is taking a long time since each line has to create the subset and then find the max of the subset.
My inefficient version
library(parallel)
library(foreach)
library(doParallel)
library(iterators)
library(utils)
library(truncnorm)
cl=parallel::makeCluster(4,type="PSOCK")
registerDoParallel(cl)
clusterEvalQ(cl, {library(truncnorm)})
Car_Value_Parallel <- function(market_data){
output <- foreach(x = iter(market_data, by = "row"), .combine = rbind) %dopar% {
market_subset = market_data[which(market_data$Market==x$Market),]
if(x['Car_purchased']==1){
x['Final_value'] = rtruncnorm(1,a=-10,b=0,mean=max(market_subset$Variable1),sd=1)
} else{
x['Final_value'] = rtruncnorm(1,a=-10,b=0,mean = x['Variable1'],sd=1)
}
return(x)
}
output
}
get_value_parallel = Car_Value_Parallel(market_data = Market_data2)
stopCluster(cl)
This is highly inefficient if I run it on a dataset of size > 100K (My actual dataset is about 1.2 million rows). However, I could not implement the parallelization at the market level where the parallel computation will be as follows: Run the computation for City1 in the 1st core, City2 in the 2nd core and so on. Can someone please help? Any help is appreciated. Thanks.
P.S. My apologies for the long question. I just wanted to show all versions of the code that I have used.
I see no reason to pursue parallel processing with your data set. Instead, look into packages like dplyr or data.table for a more efficient solution.
From my understanding of your problem, for each Market you want to apply rtruncnorm to create the variable Final_value where the mean argument of the rtruncnorm's function depends on the variable Car_purchased.
We can accomplish this without the need of a for loop, using dplyr.
library(truncnorm)
library(dplyr)
# Creating a mock dataframe
Market <- c("City1", "City2", "City3", "City4", "City5", "City2", "City4", "City1", "City3", "City5")
Variable1 <- c(.34, .19, .85, .27, .32, .43, .22, .56, .17, .11)
Car_purchased <- c(1, 0, 0, 1, 0, 1, 0, 0, 1, 1)
Market_data <- data.frame(Market, Car_type, Variable1, Car_purchased)
Market_data2 <- replicate(100, Market_data, simplify = FALSE) %>% bind_rows()
#Create a bigger dataset
Market_data2$Final_value = 0 #create a column of for future calculation
empty_list = list()
Car_Value2 <- function(data) {
data %>%
group_by(Market) %>%
mutate(
Final_value = if_else(
Car_purchased == 1,
rtruncnorm(1, a = -10, b = 0, mean = max(Variable1), sd = 1),
rtruncnorm(1, a = -10, b = 0, mean = Variable1, sd = 1)
)
)
}
microbenchmark::microbenchmark(
Car_Value(Market_data2),
Car_Value2(Market_data2),
times = 100
)
#> Unit: milliseconds
#> expr min lq mean median uq
#> Car_Value(Market_data2) 66.109304 68.043575 69.030763 68.56569 69.681255
#> Car_Value2(Market_data2) 1.073318 1.101578 1.204737 1.17583 1.230687
#> max neval cld
#> 89.497035 100 b
#> 3.465425 100 a
# Even bigger dataframe
Market_data3 <- replicate(120000, Market_data, simplify = FALSE) %>% bind_rows()
microbenchmark::microbenchmark(
Car_Value2(data = Market_data3),
times = 100
)
#> Unit: milliseconds
#> expr min lq mean median
#> Car_Value2(data = Market_data3) 338.4615 341.7134 375.8769 397.7133
#> uq max neval
#> 399.8733 412.5134 100
Created on 2019-03-10 by the reprex package (v0.2.1)
Like most people, I'm impressed by Hadley Wickham and what he's done for R -- so i figured that i'd move some functions toward his tidyverse ... having done so i'm left wondering what the point of it all is?
My new dplyr functions are much slower than their base equivalents -- i hope i'm doing something wrong. I'd particularly like some payoff from the effort required to understand non-standard-evaluation.
So, what am i doing wrong? Why is dplyr so slow?
An example:
require(microbenchmark)
require(dplyr)
df <- tibble(
a = 1:10,
b = c(1:5, 4:0),
c = 10:1)
addSpread_base <- function() {
df[['spread']] <- df[['a']] - df[['b']]
df
}
addSpread_dplyr <- function() df %>% mutate(spread := a - b)
all.equal(addSpread_base(), addSpread_dplyr())
microbenchmark(addSpread_base(), addSpread_dplyr(), times = 1e4)
Timing results:
Unit: microseconds
expr min lq mean median uq max neval
addSpread_base() 12.058 15.769 22.07805 24.58 26.435 2003.481 10000
addSpread_dplyr() 607.537 624.697 666.08964 631.19 636.291 41143.691 10000
So using dplyr functions to transform the data takes about 30x longer -- surely this isn't the intention?
I figured that perhaps this is too easy a case -- and that dplyr would really shine if we had a more realistic case where we are adding a column and sub-setting the data -- but this was worse. As you can see from the timings below, this is ~70x slower than the base approach.
# mutate and substitute
addSpreadSub_base <- function(df, col1, col2) {
df[['spread']] <- df[['a']] - df[['b']]
df[, c(col1, col2, 'spread')]
}
addSpreadSub_dplyr <- function(df, col1, col2) {
var1 <- as.name(col1)
var2 <- as.name(col2)
qq <- quo(!!var1 - !!var2)
df %>%
mutate(spread := !!qq) %>%
select(!!var1, !!var2, spread)
}
all.equal(addSpreadSub_base(df, col1 = 'a', col2 = 'b'),
addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'))
microbenchmark(addSpreadSub_base(df, col1 = 'a', col2 = 'b'),
addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'),
times = 1e4)
Results:
Unit: microseconds
expr min lq mean median uq max neval
addSpreadSub_base(df, col1 = "a", col2 = "b") 22.725 30.610 44.3874 45.450 53.798 2024.35 10000
addSpreadSub_dplyr(df, col1 = "a", col2 = "b") 2748.757 2837.337 3011.1982 2859.598 2904.583 44207.81 10000
These are micro seconds, your dataset has 10 rows, unless you plan on looping on millions of datasets of 10 rows your benchmark is pretty much irrelevant (and in that case I can't imagine a situation where it wouldn't be wise to bind them together as a first step).
Let's do it with a bigger dataset, like 1 million times bigger :
df <- tibble(
a = 1:10,
b = c(1:5, 4:0),
c = 10:1)
df2 <- bind_rows(replicate(1000000,df,F))
addSpread_base <- function(df) {
df[['spread']] <- df[['a']] - df[['b']]
df
}
addSpread_dplyr <- function(df) df %>% mutate(spread = a - b)
microbenchmark::microbenchmark(
addSpread_base(df2),
addSpread_dplyr(df2),
times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# addSpread_base(df2) 25.85584 26.93562 37.77010 32.33633 35.67604 170.6507 100 a
# addSpread_dplyr(df2) 26.91690 27.57090 38.98758 33.39769 39.79501 182.2847 100 a
Still quite fast and not much difference.
As for the "whys" of the result that you got, it's because you're using a much more complex function, so it has overheads.
Commenters have pointed that dplyr doesn't try too hard to be fast and maybe it's true when you compare to data.table, and interface is the first concern, but the authors have been working hard on speed as well. Hybrid evaluation for example allows (if I got it right) to execute C code directly on grouped data when aggregating with common functions, which can be much faster than base code, but simple code will always run faster with simple functions.
I'm trying to divide each row of a dataframe by a number stored in a second mapping dataframe.
for(g in rownames(data_table)){
print(g)
data_table[g,] <- data_table[g,]/mapping[g,2]
}
However, this is incredibly slow, each row takes almost 1-2 seconds to run. I know iteration is usually not the best way to do things in R, but I don't know how else to do it. Is there any way I can speed up the runtime?
Try this :
sweep(data_table, 1, mapping[[2]], "/")
In terms of speed here is a benchmark for the possibilities using iris dataset and including your version :
microbenchmark::microbenchmark(
A = {
for(g in rownames(test)){
# print(g)
test[g,] <- test[g,]/test[g,2]
}
},
B = sweep(test, 1, test[[2]], "/"),
C = test / test[[2]],
times = 100
)
#Unit: microseconds
#expr min lq mean median uq max neval
#A 82374.693 83722.023 101688.1254 84582.052 147280.057 157507.892 100
#B 453.652 484.393 514.4094 513.850 539.480 623.688 100
#C 404.506 423.794 456.0063 446.101 470.675 729.205 100
you can vectorize this operation if the two variables have the same number of rows:
dt <- data.frame(a = rnorm(100), b = rnorm(100))
mapping <- data.frame(x = rnorm(100), y = rnorm(100))
dt / mapping[,2]