Meaning of apply = TRUE/FALSE in collapse: settransformv - r

I am struggling to understand what apply = TRUE/FALSE means within collapse: settransformv. Maybe somebody can give a hand on this? Below I am adding an example where a code fails when I specify apply = TRUE.
library(collapse)
library(data.table)
lagamount <- 1
testdf_1 <- data.table(group = c(1,1,1,1,1,2,2,2,2,2),
counter = as.integer(c(1,2,3,5,6,7,8,9,11,12)),
xval = seq(100, 1000, 100))
testdf_2 <- copy(testdf_1)
settransformv(testdf_1, "xval", flag, 1:3, group, counter, apply = FALSE)
settransformv(testdf_2, "xval", flag, 1:3, group, counter, apply = TRUE)

So the apply argument is there mainly because all .FAST_FUN in the package have a data.frame method, which is more efficient than applying the function to multiple columns using lapply, espeically if some grouping/indexing is done in the call. This example should clarify it.
library(collapse)
library(magrittr)
library(microbenchmark)
# These two are equivalent
settransformv(wlddev, PCGDP:POP, flag, 1, iso3c, year)
fselect(wlddev, PCGDP:POP) %<>% lapply(flag, 1, wlddev$iso3c, wlddev$year)
# -> so here we redo the indexing for each column
# These two are also equivalent
settransformv(wlddev, PCGDP:POP, flag, 1, iso3c, year, apply = FALSE)
fselect(wlddev, PCGDP:POP) %<>% flag(1, wlddev$iso3c, wlddev$year)
# -> we only index once, so this is efficient
# To prove the point:
microbenchmark(A = settransformv(wlddev, PCGDP:POP, flag, 1, iso3c, year),
B = settransformv(wlddev, PCGDP:POP, flag, 1, iso3c, year, apply = FALSE))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> A 485.276 578.5715 1354.6314 610.572 653.7245 31025.151 100
#> B 211.232 243.4785 264.8313 262.031 284.4375 320.497 100
Created on 2022-09-08 by the reprex package (v2.0.1)
Note that for a single column there are no benefits to setting apply = FALSE, except if you have a function that only operates on data frames. For the single column case you also don't need to use the (v) version of the function, but can simply do
settransform(wlddev, PCGDP_lag = flag(PCGDP, 1, iso3c, year))

Related

Is it possible to do your own efficient descriptive statistics function? - R

Usually, I find myself using a few summary functions or making my own computations to get some additional initial information from the data. For example, I wanted to see the count and percentage per variable given a limit of distinct values:
table_transposed <- function(vector){
merge(as.data.frame(table(vector, dnn="values")),
as.data.frame(round(prop.table(table(vector, dnn="values")),2)),
by="values",
all.x=TRUE) %>%
data.table::transpose(keep.names = "values",
make.names = names(.)[1]) %T>%
{.[,c("values")] <- c("Count", "Percentage")}
}
table_transposed_filter <- function(dataframe, max_number_categories) {
(lapply(dataframe, function(x) NROW(unique(x))) <= max_number_categories) %>%
as.vector() %>%
{dataframe[,.]} %>%
lapply(table_transposed)
}
So, you give the dataframe and the threshold of distinct values per variable.
table_transposed_filter(mtcars, 10)
However, it's SUPER slow (maybe because of using merge() instead of left_join() from dplyr). Now, I'm trying to figure an efficient, fast, and simple way to do a combination of psych::describe(), Hmisc::describe(), other, and my own, for numeric and categorical variables (one descriptive function for each one). Something like (for numerical):
| Variable | dtype | mean | mode | variance | skew | percentile 25 | ...
If I create this table with mainly with sapply() for example, is it better (more efficient, faster, simpler code) than actually learning to create a r-package and developing in there?
PS: I thought to put this question in StackMetaExchange or Crossvalidation, but none of them seem to fit it.
Here's a somewhat faster version. It's about 2x faster on small data (like mtcars), but the difference narrows on litte bit on larger data.
This makes sense as the most expensive operation you do is table - your version does it twice, my version does it once. I didn't profile the code, but my guess is table is the bottleneck by more than one order of magnitude on any sizeable data, so it's a waste to try to optimize any other parts of the code.
t_transp = function(x, digits = 2) {
tab = table(x)
prop_tab = prop.table(tab)
df = data.frame(values = c("Count", "Percentage"))
df = cbind(df, rbind(tab, round(prop_tab, digits = digits)))
row.names(df) = NULL
df
}
t_transp_filter = function(data, n_max, ...) {
lapply(Filter(function(x) NROW(unique(x)) <= n_max, data), t_transp, ...)
}
Benchmarking:
microbenchmark::microbenchmark(
gregor = t_transp_filter(mtcars, n_max = 4),
OP = table_transposed_filter(mtcars, 4),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 1.6483 1.7093 2.253425 1.74765 1.84680 7.5394 20 a
# OP 5.6988 5.7627 6.316295 6.08545 6.57965 8.1048 20 b
set.seed(47)
df = as.data.frame(matrix(
c(sample(letters[1:5], size = 1e5 * 20, replace = T))
, ncol = 20))
microbenchmark::microbenchmark(
gregor = t_transp_filter(df, n_max = 5),
OP = table_transposed_filter(df, 5),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 59.5466 59.95545 63.6825 61.14075 67.2167 75.4270 20 a
# OP 110.3265 117.35585 123.8782 118.91005 133.7795 149.0651 20 b

Parallelize for loop in R

I am trying to learn how to use parallel processing in R. A snapshot of the data and the code is provided below.
Creating a rough dataset
library(truncnorm)
#Creating a mock dataframe
Market =c('City1','City2','City3','City4','City5','City2','City4','City1','City3','City5')
Car_type = c('A','A','A','A','A','B','B','B','B','B')
Variable1=c(.34,.19,.85,.27,.32,.43,.22,.56,.17,.11)
Car_purchased = c(1,0,0,1,0,1,0,0,1,1)
Market_data = data.frame(Market,Car_type,Variable1,Car_purchased)
Market_data2=do.call("rbind", replicate(100, Market_data, simplify = FALSE))
#Create a bigger dataset
Market_data2$Final_value = 0 #create a column of for future calculation
empty_list = list()
Writing a function and running the function
Car_Value=function(data){
market_list=unique(Market_data2$Market)
for (m in market_list){
market_subset = Market_data2[which(Market_data2$Market==m),]
for (i in 1:nrow(market_subset)){
if(market_subset[i,'Car_purchased']==1){
market_subset[i,'Final_value'] = rtruncnorm(1,a=-10,b=0,mean=max(market_subset$Variable1),sd=1)
} else{
market_subset[i,'Final_value'] = rtruncnorm(1,a=-10,b=0,mean = market_subset[i,'Variable1'],sd=1)
}
}
empty_list=rbind(empty_list,market_subset)
}
return(empty_list)
}
get_value = Car_Value(data=Market_data2)
In the above example, there are a total of 5 "Market" for cars and 2 "Car_type". Consumers may have bought the cars in either market. I have to calculate a value ("Final_value") from a given truncated normal distribution. This value only depends on the value of Variable1 of the given market. That is why I use the outer for loop. The means of the truncated normal distribution depends on the value of Variable1 (max(Variable1) in a market if the Car_purchased==1 or the given value if Car_purchased==0). This version of the code runs perfectly fine (although it is not optimized for speed).
Problem
Next what I would like to do is to use parallel processing for the outer for loop i.e. for the loop across the markets since the Final_value of a market depends only on the observations within the market.
Unfortunately, I only know how to implement parallel processing for each line of the dataset. For eg. my code (provided below) assigns the 1st line to the 1st core, 2nd line to the 2nd core and so on. This is inefficient and is taking a long time since each line has to create the subset and then find the max of the subset.
My inefficient version
library(parallel)
library(foreach)
library(doParallel)
library(iterators)
library(utils)
library(truncnorm)
cl=parallel::makeCluster(4,type="PSOCK")
registerDoParallel(cl)
clusterEvalQ(cl, {library(truncnorm)})
Car_Value_Parallel <- function(market_data){
output <- foreach(x = iter(market_data, by = "row"), .combine = rbind) %dopar% {
market_subset = market_data[which(market_data$Market==x$Market),]
if(x['Car_purchased']==1){
x['Final_value'] = rtruncnorm(1,a=-10,b=0,mean=max(market_subset$Variable1),sd=1)
} else{
x['Final_value'] = rtruncnorm(1,a=-10,b=0,mean = x['Variable1'],sd=1)
}
return(x)
}
output
}
get_value_parallel = Car_Value_Parallel(market_data = Market_data2)
stopCluster(cl)
This is highly inefficient if I run it on a dataset of size > 100K (My actual dataset is about 1.2 million rows). However, I could not implement the parallelization at the market level where the parallel computation will be as follows: Run the computation for City1 in the 1st core, City2 in the 2nd core and so on. Can someone please help? Any help is appreciated. Thanks.
P.S. My apologies for the long question. I just wanted to show all versions of the code that I have used.
I see no reason to pursue parallel processing with your data set. Instead, look into packages like dplyr or data.table for a more efficient solution.
From my understanding of your problem, for each Market you want to apply rtruncnorm to create the variable Final_value where the mean argument of the rtruncnorm's function depends on the variable Car_purchased.
We can accomplish this without the need of a for loop, using dplyr.
library(truncnorm)
library(dplyr)
# Creating a mock dataframe
Market <- c("City1", "City2", "City3", "City4", "City5", "City2", "City4", "City1", "City3", "City5")
Variable1 <- c(.34, .19, .85, .27, .32, .43, .22, .56, .17, .11)
Car_purchased <- c(1, 0, 0, 1, 0, 1, 0, 0, 1, 1)
Market_data <- data.frame(Market, Car_type, Variable1, Car_purchased)
Market_data2 <- replicate(100, Market_data, simplify = FALSE) %>% bind_rows()
#Create a bigger dataset
Market_data2$Final_value = 0 #create a column of for future calculation
empty_list = list()
Car_Value2 <- function(data) {
data %>%
group_by(Market) %>%
mutate(
Final_value = if_else(
Car_purchased == 1,
rtruncnorm(1, a = -10, b = 0, mean = max(Variable1), sd = 1),
rtruncnorm(1, a = -10, b = 0, mean = Variable1, sd = 1)
)
)
}
microbenchmark::microbenchmark(
Car_Value(Market_data2),
Car_Value2(Market_data2),
times = 100
)
#> Unit: milliseconds
#> expr min lq mean median uq
#> Car_Value(Market_data2) 66.109304 68.043575 69.030763 68.56569 69.681255
#> Car_Value2(Market_data2) 1.073318 1.101578 1.204737 1.17583 1.230687
#> max neval cld
#> 89.497035 100 b
#> 3.465425 100 a
# Even bigger dataframe
Market_data3 <- replicate(120000, Market_data, simplify = FALSE) %>% bind_rows()
microbenchmark::microbenchmark(
Car_Value2(data = Market_data3),
times = 100
)
#> Unit: milliseconds
#> expr min lq mean median
#> Car_Value2(data = Market_data3) 338.4615 341.7134 375.8769 397.7133
#> uq max neval
#> 399.8733 412.5134 100
Created on 2019-03-10 by the reprex package (v0.2.1)

roll applying multiple quantiles in data table to multiple columns

Background:
I can get multiple moments from my data using data.table (see appended), but it is taking a very long time. I was thinking that the process of sorting the table to get a particular percentile would be more efficient for finding several.
A once-through statistic like median is taking 1.79 ms while the non-median quantile is taking 68x longer at 122.8 ms. There has to be a way to reduce the compute time.
Questions:
Is there a way to call several quantiles from the same data in a more efficient manner?
Can I pull the "lapply" out of the data.table and compose it like I do the name lists?
My example code with tiny synthetic data:
#libraries
library(data.table) #data.table
library(zoo) #roll apply
#reproducibility
set.seed(45L)
#make data
DT<-data.table(V1=c(1L,2L),
V2=LETTERS[1:3],
V3=round(rnorm(300),4),
V4=round(runif(150),4),
V5=1:1200)
DT
#get names
my_col_list <- names(DT)[c(3,4)]
#make new variable names
my_name_list1 <- paste0(my_col_list, "_", "33rd_pctile")
my_name_list2 <- paste0(my_col_list, "_", "77rd_pctile")
#compute values
for(i in 1:length(my_col_list)){
#first
DT[, (my_name_list1[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
quantile,
fill = NA,
probs = 1/3)),
recursive = F),
.SDcols = my_col_list[i]]
#second
DT[, (my_name_list2[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
quantile,
fill = NA,
probs = 7/9)),
recursive = F),
.SDcols = my_col_list[i]]
}
#display it
head(DT,10)
Microbenchmarking a once-through statistic vs. the many-through statistics says that the quantiles are expensive.
res2 <- microbenchmark( DT[, (my_name_list1[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
mean,
fill = NA)),
recursive = F),
.SDcols = my_col_list[i]],
times = 5)
says it takes about 1.75 millseconds for a mean (median is 1.79 sec)
> res2
Unit: milliseconds
expr
DT[, `:=`((my_name_list1[i]), unlist(lapply(.SD, function(x) rollapply(x, 7, mean, fill = NA)), recursive = F)), .SDcols = my_col_list[i]]
min lq mean median uq max neval
1.465779 1.509114 1.754145 1.618591 1.712103 2.46514 5
but it takes 100x that to compute a quantile
res3 <- microbenchmark( DT[, (my_name_list1[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
quantile,
fill = NA,
probs = 1/3)),
recursive = F),
.SDcols = my_col_list[i]],
times = 5)
res3
and
> res3
Unit: milliseconds
expr
DT[, `:=`((my_name_list1[i]), unlist(lapply(.SD, function(x) rollapply(x, 7, quantile, fill = NA, probs = 1/3)), recursive = F)), .SDcols = my_col_list[i]]
min lq mean median uq max neval
118.5833 119.2896 122.8432 124.0168 124.4183 127.9082 5
UPDATES:
The median from "quantile" takes ~128 seconds while the "median"
takes much less. They are not the same thing.
iterating through the 9 options for "type" of "quantile" gives mean
times between 129ms and 157ms. There is no "easy-win" here.
The package "WGCNA" requires "GO.db" from bioconductor, which is not
installed with the "install.packages" command. Also requires package
"impute" which is not installed with "WGCNA" or "GO.db". Also
"preprocessCore".
Using the (eventually working) WGCNA package reduced mean time for
rolling quantile to 68.1 ms. It is about half the time, but it is
not about 1/70th the time.
Using "RollingMedian" from the "RollingWindow" package I get 169.8
microseconds (aka 0.1698 milliseconds) which is a LOT faster, but is
not an arbitrary quantile.
Using "perccal" seems to drop computing a quantile down to ~145
microseconds. In the rollapply this drops compute time down to 15.3
milliseconds, which is an 8.5x boost. I'm not sure how much more
blood there is to squeeze out of this stone.
Thoughts:
The "perccal" approach seems to only be using a single core. There
may be some "parallel" process that allows me to split the summary
against different variables to different cores. This might give some
speedup.
As we add more terms to the data, the speedup starts reducing.
Increasing to 9600 rows reduces the speedup from ~8.5x to below 1x.
This suggests that the rollapply function may also have some issues.
Data table optimization
You are correct that the median is especially fast in this case, this is because it is running specialized C code unlike the quantile function which is pure R-code.
We can read about this optimization in the Documentation of data.table in
?data.table.optimize
There we have:
When expressions in j which contains only these functions min, max,
mean, median, var, sd, prod, for example, dt[, list(mean(x),
median(x), min(y), max(y)), by=z], they are very effectively optimised
using, what we call, GForce. These functions are replaced with gmean,
gmedian, gmin, gmax instead
And they give an example for the speed improvement for the median case:
# Generate a big data.table with a relatively many columns
set.seed(1L)
dt = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
setDT(dt)[, id := sample(1e5, 5e6, TRUE)]
print(object.size(dt), units="Mb") # 400MB, not huge, but will do
# GForce
options(datatable.optimize = 2L) # optimisation 'on'
system.time(ans1 <- dt[, lapply(.SD, median), by=id])
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(stats::median(x))), by=id])
identical(ans1, ans2)
On my system the R internal version is about 44x slower than the data.table version.
Speeding up quantile
We can still try to improve the speed of the quantile function in R, for this my approach is basically "Use the source, Luke" and looking at the quantile function. Looking at the source we get standard generic function:
>> quantile
function (x, ...)
UseMethod("quantile")
<bytecode: 0x0000000009154c78>
<environment: namespace:stats>
We can trace this a bit more:
>> methods(quantile)
[1] quantile.default* quantile.ecdf* quantile.POSIXt* quantile.zoo
see '?methods' for accessing help and source code
and have a look at the default function.
>> stats:::quantile.default
function (x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE,
type = 7, ...)
{
...
}
Now we have the entire source, which is quite long, we can compare it to the R median source in median.default. With the source we can copy it as a user defined function and profile it (with a small inclusion of supplying the namespace for format_perc), from this it is clear that only two lines are relevant, namely the sorting and the output formatting, the sorting is very similar to the median function and likely hard to improve. The formatting however can be skipped entirely by commenting it out.
fast.quant <- function (x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE,
type = 7, ...)
{
if (is.factor(x)) {
...
...
if (names && np > 0L) {
#names(qs) <- stats:::format_perc(probs)
}
...
}
All in all this fix cuts the run time in half, it is still not the optimized median but most likely it is hard to get better performance without leaving R.
It is possible, and maybe even probable, that the optimizations in data.table can be leveraged to help with the quantile calulations as well, since data.table implements sorting in C as well. One would however still like to leverage that only partial sorting is required. Otherwise the Rcpp package can be used as well to perform a similar optimization.

dplyr mutate use standard evaluation [duplicate]

Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance
The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").

plyr outperforms dplyr and data.table - What's wrong?

I have to apply a function to every row of a large table (~ 2M rows). I used to use plyr for that, but the table is growing continuously and the current solution starts to approach unacceptable runtimes. I thought I could just switch to data.table or dplyr and all is fine, but that's not the case.
Here's an example:
library(data.table)
library(plyr)
library(dplyr)
dt = data.table("ID_1" = c(1:1000), # unique ID
"ID_2" = ceiling(runif(1000, 0, 100)), # other ID, duplicates possible
"group" = sample(LETTERS[1:10], 1000, replace = T),
"value" = runif(1000),
"ballast1" = "X", # keeps unchanged in derive_dt
"ballast2" = "Y", # keeps unchanged in derive_dt
"ballast3" = "Z", # keeps unchanged in derive_dt
"value_derived" = 0)
setkey(dt, ID_1)
extra_arg = c("A", "F", "G", "H")
ID_1 is guaranteed to contain no duplicates. Now I define a function to apply to every row/ID_1:
derive = function(tmprow, extra_arg){
if(tmprow$group %in% extra_arg){return(NULL)} # exlude entries occuring in extra_arg
group_index = which(LETTERS == tmprow$group)
group_index = ((group_index + sample(1:26, 1)) %% 25) + 1
new_group = LETTERS[group_index]
if(new_group %in% unique(dt$group)){return(NULL)}
new_value = runif(1)
row_derived = tmprow
row_derived$group = new_group
row_derived$value = runif(1)
row_derived$value_derived = 1
return(row_derived)
}
This one doesn't do anything useful (the actual one does). The point is that the function takes one row and computes a new row of the same format.
Now the comparison:
set.seed(42)
system.time(result_dt <- dt[, derive(.SD, extra_arg), by = ID_1])
set.seed(42)
system.time(result_dplyr <- dt %>% group_by(ID_1) %>% do(derive(., extra_arg)))
set.seed(42)
system.time(results_plyr <- x <- ddply(dt, .variable = "ID_1", .fun = derive, extra_arg))
plyr is about 8x faster than both data.table and dplyr. Obviously I'm doing something wrong here, but what?
EDIT
Thanks to eddi's answer I could reduce runtimes for data.table and dplyr to ~ 0.6 and 0.8 of the plyr version, respectively. I intialized row_derived as data.frame: row_derived = as.data.frame(tmprow). That's cool, but I still expected a higher performance increase from these packages...any further suggestions?
The issue is the assignment you use has a very high overhead in data.table, and plyr converts the row to a data.frame before passing to your derive function, and thus avoids it:
library(microbenchmark)
df = as.data.frame(dt)
microbenchmark({dt$group = dt$group}, {df$group = df$group})
#Unit: microseconds
# expr min lq mean median uq max neval
# { dt$group = dt$group } 1895.865 2667.499 3092.38903 3080.3620 3389.049 4984.406 100
# { df$group = df$group } 26.045 45.244 64.13909 61.6045 79.635 157.266 100
I can't suggest a good fix, since you say your example is not real problem, so no point in solving it better. Some basic suggestions to look at are - vectorizing the code, and using := or set instead (depending on what exactly you end up doing).

Resources