How to speed up parallel foreach in R - r

I want to calculate a series of approx 1.000.000 wilcox.tests in R:
result <- foreach(i = 1:ncol(data), .combine=bind_rows, .multicombine= TRUE, .maxcombine = 1000 ) %do% {
w = wilcox.test(data[,i]~as.factor(groups),exact = FALSE)
df <- data.frame(Characters=character(),
Doubles=double(),
Doubles=double(),
stringsAsFactors=FALSE)
df[1,] = c(colnames(data)[i], w$statistic, w$p.value)
rownames(df) = colnames(beta_t1)[i]
colnames(df) = c("cg", "statistic", "p.value")
return(df)
}
If I do it with %dopar% and 15 cores it is slower than with single core %do%.
I suspect it is a memory access problem. My processors are hardly used to capacity either. Is it possible to split the data dataframe into chunks and then have each processor calculate 100K and then add them together? How can I speed up this foreach loop?

One thing that’s immediately striking is that you use eight lines to create and return a data.frame where a single expression is sufficient:
data.frame(
cg = colnames(data)[i],
statistic = w$statistic,
p.value = w$p.value
row.names = colnames(beta_t1)[i]
stringsAsFactors = FALSE
)
However, the upshot is that after the loop is run, foreach has to row-bind all these data.frames, and that operation is slow. It’s more efficient to return a list of the p-values and statistics and forget about the row and column names (these can be provided afterwards, and then don’t require subsetting and re-concatenation).
That is, change your code to
result = foreach(col = data) %do% {
w = wilcox.test(col ~ as.factor(groups), exact = FALSE)
list(w$statistic, w$p.value)
}
# Combine result and transform it into a data.frame:
results = data.frame(
cg = colnames(data),
statistic = vapply(results, `[[`, double(1L), 1L),
p.value = vapply(results, `[[`, double(1L), 2L),
row.names = colnames(beta_t1),
stringsAsFactors = FALSE # only necessary for R < 4.0!
)
(I never use foreach so I’m not exactly sure how to use it here but the above should roughly work; otherwise try mclapply from the ‘parallel’ package, it does the same, just using the familiar syntax of lapply.)

Related

Is there an easy way to simplify this code using a loop in r?

I am working in Rstudio and have a series of codes just like these. There are 34 total and I am wondering if there is an easy ways to just write it once and have it loop through the defined variable of rsqRow.a{#}, combineddfs.a{#} and internally used variables of s_{## 'State'}
# s_WA.train.lr.Summary
rsqRow.a32 = summary(s_WA.train.lr)$r.squared
# rsqRow.a32
Coef = summary(s_WA.train.lr)$coef[,1] # row, column
CoefRows = data.frame(Coef)
Pval = summary(s_WA.train.lr)$coef[,4]
PvalRows = data.frame(Pval)
combineddfs.a32 <- merge(CoefRows, PvalRows, by=0, all=TRUE)
# combineddfs.a32
# s_WI.train.lr.Summary
rsqRow.a33 = summary(s_WI.train.lr)$r.squared
# rsqRow.a33
Coef = summary(s_WI.train.lr)$coef[,1] # row, column
CoefRows = data.frame(Coef)
Pval = summary(s_WI.train.lr)$coef[,4]
PvalRows = data.frame(Pval)
combineddfs.a33 <- merge(CoefRows, PvalRows, by=0, all=TRUE)
# combineddfs.a33
# s_WY.train.lr.Summary
rsqRow.a34 = summary(s_WY.train.lr)$r.squared
# rsqRow.a34
Coef = summary(s_WY.train.lr)$coef[,1] # row, column
CoefRows = data.frame(Coef)
Pval = summary(s_WY.train.lr)$coef[,4]
PvalRows = data.frame(Pval)
combineddfs.a34 <- merge(CoefRows, PvalRows, by=0, all=TRUE)
# combineddfs.a34
As already mentioned in comments you should get the data in a list to avoid such repetitive processes.
Find out a common pattern that represents all your dataframe names in the global environment and use that as a pattern in ls to get character vector of their names. You can then use mget to get dataframes in a list.
list_data <- mget(ls(pattern = 's_W.*train\\.lr'))
Once you have the list of dataframes, you can use lapply to iterate over it and in the function return the values that you want. Note that there might be a simpler way to write what you have in your attempt however as I don't have the data I am not going to take the risk to shorten your code. Here I am returning rsqRow and combineddfs for each dataframe. You can add/remove objects according to your preference.
all_values <- lapply(list_data, function(x) {
rsqRow = summary(x)$r.squared
Coef = summary(x)$coef[,1]
CoefRows = data.frame(Coef)
Pval = summary(x)$coef[,4]
PvalRows = data.frame(Pval)
combineddfs <- merge(CoefRows, PvalRows, by=0, all=TRUE)
list(rsqRow, combineddfs)
})

Threshold for using join over merge

My understanding regarding the difference between the merge() function (in base R) and the join() functions of plyr and dplyr are that join() is faster and more efficient when working with "large" data sets.
Is there some way to determine a threshold to regarding when to use join() over merge(), without using a heuristic approach?
I am sure you will be hard pressed to find a "hard and fast" rule around when to switch from one function to another. As others have mentioned, there are a set of tools in R to help you measure performance. object.size and system.time are two such function that look at memory usage and performance time, respectively. One general approach is to measure the two directly over an arbitrarily expanding data set. Below is one attempt at this. We will create a data frame with an 'id' column and a random set of numeric values, allowing the data frame to grow and measuring how it changes. I'll use inner_join here as you mentioned dplyr. We will measure time as "elapsed" time.
library(tidyverse)
setseed(424)
#number of rows in a cycle
growth <- c(100,1000,10000,100000,1000000,5000000)
#empty lists
n <- 1
l1 <- c()
l2 <- c()
#test for inner join in dplyr
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- inner_join(x,y, by = c('id' = 'id'))
l1[[n]] <- object.size(test)
print(system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3])
l2[[n]] <- system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3]
n <- n+1
}
#empty lists
n <- 1
l3 <- c()
l4 <- c()
#test for merge
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- merge(x,y, by = c('id'))
l3[[n]] <- object.size(test)
# print(object.size(test))
print(system.time(test <- merge(x,y, by = c('id')))[3])
l4[[n]] <- system.time(test <- merge(x,y, by = c('id')))[3]
n <- n+1
}
#ploting output (some coercing may happen, so be it)
plot <- bind_rows(data.frame("size_bytes" = l3, "time_sec" = l4, "id" = "merge"),
data.frame("size_bytes" = l1, "time_sec" = l2, "id" = "inner_join"))
plot$size_MB <- plot$size_bytes/1000000
ggplot(plot, aes(x = size_MB, y =time_sec, color = id)) + geom_line()
merge seems to perform worse out the gate, but really kicks off around ~20MB. Is this the final word on the matter? No. But such testing can give you a idea of how to choose a function.

Iterating through a list of parameters using tidyquant R

I have a dataset which I want to process using tq_mutate and rollapply with different parameter values.
Currently I'm using a for loop to go over all the parameter values but I'm sure this is not the most efficient or fastest way to do this task (especially when I am going to be looking at large numbers of parameter values). How could the for loop be improved or removed? I suspect it means using purrr::map or some other means (multithreading/multicore etc) but I've not been able to find useful examples online.
Below is some sample code. Please ignore the simplicity of the dataset and outputs of the scale function, it is for illustrative purposes only. What I want to do is iterate over many different V0 values.
library(dplyr)
library(tidyverse)
library(broom)
library(tidyquant)
my_bogus_function <- function(df, V0=1925) {
# WILL HAVE SOMETHING MORE SOPHISTICATED IN HERE BUT KEEPING IT SIMPLE
# FOR THE PURPOSES OF THE QUESTION
c(V0, V0*2)
}
window_size <- 7 * 24
cnames = c("foo", "bar")
df <- c("FB") %>%
tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01") %>%
dplyr::select("date", "open")
# CAN THIS LOOP BE DONE IN A MORE EFFICIENT MANNER?
for (i in (1825:1830)){
df <- df %>%
tq_mutate(mutate_fun = rollapply,
width = window_size,
by.column = FALSE,
FUN = my_bogus_function,
col_rename = gsub("$", sprintf(".%d", i), cnames),
V0 = i
)
}
# END OF THE FOR LOOP I WANT FASTER
Given that R uses one core I have found improvement by using the packages parallel, doSNOW and foreach which allows multiple cores to be used (Note that I'm on a windows machine so some other packages are not available).
I'm sure there are other answers out there to multithread/parallelise/vectorise code.
Here is the code for anyone interested.
library(dplyr)
library(tidyverse)
library(tidyquant)
library(parallel)
library(doSNOW)
library(foreach)
window_size <- 7 * 24
cnames = c("foo", "bar")
df <- c("FB") %>%
tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01") %>%
dplyr::select("date", "open")
my_bogus_function <- function(df, V0=1925) {
# WILL HAVE SOMETHING MORE SOPHISTICATED IN HERE BUT KEEPING IT SIMPLE
# FOR THE PURPOSES OF THE QUESTION
c(V0, V0*2)
}
# CAN THIS LOOP BE DONE IN A MORE EFFICIENT/FASTER MANNER? YES
numCores <- detectCores() # get the number of cores available
cl <- makeCluster(numCores, type = "SOCK")
registerDoSNOW(cl)
# Function to combine the outputs
mycombinefunc <- function(a,b){merge(a, b, by = c("date","open"))}
# Run the loop over multiple cores
meh <- foreach(i = 1825:1830, .combine = "mycombinefunc") %dopar% {
message(i)
df %>%
# Adjust everything
tq_mutate(mutate_fun = rollapply,
width = window_size,
by.column = FALSE,
FUN = my_bogus_function,
col_rename = gsub("$", sprintf(".%d", i), cnames),
V0 = i
)
}
stopCluster(cl)
# END OF THE FOR LOOP I WANTED FASTER

Scoping issue when using doParallel

I am trying to estimate multiple nonparametric models using the doParallel package. My problem though seems to be related to the np package.
Take a look at this reproducible example:
library(np)
library(doParallel)
df <- data.frame(Y = runif(100, 0, 10), X = rnorm(100))
models <- list(as.formula(Y ~ X))
npestimate <- function(m, data) {
LCLS <- npregbw(m, data = data, regtype = "lc", bwmethod = "cv.ls")
LLLS <- npregbw(m, data = data, regtype = "ll", bwmethod = "cv.ls")
# sigt <- npsigtest(LCLS, boot.method = "wild", boot.type = "I")
return(list(LCLS = LCLS, LLLS = LLLS))
}
cl <- makeCluster(length(models))
registerDoParallel(cl)
results <- foreach(m = models, .packages = "np", .verbose = T) %dopar%
npestimate(m, data = df)
stopCluster(cl)
As you can see I created a function called npestimate() in order to compute different stuff for each model. I commented out one line where I want to run significance tests using npsigtest. Usually, npsigtest gets the data used by looking in the environment where npregbw was called.
But this does not work here. I am not sure why but npsigtest just cannot find the data that was used in the two lines of code right above.
The data is automatically exported to the nodes, so using .export in foreach is redundant.
Any suggestions how to make this work?
npsigtest copies pretty much the approach used within lm and functions for lm objects. It thus has the same potential scoping pitfalls. The issue is the environment associated with the formula:
environment(models[[1]])
#<environment: R_GlobalEnv>
It's easy to fix:
npestimate <- function(m, data) {
environment(m) <- environment()
LCLS <- npregbw(m, data = data, regtype = "lc", bwmethod = "cv.ls")
LLLS <- npregbw(m, data = data, regtype = "ll", bwmethod = "cv.ls")
sigt <- npsigtest(LCLS, boot.method = "wild", boot.type = "I")
return(list(LCLS = LCLS, LLLS = LLLS))
}
I actually often prefer eval(bquote()) constructs because of such issues.

How do I iterate over variables in R from a Python perspective?

Let's say I have this variable in R, data:
data.orange.lm = lm(...)
data.orange.avg = mean(...)
data.orange.sd = sd(...)
data.pear.lm = lm(...)
data.pear.sd = sd(...)
...
data.plum.sd = sd(...)
data.plum.summary = summary(lm(...))
How can I programmatically iterate over data? In Python iteritems for a dictionary will provide you with keys and the respective values. Is there a R equivalent?
Store everything in a nested list:
data[["orange"]][["lm"]] = lm(...)
data[["orange"]][["avg"]] = mean(...)
data[["orange"]][["sd"]] = sd(...)
data[["pear"]][["lm"]] = lm(...)
data[["pear"]][["sd"]] = sd(...)
Then use the apply family of commands.
For data that is a bit like this:
data <- data.frame(
fruit = rep(c("orange", "pear", "plum"), each = 10),
value = rnorm(30)
)
You can use tapply to get stats by fruit, and sapply to loop over functions.
fns <- c("mean", "sd")
stats <- sapply(fns, function(f) with(data, tapply(value, fruit, f)))
(If some of your functions don't return single numbers, then use lapply rather than sapply.)

Resources