I have a dataset which I want to process using tq_mutate and rollapply with different parameter values.
Currently I'm using a for loop to go over all the parameter values but I'm sure this is not the most efficient or fastest way to do this task (especially when I am going to be looking at large numbers of parameter values). How could the for loop be improved or removed? I suspect it means using purrr::map or some other means (multithreading/multicore etc) but I've not been able to find useful examples online.
Below is some sample code. Please ignore the simplicity of the dataset and outputs of the scale function, it is for illustrative purposes only. What I want to do is iterate over many different V0 values.
library(dplyr)
library(tidyverse)
library(broom)
library(tidyquant)
my_bogus_function <- function(df, V0=1925) {
# WILL HAVE SOMETHING MORE SOPHISTICATED IN HERE BUT KEEPING IT SIMPLE
# FOR THE PURPOSES OF THE QUESTION
c(V0, V0*2)
}
window_size <- 7 * 24
cnames = c("foo", "bar")
df <- c("FB") %>%
tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01") %>%
dplyr::select("date", "open")
# CAN THIS LOOP BE DONE IN A MORE EFFICIENT MANNER?
for (i in (1825:1830)){
df <- df %>%
tq_mutate(mutate_fun = rollapply,
width = window_size,
by.column = FALSE,
FUN = my_bogus_function,
col_rename = gsub("$", sprintf(".%d", i), cnames),
V0 = i
)
}
# END OF THE FOR LOOP I WANT FASTER
Given that R uses one core I have found improvement by using the packages parallel, doSNOW and foreach which allows multiple cores to be used (Note that I'm on a windows machine so some other packages are not available).
I'm sure there are other answers out there to multithread/parallelise/vectorise code.
Here is the code for anyone interested.
library(dplyr)
library(tidyverse)
library(tidyquant)
library(parallel)
library(doSNOW)
library(foreach)
window_size <- 7 * 24
cnames = c("foo", "bar")
df <- c("FB") %>%
tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01") %>%
dplyr::select("date", "open")
my_bogus_function <- function(df, V0=1925) {
# WILL HAVE SOMETHING MORE SOPHISTICATED IN HERE BUT KEEPING IT SIMPLE
# FOR THE PURPOSES OF THE QUESTION
c(V0, V0*2)
}
# CAN THIS LOOP BE DONE IN A MORE EFFICIENT/FASTER MANNER? YES
numCores <- detectCores() # get the number of cores available
cl <- makeCluster(numCores, type = "SOCK")
registerDoSNOW(cl)
# Function to combine the outputs
mycombinefunc <- function(a,b){merge(a, b, by = c("date","open"))}
# Run the loop over multiple cores
meh <- foreach(i = 1825:1830, .combine = "mycombinefunc") %dopar% {
message(i)
df %>%
# Adjust everything
tq_mutate(mutate_fun = rollapply,
width = window_size,
by.column = FALSE,
FUN = my_bogus_function,
col_rename = gsub("$", sprintf(".%d", i), cnames),
V0 = i
)
}
stopCluster(cl)
# END OF THE FOR LOOP I WANTED FASTER
Related
I am working with the R programming language.
I have this dataset that records exam results ( 1 = pass, 0 = fail) for a set of students at different times:
library(data.table)
library(doParallel)
# Generate some sample data
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
# Create a data frame from the sample data
my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]
# Generate some additional columns for each record
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL
# Convert the data frame to a data.table
my_data = setDT(my_data)
# Create a cluster with 4 workers
cl = makeCluster(4)
I have this function that tracks the number of times each student failed an exam given the student failed the previous exam, passed an exam given that the student passed the previous exam, passed an exam given that the student failed the previous exam and failed an exam given that the student passed the previous exam. Here is the function:
my_function <- function(i) {
# Use tryCatch to handle the case where there are no rows in the start_i data frame
tryCatch({
start_i = my_data[my_data$id == i,]
pairs_i = data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
frame_i = as.data.frame(table(pairs_i))
frame_i$i = i
return(frame_i)
}, error = function(err) {
# Return an empty data frame if there are no rows in the start_i data frame
return(data.frame())
})
}
Now, I would like to try and run this function on my data in parallel - that is, I would like to assign data belonging to different students to different cores within my computer, in an effort to accelerate the time required to perform this function. Here is my attempt:
# Export the data frames and the my_function to the workers on the cluster
clusterExport(cl, c("my_data", "my_function", "data.table"))
# Assign each worker a different subset of the data to work on
clusterSetRNGStream(cl)
n = nrow(my_data)
chunks = rep(1:4, each = n / 4)
my_data = my_data[chunks == 1,]
# Evaluate the code on the cluster (final_out is the final result)
final_out = parLapply(cl, unique(my_data$id), my_function)
# alternate version
final_out = clusterApply(cl, unique(my_data$id), my_function)
# Stop the cluster when finished
stopCluster(cl)
The code seems to have run without errors - but I am not sure if I have done everything correctly.
Can someone please comment on this?
Thanks!
So far as I can tell, the approach you've taken does what you expect. I am doubtful that the cluster is giving you any real speed improvement over other alternative methods. For example, if you use a dplyr pipeline, you could do it pretty easily:
out <- my_data %>%
arrange(id, exam_number) %>%
group_by(id) %>%
mutate(prev_exam = lag(results)) %>%
group_by(id, results, prev_exam) %>%
tally() %>%
na.omit()
On my machine, macOS 12.6, 3.6 GHz intel i9, 128GB RAM, the dplyr pipeline is about 3.5 times faster than the parallel approach. As #jblood94 said in his comment, the considerable resources in communication make the cluster solution pretty inefficient. Maybe there is an even better datatable solution.
I have a function that I'm applying to different sets of coordinates to create four new columns in my tibble. This function has a pretty long start-up time (loads the genome into RAM, converts tibble to GRanges, and retrieves sequences) but is relatively fast, so that there's not much difference between 100 and 1,000,000 sequences. Is there any way to send each col in the mutate to a different core so they can be processed at the same time? I thought about using pivot_long and then group+partition but this got me thinking about whether there was a different way to accomplish this. A multi_mutate of sorts?
(I don't actually expect the multiplyr partition/collect to be that time-saving in my case given the small cost to additional coordinates, but if I could avoid the time cost of pivoting, which is still relatively small, and mess in my code, that'd be cool.)
I know you were looking for an existing package, but I couldn't find anything on that. Other similar questions (like here or here) appear not to provide a package either..
However, what about you hack it out yourself... Look at this example with furrr.
### libraries
library(dplyr)
library(furrr)
### data complaint with your example
d <- replicate(8, rnorm(100))
colnames(d) <- apply(expand.grid(letters[1:2], 1:4), 1, paste0, collapse = "")
d <- as_tibble(d)
### a function that take more than a second to finish..
long_f <- function(x1, x2){
Sys.sleep(1)
x1+x2
}
### multimutate!
multimutate <- function(.data, ..., .options = future_options()){
dots <- enquos(..., .named = TRUE)
.data[names(dots)] <- future_map(dots, ~rlang::eval_tidy(., data = .data, env = parent.frame()), .options = .options)
.data
}
# no future strategy implemented
tictoc::tic()
d %>%
multimutate(c1 = long_f(a1,b1),
c2 = long_f(a2,b2),
c3 = long_f(a3,b3),
c4 = long_f(a4,b4))
tictoc::toc()
# 4.34 sec elapsed
# future strategy
plan(multiprocess)
tictoc::tic()
d %>%
multimutate(c1 = long_f(a1,b1),
c2 = long_f(a2,b2),
c3 = long_f(a3,b3),
c4 = long_f(a4,b4),
.options = future_options(globals = "long_f"))
tictoc::toc()
# 1.59 sec elapsed
It needs some testing a guess.. and It would need to be improved.. for example using the same methods available for mutate. But it's a start.
Notice that I need to use future_options..
I want to calculate a series of approx 1.000.000 wilcox.tests in R:
result <- foreach(i = 1:ncol(data), .combine=bind_rows, .multicombine= TRUE, .maxcombine = 1000 ) %do% {
w = wilcox.test(data[,i]~as.factor(groups),exact = FALSE)
df <- data.frame(Characters=character(),
Doubles=double(),
Doubles=double(),
stringsAsFactors=FALSE)
df[1,] = c(colnames(data)[i], w$statistic, w$p.value)
rownames(df) = colnames(beta_t1)[i]
colnames(df) = c("cg", "statistic", "p.value")
return(df)
}
If I do it with %dopar% and 15 cores it is slower than with single core %do%.
I suspect it is a memory access problem. My processors are hardly used to capacity either. Is it possible to split the data dataframe into chunks and then have each processor calculate 100K and then add them together? How can I speed up this foreach loop?
One thing that’s immediately striking is that you use eight lines to create and return a data.frame where a single expression is sufficient:
data.frame(
cg = colnames(data)[i],
statistic = w$statistic,
p.value = w$p.value
row.names = colnames(beta_t1)[i]
stringsAsFactors = FALSE
)
However, the upshot is that after the loop is run, foreach has to row-bind all these data.frames, and that operation is slow. It’s more efficient to return a list of the p-values and statistics and forget about the row and column names (these can be provided afterwards, and then don’t require subsetting and re-concatenation).
That is, change your code to
result = foreach(col = data) %do% {
w = wilcox.test(col ~ as.factor(groups), exact = FALSE)
list(w$statistic, w$p.value)
}
# Combine result and transform it into a data.frame:
results = data.frame(
cg = colnames(data),
statistic = vapply(results, `[[`, double(1L), 1L),
p.value = vapply(results, `[[`, double(1L), 2L),
row.names = colnames(beta_t1),
stringsAsFactors = FALSE # only necessary for R < 4.0!
)
(I never use foreach so I’m not exactly sure how to use it here but the above should roughly work; otherwise try mclapply from the ‘parallel’ package, it does the same, just using the familiar syntax of lapply.)
My understanding regarding the difference between the merge() function (in base R) and the join() functions of plyr and dplyr are that join() is faster and more efficient when working with "large" data sets.
Is there some way to determine a threshold to regarding when to use join() over merge(), without using a heuristic approach?
I am sure you will be hard pressed to find a "hard and fast" rule around when to switch from one function to another. As others have mentioned, there are a set of tools in R to help you measure performance. object.size and system.time are two such function that look at memory usage and performance time, respectively. One general approach is to measure the two directly over an arbitrarily expanding data set. Below is one attempt at this. We will create a data frame with an 'id' column and a random set of numeric values, allowing the data frame to grow and measuring how it changes. I'll use inner_join here as you mentioned dplyr. We will measure time as "elapsed" time.
library(tidyverse)
setseed(424)
#number of rows in a cycle
growth <- c(100,1000,10000,100000,1000000,5000000)
#empty lists
n <- 1
l1 <- c()
l2 <- c()
#test for inner join in dplyr
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- inner_join(x,y, by = c('id' = 'id'))
l1[[n]] <- object.size(test)
print(system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3])
l2[[n]] <- system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3]
n <- n+1
}
#empty lists
n <- 1
l3 <- c()
l4 <- c()
#test for merge
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- merge(x,y, by = c('id'))
l3[[n]] <- object.size(test)
# print(object.size(test))
print(system.time(test <- merge(x,y, by = c('id')))[3])
l4[[n]] <- system.time(test <- merge(x,y, by = c('id')))[3]
n <- n+1
}
#ploting output (some coercing may happen, so be it)
plot <- bind_rows(data.frame("size_bytes" = l3, "time_sec" = l4, "id" = "merge"),
data.frame("size_bytes" = l1, "time_sec" = l2, "id" = "inner_join"))
plot$size_MB <- plot$size_bytes/1000000
ggplot(plot, aes(x = size_MB, y =time_sec, color = id)) + geom_line()
merge seems to perform worse out the gate, but really kicks off around ~20MB. Is this the final word on the matter? No. But such testing can give you a idea of how to choose a function.
I have some code:
test<-therapyDF %>% group_by(therapyDF$prodcode) %>% summarize(count=n_distinct(therapyDF$patid))
that is designed to group all prodcode entries (drug) and then count how many patients (patid) have an instance of each drug. For example, the raw data is held in a dataframe similar to:
patid prodcode
1 A
1 B
2 C
3 A
3 A
3 B
Thus, output will be:
A 2
B 2
C 1
Is there any way of parallelising this code? The real data is in excess of 100 million records and it's been more than 8 hours of running in serial.
I'm struggling to adopt the *apply methodologies of R and the numerous R parallel packages. Splitting the original data frame would be a bit tricky, as the data is not very well organised (just how I got it) and it would require grouping an extracting by the $patid. I am running this on an 8 core intel Linux box.
Thanks
It is possible to parallelize a group_by-summarize with
summarize_par <- function(grouped_df, ...) {
sizes <- attr(grouped_df, "group_sizes")
ord <- order(sizes, decreasing = TRUE)
one_group <- function(gdf, i, size) {
size_i <- sizes[i]
structure(
gdf[attr(gdf, "indices")[[i]] + 1, ],
indices = list(0:(size_i - 1)),
group_sizes = size_i,
biggest_group_sizes = size_i,
labels = attr(gdf, "labels")[i, , drop = FALSE]
)
}
dots <- dplyr:::named_quos(...)
res <- foreach(ic = ord) %dopar% {
dplyr::summarise(one_group(grouped_df, ic), !!!dots)
}
do.call(rbind, res[match(seq_along(ord), ord)])
}
Test:
N <- 2e7
therapyDF <- data.frame(patid = sample.int(N/2, size = N, replace = TRUE),
prodcode = sample(LETTERS, size = N, replace = TRUE))
library(dplyr)
system.time(true <- therapyDF %>%
group_by(prodcode) %>%
summarize(count=n_distinct(patid)))
library(foreach)
library(doParallel)
registerDoParallel(cl <- makeForkCluster(detectCores() / 2))
system.time(test <- therapyDF %>%
group_by(prodcode) %>%
summarize_par(count=n_distinct(patid)))
all.equal(true$count, test$count)
stopCluster(cl)
I get:
sequential: 8.228
parallel with 2 cores: 6.525
Here, my computer has not a lot of cores neither a lot of memory.
You can expect better results with a better computer.
There is a performance issue with n_distinct at the moment (see https://github.com/tidyverse/dplyr/issues/977). You should use length(unique(patid)) instead of n_distinct(patid) to speed things up.
You should give data.table a try:
N <- 1e8
therapyDF <- data.frame(patid = sample.int(N/2, size = N, replace = TRUE),
prodcode = sample(LETTERS, size = N, replace = TRUE))
library(dplyr)
system.time(therapyDF %>%
group_by(prodcode) %>%
summarize(count=n_distinct(patid)))
#> User System verstrichen
#> 36.939 1.196 38.136
library(data.table)
setDT(therapyDF)
system.time(therapyDF[, .(count = uniqueN(patid)), by = prodcode])
#> User System verstrichen
#> 5.727 0.116 5.843
It uses OpenMP for parallel processing.