Speeding up stringdist in R using Parallel - r

I have a vector of 300 sentences, and I am trying to find elementwise JW distance using the stringdist package. The execution time for the naive implementation is too high, leading me to look for ways to reduce the runtime. I am trying to leverage the doParallel and foreach packages, but I'm not getting any significant speedup. This is how I am going about it.
library(foreach)
library(doParallel)
cl = makeCluster(detectCores())
registerDoParallel(cl)
sentence = # vector containing sentences
jw_dist = foreach(i = 1:length(sentence)) %dopar% {
temp = sentence[sentence!=sentence[i]]
return(mean(1 - stringdist::stringdist(sentence[i],temp,method = "jw",nthread = 3))
}
stopCluster(cl)
I would really appreciate if someone can point out ways in which I can speed up this chunk of code.

So it seems you're fighting with extreme overhead.
Instead of parallelizing on the single sentences, just split the task in some sizable chunks and let apply do the rest. I've chosen 10 chunks of 100 sentences each, possibly there's a faster combination but this one works much faster (at least for me) than what you asked for:
library(doParallel)
library(foreach)
# generate fake sentences
txt <- readLines(url('https://baconipsum.com/api/?type=all-meat&sentences=300&start-with-lorem=1&format=text'))
sentences <- strsplit(txt,'\\.\\s')[[1]]
sentences <- rep(sentences[sample(1:100,100)],10)
# pairwise combinations of sentences
cbn <- combn(1:length(sentences),2)
# simple timing
st <- Sys.time()
# Since you work on LINUX, you can use FORK
cl <- makeCluster(detectCores(),type = 'FORK')
registerDoParallel(cl)
res <- foreach(ii = seq(1,1000,100),.combine = 'c') %dopar% {
apply(cbn[,ii:(ii+99)],2,function(x) stringdist(sentences[x[1]],sentences[x[2]],method = "jw"))
}
stopCluster(cl)
Sys.time() - st
On my Ubuntu VM, this code runs in ~ 1.8 seconds.
Specs:
Ubuntu 64 bit
R version 3.4
8 CPU cores
32GB RAM Memory
HTH
Edit:
Maybe avoiding parallel-processing would be a good alternative in this case.
Using this lapply version, I can calculate the mean for each sentence in ~ 17 seconds:
res <- do.call(rbind,lapply(1:1000,function(ii) c(ii,1-mean(stringdist(sentences[ii],sentences[-ii],method = "jw")))))
This will give you a 2 column matrix with the index for each sentence and 1-mean of all distances to the respective sentence.

Related

Way to parallelize QCA minimize() in R?

I'm using the R package QCA (https://cran.r-project.org/web/packages/QCA/index.html) for Qualitative Comparative Analysis. I want to be able to try out many different combinations, which is taking a very long time. On my faster CPU, trying all the options that I am interested in takes a little over 24 hours. R seems to be using only one of the cores available on my CPU and requires relatively little memory (just under 100MB). I am hoping someone has a good idea on how to speed up this process, perhaps through parallelization?
Here's what I'm doing:
Loading my data set (data), which is a CSV file with the outcome condition and all the options for my causal conditions. The causal conditions are in 4 groups A, B, C, and D. There are approx. 200 observations, i.e., rows in the data set.
Starting a log file with sink()
Creating a series of nested loops to generate each combination of causal conditions I want to examine.
Running the minimize() function within the nested loops. Specifically this looks like this:
for (a in causal_condition_group_A) {
for (b in causal_condition_group_B) {
for (c in causal_condition_group_C) {
for (d in causal_condition_group_D) {
minimize(data, outcome = my_outcome, conditions = paste0(a, ", ", b, ", ", c, ", ", d), ...)
}
}
}
}
The minimize function's conditions argument essentially takes a character vector as input and this is all my nested loops are creating. For example, a random conditions argument might read:
conditions = "causal_condition_A_87, causal_condition_B_2, causal_condition_C_42, causal_condition_D_219"
I tried several different things in an attempt to parallelize this approach, but so far I have not been successful. I tried experimenting with both parSapply and foreach %dopar%, but I am running into various problems. I either can't get the actual parallelization process to work properly or - and this is in some of my toy experiments - I am having trouble logging all the output, which is essential.
Please let me know if I can provide additional information to help clear things up! Thanks for your help!
EDIT:
I was able to create a working foreach() loop based on #HenrikB's advice, but I'm running into a different problem now.
Here's my test solution so far. It includes one less nested loop than I want in the final solution, but that's not important for now:
# SET QCA OUTCOME CONDITION
outcome = "c_outcome"
# LOAD LIBRARIES
library(doParallel)
library(QCA)
# CREATE CLUSTER FOR PARALLELIZATION
cores <- detectCores()
cl <- makeCluster(cores[1]-1, type = "PSOCK", outfile="")
registerDoParallel(cl)
# LOAD AND SET UP DATA
outcomecond <- read.csv("outcomes.csv", header=TRUE, row.names="ID")
causalcond <- read.csv("causal_conditions.csv", header=TRUE, row.names="ID")
data <- cbind(outcomecond[outcome], causalcond)
temp <- data[!is.na(data[outcome]), ] #keep only rows where outcome is not NA
# EXPORT CURRENT DATASET TO CLUSTERS
clusterExport(cl, "temp")
# CREATE CAUSAL CONDITION LISTS
causal_condition_group_A <- colnames(causalcond[, 1:99])
causal_condition_group_B <- colnames(causalcond[, 100:141])
causal_condition_group_C <- colnames(causalcond[, 142:183])
# EXPORT LIBRARIES TO CLUSTERS
clusterEvalQ(cl, library(doParallel))
clusterEvalQ(cl, library(QCA))
# START TIMER
start.time <- Sys.time()
# THREE NESTED FOREACH LOOPS (ONE FOR EACH CAUSAL CONDITION GROUP)
x <-
foreach(c=causal_condition_group_C, .combine='cbind') %:%
foreach(b=causal_condition_group_B, .combine='cbind') %:%
foreach(a=causal_condition_group_A, .combine='cbind') %dopar% {
tryCatch({
minimize(temp, outcome = outcome, conditions = paste0(a,",",b,",",c), n.cut = 1, incl.cut = 0.400, include = "?", details = TRUE, use.letters = TRUE)
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
# END TIMER
end.time <- Sys.time()
# PROGRAM RUNNING TIME:
print(end.time - start.time)
# END CLUSTER
stopCluster(cl)
When I run this as a sequential %do% loop, I need something like 3 GB of memory to store x. However, when I try to run this sequentially, the memory rises exponentially. Here's a screenshot from shortly before I gave up:
Screenshot of task manager
Does someone know why %dopar% is using so much more memory and is there a way to avoid this?
Could I, for example, be writing x to file once in a while and purge memory after I do this? x is a list that is of dimension 14 by number of iterations in the foreach() loops. Here is what one of the "columns" in x looks like:
result.98
tt List,11
options List,10
negatives Numeric,3
initials "~A*B"
PIchart TRUE
primes Integer,2
solution List,1
essential "~A*B"
inputcases "205,245,253,306,425,468,490,511,514,585,587,657,684,696,739,740,784,796"
pims List,1
IC List,4
numbers Numeric,4
SA List,1
call Expression

R: poor performance of doParallel to manipulate large dataframes

first of all thank you for taking the time to read my question.
More than anything, I need conceptual help, since I do not understand what is wrong with my interpretation. Some time ago I try to refactor some of the algorithms that I use, so that they work in a parallel way and take advantage of all the CPUs that I have (something like 40, and my processes always use one at a time).
Looking for examples and literature, I found that maybe the package that can best serve me is "doParallel", and I was reading this:
https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf
run a for loop in parallel in R
However, when I implement it in my code, it consumes me more time than before. To see where the problem was, I reduced my code and limited it to some simple task, where it shows that it takes longer when I use doParallel than with the common loop that I always use. Here I share the code I did to evaluate and the output it gives me, where you can see what takes more time:
library(doParallel)
proteins_names <- c("TCSYLVIO_005590","TcCLB.503947.20","TcCLB.504249.111","TcCLB.511081.60","TCSYLVIO_009736","TcCLB.507071.100","TcCLB.507801.60","TcCLB.509103.10","TCSYLVIO_003504","TcCLB.503645.40","TcCLB.508221.490","TCSYLVIO_005223","TcCLB.505949.10","TcCLB.505949.120","TcCLB.506459.219","TcCLB.506763.340","TcCLB.506767.360","TcCLB.506955.250","TcCLB.506965.190","TcCLB.506965.90")
merged_total_test<-data.frame(matrix(nrow =100,ncol = 22, rnorm(n = 2200,sd = 2,mean=10)))
merged_total_test$protein<-proteins_names[sample(1:20,100,replace = T)]
merged_total_test$signal<-rnorm(n = 100,sd = 2,mean=1000)
cores=detectCores()
cl <- makeCluster(cores[1]-4)
registerDoParallel(cl)
init_time_parallel<-Sys.time()
dt_plot_total_parallel <- foreach (prot = 1:20, .combine=rbind) %dopar% {
temp_protein_c <- merged_total_test[merged_total_test$protein == proteins_names[prot]&!is.na(merged_total_test$signal),]
temp_protein_c
}
final_time_parallel<-Sys.time()
total_time_parallel<-final_time_parallel - init_time_parallel
stopCluster(cl)
init_time<-Sys.time()
dt_plot_total <- merged_total_test[0,]
for (prot in 1:20){
print(prot)
temp_protein_c <- merged_total_test[merged_total_test$protein == proteins_names[prot]&!is.na(merged_total_test$signal),]
dt_plot_total<-rbind(dt_plot_total,temp_protein_c)
}
final_time<-Sys.time()
total_time<-final_time - init_time
total_time
total_time_parallel
identical(dt_plot_total,dt_plot_total_parallel)#should be true
output:
> total_time
Time difference of 0.3065186 secs
> total_time_parallel
Time difference of 1.939842 secs
> identical(dt_plot_total,dt_plot_total_parallel)#should be true
[1] TRUE

Dealing with multidimensional output in parallel programming

I am currently working on a program to evaluate the out-of-sample performance of several forecasting models on simulated data. For those who are familiar with finance, it works exactly like backtesting a trading strategy, except that I would evaluate forecasts and not transactions.
Some of the objects I currently manipulate using for loops for this type of task are 7 dimensional arrays (dimensions stand for Monte Carlo replications, data generating processes, forecast horizons, 3 dimensions for model parameter selection, and one dimension for all the periods covered in the out-of-sample analysis). Obviously, it is painfully slow, so parallel computing has became a must for me.
My problem is: how do I keep track of more than 2 dimensions in R? Let's just show you using 'for loops' and only 3 dimensions what I mean:
x <- array(dim=c(2,2,2))
for (i in 1:2){
for (j in 1:2){
for (k in 1:2){
x[i,j,k] <- i+j+k
}
}
}
If I use something like 'foreach', I am very annoyed by the fact that, to my knowledge, available combining functionalities will return lists, matrices or vectors -- but not arbitrarily large multidimensional arrays. For instance:
library(doParallel)
library(foreach)
# Get the number of cores to use
no_cores <- max(1, detectCores()-1)
# Make cluster object using no_cores
cl <- makeCluster(no_cores)
# Initialize cluster for parallel computing
registerDoParallel(cl)
x <- foreach(i=1:2, .combine=rbind)%:%
foreach(j=1:2, .combine=cbind)%:%
foreach(k=1:2, .combine=c)%dopar%{
i+j+k
}
Here, I basically combine results into vectors, then matrices and, finally, I pile up matrices by rows. Another option would be to use lists, or pile matrices through columns, but you can imagine the mess when you have 7 dimensions and millions of iterations to track.
I suppose I could also write my own 'combine' function and get the kind of output I want, but I suspect that I am not the first person to encounter this problem. Either there is a way to do exactly what I want, or someone here can point out a way to think differently about storing my results. It wouldn't be surprising that I am taking an absurdly inefficient path toward solving this problem -- I am an economist, not a data scientist, after all!
Any help would be greatly appreciated. Thanks in advance.
There is one available solution that I finally stumbled upon tonight. I can create an appropriate combination function along the dimension of my choice using the 'abind' function of the 'abind' package:
library(abind)
# Get the number of cores to use
no_cores <- max(1, detectCores()-1)
# Make cluster object using no_cores
cl <- makeCluster(no_cores)
# Initialize cluster for parallel computing
registerDoParallel(cl)
mbind <- function(...) abind(..., along=3)
x <- foreach(i=1:2, .combine=mbind)%:%
foreach(j=1:2, .combine=cbind)%:%
foreach(k=1:2, .combine=c)%dopar%{
i+j+k
}
I would still like to see if someone has other means of doing what I want to do, however. There might be many ways to do it and I am new to R, yet this solution is a distinct possibility.
What I would do and I already use in one of my packages, bigstatsr.
Take only one dimension and cut it in no_cores blocks. It should have sufficient iterations (e.g. 20 for 4 cores). For each iteration, construct part of the array you want and store it in a temporary file. The, use the content of these files to fill the whole array. By doing so, you fill only preallocated objects, which should be faster and easier.
Example:
x.all <- array(dim=c(20,2,2))
no_cores <- 3
tmpfile <- tempfile()
range.parts <- bigstatsr:::CutBySize(nrow(x.all), nb = no_cores)
library(foreach)
cl <- parallel::makeCluster(no_cores)
doParallel::registerDoParallel(cl)
foreach(ic = 1:no_cores) %dopar% {
ind <- bigstatsr:::seq2(range.parts[ic, ])
x <- array(dim = c(length(ind), 2, 2))
for (i in seq_along(ind)){
for (j in 1:2){
for (k in 1:2){
x[i,j,k] <- ind[i]+j+k
}
}
}
saveRDS(x, file = paste0(tmpfile, "_", ic, ".rds"))
}
parallel::stopCluster(cl)
for (ic in 1:no_cores) {
ind <- bigstatsr:::seq2(range.parts[ic, ])
x.all[ind, , ] <- readRDS(paste0(tmpfile, "_", ic, ".rds"))
}
print(x.all)
Instead of writing files, you could also directly return the no_cores parts of the array in foreach and combine them with the right abind.

doParallel (package) foreach does not work for big iterations in R

I'm running the following code (extracted from doParallel's Vignettes) on a PC (OS Linux) with 4 and 8 physical and logical cores, respectively.
Running the code with iter=1e+6 or less, every thing is fine and I can see from CPU usage that all cores are employed for this computation. However, with larger number of iterations (e.g. iter=4e+6), it seems parallel computing does not work in which case. When I also monitor the CPU usage, just one core is involved in computations (100% usage).
Example1
require("doParallel")
require("foreach")
registerDoParallel(cores=8)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
iter=4e+6
ptime <- system.time({
r <- foreach(i=1:iter, .combine=rbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
Do you have any idea what could be the reason? Could memory be the cause?
I googled around and I found THIS relevant to my question but the point is that I'm not given any kind of error and the OP seemingly has came up with a solution by providing necessary packages inside foreach loop. But no package is used inside my loop, as can be seen.
UPDATE1
My problem still is not solved. As per my experiments, I don't think that memory could be the reason. I have 8GB of memory on the system on which I run the following simple parallel (over all 8 logical cores) iteration:
Example2
require("doParallel")
require("foreach")
registerDoParallel(cores=8)
iter=4e+6
ptime <- system.time({
r <- foreach(i=1:iter, .combine=rbind) %dopar% {
i
}
})[3]
I do not have problem with running of this code but when I monitor the CPU usage, just one core (out of 8) is 100%.
UPDATE2
As for Example2, #SteveWeston (thanks for pointing this out) stated that (in comments) : "The example in your update is suffering from having tiny tasks. Only the master has any real work to do, which consists of sending tasks and processing results. That's fundamentally different than the problem with the original example which did use multiple cores on a smaller number of iterations."
However, Example1 still remains unsolved. When I run it and I monitor the processes with htop, here is what happens in more detail:
Let's name all 8 created processes p1 through p8. The status (column S in htop) for p1 is R meaning that it's running and remains unchanged. However, for p2 up to p8, after some minutes, the status changes to D (i.e. uninterruptible sleep) and, after some minutes, again changes to Z (i.e. terminated but not reaped by its parent). Do you have any idea why this happens?
I think you're running low on memory. Here's a modified version of that example that should work better when you have many tasks. It uses doSNOW rather than doParallel because doSNOW allows you to process the results with the combine function as they're returned by the workers. This example writes those results to a file in order to use less memory, however it reads the results back into memory at the end using a ".final" function, but you could skip that if you don't have enough memory.
library(doSNOW)
library(tcltk)
nw <- 4 # number of workers
cl <- makeSOCKcluster(nw)
registerDoSNOW(cl)
x <- iris[which(iris[,5] != 'setosa'), c(1,5)]
niter <- 15e+6
chunksize <- 4000 # may require tuning for your machine
maxcomb <- nw + 1 # this count includes fobj argument
totaltasks <- ceiling(niter / chunksize)
comb <- function(fobj, ...) {
for(r in list(...))
writeBin(r, fobj)
fobj
}
final <- function(fobj) {
close(fobj)
t(matrix(readBin('temp.bin', what='double', n=niter*2), nrow=2))
}
mkprogress <- function(total) {
pb <- tkProgressBar(max=total,
label=sprintf('total tasks: %d', total))
function(n, tag) {
setTkProgressBar(pb, n,
label=sprintf('last completed task: %d of %d', tag, total))
}
}
opts <- list(progress=mkprogress(totaltasks))
resultFile <- file('temp.bin', open='wb')
r <-
foreach(n=idiv(niter, chunkSize=chunksize), .combine='comb',
.maxcombine=maxcomb, .init=resultFile, .final=final,
.inorder=FALSE, .options.snow=opts) %dopar% {
do.call('c', lapply(seq_len(n), function(i) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}))
}
I included a progress bar since this example takes several hours to execute.
Note that this example also uses the idiv function from the iterators package to increase the amount of work in each of the tasks. This technique is called chunking, and often improves the parallel performance. However, using idiv messes up the task indices, since the variable i is now a per-task index rather than a global index. For a global index, you can write a custom iterator that wraps idiv:
idivix <- function(n, chunkSize) {
i <- 1
it <- idiv(n, chunkSize=chunkSize)
nextEl <- function() {
m <- nextElem(it) # may throw 'StopIterator'
value <- list(i=i, m=m)
i <<- i + m
value
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
The values emitted by this iterator are lists, each containing a starting index and a count. Here's a simple foreach loop that uses this custom iterator:
r <-
foreach(a=idivix(10, chunkSize=3), .combine='c') %dopar% {
do.call('c', lapply(seq(a$i, length.out=a$m), function(i) {
i
}))
}
Of course, if the tasks are compute intensive enough, you may not need chunking and can use a simple foreach loop as in the original example.
At first I thought you were running into memory problems because submitting many tasks does use more memory, and that can eventually cause the master process to get bogged down, so my original answer shows several techniques for using less memory. However, now it sounds like there's a startup and shutdown phase where only the master process is busy, but the workers are busy for some period of time in the middle. I think the issue is that the tasks in this example aren't really very compute intensive, and so when you have a lot of tasks, you start to really notice the startup and shutdown times. I timed the actual computations and found that each task only takes about 3 milliseconds. In the past, you wouldn't get any benefit from parallel computing with tasks that small, but now, depending on your machine, you can get some benefit but the overhead is significant, so when you have a great many tasks you really notice that overhead.
I still think that my other answer works well for this problem, but since you have enough memory, it's overkill. The most important technique to use chunking. Here is an example that uses chunking with minimal changes to the original example:
require("doParallel")
nw <- 8
registerDoParallel(nw)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
niter <- 4e+6
r <- foreach(n=idiv(niter, chunks=nw), .combine='rbind') %dopar% {
do.call('rbind', lapply(seq_len(n), function(i) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}))
}
Note that this does the chunking slightly differently than my other answer. It only uses one task per worker by using the idiv chunks option, rather than the chunkSize option. This reduces the amount of work done by the master and is a good strategy if you have enough memory.

Writing data back to dataframe after parallel computing in R

I am new to Parallel computing in R.
I have gone through various links on StackOverFlow for the topic and wrote an initial code
library(doParallel)
library(foreach)
detectCores()
## [1] 4
# Create cluster with desired number of cores
cl <- makeCluster(3)
# Register cluster
registerDoParallel(cl)
# Find out how many cores are being used
getDoParWorkers()
My objective is to do a repetitive calculation on each row, my function looks something like
func2<-function(i)
{
msgbody<-tolower(as.character(purchase$msg_body[i]))
purchase$category[i]<-category_fun(i,msgbody)
}
For this purpose I have written a foreach loop
foreach(i = 1:nrow(purchase)) %dopar% func2(i)
But, the issue is that "func2" is supposed to write back to dataframe but it is not writing anything back, all the entries are same as old
Appreciate you help.
I believe this would work better in the scenario you're indicating. You can write a function that works on each msg_body string:
func2 <- function(msg_body)
{
return(category_fun(i,tolower(as.character(purchase$msg_body[i])))
}
result <- foreach(i=1:nrow(purchase),.combine=c) %dopar% {func2(purchase$msg_body[i]}
purchase$category <- result
I do think you'll be better off using apply() to solve this though.

Resources