`doParallel` vs `future` while using `Seurat` package - r

Here is the story.
From Seurat vignette, FindMarkers() can be accelerated by utilizing future package, future::plan("multiprocess", workers = 4)
However, I am running a simulation that I need to use FindAllMarkers() inside a doParallel::foreach() loop after doParallel::registerDoParallel(numCores=10).
What's the parallelization that happened behind the scene?
How to leverage the most power of HPC under this setup?
How many CPUs should I allocate for this job to maximize the parallelization?
Any idea is welcome.
Below is a minimum example. pbmc.rds is here.
library(Seurat)
# Enable parallelization for `FindAllMarkers()`
library(future)
plan("multiprocess", workers = 4)
# Enable parallelization for `foreach()` loop
library(doParallel)
registerDoParallel(cores = 10)
pbmc <- readRDS("pbmc.rds")
rst <- foreach(i = 1:10/10, .combine = "cbind") %doPar% {
pbmc <- FindClusters(pbmc, resolution = i)
# should put future command here instead?
# plan("multiprocess", workers = 4)
DEgenes <- FindAllMarkers(pbmc)
write.csv(DEgenes, paste0("DEgenes_resolution_", i, "csv"))
pbmc$seurat_clusters
}

Related

How to fix C function R_nc4_get_vara_double returned error in ncdf4 parallel processing in R

I want to download nc data through OPENDAP from a remote storage. I use parallel backend with foreach - dopar loop as follows:
# INPUTS
inputs=commandArgs(trailingOnly = T)
interimpath=as.character(inputs[1])
gcm=as.character(inputs[2])
period=as.character(inputs[3])
var=as.character(inputs[4])
source='MACAV2'
cat('\n\n EXTRACTING DATA FOR',var, gcm, period, '\n\n')
# CHANGING LIBRARY PATHS
.libPaths("/storage/home/htn5098/local_lib/R40") # local library for packages
setwd('/storage/work/h/htn5098/DataAnalysis')
source('./src/Rcodes/CWD_function_package.R') # Calling the function Rscript
# CALLING PACKAGES
library(foreach)
library(doParallel)
library(parallel)
library(filematrix)
# REGISTERING CORES FOR PARALLEL PROCESSING
no_cores <- detectCores()
cl <- makeCluster(no_cores)
registerDoParallel(cl)
invisible(clusterEvalQ(cl,.libPaths("/storage/home/htn5098/local_lib/R40"))) # Really have to import library paths into the workers
invisible(clusterEvalQ(cl, c(library(ncdf4))))
# EXTRACTING DATA FROM THE .NC FILES TO MATRIX FORM
url <- readLines('./data/external/MACAV2_OPENDAP_allvar_allgcm_allperiod.txt')
links <- grep(x = url,pattern = paste0('.*',var,'.*',gcm,'_.*',period), value = T)
start=c(659,93,1) # lon, lat, time
count=c(527,307,-1)
spfile <- read.csv('./data/external/SERC_MACAV2_Elev.csv',header = T)
grids <- sort(unique(spfile$Grid))
clusterExport(cl,list('ncarray2matrix','start','count','grids')) #exporting data into clusters for parallel processing
cat('\nChecking when downloading all grids\n')
# k <- foreach(x = links,.packages = c('ncdf4')) %dopar% {
# nc <- nc_open(x)
# nc.var=ncvar_get(nc,varid=names(nc$var),start=start,count=count)
# return(nc.var)
# nc_close(nc)
# }
k <- foreach(x = links,.packages = c('ncdf4'),.errorhandling = 'pass') %dopar% {
nc <- nc_open(x)
print(nc)
nc.var=ncvar_get(nc,varid=names(nc$var),start=c(659,93,1),count=c(527,307,-1))
nc_close(nc)
return(dim(nc.var))
Sys.sleep(10)
}
# k <- parSapply(cl,links,function(x) {
# nc <- nc_open(x)
# nc.var=ncvar_get(nc,varid=names(nc$var),start=start,count=count)
# nc_close(nc)
# return(nc.var)
# })
print(k)
However, I keep getting this error:
<simpleError in ncvar_get_inner(ncid2use, varid2use, nc$var[[li]]$missval, addOffset, scaleFact, start = start, count = count, verbose = verbose, signedbyte = signedbyte, collapse_degen = collapse_degen): C function R_nc4_get_vara_double returned error>
What could be the reason for this problem? Can you recommend a solution for this that is time-efficient (I have to repeat this for about 20 files)?
Thank you.
I had the same error in my code. The problem was not the code itself. It was one of the files that I wanted to read. It has something wrong, so R couldn't open it. I identified the file and downloaded it again, and the same code worked perfectly.
I also encountered the same error. For me, restarting R session did the trick.

R foreach loop runs out of memory in HPC environment

I am using the foreach package in R to process raster files.
The R code below works fine locally (on Windows) when adapted to an 8-core processor, but runs out of memory in a HPC environment with 48 cores. The HPC environment has much more memory available (2 TB across all 48 cores) compared with my local box (32 GB), so that's not the limiting factor.
The memory creep occurs as the foreach loop proceeds. It's slow, but enough to eventually run out of memory.
I have tried switching parallel packages (to doMC, doSNOW), adding numerous garbage collection calls and rm() of large objects at the end of every iteration, fiddling with the number of cores used, as well as removing any temporary files immediately.
Any ideas on what may be causing my memory issues?
# Set Java memory maximum
options(java.parameters = "-Xmx39g")
library(sp)
library(raster)
library(dismo)
library(foreach)
library(doParallel)
library(rgdal)
library(rJava)
# Set directories
relPath <- "E:/BIEN_Cactaceae/"
bufferDir <- "Data/Buffers"
climDir <- "Data/FutureClimate/"
outDir <- "Analyses/FutureRanges/"
modelDir <- "Analyses/MaxEnt/"
outfileDir <- "OutFiles/"
tempDir <- "E:/Tmp/"
# Set directory for raster temporary files
rasterOptions(tmpdir = tempDir)
# Search for models
models <- list.files(path = paste0(relPath, modelDir), pattern = "rda$")
# Set up cluster
cl <- makeCluster(48, type = "FORK", outfile = paste0(relPath, outfileDir, "predictFuture.txt"))
registerDoParallel(cl)
# Loop through species and predict current ranges
foreach(i = 1:length(models),
.packages = c("sp", "raster", "dismo", "rgdal", "rJava"),
.inorder = F) %dopar% {
# Get taxon
taxon <- strsplit(models[i], ".", fixed = T)[[1]][1]
# Get buffer
tmpBuffer <- readOGR(dsn = paste0(relPath, bufferDir), layer = paste0(taxon, "_buff"), verbose = F)
# Get scenarios
scenarios <- list.files(path = paste0(relPath, climDir), pattern = "tif$")
# Get model
load(paste0(relPath, modelDir, models[i]))
# Loop over scenarios
for (j in scenarios) {
# Get scenario name
tmpScenarioName <- strsplit(j, ".", fixed = T)[[1]][1]
# Skip scenario if already processed
if (!file.exists(paste0(relPath, outDir, taxon, "_", tmpScenarioName, ".tif"))) {
# Read, crop, mask predictors
print(paste0(taxon, " - ", tmpScenarioName, ": processing"))
tmpScenarioStack <- raster::stack(paste0(relPath, climDir, j))
preds <- raster::crop(tmpScenarioStack, tmpBuffer)
preds <- raster::mask(preds, tmpBuffer)
# Rename predictors
tmpNames <- paste0(taxon, ".", 1:20)
tmpNames <- gsub("-", ".", tmpNames, fixed = T)
tmpNames <- gsub(" ", "_", tmpNames, fixed = T)
names(preds) <- tmpNames
# Predict with model
prediction <- dismo::predict(model_all, preds, progress = "")
# Export predictions
writeRaster(prediction, paste0(relPath, outDir, taxon, "_", tmpScenarioName, ".tif"))
removeTmpFiles(h = 2)
}
}
}
stopCluster(cl)

parallel processing with R in windows on a single-socket, multi-core CPU

I've been trying to make a parallel process in windows; doSNOW states that it makes "SOCK" type clusters, which is also referred to as "socket cluster"; I've got a single cpu with 8 physical and 8 logical cores, which i often use in linux; I've been trying to run the code below but whenever I try the %dopar% option it gives me a NULL result; so I'm wondering if windows only works in parallel with multi-socket machines (which I don't have)
library(jsonlite)
library(doSNOW)
library(foreach)
library(dplyr)
claves <- c("I11D67", "I11D66", "I11D65", "I11D64", "I11D63", "I11D62",
"I11D61", "I11C69", "I11D79", "I11D78", "I11D77", "I11D76", "I11D75",
"I11D74", "I11D73", "I11D72", "I11D71", "I11C79", "I12C82", "I12C81",
"I11D89", "I11D88", "I11D87", "I11D86", "I11D85")
cl <- makeCluster(8)
registerDoSNOW(cl)
foreach(i = claves, .combine=rbind, .errorhandling = "remove") %do% {
url <- sprintf("http://www.beta.inegi.org.mx/app/api/buscador/busquedaTodos/%s_A/RANKING/es", i)
ea <- fromJSON(url)
ea %>% select(TITULO, CVE_UNICA, EDICION)
} -> results
stopCluster(cl)
It's not a problem with your hardware. It's one or both of these 2 things:
SNOW has problems running on Windows. You'd want to substitute in doParallel or something similar.
use :: to make sure your code hits dplyr::select and not, for instance MASS:select
This works (with some unrelated warnings):
if(!require(doParallel)) install.packages("doParallel")
library(foreach)
library(doParallel)
workers <- makeCluster(2) # start small to test
registerDoParallel(workers)
foreach(i = claves, .combine=rbind, .errorhandling = "remove") %do% {
url <- sprintf("http://www.beta.inegi.org.mx/app/api/buscador/busquedaTodos/%s_A/RANKING/es", i)
ea <- fromJSON(url)
ea %>% dplyr::select(TITULO, CVE_UNICA, EDICION)
} -> results
stopCluster(cl)
dim(results)
[1] 50 39
results$CVE_UNICA
[1] "889463489610" "702825634490" "702825271480" "702825637149" "702825267483" "889463489603" "702825213497" "702825709792"
"702825277420" "702825711115"
[11] "702825267476" "702825637101" "702825272128" "889463489597" "702825266967" "702825709327" "702825215576" "702825634582"
"702825214098" "702825709754"
[21] "702825271510" "702825003959" "702825271503" "702825208332" "702825213510" "702825709747" "702825213503" "702825709815"
"702825267506" "889463489627"
[31] "702825267490" "702825637170" "702825271497" "702825711269" "702825268114" "889463489580" "702825219352" "702825708375"
"702825215590" "702825708382"
[41] "702825215583" "702825637330" "702825214104" "702825649418" "702825203894" "702825003962" "702825203887" "702825712617"
"702825212421" "702825637293"
Please see this source for several good examples.

How could this custom process be done in parallel? or multicores?

I am trying to figure out how i could use any of the parallel processing packages like foreach or doParallel in this random forest loop i have created:
ModelInfo <- data.frame ( model=as.numeric()
,Nodesize=as.numeric()
,Mrty=as.numeric()
,Maxdepth=as.numeric()
,Cp=as.numeric()
,Accuracy_Training=as.numeric()
,AUC_Training=as.numeric())
w=1
set.seed(1809)
NumberOfSamples=1
# Number of iterations
rfPred=list()
pred=list()
roundpred=list()
cTab=list()
Acc=list()
pred.to.roc=list()
pred.rocr=list()
perf.rocr=list()
AUC=list()
Var_imp=list()
rf_model_tr = list()
length(rf_model_tr) <- NumberOfSamples
for (i in 1:NumberOfSamples)
{
rf_model_tr[[i]] = list()
rfPred[[i]]=list()
pred[[i]]=list()
roundpred[[i]]=list()
cTab[[i]]=list()
Acc[[i]]=list()
pred.to.roc[[i]]=list()
pred.rocr[[i]]=list()
perf.rocr[[i]]=list()
AUC[[i]]=list()
Var_imp[[i]]=list()
## Tune nodesize
nodesize =c(10,20,50,80,100,200)
n=length(nodesize)
length(rf_model_tr[[i]]) <- n
for ( j in 1: length (nodesize))
{
rf_model_tr[[i]][[j]] = list()
rfPred[[i]][[j]]=list()
pred[[i]][[j]]=list()
roundpred[[i]][[j]]=list()
cTab[[i]][[j]]=list()
Acc[[i]][[j]]=list()
pred.to.roc[[i]][[j]]=list()
pred.rocr[[i]][[j]]=list()
perf.rocr[[i]][[j]]=list()
AUC[[i]][[j]]=list()
Var_imp[[i]][[j]]=list()
## Tune mrty
mrtysize =c(2,3,4)
m=length(mrtysize)
length(rf_model_tr[[i]][[j]]) <- m
for ( k in 1: length (mrtysize))
{
rf_model_tr[[i]][[j]][[k]] = list()
rfPred[[i]][[j]][[k]]=list()
pred[[i]][[j]][[k]]=list()
roundpred[[i]][[j]][[k]]=list()
cTab[[i]][[j]][[k]]=list()
Acc[[i]][[j]][[k]]=list()
pred.to.roc[[i]][[j]][[k]]=list()
pred.rocr[[i]][[j]][[k]]=list()
perf.rocr[[i]][[j]][[k]]=list()
AUC[[i]][[j]][[k]]=list()
Var_imp[[i]][[j]][[k]]=list()
## Tune maxdepth
maxdep =c(10,20,30)
z=length(maxdep)
length(rf_model_tr[[i]][[j]][[k]]) <- z
for (l in 1:length (maxdep))
{
rf_model_tr[[i]][[j]][[k]][[l]] = list()
rfPred[[i]][[j]][[k]][[l]]=list()
pred[[i]][[j]][[k]][[l]]=list()
roundpred[[i]][[j]][[k]][[l]]=list()
cTab[[i]][[j]][[k]][[l]]=list()
Acc[[i]][[j]][[k]][[l]]=list()
pred.to.roc[[i]][[j]][[k]][[l]]=list()
pred.rocr[[i]][[j]][[k]][[l]]=list()
perf.rocr[[i]][[j]][[k]][[l]]=list()
AUC[[i]][[j]][[k]][[l]]=list()
Var_imp[[i]][[j]][[k]][[l]]=list()
## Tune cp
cp =c(0,0.01,0.001)
p=length(cp)
length(rf_model_tr[[i]][[j]][[k]][[l]]) <- p
for (m in 1:length (cp))
{
rf_model_tr[[i]][[j]][[k]][[l]][[m]]= randomForest (as.factor(class) ~.
, data=train,mtry=mrtysize[[k]],maxDepth = maxdep[[l]], replace=F, importance=T, do.trace=10, ntree=200,nodesize=nodesize[j],cp=cp[[m]])
#Accuracy
rfPred[[i]][[j]][[k]][[l]][[m]] <- predict(rf_model_tr[[i]][[j]][[k]][[l]][[m]], train, type = "prob")
pred[[i]][[j]][[k]][[l]][[m]] <- colnames(rfPred[[i]][[j]][[k]][[l]][[m]] )[apply(rfPred[[i]][[j]][[k]][[l]][[m]] ,1,which.max)]
cTab[[i]][[j]][[k]][[l]][[m]] = table(pred[[i]][[j]][[k]][[l]][[m]],train$class)
Acc[[i]][[j]][[k]][[l]][[m]]<- sum(diag(cTab[[i]][[j]][[k]][[l]][[m]])) / sum(cTab[[i]][[j]][[k]][[l]][[m]])
#AUC
pred.to.roc[[i]][[j]][[k]][[l]][[m]]<-rfPred[[i]][[j]][[k]][[l]][[m]][,2]
pred.rocr[[i]][[j]][[k]][[l]][[m]]<-prediction(pred.to.roc[[i]][[j]][[k]][[l]][[m]],as.factor(train$class))
perf.rocr[[i]][[j]][[k]][[l]][[m]]<-performance(pred.rocr[[i]][[j]][[k]][[l]][[m]],measure="auc",x.measure="cutoff")
AUC[[i]][[j]][[k]][[l]][[m]]<-as.numeric(perf.rocr[[i]][[j]][[k]][[l]][[m]]#y.values)
#Variable Importance
Var_imp[[i]][[j]][[k]][[l]][[m]]<-(importance(rf_model_tr[[i]][[j]][[k]][[l]][[m]],type=2))
ModelInfo[w,1]<-w
ModelInfo[w,2]<-nodesize[[j]]
ModelInfo[w,3]<-mrtysize[[k]]
ModelInfo[w,4]<-maxdep[[l]]
ModelInfo[w,5]<-cp[[m]]
ModelInfo[w,6]<-Acc[[i]][[j]][[k]][[l]][[m]]
ModelInfo[w,7]<-AUC[[i]][[j]][[k]][[l]][[m]]
w=w+1
}
}
}
}
}
Basically ,what i am doing is that i am creating all possible model variations with one dataset based on the available tuning parameters for a random forest (nodesize,cp ect) and storing that information to the table model info as every iteration goes by. In addition i add measures like accuracy and AUC, so as to compare the different models created in the end and make a pick.
The reason i am looking for an alternative, is that the caret package offers me only to tune the mtry allthough there i do have the chance to run parRF which could solve my problem, but i prefer to incorporate something here, how would that be possible?
I have read about the foreach and doParallel packages but i dont quite get how this could be syntaxed here.
If the initial data is needed please let me know, i just thought at this point to show the part that neeeds to be parallel computed.
Thank you in advance
Hi I normally just code everything manually. In linux/mac I use parallel package and mclapply which can use memory forking. Forking processes use less memory and are faster to start up. Windows do not support forking thus I use doParallel package (other packages could do also). the foreach() function is a user friendly parallel mapper. I find myself to spend more time setting up single PC parallel computing than saving from speed-up. Still fun :)
If you work on a university, you may have access to a large cluster. The BatchJobs package is another mapper which can use many different backends, e.g. a Torque/PBS que system. I can borrow 80 nodes with 4 CPU's giving me a potential 320 times speedup (more like 150 times in practice). I learned about BatchJobs from this great introduction. I like that BatchJobs also can run single or multi-core locally, which is much easier to debug.
The code below introduces how to create a list of jobs with both foreach and BatchJobs. Each job is a set of arguments. The job arguments are fused with standard arguments and a model is trained. Some statistics is returned and all results and arguments are combined into a data.frame.
useForeach = FALSE #If FALSE, will run as batchjobs. Only faster for cluster computing.
library(randomForest)
#load a data set
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
download.file(url,destfile="winequality-white.csv",mode="w")
wwq = read.csv(file="winequality-white.csv",header=T,sep=";")
X = wwq[,names(wwq) != "quality"]
y = wwq[,"quality"]
#2 - make jobs
pars = expand.grid(
mtry = c(1:3),
sampsize = floor(seq(1000,1898,length.out = 3)),
nodesize = c(1,3)
)
jobs = lapply(1:dim(pars)[1], function(i) pars[i,])
#3 - make node function there will excute a number of jobs
test.pars = function(someJobs,useForeach=TRUE) {
#if running cluster, global environment imported manually
if(!useForeach) load(file="thisState.rda")
do.call(rbind,lapply(someJobs,function(aJob){ #do jobs and bind results by rows
print(aJob)
merged.args = c(alist(x=X,y=y),as.list(aJob)) #merge std. and job args
run.time = system.time({rfo = do.call(randomForest,merged.args)}) #run a job
data.frame(accuracy=tail(rfo$rsq,1),run.time=run.time[3],mse=tail(rfo$mse,1))
}))
}
##test function single core
jobsToTest = 1:5
out = test.pars(jobs[jobsToTest])
print(cbind(out,do.call(rbind,jobs[jobsToTest])))
#4a execute jobs with foreach package:
if(useForeach) {
library(foreach)
library(doParallel)
CPUs=4
cl = makeCluster(CPUs)#works both for windows and linux, otherwise forking is better
registerDoParallel(cl)
nodes=min(CPUs,length(jobs)) #how many splits of jobList, not so important for foreach...
job.array = suppressWarnings(split(jobs,1:nodes)) #split warns if each core cannot get same amount of jobs
out = foreach(i=job.array,.combine=rbind,.packages="randomForest") %dopar% test.pars(i)
stopCluster(cl)
} else {
library(BatchJobs)
#4b - execute jobs with BatchJobs package (read manual how to set up on cluster)
nodes=min(80,length(jobs)) # how many nodes to split job onto
job.array = split(jobs,1:nodes)
save(list=ls(),file="thisState.rda") #export this state(global environment) to every node
#initiate run
reg = makeRegistry(id ="myFirstBatchJob",packages="randomForest")
batchMap(reg,fun=test.pars,someJobs = job.array,more.args=list(useForeach=FALSE))
submitJobs(reg)
waitForJobs(reg)
out = loadResults(reg)
#6- wrap up save filnalResults to user
finalResult = cbind(do.call(rbind,jobs),do.call(rbind,out))
save(out,file="finalResult.rda")
removeRegistry(reg,ask="no")
}
#7- print final result
print(cbind(do.call(rbind,jobs),out))

using package snow's parRapply: argument missing error

I want to find documents whose similarity between other doucuments are larger than a given value(0.1) by cutting documents into blocks.
library(tm)
data("crude")
sample.dtm <- DocumentTermMatrix(
crude, control=list(
weighting=function(x) weightTfIdf(x, normalize=FALSE),
stopwords=TRUE
)
)
step = 5
n = nrow(sample.dtm)
block = n %/% step
start = (c(1:block)-1)*step+1
end = start+step-1
j = unlist(lapply(1:(block-1),function(x) rep(((x+1):block),times=1)))
i = unlist(lapply(1:block,function(x) rep(x,times=(block-x))))
ij <- cbind(i,j)
library(skmeans)
getdocs <- function(k){
ci <- c(start[k[[1]]]:end[k[[1]]])
cj <- c(start[k[[2]]]:end[k[[2]]])
combi <- sample.dtm[ci]
combj < -sample.dtm[cj]
rownames(combi)<-ci
rownames(combj)<-cj
comb<-c(combi,combj)
sim<-1-skmeans_xdist(comb)
cat("Block", k[[1]], "with Block", k[[2]], "\n")
flush.console()
tri.sim<-upper.tri(sim,diag=F)
results<-tri.sim & sim>0.1
docs<-apply(results,1,function(x) length(x[x==TRUE]))
docnames<-names(docs)[docs>0]
gc()
return (docnames)
}
It works well when using apply
system.time(rmdocs<-apply(ij,1,getdocs))
When using parRapply
library(snow)
library(skmeans)
cl<-makeCluster(2)
clusterExport(cl,list("getdocs","sample.dtm","start","end"))
system.time(rmdocs<-parRapply(cl,ij,getdocs))
Error:
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: attempt to set 'rownames' on an object with no dimensions
Timing stopped at: 0.01 0 0.04
It seems that sample.dtm coundn't be used in parRapply. I'm confused. Can anyone help me? Thanks!
In addition to exporting objects, you need to load the necessary packages on the cluster workers. In your case, the result of not doing so is that there isn't a dimnames method defined for "DocumentTermMatrix" objects, causing rownames<- to fail.
You can load packages on the cluster workers with the clusterEvalQ function:
clusterEvalQ(cl, { library(tm); library(skmeans) })
After doing that, rownames(combi)<-ci will work correctly.
Also, if you want to see the output from cat, you should use the makeCluster outfile argument:
cl <- makeCluster(2, outfile='')

Resources