Parallel computing when using ‘CORElearn' in R - r

I’m using the ReliefF for feature selection (using the package called "CORElearn"). It worked very well before. But later on, I want to speed up my code. Since I have bootstrap in my code (each loop is doing exactly the same thing, including using ReliefF), so I’m using the package ‘parallel’ for parallel computing. But I realized that every time when it comes to the part of ReliefF, the code will just stuck there.
The related codes are as followings:
num.round <- 10 # number of rounds for bootstrap
rounds.btsp <- seq(1, num.round) # sequence of numbers for bootstrap, used for parallel computing
boot.strap <- function(round.btsp) {
## some codes using other feature selection methods
print('Finish feature selection using other methods') # I can get this output
# use ReliefF to rank the features
data.ref <- data.frame(t(x.train.resample), y.train.resample, check.names = F) # add the param to avoid changing '-' to '.'
print('Start using attrEval') # I’ll get this output, but then I'll get stuck here
estReliefF <- attrEval('y.train.resample', data.ref, estimator = 'ReliefFexpRank', ReliefIterations = 30)
names(estReliefF) <- fea.name # This command needs to be added because it's very annoying that 'attrEval' will change the '-' in the names to '.'
print('Start using estReliefF') # I’ll never get here
fea.rank.ref <- estReliefF[order(abs(estReliefF), decreasing = T)]
fea.rank.ref <- data.frame(importance = fea.rank.ref)
fea.rank.name.ref <- rownames(fea.rank.ref) # the ranked feature list for this round
return(fea.rank.name.ref)
}
results.btsp <- mclapply(rounds.btsp, boot.strap, mc.cores = num.round)
What I’m thinking now is that the function ‘attrEval’ will use multiple cores for parallel computing (I read that in the document: https://cran.r-project.org/web/packages/CORElearn/CORElearn.pdf). Then there will be somehow conflict with the parallel that I’m using. When I change ’num.round’ to 1, then there’s no problem running the code (but even I set it to 2, it won’t work).
The server that I'm using has 80 cores.
Is there a way to solve this? I'm thinking that shutting down the parallel computing for the function 'attrEval' maybe a solution? Even though I don't how to do that~~~

Having multiple levels of parallelism can be tricky. Unfortunately the CORElearn package does not allow to directly manipulate the number of threads used. Since it uses OpenMP for parallel execution, you can try to set the environment variable OMP_NUM_THREADS appropriately, e.g.
Sys.setenv(OMP_NUM_THREADS = 8)
num.round <- 10
This way there should be 10 groups of 8 cores, each group handling one bootstrap round.

Got a solution from the contributor of the package Marko: disable multithreading in CORElearn by using parameter maxThreads=1 in 'attrEval'

Related

'mc.cores' > 1 is not supported on Windows

I am new to programming and I am trying to use parallel processing for R in windows, using an existing code.
Following is the snippet of my code:
if (length(grep("linux", R.version$os)) == 1){
num_cores = detectCores()
impact_list <- mclapply(len_a, impact_func, mc.cores = (num_cores - 1))
}
# else if(length(grep("mingw32", R.version$os)) == 1){
# num_cores = detectCores()
# impact_list <- mclapply(len_a, impact_func, mc.cores = (num_cores - 1))
#
# }
else{
impact_list <- lapply(len_a, impact_func)
}
return(sum(unlist(impact_list, use.names = F)))
This works fine, I am using R on windows so the code enters in 'else' statement and it runs the code using lapply() and not by parallel processing.
I have added the 'else if' statement to make it work for windows. So when I un-comment 'else if' block of code and run it, I am getting an error "'mc.cores' > 1 is not supported on Windows".
Please suggest how can I use parallel processing in windows, so that less time is taken to run the code.
Any help will be appreciated.
(disclaimer: I'm author of the future framework here)
The future.apply package provides parallel versions of R's built-in "apply" functions. It's cross platform, i.e. it works on Linux, macOS, and Windows. The package allows you to often just replace an existing lapply() with a future_lapply() call, e.g.
library(future.apply)
plan(multisession)
your_fcn <- function(len_a) {
impact_list <- future_lapply(len_a, impact_func)
sum(unlist(impact_list, use.names = FALSE))
}
Regarding mclapply() per se: If you use parallel::mclapply() in your code, make sure that there is always an option not to use it. The reason is that it is not guaranteed to work in all environment, that is, it might be unstable and crash R. In R-devel thread 'mclapply returns NULLs on MacOS when running GAM' (https://stat.ethz.ch/pipermail/r-devel/2020-April/079384.html), the author of mclapply() wrote on 2020-04-28:
Do NOT use mcparallel() in packages except as a non-default option that user can set for the reasons Henrik explained. Multicore is intended for HPC applications that need to use many cores for computing-heavy jobs, but it does not play well with RStudio and more importantly you don't know the resource available so only the user can tell you when it's safe to use. Multi-core machines are often shared so using all detected cores is a very bad idea. The user should be able to explicitly enable it, but it should not be enabled by default.

ViSEAGO tutorial: visualising topGO object

Earlier, I had posted a question and was able to load in my data successfully and create a topGO object after some help. I'm trying to visualise GO terms that are significantly associated with the list of differentially expressed genes that I have from mouse RNA-seq data.
Now, I'd want to raise a concern about ViSEAGO's tutorial. The tutorial initially specifies loading two files: 'selection.txt' and 'background.txt'. The origin of these files is not clearly stated. However, after a lot of digging into topGO's documentation, I was able to find the datatypes for each of the files. But, even after following these, I have a problem running the following code. Does anyone have any insights to share?
WORKING CODE:
mysampleGOdata <- new("topGOdata",
description = "my Simple session",
ontology = "BP",
allGenes = geneList_new,
nodeSize = 1,
annot = annFUN.org,
mapping="org.Mm.eg.db",
ID = "SYMBOL")
resultFisher <- runTest(mysampleGOdata, algorithm = "classic", statistic = "fisher")
head(GenTable(mysampleGOdata,fisher=resultFisher),20)
myNewBP<-GenTable(mysampleGOdata,fisher=resultFisher)
PROBLEMS:
> head(myNewBP,2)
GO.ID Term Annotated Significant Expected fisher
1 GO:0006006 glucose metabolic process 194 12 0.19 1.0e-19
2 GO:0019318 hexose metabolic process 223 12 0.22 5.7e-19
> ###################
> # merge results
> myBP_sResults<-ViSEAGO::merge_enrich_terms(
+ Input=list(
+ condition=c("mysampleGOdata","resultFisher")
+ )
+ )
Error in setnames(x, value) :
Can't assign 3 names to a 2 column data.table
> myNewBP<-GenTable(mysampleGOdata,fisher=resultFisher)
> ###################
> # display the merged table
> ViSEAGO::show_table(myNewBP)
Error in ViSEAGO::show_table(myNewBP) :
object must be enrich_GO_terms, GO_SS, or GO_clusters class objects
According to the tutorial, the printed table contains for each enriched GO terms, additional columns including the list of significant genes and frequency (ratio of the number of significant genes to the number of background genes) evaluated by comparison. I think I have that, but it's definitely not working.
Can someone see why? I'm not very clear on this.
Thanks!
I think you try to circumvent an error you made at the beginning. You receive the error due to the fact that you did not use the wrapper function from the ViSEAGO package. As you stated in your last question, you had initial problems formatting your data.
Here are some tips:
The "selection" file is a character vector with your DEGs names or IDs. I recommend using EntrezID's.
The "Background" file is a character vector with known genes. I recommend using EntrezID's as well. You can easily generate this character vector with:
background=keys(org.Hs.eg.db, keytype ='ENTREZID').
With these two files, you can easily proceed to the next steps of the package as described in the vignette.
# connect to EntrezGene
EntrezGene<-ViSEAGO::EntrezGene2GO()
# load GO annotations from EntrezGene
# with the add of GO annotations from orthologs genes (see above)
#id = "9606" = homo sapiens
myGENE2GO<-ViSEAGO::annotate(id="9606", EntrezGene)
BP<-ViSEAGO::create_topGOdata(
geneSel = selection, #your DEG vector
allGenes = background, #your created background vector
gene2GO=myGENE2GO,
ont="BP",
nodeSize=5
)
classic<-topGO::runTest(
BP,
algorithm ="classic",
statistic = "fisher"
)
# merge results
BP_sResults<-ViSEAGO::merge_enrich_terms(
Input=list(
condition=c("BP","classic")
)
)
You should get a merged list of your enriched GO terms with the corresponding statistical tests you prefer.
I have faced this problem recently, it was very frustrating. In my case the whole issue seemed to be related to the package version I was using.
I used conda to install ViSEAGO. Nevertheless, R's version in my conda environment was a bit old (i.e. 3.6.1 to be specific). Therefore, when installing ViSEAGO with conda, the version 1.0.0 of the package was installed. Please note that the most recent version of ViSEAGO is 1.4.0.
Therefore, I created a conda environment with R version 4.0.3, and repeated the procedure to install ViSEAGO by using conda. When doing this, ViSEAGO's 1.4.0 version was installed, and everything went fine.
I've tried to backtrack the error, and only find one thing: in the older ViSEAGO version, the function Custom2GO loaded tables with 4 columns; in the most recent version it admits 5 columns (the new one being 'gene_symbol'). I think this disagreement might be part of the issue, as the source code of the function merge_enrich_terms seems to deal with the columns 'gene_id' and 'gene_symbol' at some point, but I'm not sure.
Hope you find my comment helpful!
Cheers,
Mauricio

How can I run multiple independent and unrelated functions in parallel without larger code do-over?

I've been searching around the internet, trying to understand parallel processing.
What they all seem to assume is that I have some kind of loop function operating on e.g. every Nth row of a data set divided among N cores and combined afterwards, and I'm pointed towards a lot of parallelized apply() functions.
(Warning, ugly code below)
My situation though is that I have is on form
tempJob <- myFunction(filepath, string.arg1, string.arg2)
where the path is a file location, and the string arguments are various ways of sorting my data.
My current workflow is simply amassing a lot of
tempjob1 <- myFunction(args)
tempjob2 <- myFunction(other args)
...
tempjobN <- myFunction(some other args here)
# Make a list of all temporary outputs in the global environment
temp.list <- lapply(ls(pattern = "temp"), get)
# Stack them all
df <- rbindlist(temp.list)
# Remove all variables from workspace matching "temp"
rm(list=ls(pattern="temp"))
These jobs are entirely independent, and could in principle be run in 8 separate instances of R (although that would be a bother to manage I guess). How can I separate the first 8 jobs out to 8 cores, and whenever a core finishes its job and returns a treated dataset to the global environment it'll simply take whichever job is next in line.
With the future package (I'm the author) you can achieve what you want with a minor modification to your code - use "future" assignments %<-% instead of regular assignments <- for the code you want to run asynchronously.
library("future")
plan(multisession)
tempjob1 %<-% myFunction(args)
tempjob2 %<-% myFunction(other args)
...
tempjobN %<-% myFunction(some other args here)
temp.list <- lapply(ls(pattern = "temp"), get)
EDIT 2022-01-04: plan(multiprocess) -> plan(multisession) since multiprocess is deprecated and will eventually be removed.
Unless you are unfortunate enough to be using Windows, you could maybe try with GNU Parallel like this:
parallel Rscript ::: script1.R script2.R JOB86*.R
and that would keep 8 scripts running at a time, if your CPU has 8 cores. You can change it with -j 4 if you just want 4 at a time. The JOB86 part is just random - I made it up.
You can also add switches for a progress bar, for how to handle errors, for adding parameters and distributing jobs across multiple machines.
If you are on a Mac, you can install GNU Parallel with homebrew:
brew install parallel
I think the easiest way is to use one of the parallelized apply functions. Those will do all the fiddly work of separating out the jobs, taking whichever job is next in line, etc.
Put all your arguments into a list:
args <- list(
list(filePath1, stringArgs11, stringArgs21),
list(filePath2, stringArgs12, stringArgs22),
...
list(filePath8, stringArgs18, stringArgs28)
)
Then do something like
library(parallel)
cl <- makeCluster(detectCores())
df <- parSapply(cl, args, myFunction)
I'm not sure about parSapply, and I can't check as R isn't working on my machine just now. If that doesn't work, use parLapply and then manipulate the result.

SparkR dubt and Broken pipe exception

Hi I'm working on SparkR in distribuited mode with yarn cluster.
I have two question :
1) If I made a script that contains R line code and SparkR line code, it will distribute just the SparkR code or simple R too?
This is the script. I read a csv and take just the 100k first record.
I clean it (with R function) deleting NA values and created a SparkR dataframe.
This is what it do: foreach Lineset take each TimeInterval where that LineSet appear and sum some attribute (numeric attribute), after that put them all into a Matrix.
This is the script with R and SparkR code. It takes 7h in standalone mode and like 60h in distributed mode (killed by java.net.SocketException: Broken Pipe)
LineSmsInt<-fread("/home/sentiment/Scrivania/LineSmsInt.csv")
Short<-LineSmsInt[1:100000,]
Short[is.na(Short)] <- 0
Short$TimeInterval<-Short$TimeInterval/1000
ShortDF<-createDataFrame(sqlContext,Short)
UniqueLineSet<-unique(Short$LINESET)
UniqueTime<-unique(Short$TimeInterval)
UniqueTime<-as.numeric(UniqueTime)
Row<-length(UniqueLineSet)*length(UniqueTime)
IntTemp<-matrix(nrow =Row,ncol=7)
k<-1
colnames(IntTemp)<-c("LINESET","TimeInterval","SmsIN","SmsOut","CallIn","CallOut","Internet")
Sys.time()
for(i in 1:length(UniqueLineSet)){
SubSetID<-filter(ShortDF,ShortDF$LINESET==UniqueLineSet[i])
for(j in 1:length(UniqueTime)){
SubTime<-filter(SubSetID,SubSetID$TimeInterval==UniqueTime[j])
IntTemp[k,1]<-UniqueLineSet[i]
IntTemp[k,2]<-as.numeric(UniqueTime[j])
k3<-collect(select(SubTime,sum(SubTime$SmsIn)))
IntTemp[k,3]<-k3[1,1]
k4<-collect(select(SubTime,sum(SubTime$SmsOut)))
IntTemp[k,4]<-k4[1,1]
k5<-collect(select(SubTime,sum(SubTime$CallIn)))
IntTemp[k,5]<-k5[1,1]
k6<-collect(select(SubTime,sum(SubTime$CallOut)))
IntTemp[k,6]<-k6[1,1]
k7<-collect(select(SubTime,sum(SubTime$Internet)))
IntTemp[k,7]<-k7[1,1]
k<-k+1
}
print(UniqueLineSet[i])
print(i)
}
This is the script R the only things that change is the subset function and of course is a normal R data.frame not SparkR dataframe.
It takes 1.30 minute in standalone mode.
Why it's so fast just in R and it's so slowly in SparkR?
for(i in 1:length(UniqueLineSet)){
SubSetID<-subset.data.frame(LineSmsInt,LINESET==UniqueLineSet[i])
for(j in 1:length(UniqueTime)){
SubTime<-subset.data.frame(SubSetID,TimeInterval==UniqueTime[j])
IntTemp[k,1]<-UniqueLineSet[i]
IntTemp[k,2]<-as.numeric(UniqueTime[j])
IntTemp[k,3]<-sum(SubTime$SmsIn,na.rm = TRUE)
IntTemp[k,4]<-sum(SubTime$SmsOut,na.rm = TRUE)
IntTemp[k,5]<-sum(SubTime$CallIn,na.rm = TRUE)
IntTemp[k,6]<-sum(SubTime$CallOut,na.rm = TRUE)
IntTemp[k,7]<-sum(SubTime$Internet,na.rm=TRUE)
k<-k+1
}
print(UniqueLineSet[i])
print(i)
}
2) First script, in distribuited mode, was killed by :
java.net.SocketException: Broken Pipe
and this appear too sometimes:
java.net.SocketTimeoutException: Accept timed out
It may comes from a bad configuration? suggestion?
Thanks.
Don't take this the wrong way but it is not a particularly well written piece of code. It is already inefficient using core R and adding SparkR to the equation makes it even worse.
If I made a script that contains R line code and SparkR line code, it will distribute just the SparkR code or simple R too?
Unless you're using distributed data structures and functions which operate on these structures it is just a plain R code executed in a single thread on the master.
Why it's so fast just in R and it's so slowly in SparkR?
For starters you execute a single job for each combination of LINESET, UniqueTime and column. Each time Spark has scan all records and fetch data to the driver.
Moreover using Spark to handle data that can be easily handled in memory of a single machine simply doesn't make sense. Cost of running the job in case like this is usually much higher than a cost of actual processing.
suggestion?
If you really want to use SparkR just groupBy and agg:
group_by(Short, Short$LINESET, Short$TimeInterval) %>% agg(
sum(Short$SmsIn), sum(Short$SmsOut), sum(Short$CallIn),
sum(Short$CallOut), sum(Short$Internet))
If you care about missing (LINESET, TimeInterval) pairs fill these using either join or unionAll.
In practice it would simply stick with a data.table and aggregate locally:
Short[, lapply(.SD, sum, na.rm=TRUE), by=.(LINESET, TimeInterval)]

Options for caching / memoization / hashing in R

I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, e.g.memoise and R.cache, but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash package, which doesn't seem to underpin the two memoization packages.
Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?
As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (e.g. BerkeleyDB, Redis, MemcacheDB and scores of others).
It looks like the options are:
Hashing
digest - provides hashing for arbitrary R objects.
Memoization
memoise - a very simple tool for memoization of functions.
R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.
Caching
hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.
Key/value storage
These are basic options for external storage of R objects.
stashr
filehash
Checkpointing
cacher - this seems to be more akin to checkpointing.
CodeDepends - An OmegaHat project that underpins cacher and provides some useful functionality.
DMTCP (not an R package) - appears to support checkpointing in a bunch of languages, and a developer recently sought assistance testing DMTCP checkpointing in R.
Other
Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also pairlist, but it is deprecated.)
The data.table package supports rapid lookups of elements in a data table.
Use case
Although I'm mostly interested in knowing the options, I have two basic use cases that arise:
Caching: Simple counting of strings. [Note: This isn't for NLP, but general use, so NLP libraries are overkill; tables are inadequate because I prefer not to wait until the entire set of strings are loaded into memory. Perl-style hashes are at the right level of utility.]
Memoization of monstrous calculations.
These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.
Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages (cacher and R.cache), but there is no elaboration on usage options.
Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)
Dirk Eddelbuettel: digest - a lot of other packages depend on this.
Roger Peng: cacher, filehash, stashR - these address different problems in different ways; see Roger's site for more packages.
Christopher Brown: hash - Seems to be a useful package, but the links to ODG are down, unfortunately.
Henrik Bengtsson: R.cache & Hadley Wickham: memoise -- it's not yet clear when to prefer one package over the other.
Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".
I did not have luck with memoise because it gave a 'too deep recursive' problem to some functions of a package I tried it with. With R.cache I had better luck. Following is more annotated code I adapted from the R.cache documentation. The code shows different options for doing caching:
# Workaround to avoid question when loading R.cache library
dir.create(path="~/.Rcache", showWarnings=F)
library("R.cache")
setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
# In case we need the cache path, but not used in this example.
cache.root = getCacheRootPath()
simulate <- function(mean, sd) {
# 1. Try to load cached data, if already generated
key <- list(mean, sd)
data <- loadCache(key)
if (!is.null(data)) {
cat("Loaded cached data\n")
return(data);
}
# 2. If not available, generate it.
cat("Generating data from scratch...")
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("ok\n")
saveCache(data, key=key, comment="simulate()")
data;
}
data <- simulate(2.3, 3.0)
data <- simulate(2.3, 3.5)
a = 2.3
b = 3.0
data <- simulate(a, b) # Will load cached data, params are checked by value
# Clean up
file.remove(findCache(key=list(2.3,3.0)))
file.remove(findCache(key=list(2.3,3.5)))
simulate2 <- function(mean, sd) {
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("Done generating data from scratch\n")
data;
}
# Easy step to memoize a function
# aslo possible to resassign function name.
This would work with any functions from external packages.
mzs <- addMemoization(simulate2)
data <- mzs(2.3, 3.0)
data <- mzs(2.3, 3.5)
data <- mzs(2.3, 3.0) # Will load cached data
# aslo possible to resassign function name.
# but different memoizations of the same
# function will return the same cache result
# if input params are the same
simulate2 <- addMemoization(simulate2)
data <- simulate2(2.3, 3.0)
# If the expression being evaluated depends on
# "input" objects, then these must be be specified
# explicitly as "key" objects.
for (ii in 1:2) {
for (kk in 1:3) {
cat(sprintf("Iteration #%d:\n", kk))
res <- evalWithMemoization({
cat("Evaluating expression...")
a <- kk
Sys.sleep(1)
cat("done\n")
a
}, key=list(kk=kk))
# expressions inside 'res' are skipped on the repeated run
print(res)
# Sanity checks
stopifnot(a == kk)
# Clean up
rm(a)
} # for (kk ...)
} # for (ii ...)
For simple counting of strings (and not using table or similar), a multiset data structure seems like a good fit. The environment object can be used to emulate this.
# Define the insert function for a multiset
msetInsert <- function(mset, s) {
if (exists(s, mset, inherits=FALSE)) {
mset[[s]] <- mset[[s]] + 1L
} else {
mset[[s]] <- 1L
}
}
# First we generate a bunch of strings
n <- 1e5L # Total number of strings
nus <- 1e3L # Number of unique strings
ustrs <- paste("Str", seq_len(nus))
set.seed(42)
strs <- sample(ustrs, n, replace=TRUE)
# Now we use an environment as our multiset
mset <- new.env(TRUE, emptyenv()) # Ensure hashing is enabled
# ...and insert the strings one by one...
for (s in strs) {
msetInsert(mset, s)
}
# Now we should have nus unique strings in the multiset
identical(nus, length(mset))
# And the names should be correct
identical(sort(ustrs), sort(names(as.list(mset))))
# ...And an example of getting the count for a specific string
mset[["Str 3"]] # "Str 3" instance count (97)
Related to #biocyperman solution. R.cache has a wrapping function for avoiding the loading, saving and evaluation of the cache. See the modified function:
R.cache provide a wrapper for loading, evaluating, saving. You can simplify your code like that:
simulate <- function(mean, sd) {
key <- list(mean, sd)
data <- evalWithMemoization(key = key, expr = {
cat("Generating data from scratch...")
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("ok\n")
data})
}

Resources