bartMachine package for R should rely in parallel processing for reducing computing time, however I can't find how to make it work: the documentation of the packages repeats that it supports parallel processing, but there are no instruction for how to make it do it, and I can see that only one of the logical cores of my pc is working.
I use ubuntu 16.04.4 and I tied installing bartMachine via compilation from source, as recommended by its github page, thought I'm not sure I did everything correctly.
what can I do to make bartMachine finally work in parallel?
Have you tried running set_bart_machine_num_cores(num_cores) in R before running bartMachine? This did the trick for me.
See https://rdrr.io/cran/bartMachine/man/set_bart_machine_num_cores.html
Related
I'm trying to create (and activate & use) a Conda environment using a .yml file (in fact, I'm following instructions on this GitHub page: https://github.com/RajLabMSSM/echolocatoR). I'm working in a cluster computing system running Linux.
conda env create -f https://github.com/RajLabMSSM/echolocatoR/raw/master/inst/conda/echoR.yml
After running the above line of code, I'm trying to activate the environment:
conda activate echoR
However, this returns the following message:
Could not find conda environment: echoR
You can list all discoverable environments with conda info --envs.
When checking the list of environments in .conda/environments.txt, the echoR environment is indeed not listed.
I'm hoping for some suggestions of what might be the issue here.
Likely Cause: Out of Memory
Given the HPC context, the solver is likely trying to exceed the allocated memory and getting killed. The Python-based Conda solver is not very resource efficient and can struggle with large environments. This particular environment is quite large, given it mixes both Python and R, and it doesn't give exact specifications for R and Python versions - only lower bounds - which makes the SAT search space enormous.
Profiling Memory ( )
I attempted to use a GitHub Workflow to profile the memory usage. Using Mamba, it solved without issue; using Conda, the job was killed because the GitHub runner ran out of memory (7GB max). The breakdown was:
Tool
Memory (MB)
User Time (s)
Mamba
745
195.45
Conda
> 6,434
> 453.34
Workarounds
Use Mamba
As a drop-in replacement for Conda that is compiled, Mamba is much more resource efficient. Also, it has seen welcome adoption in the bioinformatics community (e.g., it is default frontend for Snakemake).
As the GitHub workflow demonstrates, the Mamba-based creation works perfectly fine with the YAML as is.
Request more memory
Ask SLURM/SGE for more memory for your interactive session. Conda seems to need more that 6.5 GB (maybe try 16GB?).
Create a better YAML
The first thing one could do to get a faster solve is provide exact versions for the Python and R. The Mamba solution resolved to python=3.9 r-base=4.0.
There's also a bunch of development-level stuff in the environment that is completely unnecessary for end-users. But that's more something to bother the developers about.
When I ran best_model = compare_models() there is a huge load on CPU memory, while my GPU is unutilized. How do I run the setup() or compare_models() on GPU?
Is there an in-built method in PyCaret?
Only some models can run on GPU, and they must be properly installed to use GPU. For example, for xgboost, you must install it with pip and have CUDA 10+ installed (or install a GPU xgboost version from anaconda, etc). Here is the list of estimators that can use GPU and their requirements: https://pycaret.readthedocs.io/en/latest/installation.html?highlight=gpu#pycaret-on-gpu
As Yatin said, you need to use use_gpu=True in setup(). Or you can specify it when creating an individual model, like xgboost_gpu = create_model('xgboost', fold=3, tree_method='gpu_hist', gpu_id=0).
For installing CUDA, I like using Anaconda since it makes it easy, like conda install -c anaconda cudatoolkit. It looks like for the non-boosted methods, you need to install cuML for GPU use.
Oh, and looks like pycaret can't use tune-sklearn with GPU (in the warnings here at the bottom of the tune_model doc section).
To use gpu in PyCaret you have to simply pas use_gpu=True as parameter in setup function.
Example:
model = setup(data,target_variable,use_gpu=True)
I write almost all my R code in packages at work (and use git). I make heavy use of devtools, in particular short cuts for load_all, etc as I update functions used in a package. I have a rough understanding of devtools, in that load_all makes a temporary copy of the package, and I really like this workflow for testing function updates in packages.
Is there a nice easy way/workflow for running simulations depending on the package, while developing it at the same time, without "breaking" those simulations?
I suspect there is an easy solution that I've overlooked.
Right now what I do is:
get the package "mypackage" up to a point ready for running simulations. copy the whole folder containing the project. Run the simulations in the copied folder using a new package name "mypackage2"). Run simulation scripts which include library(mypackage2) but NOT library(mypackage). This annoyingly means I need to update library(mypackage) calls to library(mypackage2) calls. If I run simulations using library(mypackage) and avoid using library(mypackage2), then I need to make sure the current built version of mypackage is the 'old' one that doesn't reflect updates in 2. below (but 2. below requires rebuilding the package too!). Handling all this gets messy.
While the simulations are running in the copied folder I can update the functions in "mypackage", by either using load_all or rebuilding the package. I often need to Rebuild the package (i.e. using load_all without rebuilding the package when testing updates to the package isn't a workable solution) because I want to test functions that run small parallel simulations with doParallel and foreach, etc (on windows), and any functions I modify and want to test need the latest built "mypackage" in the children processes which spawn new R processes calling "mypackage". I understand that when a package is built in R, it gets stored in ..\R\R-3.6.1\library, and when future R sessions call library(mypackage) they will use that version of the package.
What I'd ideally like to be able to do is, in the same original folder, run simulations with a version of mypackage, and then update the code in the package while simulations are stopped/started, confident my development changes won't break the simulations which are running a specific version of the package.
Is there a simple way for doing the above, without having to recopy folders (and make things like "mypackage2")?
thanks
The issue described here is sort of similar to what I am facing Specify package location in foreach
The problem is that if I run a simulation that takes several days using "mypackage", with many calls to foreach, and update and rebuild "mypackage" when testing changes, future foreach calls from the simulation may pick up the new updated version of the package, which would be a disaster.
I think the answers in the other question do apply,
but you need to do some extra steps.
Let's say you have a version of the package you want to test.
You'd still create a specific folder for that version, but you leave it empty.
Here I'll use /tmp/mypkg2 as an example.
While having your project open in RStudio, you execute:
withr::with_libpaths(c("/tmp/mypkg2", .libPaths()), devtools::install())
That will install that version of the package to the provided folder.
You could then have a wrapper script,
say wrapper.R,
with something like:
pkg_path <- commandArgs(trailingOnly = TRUE)[1L]
cat("Using package at", pkg_path, "\n")
.libPaths(c(pkg_path, .libPaths()))
library(doParallel)
workers <- makeCluster(detectCores())
registerDoParallel(workers)
# We need to modify the lib path in each worker too
parallel::clusterExport(workers, "pkg_path")
parallel::clusterEvalQ(workers, .libPaths(c(pkg_path, .libPaths())))
# ... Your code calling your package and doing stuff
parallel::stopCluster(workers)
Afterwards, from the command line (outside of R/RStudio),
you could type (assuming Rscript is in your path):
Rscript path/to/wrapper.R /tmp/mypkg2
This way, the actual testing code can stay the same
(including calls to library)
and R will automatically search first in pkg_path,
loading your specific package version,
and then searching in the standard locations for any dependencies you may have.
I don't fully understand your use-case (as to why you want to do this) but what I normally do when testing two versions of a package is to push the most recent version to my dev branch in GitHub and then use devtools::load_all() to test what I'm currently working on. Then by using remotes::install_github() and specifying the dev branch you can run the GitHub version with mypackage::func and the devtools version with func
I use RStudio, it's fantastic. Recent adjustments to my BLAS installation, however, require me to start R with the command
taskset 0xffff R
To get parallel processing to work. Is there a way tell RStudio to start it's R session with this command so I can use RStudio with parallel?
(I know parallel and GUI's don't play that nice)
Thanks.
EDIT: I'm running kUbuntu.
You should be able to do this by customizing your Rprofile.site file, as described here:
http://www.statmethods.net/interface/customizing.html
When I require(gWidgetstcktk), I get an infinite loop, with a seemingly endless number of error messages that look like this:
error reading package index file /Library/Frameworks/R.framework/Versions/2.14/Resources/library/tcltk2/tklibs/ttktheme_clearlooks/pkgIndex.tcl: can't find package tile
error reading package index file /Library/Frameworks/R.framework/Versions/2.14/Resources/library/tcltk2/tklibs/ttktheme_clearlooks/pkgIndex.tcl: too many nested evaluations (infinite loop?)
(On each iteration the path is different. The end of these messages seem to be the important parts: can't find package tile and too many nested evaluations (infinite loop?)
I installed the packages as usual using install.package() and the files referred to seem to be present. gWidgets seems to load just fine. I'm running R 2.14.1 via RStudio 0.96.231 on OSX 10.7.4. What is going wrong here?
Update: I now see that the problem is coming from the tcltk2 package.
That shouldn't happen. First off, I'd say try uninstalling the package and then re-installing it. There might have been an error during the process. Another thing you should do is select "Install All Dependencies" when you do this (or install.packages(______, dependencies = TRUE)). Have you installed all of the package's relevant dependencies? Perhaps this library requires a different library which you don't have.