Unable to locate conda environment after creating it using .yml file - r

I'm trying to create (and activate & use) a Conda environment using a .yml file (in fact, I'm following instructions on this GitHub page: https://github.com/RajLabMSSM/echolocatoR). I'm working in a cluster computing system running Linux.
conda env create -f https://github.com/RajLabMSSM/echolocatoR/raw/master/inst/conda/echoR.yml
After running the above line of code, I'm trying to activate the environment:
conda activate echoR
However, this returns the following message:
Could not find conda environment: echoR
You can list all discoverable environments with conda info --envs.
When checking the list of environments in .conda/environments.txt, the echoR environment is indeed not listed.
I'm hoping for some suggestions of what might be the issue here.

Likely Cause: Out of Memory
Given the HPC context, the solver is likely trying to exceed the allocated memory and getting killed. The Python-based Conda solver is not very resource efficient and can struggle with large environments. This particular environment is quite large, given it mixes both Python and R, and it doesn't give exact specifications for R and Python versions - only lower bounds - which makes the SAT search space enormous.
Profiling Memory ( )
I attempted to use a GitHub Workflow to profile the memory usage. Using Mamba, it solved without issue; using Conda, the job was killed because the GitHub runner ran out of memory (7GB max). The breakdown was:
Tool
Memory (MB)
User Time (s)
Mamba
745
195.45
Conda
> 6,434
> 453.34
Workarounds
Use Mamba
As a drop-in replacement for Conda that is compiled, Mamba is much more resource efficient. Also, it has seen welcome adoption in the bioinformatics community (e.g., it is default frontend for Snakemake).
As the GitHub workflow demonstrates, the Mamba-based creation works perfectly fine with the YAML as is.
Request more memory
Ask SLURM/SGE for more memory for your interactive session. Conda seems to need more that 6.5 GB (maybe try 16GB?).
Create a better YAML
The first thing one could do to get a faster solve is provide exact versions for the Python and R. The Mamba solution resolved to python=3.9 r-base=4.0.
There's also a bunch of development-level stuff in the environment that is completely unnecessary for end-users. But that's more something to bother the developers about.

Related

How to check which dependency has errored while building a package?

I am a new Julia user and am building my own package. When I build a package I created on my own I get this
(VFitApproximation) pkg> build VFitApproximation
Building MathLink → `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/653c35640db592dfddd8c46dbb43d561cbef7862/build.log`
Building Conda ───→ `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/299304989a5e6473d985212c28928899c74e9421/build.log`
Building PyCall ──→ `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/169bb8ea6b1b143c5cf57df6d34d022a7b60c6db/build.log`
Progress [========================================>] 4/4
✗ VFitApproximation
3 dependencies successfully precompiled in 9 seconds (38 already precompiled)
1 dependency errored
I don't know which dependency has errored and how to fix that. How do I ask Julia what has errored?
It's VFitApproximation itself that has the error (it's the only one with an ✗ next to it). You should try starting a session and typing using VFitApproximation; if that causes an error, the message will tell you much more about the origin than build. If that doesn't directly trigger an error, then you can try the verbose mode of build as suggested by #sundar above. Julia's package manager runs a lot of its system-wide operations in parallel, which is wonderful when you have to build dozens or hundreds of packages, but under those circumstances you only get general summaries rather than the level of detail you can get from operations focused on a single package.
More generally, most packages don't need manual build: it's typically used for packages that require special configuration at the time of installation. Examples might include downloading a data set from the internet (though this now better handled through Artifacts), or saving configuration data about the user's hardware to a file, etc. For reference, on my system, out of 418 packages I have deved, only 20 have deps/build.jl scripts, and many of those only because they haven't yet been updated to use Artifacts.
Bottom line: for most code, you never need Pkg.build, and you should just use Pkg.precompile or using directly.

Use valgrind with `R CMD check`

I want to run valgrind on the tests, examples and vignettes of my package. Various sources insinuate that the way to do this should be:
R CMD build my-pkg
R CMD check --use-valgrind my-pkg_0.0.tar.gz
R CMD check seems to run fine, but shows no evidence of valgrind output, even after setting the environment variable VALGRIND_OPTS: --memcheck:leak-check=full. I've found sources that hint that R needs to run interactively for valgrind to show output, but R -d CMD check (or R -d "CMD check") seems to be the wrong format.
R -d "valgrind --tool=memcheck --leak-check=full" --vanilla < my-pkg.Rcheck/my-pkg-Ex.R does work, but only on the example files; I can't see a simple way to run this against my vignettes and testthat tests.
What is the best way to run all relevant scripts through valgrind? For what it's worth, the goal is to integrate this in a GitHub actions script.
Edit Mar 2022: The R CMD check case is actually simpler, running R CMD check --use-valgrind [other options you may want] will run the tests and examples under valgrind and then append the standard valgrind summary at the end of the examples output (i.e., pkg.Rcheck/pkg-Ex.Rout) and test output (i.e., pkg.Rcheck/tinytest.Rout as I use tinytest)_. What puzzles me now is that an error detected by valgrind does not seem to fail the test.
Original answer below the separator.
There is a bit more to this: you helps to ensure that the R build is instrumented for it. See Section 4.3.2 of Writing R Extensions:
On platforms where valgrind is installed you can build a version of R with extra instrumentation to help valgrind detect errors in the use of memory allocated from the R heap. The configure option is --with-valgrind-instrumentation=level, where level is 0, 1 or 2. Level 0 is the default and does not add anything. Level 1 will detect some uses117 of uninitialised memory and has little impact on speed (compared to level 0). Level 2 will detect many other memory-use bugs118 but make R much slower when running under valgrind. Using this in conjunction with gctorture can be even more effective (and even slower).
So you probably want to build yourself a Docker-ized version of R to call from your GitHub Action. I think the excellent 'sumo' container by Winston has a valgrind build as well so you could try that as well. It's huge at over 4gb:
edd#rob:~$ docker images| grep wch # some whitespace edit out
wch1/r-debug latest a88fabe8ec81 8 days ago 4.49GB
edd#rob:~$
And of course if you test dependent packages you have to get them into the valgrind session too...
Unfortunately, I found the learning curve involved in dockerizing R in GitHub actions, per #dirk-eddelbuettel's suggestion, too steep.
I came up with the hacky approach of adding a file memcheck.R to the root of the package with the contents
devtools::load_all()
devtools::run_examples()
devtools::build_vignettes()
devtools::test()
(remembering to add to .Rbuildignore).
Running R -d "valgrind --tool=memcheck --leak-check=full" --vanilla < memcheck.R then seems to work, albeit with the reporting of some issues that appear to be false positives, or at least issues that are not identified by CRAN's valgrind implementation.
Here's an example of this in action in a GitHub actions script.
(Readers who know what they are doing are invited to suggest shortcomings of this approach in the comments!)

Virtual environment in R?

I've found several posts about best practice, reproducibility and workflow in R, for example:
How to increase longer term reproducibility of research (particularly using R and Sweave)
Complete substantive examples of reproducible research using R
One of the major preoccupations is ensuring portability of code, in the sense that moving it to a new machine (possibly running a different OS) is relatively straightforward and gives the same results.
Coming from a Python background, I'm used to the concept of a virtual environment. When coupled with a simple list of required packages, this goes some way to ensuring that the installed packages and libraries are available on any machine without too much fuss. Sure, it's no guarantee - different OSes have their own foibles and peculiarities - but it gets you 95% of the way there.
Does such a thing exist within R? Even if it's not as sophisticated. For example simply maintaining a plain text list of required packages and a script that will install any that are missing?
I'm about to start using R in earnest for the first time, probably in conjunction with Sweave, and would ideally like to start in the best way possible! Thanks for your thoughts.
I'm going to use the comment posted by #cboettig in order to resolve this question.
Packrat
Packrat is a dependency management system for R. Gives you three important advantages (all of them focused in your portability needs)
Isolated : Installing a new or updated package for one project won’t break your other projects, and vice versa. That’s because packrat gives each project its own private package library.
Portable: Easily transport your projects from one computer to another, even across different platforms. Packrat makes it easy to install the packages your project depends on.
Reproducible: Packrat records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go.
What's next?
Walkthrough guide: http://rstudio.github.io/packrat/walkthrough.html
Most common commands: http://rstudio.github.io/packrat/commands.html
Using Packrat with RStudio: http://rstudio.github.io/packrat/rstudio.html
Limitations and caveats: http://rstudio.github.io/packrat/limitations.html
Update: Packrat has been soft-deprecated and is now superseded by renv, so you might want to check this package instead.
The Anaconda package manager conda supports creating R environments.
conda create -n r-environment r-essentials r-base
conda activate r-environment
I have had a great experience using conda to maintain different Python installations, both user specific and several versions for the same user. I have tested R with conda and the jupyter-notebook and it works great. At least for my needs, which includes RNA-sequencing analyses using the DEseq2 and related packages, as well as data.table and dplyr. There are many bioconductor packages available in conda via bioconda and according to the comments on this SO question, it seems like install.packages() might work as well.
It looks like there is another option from RStudio devs, renv. It's available on CRAN and supersedes Packrat.
In short, you use renv::init() to initialize your project library, and use renv::snapshot() / renv::restore() to save and load the state of your library.
I prefer this option to conda r-enviroments because here everything is stored in the file renv.lock, which can be committed to a Git repo and distributed to the team.
To add to this:
Note:
1. Have Anaconda installed already
2. Assumed your working directory is "C:"
To create desired environment -> "r_environment_name"
C:\>conda create -n "r_environment_name" r-essentials r-base
To see available environments
C:\>conda info --envs
.
..
...
To activate environment
C:\>conda activate "r_environment_name"
(r_environment_name) C:\>
Launch Jupyter Notebook and let the party begins
(r_environment_name) C:\> jupyter notebook
For a similar "requirements.txt", perhaps this link will help -> Is there something like requirements.txt for R?
Check out roveR, the R container management solution. For details, see https://www.slideshare.net/DavidKunFF/ownr-technical-introduction, in particular slide 12.
To install roveR, execute the following command in R:
install.packages("rover", repos = c("https://lair.functionalfinances.com/repos/shared", "https://lair.functionalfinances.com/repos/cran"))
To make full use of the power of roveR (including installing specific versions of packages for reproducibility), you will need access to a laiR - for CRAN, you can use our laiR instance at https://lair.ownr.io, for uploading your own packages and sharing them with your organization you will need a laiR license. You can contact us on the email address in the presentation linked above.

How can I ensure a consistent R environment among different users on the same server?

I am writing a protocol for a reproducible analysis using an in-house package "MyPKG". Each user will supply their own input files; other than the inputs, the analyses should be run under the same conditions. (e.g. so that we can infer that different results are due to different input files).
MyPKG is under development, so library(MyPKG) will load whichever was the last version that the user compiled in their local library. It will also load any dependencies found in their local libraries.
But I want everyone to use a specific version (MyPKG_3.14) for this analysis while still allowing development of more recent versions. If I understand correctly, "R --vanilla" will load the same dependencies for everyone.
Once we are done, we will save the working environment as a VM to maintain a stable reproducible environment. So a temporary (6 month) solution will suffice.
I have come up with two potential solutions, but am not sure if either is sufficient.
ask the server admin to install MyPKG_3.14 into the default R path and then provide the following code in the protocol:
R --vanilla
library(MyPKG)
....
or
compile MyPKG_3.14 in a specific library, e.g. lib.loc = "/home/share/lib/R/MyPKG_3.14", and then provide
R --vanilla
library(MyPKG)
Are both of these approaches sufficient to ensure that everyone is running the same version?
Is one preferable to the other?
Are there other unforseen issues that may arise?
Is there a preferred option for standardising the multiple analyses?
Should I include a test of the output of SessionInfo()?
Would it be better to create a single account on the server for everyone to use?
Couple of points:
Use system-wide installations of packages, e.g. the Debian / Ubuntu binary for R (incl the CRAN ports) will try to use /usr/local/lib/R/site-library (which users can install too if added to group owning the directory). That way everybody gets the same version
Use system-wide configuration, e.g. prefer $R_HOME/etc/ over the dotfiles below ~/. For the same reason, the Debian / Ubuntu package offers softlinks in /etc/R/
Use R's facilties to query its packages (eg installed.packages()) to report packages and versions.
Use, where available, OS-level facilities to query OS release and version. This, however, is less standardized.
Regarding the last point my box at home says
> edd#max:~$ lsb_release -a | tail -4
> Distributor ID: Ubuntu
> Description: Ubuntu 12.04.1 LTS
> Release: 12.04
> Codename: precise
> edd#max:~$
which is a start.

R: combining mutiple library locations with most up-to-date packages

Question: How do I move all of the most up-to-date R packages into one simple location that R (and everything else) will use from now and forever for my packages?
I have been playing around with R on Ubuntu 10.04 using variously RGedit, RCmdr, R shell, and RStudio. Meanwhile, I have installed packages, updated packages, and re-updated packages via apt, synaptic, install.packages(), etc... which apparently means these packages get placed everywhere, and (with the occasional sudo tossed in) with different permissions.
Currently I have different versions of different (and repeated) packages in:
/home/me/R/i486-pc-linux-gnu-library/2.10
/home/me/R/i486-pc-linux-gnu-library/2.14
/home/me/R/i486-pc-linux-gnu-library/
/usr/local/lib/R/site-library
/usr/lib/R/site-library
/usr/lib/R/library
First - I'm a single user, on a single machine - I don't want multiple library locations, I just want it to work.
Second - I am on an extremely slow connection, and can't keep just downloading packages repeatedly.
So - is there an easy way to merge all these library locations into one simple location? Can I just copy the folders over?
How do I set it in concrete that this is and always will be where anything R related looks for and updates packages?
This is maddening.
Thanks for your help.
Yes, it should almost work to just copy the folders over. But pre-2.14 packages WITHOUT a NAMESPACE file probably won't work in R 2.14 where all packages must have a namespace...
And you'd want to manually ensure you only copy the latest version of each package if you have multiple versions...
If you type .libPaths(), it will tell you where R looks for packages. The first in the list is where new packages are typically installed. I suspect that .libPaths() might return different things from RStudio vs. Rcmd etc.
After piecing together various bits of info here goes: A complete moron's guide to the R packages directory organization:
NB1 - this is my experience with Ubuntu - your mileage may vary
NB2 - I'm a single user on a single machine, and I like things simple.
Ubuntu puts anything installed via apt, or synaptic in:
/usr/lib/R/site-library
/usr/lib/R/library
directories. The default vanilla R install will try install packages here:
/usr/local/lib/R/site-library
Since these are system directories the user does not have write privileges to, depending on what method you are interacting with R you might be prompted with a friendly - "Hey buddy - we can't write there, you want us to put your packages in your home directory?" which seems innocent and reasonable enough... assuming you never change your GUI, or your working environment. If you do, the new GUI / environment might not be looking in the directory where the packages were placed, so won't find them. (Most interfaces have a way for you to point where your personal library of packages is, but who wants to muck about in config files?)
What seems to be the best practice for me (and feel free to correct me if I'm wrong) with a default install setup on Ubuntu, is to do any package management from a basic R shell as sudo: > sudo R and from there do your install.packages() voodoo. This seems to put packages in the usr/local/lib/R/site-library directory.
At the same time, update.packages() will update the files in /usr/lib/R/site-library and usr/lib/R/library directories, as well as usr/local/lib/R/site-library
(As for usr/lib/R/ division, it looks like /library/ has the core packages, while /site-library/ holds anything added, assuming they were installed by apt...)
Any packages previously installed and in the wrong place can be moved to the /usr/local/lib/R/site-library directory (assuming you are sudoing it) just by moving the directories (thanks #Tommy), but as usr/lib/R/ is controlled by apt - best not add or subtract anything from there...
Whew. Anyway - simple enough, and in simple language. Thanks everyone for the help.

Resources