cvxopt uses just one core, need to run on all / some - glpk

I call cvxopt.glpk.ilp in Python 3.6.6, cvxopt==1.2.3 for a boolean optimization problem with about 500k boolean variables. It is solved in 1.5 hours, but it seems to run on just one core! How can I make it run on all or a specific set of cores?
The server with Linux Ubuntu x86_64 has 16 or 32 physical cores. My process affinity is 64 cores (I assume due to hyperthreading).
> grep ^cpu\\scores /proc/cpuinfo | uniq
16
> grep -c ^processor /proc/cpuinfo
64
> taskset -cp <PID>
pid <PID> current affinity list: 0-63
However top shows only 100% CPU for my process, and htop shows that only one core is 100% busy (some others are slightly loaded presumably by other users).
I set OMP_NUM_THREADS=32 and started my program again, but still one core. It's a bit difficult to restart the server itself. I don't have root access to the server.
I installed cvxopt from a company's internal repo which should be a mirror of PyPI. The following libs are installed in /usr/lib: liblapack, liblapack_atlas, libopenblas, libblas, libcblas, libatlas.

Here some SO-user writes, that GLPK is not multithreaded. This is the solver used by default as cvxopt has no own MIP-solver.
As cvxopt only supports GLPK as open-source mixed-integer programming solver, you are out of luck.
Alternatively you can use CoinOR's Cbc, which is usually a much better solver than GLPK while still being open-source. This one also can be compiled with parallelization. See some benchmarks which also indicate that GLPK is really without parallel support.
But as there is no support in cvxopt, you will need some alternative access-point:
own C/C++ based wrapper
pulp
binary install available
python-mip
binary install available
Google's ortools
binary install available
cylp
cvxpy + cylp
binary install available for cvxpy; without cylp-build
Those:
have very different modelling-styles (from completely low-level: cylp to very high-level: cvxpy)
i'm not sure if all those builds are compiled with enable-parallel (which is needed when compiling Cbx)
Furthermore: don't expect too much gain from multithreading. It's usually way worse than linear speedup (as for all combinatorial-optimization problems which are not based on brute-force).
(Imho the GIL does not matter as all those are C-extensions where the GIL is not in the way)

Related

Unable to locate conda environment after creating it using .yml file

I'm trying to create (and activate & use) a Conda environment using a .yml file (in fact, I'm following instructions on this GitHub page: https://github.com/RajLabMSSM/echolocatoR). I'm working in a cluster computing system running Linux.
conda env create -f https://github.com/RajLabMSSM/echolocatoR/raw/master/inst/conda/echoR.yml
After running the above line of code, I'm trying to activate the environment:
conda activate echoR
However, this returns the following message:
Could not find conda environment: echoR
You can list all discoverable environments with conda info --envs.
When checking the list of environments in .conda/environments.txt, the echoR environment is indeed not listed.
I'm hoping for some suggestions of what might be the issue here.
Likely Cause: Out of Memory
Given the HPC context, the solver is likely trying to exceed the allocated memory and getting killed. The Python-based Conda solver is not very resource efficient and can struggle with large environments. This particular environment is quite large, given it mixes both Python and R, and it doesn't give exact specifications for R and Python versions - only lower bounds - which makes the SAT search space enormous.
Profiling Memory ( )
I attempted to use a GitHub Workflow to profile the memory usage. Using Mamba, it solved without issue; using Conda, the job was killed because the GitHub runner ran out of memory (7GB max). The breakdown was:
Tool
Memory (MB)
User Time (s)
Mamba
745
195.45
Conda
> 6,434
> 453.34
Workarounds
Use Mamba
As a drop-in replacement for Conda that is compiled, Mamba is much more resource efficient. Also, it has seen welcome adoption in the bioinformatics community (e.g., it is default frontend for Snakemake).
As the GitHub workflow demonstrates, the Mamba-based creation works perfectly fine with the YAML as is.
Request more memory
Ask SLURM/SGE for more memory for your interactive session. Conda seems to need more that 6.5 GB (maybe try 16GB?).
Create a better YAML
The first thing one could do to get a faster solve is provide exact versions for the Python and R. The Mamba solution resolved to python=3.9 r-base=4.0.
There's also a bunch of development-level stuff in the environment that is completely unnecessary for end-users. But that's more something to bother the developers about.

fast install package during development with multiarch

I'm working on a package "xyz" that uses Rcpp with several cpp files.
When I'm only updating the R code, I would like to run R CMD INSTALL xyz on the package directory without having to recompile all the shared libraries that haven't changed. That works fine if I specify the --no-multiarch flag: the source directory src gets populated the first time with the compiled objects, and if the sources don't change they are re-used the next time. With multiarch on, however, R decides to make two copies of src, src-i386 and src-x86_64. It seems to confuse R CMD INSTALL which always re-runs all the compilation. Is there any workaround?
(I'm aware that there are alternative ways, e.g. devtools::load_all, but I'd rather stick to R CM INSTALL if possible).
The platform is MacOS 10.7, and I have the latest version of R.
I have a partial answer for you. One really easy for speed-up is provided by using ccache which you can enable for all R compilation (e.g. via R CMD whatever thereby also getting inline, attributes, RStudio use, ...) globally through .R/Makevars:
edd#max:~$ tail -10 .R/Makevars
VER=4.6
CC=ccache gcc-$(VER)
CXX=ccache g++-$(VER)
SHLIB_CXXLD=g++-$(VER)
FC=ccache gfortran
F77=ccache gfortran
MAKE=make -j8
edd#max:~$
It takes care of all caching of compilation units.
Now, that does not "explicitly" address the --no-multiarch aspect which I don;t play much with that as we are still mostly 'single arch' on Linux. This will change, eventually, but hasn't yet. Yet I suspect but by letting the compiler decide the caching you too will get the net effect.
Other aspects can be controlled too, eg ~/.R/check.Renviron can be used to turn certain tests on or off. I tend to keep'em all on -- better to waste a few seconds here than to get a nastygram from Vienna.

How can I ensure a consistent R environment among different users on the same server?

I am writing a protocol for a reproducible analysis using an in-house package "MyPKG". Each user will supply their own input files; other than the inputs, the analyses should be run under the same conditions. (e.g. so that we can infer that different results are due to different input files).
MyPKG is under development, so library(MyPKG) will load whichever was the last version that the user compiled in their local library. It will also load any dependencies found in their local libraries.
But I want everyone to use a specific version (MyPKG_3.14) for this analysis while still allowing development of more recent versions. If I understand correctly, "R --vanilla" will load the same dependencies for everyone.
Once we are done, we will save the working environment as a VM to maintain a stable reproducible environment. So a temporary (6 month) solution will suffice.
I have come up with two potential solutions, but am not sure if either is sufficient.
ask the server admin to install MyPKG_3.14 into the default R path and then provide the following code in the protocol:
R --vanilla
library(MyPKG)
....
or
compile MyPKG_3.14 in a specific library, e.g. lib.loc = "/home/share/lib/R/MyPKG_3.14", and then provide
R --vanilla
library(MyPKG)
Are both of these approaches sufficient to ensure that everyone is running the same version?
Is one preferable to the other?
Are there other unforseen issues that may arise?
Is there a preferred option for standardising the multiple analyses?
Should I include a test of the output of SessionInfo()?
Would it be better to create a single account on the server for everyone to use?
Couple of points:
Use system-wide installations of packages, e.g. the Debian / Ubuntu binary for R (incl the CRAN ports) will try to use /usr/local/lib/R/site-library (which users can install too if added to group owning the directory). That way everybody gets the same version
Use system-wide configuration, e.g. prefer $R_HOME/etc/ over the dotfiles below ~/. For the same reason, the Debian / Ubuntu package offers softlinks in /etc/R/
Use R's facilties to query its packages (eg installed.packages()) to report packages and versions.
Use, where available, OS-level facilities to query OS release and version. This, however, is less standardized.
Regarding the last point my box at home says
> edd#max:~$ lsb_release -a | tail -4
> Distributor ID: Ubuntu
> Description: Ubuntu 12.04.1 LTS
> Release: 12.04
> Codename: precise
> edd#max:~$
which is a start.

Rcpp on Solaris

I am trying to compile Rcpp_0.9.7 from source on sparc-sun-solaris2.10. I am getting the following error when I try to use install.packages:
sh: make: not found
ERROR: compilation failed for package 'Rcpp'
From research on the internet, it appears others have had similar problems with solaris. Unfortunately I do not know very much about which compilers I should or should not be using. One thing I am beginning to realize, however, is that solaris seems to be a sub-optimal environment for running R (in terms of performance as well as convenience).
Solaris can mean different things: it could be Solaris on x86, or Solaris on Sparc.
According to the Rcpp build results page on CRAN, Rcpp does now build on x86 Solaris (thanks to a recent patch by Martyn Plummer) but not Sparc Solaris. We were just discussing that this week on the rcpp-devel list.
As for your error, you are lacking critical components, namely the make tool. You likely lack more. Your conclusion is correct, though. Depending your level of Unix knowledge, you may be best off to simply install Ubuntu and enjoy tens of thousands of pre-built packages, including R and well over a hundred related packages.
Not really a solution but too long for a comment.
First of all get a decent environment for your testing of building Rccp on Solaris. Personally I use VirtualBox on my Windows workstation. This way I have an environment that I can control myself and do not depend on any grumpy SysAdmin. Best of all: there's no cost involved! When you are confident with your build you can either (1) move the binaries over to your target host or (2) replicate the build setup on your target host.
Secondly you can use these instructions to set up a proper build host on Solaris. (you seem to be lacking some crucial tools!). Remember to use gmake when building as per the instructions in the posting.
As Dirk mentioned, you're lacking the make command. If you're running Solaris 10
or earlier, then you need to find your installation media and pkgadd SUNWsprot.
If you're running Solaris 11 or later, then
pkg install developer/build/make
will get you that utility. You probably need the system headers as well, which are in pkg://solaris/system/header for Solaris 11 and later, or SUNWhea in earlier releases.
I see that you mention sparc-sun-solaris2.10 in your question - is there any opportunity for you to update to Solaris 11 or later? The developer environment is muchmuch nicer in the newer releases. Certainly easier to get a copy of a compiler....

Setting up "configure" for openMP in R

I have an R package which is easily sped up by using OpenMP. If your compiler supports it then you get the win, if it doesn't then the pragmas are ignored and you get one core.
My problem is how to get the package build system to use the right compiler options and libraries. Currently I have:
PKG_CPPFLAGS=-fopenmp
PKG_LIBS=-fopenmp
hardcoded into src/Makevars on my machine, and this builds it with OpenMP support. But it produces a warning about non-standard compiler flags on check, and will probably fail hard on a machine with no openMP capabilities.
The solution seems to be to use configure and autoconf. There's some information around here:
http://cran.r-project.org/doc/manuals/R-exts.html#Using-Makevars
including a complex example to compile in odbc functionality. But I can't see how to begin tweaking that to check for openmp and libgomp.
None of the R packages I've looked at that talk about using openMP seem to have this set up either.
So does anyone have a walkthrough for setting up an R package with OpenMP?
[EDIT]
I may have cracked this now. I have a configure.ac script and a Makevars.in with #FOO# substitutions for the compiler options. But now I'm not sure of the workflow. Is it:
Run "autoconf configure.in > configure; chmod 755 configure" if I change the configure.in file.
Do a package build.
On package install, the system runs ./configure for me and creates Makevars from Makevars.in
But just to be clear, "autoconf configure.in > configure" doesn't run on package install - its purely a developer process to create the configure script that is distributed - amirite?
Methinks you have the library option wrong, please try
## -- compiling for OpenMP
PKG_CXXFLAGS=-fopenmp
##
## -- linking for OpenMP
PKG_LIBS= -fopenmp -lgomp
In other words, -lgomp gets you the OpenMP library linked. And I presume you know that this library is not part of the popular Rtools kit for Windows. On a modern Linux you should be fine.
In an unrelease testpackage I have here I also add the following to PKG_LIBS, but that is mostly due to my use of Rcpp:
$(shell $(R_HOME)/bin/Rscript -e "Rcpp:::LdFlags()") \
$(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)
Lastly, I think the autoconf business is not really needed unless you feel you need to test for OpenMP via configure.
Edit: SpacedMan is correct. Per the beginning of the libgomp-4.4 manual:
1 Enabling OpenMP
To activate the OpenMP extensions for
C/C++ and Fortran, the compile-time
flag `-fopenmp' must be specified.
This enables the OpenMP directive
[...] The flag also
arranges for automatic linking of the
OpenMP runtime library.
So I stand corrected. Seems that it doesn't hurt to manually add what would get added anyway, just for clarity...
Just addressing your question regarding the usage of autoconf--no, you do not want to run autoconf with any arguments, nor should you redirect its output. You are correct that running autoconf to build the configure script is something that the package maintainer does, and the resulting configure script is distributed. Normally, to generate the configure script from configure.ac (older packages use the name configure.in, but that name has been discouraged for several years), the developer simply runs autoconf with no arguments. Before running autoconf, it is necessary to run aclocal, autoheader, libtoolize, etc... There is also a tool (autoreconf) which simplifies the process and invokes all the required programs in the correct order. It is now more typical to run autoreconf instead of autoconf.

Resources