R library path for two Linux clusters - r

I am working with two Linux clusters which share the same file system.
Because of that, when I install libraries in one of the clusters, they
get installed in the same folder (/home/R), shared by both clusters,
which causes conflicts if later I work on the other cluster.
Do you know if there is any external variable or even any R hidden config
I could use, so that, upon starting R (or Rstudio) on one cluster it could
detect the cluster and the corresponding path for the libraries' location
(for instance /home/R/cluster1 and /home/R/cluster2)?
Thanks.

Yes, it should be pretty straightforward. Create an Rprofile.site file (see the Initialization at Startup docs for where this goes). In that file, you can write R code to detect which cluster you're on.
Once you know which cluster you're on, use the .libPaths() function (see libPaths docs) to change the library path.
R will run the Rprofile.site file every time a new session starts up, so each session should get its library path adjusted appropriately for the cluster it's on.

Related

How can I permanently set an environment variable using Autotools?

I'm adapting an existing program to use Autotools for its build, but the resulting process depends on an environment variable. Is there a way to permanently set this environment variable during the build or installation process?
The program is intended to be used by Unix users and I could try to concatenate an export command directly to the .bashrc file and warn the user in case it fails because most of them will actually just use Ubuntu to run it (it's a relatively simple program that targets students), but I'd like to know if there's a more portable way to do this.
That's what I wouldn't like to do:
export VAR=/my/totally/not/hardcoded/path >> $HOME/.bashrc
Sorry to come to this late, but all of the answers to date are shockingly ... incomplete.
Building and installing software are both core use cases for the Autotools, and the installation part can absolutely involve adding or modifying files that affect user environments. If the software is installed by a user with sufficient privilege, then such effects can absolutely be applied to all system users, though the details may vary a bit from system to system (and the Autotools can help with that, too!).
For example, on RedHat-family Linuxes such as RedHat Enterprise, Fedora, Oracle Linux, and various others, you can drop an appropriately named file in /etc/profile.d, and the commands in it will automatically be read and executed by every login shell. Setting environment variables for all users is one of the common uses of this feature. I'm uncertain about Debian-family Linuxes such as Ubuntu, but it is always possible to modify file /etc/profile instead to have the same effect, and you absolutely can write an Automake install hook to do that.
Or for an altogether different approach, you can always provide a wrapper script around your program that sets the needed environment variables (supposing that the point is other than to add a directory to the PATH so as to find the program in the first place). In that case, you can even install the main program in a location that is not ordinarily in the path, so that users don't accidentally run it directly. This mechanism has the advantage that the environment variables are scoped to a run of the program, not a whole login session, but the disadvantage that users cannot override them.
I guess, no.
Autotools are about building your program, not about environment setup for the program to run. That's what users/admins are supposed to do. (Well I can imagine doing this, but I really don't want to try to figure it out, because the idea itself seems broken to me)
If your program REALLY needs some environment variable during run-time, then you should patch your sources for your application to test if the variable exists, and set one to default desired value, if it doesn't. Another idea is to enforce usage of an obligatory command line switch to pass the value in.
It's not clear what this has to do with autotools (or any other build system). No build system, by itself, can arrange for an env var to be present when the program it builds is run at a later tiem.
One solution is for your program to have a hardcoded default value for the var which is used if the environment var isn't present when the program starts running. Another frequently used solution is to name your binary something like myprog.bin and install a shell script named myprog which sets up the environment before doing exec myprog.bin.
I'm adapting an existing program to use Autotools for its build, but the resulting process depends on an environment variable. Is there a way to permanently set this environment variable during the build or installation process?
You've not been very concrete about what the program is (e.g. is the program a daemon? A user program?) or the nature of the environment variable dependency (e.g. is it another program? A mount point? A URL? A DB connection string?). Being more specific might give a better answer for you.
Anyway, autotools is not likely to offer any feature to help: It's a build system. Depending on the nature of your environment variable dependency, you're likely going to need package management (if you package it) or system administration level setup.
Since you think your primary user base is on Ubuntu this help page might give you some ideas.

How to share R package locally without using Github/miniCRAN/other internet sites

I created a local R-package (gmad) on one computer. I want to copy this package to a server and continue developing it.
I have the "project folder" in my current working directory where I wrote the R code etc. (includes the R functions explicitly saved in separate files).
There is also a folder with the same name in the directory containing all the installed packages - which has a sub-dolder named "R" but unlike the previous folder, it does not contain separate files for different functions. I'm guessing this is the "installed" version of the package and that I won't be able to modify the R code in it.
What is the best way to copy this to the new computer. On the original computer, .libPaths() shows two locations for installed packages. I don't really care about these on this machine however I want to follow the correct procedure for the server. What is the best practice - should there be only one location and if so, what should it be?

Schedule a function that belongs to an R package

I'm trying to build an R package whose goal is to run a series of analyses by taking input data and writing output data to an external database (PostgreSQL).
Specifically, I need a set of operations to be scheduled to run on a daily basis. Therefore, I have written some bash scripts with R code (using the header #!/usr/bin/env Rscript) and I have saved them into the exec/ folder of the R package. The scripts make several call to the package's core functions in R/ folder.
At this point, once installed the package on a linux server, how do I set up a crontab that is able to directly access the scripts in the exec/ folder?
Is this way of proceeding correct or is there a different best practice for such operations?
We do this all the bleeping time at work. Here at home I also have a couple of recurring cronjobs, eg for CRANberries. The exec/ folder you reference works, but my preferred solution is to use, say, inst/scripts/someScript.R.
Then one initial time you need create a softlink from your package library, say, /usr/local/lib/R/site-library/myPackage/scripts/someScript.R to a directory in the $PATH, say /usr/local/bin.
The key aspect is that the softlink persists even as you update the package. So now you are golden. All you now need is your crontab entry referencing someScript.R. We use a mix of Rscript and littler scripts.

How do I setup and run SparkR projects and scripts (like a jar file)?

We have successfully gone through all the SparkR tutorials about setting it up and running basic programs in RStudio on an EC2 instance.
What we can't figure out now is how to then create a project with SparkR as a dependency, compile/jar it, and run any of the various R programs within it.
We're coming from Scala and Java, so we may be coming at this with the wrong mindset. Is this even possible in R or is it done differently that Java's build files and jars or do you just have to run each R script individually without a packaged jar?
do you just have to run each R script individually without a packaged jar?
More or less. While you can create a R package(-s) to store reusable parts of your code (see for example devtools::create or R packages) and optionally distribute it over the cluster (since current public API is limited to high level interactions with JVM backend it shouldn't be required), what you pass to spark-submit is simply a single R script which:
creates a SparkContext - SparkR::sparkR.init
creates a SQLContext / HiveContexts - SparkR::sparkRSQL.init / SparkR::sparkRHive.init
executes the rest of your code
stops SparkContext - SparkR::sparkR.stop
Assuming that external dependencies are present on the workers, missing packages can installed on the runtime using if not require pattern, for example:
if(!require("some_package")) install.packages("some_package")
or
if(!require("some_package")) devtools::install_github("some_user/some_package")

update of a R package which is lazy loaded

I have several unix servers using a R package which is installed on a shared R library folder. The packages are lazy loaded (that's the default) from this shared folder.
Now I want to update the package:
1) is it possible (and clean) to do that without closing all R instances?
2) More precisely, I am concerned about the following:
2)a) The warning I get from the user interface when I try to install a package that is already loaded:
2)b)
From https://cran.r-project.org/doc/manuals/r-release/R-ints.html#Lazy-loading,
When a package/namespace which uses it is loaded, the package/namespace environment is populated with promises for all the named objects: when these promises are evaluated they load the actual code from a database.
Does that mean that the R instance will read again from the library folder when doing the actual evaluation of each object (in which case that means I need to either deactivate the lazy loading, or close all R instances before updating the package)
3) is there an alternative way to maintain R packages on a network of servers, that are running scripts all the time, without having to put each server offline one by one)
Thanks for your input
You asked
1) is it possible (and clean) to do that without closing all R instances?
and I can assure that yes, that it is how works and done everywhere.
As for
2) More precisely, I am concerned about the following:
you are reading it wrong. An R restart is simply recommended to ensure the new package is loaded as you cannot insert it into a running session.
Further
3) is there an alternative way to maintain R packages on a network of servers, that are running scripts all the time, without having to put each server offline one by one)
you never have to take a server off-line just to update a user-space package. E.g. we don't even take them off-line when we, say, upgrade the entire Ubuntu release twice a year.

Resources