Sharing config variables between bash and R - r

I need to share config variables between various R and bash programmes. They all share various resources, especially a GRASS database.
I started by creating a bash script that sets up shell variables and then runs the R programme. This way R is blind to the shell variables:
$ cat testVars.R
Sys.getenv(c("WDIR","GDIR"))
$ cat testVars.sh
#!/bin/sh
WDIR="/Work/Project/"
GDIR=$WDIR"GRASSDATA"
Rscript testVars.R
$ ./testVars.sh
WDIR GDIR
"" ""
I then tried using the readRenviron function in R, thinking it could be used to source a bash file setting up variables. However, this results in a different problem, R is unable to replace and concatenate variables like bash does:
$ cat testVars.R
readRenviron("./testVars.sh")
Sys.getenv(c("WDIR","GDIR"))
$ cat testVars.sh
#!/bin/sh
WDIR="/Work/Project/"
GDIR=$WDIR"GRASSDATA"
$ Rscript testVars.R
WDIR GDIR
"/Work/Project/" "$WDIRGRASSDATA"
YAML is supported to some extent in both languages, but it suffers from the same lack of replacement and concatenation facilities. For instance, with YAML I would need to repeat the working directory countless times in the configuration file.
So here is what I am looking for: a configuration format than can be used by both R and bash and also allows variable concatenation.

I think all you need is to export variables in bash to make them accessible to R.
$ export TEST_VAR=42
$ Rscript -e "Sys.getenv('TEST_VAR')"
[1] "42"
Then concatenation can be handled with paste() or paste0().

Related

Running multiple R scripts sequentially from shell in the same R session

Is it possible to run multiple .R files from the shell or a bash script, in sequence, in the same R session (so without having to write intermediate results to disk)?
E.g. if file1.R contains a=1 and file2.R print(a+1)
then do something like
$ Rscript file1.R file2.R
[1] 2
(of course a workaround would be to stitch the scripts together or have a master script sourcing 1 and 2)
You could write a wrapper script that calls each script in turn:
source("file1.R")
source("file2.R")
Call this source_files.R and then run Rscript source_files.R. Of course, with something this simple you can also just pass the statements on the command line:
Rscript -e 'source("file1.R"); source("file2.R")'

Initializing variables in bash

I'm migrating windows CMD script to bin/bash on unix.
The goal of initial script was to setting up some variables, so after anything run from this cmd window was using that variables.
How can I do same in UNIX? Looks like simple
MyVar="value"
doesn't work. It visible only in script itself, not from terminal where it was run.
You can initialize shell variables with simple assignments
$ foo="fooval"
$ echo $foo
fooval
These variables won't spread to unrelated child processes:
$ foo=fooval
$ sh -c 'printf "\"%s\"" $foo'
""
To make them spread, you need to export them into the process's (shell's)
environment (make them into "environment variables" (these are commonly capitalized, i.e.,
FOO instead of foo)
$ export foo
$ sh -c 'echo $foo'
fooval
You can assign and export in one step:
$ export foo=fooval
Environment variables will never spread anywhere but down the process hierarchy.
(Only to children, never to parents or completely unrelated processes)
Therefore, if you have a script with variable assignments, you need to source it, not execute it:
$ ./envvars #won't affect the parent shell
$ . ./envvars #this will
There are no per-terminal variables (though there are per-terminal configurations with fixed keys accessible manipulatable with the stty tool).
Create a file test.sh
add the following line:
export b="key"
Now goto the terminal and do the following :
source ./test.sh
echo $b
Output:
key

Can I force Rscript to pass wildcard in command line arguments?

As best as I can sort out (since I haven't had much luck finding documentation of it) when one runs Rscript with a command argument that includes a wildcard *, the argument is expanded to a character vector of the filepaths that match, or passed if there are no matches. Is there a way to pass the wildcard all the time, so I can handle it myself within the script (using things like Sys.glob, for example)?
Here's a minimal example, run from the terminal:
ls
## foo.csv bar.csv baz.txt
Rscript -e "print(commandArgs(T))" *.csv
## [1] "foo.csv" "bar.csv"
Rscript -e "print(commandArgs(T))" *.txt
## [1] "baz.txt"
Rscript -e "print(commandArgs(T))" *.rds
## [1] "*.rds"
EDIT: I have learned that this behavior is from bash, not Rscript. Is there some way to work around this behavior from within R, or to suppress wildcard expansion for a particular R script but not the Rscript command? In my particular case, I want to run a function with two arguments, Rscript collapse.R *.rds out.rds that concatenates the contents of many individual RDS files into a list and saves the result in out.rds. But since the wildcard gets expanded before being passed to R, I have no way of checking whether the second argument has been supplied.
If I understand correctly, you don't want bash to glob the wildcard for you, you want to pass the expression itself, e.g. *.csv. Some options include:
Pass the expression in quoted text and process it within R, either by evaluating that in another command or otherwise
Rscript -e "list.files(pattern = commandArgs(T))" "*\.csv$"
Pass just the extension and process the * within R by context
Rscript -e "list.files(pattern = paste0('*\\\\.', commandArgs(T)))" "csv$"
Through complicated and unnecessary means, disable globbing for that command: Stop shell wildcard character expansion?
Note: I've changed the argument to a regex to prevent it matching too greedily.

How To Run Different, Multiple Rscripts on SGE Cluster

I am trying to run different Rscripts on a SGE cluster, each Rscript only changes by one variable (e.g. cancer <- "UVM" or "ACC", etc.).
I have attempted two ways: either run a Single Rscript that gets command line arguments for the 30 different cancer names
OR
run each Rscript (i.e. UVM.r, ACC.r, etc.)
Either way, I am having alot of difficulty figuring out how to submit these jobs so I can run either one Rscript 30 times with different argument each time OR run multiple Rscripts with no command line arguments.
You can use while loop in bash for this.
Setup input file of arguments, e.g. args.txt:
UVM
ACC
Run qsub in while loop to submit script for each argument:
while read arg
do
echo "Rscript script.R ${arg}" | qsub <options>
done <args.txt
Above uses echo to pass code to run to qsub.
A job script like this:
#!/bin/bash
#$ -t 1-30
shift ${SGE_TASK_ID}
exec Rscript script.R $1
Submit like this qsub job_script dummy UVM ACC ...

Converting this code from R to Shell script?

So I'm running a program that works but the issue is my computer is not powerful enough to handle the task. I have a code written in R but I have access to a supercomputer that runs a Unix system (as one would expect).
The program is designed to read a .csv file and find everything with the unit ft3(monthly total) in the "Units" column and select the value in the column before it. The files are charts that list things in multiple units.
To convert this program in R:
getwd()
setwd("/Users/youruserName/Desktop")
myData= read.table("yourFileName.csv", header=T, sep=",")
funData= subset(myData, units="ft3(monthly total)", select=units:value)
write.csv(funData, file="funData.csv")
To a program in Shell Script, I tried:
pwd
cd /Users/yourusername/Desktop
touch RunThisProgram
nano RunThisProgram
(((In nano, I wrote)))
if
grep -r yourFileName.csv ft3(monthly total)
cat > funData.csv
else
cat > nofun.csv
fi
control+x (((used control x to close nano)))
chmod -x RunThisProgram
./RunThisProgram
(((It runs for a while)))
We get a funData.csv file output but that file is empty
What am I doing wrong?
It isn't actually running, because there are a couple problems with your script.
grep needs the pattern first, and quoted; -r is for recursing a
directory...
if without a then
cat is called wrong so it is actually reading from stdin.
You really only need one line:
grep -F "ft3(monthly total)" yourFileName.csv > funData.csv

Resources