How to pass bash variable into R script - r

I have a couple of R scripts that processes data in a particular input folder. I have a few folders I need to run this script on, so I started writing a bash script to loop through these folders and run those R scripts.
I'm not familiar with R at all (the script was written by a previous worker and it's basically a black box for me), and I'm inexperienced with passing variables through scripts, especially involving multiple languages. There's also an issue present when I call source("$SWS_output/Step_1_Setup.R") here - R isn't reading my $SWS_output as a variable, but rather a string.
Here's my bash script:
#!/bin/bash
# Inputs
workspace="`pwd`"
preprocessed="$workspace/6_preprocessed"
# Output
SWS_output="$workspace/7_SKSattempt4_results/"
# create output directory
mkdir -p $SWS_output
# Copy data from preprocessed to SWS_output
cp -a $preprocessed/* $SWS_output
# Loop through folders in the output and run the R code on each folder
for qdir in $SWS_output/*/; do
qdir_name=`basename $qdir`
echo -e 'source("$SWS_output/Step_1_Setup.R") \n source("$SWS_output/(Step_2_data.R") \n q()' | R --no-save
done
I need to pass the variable "qdir" into the second R script (Step_2_data.R) to tell it which folder to process.
Thanks!

My previous answer was incomplete. Here is a better effort to explain command line parsing.
It is pretty easy to use R's commandArgs function to process command line arguments. I wrote a small tutorial https://gitlab.crmda.ku.edu/crmda/hpcexample/tree/master/Ex51-R-ManySerialJobs. In cluster computing this works very well for us. The whole hpcexample repo is open source/free.
The basic idea is that in the command line you can run R with command line arguments, as in:
R --vanilla -f r-clargs-3.R --args runI=13 parmsC="params.csv" xN=33.45
In this case, my R program is a file r-clargs-3.R and the arguments that the file will import are three space separated elements, runI, parmsC, xN. You can add as many of these space separated parameters as you like. It is completely at your discretion what these are called, but it is required they are separated by spaces and there is NO SPACE around the equal signs. Character string variables should be quoted.
My habit is to name the arguments with suffix "I" to hint that it is an integer, "C" is for character, and "N" is for floating point numbers.
In the file r-clargs-3.R, include some code to read the arguments and sort through them. For example, my tutorial's example
cli <- commandArgs(trailingOnly = TRUE)
args <- strsplit(cli, "=", fixed = TRUE)
The rest of the work is sorting through the args, and this is my most evolved stanza to sort through arguments (because it looks for suffixes "I", "N", "C", and "L" (for logical)), and then it coerces the inputs to the correct variable types (all input variables are characters, unless we coerce with as.integer(), etc):
for (e in args) {
argname <- e[1]
if (! is.na(e[2])) {
argval <- e[2]
## regular expression to delete initial \" and trailing \"
argval <- gsub("(^\\\"|\\\"$)", "", argval)
}
else {
# If arg specified without value, assume it is bool type and TRUE
argval <- TRUE
}
# Infer type from last character of argname, cast val
type <- substring(argname, nchar(argname), nchar(argname))
if (type == "I") {
argval <- as.integer(argval)
}
if (type == "N") {
argval <- as.numeric(argval)
}
if (type == "L") {
argval <- as.logical(argval)
}
assign(argname, argval)
cat("Assigned", argname, "=", argval, "\n")
}
That will create variables in the R session named paramsC, runI, and xN.
The convenience of this approach is that the same base R code can be run with 100s or 1000s of command parameter variations. Good for Monte Carlo simulation, etc.

Thanks for all the answers they were very helpful. I was able to get a solution that works. Here's my completed script.
#!/bin/bash
# Inputs
workspace="`pwd`"
preprocessed="$workspace/6_preprocessed"
# Output
SWS_output="$workspace/7_SKSattempt4_results"
# create output directory
mkdir -p $SWS_output
# Copy data from preprocessed to SWS_output
cp -a $preprocessed/* $SWS_output
cd $SWS_output
# Loop through folders in the output and run the R code on each folder
for qdir in $SWS_output/*/; do
qdir_name=`basename $qdir`
echo $qdir_name
export VARIABLENAME=$qdir
echo -e 'source("Step_1_Setup.R") \n source("Step_2_Data.R") \n q()' | R --no-save --slave
done
And then the R script looks like this:
qdir<-Sys.getenv("VARIABLENAME")
pathname<-qdir[1]
As a couple of comments have pointed out, this isn't best practice, but this worked exactly as I wanted it to. Thanks!

Related

How can I pass the names of a list of files from bash to an R program?

I have a long list of files with names like: file-typeX-sectorY.tsv, where X and Y get values from 0-100. I process each of those files with an R program, but read them one by one like this:
data <- read.table(file='my_info.tsv', sep = '\t', header = TRUE, fill = TRUE)
it is impractical. I want to build a bash program that does something like
#!/bin/bash
for i in {0..100..1}
do
for j in {1..100..1)
do
Rscript program.R < file-type$i-sector$j.tsv
done
done
My problem is not with the bash script but with the R program. How can I receive the files one by one? I have googled and tried instructions like:
args <- commandArgs(TRUE)
either
data <- commandArgs(trailingOnly = TRUE)
but I can't find the way. Could you please help me?
At the simplest level your problem may be the (possible accidental ?) redirect you have -- so remove the <.
Then a mininmal R 'program' to take a command-line argument and do something with it would be
#!/usr/bin/env Rscript
args <- commandArgs(trailingOnly = TRUE)
stopifnot("require at least one arg" = length(args) > 0)
cat("We were called with '", args[1], "'\n", sep="")
We use a 'shebang' line and make it chmod 0755 basicScript.R to be runnable. The your shell double loop, reduced here (and correcting one typo) becomes
#!/bin/bash
for i in {0..2..1}; do
for j in {1..2..1}; do
./basicScript.R file-type${i}-sector${j}.tsv
done
done
and this works as we hope with the inner program reflecting the argument:
$ ./basicCaller.sh
We were called with 'file-type0-sector1.tsv'
We were called with 'file-type0-sector2.tsv'
We were called with 'file-type1-sector1.tsv'
We were called with 'file-type1-sector2.tsv'
We were called with 'file-type2-sector1.tsv'
We were called with 'file-type2-sector2.tsv'
$
Of course, this is horribly inefficient as you have N x M external processes. The two outer loops could be written in R, and instead of calling the script you would call your script-turned-function.

How to specify input arguments to Rscript by name from command line?

I am new to command line usage and don't think this question has been asked elsewhere. I'm trying to adapt an Rscript to be run from the command line in a shell script. Basically, I'm using some tools in the immcantation framework to read and annotate some antibody NGS data, and then to group sequences into their clonal families. To set the similarity threshold, the creators recommend using a function in their shazam package to set an appropriate threshold.
I've made the simple script below to read and validate the arguments:
#!/usr/bin/env Rscript
params <- commandArgs(trailingOnly=TRUE)
### read and validate mode argument
mode <- params[1]
modeAllowed <- c("ham","aa","hh_s1f","hh_s5f")
if(!(mode %in% modeAllowed)){
stop(paste("illegal mode argument supplied. acceptable values are",
paste(paste(modeAllowed, collapse = ", "), ".", sep = ""), "\nmode should be supplied first",
sep = " "))
}
### execute function
cat(threshold)
The script works, however since for each parameter there's only a finite number of options. I was wondering if there was a way of passing in the arguments like --mode aa (for example) from the terminal? All the information I've seen online seems to be using code like my mode <- params[1] from above which I guess only works if the mode argument is first?

R - Connect Scripts via Pipes

I have a number of R scripts that I would like to chain together using a UNIX-style pipeline. Each script would take as input a data frame and provide a data frame as output. For example, I am imagining something like this that would run in R's batch mode.
cat raw-input.Rds | step1.R | step2.R | step3.R | step4.R > result.Rds
Any thoughts on how this could be done?
Writing executable scripts is not the hard part, what is tricky is how to make the scripts read from files and/or pipes. I wrote a somewhat general function here: https://stackoverflow.com/a/15785789/1201032
Here is an example where the I/O takes the form of csv files:
Your step?.R files should look like this:
#!/usr/bin/Rscript
OpenRead <- function(arg) {
if (arg %in% c("-", "/dev/stdin")) {
file("stdin", open = "r")
} else if (grepl("^/dev/fd/", arg)) {
fifo(arg, open = "r")
} else {
file(arg, open = "r")
}
}
args <- commandArgs(TRUE)
file <- args[1]
fh.in <- OpenRead(file)
df.in <- read.csv(fh.in)
close(fh.in)
# do something
df.out <- df.in
# print output
write.csv(df.out, file = stdout(), row.names = FALSE, quote = FALSE)
and your csv input file should look like:
col1,col2
a,1
b,2
Now this should work:
cat in.csv | ./step1.R - | ./step2.R -
The - are annoying but necessary. Also make sure to run something like chmod +x ./step?.R to make your scripts executables. Finally, you could store them (and without extension) inside a directory that you add to your PATH, so you will be able to run it like this:
cat in.csv | step1 - | step2 -
Why on earth you want to cram your workflow into pipes when you have the whole R environment available is beyond me.
Make a main.r containing the following:
source("step1.r")
source("step2.r")
source("step3.r")
source("step4.r")
That's it. You don't have to convert the output of each step into a serialised format; instead you can just leave all your R objects (datasets, fitted models, predicted values, lattice/ggplot graphics, etc) as they are, ready for the next step to process. If memory is a problem, you can rm any unneeded objects at the end of each step; alternatively, each step can work with an environment which it deletes when done, first exporting any required objects to the global environment.
If modular code is desired, you can recast your workflow as follows. Encapsulate the work done by each file into one or more functions. Then call these functions in your main.r with the appropriate arguments.
source("step1.r") # defines step1_read_input, step1_f2
source("step2.r") # defines step2_f2
source("step3.r") # defines step3_f1, step3_f2, step3_f3
source("step4.r") # defines step4_write_output
step1_read_input(...)
step1_f2(...)
....
step4write_output(...)
You'll need to add a line at the top of each script to read in from stdin. Via this answer:
in_data <- readLines(file("stdin"),1)
You'll also need to write the output of each script to stdout().

Detect number of running r instances in windows within r

I have an r code I am creating that I would like to detect the number of running instances of R in windows so the script can choose whether or not to run a particular set of scripts (i.e., if there is already >2 instances of R running do X, else Y).
Is there a way to do this within R?
EDIT:
Here is some info on the purpose as requested:
I have a very long set of scripts for applying a bayesian network model using the catnet library for thousands of cases. This code processes and outputs results in a csv file for each case. Most of the parallel computing alternatives I have tried have not been ideal as they suppress a lot of the built-in notification of progress, hence I have been running a subset of the cases on different instances of R. I know this is somewhat antiquated, but it works for me, so I wanted a way to have the code subset the number of cases automatically based on the number of instances running.
I do this right now by hand by opening multiple instances of Rscript in CMD opening slightly differently configured r files to get something like this:
cd "Y:\code\BN_code"
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T1.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T2.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T3.r" /b
EDIT2:
Thanks to the answers below, here is my implementation of what I call 'poorman's parallel computing in R:
So if you have any long script that has to be applied to a long list of cases use the code below to break the long list into a number of sublists to be fed to each instance of rscript:
#the cases that I need to apply my code to:
splist=c("sp01", "sp02", "sp03", "sp04", "sp05", "sp06", "sp07", "sp08", "sp09", "sp010", "sp11", "sp12",
"sp013", "sp014", "sp015", "sp16", "sp17", "sp018", "sp19", "sp20", "sp21", "sp22", "sp23", "sp24")
###automatic subsetting of cases based on number of running instances of r script:
cpucores=as.integer(Sys.getenv('NUMBER_OF_PROCESSORS'))
n_instances=length(system('tasklist /FI "IMAGENAME eq Rscript.exe" ', intern = TRUE))-3
jnk=length(system('tasklist /FI "IMAGENAME eq rstudio.exe" ', intern = TRUE))-3
if (jnk>0)rstudiorun=TRUE else rstudiorun=FALSE
if (!rstudiorun & n_instances>0 & cpucores>1){ #if code is being run from rscript and
#not from rstudio and there is more than one core available
jnkn=length(splist)
jnk=seq(1,jnkn,round(jnkn/cpucores,0))
jnk=c(jnk,jnkn)
splist=splist[jnk[n_instances]:jnk[n_instances+1]]
}
###end automatic subsetting of cases
#perform your script on subset of list of cases:
for(sp in splist){
ptm0 <- proc.time()
Sys.sleep(6)
ptm1=proc.time() - ptm0
jnk=as.numeric(ptm1[3])
cat('\n','It took ', jnk, "seconds to do species", sp)
}
To make this code run on multiple instances of r automatically in windows, just create a .bat file:
cd "C:\Users\lfortini\code\misc code\misc r code"
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
exit
The timeout is there to give enough time for r to detect its own number of instances.
Clicking on this .bat file will automatically open up numerous instances of r script, with each one taking on a particular subset of the cases you want to analyse, while still providing all of the progress of the running of the script in each window, like the image above. The cool thing about this approach is that you pretty much just have to slap on the automated list subsetting code before whichever iteration mechanism you are using in your code (loops, apply fx, etc). Then just fire the code using rcript using the .bat or manually and you are set.
Actually it is easier than expected, as Windows comes with the nice function tasklist found here.
With it you can get all running processes from which you simply need to count the number of Rscript.exe instances (I use stringr here for string manipulations).
require(stringr)
progs <- system("tasklist", intern = TRUE)
progs <- vapply(str_split(progs, "[[:space:]]"), "[[", "", i = 1)
sum(progs == "Rscript.exe")
That should do the trick. (I only tried it with counting instances of Rgui.exe but that works fine.)
You can do even shorter as below
length(grep("rstudio\\.exe", system("tasklist", intern = TRUE)))
Replace rstudio with any other Rscript or any other process name
Or even shorter
length(system('tasklist /FI "IMAGENAME eq Rscript.exe" ', intern = TRUE))-3

R, passing variables to a system command

Using R, I am looking to create a QR code and embed it into an Excel spreadsheet (hundreds of codes and spreadsheets). The obvious way seems to be to create a QR code using the command line, and use the "system" command in R. Does anyone know how to pass R variables through the "system" command? Google is not too helpful as "system" is a bit generic, ?system does not contain any examples of this.
Note - I am actually using data matrices rather than QR codes, but using the term "data matrix" in an R question will lead to havoc, so let's talk QR codes instead. :-)
system("dmtxwrite my_r_variable -o image.png")
fails, as do the variants I have tried with "paste". Any suggestions gratefully received.
Let's say we have the variable x that we want to pass on to dmtxwrite, you can pass it on like:
x = 10
system(sprintf("dmtxwrite %s -o image.png", x))
or alternatively using paste:
system(paste("dmtxwrite", x, "-o image.png"))
but I prefer sprintf in this case.
Also making use of base::system2 may be worth considering as system2 provides args argument that can be used for that purpose. In your example:
my_r_variable <- "a"
system2(
'echo',
args = c(my_r_variable, '-o image.png')
)
would return:
a -o image.png
which is equivalent to running echo in the terminal. You may also want to redirect output to text files:
system2(
'echo',
args = c(my_r_variable, '-o image.png'),
stdout = 'stdout.txt',
stderr = 'stderr.txt'
)

Resources