R - Connect Scripts via Pipes - r

I have a number of R scripts that I would like to chain together using a UNIX-style pipeline. Each script would take as input a data frame and provide a data frame as output. For example, I am imagining something like this that would run in R's batch mode.
cat raw-input.Rds | step1.R | step2.R | step3.R | step4.R > result.Rds
Any thoughts on how this could be done?

Writing executable scripts is not the hard part, what is tricky is how to make the scripts read from files and/or pipes. I wrote a somewhat general function here: https://stackoverflow.com/a/15785789/1201032
Here is an example where the I/O takes the form of csv files:
Your step?.R files should look like this:
#!/usr/bin/Rscript
OpenRead <- function(arg) {
if (arg %in% c("-", "/dev/stdin")) {
file("stdin", open = "r")
} else if (grepl("^/dev/fd/", arg)) {
fifo(arg, open = "r")
} else {
file(arg, open = "r")
}
}
args <- commandArgs(TRUE)
file <- args[1]
fh.in <- OpenRead(file)
df.in <- read.csv(fh.in)
close(fh.in)
# do something
df.out <- df.in
# print output
write.csv(df.out, file = stdout(), row.names = FALSE, quote = FALSE)
and your csv input file should look like:
col1,col2
a,1
b,2
Now this should work:
cat in.csv | ./step1.R - | ./step2.R -
The - are annoying but necessary. Also make sure to run something like chmod +x ./step?.R to make your scripts executables. Finally, you could store them (and without extension) inside a directory that you add to your PATH, so you will be able to run it like this:
cat in.csv | step1 - | step2 -

Why on earth you want to cram your workflow into pipes when you have the whole R environment available is beyond me.
Make a main.r containing the following:
source("step1.r")
source("step2.r")
source("step3.r")
source("step4.r")
That's it. You don't have to convert the output of each step into a serialised format; instead you can just leave all your R objects (datasets, fitted models, predicted values, lattice/ggplot graphics, etc) as they are, ready for the next step to process. If memory is a problem, you can rm any unneeded objects at the end of each step; alternatively, each step can work with an environment which it deletes when done, first exporting any required objects to the global environment.
If modular code is desired, you can recast your workflow as follows. Encapsulate the work done by each file into one or more functions. Then call these functions in your main.r with the appropriate arguments.
source("step1.r") # defines step1_read_input, step1_f2
source("step2.r") # defines step2_f2
source("step3.r") # defines step3_f1, step3_f2, step3_f3
source("step4.r") # defines step4_write_output
step1_read_input(...)
step1_f2(...)
....
step4write_output(...)

You'll need to add a line at the top of each script to read in from stdin. Via this answer:
in_data <- readLines(file("stdin"),1)
You'll also need to write the output of each script to stdout().

Related

How can I pass the names of a list of files from bash to an R program?

I have a long list of files with names like: file-typeX-sectorY.tsv, where X and Y get values from 0-100. I process each of those files with an R program, but read them one by one like this:
data <- read.table(file='my_info.tsv', sep = '\t', header = TRUE, fill = TRUE)
it is impractical. I want to build a bash program that does something like
#!/bin/bash
for i in {0..100..1}
do
for j in {1..100..1)
do
Rscript program.R < file-type$i-sector$j.tsv
done
done
My problem is not with the bash script but with the R program. How can I receive the files one by one? I have googled and tried instructions like:
args <- commandArgs(TRUE)
either
data <- commandArgs(trailingOnly = TRUE)
but I can't find the way. Could you please help me?
At the simplest level your problem may be the (possible accidental ?) redirect you have -- so remove the <.
Then a mininmal R 'program' to take a command-line argument and do something with it would be
#!/usr/bin/env Rscript
args <- commandArgs(trailingOnly = TRUE)
stopifnot("require at least one arg" = length(args) > 0)
cat("We were called with '", args[1], "'\n", sep="")
We use a 'shebang' line and make it chmod 0755 basicScript.R to be runnable. The your shell double loop, reduced here (and correcting one typo) becomes
#!/bin/bash
for i in {0..2..1}; do
for j in {1..2..1}; do
./basicScript.R file-type${i}-sector${j}.tsv
done
done
and this works as we hope with the inner program reflecting the argument:
$ ./basicCaller.sh
We were called with 'file-type0-sector1.tsv'
We were called with 'file-type0-sector2.tsv'
We were called with 'file-type1-sector1.tsv'
We were called with 'file-type1-sector2.tsv'
We were called with 'file-type2-sector1.tsv'
We were called with 'file-type2-sector2.tsv'
$
Of course, this is horribly inefficient as you have N x M external processes. The two outer loops could be written in R, and instead of calling the script you would call your script-turned-function.

How to pass bash variable into R script

I have a couple of R scripts that processes data in a particular input folder. I have a few folders I need to run this script on, so I started writing a bash script to loop through these folders and run those R scripts.
I'm not familiar with R at all (the script was written by a previous worker and it's basically a black box for me), and I'm inexperienced with passing variables through scripts, especially involving multiple languages. There's also an issue present when I call source("$SWS_output/Step_1_Setup.R") here - R isn't reading my $SWS_output as a variable, but rather a string.
Here's my bash script:
#!/bin/bash
# Inputs
workspace="`pwd`"
preprocessed="$workspace/6_preprocessed"
# Output
SWS_output="$workspace/7_SKSattempt4_results/"
# create output directory
mkdir -p $SWS_output
# Copy data from preprocessed to SWS_output
cp -a $preprocessed/* $SWS_output
# Loop through folders in the output and run the R code on each folder
for qdir in $SWS_output/*/; do
qdir_name=`basename $qdir`
echo -e 'source("$SWS_output/Step_1_Setup.R") \n source("$SWS_output/(Step_2_data.R") \n q()' | R --no-save
done
I need to pass the variable "qdir" into the second R script (Step_2_data.R) to tell it which folder to process.
Thanks!
My previous answer was incomplete. Here is a better effort to explain command line parsing.
It is pretty easy to use R's commandArgs function to process command line arguments. I wrote a small tutorial https://gitlab.crmda.ku.edu/crmda/hpcexample/tree/master/Ex51-R-ManySerialJobs. In cluster computing this works very well for us. The whole hpcexample repo is open source/free.
The basic idea is that in the command line you can run R with command line arguments, as in:
R --vanilla -f r-clargs-3.R --args runI=13 parmsC="params.csv" xN=33.45
In this case, my R program is a file r-clargs-3.R and the arguments that the file will import are three space separated elements, runI, parmsC, xN. You can add as many of these space separated parameters as you like. It is completely at your discretion what these are called, but it is required they are separated by spaces and there is NO SPACE around the equal signs. Character string variables should be quoted.
My habit is to name the arguments with suffix "I" to hint that it is an integer, "C" is for character, and "N" is for floating point numbers.
In the file r-clargs-3.R, include some code to read the arguments and sort through them. For example, my tutorial's example
cli <- commandArgs(trailingOnly = TRUE)
args <- strsplit(cli, "=", fixed = TRUE)
The rest of the work is sorting through the args, and this is my most evolved stanza to sort through arguments (because it looks for suffixes "I", "N", "C", and "L" (for logical)), and then it coerces the inputs to the correct variable types (all input variables are characters, unless we coerce with as.integer(), etc):
for (e in args) {
argname <- e[1]
if (! is.na(e[2])) {
argval <- e[2]
## regular expression to delete initial \" and trailing \"
argval <- gsub("(^\\\"|\\\"$)", "", argval)
}
else {
# If arg specified without value, assume it is bool type and TRUE
argval <- TRUE
}
# Infer type from last character of argname, cast val
type <- substring(argname, nchar(argname), nchar(argname))
if (type == "I") {
argval <- as.integer(argval)
}
if (type == "N") {
argval <- as.numeric(argval)
}
if (type == "L") {
argval <- as.logical(argval)
}
assign(argname, argval)
cat("Assigned", argname, "=", argval, "\n")
}
That will create variables in the R session named paramsC, runI, and xN.
The convenience of this approach is that the same base R code can be run with 100s or 1000s of command parameter variations. Good for Monte Carlo simulation, etc.
Thanks for all the answers they were very helpful. I was able to get a solution that works. Here's my completed script.
#!/bin/bash
# Inputs
workspace="`pwd`"
preprocessed="$workspace/6_preprocessed"
# Output
SWS_output="$workspace/7_SKSattempt4_results"
# create output directory
mkdir -p $SWS_output
# Copy data from preprocessed to SWS_output
cp -a $preprocessed/* $SWS_output
cd $SWS_output
# Loop through folders in the output and run the R code on each folder
for qdir in $SWS_output/*/; do
qdir_name=`basename $qdir`
echo $qdir_name
export VARIABLENAME=$qdir
echo -e 'source("Step_1_Setup.R") \n source("Step_2_Data.R") \n q()' | R --no-save --slave
done
And then the R script looks like this:
qdir<-Sys.getenv("VARIABLENAME")
pathname<-qdir[1]
As a couple of comments have pointed out, this isn't best practice, but this worked exactly as I wanted it to. Thanks!

how to pass an argument as a variable name in R

I need help with passing an argument as a variable name in an R script from the terminal. I'll run the script as follows:
R < script.R --args "hello"
And, in the script there should be something like this:
args <- commandArgs(trailingOnly = TRUE)
assign(args[1],24)
save(args[1], file="output.RData")
But, I need to take the argument as the variable name. What I mean is the following: If I run the script with "numbers" argument, the variable name inside the script should be numbers.
assign(args[1], 24)
does the trick. But, inside the save function, args[1] does not work. How can I pass it as a variable name?
Does it work if you try
saveRDS(get(args[1]),file="output.rds")
?
You won't get a text file with the save function. If you want its text version you would need to use `dump". This would "work" despirte the extention. The file ois still an .Rdata file event without the extension:
arg=1
argname="reports"
assign(argname, arg)
reports
#[1] 1
save(reports, file="test.txt")
rm(reports)
rm(argname)
rm(arg)
load("test.txt")
To use dump:
dump('reports', file="test2.txt")
This would appear in that file. It should be parse-able (and readable to humans) R code:
reports <-
1

Tensorflow: How to convert .meta, .data and .index model files into one graph.pb file

In tensorflow the training from the scratch produced following 6 files:
events.out.tfevents.1503494436.06L7-BRM738
model.ckpt-22480.meta
checkpoint
model.ckpt-22480.data-00000-of-00001
model.ckpt-22480.index
graph.pbtxt
I would like to convert them (or only the needed ones) into one file graph.pb to be able to transfer it to my Android application.
I tried the script freeze_graph.py but it requires as an input already the input.pb file which I do not have. (I have only these 6 files mentioned before). How to proceed to get this one freezed_graph.pb file? I saw several threads but none was working for me.
You can use this simple script to do that. But you must specify the names of the output nodes.
import tensorflow as tf
meta_path = 'model.ckpt-22480.meta' # Your .meta file
output_node_names = ['output:0'] # Output nodes
with tf.Session() as sess:
# Restore the graph
saver = tf.train.import_meta_graph(meta_path)
# Load weights
saver.restore(sess,tf.train.latest_checkpoint('path/of/your/.meta/file'))
# Freeze the graph
frozen_graph_def = tf.graph_util.convert_variables_to_constants(
sess,
sess.graph_def,
output_node_names)
# Save the frozen graph
with open('output_graph.pb', 'wb') as f:
f.write(frozen_graph_def.SerializeToString())
If you don't know the name of the output node or nodes, there are two ways
You can explore the graph and find the name with Netron or with console summarize_graph utility.
You can use all the nodes as output ones as shown below.
output_node_names = [n.name for n in tf.get_default_graph().as_graph_def().node]
(Note that you have to put this line just before convert_variables_to_constants call.)
But I think it's unusual situation, because if you don't know the output node, you cannot use the graph actually.
As it may be helpful for others, I also answer here after the answer on github ;-).
I think you can try something like this (with the freeze_graph script in tensorflow/python/tools) :
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
The important flag here is --input_binary=false as the file graph.pbtxt is in text format. I think it corresponds to the required graph.pb which is the equivalent in binary format.
Concerning the output_node_names, that's really confusing for me as I still have some problems on this part but you can use the summarize_graph script in tensorflow which can take the pb or the pbtxt as an input.
Regards,
Steph
I tried the freezed_graph.py script, but the output_node_name parameter is totally confusing. Job failed.
So I tried the other one: export_inference_graph.py.
And it worked as expected!
python -u /tfPath/models/object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=/your/config/path/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix=/your/checkpoint/path/model.ckpt-50000 \
--output_directory=/output/path
The tensorflow installation package I used is from here:
https://github.com/tensorflow/models
First, use the following code to generate the graph.pb file.
with tf.Session() as sess:
# Restore the graph
_ = tf.train.import_meta_graph(args.input)
# save graph file
g = sess.graph
gdef = g.as_graph_def()
tf.train.write_graph(gdef, ".", args.output, True)
then, use summarize graph get the output node name.
Finally, use
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
to generate the freeze graph.

Retrieving Variable Declaration

How can I find how did I first declare a certain variable when I am a few hundred
lines down from where I first declared it. For example, I have declared the following:
a <- c(vectorA,vectorB,vectorC)
and now I want to see how I declared it. How can I do that?
Thanks.
You could try using the history command:
history(pattern = "a <-")
to try to find lines in your history where you assigned something to the variable a. I think this matches exactly, though, so you may have to watch out for spaces.
Indeed, if you type history at the command line, it doesn't appear to be doing anything fancier than saving the current history in a tempfile, loading it back in using readLines and then searching it using grep. It ought to be fairly simple to modify that function to include more functionality...for example, this modification will cause it to return the matching lines so you can store it in a variable:
myHistory <- function (max.show = 25, reverse = FALSE, pattern, ...)
{
file1 <- tempfile("Rrawhist")
savehistory(file1)
rawhist <- readLines(file1)
unlink(file1)
if (!missing(pattern))
rawhist <- unique(grep(pattern, rawhist, value = TRUE,
...))
nlines <- length(rawhist)
if (nlines) {
inds <- max(1, nlines - max.show):nlines
if (reverse)
inds <- rev(inds)
}
else inds <- integer()
#file2 <- tempfile("hist")
#writeLines(rawhist[inds], file2)
#file.show(file2, title = "R History", delete.file = TRUE)
rawhist[inds]
}
I will assume you're using the default R console. If you're on Windows, you can File -> Save history and open the file in your fav text browser, or you can use function savehistory() (see help(savehistory)).
What you need to do is get a (good) IDE, or at least a decent text editor. You will benevit from code folding, syntax coloring and much more. There's a plethora of choices, from Tinn-R, VIM, ESS, Eclipse+StatET, RStudio or RevolutionR among others.
You can run grep 'a<-' .Rhistory from terminal (assuming that you've cdd to your working directory). ESS has several very useful history-searching functions, like (comint-history-isearch-backward-regexp) - binded to M-r by default.
For further info, consult ESS manual: http://ess.r-project.org/Manual/ess.html
When you define a function, R stores the source code of the function (preserving formatting and comments) in an attribute named "source". When you type the name of the function, you will get this content printed.
But it doesn't do this with variables. You can deparse a variable, which generates an expression that will produce the variable's value but this doesn't need to be the original expression. For example when you have b <- c(17, 5, 21), deparse(b) will produce the string "c(17, 5, 21)".
In your example, however, the result wouldn't be "c(vectorA,vectorB,vectorC)", it would be an expression that produces the combined result of your three vectors.

Resources