R or bash command line length limit - r

I'm developing a bash program that execute a R oneliner command to convert a RMarkdown template into a HTML document.
This R oneliner command looks like:
R -e 'library(rmarkdown) ; rmarkdown::render( "template.Rmd", "html_document", output_file = "report.html", output_dir = "'${OUTDIR}'", params = list( param1 = "'${PARAM1}'", param2 = "'${PARAM2}'", ... ) )
I have a long list of parameters, let's say 10 to explain the problem, and it seems that the R or bash has a command line length limit.
When I execute the R oneliner with 10 parameters I obtain a error message like this:
WARNING: '-e library(rmarkdown)~+~;~+~rmarkdown::render(~+~"template.Rmd",~+~"html_document",~+~output_file~+~=~+~"report.html",~+~output_dir~+~=~+~"output/",~+~params~+~=~+~list(~+~param1~+~=~+~"param2", ...
Fatal error: you must specify '--save', '--no-save' or '--vanilla'
When I execute the R oneliner with 9 parameters it's ok (I tried different combinations to verify that the problem was not the last parameter).
When I execute the R oneliner with 10 parameters but with removing all spaces in it, it's ok too so I guess that R or bash use a command line length limit.
R -e 'library(rmarkdown);rmarkdown::render("template.Rmd","html_document",output_file="report.html",output_dir="'${OUTDIR}'",params=list(param1="'${PARAM1}'",param2="'${PARAM2}'",...))
Is it possible to increase this limit?

This will break a number of ways – including if your arguments have spaces or quotes in them.
Instead, try passing the values as arguments. Something like this should give you an idea how it works:
# create a script file
tee arguments.r << 'EOF'
argv <- commandArgs(trailingOnly=TRUE)
arg1 <- argv[1]
print(paste("Argument 1 was", arg1))
EOF
# set some values
param1="foo bar"
param2="baz"
# run the script with arguments
Rscript arguments.r "$param1" "$param2"
Expected output:
[1] "Argument 1 was foo bar"
Always quote your variables and always use lowercase variable names to avoid conflicts with system or application variables.

Related

Paralelizing an Rscript using a job array in Slurm

I want to run an Rscript.R using an array job in Slurm, with 1-10 tasks, whereby the task id from the job will be directed to the Rscript, to write a file named "'task id'.out", containing 'task id' in its body. However, this has proven to be more challenging than I anticipated haha I am trying the following:
~/bash_test.sh looks like:
#!/bin/bash -l
#SBATCH --time=00:01:00
#SBATCH --array=1-10
conda activate R
cd ~/test
R CMD BATCH --no-save --no-restore ~/Rscript_test.R $SLURM_ARRAY_TASK_ID
~/Rscript_test.R looks like:
#!/usr/bin/env Rscript
taskid = commandArgs(trailingOnly=TRUE)
# taskid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
taskid <- as.data.frame(taskid)
# print task number
print(paste0("the number processed was... ", taskid))
write.table(taskid, paste0("~/test/",taskid,".out"),quote=FALSE, row.names=FALSE, col.names=FALSE)
After I submit my job (sbatch bash_test.sh), it looks like R is not really seeing SLURM_ARRAY_TASK_ID. The script is generating 10 files (1, 2, ..., 10 - just numbers - probably corresponding to the task ids), but it's not writing the files with the extension ".out": the script wrote an empty "integer(0).out" file.
What I wanted, was to populate the folder ~/test/ with 10 files, 1.out, 2.out, ..., 10.out, and each file has to contain the task id inside (simply the number 1, 2, ..., or 10, respectively).
P.S.: Note that I tried playing with Sys.getenv() too, but I don't think I was able to set that up properly. That option generates 10 files, and one 1.out file, containing number 10.
P.S.2: This is slurm 19.05.5. I am running R wihthin a conda environment.
You should avoid using "R CMD BATCH". It doesn't handle arguments the way most functions do. "Rscript" has been the recommended option for a while now. By calling "R CMD BATCH" you are basically ignoring the "#!/usr/bin/env Rscript" part of your script.
So change your script file to
#!/bin/bash -l
#SBATCH --time=00:01:00
#SBATCH --array=1-10
conda activate R
cd ~/test
Rscript ~/Rscript_test.R $SLURM_ARRAY_TASK_ID
And then becareful in your script that you aren't using the same variable as both a string a data.frame. You can't easily paste a data.frame into a file path for example. So
taskid <- commandArgs(trailingOnly=TRUE)
# taskid <- Sys.getenv('SLURM_ARRAY_TASK_ID') # This should also work
print(paste0("the number processed was... ", taskid))
outdata <- as.data.frame(taskid)
outfile <- paste0("~/test/", taskid, ".out")
write.table(outdata, outfile, quote=FALSE, row.names=FALSE, col.names=FALSE)
The extra files with just the array number were created because the usage of R CMD BATCH is
R CMD BATCH [options] infile [outfile]
So the $SLURM_ARRAY_TASK_ID value you were passing at the command line was treated as the outfile name. Instead that value needed to be passed as options. But again, it's better to use Rscript which has more standard argument conventions.

What does the `input` argument do in the system() function in R?

What does the input argument do in the system() function in R? For example in the code below
authentication_test <- "authentication_test aws s3 ls s3://test-bucket/ > /dev/null"
system(authentication_test, input = "q")
I don't understand what purpose the letter q serves.
Looking at the help file, input is described as
input: if a character vector is supplied, this is copied
one string per line to a temporary file, and the standard
input of command is redirected to the file.
but I still have trouble understanding what exactly it is doing.
input creates a temporary file which is used as STDIN for the system shell command.
Take for example the cat command:
system("cat", input = "Line1\nLine2")
#Line1
#Line2
In your bash shell this would be the same as
echo -e "File1\nFile2" > file
cat < file
#Line1
#Line2

R - How to execute PowerShell cmds with system() or system2()

I working in R (on a Windows OS) attempting to count the number of words in the text file without loading the file into memory. The idea is to get some stats on the file size, line count, word count, etc. A call to R's system() function that uses find for the line count is not hard to come by:
How do I do a "word count" command in Windows Command Prompt
lineCount <- system(paste0('find /c /v "" ', path), intern = T)
The command that I'm trying to work with for the word count is a PowerShell command: Measure-Object. I can get the following code to run without throwing an error but it returns an incorrect count.
print(system2("Measure-Object", args = c('count_words.txt', '-Word')))
[1] 127
The file, count_words.txt has on the order of millions of words. I also tested it on a .txt file with far fewer words.
"There are seven words in this file."
But the count again is returned as 127.
print(system2("Measure-Object", args = c('seven_words.txt', '-Word')))
[1] 127
Does system2() recognize PowerShell commands? What is the correct syntax for a call to the function when using Measure-Object? Why is it returning the same value regardless of actual word count?
The issues -- overview
So, you have two issues going on here:
You aren't telling system2() to use powershell
You aren't using the right powershell syntax
The solution
command <- "Get-Content C:/Users/User/Documents/test1.txt | Measure-Object -Word"
system2("powershell", args = command)
where you replace C:/Users/User/Documents/test2.txt with whatever the path to your file is. I created two .txt files, one with the text "There are seven words in this file." and the other with the text "But there are eight words in this file." I then ran the following in R:
command <- "Get-Content C:/Users/User/Documents/test1.txt | Measure-Object -Word"
system2("powershell", args = command)
Lines Words Characters Property
----- ----- ---------- --------
7
command <- "Get-Content C:/Users/User/Documents/test2.txt | Measure-Object -Word"
system2("powershell", args = command)
Lines Words Characters Property
----- ----- ---------- --------
8
More explanation
From help("system2"):
system2 invokes the OS command specified by command.
One main issue is that Measure-Object isn't a system command -- it's a PowerShell command. The system command for PowerShell is powershell, which is what you need to invoke.
Then, further, you didn't quite have the right PowerShell syntax. If you take a look at the docs, you'll see the PowerShell command you really want is
Get-Content C:/Users/User/Documents/count_words.txt | Measure-Object -Word
(check out example three on the linked documentation).

Piping Rscript gives error after output

I wrote a small R script to read JSON, which works fine but upon piping with
Rscript myscript.R | head
the (full, expected) output comes back with an error
Error: ignoring SIGPIPE signal
Execution halted
Oddly I can't remove this by piping STDERR to /dev/null using:
Rscript myscript.R | head 2>/dev/null
The same error is given... presumably because the error arises within the Rscript command? The suggestion to me is that the output of the head command is all STDOUT.
Piping STDOUT to /dev/null returns only the error message
Piping STDERR to /dev/null returns only the error message...!
Piping the output to cat seems to be 'invisible' - this doesn't cause an error.
Rscript myscript.R | cat | head
Further pipe chaining is possible after the cat command but it feels like I may be ignoring something important by not addressing the error.
Is there a setting I need to use within the script to permit piping without the error? I'd like to have R scripts at the ready for small tasks as is done with the likes of Python and Perl, and it'd get annoying to always have to add a useless cat.
There is discussion of handling this error in C here, but it's not immediately clear to me how this would relate to an R script.
Edit In response to #lll's answer, the full script in use (above called as 'myscript.R') is
library(RJSONIO)
note.list <- c('abcdefg.json','hijklmn.json')
# unique IDs for markdown notes stored in JSON by Laverna, http://laverna.cc
for (laverna.note in note.list) {
# note.file <- path.expand(file.path('~/Dropbox/Apps/Laverna/notes',
# laverna.note))
# For the purpose of this example run the script in the same
# directory as the JSON files
note.file <- path.expand(file.path(getwd(),laverna.note))
file.conn <- file(note.file)
suppressWarnings( # warnings re: no terminating newline
cat(paste0(substr(readLines(file.conn), 2, 15)),'\n') # add said newline
)
close(file.conn)
}
Rscript myscript.R outputs
"id":"abcdefg"
"id":"hijklmn"
Rscript myscript.R | head -1 outputs
"id":"abcdefg"
Error: ignoring SIGPIPE signal
Execution halted
It's not clear to me what would be terminating 'early' here
Edit 2 It's replicable with readLines so I've removed JSON library-specific details in the example above. Script and dummy JSON gisted here.
Edit 3 It seems it may be possible to take command-line arguments including pipes and pass them to pipe() - I'll try this when I can and resolve the question.
The error is simply caused by an attempt to write to the pipe without a process connected to the other end. In other words, your script has already picked up and left by the time the pipe is reached and the HEAD command is called.
The command itself might not be the issue; it could be something within the script causing an early termination or race condition before reaching the pipe. Since you're getting full output it may not be that much of a concern, however, masking the error with other CLI commands as mentioned probably isn't the best approach.
The command line solution:
R does have a couple of useful commands for dealing with instances in which you might want the interpreter to wait, or perhaps suppress any errors that would normally be output to stderr.
For command-line R, error messages written to ‘stderr’ will be sent to
the terminal unless ignore.stderr = TRUE. They can be captured (in the
most likely shells) by:
system("some command 2>&1", intern = TRUE)
There is also the wait argument which could help with keeping the process alive.
wait — logical (not NA) indicating whether the R interpreter should
wait for the command to finish, or run it asynchronously. This will be
ignored (and the interpreter will always wait)
if intern = TRUE.
system("Rscript myscript.R | head 2>&1", intern = TRUE)
The above would wait, and output errors, if any are thrown.
system("Rscript myscript.R | head", intern = FALSE, ignore.stderr = TRUE)
The above won't wait, but would suppress errors, if any.
I encountered the same annoying error. It appears to be generated from within R by a function writing to STDOUT, if the R function is still running (outputting data to the pipe) when the pipe stops 'listening'.
So, errors can be suppressed by simply wrapping the R output function into try(...,silent=TRUE), or specifically this error can be handled by wrapping the R output function into a more-involved tryCatch(...,error=...) function.
Example:
Here's a script that generates an error when piped:
#! /Library/Frameworks/R.framework/Resources/bin/rscript
random_matrix=matrix(rnorm(2000),1000)
write.table(x=random_matrix,file="",sep=",",row.names=FALSE,col.names=FALSE)
Output when called from bash and piped to head:
./myScript.r | head -n 1
-1.69669866833626,-0.463199773124574
Error in write.table(x = random_matrix, file = "", sep = ",", row.names = FALSE, :
ignoring SIGPIPE signal
Execution halted
So: wrap the write.table output function into try to suppress all errors that occur during output:
try(write.table(x=random_matrix,file="",sep=",",row.names=FALSE,col.names=FALSE),silent=TRUE)
Or, more-specific, just suppress the "ignoring SIGPIPE signal" error:
tryCatch(write.table(x=random_matrix,file="",sep=",",row.names=FALSE,col.names=FALSE),
error=function(e) if(!grepl("ignoring SIGPIPE signal",e$message))stop(e) )
I could overcome this problem by using littler instead of Rscript:
r myscript.R | head

How can I suppress the line numbers output using R CMD BATCH?

If I have an R script:
print("hi")
commandArgs()
And I run it using:
r CMD BATCH --slave --no-timing test.r output.txt
The output will contain:
[1] "hi"
[1] "/Library/Frameworks/R.framework/Resources/bin/exec/x86_64/R"
[2] "-f"
[3] "test.r"
[4] "--restore"
[5] "--save"
[6] "--no-readline"
[7] "--slave"
How can i suppress the line numbers[1]..[7] in the output so only the output of the script appears?
Use cat instead of print if you want to suppress the line numbers ([1], [2], ...) in the output.
I think you are also going to want to pass command line arguments. I think the easiest way to do that is to create a file with the RScript shebang:
For example, create a file called args.r:
#!/usr/bin/env Rscript
args <- commandArgs(TRUE)
cat(args, sep = "\n")
Make it executable with chmod +x args.r and then you can run it with ./args.r ARG1 ARG2
FWIW, passing command line parameters with the R CMD BATCH ... syntax is a pain. Here is how you do it: R CMD BATCH "--args ARG1 ARG2" args.r Note the quotes. More discussion here
UPDATE: changed shebang line above from #!/usr/bin/Rscript to #!/usr/bin/env Rscript in response to #mbq's comment (thanks!)
Yes, mbq is right -- use Rscript, or, if it floats your boat, littler:
$ cat /tmp/tommy.r
#!/usr/bin/r
cat("hello world\n")
print(argv[])
$ /tmp/tommy.r a b c
hello world
[1] "a" "b" "c"
$
You probably want to look at CRAN packages getopt and optparse for argument-parsing as you'd do in other scripting languages/
Use commandArgs(TRUE) and run your script with Rscript.
EDIT: Ok, I've misread your question. David has it right.
Stop Rscript from command-numbering the output from print
By default, R makes print(...) pre-pend command numbering to stdout like this:
print("we get signal")
Produces:
[1] "we get signal"
Rscript lets the user change the definition of functions like print, so it serves our purpose by default:
print = cat
print("we get signal")
Produces:
we get signal
Notice the command numbering and double quoting is gone.
Get more control of print by using R first class functions:
my_print <- function(x, ...){
#extra shenanigans for when the wind blows from the east on tuesdays, go here.
cat(x)
}
print = my_print
print("we get signal")
Prints:
we get signal
If you're using print as a poor mans debugger... We're not laughing at you, we're laughing with you.

Resources