Simultaneously save and print R system call output? - r

Within my R script I am calling a shell script. I would like to both print the output to the console in real time and save the output for debugging. For example:
system("script.sh")
prints to console in real time,
out <- system("script.sh", intern = TRUE)
saves the output to a variable for debugging,
and
(out <- system("script.sh", intern = TRUE))
will only print the contents of out after the script has finished. Is there any way to both print to console in real time and store the output as a variable?

Since R is waiting for this to complete anyway, generally to see the stdout in real time, you need to poll the process for output. (One can/should also poll for stderr, depending.)
Here's a quick example using processx.
First, I'll create a slow-output shell script; replace this with the real reason you're calling system. I've named this myscript.sh.
#!/bin/bash
for i in `seq 1 5` ; do
sleep 3
echo 'hello world: '$i
done
Now let's (1) start a process in the background, then (2) poll its output every second.
proc <- processx::process$new("bash", c("-c", "./myscript.sh"), stdout = "|")
output <- character(0)
while (proc$is_alive()) {
Sys.sleep(1)
now <- Sys.time()
tmstmp <- sprintf("# [%s]", format(now, format = "%T"))
thisout <- proc$read_output_lines()
if (length(thisout)) {
output <- c(output, thisout)
message(tmstmp, " New output!\n", paste("#>", thisout))
} else message(tmstmp)
}
# [13:09:29]
# [13:09:30]
# [13:09:31]
# [13:09:32]New output!
#> hello world: 1
# [13:09:33]
# [13:09:34]
# [13:09:35]New output!
#> hello world: 2
# [13:09:36]
# [13:09:37]
# [13:09:38]New output!
#> hello world: 3
# [13:09:39]
# [13:09:40]
# [13:09:41]New output!
#> hello world: 4
# [13:09:42]
# [13:09:43]
# [13:09:44]New output!
#> hello world: 5
And its output is stored:
output
# [1] "hello world: 1" "hello world: 2" "hello world: 3" "hello world: 4" "hello world: 5"
Ways that this can be extended:
Add/store a timestamp with each message, so you know when it came in. The accuracy and utility of this depends on how frequently you want R to poll the process stdout pipe, and really how much you need this information.
Run the process in the background, and even poll for it in the background cycles. I use the later package and set up a self-recurring function that polls, appends, and re-submits itself into the later process queue. The benefit of this is that you can continue to use R; the drawback is that if you're running long code, then you will not see output until your current code exits and lets R breathe and do something idly. (To understand this bullet, one really must play with the later package, a bit beyond this answer.)
Depending on your intentions, it might be more appropriate for the output to go to a file and "permanently" store it there instead of relying on the R process to keep tabs. There are disadvantages to this, in that now you need to manage polling a file for changes, and R isn't making that easy (it does not have, for instance, direct/easy access to inotify, so now it gets even more complicated).

Related

Run command with `system2(..., wait = FALSE)` in R and kill it later

Assume I want to start a local server from R, access it some times for calculations and kill it again at the end. So:
## start the server
system2("sleep", "100", wait = FALSE)
## do some work
# Here I want to kill the process started at the beginning
This should be cross platform, as it is part of a package (Mac, Linux, Windows, ...).
How can I achieve this in R?
EDIT 1
The command I have to run is a java jar,
system2("java", "... plantuml.jar ...")
Use the processx package.
proc <- processx::process$new("sleep", c("100"))
proc
# PROCESS 'sleep', running, pid 431792.
### ... pause ...
proc
# PROCESS 'sleep', running, pid 431792.
proc$kill()
# [1] TRUE
proc
# PROCESS 'sleep', finished.
proc$get_exit_status()
# [1] 2
The finished output is just an indication that the process has exited, not whether it was successful or erred. Use the exit status, where 0 indicates a good exit.
proc <- processx::process$new("sleep", c("1"))
Sys.sleep(1)
proc
# PROCESS 'sleep', finished.
proc$get_exit_status()
# [1] 0
FYI, base R's system and system2 work fine when you don't need to easily kill it and don't have any embedded spaces in the command or its arguments. system2 appears like it should be better at embedded spaces (since it accepts a vector of arguments), but under the hood all it's doing is
command <- paste(c(shQuote(command), env, args), collapse = " ")
and then
rval <- .Internal(system(command, flag, f, stdout, stderr, timeout))
which does nothing to protect the arguments.
You said "do some work", which suggests that you need to pass something to/from it periodically. processx does support writing to the standard input of the background process, as well as capturing its output. Its documentation at https://processx.r-lib.org/ is rich with great examples on this, including one-time calls for output or a callback function.

R parallel package parSapply(): cannot print message [duplicate]

I found if there are more than one print function during the parallel computation, only the last on will display on the console. So I set outfile option and hope I can get the result of every print. Here is the R code:
cl <- makeCluster(3, type = "SOCK",outfile="log.txt")
abc <<- 123
clusterExport(cl,"abc")
clusterApplyLB(cl, 1:6,
function(y){
print(paste("before:",abc));
abc<<-y;
print(paste("after:",abc));
}
)
stopCluster(cl)
But I just get three records:
starting worker for localhost:11888
Type: EXEC
Type: EXEC
[1] "index: 3"
[1] "before: 123"
[1] "after: 2"
Type: EXEC
[1] "index: 6"
[1] "before: 2"
[1] "after: 6"
Type: DONE
It looks like you're only getting the output from one worker in log.txt. I've often wondered if that could happen, because when you specify outfile="log.txt", each of the workers will open log.txt for appending and then call sink. Here is the code that is executed by the worker processes when outfile is not an empty string:
## all the workers log to the same file.
outcon <- file(outfile, open = "a")
sink(outcon)
sink(outcon, type = "message")
This makes me nervous because I'm not certain what might happen with all of the workers opening the same file for appending at the same time. It may be OS or file system dependent, and it might explain why you're only getting the output from one worker.
For this reason, I tend to use outfile="", in which case this code isn't executed, allowing the output operations to happen normally without redirecting them with the sink function. However, on Windows, you won't see the output if you're using Rgui, so use Rterm instead.
There shouldn't be a problem with multiple print statements in a task, but if you didn't set outfile, you shouldn't see any output since all output is redirected to /dev/null in that case.

Implementation of simple polling of results file

For one of my dissertation's data collection modules, I have implemented a simple polling mechanism. This is needed, because I make each data collection request (one of many) as SQL query, submitted via Web form, which is simulated by RCurl code. The server processes each request and generates a text file with results at a specific URL (RESULTS_URL in code below). Regardless of the request, URL and file name are the same (I cannot change that). Since processing time for different data requests, obviously, is different and some requests may take significant amount of time, my R code needs to "know", when the results are ready (file is re-generated), so that it can retrieve them. The following is my solution for this problem.
POLL_TIME <- 5 # polling timeout in seconds
In function srdaRequestData(), before making data request:
# check and save 'last modified' date and time of the results file
# before submitting data request, to compare with the same after one
# for simple polling of results file in srdaGetData() function
beforeDate <- url.exists(RESULTS_URL, .header=TRUE)["Last-Modified"]
beforeDate <<- strptime(beforeDate, "%a, %d %b %Y %X", tz="GMT")
<making data request is here>
In function srdaGetData(), called after srdaRequestData()
# simple polling of the results file
repeat {
if (DEBUG) message("Waiting for results ...", appendLF = FALSE)
afterDate <- url.exists(RESULTS_URL, .header=TRUE)["Last-Modified"]
afterDate <- strptime(afterDate, "%a, %d %b %Y %X", tz="GMT")
delta <- difftime(afterDate, beforeDate, units = "secs")
if (as.numeric(delta) != 0) { # file modified, results are ready
if (DEBUG) message(" Ready!")
break
}
else { # no results yet, wait the timeout and check again
if (DEBUG) message(".", appendLF = FALSE)
Sys.sleep(POLL_TIME)
}
}
<retrieving request's results is here>
The module's main flow/sequence of events is linear, as follows:
Read/update configuration file
Authenticate with the system
Loop through data requests, specified in configuration file (via lapply()),
where for each request perform the following:
{
...
Make request: srdaRequestData()
...
Retrieve results: srdaGetData()
...
}
The issue with the code above is that it doesn't seem to be working as expected: upon making data request, the code should print "Waiting for results ..." and then, periodically checking the results file for being modified (re-generated), print progress dots until the results are ready, when it prints confirmation. However, the actual behavior is that the code waits long time (I intentionally made one request a long-running), not printing anything, but then, apparently retrieves results and prints both "Waiting for results ..." and " Ready" at the same time.
It seems to me that it's some kind of synchronization issue, but I can't figure out what exactly. Or, maybe it's something else and I'm somehow missing it. Your advice and help will be much appreciated!
In a comment to the question, I believe MrFlick solved the issue: the polling logic appears to be functional, but the problem is that the progress messages are out of synch with current events on the system.
By default, the R console output is buffered. This is by design: to speed things up and avoid the distracting flicker that may be associated with frequent messages etc. We tend to forget this fact, particularly after we've been using R in a very interactive fashion, running various ad-hoc statement at the console (the console buffer is automatically flushed just before returning the > prompt).
It is however possible to get message() and more generally console output in "real time" by either explicitly flushing the console after each critical output statement, using the flush.console() function, or by disabling buffering at the level of the R GUI (right-click when on the console, see Buffered output Ctrl W item. This is also available in the Misc menu)
Here's a toy example of the explicit use of flush.console. Note the use of cat() rather than message() as the former doesn't automatically add a CR/LF to the output. The latter however is useful however because its messages can be suppressed with suppressMessages() and the like. Also as shown in the comment you can cat the "\b" (backspace) character to make the number overwrite one another.
CountDown <- function() {
for (i in 9:1){
cat(i)
# alternatively to cat(i) use: message(i)
flush.console() # <<<<<<< immediate ouput to console.
Sys.sleep(1)
cat(" ") # also try cat("\b") instead ;-)
}
cat("... Blast-off\n")
}
The output is the following, what is of course not evident in this print-out is that it took 10 seconds overall with one number printed every second, before the final "Blast off"; do remove the flush.console() statement and the output will come at once, after 10 seconds, i.e. when the function terminates (unless console is not buffered at the level of the GUI).
CountDown()
9 8 7 6 5 4 3 2 1 ... Blast-off

how to write out log during parallel computation? how to debug parallel computation?

I found if there are more than one print function during the parallel computation, only the last on will display on the console. So I set outfile option and hope I can get the result of every print. Here is the R code:
cl <- makeCluster(3, type = "SOCK",outfile="log.txt")
abc <<- 123
clusterExport(cl,"abc")
clusterApplyLB(cl, 1:6,
function(y){
print(paste("before:",abc));
abc<<-y;
print(paste("after:",abc));
}
)
stopCluster(cl)
But I just get three records:
starting worker for localhost:11888
Type: EXEC
Type: EXEC
[1] "index: 3"
[1] "before: 123"
[1] "after: 2"
Type: EXEC
[1] "index: 6"
[1] "before: 2"
[1] "after: 6"
Type: DONE
It looks like you're only getting the output from one worker in log.txt. I've often wondered if that could happen, because when you specify outfile="log.txt", each of the workers will open log.txt for appending and then call sink. Here is the code that is executed by the worker processes when outfile is not an empty string:
## all the workers log to the same file.
outcon <- file(outfile, open = "a")
sink(outcon)
sink(outcon, type = "message")
This makes me nervous because I'm not certain what might happen with all of the workers opening the same file for appending at the same time. It may be OS or file system dependent, and it might explain why you're only getting the output from one worker.
For this reason, I tend to use outfile="", in which case this code isn't executed, allowing the output operations to happen normally without redirecting them with the sink function. However, on Windows, you won't see the output if you're using Rgui, so use Rterm instead.
There shouldn't be a problem with multiple print statements in a task, but if you didn't set outfile, you shouldn't see any output since all output is redirected to /dev/null in that case.

is it possible to redirect console output to a variable?

In R, I'm wondering if it's possible to temporarily redirect the output of the console to a variable?
p.s. There are a few examples on the web on how to use sink() to redirect the output into a filename, but none that I could find showing how to redirect into a variable.
p.p.s. The reason this is useful, in practice, is that I need to print out a portion of the default console output from some of the built-in functions in R.
I believe results <- capture.output(...) is what you need (i.e. using the default file=NULL argument). sink(textConnection("results")); ...; sink() should work as well, but as ?capture.output says, capture.output() is:
Related to ‘sink’ in the same way that ‘with’ is related to ‘attach’.
... which suggests that capture.output() will generally be better since it is more contained (i.e. you don't have to remember to terminate the sink()).
If you want to send the output of multiple statements to a variable you can wrap them in curly brackets {}, but if the block is sufficiently complex it might be better to use sink() (or make your code more modular by wrapping it in functions).
For the record, it's indeed possible to store stdout in a variable with the help of a temorary connection without calling capture.output -- e.g. when you want to save both the results and stdout. Example:
Prepare the variable for the diverted R output:
> stdout <- vector('character')
> con <- textConnection('stdout', 'wr', local = TRUE)
Divert the output:
> sink(con)
Do some stuff:
> 1:10
End the diversion:
> sink()
Close the temporary connection:
> close(con)
Check results:
> stdout
[1] " [1] 1 2 3 4 5 6 7 8 9 10"

Resources