I found if there are more than one print function during the parallel computation, only the last on will display on the console. So I set outfile option and hope I can get the result of every print. Here is the R code:
cl <- makeCluster(3, type = "SOCK",outfile="log.txt")
abc <<- 123
clusterExport(cl,"abc")
clusterApplyLB(cl, 1:6,
function(y){
print(paste("before:",abc));
abc<<-y;
print(paste("after:",abc));
}
)
stopCluster(cl)
But I just get three records:
starting worker for localhost:11888
Type: EXEC
Type: EXEC
[1] "index: 3"
[1] "before: 123"
[1] "after: 2"
Type: EXEC
[1] "index: 6"
[1] "before: 2"
[1] "after: 6"
Type: DONE
It looks like you're only getting the output from one worker in log.txt. I've often wondered if that could happen, because when you specify outfile="log.txt", each of the workers will open log.txt for appending and then call sink. Here is the code that is executed by the worker processes when outfile is not an empty string:
## all the workers log to the same file.
outcon <- file(outfile, open = "a")
sink(outcon)
sink(outcon, type = "message")
This makes me nervous because I'm not certain what might happen with all of the workers opening the same file for appending at the same time. It may be OS or file system dependent, and it might explain why you're only getting the output from one worker.
For this reason, I tend to use outfile="", in which case this code isn't executed, allowing the output operations to happen normally without redirecting them with the sink function. However, on Windows, you won't see the output if you're using Rgui, so use Rterm instead.
There shouldn't be a problem with multiple print statements in a task, but if you didn't set outfile, you shouldn't see any output since all output is redirected to /dev/null in that case.
Related
Assume I want to start a local server from R, access it some times for calculations and kill it again at the end. So:
## start the server
system2("sleep", "100", wait = FALSE)
## do some work
# Here I want to kill the process started at the beginning
This should be cross platform, as it is part of a package (Mac, Linux, Windows, ...).
How can I achieve this in R?
EDIT 1
The command I have to run is a java jar,
system2("java", "... plantuml.jar ...")
Use the processx package.
proc <- processx::process$new("sleep", c("100"))
proc
# PROCESS 'sleep', running, pid 431792.
### ... pause ...
proc
# PROCESS 'sleep', running, pid 431792.
proc$kill()
# [1] TRUE
proc
# PROCESS 'sleep', finished.
proc$get_exit_status()
# [1] 2
The finished output is just an indication that the process has exited, not whether it was successful or erred. Use the exit status, where 0 indicates a good exit.
proc <- processx::process$new("sleep", c("1"))
Sys.sleep(1)
proc
# PROCESS 'sleep', finished.
proc$get_exit_status()
# [1] 0
FYI, base R's system and system2 work fine when you don't need to easily kill it and don't have any embedded spaces in the command or its arguments. system2 appears like it should be better at embedded spaces (since it accepts a vector of arguments), but under the hood all it's doing is
command <- paste(c(shQuote(command), env, args), collapse = " ")
and then
rval <- .Internal(system(command, flag, f, stdout, stderr, timeout))
which does nothing to protect the arguments.
You said "do some work", which suggests that you need to pass something to/from it periodically. processx does support writing to the standard input of the background process, as well as capturing its output. Its documentation at https://processx.r-lib.org/ is rich with great examples on this, including one-time calls for output or a callback function.
Within my R script I am calling a shell script. I would like to both print the output to the console in real time and save the output for debugging. For example:
system("script.sh")
prints to console in real time,
out <- system("script.sh", intern = TRUE)
saves the output to a variable for debugging,
and
(out <- system("script.sh", intern = TRUE))
will only print the contents of out after the script has finished. Is there any way to both print to console in real time and store the output as a variable?
Since R is waiting for this to complete anyway, generally to see the stdout in real time, you need to poll the process for output. (One can/should also poll for stderr, depending.)
Here's a quick example using processx.
First, I'll create a slow-output shell script; replace this with the real reason you're calling system. I've named this myscript.sh.
#!/bin/bash
for i in `seq 1 5` ; do
sleep 3
echo 'hello world: '$i
done
Now let's (1) start a process in the background, then (2) poll its output every second.
proc <- processx::process$new("bash", c("-c", "./myscript.sh"), stdout = "|")
output <- character(0)
while (proc$is_alive()) {
Sys.sleep(1)
now <- Sys.time()
tmstmp <- sprintf("# [%s]", format(now, format = "%T"))
thisout <- proc$read_output_lines()
if (length(thisout)) {
output <- c(output, thisout)
message(tmstmp, " New output!\n", paste("#>", thisout))
} else message(tmstmp)
}
# [13:09:29]
# [13:09:30]
# [13:09:31]
# [13:09:32]New output!
#> hello world: 1
# [13:09:33]
# [13:09:34]
# [13:09:35]New output!
#> hello world: 2
# [13:09:36]
# [13:09:37]
# [13:09:38]New output!
#> hello world: 3
# [13:09:39]
# [13:09:40]
# [13:09:41]New output!
#> hello world: 4
# [13:09:42]
# [13:09:43]
# [13:09:44]New output!
#> hello world: 5
And its output is stored:
output
# [1] "hello world: 1" "hello world: 2" "hello world: 3" "hello world: 4" "hello world: 5"
Ways that this can be extended:
Add/store a timestamp with each message, so you know when it came in. The accuracy and utility of this depends on how frequently you want R to poll the process stdout pipe, and really how much you need this information.
Run the process in the background, and even poll for it in the background cycles. I use the later package and set up a self-recurring function that polls, appends, and re-submits itself into the later process queue. The benefit of this is that you can continue to use R; the drawback is that if you're running long code, then you will not see output until your current code exits and lets R breathe and do something idly. (To understand this bullet, one really must play with the later package, a bit beyond this answer.)
Depending on your intentions, it might be more appropriate for the output to go to a file and "permanently" store it there instead of relying on the R process to keep tabs. There are disadvantages to this, in that now you need to manage polling a file for changes, and R isn't making that easy (it does not have, for instance, direct/easy access to inotify, so now it gets even more complicated).
I am trying to pause my code for a little while, time for me to observe the plots.
I tried:
print('A')
something = readline("Press Enter")
print('B')
print('C')
, then there is no pause, the line print('B') is fed to readline and get stored into something and therefore only A and C got printed on the screen. Note that if I add an empty line between Something = readline("Press Enter") and print("B"), then print("B") get printed on the screen but still the console doesn't allow the user to press enter before continuing.
And I tried:
print('A')
Sys.sleep(3)
print('B')
print('C')
The program waits 3 seconds before starting and then run "normally" without doing any pause between print('A') and print('B').
What do I missunderstand?
Here is my R version: R 3.1.1 GUI 1.65 Snow Leopard build (6784)
The problem with readline is that if you paste your script into an R console, or execute it from eg Rstudio, the redline function is read and then the next line of the script is read in as the console entry, which in your case sets the value of something to print('B).
An easy way to get around this is to stick your entire code in a function, then call the function to run it. So, in your case:
myscript = function(){
print('A')
something = readline(prompt = "Press Enter")
print('B')
print('C')
}
myscript()
The output of this for me (in Rstudio, with R version 3.1.1):
[1] "A"
Press Enter
[1] "B"
[1] "C"
This has always felt like a bit of a hack to me, but it's essentially what the readline documentation recommends in its example.
I've never used sleep in my code, so I can't help you there.
Edit to clarify based on comments: This will only work if myscript() is the very last line of your script, or if it is manually entered into the console after running the script to generate the function. Otherwise, you will run into the same problem as before- the next line of code will be automatically entered.
I found if there are more than one print function during the parallel computation, only the last on will display on the console. So I set outfile option and hope I can get the result of every print. Here is the R code:
cl <- makeCluster(3, type = "SOCK",outfile="log.txt")
abc <<- 123
clusterExport(cl,"abc")
clusterApplyLB(cl, 1:6,
function(y){
print(paste("before:",abc));
abc<<-y;
print(paste("after:",abc));
}
)
stopCluster(cl)
But I just get three records:
starting worker for localhost:11888
Type: EXEC
Type: EXEC
[1] "index: 3"
[1] "before: 123"
[1] "after: 2"
Type: EXEC
[1] "index: 6"
[1] "before: 2"
[1] "after: 6"
Type: DONE
It looks like you're only getting the output from one worker in log.txt. I've often wondered if that could happen, because when you specify outfile="log.txt", each of the workers will open log.txt for appending and then call sink. Here is the code that is executed by the worker processes when outfile is not an empty string:
## all the workers log to the same file.
outcon <- file(outfile, open = "a")
sink(outcon)
sink(outcon, type = "message")
This makes me nervous because I'm not certain what might happen with all of the workers opening the same file for appending at the same time. It may be OS or file system dependent, and it might explain why you're only getting the output from one worker.
For this reason, I tend to use outfile="", in which case this code isn't executed, allowing the output operations to happen normally without redirecting them with the sink function. However, on Windows, you won't see the output if you're using Rgui, so use Rterm instead.
There shouldn't be a problem with multiple print statements in a task, but if you didn't set outfile, you shouldn't see any output since all output is redirected to /dev/null in that case.
I have an r code I am creating that I would like to detect the number of running instances of R in windows so the script can choose whether or not to run a particular set of scripts (i.e., if there is already >2 instances of R running do X, else Y).
Is there a way to do this within R?
EDIT:
Here is some info on the purpose as requested:
I have a very long set of scripts for applying a bayesian network model using the catnet library for thousands of cases. This code processes and outputs results in a csv file for each case. Most of the parallel computing alternatives I have tried have not been ideal as they suppress a lot of the built-in notification of progress, hence I have been running a subset of the cases on different instances of R. I know this is somewhat antiquated, but it works for me, so I wanted a way to have the code subset the number of cases automatically based on the number of instances running.
I do this right now by hand by opening multiple instances of Rscript in CMD opening slightly differently configured r files to get something like this:
cd "Y:\code\BN_code"
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T1.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T2.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T3.r" /b
EDIT2:
Thanks to the answers below, here is my implementation of what I call 'poorman's parallel computing in R:
So if you have any long script that has to be applied to a long list of cases use the code below to break the long list into a number of sublists to be fed to each instance of rscript:
#the cases that I need to apply my code to:
splist=c("sp01", "sp02", "sp03", "sp04", "sp05", "sp06", "sp07", "sp08", "sp09", "sp010", "sp11", "sp12",
"sp013", "sp014", "sp015", "sp16", "sp17", "sp018", "sp19", "sp20", "sp21", "sp22", "sp23", "sp24")
###automatic subsetting of cases based on number of running instances of r script:
cpucores=as.integer(Sys.getenv('NUMBER_OF_PROCESSORS'))
n_instances=length(system('tasklist /FI "IMAGENAME eq Rscript.exe" ', intern = TRUE))-3
jnk=length(system('tasklist /FI "IMAGENAME eq rstudio.exe" ', intern = TRUE))-3
if (jnk>0)rstudiorun=TRUE else rstudiorun=FALSE
if (!rstudiorun & n_instances>0 & cpucores>1){ #if code is being run from rscript and
#not from rstudio and there is more than one core available
jnkn=length(splist)
jnk=seq(1,jnkn,round(jnkn/cpucores,0))
jnk=c(jnk,jnkn)
splist=splist[jnk[n_instances]:jnk[n_instances+1]]
}
###end automatic subsetting of cases
#perform your script on subset of list of cases:
for(sp in splist){
ptm0 <- proc.time()
Sys.sleep(6)
ptm1=proc.time() - ptm0
jnk=as.numeric(ptm1[3])
cat('\n','It took ', jnk, "seconds to do species", sp)
}
To make this code run on multiple instances of r automatically in windows, just create a .bat file:
cd "C:\Users\lfortini\code\misc code\misc r code"
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
exit
The timeout is there to give enough time for r to detect its own number of instances.
Clicking on this .bat file will automatically open up numerous instances of r script, with each one taking on a particular subset of the cases you want to analyse, while still providing all of the progress of the running of the script in each window, like the image above. The cool thing about this approach is that you pretty much just have to slap on the automated list subsetting code before whichever iteration mechanism you are using in your code (loops, apply fx, etc). Then just fire the code using rcript using the .bat or manually and you are set.
Actually it is easier than expected, as Windows comes with the nice function tasklist found here.
With it you can get all running processes from which you simply need to count the number of Rscript.exe instances (I use stringr here for string manipulations).
require(stringr)
progs <- system("tasklist", intern = TRUE)
progs <- vapply(str_split(progs, "[[:space:]]"), "[[", "", i = 1)
sum(progs == "Rscript.exe")
That should do the trick. (I only tried it with counting instances of Rgui.exe but that works fine.)
You can do even shorter as below
length(grep("rstudio\\.exe", system("tasklist", intern = TRUE)))
Replace rstudio with any other Rscript or any other process name
Or even shorter
length(system('tasklist /FI "IMAGENAME eq Rscript.exe" ', intern = TRUE))-3