how to write out log during parallel computation? how to debug parallel computation? - r

I found if there are more than one print function during the parallel computation, only the last on will display on the console. So I set outfile option and hope I can get the result of every print. Here is the R code:
cl <- makeCluster(3, type = "SOCK",outfile="log.txt")
abc <<- 123
clusterExport(cl,"abc")
clusterApplyLB(cl, 1:6,
function(y){
print(paste("before:",abc));
abc<<-y;
print(paste("after:",abc));
}
)
stopCluster(cl)
But I just get three records:
starting worker for localhost:11888
Type: EXEC
Type: EXEC
[1] "index: 3"
[1] "before: 123"
[1] "after: 2"
Type: EXEC
[1] "index: 6"
[1] "before: 2"
[1] "after: 6"
Type: DONE

It looks like you're only getting the output from one worker in log.txt. I've often wondered if that could happen, because when you specify outfile="log.txt", each of the workers will open log.txt for appending and then call sink. Here is the code that is executed by the worker processes when outfile is not an empty string:
## all the workers log to the same file.
outcon <- file(outfile, open = "a")
sink(outcon)
sink(outcon, type = "message")
This makes me nervous because I'm not certain what might happen with all of the workers opening the same file for appending at the same time. It may be OS or file system dependent, and it might explain why you're only getting the output from one worker.
For this reason, I tend to use outfile="", in which case this code isn't executed, allowing the output operations to happen normally without redirecting them with the sink function. However, on Windows, you won't see the output if you're using Rgui, so use Rterm instead.
There shouldn't be a problem with multiple print statements in a task, but if you didn't set outfile, you shouldn't see any output since all output is redirected to /dev/null in that case.

Related

Run command with `system2(..., wait = FALSE)` in R and kill it later

Assume I want to start a local server from R, access it some times for calculations and kill it again at the end. So:
## start the server
system2("sleep", "100", wait = FALSE)
## do some work
# Here I want to kill the process started at the beginning
This should be cross platform, as it is part of a package (Mac, Linux, Windows, ...).
How can I achieve this in R?
EDIT 1
The command I have to run is a java jar,
system2("java", "... plantuml.jar ...")
Use the processx package.
proc <- processx::process$new("sleep", c("100"))
proc
# PROCESS 'sleep', running, pid 431792.
### ... pause ...
proc
# PROCESS 'sleep', running, pid 431792.
proc$kill()
# [1] TRUE
proc
# PROCESS 'sleep', finished.
proc$get_exit_status()
# [1] 2
The finished output is just an indication that the process has exited, not whether it was successful or erred. Use the exit status, where 0 indicates a good exit.
proc <- processx::process$new("sleep", c("1"))
Sys.sleep(1)
proc
# PROCESS 'sleep', finished.
proc$get_exit_status()
# [1] 0
FYI, base R's system and system2 work fine when you don't need to easily kill it and don't have any embedded spaces in the command or its arguments. system2 appears like it should be better at embedded spaces (since it accepts a vector of arguments), but under the hood all it's doing is
command <- paste(c(shQuote(command), env, args), collapse = " ")
and then
rval <- .Internal(system(command, flag, f, stdout, stderr, timeout))
which does nothing to protect the arguments.
You said "do some work", which suggests that you need to pass something to/from it periodically. processx does support writing to the standard input of the background process, as well as capturing its output. Its documentation at https://processx.r-lib.org/ is rich with great examples on this, including one-time calls for output or a callback function.

Simultaneously save and print R system call output?

Within my R script I am calling a shell script. I would like to both print the output to the console in real time and save the output for debugging. For example:
system("script.sh")
prints to console in real time,
out <- system("script.sh", intern = TRUE)
saves the output to a variable for debugging,
and
(out <- system("script.sh", intern = TRUE))
will only print the contents of out after the script has finished. Is there any way to both print to console in real time and store the output as a variable?
Since R is waiting for this to complete anyway, generally to see the stdout in real time, you need to poll the process for output. (One can/should also poll for stderr, depending.)
Here's a quick example using processx.
First, I'll create a slow-output shell script; replace this with the real reason you're calling system. I've named this myscript.sh.
#!/bin/bash
for i in `seq 1 5` ; do
sleep 3
echo 'hello world: '$i
done
Now let's (1) start a process in the background, then (2) poll its output every second.
proc <- processx::process$new("bash", c("-c", "./myscript.sh"), stdout = "|")
output <- character(0)
while (proc$is_alive()) {
Sys.sleep(1)
now <- Sys.time()
tmstmp <- sprintf("# [%s]", format(now, format = "%T"))
thisout <- proc$read_output_lines()
if (length(thisout)) {
output <- c(output, thisout)
message(tmstmp, " New output!\n", paste("#>", thisout))
} else message(tmstmp)
}
# [13:09:29]
# [13:09:30]
# [13:09:31]
# [13:09:32]New output!
#> hello world: 1
# [13:09:33]
# [13:09:34]
# [13:09:35]New output!
#> hello world: 2
# [13:09:36]
# [13:09:37]
# [13:09:38]New output!
#> hello world: 3
# [13:09:39]
# [13:09:40]
# [13:09:41]New output!
#> hello world: 4
# [13:09:42]
# [13:09:43]
# [13:09:44]New output!
#> hello world: 5
And its output is stored:
output
# [1] "hello world: 1" "hello world: 2" "hello world: 3" "hello world: 4" "hello world: 5"
Ways that this can be extended:
Add/store a timestamp with each message, so you know when it came in. The accuracy and utility of this depends on how frequently you want R to poll the process stdout pipe, and really how much you need this information.
Run the process in the background, and even poll for it in the background cycles. I use the later package and set up a self-recurring function that polls, appends, and re-submits itself into the later process queue. The benefit of this is that you can continue to use R; the drawback is that if you're running long code, then you will not see output until your current code exits and lets R breathe and do something idly. (To understand this bullet, one really must play with the later package, a bit beyond this answer.)
Depending on your intentions, it might be more appropriate for the output to go to a file and "permanently" store it there instead of relying on the R process to keep tabs. There are disadvantages to this, in that now you need to manage polling a file for changes, and R isn't making that easy (it does not have, for instance, direct/easy access to inotify, so now it gets even more complicated).

R parallel package parSapply(): cannot print message [duplicate]

I found if there are more than one print function during the parallel computation, only the last on will display on the console. So I set outfile option and hope I can get the result of every print. Here is the R code:
cl <- makeCluster(3, type = "SOCK",outfile="log.txt")
abc <<- 123
clusterExport(cl,"abc")
clusterApplyLB(cl, 1:6,
function(y){
print(paste("before:",abc));
abc<<-y;
print(paste("after:",abc));
}
)
stopCluster(cl)
But I just get three records:
starting worker for localhost:11888
Type: EXEC
Type: EXEC
[1] "index: 3"
[1] "before: 123"
[1] "after: 2"
Type: EXEC
[1] "index: 6"
[1] "before: 2"
[1] "after: 6"
Type: DONE
It looks like you're only getting the output from one worker in log.txt. I've often wondered if that could happen, because when you specify outfile="log.txt", each of the workers will open log.txt for appending and then call sink. Here is the code that is executed by the worker processes when outfile is not an empty string:
## all the workers log to the same file.
outcon <- file(outfile, open = "a")
sink(outcon)
sink(outcon, type = "message")
This makes me nervous because I'm not certain what might happen with all of the workers opening the same file for appending at the same time. It may be OS or file system dependent, and it might explain why you're only getting the output from one worker.
For this reason, I tend to use outfile="", in which case this code isn't executed, allowing the output operations to happen normally without redirecting them with the sink function. However, on Windows, you won't see the output if you're using Rgui, so use Rterm instead.
There shouldn't be a problem with multiple print statements in a task, but if you didn't set outfile, you shouldn't see any output since all output is redirected to /dev/null in that case.

Speed up API calls in R

I am querying Freebase to get the genre information for some 10000 movies.
After reading How to optimise scraping with getURL() in R, I tried to execute the requests in parallel. However, I failed - see below. Besides parallelization, I also read that httr might be a better alternative to RCurl.
My questions are:
Is it possible to speed up the API calls by using
a) a parallel version of the loop below (using a WINDOWS machine)?
b) alternatives to getURL such as GET in the httr-package?
library(RCurl)
library(jsonlite)
library(foreach)
library(doSNOW)
df <- data.frame(film=c("Terminator", "Die Hard", "Philadelphia", "A Perfect World", "The Parade", "ParaNorman", "Passengers", "Pink Cadillac", "Pleasantville", "Police Academy", "The Polar Express", "Platoon"), genre=NA)
f_query_freebase <- function(film.title){
request <- paste0("https://www.googleapis.com/freebase/v1/search?",
"filter=", paste0("(all alias{full}:", "\"", film.title, "\"", " type:\"/film/film\")"),
"&indent=TRUE",
"&limit=1",
"&output=(/film/film/genre)")
temp <- getURL(URLencode(request), ssl.verifypeer = FALSE)
data <- fromJSON(temp, simplifyVector=FALSE)
genre <- paste(sapply(data$result[[1]]$output$`/film/film/genre`[[1]], function(x){as.character(x$name)}), collapse=" | ")
return(genre)
}
# Non-parallel version
# ----------------------------------
for (i in df$film){
df$genre[which(df$film==i)] <- f_query_freebase(i)
}
# Parallel version - Does not work
# ----------------------------------
# Set up parallel computing
cl<-makeCluster(2)
registerDoSNOW(cl)
foreach(i=df$film) %dopar% {
df$genre[which(df$film==i)] <- f_query_freebase(i)
}
stopCluster(cl)
# --> I get the following error: "Error in { : task 1 failed", further saying that it cannot find the function "getURL".
This doesn't achieve parallel requests within a single R session, however, it's something I've used to achieve >1 simultaneous requests (e.g. in parallel) across multiple R sessions, so it may be useful.
At a high level
You'll want to break the process into a few parts:
Get a list of the URLs/API calls you need to make and store as a csv/text file
Use the code below as a template for starting multiple R processes and dividing the work among them
Note: this happened to run on windows, so I used powershell. On mac this could be written in bash.
Powershell/bash script
Use a single powershell script to start off multiple instances R processes (here we divide the work between 3 processes):
e.g. save a plain text file with .ps1 file extension, you can double click on it to run it, or schedule it with task scheduler/cron:
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 1; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 2; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 3; TIMEOUT 20000 }
What's it doing? It will:
Go the the Desktop, start a script it finds called extract.R, and provide an argument to the R script (1, 2, and 3).
The R processes
Each R process can look like this
# Get command line argument
arguments <- commandArgs(trailingOnly = TRUE)
process_number <- as.numeric(arguments[1])
api_calls <- read.csv("api_calls.csv")
# work out which API calls each R script should make (e.g.
indicies <- seq(process_number, nrow(api_calls), 3)
api_calls_for_this_process_only <- api_calls[indicies, ] # this subsets for 1/3 of the API calls
# (the other two processes will take care of the remaining calls)
# Now, make API calls as usual using rvest/jsonlite or whatever you use for that

Detect number of running r instances in windows within r

I have an r code I am creating that I would like to detect the number of running instances of R in windows so the script can choose whether or not to run a particular set of scripts (i.e., if there is already >2 instances of R running do X, else Y).
Is there a way to do this within R?
EDIT:
Here is some info on the purpose as requested:
I have a very long set of scripts for applying a bayesian network model using the catnet library for thousands of cases. This code processes and outputs results in a csv file for each case. Most of the parallel computing alternatives I have tried have not been ideal as they suppress a lot of the built-in notification of progress, hence I have been running a subset of the cases on different instances of R. I know this is somewhat antiquated, but it works for me, so I wanted a way to have the code subset the number of cases automatically based on the number of instances running.
I do this right now by hand by opening multiple instances of Rscript in CMD opening slightly differently configured r files to get something like this:
cd "Y:\code\BN_code"
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T1.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T2.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T3.r" /b
EDIT2:
Thanks to the answers below, here is my implementation of what I call 'poorman's parallel computing in R:
So if you have any long script that has to be applied to a long list of cases use the code below to break the long list into a number of sublists to be fed to each instance of rscript:
#the cases that I need to apply my code to:
splist=c("sp01", "sp02", "sp03", "sp04", "sp05", "sp06", "sp07", "sp08", "sp09", "sp010", "sp11", "sp12",
"sp013", "sp014", "sp015", "sp16", "sp17", "sp018", "sp19", "sp20", "sp21", "sp22", "sp23", "sp24")
###automatic subsetting of cases based on number of running instances of r script:
cpucores=as.integer(Sys.getenv('NUMBER_OF_PROCESSORS'))
n_instances=length(system('tasklist /FI "IMAGENAME eq Rscript.exe" ', intern = TRUE))-3
jnk=length(system('tasklist /FI "IMAGENAME eq rstudio.exe" ', intern = TRUE))-3
if (jnk>0)rstudiorun=TRUE else rstudiorun=FALSE
if (!rstudiorun & n_instances>0 & cpucores>1){ #if code is being run from rscript and
#not from rstudio and there is more than one core available
jnkn=length(splist)
jnk=seq(1,jnkn,round(jnkn/cpucores,0))
jnk=c(jnk,jnkn)
splist=splist[jnk[n_instances]:jnk[n_instances+1]]
}
###end automatic subsetting of cases
#perform your script on subset of list of cases:
for(sp in splist){
ptm0 <- proc.time()
Sys.sleep(6)
ptm1=proc.time() - ptm0
jnk=as.numeric(ptm1[3])
cat('\n','It took ', jnk, "seconds to do species", sp)
}
To make this code run on multiple instances of r automatically in windows, just create a .bat file:
cd "C:\Users\lfortini\code\misc code\misc r code"
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
exit
The timeout is there to give enough time for r to detect its own number of instances.
Clicking on this .bat file will automatically open up numerous instances of r script, with each one taking on a particular subset of the cases you want to analyse, while still providing all of the progress of the running of the script in each window, like the image above. The cool thing about this approach is that you pretty much just have to slap on the automated list subsetting code before whichever iteration mechanism you are using in your code (loops, apply fx, etc). Then just fire the code using rcript using the .bat or manually and you are set.
Actually it is easier than expected, as Windows comes with the nice function tasklist found here.
With it you can get all running processes from which you simply need to count the number of Rscript.exe instances (I use stringr here for string manipulations).
require(stringr)
progs <- system("tasklist", intern = TRUE)
progs <- vapply(str_split(progs, "[[:space:]]"), "[[", "", i = 1)
sum(progs == "Rscript.exe")
That should do the trick. (I only tried it with counting instances of Rgui.exe but that works fine.)
You can do even shorter as below
length(grep("rstudio\\.exe", system("tasklist", intern = TRUE)))
Replace rstudio with any other Rscript or any other process name
Or even shorter
length(system('tasklist /FI "IMAGENAME eq Rscript.exe" ', intern = TRUE))-3

Resources