Detect number of running r instances in windows within r - r

I have an r code I am creating that I would like to detect the number of running instances of R in windows so the script can choose whether or not to run a particular set of scripts (i.e., if there is already >2 instances of R running do X, else Y).
Is there a way to do this within R?
EDIT:
Here is some info on the purpose as requested:
I have a very long set of scripts for applying a bayesian network model using the catnet library for thousands of cases. This code processes and outputs results in a csv file for each case. Most of the parallel computing alternatives I have tried have not been ideal as they suppress a lot of the built-in notification of progress, hence I have been running a subset of the cases on different instances of R. I know this is somewhat antiquated, but it works for me, so I wanted a way to have the code subset the number of cases automatically based on the number of instances running.
I do this right now by hand by opening multiple instances of Rscript in CMD opening slightly differently configured r files to get something like this:
cd "Y:\code\BN_code"
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T1.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T2.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T3.r" /b
EDIT2:
Thanks to the answers below, here is my implementation of what I call 'poorman's parallel computing in R:
So if you have any long script that has to be applied to a long list of cases use the code below to break the long list into a number of sublists to be fed to each instance of rscript:
#the cases that I need to apply my code to:
splist=c("sp01", "sp02", "sp03", "sp04", "sp05", "sp06", "sp07", "sp08", "sp09", "sp010", "sp11", "sp12",
"sp013", "sp014", "sp015", "sp16", "sp17", "sp018", "sp19", "sp20", "sp21", "sp22", "sp23", "sp24")
###automatic subsetting of cases based on number of running instances of r script:
cpucores=as.integer(Sys.getenv('NUMBER_OF_PROCESSORS'))
n_instances=length(system('tasklist /FI "IMAGENAME eq Rscript.exe" ', intern = TRUE))-3
jnk=length(system('tasklist /FI "IMAGENAME eq rstudio.exe" ', intern = TRUE))-3
if (jnk>0)rstudiorun=TRUE else rstudiorun=FALSE
if (!rstudiorun & n_instances>0 & cpucores>1){ #if code is being run from rscript and
#not from rstudio and there is more than one core available
jnkn=length(splist)
jnk=seq(1,jnkn,round(jnkn/cpucores,0))
jnk=c(jnk,jnkn)
splist=splist[jnk[n_instances]:jnk[n_instances+1]]
}
###end automatic subsetting of cases
#perform your script on subset of list of cases:
for(sp in splist){
ptm0 <- proc.time()
Sys.sleep(6)
ptm1=proc.time() - ptm0
jnk=as.numeric(ptm1[3])
cat('\n','It took ', jnk, "seconds to do species", sp)
}
To make this code run on multiple instances of r automatically in windows, just create a .bat file:
cd "C:\Users\lfortini\code\misc code\misc r code"
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
exit
The timeout is there to give enough time for r to detect its own number of instances.
Clicking on this .bat file will automatically open up numerous instances of r script, with each one taking on a particular subset of the cases you want to analyse, while still providing all of the progress of the running of the script in each window, like the image above. The cool thing about this approach is that you pretty much just have to slap on the automated list subsetting code before whichever iteration mechanism you are using in your code (loops, apply fx, etc). Then just fire the code using rcript using the .bat or manually and you are set.

Actually it is easier than expected, as Windows comes with the nice function tasklist found here.
With it you can get all running processes from which you simply need to count the number of Rscript.exe instances (I use stringr here for string manipulations).
require(stringr)
progs <- system("tasklist", intern = TRUE)
progs <- vapply(str_split(progs, "[[:space:]]"), "[[", "", i = 1)
sum(progs == "Rscript.exe")
That should do the trick. (I only tried it with counting instances of Rgui.exe but that works fine.)

You can do even shorter as below
length(grep("rstudio\\.exe", system("tasklist", intern = TRUE)))
Replace rstudio with any other Rscript or any other process name
Or even shorter
length(system('tasklist /FI "IMAGENAME eq Rscript.exe" ', intern = TRUE))-3

Related

Julia Distributed, redundant iterations appearing

I ran
mpiexec -n $nprocs julia --project myfile.jl
on a cluster, where myfile.jl has the following form
using Distributed; using Dates; using JLD2; using LaTeXStrings
#everywhere begin
using SharedArrays; using QuantumOptics; using LinearAlgebra; using Plots; using Statistics; using DifferentialEquations; using StaticArrays
#Defining some other functions and SharedArrays to be used later e.g.
MySharedArray=SharedArray{SVector{Nt,Float64}}(Np,Np)
end
#sync #distributed for pp in 1:Np^2
for jj in 1:Nj
#do some stuff with local variables
for tt in 1:Nt
#do some stuff with local variables
end
end
MySharedArray[pp]=... #using linear indexing
println("$pp finished")
end
timestr=Dates.format(Dates.now(), "yyyy-mm-dd-HH:MM:SS")
filename="MyName"*timestr
#save filename*".jld2"
#later on, some other small stuff like making and saving a figure. (This does give an error "no method matching heatmap_edges(::Surface{Array{Float64,2}}, ::Symbol)" but I think that this is a technical thing about Plots so not very related to the bigger issue here)
However, when looking at the output, there are a few issues that make me conclude that something is wrong
The "$pp finished" output is repeated many times for each value of pp. It seems that this amount is actually equal to 32=$nprocs
Despite the code not being finished, "MyName" files are generated. It should be one, but I get a dozen of them with different timestr component
EDIT: two more things that I can add
the output of the different "MyName" files is not identical, but this is expected since random numbers are used in the inner loops. There are 28 of them, a number that I don't easily recognize except that its again close to the 32 $nprocs
earlier, I wrote that the walltime was exceeded, but this turns out not to be true. The .o file ends with "BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES ... EXIT CODE :9", pretty shortly after the last output file.
$nprocs is obtained in the pbs script through
#PBS -l select=1:ncpus=32:mpiprocs=32
nprocs= `cat $PBS_NODEFILE|wc -l`
As pointed out by adamslc on the Julia discourse, the proper way to use Julia on a cluster is to either
Start a session with one core from the job script, add more with addprocs() in the Julia script itself
Use more specialized Julia packages
https://discourse.julialang.org/t/julia-distributed-redundant-iterations-appearing/57682/3

How to pass bash variable into R script

I have a couple of R scripts that processes data in a particular input folder. I have a few folders I need to run this script on, so I started writing a bash script to loop through these folders and run those R scripts.
I'm not familiar with R at all (the script was written by a previous worker and it's basically a black box for me), and I'm inexperienced with passing variables through scripts, especially involving multiple languages. There's also an issue present when I call source("$SWS_output/Step_1_Setup.R") here - R isn't reading my $SWS_output as a variable, but rather a string.
Here's my bash script:
#!/bin/bash
# Inputs
workspace="`pwd`"
preprocessed="$workspace/6_preprocessed"
# Output
SWS_output="$workspace/7_SKSattempt4_results/"
# create output directory
mkdir -p $SWS_output
# Copy data from preprocessed to SWS_output
cp -a $preprocessed/* $SWS_output
# Loop through folders in the output and run the R code on each folder
for qdir in $SWS_output/*/; do
qdir_name=`basename $qdir`
echo -e 'source("$SWS_output/Step_1_Setup.R") \n source("$SWS_output/(Step_2_data.R") \n q()' | R --no-save
done
I need to pass the variable "qdir" into the second R script (Step_2_data.R) to tell it which folder to process.
Thanks!
My previous answer was incomplete. Here is a better effort to explain command line parsing.
It is pretty easy to use R's commandArgs function to process command line arguments. I wrote a small tutorial https://gitlab.crmda.ku.edu/crmda/hpcexample/tree/master/Ex51-R-ManySerialJobs. In cluster computing this works very well for us. The whole hpcexample repo is open source/free.
The basic idea is that in the command line you can run R with command line arguments, as in:
R --vanilla -f r-clargs-3.R --args runI=13 parmsC="params.csv" xN=33.45
In this case, my R program is a file r-clargs-3.R and the arguments that the file will import are three space separated elements, runI, parmsC, xN. You can add as many of these space separated parameters as you like. It is completely at your discretion what these are called, but it is required they are separated by spaces and there is NO SPACE around the equal signs. Character string variables should be quoted.
My habit is to name the arguments with suffix "I" to hint that it is an integer, "C" is for character, and "N" is for floating point numbers.
In the file r-clargs-3.R, include some code to read the arguments and sort through them. For example, my tutorial's example
cli <- commandArgs(trailingOnly = TRUE)
args <- strsplit(cli, "=", fixed = TRUE)
The rest of the work is sorting through the args, and this is my most evolved stanza to sort through arguments (because it looks for suffixes "I", "N", "C", and "L" (for logical)), and then it coerces the inputs to the correct variable types (all input variables are characters, unless we coerce with as.integer(), etc):
for (e in args) {
argname <- e[1]
if (! is.na(e[2])) {
argval <- e[2]
## regular expression to delete initial \" and trailing \"
argval <- gsub("(^\\\"|\\\"$)", "", argval)
}
else {
# If arg specified without value, assume it is bool type and TRUE
argval <- TRUE
}
# Infer type from last character of argname, cast val
type <- substring(argname, nchar(argname), nchar(argname))
if (type == "I") {
argval <- as.integer(argval)
}
if (type == "N") {
argval <- as.numeric(argval)
}
if (type == "L") {
argval <- as.logical(argval)
}
assign(argname, argval)
cat("Assigned", argname, "=", argval, "\n")
}
That will create variables in the R session named paramsC, runI, and xN.
The convenience of this approach is that the same base R code can be run with 100s or 1000s of command parameter variations. Good for Monte Carlo simulation, etc.
Thanks for all the answers they were very helpful. I was able to get a solution that works. Here's my completed script.
#!/bin/bash
# Inputs
workspace="`pwd`"
preprocessed="$workspace/6_preprocessed"
# Output
SWS_output="$workspace/7_SKSattempt4_results"
# create output directory
mkdir -p $SWS_output
# Copy data from preprocessed to SWS_output
cp -a $preprocessed/* $SWS_output
cd $SWS_output
# Loop through folders in the output and run the R code on each folder
for qdir in $SWS_output/*/; do
qdir_name=`basename $qdir`
echo $qdir_name
export VARIABLENAME=$qdir
echo -e 'source("Step_1_Setup.R") \n source("Step_2_Data.R") \n q()' | R --no-save --slave
done
And then the R script looks like this:
qdir<-Sys.getenv("VARIABLENAME")
pathname<-qdir[1]
As a couple of comments have pointed out, this isn't best practice, but this worked exactly as I wanted it to. Thanks!

R system() error when using brace expansion to match folders in linux

I have a series of sequential directories to gather files from on a linux server that I am logging into remotely and processing from an R terminal.
/r18_060, /r18_061, ... /r18_118, /r18_119
Each directory is for the day of year the data was logged on, and it contains a series of files with standard prefix such as "fl.060.gz"
I have to supply a function that contains multiple system() commands with a linux glob for the day. I want to divide the year into 60-day intervals to make the QA/QC more manageable. Since I'm crossing from 099 - 100 in the glob, I have to use brace expansion to match the correct sequence of days.
ls -d /root_driectory/r18_{0[6-9]?,1[0-1]?}
ls -d /root_driectory/r18_{060..119}
All of these work fine when I manually input these globs into my bash shell, but I get an error when the system() function provides a similar command through R.
day_glob <- {060..119}
system(paste("zcat /root_directory/r_18.", day_glob, "/fl.???.gz > tmpfile", sep = "")
>gzip: cannot access '/root_directory/r18_{060..119}': No such file or directory
I know that this could be an error in the shell that the system() function operates in, but when I query that it gives the correct environment and user name
system("env | grep ^SHELL=")
>SHELL=/bin/bash
system("echo $USER")
>tgw
Does anyone know why this fails when it is passed through R's system() command? What can I do to get around this problem without removing the system call altogether? There are many scripts that rely on these functions, and re-writing the entire family of R scripts would be time prohibitive.
Previously I had been using 50-day intervals which avoids this problem, but I thought this should be something easy to change, and make one less iteration of my QA/QC scripts per year. I'm new to the linux OS so I figured I might just be missing something obvious.

Speed up API calls in R

I am querying Freebase to get the genre information for some 10000 movies.
After reading How to optimise scraping with getURL() in R, I tried to execute the requests in parallel. However, I failed - see below. Besides parallelization, I also read that httr might be a better alternative to RCurl.
My questions are:
Is it possible to speed up the API calls by using
a) a parallel version of the loop below (using a WINDOWS machine)?
b) alternatives to getURL such as GET in the httr-package?
library(RCurl)
library(jsonlite)
library(foreach)
library(doSNOW)
df <- data.frame(film=c("Terminator", "Die Hard", "Philadelphia", "A Perfect World", "The Parade", "ParaNorman", "Passengers", "Pink Cadillac", "Pleasantville", "Police Academy", "The Polar Express", "Platoon"), genre=NA)
f_query_freebase <- function(film.title){
request <- paste0("https://www.googleapis.com/freebase/v1/search?",
"filter=", paste0("(all alias{full}:", "\"", film.title, "\"", " type:\"/film/film\")"),
"&indent=TRUE",
"&limit=1",
"&output=(/film/film/genre)")
temp <- getURL(URLencode(request), ssl.verifypeer = FALSE)
data <- fromJSON(temp, simplifyVector=FALSE)
genre <- paste(sapply(data$result[[1]]$output$`/film/film/genre`[[1]], function(x){as.character(x$name)}), collapse=" | ")
return(genre)
}
# Non-parallel version
# ----------------------------------
for (i in df$film){
df$genre[which(df$film==i)] <- f_query_freebase(i)
}
# Parallel version - Does not work
# ----------------------------------
# Set up parallel computing
cl<-makeCluster(2)
registerDoSNOW(cl)
foreach(i=df$film) %dopar% {
df$genre[which(df$film==i)] <- f_query_freebase(i)
}
stopCluster(cl)
# --> I get the following error: "Error in { : task 1 failed", further saying that it cannot find the function "getURL".
This doesn't achieve parallel requests within a single R session, however, it's something I've used to achieve >1 simultaneous requests (e.g. in parallel) across multiple R sessions, so it may be useful.
At a high level
You'll want to break the process into a few parts:
Get a list of the URLs/API calls you need to make and store as a csv/text file
Use the code below as a template for starting multiple R processes and dividing the work among them
Note: this happened to run on windows, so I used powershell. On mac this could be written in bash.
Powershell/bash script
Use a single powershell script to start off multiple instances R processes (here we divide the work between 3 processes):
e.g. save a plain text file with .ps1 file extension, you can double click on it to run it, or schedule it with task scheduler/cron:
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 1; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 2; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 3; TIMEOUT 20000 }
What's it doing? It will:
Go the the Desktop, start a script it finds called extract.R, and provide an argument to the R script (1, 2, and 3).
The R processes
Each R process can look like this
# Get command line argument
arguments <- commandArgs(trailingOnly = TRUE)
process_number <- as.numeric(arguments[1])
api_calls <- read.csv("api_calls.csv")
# work out which API calls each R script should make (e.g.
indicies <- seq(process_number, nrow(api_calls), 3)
api_calls_for_this_process_only <- api_calls[indicies, ] # this subsets for 1/3 of the API calls
# (the other two processes will take care of the remaining calls)
# Now, make API calls as usual using rvest/jsonlite or whatever you use for that

Running a Windows executable file from within R with command line options

I am trying to call a Windows program called AMDIS from within R using the call
system("C:/NIST08/AMDIS32/AMDIS_32.exe /S C:/Users/Ento/Documents/GCMS/test_cataglyphis_iberica/queens/CI23_Q_120828_01.CDF")
in order to carry out an analysis (specified using the /S switch) on a file called CI23_Q_120828_01.CDF, but it seems that no matter what I try the file is not loaded in correctly, presumably because the options are not passed along. Does anyone have a clue what I might be doing wrong?
Right now this command either
doesn't do anything,
makes AMDIS pop up, but it doesn't load the file I specify
gives me the error
Warning message:
running command 'C:/NIST08/AMDIS32/AMDIS_32.exe /S
C:/Users/Ento/Documents/GCMS/test_cataglyphis_iberica/queens/CI23_Q_120828_01.CDF'
had status 65535
(I have no idea what results in these different outcomes of the same command)
(the AMDIS command line options are described here at the page 8)
Cheers,
Tom
EDIT:
Found it had to do with forward vs backslashes - running
system("C:\\NIST08\\AMDIS32\\AMDIS_32.EXE C:\\Users\\Ento\\Documents\\GCMS\\test_cataglyphis_iberica\\queens\\CI23_Q_120828_01.CDF /S /E")
seems to work - thank you all for the suggestions!
You've heard of bquote , noquote , sQuote, dQuote , quote enquote and Quotes, well now meet shQuote!!! :-)
This little function call works to format a string to be passed to an operating system shell. Personally I find that I can get embroiled in backslash escaping hell, and shQuote saves me. Simply type the character string as you would on the command line of your choice ('sh' for Unix alikes like bash , csh for the C-shell and 'cmd' for the Windows shell ) wihtin shQuote and it will format it for a call from R using system:
shQuote("C:/NIST08/AMDIS32/AMDIS_32.exe /S C:/Users/Ento/Documents/GCMS/test_cataglyphis_iberica/queens/CI23_Q_120828_01.CDF" , type = "cmd" )
#[1] "\"C:/NIST08/AMDIS32/AMDIS_32.exe /S C:/Users/Ento/Documents/GCMS/test_cataglyphis_iberica/queens/CI23_Q_120828_01.CDF\""
More generally, you can use shQuote like this:
system( shQuote( "mystring" , type = c("cmd","sh") ) , ... )

Resources