Speed up API calls in R - r

I am querying Freebase to get the genre information for some 10000 movies.
After reading How to optimise scraping with getURL() in R, I tried to execute the requests in parallel. However, I failed - see below. Besides parallelization, I also read that httr might be a better alternative to RCurl.
My questions are:
Is it possible to speed up the API calls by using
a) a parallel version of the loop below (using a WINDOWS machine)?
b) alternatives to getURL such as GET in the httr-package?
library(RCurl)
library(jsonlite)
library(foreach)
library(doSNOW)
df <- data.frame(film=c("Terminator", "Die Hard", "Philadelphia", "A Perfect World", "The Parade", "ParaNorman", "Passengers", "Pink Cadillac", "Pleasantville", "Police Academy", "The Polar Express", "Platoon"), genre=NA)
f_query_freebase <- function(film.title){
request <- paste0("https://www.googleapis.com/freebase/v1/search?",
"filter=", paste0("(all alias{full}:", "\"", film.title, "\"", " type:\"/film/film\")"),
"&indent=TRUE",
"&limit=1",
"&output=(/film/film/genre)")
temp <- getURL(URLencode(request), ssl.verifypeer = FALSE)
data <- fromJSON(temp, simplifyVector=FALSE)
genre <- paste(sapply(data$result[[1]]$output$`/film/film/genre`[[1]], function(x){as.character(x$name)}), collapse=" | ")
return(genre)
}
# Non-parallel version
# ----------------------------------
for (i in df$film){
df$genre[which(df$film==i)] <- f_query_freebase(i)
}
# Parallel version - Does not work
# ----------------------------------
# Set up parallel computing
cl<-makeCluster(2)
registerDoSNOW(cl)
foreach(i=df$film) %dopar% {
df$genre[which(df$film==i)] <- f_query_freebase(i)
}
stopCluster(cl)
# --> I get the following error: "Error in { : task 1 failed", further saying that it cannot find the function "getURL".

This doesn't achieve parallel requests within a single R session, however, it's something I've used to achieve >1 simultaneous requests (e.g. in parallel) across multiple R sessions, so it may be useful.
At a high level
You'll want to break the process into a few parts:
Get a list of the URLs/API calls you need to make and store as a csv/text file
Use the code below as a template for starting multiple R processes and dividing the work among them
Note: this happened to run on windows, so I used powershell. On mac this could be written in bash.
Powershell/bash script
Use a single powershell script to start off multiple instances R processes (here we divide the work between 3 processes):
e.g. save a plain text file with .ps1 file extension, you can double click on it to run it, or schedule it with task scheduler/cron:
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 1; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 2; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 3; TIMEOUT 20000 }
What's it doing? It will:
Go the the Desktop, start a script it finds called extract.R, and provide an argument to the R script (1, 2, and 3).
The R processes
Each R process can look like this
# Get command line argument
arguments <- commandArgs(trailingOnly = TRUE)
process_number <- as.numeric(arguments[1])
api_calls <- read.csv("api_calls.csv")
# work out which API calls each R script should make (e.g.
indicies <- seq(process_number, nrow(api_calls), 3)
api_calls_for_this_process_only <- api_calls[indicies, ] # this subsets for 1/3 of the API calls
# (the other two processes will take care of the remaining calls)
# Now, make API calls as usual using rvest/jsonlite or whatever you use for that

Related

Run command with `system2(..., wait = FALSE)` in R and kill it later

Assume I want to start a local server from R, access it some times for calculations and kill it again at the end. So:
## start the server
system2("sleep", "100", wait = FALSE)
## do some work
# Here I want to kill the process started at the beginning
This should be cross platform, as it is part of a package (Mac, Linux, Windows, ...).
How can I achieve this in R?
EDIT 1
The command I have to run is a java jar,
system2("java", "... plantuml.jar ...")
Use the processx package.
proc <- processx::process$new("sleep", c("100"))
proc
# PROCESS 'sleep', running, pid 431792.
### ... pause ...
proc
# PROCESS 'sleep', running, pid 431792.
proc$kill()
# [1] TRUE
proc
# PROCESS 'sleep', finished.
proc$get_exit_status()
# [1] 2
The finished output is just an indication that the process has exited, not whether it was successful or erred. Use the exit status, where 0 indicates a good exit.
proc <- processx::process$new("sleep", c("1"))
Sys.sleep(1)
proc
# PROCESS 'sleep', finished.
proc$get_exit_status()
# [1] 0
FYI, base R's system and system2 work fine when you don't need to easily kill it and don't have any embedded spaces in the command or its arguments. system2 appears like it should be better at embedded spaces (since it accepts a vector of arguments), but under the hood all it's doing is
command <- paste(c(shQuote(command), env, args), collapse = " ")
and then
rval <- .Internal(system(command, flag, f, stdout, stderr, timeout))
which does nothing to protect the arguments.
You said "do some work", which suggests that you need to pass something to/from it periodically. processx does support writing to the standard input of the background process, as well as capturing its output. Its documentation at https://processx.r-lib.org/ is rich with great examples on this, including one-time calls for output or a callback function.

downloading data and saving data to a folder in batches

I have 200,000 links that I am trying to download, I have tried downloading it all in one go but I ran into memory issues.
I am trying to create a function which will download 1000 links at a time and save them in a folder.
Packages:
library(dplyr)
library(purrr)
library(edgarWebR)
A small sample of the data is as follows:
Data 1:
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm"
)
I then apply the following function to download these 10 links
parsed_files <- map(urls_to_parse, possibly(parse_filing, otherwise = NA))
Which stores it as a nice list, I can then apply names(parsed_files) <- urls_to_parse to name the lists as the links from where they were downloading them from. I can also use output <- plyr::ldply(parsed_files, data.frame) to store everything in a nice data frame.
Using the below data, how could I create batches to download the data in say batches of 10?
What I have currently:
start = 1
end = 100
output <- NULL
output_fin <- NULL
for(i in start:end){
output[[i]] <- map(urls_to_parse[[i]], possibly(parse_filing, otherwise = NA))
names(output) <- urls_to_parse[start:end]
save(output_fin, file = paste0("C:/Users/Downloads/data/",i, "output.RData"))
}
I am sure there is a better way using a function, since this code breaks for some of the results.
More data: - 100 links
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746908008126/a2186742z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465907055173/a07-18543_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465906047248/a06-15961_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465905033688/a05-12324_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746904023905/a2140220z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746903028005/a2116671z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000091205702033450/a2087919z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095012310108231/c61492e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095015208010514/n48172e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707018659/c22309e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707000193/c11187e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013406000594/c01109e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677405000032/d16006.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677404000013/d13773.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000104746903001075/a2097401z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000091205702001614/a2067550z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752308008030/a5800571.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752307009801/a5515869.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752306009238/a5227919.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046908000102/alpharmainc_10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046907000017/alo10k2006.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046906000027/alo10k2005.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046905000021/alo10k2004final.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046904000058/alo10k2003master.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046903000001/alo10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046902000004/alo10k2001.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046901500003/alo.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620118000009/a10k123117.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312517051216/d286458d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312516474605/d78287d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312515061145/d829913d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620113000023/amr-10kx20121231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000119312512063516/d259681d10k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095012311014726/d78201e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620110000006/ar123109.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620109000009/ar120810k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000451508000014/ar022010k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013407003888/d43815e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013406003715/d33303e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013405003726/d22731e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013404002668/d12953e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000104746903013301/a2108197z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095013407003823/h42902e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012906002343/h31028e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012905002955/h22337e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459018005085/cece-10k_20171231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459017004264/cece-10k_20161231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459016015157/cece-10k_20151231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312515095828/d864880d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312514098407/d661608d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312513109153/d444138d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312512119293/d293768d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312511067373/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312510069639/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312509055504/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312508058939/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312507071909/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312506068031/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312505077739/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312504052176/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465910047121/a10-16705_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000114420409046933/v159572_10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465906060737/a06-19311_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746905022854/a2162888z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746904028585/a2143353z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746903031974/a2119476z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000143774918010388/avx20180331_10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916317000028/avx-20170331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916316000079/avx-20160331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916315000024/avx-20150331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916314000035/avx-20140331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916313000022/avx-20130331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916312000024/avxform10kfy12.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916311000013/avxform10kfy11.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916310000020/avxform10kfy10.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916309000117/form10kfy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000192/form10qq1fy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000101/form10kfy08.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916307000122/form10kfy07.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916306000102/avxfy06form10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916305000094/fy0510k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916304000091/fy0410k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916303000020/fy0310k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916302000007/r10k-0302.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462218000018/pnw2017123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462217000010/pnw2016123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462216000087/pnw2015123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462215000013/pnw12311410-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000110465914012068/a13-25897_110k.htm"
)
Looping over to do batch job as you showed is a bad idea. If you have a 1000s of files to be downloaded, how do you recover from errors?
The performance is not solely depend on your computer's configuration, but the network performance is crucial.
Here are couple of suggestions.
Option 1
partition all URLs in to batches to be able to download them parallelly. The number of files to be downloaded could be equal to number of cores in your computer. Look at this question; reading multiple files quickly in R
store these batches in a queue objects - For ex: using a package like https://cran.r-project.org/web/packages/dequer/dequer.pdf
pop the queue and use the batch of URLs in your parallel file download function.
Use a retryable file download function like in -- HTTP error 400 in R, error handling, How to retry instead of forcing to stop?
Once the queue is completed, move to the next partition.
wrap the whole operation in a retryable loop. For example; How to retry a statement on error?
Why do I use a queue? Because you could retry on error easily.
A pseudo code
file_url_partitions <- partion_as_batches(all_urls, batch_size)
attempts = 3
while( file_url_partitions is not empty && attempt <= 3 ) {
batch = file_url_partitions.pop()
tryCatch({
download_parallel(batch)
}, some_exception = function(se) {
file_url_partitions.push(batch)
attemp = attempt+1
})
}
Note: I don't have access to R studio/environment now hence no way to try.
Option 2
Download files separately using a download manager/similar and use downloaded files.
Some useful resources:
https://www.r-bloggers.com/r-with-parallel-computing-from-user-perspectives/
http://adv-r.had.co.nz/beyond-exception-handling.html

Loop to wait for result or timeout in r

I've written a very quick blast script in r to enable interfacing with the NCBI blast API. Sometimes however, the result url takes a while to load and my script throws an error until the url is ready. Is there an elegant way (i.e. a tryCatch option) to handle the error until the result is returned or timeout after a specified time?
library(rvest)
## Definitive set of blast API instructions can be found here: https://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/BLAST_URLAPI.html
## Generate query URL
query_url <-
function(QUERY,
PROGRAM = "blastp",
DATABASE = "nr",
...) {
put_url_stem <-
'https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Put'
arguments = list(...)
paste0(
put_url_stem,
"&QUERY=",
QUERY,
"&PROGRAM=",
PROGRAM,
"&DATABASE=",
DATABASE,
arguments
)
}
blast_url <- query_url(QUERY = "NP_001117.2") ## test query
blast_session <- html_session(blast_url) ## create session
blast_form <- html_form(blast_session)[[1]] ## pull form from session
RID <- blast_form$fields$RID$value ## extract RID identifier
get_url <- function(RID, ...) {
get_url_stem <-
"https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get"
arguments = list(...)
paste0(get_url_stem, "&RID=", RID, "&FORMAT_TYPE=XML", arguments)
}
hits_xml <- read_xml(get_url(RID)) ## this is the sticky part
Sometimes it takes several minutes for the get_url to go live so what I would like is to do is to keep trying let's say every 20-30 seconds until it either produces the url or times out after a pre-specified time.
I think you may find this answer about the use of tryCatch useful
Regarding the 'keep trying until timeout' part. I imagine you can work on top of this other answer about a tryCatch loop on error
Hope it helps.

Kill a calculation programme after user defined time in R

Say my executable is c:\my irectory\myfile.exe and my R script calls on this executeable with system(myfile.exe)
The R script gives parameters to the executable programme which uses them to do numerical calculations. From the ouput of the executable, the R script then tests whether the parameters are good ore not. If they are not good, the parameters are changed and the executable rerun with updated parameters.
Now, as this executable carries out mathematical calculations and solutions may converge only slowly I wish to be able to kill the executable once it has takes to long to carry out the calculations (say 5 seconds)
How do I do this time dependant kill?
PS:
My question is a little related to this one: (time non dependant kill)
how to run an executable file and then later kill or terminate the same process with R in Windows
You can add code to your R function which issued the executable call:
setTimeLimit(elapse=5, trans=T)
This will kill the calling function, returning control to the parent environment (which could well be a function as well). Then use the examples in the question you linked to for further work.
Alternatively, set up a loop which examines Sys.time and if the expected update to the parameter set has not taken place after 5 seconds, break the loop and issue the system kill command to terminate myfile.exe .
There might possibly be nicer ways but it is a solution.
The assumption here is, that myfile.exe successfully does its calculation within 5 seconds
try.wtl <- function(timeout = 5)
{
y <- evalWithTimeout(system(myfile.exe), timeout = timeout, onTimeout= "warning")
if(inherits(y, "try-error")) NA else y
}
case 1 (myfile.exe is closed after successfull calculation)
g <- try.wtl(5)
case 2 (myfile.exe is not closed after successfull calculation)
g <- try.wtl(0.1)
MSDOS taskkill required for case 2 to recommence from the beginnging
if (class(g) == "NULL") {system('taskkill /im "myfile.exe" /f',show.output.on.console = FALSE)}
PS: inspiration came from Time out an R command via something like try()

Detect number of running r instances in windows within r

I have an r code I am creating that I would like to detect the number of running instances of R in windows so the script can choose whether or not to run a particular set of scripts (i.e., if there is already >2 instances of R running do X, else Y).
Is there a way to do this within R?
EDIT:
Here is some info on the purpose as requested:
I have a very long set of scripts for applying a bayesian network model using the catnet library for thousands of cases. This code processes and outputs results in a csv file for each case. Most of the parallel computing alternatives I have tried have not been ideal as they suppress a lot of the built-in notification of progress, hence I have been running a subset of the cases on different instances of R. I know this is somewhat antiquated, but it works for me, so I wanted a way to have the code subset the number of cases automatically based on the number of instances running.
I do this right now by hand by opening multiple instances of Rscript in CMD opening slightly differently configured r files to get something like this:
cd "Y:\code\BN_code"
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T1.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T2.r" /b
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "process spp data3_T3.r" /b
EDIT2:
Thanks to the answers below, here is my implementation of what I call 'poorman's parallel computing in R:
So if you have any long script that has to be applied to a long list of cases use the code below to break the long list into a number of sublists to be fed to each instance of rscript:
#the cases that I need to apply my code to:
splist=c("sp01", "sp02", "sp03", "sp04", "sp05", "sp06", "sp07", "sp08", "sp09", "sp010", "sp11", "sp12",
"sp013", "sp014", "sp015", "sp16", "sp17", "sp018", "sp19", "sp20", "sp21", "sp22", "sp23", "sp24")
###automatic subsetting of cases based on number of running instances of r script:
cpucores=as.integer(Sys.getenv('NUMBER_OF_PROCESSORS'))
n_instances=length(system('tasklist /FI "IMAGENAME eq Rscript.exe" ', intern = TRUE))-3
jnk=length(system('tasklist /FI "IMAGENAME eq rstudio.exe" ', intern = TRUE))-3
if (jnk>0)rstudiorun=TRUE else rstudiorun=FALSE
if (!rstudiorun & n_instances>0 & cpucores>1){ #if code is being run from rscript and
#not from rstudio and there is more than one core available
jnkn=length(splist)
jnk=seq(1,jnkn,round(jnkn/cpucores,0))
jnk=c(jnk,jnkn)
splist=splist[jnk[n_instances]:jnk[n_instances+1]]
}
###end automatic subsetting of cases
#perform your script on subset of list of cases:
for(sp in splist){
ptm0 <- proc.time()
Sys.sleep(6)
ptm1=proc.time() - ptm0
jnk=as.numeric(ptm1[3])
cat('\n','It took ', jnk, "seconds to do species", sp)
}
To make this code run on multiple instances of r automatically in windows, just create a .bat file:
cd "C:\Users\lfortini\code\misc code\misc r code"
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
timeout 10
START "" "C:\Program Files\R\R-3.0.0\bin\x64\Rscript.exe" "rscript_multiple_instances.r" /b
exit
The timeout is there to give enough time for r to detect its own number of instances.
Clicking on this .bat file will automatically open up numerous instances of r script, with each one taking on a particular subset of the cases you want to analyse, while still providing all of the progress of the running of the script in each window, like the image above. The cool thing about this approach is that you pretty much just have to slap on the automated list subsetting code before whichever iteration mechanism you are using in your code (loops, apply fx, etc). Then just fire the code using rcript using the .bat or manually and you are set.
Actually it is easier than expected, as Windows comes with the nice function tasklist found here.
With it you can get all running processes from which you simply need to count the number of Rscript.exe instances (I use stringr here for string manipulations).
require(stringr)
progs <- system("tasklist", intern = TRUE)
progs <- vapply(str_split(progs, "[[:space:]]"), "[[", "", i = 1)
sum(progs == "Rscript.exe")
That should do the trick. (I only tried it with counting instances of Rgui.exe but that works fine.)
You can do even shorter as below
length(grep("rstudio\\.exe", system("tasklist", intern = TRUE)))
Replace rstudio with any other Rscript or any other process name
Or even shorter
length(system('tasklist /FI "IMAGENAME eq Rscript.exe" ', intern = TRUE))-3

Resources