Using Sys.sleep() to delay API call - r

I'm using R to make an API call to a weather data provider to download some weather forecasts. I'm using a free key that allows me to make no more than 10 calls per minute. I've tried using Sys.sleep() to ensure I don't go over the threshold but the API resource monitor tells me that I've exceeded the number of calls.
For example, if I'm making 6 calls, a time interval of 10 seconds between the calls ought to be sufficient (not taking into account the time R would need).
dat <- list()
for(i in 1:6){
dat[[i]] <- getWeatherData(web_url, api_key, history_date, data_format)
Sys.sleep(10)
web_url <- gsub(i-1, i, url)
}
The getWeatherData function does the following:
makes the API call (only one API call is made each time the function is invoked. Uses httr::GET() to get the data),
parses the XML output to get desired variables (regulat expressions),
performs some clean-up (for missing/garbage values),
converts strings to R date-time objects (POSIXct), and
rounds values to the nearest hour (lubridate::round_date()).
Function inputs:
web_url is a custom url,
api_key is my personal key,
history_date is a string (formatted as "%d/%m/%Y %H:%M:%S"), and
data_format specifies if I want an .XML or .json file as output.
I can not share the url/key for obvious reasons. As soon as I run this, I get a notification from the data provider that I've exceeded the allowable calls per minute (10). I don't get a notification every time - not sure why that is either.
Any help is appreciated!

This solution should be helpful for you if Sys.sleep doesn't do the trick.
Basically, this replaces the use of Sys.sleep with while logic.
dat <- list()
delay_seconds<-10
for(i in 1:6){
dat[[i]] <- getWeatherData(web_url, api_key, history_date, data_format)
date_time<-Sys.time()
while((as.numeric(Sys.time()) - as.numeric(date_time))<delay_seconds){}
web_url <- gsub(i-1, i, url)
}
Here, we are:
defining a number of seconds to wait ( delay_seconds<-10 )
defining a start time for comparison ( date_time<-Sys.time() )
using a while loop that checks the present time in comparison to our comparison time and seeing if this is less than our chosen delay interval ( (as.numeric(Sys.time()) - as.numeric(date_time)<delay_seconds )
doing nothing until the wait time is over( {} )
Not knowing if you need/want to, but in the case that you're hoping to get your data out of the lists and into a longer combined form, I recommend the dplyr function bind_rows().
dat2<-bind_rows(dat)
Thanks to an answer by rbtj to this question: How to make execution pause, sleep, wait for X seconds in R?

Related

How to rate-limit parallel API requests in R/future/furrr

I have to retrieve a large dataset from a web API (NCBI entrez) that limits me to a certain number of requests per second, say 10 (the example code will limit you to three without an API key). I'm using furrr's future_* functions to parallelize the requests to get them as quickly as possible, like this:
library(tidyverse)
library(rentrez)
library(furrr)
plan(multiprocess)
api_key <- "<api key>"
# this will return a crap-ton of results
srch <- entrez_search("nuccore", "Homo sapiens", use_history=T, api_key=api_key)
total <- srch$count
per_request <- 500 # get 500 records per parallel request
nrequest <- total %/% per_request + as.logical(total %% per_request)
result <- future_map(seq(nrequest),function(x) {
rstart <- (x - 1) * per_request
return(entrez_fetch(
"nuccore",
web_history = srch$web_history,
rettype="fasta",
retmode="xml",
retstart=rstart,
retmax=per_request,
api_key=api_key
))
}
Obviously for cases where nrequest > 10 (or whatever the limit is), we will immediately run afoul of the rate limiting.
I see two seemingly obvious simple solutions to this, both of which seem to work.
One is to introduce a random short delay before making the request, like so:
future_map(seq(nrequest),function(x) {
Sys.sleep(runif(1,0,5))
# ...do the request...
}
The second is to limit the number of concurrent requests to the rate limit, either by plan(multiprocess,workers=<max_concurrent_requests>) or by using the semaphore package with a semaphore set to the rate limit, like this:
# this sort of assumes individual requests take long enough to cause
# a wait for the semaphore to be long enough
# for this case, they do
rate_limit <- 10
lock = semaphore(rate_limit)
result <- future_map(seq(nrequest),function(x) {
rstart <- (x - 1) * per_request
acquire(lock)
s <- entrez_fetch(
"nuccore",
web_history = srch$web_history,
rettype="fasta",
retmode="xml",
retstart=rstart,
retmax=per_request,
api_key=api_key
)
release(lock)
return(s)
}
However, what I would really like to be able to do is limit the request rate rather than the number of concurrent requests. There's a great post by Quentin Pradet on how to do this using async io http requests in python. I made an attempt to adapt this to R, but ran into the problem that any variable shared across threads/processes in the future_* function is copied rather than actually shared, and thus modifications (even if protected by semaphore lock) are not shared among threads/processes, so it's not possible to implement the counter bucket we rely on for this method to work!
Is there a clever way to rate-limit parallel requests without necessarily capping the number of simultaneous requests? Or am I overthinking this and should just stick to limiting the number?

R - plumber, how to add a stop/quit condition to the run?

I have a small API, that I need to stop running if a certain time in the day is reached, say midnight for example, and then continue on to the rest of the program. And that independant from a call from a request. Beacause one could include a function that breaks the process by checking the Sy.time(), but that will only be executed if a request come through.
My guess is that I have to modify some attribute of the $run, but can't find much about it on the internet.
I know that one could get the pid and kill it with a system command, but I don't know if that solution makes sure that the rest of the program runs.
Does anyone have an idea?
Thanks in advance.
The file containing the function looks like :
(my_file.R)
#* #param x My argument
#* #get /lag_lead
function(x){
x <- as.numeric(x)
c(x-1, x+1)
}
and the running script :
library(plumber)
mon_api <- plumb('my_file.R')
mon_api$run(port = 8000)
print('hello')
One way to do so, would be to set a timeout with withTimeout which is a wrapper for setTimeLimit
library(R-utils)
withTimeout(mon_api$run(port = 8000), timeout = 30)
# timeout is in seconds
# calculate timeout, using difftime between now and midnight
tmo <- as.numeric(difftime(as.POSIXct("2018-12-28 00:00:00"), Sys.time(), units = "secs"))
withTimeout(mon_api$run(port = 8000), timeout = tmo)
Some other timeout solutions, one using parallel
Time out an R command via something like try()

How do I cache vectorized calls that take user input in R?

I am trying to calculate a field for all rows of a large dataset. The function to calculate it is from the package taxize, and uses an HTTP request to query an external site for the right ID number. It is searching by scientific name, and often there are multiple results, in which case this function asks for user input. I would like the function to cache my selection and return that ID number every time the same call is made from then on. I have tried with my own caching function and with memoizedCall() from the package R.cache but every time it hits the second entry of the same scientific name it still prompts me for user input. I feel like I am misunderstanding something basic about how vectorization works. Sorry for my ignorance but any advice is appreciated.
Here is the code I used as a custom caching function.
check_tsn <- function(data,tsn_list){
print(data)
print(tsn_list)
if (is.null(tsn_list$data)){
tsn_list$data = taxize::get_tsn(data)
print('added to tsn_list')
}
return(tsn_list$data)
}
tsn_list <- vector(mode = "list", nrow(wanglang))
Genus.Species <- c('Tamiops swinhoei','Bos taurus','Tamiops swinhoei')
IUCN.ID <- c('21382','','21382')
species <- data.frame(Genus.Species,IUCN.ID)
species$TSN.ID = check_tsn(species$Genus.Species,tsn_list)

Loop to wait for result or timeout in r

I've written a very quick blast script in r to enable interfacing with the NCBI blast API. Sometimes however, the result url takes a while to load and my script throws an error until the url is ready. Is there an elegant way (i.e. a tryCatch option) to handle the error until the result is returned or timeout after a specified time?
library(rvest)
## Definitive set of blast API instructions can be found here: https://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/BLAST_URLAPI.html
## Generate query URL
query_url <-
function(QUERY,
PROGRAM = "blastp",
DATABASE = "nr",
...) {
put_url_stem <-
'https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Put'
arguments = list(...)
paste0(
put_url_stem,
"&QUERY=",
QUERY,
"&PROGRAM=",
PROGRAM,
"&DATABASE=",
DATABASE,
arguments
)
}
blast_url <- query_url(QUERY = "NP_001117.2") ## test query
blast_session <- html_session(blast_url) ## create session
blast_form <- html_form(blast_session)[[1]] ## pull form from session
RID <- blast_form$fields$RID$value ## extract RID identifier
get_url <- function(RID, ...) {
get_url_stem <-
"https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get"
arguments = list(...)
paste0(get_url_stem, "&RID=", RID, "&FORMAT_TYPE=XML", arguments)
}
hits_xml <- read_xml(get_url(RID)) ## this is the sticky part
Sometimes it takes several minutes for the get_url to go live so what I would like is to do is to keep trying let's say every 20-30 seconds until it either produces the url or times out after a pre-specified time.
I think you may find this answer about the use of tryCatch useful
Regarding the 'keep trying until timeout' part. I imagine you can work on top of this other answer about a tryCatch loop on error
Hope it helps.

Implementation of simple polling of results file

For one of my dissertation's data collection modules, I have implemented a simple polling mechanism. This is needed, because I make each data collection request (one of many) as SQL query, submitted via Web form, which is simulated by RCurl code. The server processes each request and generates a text file with results at a specific URL (RESULTS_URL in code below). Regardless of the request, URL and file name are the same (I cannot change that). Since processing time for different data requests, obviously, is different and some requests may take significant amount of time, my R code needs to "know", when the results are ready (file is re-generated), so that it can retrieve them. The following is my solution for this problem.
POLL_TIME <- 5 # polling timeout in seconds
In function srdaRequestData(), before making data request:
# check and save 'last modified' date and time of the results file
# before submitting data request, to compare with the same after one
# for simple polling of results file in srdaGetData() function
beforeDate <- url.exists(RESULTS_URL, .header=TRUE)["Last-Modified"]
beforeDate <<- strptime(beforeDate, "%a, %d %b %Y %X", tz="GMT")
<making data request is here>
In function srdaGetData(), called after srdaRequestData()
# simple polling of the results file
repeat {
if (DEBUG) message("Waiting for results ...", appendLF = FALSE)
afterDate <- url.exists(RESULTS_URL, .header=TRUE)["Last-Modified"]
afterDate <- strptime(afterDate, "%a, %d %b %Y %X", tz="GMT")
delta <- difftime(afterDate, beforeDate, units = "secs")
if (as.numeric(delta) != 0) { # file modified, results are ready
if (DEBUG) message(" Ready!")
break
}
else { # no results yet, wait the timeout and check again
if (DEBUG) message(".", appendLF = FALSE)
Sys.sleep(POLL_TIME)
}
}
<retrieving request's results is here>
The module's main flow/sequence of events is linear, as follows:
Read/update configuration file
Authenticate with the system
Loop through data requests, specified in configuration file (via lapply()),
where for each request perform the following:
{
...
Make request: srdaRequestData()
...
Retrieve results: srdaGetData()
...
}
The issue with the code above is that it doesn't seem to be working as expected: upon making data request, the code should print "Waiting for results ..." and then, periodically checking the results file for being modified (re-generated), print progress dots until the results are ready, when it prints confirmation. However, the actual behavior is that the code waits long time (I intentionally made one request a long-running), not printing anything, but then, apparently retrieves results and prints both "Waiting for results ..." and " Ready" at the same time.
It seems to me that it's some kind of synchronization issue, but I can't figure out what exactly. Or, maybe it's something else and I'm somehow missing it. Your advice and help will be much appreciated!
In a comment to the question, I believe MrFlick solved the issue: the polling logic appears to be functional, but the problem is that the progress messages are out of synch with current events on the system.
By default, the R console output is buffered. This is by design: to speed things up and avoid the distracting flicker that may be associated with frequent messages etc. We tend to forget this fact, particularly after we've been using R in a very interactive fashion, running various ad-hoc statement at the console (the console buffer is automatically flushed just before returning the > prompt).
It is however possible to get message() and more generally console output in "real time" by either explicitly flushing the console after each critical output statement, using the flush.console() function, or by disabling buffering at the level of the R GUI (right-click when on the console, see Buffered output Ctrl W item. This is also available in the Misc menu)
Here's a toy example of the explicit use of flush.console. Note the use of cat() rather than message() as the former doesn't automatically add a CR/LF to the output. The latter however is useful however because its messages can be suppressed with suppressMessages() and the like. Also as shown in the comment you can cat the "\b" (backspace) character to make the number overwrite one another.
CountDown <- function() {
for (i in 9:1){
cat(i)
# alternatively to cat(i) use: message(i)
flush.console() # <<<<<<< immediate ouput to console.
Sys.sleep(1)
cat(" ") # also try cat("\b") instead ;-)
}
cat("... Blast-off\n")
}
The output is the following, what is of course not evident in this print-out is that it took 10 seconds overall with one number printed every second, before the final "Blast off"; do remove the flush.console() statement and the output will come at once, after 10 seconds, i.e. when the function terminates (unless console is not buffered at the level of the GUI).
CountDown()
9 8 7 6 5 4 3 2 1 ... Blast-off

Resources