How to rate-limit parallel API requests in R/future/furrr - r

I have to retrieve a large dataset from a web API (NCBI entrez) that limits me to a certain number of requests per second, say 10 (the example code will limit you to three without an API key). I'm using furrr's future_* functions to parallelize the requests to get them as quickly as possible, like this:
library(tidyverse)
library(rentrez)
library(furrr)
plan(multiprocess)
api_key <- "<api key>"
# this will return a crap-ton of results
srch <- entrez_search("nuccore", "Homo sapiens", use_history=T, api_key=api_key)
total <- srch$count
per_request <- 500 # get 500 records per parallel request
nrequest <- total %/% per_request + as.logical(total %% per_request)
result <- future_map(seq(nrequest),function(x) {
rstart <- (x - 1) * per_request
return(entrez_fetch(
"nuccore",
web_history = srch$web_history,
rettype="fasta",
retmode="xml",
retstart=rstart,
retmax=per_request,
api_key=api_key
))
}
Obviously for cases where nrequest > 10 (or whatever the limit is), we will immediately run afoul of the rate limiting.
I see two seemingly obvious simple solutions to this, both of which seem to work.
One is to introduce a random short delay before making the request, like so:
future_map(seq(nrequest),function(x) {
Sys.sleep(runif(1,0,5))
# ...do the request...
}
The second is to limit the number of concurrent requests to the rate limit, either by plan(multiprocess,workers=<max_concurrent_requests>) or by using the semaphore package with a semaphore set to the rate limit, like this:
# this sort of assumes individual requests take long enough to cause
# a wait for the semaphore to be long enough
# for this case, they do
rate_limit <- 10
lock = semaphore(rate_limit)
result <- future_map(seq(nrequest),function(x) {
rstart <- (x - 1) * per_request
acquire(lock)
s <- entrez_fetch(
"nuccore",
web_history = srch$web_history,
rettype="fasta",
retmode="xml",
retstart=rstart,
retmax=per_request,
api_key=api_key
)
release(lock)
return(s)
}
However, what I would really like to be able to do is limit the request rate rather than the number of concurrent requests. There's a great post by Quentin Pradet on how to do this using async io http requests in python. I made an attempt to adapt this to R, but ran into the problem that any variable shared across threads/processes in the future_* function is copied rather than actually shared, and thus modifications (even if protected by semaphore lock) are not shared among threads/processes, so it's not possible to implement the counter bucket we rely on for this method to work!
Is there a clever way to rate-limit parallel requests without necessarily capping the number of simultaneous requests? Or am I overthinking this and should just stick to limiting the number?

Related

Scraping string from a large number of URLs with Julia

Happy New Year!
I have just started to learn Julia and my first mini challenge I have set myself is to scrape data from a large list of URLs.
I have ca 50k URLs (which I successfully parsed from a JSON with Julia using Regex) in a CSV file. I want to scrape each one and return a matched string ("/page/12345/view" - where 12345 is any integer).
I managed to do so using HTTP and Queryverse (although had started with CSV and CSVFiles but looking at packages for learning purposes) but the script seems to stop after just under 2k. I can't see an error such as a timeout.
May I ask if anyone can advise what I'm doing wrong or how I can approach it differently? Explanations/links to learning resources would also be great!
using HTTP, Queryverse
URLs = load("urls.csv") |> DataFrame
patternid = r"\/page\/[0-9]+\/view"
touch("ids.txt")
f = open("ids.txt", "a")
for row in eachrow(URLs)
urlResponse = HTTP.get(row[:url])
if Int(urlResponse.status) == 404
continue
end
urlHTML = String(urlResponse.body)
urlIDmatch = match(patternid, urlHTML)
write(f, urlIDmatch.match, "\n")
end
close(f)
There can be always a server that detects your scraper and intentionally takes a very long time to respond.
Basically, since scraping is an IO intensive operations you should do it using a big number of asynchronous tasks. Moreover this should be combined with the readtimeout parameter of the get function. Hence your code will look more or less like this:
asyncmap(1:nrow(URLs);ntasks=50) do n
row = URLs[n, :]
urlResponse = HTTP.get(row[:url], readtimeout=10)
# the rest of your code comes here
end
Even one some servers are delaying transmission, always many connections will be working.

downloading data and saving data to a folder in batches

I have 200,000 links that I am trying to download, I have tried downloading it all in one go but I ran into memory issues.
I am trying to create a function which will download 1000 links at a time and save them in a folder.
Packages:
library(dplyr)
library(purrr)
library(edgarWebR)
A small sample of the data is as follows:
Data 1:
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm"
)
I then apply the following function to download these 10 links
parsed_files <- map(urls_to_parse, possibly(parse_filing, otherwise = NA))
Which stores it as a nice list, I can then apply names(parsed_files) <- urls_to_parse to name the lists as the links from where they were downloading them from. I can also use output <- plyr::ldply(parsed_files, data.frame) to store everything in a nice data frame.
Using the below data, how could I create batches to download the data in say batches of 10?
What I have currently:
start = 1
end = 100
output <- NULL
output_fin <- NULL
for(i in start:end){
output[[i]] <- map(urls_to_parse[[i]], possibly(parse_filing, otherwise = NA))
names(output) <- urls_to_parse[start:end]
save(output_fin, file = paste0("C:/Users/Downloads/data/",i, "output.RData"))
}
I am sure there is a better way using a function, since this code breaks for some of the results.
More data: - 100 links
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746908008126/a2186742z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465907055173/a07-18543_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465906047248/a06-15961_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465905033688/a05-12324_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746904023905/a2140220z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746903028005/a2116671z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000091205702033450/a2087919z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095012310108231/c61492e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095015208010514/n48172e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707018659/c22309e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707000193/c11187e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013406000594/c01109e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677405000032/d16006.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677404000013/d13773.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000104746903001075/a2097401z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000091205702001614/a2067550z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752308008030/a5800571.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752307009801/a5515869.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752306009238/a5227919.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046908000102/alpharmainc_10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046907000017/alo10k2006.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046906000027/alo10k2005.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046905000021/alo10k2004final.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046904000058/alo10k2003master.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046903000001/alo10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046902000004/alo10k2001.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046901500003/alo.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620118000009/a10k123117.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312517051216/d286458d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312516474605/d78287d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312515061145/d829913d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620113000023/amr-10kx20121231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000119312512063516/d259681d10k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095012311014726/d78201e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620110000006/ar123109.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620109000009/ar120810k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000451508000014/ar022010k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013407003888/d43815e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013406003715/d33303e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013405003726/d22731e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013404002668/d12953e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000104746903013301/a2108197z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095013407003823/h42902e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012906002343/h31028e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012905002955/h22337e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459018005085/cece-10k_20171231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459017004264/cece-10k_20161231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459016015157/cece-10k_20151231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312515095828/d864880d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312514098407/d661608d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312513109153/d444138d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312512119293/d293768d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312511067373/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312510069639/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312509055504/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312508058939/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312507071909/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312506068031/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312505077739/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312504052176/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465910047121/a10-16705_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000114420409046933/v159572_10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465906060737/a06-19311_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746905022854/a2162888z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746904028585/a2143353z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746903031974/a2119476z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000143774918010388/avx20180331_10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916317000028/avx-20170331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916316000079/avx-20160331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916315000024/avx-20150331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916314000035/avx-20140331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916313000022/avx-20130331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916312000024/avxform10kfy12.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916311000013/avxform10kfy11.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916310000020/avxform10kfy10.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916309000117/form10kfy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000192/form10qq1fy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000101/form10kfy08.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916307000122/form10kfy07.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916306000102/avxfy06form10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916305000094/fy0510k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916304000091/fy0410k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916303000020/fy0310k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916302000007/r10k-0302.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462218000018/pnw2017123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462217000010/pnw2016123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462216000087/pnw2015123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462215000013/pnw12311410-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000110465914012068/a13-25897_110k.htm"
)
Looping over to do batch job as you showed is a bad idea. If you have a 1000s of files to be downloaded, how do you recover from errors?
The performance is not solely depend on your computer's configuration, but the network performance is crucial.
Here are couple of suggestions.
Option 1
partition all URLs in to batches to be able to download them parallelly. The number of files to be downloaded could be equal to number of cores in your computer. Look at this question; reading multiple files quickly in R
store these batches in a queue objects - For ex: using a package like https://cran.r-project.org/web/packages/dequer/dequer.pdf
pop the queue and use the batch of URLs in your parallel file download function.
Use a retryable file download function like in -- HTTP error 400 in R, error handling, How to retry instead of forcing to stop?
Once the queue is completed, move to the next partition.
wrap the whole operation in a retryable loop. For example; How to retry a statement on error?
Why do I use a queue? Because you could retry on error easily.
A pseudo code
file_url_partitions <- partion_as_batches(all_urls, batch_size)
attempts = 3
while( file_url_partitions is not empty && attempt <= 3 ) {
batch = file_url_partitions.pop()
tryCatch({
download_parallel(batch)
}, some_exception = function(se) {
file_url_partitions.push(batch)
attemp = attempt+1
})
}
Note: I don't have access to R studio/environment now hence no way to try.
Option 2
Download files separately using a download manager/similar and use downloaded files.
Some useful resources:
https://www.r-bloggers.com/r-with-parallel-computing-from-user-perspectives/
http://adv-r.had.co.nz/beyond-exception-handling.html

R - plumber, how to add a stop/quit condition to the run?

I have a small API, that I need to stop running if a certain time in the day is reached, say midnight for example, and then continue on to the rest of the program. And that independant from a call from a request. Beacause one could include a function that breaks the process by checking the Sy.time(), but that will only be executed if a request come through.
My guess is that I have to modify some attribute of the $run, but can't find much about it on the internet.
I know that one could get the pid and kill it with a system command, but I don't know if that solution makes sure that the rest of the program runs.
Does anyone have an idea?
Thanks in advance.
The file containing the function looks like :
(my_file.R)
#* #param x My argument
#* #get /lag_lead
function(x){
x <- as.numeric(x)
c(x-1, x+1)
}
and the running script :
library(plumber)
mon_api <- plumb('my_file.R')
mon_api$run(port = 8000)
print('hello')
One way to do so, would be to set a timeout with withTimeout which is a wrapper for setTimeLimit
library(R-utils)
withTimeout(mon_api$run(port = 8000), timeout = 30)
# timeout is in seconds
# calculate timeout, using difftime between now and midnight
tmo <- as.numeric(difftime(as.POSIXct("2018-12-28 00:00:00"), Sys.time(), units = "secs"))
withTimeout(mon_api$run(port = 8000), timeout = tmo)
Some other timeout solutions, one using parallel
Time out an R command via something like try()

Using Sys.sleep() to delay API call

I'm using R to make an API call to a weather data provider to download some weather forecasts. I'm using a free key that allows me to make no more than 10 calls per minute. I've tried using Sys.sleep() to ensure I don't go over the threshold but the API resource monitor tells me that I've exceeded the number of calls.
For example, if I'm making 6 calls, a time interval of 10 seconds between the calls ought to be sufficient (not taking into account the time R would need).
dat <- list()
for(i in 1:6){
dat[[i]] <- getWeatherData(web_url, api_key, history_date, data_format)
Sys.sleep(10)
web_url <- gsub(i-1, i, url)
}
The getWeatherData function does the following:
makes the API call (only one API call is made each time the function is invoked. Uses httr::GET() to get the data),
parses the XML output to get desired variables (regulat expressions),
performs some clean-up (for missing/garbage values),
converts strings to R date-time objects (POSIXct), and
rounds values to the nearest hour (lubridate::round_date()).
Function inputs:
web_url is a custom url,
api_key is my personal key,
history_date is a string (formatted as "%d/%m/%Y %H:%M:%S"), and
data_format specifies if I want an .XML or .json file as output.
I can not share the url/key for obvious reasons. As soon as I run this, I get a notification from the data provider that I've exceeded the allowable calls per minute (10). I don't get a notification every time - not sure why that is either.
Any help is appreciated!
This solution should be helpful for you if Sys.sleep doesn't do the trick.
Basically, this replaces the use of Sys.sleep with while logic.
dat <- list()
delay_seconds<-10
for(i in 1:6){
dat[[i]] <- getWeatherData(web_url, api_key, history_date, data_format)
date_time<-Sys.time()
while((as.numeric(Sys.time()) - as.numeric(date_time))<delay_seconds){}
web_url <- gsub(i-1, i, url)
}
Here, we are:
defining a number of seconds to wait ( delay_seconds<-10 )
defining a start time for comparison ( date_time<-Sys.time() )
using a while loop that checks the present time in comparison to our comparison time and seeing if this is less than our chosen delay interval ( (as.numeric(Sys.time()) - as.numeric(date_time)<delay_seconds )
doing nothing until the wait time is over( {} )
Not knowing if you need/want to, but in the case that you're hoping to get your data out of the lists and into a longer combined form, I recommend the dplyr function bind_rows().
dat2<-bind_rows(dat)
Thanks to an answer by rbtj to this question: How to make execution pause, sleep, wait for X seconds in R?

Manual API rate limiting

I am trying to write a manual rate-limiting function for the rgithub package. So far this is what I have:
library(rgithub)
pull <- function(i){
commits <- get.pull.request.commits(owner = owner, repo = repo, id = i, ctx = get.github.context(), per_page=100)
links <- digest_header_links(commits)
number_of_pages <- links[2,]$page
if (number_of_pages != 0)
try_default(for (n in 1:number_of_pages){
if (as.integer(commits$headers$`x-ratelimit-remaining`) < 5)
Sys.sleep(as.integer(commits$headers$`x-ratelimit-reset`)-as.POSIXct(Sys.time()) %>% as.integer())
else
get.pull.request.commits(owner = owner, repo = repo, id = i, ctx = get.github.context(), per_page=100, page = n)
}, default = NULL)
else
return(commits)
}
list <- c(500, 501, 502)
pull_lists <- lapply(list, pull)
The intention i that if the x-ratelimit-remaining variable goes below a certain threshold the script should wait until the time specified in x-ratelimit-reset has passed, and then continue the script. However, I'm not sure if this is the actual behavior of the if else set up that I have here.
The function runs fine, but I have some doubts about whether it actually does the rate limiting or whether it somehow skips that steps. Hence I ask: a) how can I find out if it actually does rate-limiting, and b) if not, how can I rewrite it so that it actually does rate limiting? Would a while condition/loop perhaps be better?
You can test if it does the rate limiting changing 5 to a large enough number and adding a display of the timing of Sys.sleep using:
print(system.time(Sys.sleep(...)))
That said, the function seems ok to me, unfortunately I cannot test it easily as rgithub is not available for my version of R (3.1.3).
Not a canonical answer, but some working example.
You should add some logging in your script, even kind of write.csv(append=TRUE).
I've implemented automatic antiddos process which prevent your ip to be banned by the exchange market. You can find it jangorecki/Rbitcoin/R/utils.R.
Rbitcoin.last_api_call is env object stored in package namespace, kind of session package cache.
This can help you with setting it in your package.
You should also consider a optional parallel supported version. Linking to database with concurrency read. My function can be easy modified to queue call and recheck timing every X seconds.
Edit
I forget to add that mentioned function support multiple source systems. That allows for example to extend your rgithub for bitbucket, etc. and still effectively manage API rate limiting.

Resources