Happy New Year!
I have just started to learn Julia and my first mini challenge I have set myself is to scrape data from a large list of URLs.
I have ca 50k URLs (which I successfully parsed from a JSON with Julia using Regex) in a CSV file. I want to scrape each one and return a matched string ("/page/12345/view" - where 12345 is any integer).
I managed to do so using HTTP and Queryverse (although had started with CSV and CSVFiles but looking at packages for learning purposes) but the script seems to stop after just under 2k. I can't see an error such as a timeout.
May I ask if anyone can advise what I'm doing wrong or how I can approach it differently? Explanations/links to learning resources would also be great!
using HTTP, Queryverse
URLs = load("urls.csv") |> DataFrame
patternid = r"\/page\/[0-9]+\/view"
touch("ids.txt")
f = open("ids.txt", "a")
for row in eachrow(URLs)
urlResponse = HTTP.get(row[:url])
if Int(urlResponse.status) == 404
continue
end
urlHTML = String(urlResponse.body)
urlIDmatch = match(patternid, urlHTML)
write(f, urlIDmatch.match, "\n")
end
close(f)
There can be always a server that detects your scraper and intentionally takes a very long time to respond.
Basically, since scraping is an IO intensive operations you should do it using a big number of asynchronous tasks. Moreover this should be combined with the readtimeout parameter of the get function. Hence your code will look more or less like this:
asyncmap(1:nrow(URLs);ntasks=50) do n
row = URLs[n, :]
urlResponse = HTTP.get(row[:url], readtimeout=10)
# the rest of your code comes here
end
Even one some servers are delaying transmission, always many connections will be working.
I have 200,000 links that I am trying to download, I have tried downloading it all in one go but I ran into memory issues.
I am trying to create a function which will download 1000 links at a time and save them in a folder.
Packages:
library(dplyr)
library(purrr)
library(edgarWebR)
A small sample of the data is as follows:
Data 1:
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm"
)
I then apply the following function to download these 10 links
parsed_files <- map(urls_to_parse, possibly(parse_filing, otherwise = NA))
Which stores it as a nice list, I can then apply names(parsed_files) <- urls_to_parse to name the lists as the links from where they were downloading them from. I can also use output <- plyr::ldply(parsed_files, data.frame) to store everything in a nice data frame.
Using the below data, how could I create batches to download the data in say batches of 10?
What I have currently:
start = 1
end = 100
output <- NULL
output_fin <- NULL
for(i in start:end){
output[[i]] <- map(urls_to_parse[[i]], possibly(parse_filing, otherwise = NA))
names(output) <- urls_to_parse[start:end]
save(output_fin, file = paste0("C:/Users/Downloads/data/",i, "output.RData"))
}
I am sure there is a better way using a function, since this code breaks for some of the results.
More data: - 100 links
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746908008126/a2186742z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465907055173/a07-18543_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465906047248/a06-15961_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465905033688/a05-12324_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746904023905/a2140220z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746903028005/a2116671z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000091205702033450/a2087919z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095012310108231/c61492e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095015208010514/n48172e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707018659/c22309e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707000193/c11187e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013406000594/c01109e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677405000032/d16006.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677404000013/d13773.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000104746903001075/a2097401z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000091205702001614/a2067550z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752308008030/a5800571.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752307009801/a5515869.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752306009238/a5227919.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046908000102/alpharmainc_10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046907000017/alo10k2006.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046906000027/alo10k2005.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046905000021/alo10k2004final.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046904000058/alo10k2003master.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046903000001/alo10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046902000004/alo10k2001.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046901500003/alo.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620118000009/a10k123117.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312517051216/d286458d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312516474605/d78287d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312515061145/d829913d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620113000023/amr-10kx20121231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000119312512063516/d259681d10k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095012311014726/d78201e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620110000006/ar123109.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620109000009/ar120810k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000451508000014/ar022010k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013407003888/d43815e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013406003715/d33303e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013405003726/d22731e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013404002668/d12953e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000104746903013301/a2108197z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095013407003823/h42902e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012906002343/h31028e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012905002955/h22337e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459018005085/cece-10k_20171231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459017004264/cece-10k_20161231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459016015157/cece-10k_20151231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312515095828/d864880d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312514098407/d661608d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312513109153/d444138d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312512119293/d293768d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312511067373/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312510069639/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312509055504/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312508058939/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312507071909/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312506068031/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312505077739/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312504052176/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465910047121/a10-16705_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000114420409046933/v159572_10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465906060737/a06-19311_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746905022854/a2162888z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746904028585/a2143353z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746903031974/a2119476z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000143774918010388/avx20180331_10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916317000028/avx-20170331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916316000079/avx-20160331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916315000024/avx-20150331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916314000035/avx-20140331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916313000022/avx-20130331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916312000024/avxform10kfy12.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916311000013/avxform10kfy11.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916310000020/avxform10kfy10.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916309000117/form10kfy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000192/form10qq1fy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000101/form10kfy08.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916307000122/form10kfy07.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916306000102/avxfy06form10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916305000094/fy0510k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916304000091/fy0410k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916303000020/fy0310k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916302000007/r10k-0302.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462218000018/pnw2017123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462217000010/pnw2016123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462216000087/pnw2015123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462215000013/pnw12311410-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000110465914012068/a13-25897_110k.htm"
)
Looping over to do batch job as you showed is a bad idea. If you have a 1000s of files to be downloaded, how do you recover from errors?
The performance is not solely depend on your computer's configuration, but the network performance is crucial.
Here are couple of suggestions.
Option 1
partition all URLs in to batches to be able to download them parallelly. The number of files to be downloaded could be equal to number of cores in your computer. Look at this question; reading multiple files quickly in R
store these batches in a queue objects - For ex: using a package like https://cran.r-project.org/web/packages/dequer/dequer.pdf
pop the queue and use the batch of URLs in your parallel file download function.
Use a retryable file download function like in -- HTTP error 400 in R, error handling, How to retry instead of forcing to stop?
Once the queue is completed, move to the next partition.
wrap the whole operation in a retryable loop. For example; How to retry a statement on error?
Why do I use a queue? Because you could retry on error easily.
A pseudo code
file_url_partitions <- partion_as_batches(all_urls, batch_size)
attempts = 3
while( file_url_partitions is not empty && attempt <= 3 ) {
batch = file_url_partitions.pop()
tryCatch({
download_parallel(batch)
}, some_exception = function(se) {
file_url_partitions.push(batch)
attemp = attempt+1
})
}
Note: I don't have access to R studio/environment now hence no way to try.
Option 2
Download files separately using a download manager/similar and use downloaded files.
Some useful resources:
https://www.r-bloggers.com/r-with-parallel-computing-from-user-perspectives/
http://adv-r.had.co.nz/beyond-exception-handling.html
I have a small API, that I need to stop running if a certain time in the day is reached, say midnight for example, and then continue on to the rest of the program. And that independant from a call from a request. Beacause one could include a function that breaks the process by checking the Sy.time(), but that will only be executed if a request come through.
My guess is that I have to modify some attribute of the $run, but can't find much about it on the internet.
I know that one could get the pid and kill it with a system command, but I don't know if that solution makes sure that the rest of the program runs.
Does anyone have an idea?
Thanks in advance.
The file containing the function looks like :
(my_file.R)
#* #param x My argument
#* #get /lag_lead
function(x){
x <- as.numeric(x)
c(x-1, x+1)
}
and the running script :
library(plumber)
mon_api <- plumb('my_file.R')
mon_api$run(port = 8000)
print('hello')
One way to do so, would be to set a timeout with withTimeout which is a wrapper for setTimeLimit
library(R-utils)
withTimeout(mon_api$run(port = 8000), timeout = 30)
# timeout is in seconds
# calculate timeout, using difftime between now and midnight
tmo <- as.numeric(difftime(as.POSIXct("2018-12-28 00:00:00"), Sys.time(), units = "secs"))
withTimeout(mon_api$run(port = 8000), timeout = tmo)
Some other timeout solutions, one using parallel
Time out an R command via something like try()
I'm using R to make an API call to a weather data provider to download some weather forecasts. I'm using a free key that allows me to make no more than 10 calls per minute. I've tried using Sys.sleep() to ensure I don't go over the threshold but the API resource monitor tells me that I've exceeded the number of calls.
For example, if I'm making 6 calls, a time interval of 10 seconds between the calls ought to be sufficient (not taking into account the time R would need).
dat <- list()
for(i in 1:6){
dat[[i]] <- getWeatherData(web_url, api_key, history_date, data_format)
Sys.sleep(10)
web_url <- gsub(i-1, i, url)
}
The getWeatherData function does the following:
makes the API call (only one API call is made each time the function is invoked. Uses httr::GET() to get the data),
parses the XML output to get desired variables (regulat expressions),
performs some clean-up (for missing/garbage values),
converts strings to R date-time objects (POSIXct), and
rounds values to the nearest hour (lubridate::round_date()).
Function inputs:
web_url is a custom url,
api_key is my personal key,
history_date is a string (formatted as "%d/%m/%Y %H:%M:%S"), and
data_format specifies if I want an .XML or .json file as output.
I can not share the url/key for obvious reasons. As soon as I run this, I get a notification from the data provider that I've exceeded the allowable calls per minute (10). I don't get a notification every time - not sure why that is either.
Any help is appreciated!
This solution should be helpful for you if Sys.sleep doesn't do the trick.
Basically, this replaces the use of Sys.sleep with while logic.
dat <- list()
delay_seconds<-10
for(i in 1:6){
dat[[i]] <- getWeatherData(web_url, api_key, history_date, data_format)
date_time<-Sys.time()
while((as.numeric(Sys.time()) - as.numeric(date_time))<delay_seconds){}
web_url <- gsub(i-1, i, url)
}
Here, we are:
defining a number of seconds to wait ( delay_seconds<-10 )
defining a start time for comparison ( date_time<-Sys.time() )
using a while loop that checks the present time in comparison to our comparison time and seeing if this is less than our chosen delay interval ( (as.numeric(Sys.time()) - as.numeric(date_time)<delay_seconds )
doing nothing until the wait time is over( {} )
Not knowing if you need/want to, but in the case that you're hoping to get your data out of the lists and into a longer combined form, I recommend the dplyr function bind_rows().
dat2<-bind_rows(dat)
Thanks to an answer by rbtj to this question: How to make execution pause, sleep, wait for X seconds in R?
I am trying to write a manual rate-limiting function for the rgithub package. So far this is what I have:
library(rgithub)
pull <- function(i){
commits <- get.pull.request.commits(owner = owner, repo = repo, id = i, ctx = get.github.context(), per_page=100)
links <- digest_header_links(commits)
number_of_pages <- links[2,]$page
if (number_of_pages != 0)
try_default(for (n in 1:number_of_pages){
if (as.integer(commits$headers$`x-ratelimit-remaining`) < 5)
Sys.sleep(as.integer(commits$headers$`x-ratelimit-reset`)-as.POSIXct(Sys.time()) %>% as.integer())
else
get.pull.request.commits(owner = owner, repo = repo, id = i, ctx = get.github.context(), per_page=100, page = n)
}, default = NULL)
else
return(commits)
}
list <- c(500, 501, 502)
pull_lists <- lapply(list, pull)
The intention i that if the x-ratelimit-remaining variable goes below a certain threshold the script should wait until the time specified in x-ratelimit-reset has passed, and then continue the script. However, I'm not sure if this is the actual behavior of the if else set up that I have here.
The function runs fine, but I have some doubts about whether it actually does the rate limiting or whether it somehow skips that steps. Hence I ask: a) how can I find out if it actually does rate-limiting, and b) if not, how can I rewrite it so that it actually does rate limiting? Would a while condition/loop perhaps be better?
You can test if it does the rate limiting changing 5 to a large enough number and adding a display of the timing of Sys.sleep using:
print(system.time(Sys.sleep(...)))
That said, the function seems ok to me, unfortunately I cannot test it easily as rgithub is not available for my version of R (3.1.3).
Not a canonical answer, but some working example.
You should add some logging in your script, even kind of write.csv(append=TRUE).
I've implemented automatic antiddos process which prevent your ip to be banned by the exchange market. You can find it jangorecki/Rbitcoin/R/utils.R.
Rbitcoin.last_api_call is env object stored in package namespace, kind of session package cache.
This can help you with setting it in your package.
You should also consider a optional parallel supported version. Linking to database with concurrency read. My function can be easy modified to queue call and recheck timing every X seconds.
Edit
I forget to add that mentioned function support multiple source systems. That allows for example to extend your rgithub for bitbucket, etc. and still effectively manage API rate limiting.