How can I stop url.exists()? - r

I have a list of PDF URLs, and I want to download these PDFs. However, not all of the URLs are still existing, this is why I check them before by means of the RCurl function url.exists(). With some URLs, however, this function is running forever without delivering a result. I can't even stop it with a withTimeout() function.
I wrapped url.exists() into withTimeout(), but the timeout does not work:
library(RCurl)
library(R.utils)
url <- "http://www.shangri-la.com/uploadedFiles/corporate/about_us/csr_2011/Shangri-La%20Asia%202010%20Sustainability%20Report.pdf"
withTimeout(url.exists(url), timeout = 15, onTimeout = "warning")
The function runs forever, timeout is ignored.
Thus my questions:
Is there any possible check which sorts out this URL before it gets to url.exists()?
Or is there a possibility to prevent url.exists() from running forever?
Other checks I tried (but which do not sort out this URL) are:
try(length(getBinaryURL(url))>0) == T
http_status(GET(url))
!class(try(GET(url]))) == "try-error"

library(httr)
urls <- c(
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'
)
sapply(urls, url_success, config(followlocation = 0L), USE.NAMES = FALSE)
This functions is analogous to file.exists and determines whether a request for a specific URL responds without error. We make the request but ask the server not to return the body. We just process the header.

Related

Requests(url) is having after 5 iteration

I am attempting to run a webscraping algo on indeed using beautifulSoup and loop through the different pages. However, after 2-6 iterations, the requests.get(url) hangs and stops finding the next page. I have read that it might do something with the server being blocked but that would have blocked the original requests and it also says online that Indeed allows for web scraping. I have also heard that I should set a header but I am unsure how to do that. I am running on the latest version of safari and MacOs 12.4.
A solution I came up with, thought this does not answer the question specifically, is by using a try expect statement and setting a timeout value to the request. Once the timeout value is reached, it enters the try except statement, sets a boolean value, and then continues the loop and try again. Code is inserted below.
while(i < 10):
url = get_url('software intern', '', i)
print("Parsing Page Number:" + str(i + 1))
error = False
try:
response = requests.get(url, timeout = 10)
except requests.exceptions.Timeout as err:
error = True
if error:
print("Trying to connect to webpage again")
continue
i += 1
I am leaving the question as unanswered for now however as I still don't know the root cause of this issue and this solution is just a workaround.

Lua - Handle a 301 Moved Permanently error and then save generated image from resulting URL

I’m trying to make a http.request to have a graph created, and then save the resulting .png graph image that is created. The problem is I want to do this with Lua, yet I’m struggling on two parts. (If you take url, you’ll see that this should work fine in a standard browser)
Handling a 301 error, have looked through SO, I could see a few references to this and the need to use luasec, which I believe I have.
301 moved permanently with socket.http
Here is the script, with the URL I’m trying to call via HTTP, and then (eventually want to ) save the resulting graph image (.png file) that’s created
local http = require "socket.http"
--local https = require("ssl.https")
local ltn12 = require "ltn12"
r = {} -- init empty table
local result, code, headers, status = http.request{
url="http://www.chartgo.com/create.do?charttype=line&width=650&height=650&chrtbkgndcolor=white&gridlines=1&labelorientation=horizontal&title=Fdsfsdfdsfsdfsdfsdf&subtitle=Qrqwrwqrqwrqwr&xtitle=Cbnmcbnm&ytitle=Ghjghj&source=Hgjghj&fonttypetitle=bold&fonttypelabel=normal&gradient=1&max_yaxis=&min_yaxis=&threshold=&labels=1&xaxis1=Jan%0D%0AFeb%0D%0AMar%0D%0AApr%0D%0AMay%0D%0AJun%0D%0AJul%0D%0AAug%0D%0ASep%0D%0AOct%0D%0ANov%0D%0ADec&yaxis1=20%0D%0A30%0D%0A80%0D%0A90%0D%0A50%0D%0A30%0D%0A60%0D%0A50%0D%0A40%0D%0A50%0D%0A10%0D%0A20&group1=Group+1&viewsource=mainView&language=en§ionSetting=§ionSpecific=§ionData=",
sink = ltn12.sink.table( r )
}
print("code=".. tostring(code))
print("status=".. tostring(status))
print("headers=".. tostring(headers))
print("result=".. tostring(result))
print("sink= ".. table.concat( r, "" ) )
print(result, code, headers, status )
for i,v in pairs(headers) do
print("\t",i, v)
end
Which returns the 301 Moved Permanently error, plus via a viewer it also provides me with a link to another URL (this time a https on)
So to try and get to the https site first off, I tried adding in the ssl.http element with the following, but that that does not return anything at all, all nil .
local https = require("ssl.https")
local ltn12 = require "ltn12"
r = {} -- init empty table
local result, code, headers, status = https.request{
url="https://www.chartgo.com/create.do?charttype=line&width=650&height=650&chrtbkgndcolor=white&gridlines=1&labelorientation=horizontal&title=Fdsfsdfdsfsdfsdfsdf&subtitle=Qrqwrwqrqwrqwr&xtitle=Cbnmcbnm&ytitle=Ghjghj&source=Hgjghj&fonttypetitle=bold&fonttypelabel=normal&gradient=1&max_yaxis=&min_yaxis=&threshold=&labels=1&xaxis1=Jan%0D%0AFeb%0D%0AMar%0D%0AApr%0D%0AMay%0D%0AJun%0D%0AJul%0D%0AAug%0D%0ASep%0D%0AOct%0D%0ANov%0D%0ADec&yaxis1=20%0D%0A30%0D%0A80%0D%0A90%0D%0A50%0D%0A30%0D%0A60%0D%0A50%0D%0A40%0D%0A50%0D%0A10%0D%0A20&group1=Group+1&viewsource=mainView&language=en§ionSetting=§ionSpecific=§ionData=",
sink = ltn12.sink.table( r )
}
print("code=".. tostring(code))
print("status=".. tostring(status))
print("headers=".. tostring(headers))
print("result=".. tostring(result))
print("sink= ".. table.concat( r, "" ) )
print(result, code, headers, status )
And then …
assuming I can eventually make the http.request work, the web page returns a png. image of the resulting graph - I’d love to be able to extract/copy that for further use within this piece of code..
As always any help/advice would be appreciated..

withTimeout not working inside functions?

I am having some issues with R.utils::withTimeout(). It doesn't seem to take the timeout option into acount at all, or only sometimes. Below the function I want to use:
scrape_player <- function(url, time){
raw_html <- tryCatch({
R.utils::withTimeout({
RCurl::getURL(url)
},
timeout = time, onTimeout = "warning")}
)
html_page <- xml2::read_html(raw_html)
}
Now when I use it:
scrape_player("http://nhlnumbers.com/player_stats/1", 1)
it either works fine and I get the html page I want, or I get an error message telling me that the elapsed time limit was reached, or, and this is my problem, it takes a very long time, way more than 1 second, to finally return an html page with an error 500.
Shouldn't RCurl::getURL() try for only 1 second (in the example) to get the html page and if not, simply return a warning? What am I missing?
Ok, what I did as a workaround: instead of returning the page I write it to disk. Doesn't solve the issue that withTimeout doesn't seem to work, but at least I see that I'm getting pages written to disk, slowly but surely.

Loop to wait for result or timeout in r

I've written a very quick blast script in r to enable interfacing with the NCBI blast API. Sometimes however, the result url takes a while to load and my script throws an error until the url is ready. Is there an elegant way (i.e. a tryCatch option) to handle the error until the result is returned or timeout after a specified time?
library(rvest)
## Definitive set of blast API instructions can be found here: https://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/BLAST_URLAPI.html
## Generate query URL
query_url <-
function(QUERY,
PROGRAM = "blastp",
DATABASE = "nr",
...) {
put_url_stem <-
'https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Put'
arguments = list(...)
paste0(
put_url_stem,
"&QUERY=",
QUERY,
"&PROGRAM=",
PROGRAM,
"&DATABASE=",
DATABASE,
arguments
)
}
blast_url <- query_url(QUERY = "NP_001117.2") ## test query
blast_session <- html_session(blast_url) ## create session
blast_form <- html_form(blast_session)[[1]] ## pull form from session
RID <- blast_form$fields$RID$value ## extract RID identifier
get_url <- function(RID, ...) {
get_url_stem <-
"https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get"
arguments = list(...)
paste0(get_url_stem, "&RID=", RID, "&FORMAT_TYPE=XML", arguments)
}
hits_xml <- read_xml(get_url(RID)) ## this is the sticky part
Sometimes it takes several minutes for the get_url to go live so what I would like is to do is to keep trying let's say every 20-30 seconds until it either produces the url or times out after a pre-specified time.
I think you may find this answer about the use of tryCatch useful
Regarding the 'keep trying until timeout' part. I imagine you can work on top of this other answer about a tryCatch loop on error
Hope it helps.

Manual API rate limiting

I am trying to write a manual rate-limiting function for the rgithub package. So far this is what I have:
library(rgithub)
pull <- function(i){
commits <- get.pull.request.commits(owner = owner, repo = repo, id = i, ctx = get.github.context(), per_page=100)
links <- digest_header_links(commits)
number_of_pages <- links[2,]$page
if (number_of_pages != 0)
try_default(for (n in 1:number_of_pages){
if (as.integer(commits$headers$`x-ratelimit-remaining`) < 5)
Sys.sleep(as.integer(commits$headers$`x-ratelimit-reset`)-as.POSIXct(Sys.time()) %>% as.integer())
else
get.pull.request.commits(owner = owner, repo = repo, id = i, ctx = get.github.context(), per_page=100, page = n)
}, default = NULL)
else
return(commits)
}
list <- c(500, 501, 502)
pull_lists <- lapply(list, pull)
The intention i that if the x-ratelimit-remaining variable goes below a certain threshold the script should wait until the time specified in x-ratelimit-reset has passed, and then continue the script. However, I'm not sure if this is the actual behavior of the if else set up that I have here.
The function runs fine, but I have some doubts about whether it actually does the rate limiting or whether it somehow skips that steps. Hence I ask: a) how can I find out if it actually does rate-limiting, and b) if not, how can I rewrite it so that it actually does rate limiting? Would a while condition/loop perhaps be better?
You can test if it does the rate limiting changing 5 to a large enough number and adding a display of the timing of Sys.sleep using:
print(system.time(Sys.sleep(...)))
That said, the function seems ok to me, unfortunately I cannot test it easily as rgithub is not available for my version of R (3.1.3).
Not a canonical answer, but some working example.
You should add some logging in your script, even kind of write.csv(append=TRUE).
I've implemented automatic antiddos process which prevent your ip to be banned by the exchange market. You can find it jangorecki/Rbitcoin/R/utils.R.
Rbitcoin.last_api_call is env object stored in package namespace, kind of session package cache.
This can help you with setting it in your package.
You should also consider a optional parallel supported version. Linking to database with concurrency read. My function can be easy modified to queue call and recheck timing every X seconds.
Edit
I forget to add that mentioned function support multiple source systems. That allows for example to extend your rgithub for bitbucket, etc. and still effectively manage API rate limiting.

Resources