I am having some issues with R.utils::withTimeout(). It doesn't seem to take the timeout option into acount at all, or only sometimes. Below the function I want to use:
scrape_player <- function(url, time){
raw_html <- tryCatch({
R.utils::withTimeout({
RCurl::getURL(url)
},
timeout = time, onTimeout = "warning")}
)
html_page <- xml2::read_html(raw_html)
}
Now when I use it:
scrape_player("http://nhlnumbers.com/player_stats/1", 1)
it either works fine and I get the html page I want, or I get an error message telling me that the elapsed time limit was reached, or, and this is my problem, it takes a very long time, way more than 1 second, to finally return an html page with an error 500.
Shouldn't RCurl::getURL() try for only 1 second (in the example) to get the html page and if not, simply return a warning? What am I missing?
Ok, what I did as a workaround: instead of returning the page I write it to disk. Doesn't solve the issue that withTimeout doesn't seem to work, but at least I see that I'm getting pages written to disk, slowly but surely.
Related
I am attempting to run a webscraping algo on indeed using beautifulSoup and loop through the different pages. However, after 2-6 iterations, the requests.get(url) hangs and stops finding the next page. I have read that it might do something with the server being blocked but that would have blocked the original requests and it also says online that Indeed allows for web scraping. I have also heard that I should set a header but I am unsure how to do that. I am running on the latest version of safari and MacOs 12.4.
A solution I came up with, thought this does not answer the question specifically, is by using a try expect statement and setting a timeout value to the request. Once the timeout value is reached, it enters the try except statement, sets a boolean value, and then continues the loop and try again. Code is inserted below.
while(i < 10):
url = get_url('software intern', '', i)
print("Parsing Page Number:" + str(i + 1))
error = False
try:
response = requests.get(url, timeout = 10)
except requests.exceptions.Timeout as err:
error = True
if error:
print("Trying to connect to webpage again")
continue
i += 1
I am leaving the question as unanswered for now however as I still don't know the root cause of this issue and this solution is just a workaround.
On this code when I use for loop or the function lapply I get the following error
"Error in get_entrypoint (debug_port):
Cannot connect R to Chrome. Please retry. "
library(rvest)
library(xml2) #pull html data
library(selectr) #for xpath element
url_stackoverflow_rmarkdown <-
'https://stackoverflow.com/questions/tagged/r-markdown?tab=votes&pagesize=50'
web_page <- read_html(url_stackoverflow_rmarkdown)
questions_per_page <- html_text(html_nodes(web_page, ".page-numbers.current"))[1]
link_questions <- html_attr(html_nodes(web_page, ".question-hyperlink")[1:questions_per_page],
"href")
setwd("~/WebScraping_chrome_print_to_pdf")
for (i in 1:length(link_questions)) {
question_to_pdf <- paste0("https://stackoverflow.com",
link_questions[i])
pagedown::chrome_print(question_to_pdf)
}
Is it possible to build a for loop() or use lapply to repeat the code from where it break? That is, from the last i value without breaking the code?
Many thanks
I edited #Rui Barradas idea of tryCatch().
You can try to do something like below.
The IsValues will get either the link value or bad is.
IsValues <- list()
for (i in 1:length(link_questions)) {
question_to_pdf <- paste0("https://stackoverflow.com",
link_questions[i])
IsValues[[i]] <- tryCatch(
{
message(paste("Converting", i))
pagedown::chrome_print(question_to_pdf)
},
error=function(cond) {
message(paste("Cannot convert", i))
# Choose a return value in case of error
return(i)
})
}
Than, you can rbind your values and extract the bad is:
do.call(rbind, IsValues)[!grepl("\\.pdf$", do.call(rbind, IsValues))]
[1] "3" "5" "19" "31"
You can read more about tryCatch() in this answer.
Based on your example, it looks like you have two errors to contend with. The first error is the one you mention in your question. It is also the most frequent error:
Error in get_entrypoint (debug_port): Cannot connect R to Chrome. Please retry.
The second error arises when there are links in the HTML that return 404:
Failed to generate output. Reason: Failed to open https://lh3.googleusercontent.com/-bwcos_zylKg/AAAAAAAAAAI/AAAAAAAAAAA/AAnnY7o18NuEdWnDEck_qPpn-lu21VTdfw/mo/photo.jpg?sz=32 (HTTP status code: 404)
The key phrase in the first error is "Please retry". As far as I can tell, chrome_print sometimes has issues connecting to Chrome. It seems to be fairly random, i.e. failed connections in one run will be fine in the next, and vice versa. The easiest way to get around this issue is to just keep trying until it connects.
I can't come up with any fix for the second error. However, it doesn't seem to come up very often, so it might make sense to just record it and skip to the next URL.
Using the following code I'm able to print 48 of 50 pages. The only two I can't get to work have the 404 issue I describe above. Note that I use purrr::safely to catch errors. Base R's tryCatch will also work fine, but I find safely to be a little more convient. That said, in the end it's really just a matter of preference.
Also note that I've dealt with the connection error by utilizing repeat within the for loop. R will keep trying to connect to Chrome and print until it is either successful, or some other error pops up. I didn't need it, but you might want to include a counter to set an upper threshold for the number of connection attempts:
quest_urls <- paste0("https://stackoverflow.com", link_questions)
errors <- NULL
safe_print <- purrr::safely(pagedown::chrome_print)
for (qurl in quest_urls){
repeat {
output <- safe_print(qurl)
if (is.null(output$error)) break
else if (grepl("retry", output$error$message)) next
else {errors <- c(errors, `names<-`(output$error$message, qurl)); break}
}
}
I apologize that I can not tell you what these functions are form the start.
I have a function CheckOutCell. It takes one argument and that is the number 764. So every time I run the function it looks like this in it's entirety: CheckOutCell(764).
Now many times the function will give me an error:
Error in checkInCell(764) :
The function is currently locked; try again in a minute.
Which is a custom error message and the details are not important to this question.
Now this function could be locked from anywhere from 30 seconds to an hour. I want to be able to automatically run CheckOutCell(764) till it goes through, and then stop running it. That is, run it till I do not get an error, then stop.
I think a start would be using
while(capture.output(checkInCell(764)) == "Error in checkInCell(764) :
The function is currently locked; try again in a minute."){
do something}
However this just produces
Error in checkInCell(764) :
The function is currently locked; try again in a minute.
because the function is still locked, so no output can be captured.
How would I test for while(error = T)
Assume the source code of the function cannot be modified.
Even is.error(CheckInCell(764)) will just produce the same error message
So it seems that this code works in a way
wrapcheck <- function(x){
repeatCheck =tryCatch(checkOutCell(764),
error = function(cond)"skip")
SudoCheck = ifelse(repeatCheck=="skip",repeatCheck, checkOutCell(764))
while(SudoCheck == "skip"){
repeatCheck
}
}
wrapcheck(764)
Basically this checks for an error and then keeps running the function till the error is not produced. In fact I am fairly confident that this would work with any funciton you wanted to put in place of CheckOutCell.
The main problem is that when the function is locked, that it not really an error, it is locked. Therefore this above block will not work. This above block will work when errors other than a lock are produced.
(first question here, sorry if I am breaking a piece of etiquette)
My site is running on an eCommerce back end provider that I subscribe to. They have everything in classic ASP. They have a black box function called import_products that I use to import a given text file into my site's database.
The problem is that if I call the function more than once, something breaks. Here is my example code:
for blah = 1 to 20
thisfilename = "fullcatalog_" & blah & ".csv"
Response.Write thisfilename & "<br>"
Response.Flush
Call Import_Products(3,thisfilename,1)
Next
Response.End
The first execution of the Import_Products function works fine. The second time I get:
Microsoft VBScript runtime error '800a0009'
Subscript out of range: 'i'
The filenames all exist. That part is fine. There are no bugs in my calling code. I have tried checking the value of "i" before each execution. The first time the value is blank, and before the second execution the value is "2". So I tried setting it to null during each loop iteration, but that didn't change the results at all.
I assume that the function is setting a variable or opening a connection during its execution, but not cleaning it up, and then not expecting it to already be set the second time. Is there any way to find out what this would be? Or somehow reset the condition back to nothing so that the function will be 'fresh'?
The function is in an unreadable include file so I can't see the code. Obviously a better solution would be to go with the company support, and I have a ticket it in with them, but it is like pulling teeth to get them to even acknowledge that there is a problem. Let alone solve it.
Thanks!
EDIT: Here is a further simplified example of calling the function. The first call works. The second call fails with the same error as above.
thisfilename = "fullcatalog_testfile.csv"
Call Import_Products(3,thisfilename,1)
Call Import_Products(3,thisfilename,1)
Response.End
The likely cause of the error are the two numeric parameters for the Import_Products subroutine.
Import_Products(???, FileName, ???)
The values are 3 and 1 in your example but you never explain what they do or what they are documented to do.
EDIT Since correcting the vender subroutine is impossible, but it always works for the first time it's called lets use an HTTP REDIRECT instead of a FOR LOOP so that it technically only gets called once per page execution.
www.mysite.tld/import.asp?current=1&end=20
curr = CInt(Request.QueryString("current"))
end = CInt(Request.QueryString("end"))
If curr <= end Then
thisfilename = "fullcatalog_" & curr & ".csv"
Call Import_Products(3,thisfilename,1)
Response.Redirect("www.mysite.tld/import.asp?current=" & (curr + 1) & "&end=" & end)
End If
note the above was written inside my browser and is untested so syntax errors may exist.
I would like to know how can I check if a html is available. If it is not, I would like to control the return to avoid stop the script by error.
Ex:
arq <- readLines("www.pageerror.com.br")
print(arq)
An alternative is try() - it is simpler to work with than trycatch() but isn't as featureful. You might also need to suppress warnings as R will report that it can't resolve the address.
You want something like this in your script:
URL <- "http://www.pageerror.com.br"
arq <- try(suppressWarnings(readLines(con <- url(URL))), silent = TRUE)
close(con) ## close the connection
if(inherits(arq, "try-error")) {
writeLines(strwrap(paste("Page", URL, "is not available")))
} else {
print(arq)
}
The silent = TRUE bit suppresses the reporting of errors (if you leave this at the default FALSE, then R will report the error but not abort the script). We wrap the potentially error-raising function call in try(...., silent = TRUE), with suppressWarnings() being used to suppress the warnings. Then we test the class of the returned object arq and if it inherits from class "try-error" we know the page could not be retrieved and issue a message indicating so. Otherwise we can print arq.
?tryCatch 'Nuff said. <-- Except, apparently not, because the pageweaver demands more characters in an answer. So, "If a chicken and a half lays an egg and a half in a day and a half, how many eggs do nine chickens lay in nine days?"
OK, long enough.