How to force a For loop() or lapply() to run with error message in R - r

On this code when I use for loop or the function lapply I get the following error
"Error in get_entrypoint (debug_port):
Cannot connect R to Chrome. Please retry. "
library(rvest)
library(xml2) #pull html data
library(selectr) #for xpath element
url_stackoverflow_rmarkdown <-
'https://stackoverflow.com/questions/tagged/r-markdown?tab=votes&pagesize=50'
web_page <- read_html(url_stackoverflow_rmarkdown)
questions_per_page <- html_text(html_nodes(web_page, ".page-numbers.current"))[1]
link_questions <- html_attr(html_nodes(web_page, ".question-hyperlink")[1:questions_per_page],
"href")
setwd("~/WebScraping_chrome_print_to_pdf")
for (i in 1:length(link_questions)) {
question_to_pdf <- paste0("https://stackoverflow.com",
link_questions[i])
pagedown::chrome_print(question_to_pdf)
}
Is it possible to build a for loop() or use lapply to repeat the code from where it break? That is, from the last i value without breaking the code?
Many thanks

I edited #Rui Barradas idea of tryCatch().
You can try to do something like below.
The IsValues will get either the link value or bad is.
IsValues <- list()
for (i in 1:length(link_questions)) {
question_to_pdf <- paste0("https://stackoverflow.com",
link_questions[i])
IsValues[[i]] <- tryCatch(
{
message(paste("Converting", i))
pagedown::chrome_print(question_to_pdf)
},
error=function(cond) {
message(paste("Cannot convert", i))
# Choose a return value in case of error
return(i)
})
}
Than, you can rbind your values and extract the bad is:
do.call(rbind, IsValues)[!grepl("\\.pdf$", do.call(rbind, IsValues))]
[1] "3" "5" "19" "31"
You can read more about tryCatch() in this answer.

Based on your example, it looks like you have two errors to contend with. The first error is the one you mention in your question. It is also the most frequent error:
Error in get_entrypoint (debug_port): Cannot connect R to Chrome. Please retry.
The second error arises when there are links in the HTML that return 404:
Failed to generate output. Reason: Failed to open https://lh3.googleusercontent.com/-bwcos_zylKg/AAAAAAAAAAI/AAAAAAAAAAA/AAnnY7o18NuEdWnDEck_qPpn-lu21VTdfw/mo/photo.jpg?sz=32 (HTTP status code: 404)
The key phrase in the first error is "Please retry". As far as I can tell, chrome_print sometimes has issues connecting to Chrome. It seems to be fairly random, i.e. failed connections in one run will be fine in the next, and vice versa. The easiest way to get around this issue is to just keep trying until it connects.
I can't come up with any fix for the second error. However, it doesn't seem to come up very often, so it might make sense to just record it and skip to the next URL.
Using the following code I'm able to print 48 of 50 pages. The only two I can't get to work have the 404 issue I describe above. Note that I use purrr::safely to catch errors. Base R's tryCatch will also work fine, but I find safely to be a little more convient. That said, in the end it's really just a matter of preference.
Also note that I've dealt with the connection error by utilizing repeat within the for loop. R will keep trying to connect to Chrome and print until it is either successful, or some other error pops up. I didn't need it, but you might want to include a counter to set an upper threshold for the number of connection attempts:
quest_urls <- paste0("https://stackoverflow.com", link_questions)
errors <- NULL
safe_print <- purrr::safely(pagedown::chrome_print)
for (qurl in quest_urls){
repeat {
output <- safe_print(qurl)
if (is.null(output$error)) break
else if (grepl("retry", output$error$message)) next
else {errors <- c(errors, `names<-`(output$error$message, qurl)); break}
}
}

Related

withTimeout not working inside functions?

I am having some issues with R.utils::withTimeout(). It doesn't seem to take the timeout option into acount at all, or only sometimes. Below the function I want to use:
scrape_player <- function(url, time){
raw_html <- tryCatch({
R.utils::withTimeout({
RCurl::getURL(url)
},
timeout = time, onTimeout = "warning")}
)
html_page <- xml2::read_html(raw_html)
}
Now when I use it:
scrape_player("http://nhlnumbers.com/player_stats/1", 1)
it either works fine and I get the html page I want, or I get an error message telling me that the elapsed time limit was reached, or, and this is my problem, it takes a very long time, way more than 1 second, to finally return an html page with an error 500.
Shouldn't RCurl::getURL() try for only 1 second (in the example) to get the html page and if not, simply return a warning? What am I missing?
Ok, what I did as a workaround: instead of returning the page I write it to disk. Doesn't solve the issue that withTimeout doesn't seem to work, but at least I see that I'm getting pages written to disk, slowly but surely.

Loop to wait for result or timeout in r

I've written a very quick blast script in r to enable interfacing with the NCBI blast API. Sometimes however, the result url takes a while to load and my script throws an error until the url is ready. Is there an elegant way (i.e. a tryCatch option) to handle the error until the result is returned or timeout after a specified time?
library(rvest)
## Definitive set of blast API instructions can be found here: https://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/BLAST_URLAPI.html
## Generate query URL
query_url <-
function(QUERY,
PROGRAM = "blastp",
DATABASE = "nr",
...) {
put_url_stem <-
'https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Put'
arguments = list(...)
paste0(
put_url_stem,
"&QUERY=",
QUERY,
"&PROGRAM=",
PROGRAM,
"&DATABASE=",
DATABASE,
arguments
)
}
blast_url <- query_url(QUERY = "NP_001117.2") ## test query
blast_session <- html_session(blast_url) ## create session
blast_form <- html_form(blast_session)[[1]] ## pull form from session
RID <- blast_form$fields$RID$value ## extract RID identifier
get_url <- function(RID, ...) {
get_url_stem <-
"https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get"
arguments = list(...)
paste0(get_url_stem, "&RID=", RID, "&FORMAT_TYPE=XML", arguments)
}
hits_xml <- read_xml(get_url(RID)) ## this is the sticky part
Sometimes it takes several minutes for the get_url to go live so what I would like is to do is to keep trying let's say every 20-30 seconds until it either produces the url or times out after a pre-specified time.
I think you may find this answer about the use of tryCatch useful
Regarding the 'keep trying until timeout' part. I imagine you can work on top of this other answer about a tryCatch loop on error
Hope it helps.

R / Rstudio : is there a watch function to continuously monitor some values?

I have a lot of different operations running on quite a big dataframe. It starts to be a pain for maintenance, especially with some data being improperly formatted, and I'm looking at some options to make my life easier.
The problem is that at one point in the flow of operations NAs are introduced in several lines, including the id (certainly due to some bad subsetting). Now I cannot find the culprit easily because I have each time to str() it, or to view() it in Rstudio... This takes time and I already did it once without finding the bad operation...
So I'm curious if there is some package answering to this problem or a way to program something "daemon-like", to pop up a warning message when a specific value appears.
A while loop doesn't help, because it evaluates all the statements, and of course at one point the condition is not true and it doesn't print when it stops ...
while(nrow(df[is.na(df$id),]) > 0){
statements OK
breaking statement
other OK statements
}
I'll look for other options but I wanted to ask before...
EDIT : thanks for the useful comments, I'll definitely will look more into those functions. However I tried also to build myself a watch function (see my answer).
Ok, I guess I have finally built something quite like it :
This is a function to source a file line per line until a given condition is met :
watchIt <- function(file,watchexpression,startwatchline){
line <- 1
sourceList <- scan(file = "source_test.R", what="character", sep="\n", blank.lines.skip = FALSE)
maxLines <- length(sourceList)
while(startwatchline > line && maxLines >= line){
cat("l")
eval(parse(text=sourceList[line]))
line <- line+1
cat(line)
cat(" ")
}
while(eval(parse(text=watchexpression)) == FALSE && maxLines >= line){
cat(" L")
eval(parse(text=sourceList[line]))
line <- line+1
cat(line)
cat(" ")
}
if(maxLines <= line) {
cat("End of file reached without condition getting TRUE")
}
else{
cat("Condition evaluated to TRUE on line :")
cat(line)
cat("\n")
cat(sourceList[line])
}
}
So this is how I use it :
watchIt("source_test.R","nrow(df[is.na(df$id),]) > 0",10)
This puts "source_test.R" in a list, each line a new list item, and, starting from line 10, I test if the resultant dataframe as NAs in the id field. The execution stops either when the condition evaluates TRUE or when the end of the list items is reached.
Still I'm waiting for some other/better answers... Also, this is kind of my fourth function I managed to create in R, so I guess there might be ameliorations to be made to it...

Running R script_Readline and Scan does not pause for user input

I have looked at other posts that appeared similar to this question but they have not helped me. This may be just my ignorance of R. Thus I decided to sign up and make my first post on stack-overflow.
I am running an R-script and would like the user to decide either to use one of the two following loops. The code to decide user input looks similar to the one below:
#Define the function
method.choice<-function() {
Method.to.use<-readline("Please enter 'New' for new method and'Old' for old method: ")
while(Method.to.use!="New" && Method.to.use!="Old"){ #Make sure selection is one of two inputs
cat("You have not entered a valid input, try again", "\n")
Method.to.use<-readline("Please enter 'New' for new method and 'Old' for old method: ")
cat("You have selected", Method.to.use, "\n")
}
return(Method.to.use)
}
#Run the function
method.choice()
Then below this I have the two possible choices:
if(Method.to.use=="New") {
for(i in 1:nrow(linelist)){...}
}
if(Method.to.use=="Old"){
for(i in 1:nrow(linelist)){...}
}
My issue is, and what I have read from other posts, is that whether I use "readline", "scan" or "ask", R does not wait for my input. Instead R will use the following lines as the input.
The only way I found that R would pause for input is if the code is all on the same line or if it is run line by line (instead of selecting all the code at once). See example from gtools using "ask":
silly <- function()
{
age <- ask("How old are you? ")
age <- as.numeric(age)
cat("In 10 years you will be", age+10, "years old!\n")
}
This runs with a pause:
silly(); paste("this is quite silly")
This does not wait for input:
silly()
paste("this is quite silly")
Any guidance would be appreciated to ensure I can still run my entire script and have it pause at readline without continuing. I am using R-studio and I have checked that interactive==TRUE.
The only other work-around I found is wrapping my entire script into one main function, which is not ideal for me. This may require me to use <<- to write to my environment.
Thank you in advance.

Dynamically changing the sequence of a loop

I am trying to scrape data from a website which is unfortunately located on a very unreliable sever which has very volatile reaction times. The first idea is of course to loop over the list of (thousands of) URLs and saving the downloaded results by populating a list.
The problem however is that the server randomly responds very slowly which results into a timeout error. This alone would not be a problem as I can use the tryCatch() function and jump to the next iteration. Doing so I am however missing some files in each run. I know that each of the URLs in the list exists and I need all of the data.
My idea thus would have been to use tryCatch() to evaluate if the getURL() request yields and error. If so the loop jumps to the next iteration and the erroneous URL is appended at the end of the URL list over which the loop runs. My intuitive solution would look something like this:
dwl = list()
for (i in seq_along(urs)) {
temp = tryCatch(getURL(url=urs[[i]]),error=function(e){e})
if(inherits(temp,"OPERATION_TIMEDOUT")){ #check for timeout error
urs[[length(urs)+1]] = urs[[i]] #if there is one the erroneous url is appended at the end of the sequence
next} else {
dwl[[i]] = temp #if there is no error the data is saved in the list
}
}
If it "would" work I would eventually be able to download all the URLs in the list. It however doesn't work, because as the help page for the next function states: "seq in a for loop is evaluated at the start of the loop; changing it subsequently does not affect the loop". Is there a workaround for this or a trick with which I could achieve my envisaged goal? I am grateful for every comment!
I would do something like this(explanation within comments):
## RES is a global list that contain the final result
## Always try to pre-allocate your results
RES <- vector("list",length(urs))
## Safe getURL returns NA if error, the NA is useful to filter results
get_url <- function(x) tryCatch(getURL(x),error=function(e)NA)
## the parser!
parse_doc <- function(x){## some code to parse the doc})
## loop while we still have some not scraped urls
while(length(urs)>0){
## get the doc for all urls
l_doc <- lapply(urs,get_url)
## parse each document and put the result in RES
RES[!is.na(l_doc )] <<- lapply(l_doc [!is.na(l_doc)],parse_doc)
## update urs
urs <<- urs[is.na(l_doc)]
}

Resources