I would like to know how can I check if a html is available. If it is not, I would like to control the return to avoid stop the script by error.
Ex:
arq <- readLines("www.pageerror.com.br")
print(arq)
An alternative is try() - it is simpler to work with than trycatch() but isn't as featureful. You might also need to suppress warnings as R will report that it can't resolve the address.
You want something like this in your script:
URL <- "http://www.pageerror.com.br"
arq <- try(suppressWarnings(readLines(con <- url(URL))), silent = TRUE)
close(con) ## close the connection
if(inherits(arq, "try-error")) {
writeLines(strwrap(paste("Page", URL, "is not available")))
} else {
print(arq)
}
The silent = TRUE bit suppresses the reporting of errors (if you leave this at the default FALSE, then R will report the error but not abort the script). We wrap the potentially error-raising function call in try(...., silent = TRUE), with suppressWarnings() being used to suppress the warnings. Then we test the class of the returned object arq and if it inherits from class "try-error" we know the page could not be retrieved and issue a message indicating so. Otherwise we can print arq.
?tryCatch 'Nuff said. <-- Except, apparently not, because the pageweaver demands more characters in an answer. So, "If a chicken and a half lays an egg and a half in a day and a half, how many eggs do nine chickens lay in nine days?"
OK, long enough.
Related
On this code when I use for loop or the function lapply I get the following error
"Error in get_entrypoint (debug_port):
Cannot connect R to Chrome. Please retry. "
library(rvest)
library(xml2) #pull html data
library(selectr) #for xpath element
url_stackoverflow_rmarkdown <-
'https://stackoverflow.com/questions/tagged/r-markdown?tab=votes&pagesize=50'
web_page <- read_html(url_stackoverflow_rmarkdown)
questions_per_page <- html_text(html_nodes(web_page, ".page-numbers.current"))[1]
link_questions <- html_attr(html_nodes(web_page, ".question-hyperlink")[1:questions_per_page],
"href")
setwd("~/WebScraping_chrome_print_to_pdf")
for (i in 1:length(link_questions)) {
question_to_pdf <- paste0("https://stackoverflow.com",
link_questions[i])
pagedown::chrome_print(question_to_pdf)
}
Is it possible to build a for loop() or use lapply to repeat the code from where it break? That is, from the last i value without breaking the code?
Many thanks
I edited #Rui Barradas idea of tryCatch().
You can try to do something like below.
The IsValues will get either the link value or bad is.
IsValues <- list()
for (i in 1:length(link_questions)) {
question_to_pdf <- paste0("https://stackoverflow.com",
link_questions[i])
IsValues[[i]] <- tryCatch(
{
message(paste("Converting", i))
pagedown::chrome_print(question_to_pdf)
},
error=function(cond) {
message(paste("Cannot convert", i))
# Choose a return value in case of error
return(i)
})
}
Than, you can rbind your values and extract the bad is:
do.call(rbind, IsValues)[!grepl("\\.pdf$", do.call(rbind, IsValues))]
[1] "3" "5" "19" "31"
You can read more about tryCatch() in this answer.
Based on your example, it looks like you have two errors to contend with. The first error is the one you mention in your question. It is also the most frequent error:
Error in get_entrypoint (debug_port): Cannot connect R to Chrome. Please retry.
The second error arises when there are links in the HTML that return 404:
Failed to generate output. Reason: Failed to open https://lh3.googleusercontent.com/-bwcos_zylKg/AAAAAAAAAAI/AAAAAAAAAAA/AAnnY7o18NuEdWnDEck_qPpn-lu21VTdfw/mo/photo.jpg?sz=32 (HTTP status code: 404)
The key phrase in the first error is "Please retry". As far as I can tell, chrome_print sometimes has issues connecting to Chrome. It seems to be fairly random, i.e. failed connections in one run will be fine in the next, and vice versa. The easiest way to get around this issue is to just keep trying until it connects.
I can't come up with any fix for the second error. However, it doesn't seem to come up very often, so it might make sense to just record it and skip to the next URL.
Using the following code I'm able to print 48 of 50 pages. The only two I can't get to work have the 404 issue I describe above. Note that I use purrr::safely to catch errors. Base R's tryCatch will also work fine, but I find safely to be a little more convient. That said, in the end it's really just a matter of preference.
Also note that I've dealt with the connection error by utilizing repeat within the for loop. R will keep trying to connect to Chrome and print until it is either successful, or some other error pops up. I didn't need it, but you might want to include a counter to set an upper threshold for the number of connection attempts:
quest_urls <- paste0("https://stackoverflow.com", link_questions)
errors <- NULL
safe_print <- purrr::safely(pagedown::chrome_print)
for (qurl in quest_urls){
repeat {
output <- safe_print(qurl)
if (is.null(output$error)) break
else if (grepl("retry", output$error$message)) next
else {errors <- c(errors, `names<-`(output$error$message, qurl)); break}
}
}
I wonder if there is a way to display the current time in the R command line, like in MS DOS, we can use
Prompt $T $P$G
to include the time clock in every prompt line.
Something like
options(prompt=paste(format(Sys.time(), "%H:%M:%S"),"> "))
will do it, but then it is fixed at the time it was set. I'm not sure how to make it update automatically.
Chase points the right way as options("prompt"=...) can be used for this. But his solutions adds a constant time expression which is not what we want.
The documentation for the function taskCallbackManager has the rest:
R> h <- taskCallbackManager()
R> h$add(function(expr, value, ok, visible) {
+ options("prompt"=format(Sys.time(), "%H:%M:%S> "));
+ return(TRUE) },
+ name = "simpleHandler")
[1] "simpleHandler"
07:25:42> a <- 2
07:25:48>
We register a callback that gets evaluated after each command completes. That does the trick. More fancy documentation is in this document from the R developer site.
None of the other methods, which are based on callbacks, will update the prompt unless a top-level command is executed. So, pressing return in the console will not create a change. Such is the nature of R's standard callback handling.
If you install the tcltk2 package, you can set up a task scheduler that changes the option() as follows:
library(tcltk2)
tclTaskSchedule(1000, {options(prompt=paste(Sys.time(),"> "))}, id = "ticktock", redo = TRUE)
Voila, something like the MS DOS prompt.
NB: Inspiration came from this answer.
Note 1: The wait time (1000 in this case) refers to the # of milliseconds, not seconds. You might adjust it downward when sub-second resolution is somehow useful.
Here is an alternative callback solution:
updatePrompt <- function(...) {options(prompt=paste(Sys.time(),"> ")); return(TRUE)}
addTaskCallback(updatePrompt)
This works the same as Dirk's method, but the syntax is a bit simpler to me.
You can change the default character that is displayed through the options() command. You may want to try something like this:
options(prompt = paste(Sys.time(), ">"))
Check out the help page for ?options for a full list of things you can set. It is a very useful thing to know about!
Assuming this is something you want to do for every R session, consider moving that to your .Rprofile. Several other good nuggets of programming happiness can be found hither on that topic.
I don't know of a native R function for doing this, but I know R has interfaces with other languages that do have system time commands. Maybe this is an option?
Thierry mentioned system.time() and there is also proc.time() depending on what you need it for, although neither of these give you the current time.
I am having some issues with R.utils::withTimeout(). It doesn't seem to take the timeout option into acount at all, or only sometimes. Below the function I want to use:
scrape_player <- function(url, time){
raw_html <- tryCatch({
R.utils::withTimeout({
RCurl::getURL(url)
},
timeout = time, onTimeout = "warning")}
)
html_page <- xml2::read_html(raw_html)
}
Now when I use it:
scrape_player("http://nhlnumbers.com/player_stats/1", 1)
it either works fine and I get the html page I want, or I get an error message telling me that the elapsed time limit was reached, or, and this is my problem, it takes a very long time, way more than 1 second, to finally return an html page with an error 500.
Shouldn't RCurl::getURL() try for only 1 second (in the example) to get the html page and if not, simply return a warning? What am I missing?
Ok, what I did as a workaround: instead of returning the page I write it to disk. Doesn't solve the issue that withTimeout doesn't seem to work, but at least I see that I'm getting pages written to disk, slowly but surely.
I have a lot of different operations running on quite a big dataframe. It starts to be a pain for maintenance, especially with some data being improperly formatted, and I'm looking at some options to make my life easier.
The problem is that at one point in the flow of operations NAs are introduced in several lines, including the id (certainly due to some bad subsetting). Now I cannot find the culprit easily because I have each time to str() it, or to view() it in Rstudio... This takes time and I already did it once without finding the bad operation...
So I'm curious if there is some package answering to this problem or a way to program something "daemon-like", to pop up a warning message when a specific value appears.
A while loop doesn't help, because it evaluates all the statements, and of course at one point the condition is not true and it doesn't print when it stops ...
while(nrow(df[is.na(df$id),]) > 0){
statements OK
breaking statement
other OK statements
}
I'll look for other options but I wanted to ask before...
EDIT : thanks for the useful comments, I'll definitely will look more into those functions. However I tried also to build myself a watch function (see my answer).
Ok, I guess I have finally built something quite like it :
This is a function to source a file line per line until a given condition is met :
watchIt <- function(file,watchexpression,startwatchline){
line <- 1
sourceList <- scan(file = "source_test.R", what="character", sep="\n", blank.lines.skip = FALSE)
maxLines <- length(sourceList)
while(startwatchline > line && maxLines >= line){
cat("l")
eval(parse(text=sourceList[line]))
line <- line+1
cat(line)
cat(" ")
}
while(eval(parse(text=watchexpression)) == FALSE && maxLines >= line){
cat(" L")
eval(parse(text=sourceList[line]))
line <- line+1
cat(line)
cat(" ")
}
if(maxLines <= line) {
cat("End of file reached without condition getting TRUE")
}
else{
cat("Condition evaluated to TRUE on line :")
cat(line)
cat("\n")
cat(sourceList[line])
}
}
So this is how I use it :
watchIt("source_test.R","nrow(df[is.na(df$id),]) > 0",10)
This puts "source_test.R" in a list, each line a new list item, and, starting from line 10, I test if the resultant dataframe as NAs in the id field. The execution stops either when the condition evaluates TRUE or when the end of the list items is reached.
Still I'm waiting for some other/better answers... Also, this is kind of my fourth function I managed to create in R, so I guess there might be ameliorations to be made to it...
I've been trying to run some species deletion simulations using the cheddar package and have come across an error:
Error in RemoveNodes(new.community, new.remove, title = title, method = "cascade") :
Removing these nodes would result in an empty community
you can recreate the error as such:
library(cheddar)
data(SkipwithPond)
a<-RemoveNodes(SkipwithPond,c('Detritus','Corixidae nymphs','Agabus / Ilybius larvae'),method='cascade')
i was wondering if was possible to disable this feature so as to allow the removal to occur? If not would there be a way to return a certain value (the number of nodes in the web in this case) if this error occurs?
I don't know much about the cheddar package, but the second option you mention would be to "catch" the error after trying to evaluate the expression. Enter tryCatch. See the documentation for this function, but generally when you save result of tryCatch to a variable, you can redirect your flow to accommodate for the error. Something along the lines of
# spaces possibly make code easier to read
a <- tryCatch(RemoveNodes(SkipwithPond, c('Detritus','Corixidae nymphs','Agabus / Ilybius larvae'), method='cascade'), error = function(e) e)
# str(a) to see what the error is (message, class...) and act on that message
# or if you want a custom message to catch
a <- tryCatch(RemoveNodes(SkipwithPond, c('Detritus','Corixidae nymphs','Agabus / Ilybius larvae'), method='cascade'), error = function(e) "empty community?")
if (a$message == "empty community?") {
# ...do something
}