I am looking for a tryCatch function in R that would retry n times instead of just once. One of my web request fails occasionally to return a value when the server is busy, but after one or two retries it usually works fine.
The excellent page How to write trycatch in R does not touch on this topic. I found the function TryRetry in C (orginally discussed in TryRetry - Try, Catch, then Retry) which accomplishes what I was looking for and I thought maybe a similar function exist in R in some package too?
Unfortunately, I don't have the skills to abstract an R code structure from the C example. I could just recall my function in the error handling portion of the tryCatch, but somehow this seems the wrong way to go, especially once you deal with more than one retry.
Any suggestions on how to approach a tryRetry-code structure in R would be appreciated.
You can implement a retry logic by relying on the RETRY method from the httr package and parsing the response in a second step.
In order to apply it to file download I would go down the following path (using this hosted .csv file as an example):
library(httr)
library(dplyr)
df <- RETRY(
"GET",
url = "https://www.stats.govt.nz/assets/Uploads/Business-operations-survey/Business-operations-survey-2018/Download-data/business-operations-survey-2018-business-finance-csv.csv",
times = 3) %>% # max retry attempts
content(., "parsed")
Here is a way of having a web read request tried several times before failing. It's an adaptation of the post linked to in the question, called in a loop a number of times chosen by the user. Between each try there is a Sys.sleep defaulting to 3 seconds.
I repost the function readUrl, changed. And with many comments deleted, they are in the original code.
readUrl <- function(url) {
out <- tryCatch(
{
message("This is the 'try' part")
text <- readLines(con=url, warn=FALSE)
return(list(ok = TRUE, contents = text))
},
error=function(cond) {
message(paste("URL does not seem to exist:", url))
message("Here's the original error message:")
message(paste(cond, "\n"))
# Choose a return value in case of error
return(list(ok = FALSE, contents = cond))
},
warning=function(cond) {
message(paste("URL caused a warning:", url))
message("Here's the original warning message:")
message(paste(cond, "\n"))
# Choose a return value in case of warning
return(list(ok = FALSE, contents = cond))
},
finally={
message(paste("Processed URL:", url))
message("Some other message at the end")
}
)
return(out)
}
readUrlRetry <- function(url, times = 1, secs = 3){
count <- 0L
while(count < times){
res <- readUrl(url)
count <- count + 1L
OK <- res$ok
if(OK) break
Sys.sleep(time = secs)
}
res
}
url <- c(
"http://stat.ethz.ch/R-manual/R-devel/library/base/html/connections.html",
"http://en.wikipedia.org/wiki/Xz",
"xxxxx")
res <- lapply(url, readUrlRetry, times = 3)
res[[3]]
inherits(res[[3]]$contents, "warning")
Related
I have a list of URLs (more than 4000) from a specific domain (pixilink.com) and what I want to do is to figure out if the provided domain is a picture or a video. To do this, I used the solutions provided here: How to write trycatch in R and Check whether a website provides photo or video based on a pattern in its URL and wrote the code shown below:
#Function to get the value of initial_mode from the URL
urlmode <- function(x){
mycontent <- readLines(x)
mypos <- grep("initial_mode = ", mycontent)
if(grepl("0", mycontent[mypos])){
return("picture")
} else if(grepl("tour", mycontent[mypos])){
return("video")
} else{
return(NA)
}
}
Also, in order to prevent having error for URLs that don't exist, I used the code below:
readUrl <- function(url) {
out <- tryCatch(
{
readLines(con=url, warn=FALSE)
return(1)
},
error=function(cond) {
return(NA)
},
warning=function(cond) {
return(NA)
},
finally={
message( url)
}
)
return(out)
}
Finally, I separated the list of URLs and pass it into the functions (here for instance, I used 1000 values from URL list) described above:
a <- subset(new_df, new_df$host=="www.pixilink.com")
vec <- a[['V']]
vec <- vec[1:1000] # only chose first 1000 rows
tt <- numeric(length(vec)) # checking validity of url
for (i in 1:length(vec)){
tt[i] <- readUrl(vec[i])
print(i)
}
g <- data.frame(vec,tt)
g2 <- g[which(!is.na(g$tt)),] #only valid url
dd <- numeric(nrow(g2))
for (j in 1:nrow(g2)){
dd[j] <- urlmode(g2[j,1])
}
Final <- cbind(g2,dd)
Final <- left_join(g, Final, by = c("vec" = "vec"))
I ran this code on a sample list of URLs with 100, URLs and it worked; however, after I ran it on whole list of URLs, it returned an error. Here is the error : Error in textConnection("rval", "w", local = TRUE) : all connections are in use Error in textConnection("rval", "w", local = TRUE) : all connections are in use
And after this even for sample URLs (100 samples that I tested before) I ran the code and got this error message : Error in file(con, "r") : all connections are in use
I also tried closeAllConnection after each recalling each function in the loop, but it didn't work.
Can anyone explain what this error is about? is it related to the number of requests we can have from the website? what's the solution?
So, my guess as to why this is happening is because you're not closing the connections that you're opening via tryCatch() and via urlmode() through the use of readLines(). I was unsure of how urlmode() was going to be used in your previous post so it had made it as simple as I could (and in hindsight, that was badly done, my apologies). So I took the liberty of rewriting urlmode() to try and make it a little bit more robust for what appears to be a more expansive task at hand.
I think the comments in the code should help, so take a look below:
#Updated URL mode function with better
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
#Check if URL is good to go
if(!httr::http_error(x)){
#Test cases
#x <- "www.pixilink.com/3"
#x <- "https://www.pixilink.com/93320"
#x <- "https://www.pixilink.com/93313"
#Then since there are redirect shenanigans
#Get the actual URL the input points to
#It should just be the input URL if there is
#no redirection
#This is important as this also takes care of
#checking whether http or https need to be prefixed
#in case the input URL is supplied without those
#(this can cause problems for url() below)
myx <- httr::HEAD(x)$url
#Then check for what the default mode is
mycon <- url(myx)
open(mycon, "r")
mycontent <- readLines(mycon)
mypos <- grep("initial_mode = ", mycontent)
#Close the connection since it's no longer
#necessary
close(mycon)
#Some URLs with weird formats can return
#empty on this one since they don't
#follow the expected format.
#See for example: "https://www.pixilink.com/clients/899/#3"
#which is actually
#redirected from "https://www.pixilink.com/3"
#After that, evaluate what's at mypos, and always
#return the actual URL
#along with the result
if(!purrr::is_empty(mypos)){
#mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\\s\\=).*")
mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
return(c(myx, mystr))
#return(mystr)
#So once all that is done, check if the line at mypos
#contains a 0 (picture), tour (video)
#if(grepl("0", mycontent[mypos])){
# return(c(myx, "picture"))
#return("picture")
#} else if(grepl("tour", mycontent[mypos])){
# return(c(myx, "video"))
#return("video")
#}
} else{
#Valid URL but not interpretable
return(c(myx, "uninterpretable"))
#return("uninterpretable")
}
} else{
#Straight up invalid URL
#No myx variable to return here
#Just x
return(c(x, "invalid"))
#return("invalid")
}
}
#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)
#All future + progressr related stuff
#learned courtesy
#https://stackoverflow.com/a/62946400/9494044
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)
#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))
#Website's base URL
baseurl <- "https://www.pixilink.com"
#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar
#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
myprog <- progressr::progressor(along = range)
sitetype <- do.call(rbind, future_lapply(range, function(b, x){
myprog() ##Progress bar signaller
myurl <- paste0(b, "/", x)
cat("\n", myurl, " ")
myret <- urlmode(myurl)
cat(myret, "\n")
return(c(myurl, myret))
}, b = baseurl, future.chunk.size = 10))
})
#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")
#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")
head(sitetype)
# given_url actual_url mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310 invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311 invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313 picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315 tour
unique(sitetype$mode)
# [1] "invalid" "floorplan2d" "picture" "tour"
#--------
Basically, urlmode() now opens and closes connections only when necessary, checks for URL validity, URL redirection, and also "intelligently" extracts the value assigned to initial_mode. With the help of future.lapply(), and the progress bar from the progressr package, this can now be applied quite conveniently in parallel to as many pixilink.com/<integer> URLs as desired. With a bit of wrangling thereafter, the results can be presented very tidily as a data.frame as shown.
As an example, I've demonstrated this for a small range in the code above. Note the commented out 1:10000 range in the code in this context: I let this code run the last couple of hours over this (hopefully sufficiently) large range of URLs to check for errors and problems. I can attest that I encountered no errors (only the regular warnings In readLines(mycon) : incomplete final line found on 'https://www.pixilink.com/93334'). For proof, I have the data from all 10000 URLs written to a CSV file that I can provide upon request (I don't fancy uploading that to pastebin or elsewhere unnecessarily). Due to oversight on my part, I forgot to benchmark that run, but I suppose I could do that later if performance metrics are desired/would be considered interesting.
For your purposes, I believe you can simply take the entire code snippet below and run it verbatim (or with modifications) by just changing the range assignment right before the with_progress(do.call(...)) step to a range of your liking. I believe this approach is simpler and does away with having to deal with multiple functions and such (and no tryCatch() messes to deal with).
I'm using googlesheets package in order for me to work with some spreadsheets and I'm facing the following small problem:
Firstly, I'm downloading the document:
spreadsheet <- gs_title("Spreadsheet")
Ok, then I'm getting (or trying to) each one of the worksheets
a <- gs_read(spreadsheet, ws = "a")
b <- gs_read(spreadsheet, ws = "b")
c <- gs_read(spreadsheet, ws = "c")
d <- gs_read(spreadsheet, ws = "d")
e <- gs_read(spreadsheet, ws = "e")
When I'm trying to do this it happens recurrently the following:
no problems to read the first worksheets (normally "a" and "b")
when it's time to read "c" it returns the following error
Accessing worksheet titled 'c'.
Downloading: 1.1 kB Error in function_list[k] :
Too Many Requests (RFC 6585) (HTTP 429).
By now I'm overcoming this in the simplest way: retrying as many times as needed until it reads the troublesome worksheets.
I've been wondering if it's possible to create a loop in order for me to make RStudio try and retry applying gs_read function until I get my desired outcome instead of myself doing manually the same as it's currently happening.
You could make use of tryCatch
read_spreadsheet <- function(spreadsheet, ws) {
tryCatch(
gs_read(spreadsheet, ws = ws),
warning = function(war) {
message(war)
return(NULL)
},
error = function(err) {
message(err)
return(NULL)
}
)
}
Then write your loop, e.g.
repeat {
file <- read_spreadsheet(spreadsheet, ws = "a")
if(!is.null(file)) break
}
(Though you need to be very sure that it will work, otherwise repeat won't stop)
Edit: Warning handler
To obtain the output despite warnings, repeat the function in the warning section:
(If you want to, you can also add suppressWarnings, warnings will be provided through message here)
read_spreadsheet <- function(spreadsheet, ws) {
tryCatch(
gs_read(spreadsheet, ws = ws),
warning = function(war) {
message(war)
return(
suppressWarnings(gs_read(spreadsheet, ws = ws))
)
},
error = function(err) {
message(err)
return(NULL)
}
)
}
FYI, based on some comments I added more information.
I created the following function that is making a call to an API:
keyword_checker <- function(keyword, domain, loc, lang){
keyword_to_check <- as.character(keyword)
api_request <- paste("https://script.google.com/blabalbalba",
"?kw=",keyword,
"&domain=",domain,
"&loc=",loc,
"&lang=",lang,sep="")
api_request <- URLencode(api_request, repeated = TRUE)
source <-fromJSON(file = api_request)#Json file into Data Frame
return(data.frame(do.call("rbind", source$data$result))) ##in order to extract only the "results" data
I am using the R package foreach() with %dopar% and doSNOW to do many API calls (more than 120k calls).
Unfortunately, it happens that there are some errors (usually time out connection), so it makes the script stops. In order to avoid this problem I used the .errorhandling = 'pass'. Now, the script doesn't stop but I would like to know if there is a way to make the API call until I get an answer?
Here is my script:
cl <- makeCluster(9)
registerDoSNOW(cl)
final_urls_checker <- foreach(i = 1:length(mes_urls_to_check), .combine=rbind, .errorhandling = 'pass', .packages='rjson') %dopar% {
test_keyword <- as.character(mes_urls_to_check[i])
results <- indexed_url(test_keyword)} ##name of my function
##Stop cluster
stopCluster(cl)
I basically want my script to continue (without stopping the whole process) until I get the answer from the API call
Do I need to incorporate the TryCatch function within the foreach, OR is that better to "upgrade" the function that I created by adding something like "if the API doesn't give the answer, then wait until it gets it?"
I hope this is clearer.
Try using tryCatch inside the foreach function to catch the expected error messages (here failed API call due to time out). Below is a sample code snippet for the given function keyword_checker, based on my understanding.
library(foreach)
cl <- makeCluster(9)
registerDoSNOW(cl)
final_urls_checker <- foreach(i = 1:length(mes_urls_to_check), .combine=rbind, .errorhandling = 'pass',
.packages='rjson') %dopar% {
test_keyword <- as.character(mes_urls_to_check[i])
#results <- keyword_checker(test_keyword)} ##name of my function
results <- function(test_keyword){
dmy <- tryCatch(
{
keyword_checker(test_keyword)
},
error = function(cond){
message = "Timeout error! Calling again..."
dmy2 <- keyword_checker(test_keyword)
return(dmy2)
}
warning = function(cond){
message("Warning message:")
message(cond)
return(NULL)
}
finally = {
message(paste("Succesfully called API ", test_keyword))
}
)
return(dmy)
}
##Stop cluster
stopCluster(cl)
Here's a link which explains how to write tryCatch. Note, this snippet may not exactly work since I didn't run the code block. But calling the API caller again, when it fails should do the job.
Check this link, for a discussion on similar issue.
Here is an updated script including the TryCatch directly in the function.
indexed_url <- function(url){
url_to_check <- as.character(url)
api_request <- paste("https://script.google.com/macros/blablabalbalbaexec",
"?page=",url_to_check,sep="")
api_request <- URLencode(api_request, repeated = TRUE)
source <- tryCatch({
fromJSON(file = api_request)#Convertir un Json file en Data Frame
}, error = function(e) {
cat(paste0("Une erreur a eu lieu :",e))
Sys.sleep(1)
indexed_url(url)
})
return(data.frame(do.call("rbind", source)))
}
Then running the foreach just the way it was is working perfectly. No more errors, and I have the full analysis.
I am trying to write a function that cleans spreadsheets. However, some of the spreadsheets are corrupted and will not open. I want the function to recognize this, print an error message, and skip execution of the rest of the function (since I am using lapply() to iterate across files), and continues. My current attempt looks like this:
candidate.cleaner <- function(filename){
#this function cleans candidate data spreadsheets into an R dataframe
#dependency check
library(readxl)
#read in
cand_df <- tryCatch(read_xls(filename, col_names = F),
error = function (e){
warning(paste(filename, "cannot be opened; corrupted or does not exist"))
})
print(filename)
#rest of function
cand_df[1,1]
}
test_vec <- c("test.xls", "test2.xls", "test3.xls")
lapply(FUN = candidate.cleaner, X = test_vec)
However, this still executes the line of the function after the tryCatch statement when given a .xls file that does not exist, which throws a stop since I'm attempting to index a dataframe that doesn't exist. This exits the lapply call. How can I write the tryCatch call to make it skip execution of the rest of the function without exiting lapply?
One could set a semaphore at the start of the tryCatch() indicating that things have gone OK so far, then handle the error and signal that things have gone wrong, and finally check the semaphore and return from the function with an appropriate value.
lapply(1:5, function(i) {
value <- tryCatch({
OK <- TRUE
if (i == 2)
stop("stopping...")
i
}, error = function(e) {
warning("oops: ", conditionMessage(e))
OK <<- FALSE # assign in parent environment
}, finally = {
## return NA on error
OK || return(NA)
})
## proceed
value * value
})
This allows one to continue using the tryCatch() infrastructure, e.g., to translate warnings into errors. The tryCatch() block encapsulates all the relevant code.
Turns out, this can be accomplished in a simple way with try() and an additional help function.
candidate.cleaner <- function(filename){
#this function cleans candidate data spreadsheets into an R dataframe
#dependency check
library(readxl)
#read in
cand_df <- try(read_xls(filename, col_names = F))
if(is.error(cand_df) == T){
return(list("Corrupted: rescrape", filename))
} else {
#storing election name for later matching
election_name <- cand_df[1,1]
}
}
Where is.error() is taken from Hadley Wickham's Advanced R chapter on debugging. It's defined as:
is.error <- function(x) inherits(x, "try-error")
What I want to do is to read output files and extract some value in files. But, in fact, some files don't exist, so I use tryCatch() in my program to catch those "errors", then my program will return a NA value and continue reading the next file. But when I execute my program, it reports errors("all connections are in use"), I have tried to find answers online, but there is no good answer for my questions. So, if you can solve my problem, please give your advice ! Thank you very much !
prop.protec <- 0.5
pat <- read.csv("~/par.csv",header = FALSE) # parameter file
pat <- as.matrix(pat)
tpw <- matrix(NA, nrow = 36, ncol = num.rep) # used to store p-values
num.rep <- 500
for (i in 1:36){
# following are 4 parameters
dis.mod <- pat[i,1]
herit.tot <- pat[i,2]
bin <- pat[i,3]
op <- pat[i,4]
for(reps in 1:num.rep){
tryCatch(
{
res.file <- paste("~/z-out-",op, "-", dis.mod, "-",bin,"-",herit.tot,"-",prop.protec,"-",reps, ".extended.qls.res", sep = "")
res.dat <- read.table(file = res.file, header = TRUE)
tpw[i,reps] <- res.dat$P_MFQLS
},
warning=function(cond) {
message("Here's the original warning message:")
message(cond)
},
error = function(e){
message("Here's the original error message:")
message(e)} )
}
}
I have found the answer, a very simple and wonderful answer, we just need to add closeAllConnections() at the end of inner loop. That works very well ! I have spent much time on it, when I know this answer and succeed, I believe you can understand my feeling.