Using R try catch to download PDFs of the web - r

I'm trying to download PDFs using a table containing links. Due to inconsistent formatting of the links, I have created different versions of the same links residing in different columns. For privacy reasons I can't disclose the links, but here is what I did.
links <- data.frame(links1,links2,links3,links4)
filenames <- str_c(format(seq.Date(from = as.Date("2015-04-01"),
to = Sys.Date(), by = "day"),"%Y_%m_%d"),".pdf")
After having created all the versions and the names, here I try to write a loop wrapped in Try-Catch to continue despite the link not being correct. my goal is to when it doesn't find the link on column links$links1[3] to look at the other columns on the same row to find a working link.
Here is my try:
for (i in seq_along(links[,1])) {
#using trycatch to bipass the error when url doesn't exist
tryCatch({
if (!file.exists(str_c(folder,"/",filenames[i]))) {
download.file(links[i,1], filenames[i], mode = "wb")
print(paste0("Downloading: ", filenames[i]))
} }, error = function(e){
for (j in seq_along(links[i,])){
tryCatch({
download.file(links[i,j], filenames[i], mode = "wb")
}, error = function(e){}
)
}
}
)
}
For some reason its not picking up PDFs uploaded on April 9th 2015 and possibly other dates too.

The line for (j in seq_along(links[i,])){ is causing the inner loop to retry the already failed link. If the link fails, it will therefore fail again in the inner loop. Your program continues happily, never having tried the other links.
You should skip over j = 1 in your inner for loop.
Here's a slightly modified version of your program showing what is happening.
links1 <- c('a','b','c')
links2 <- c('x','y','z')
links <- data.frame(links1,links2)
for (i in seq_along(links[,1])) {
#using trycatch to bipass the error when url doesn't exist
tryCatch({
print(sprintf("trying: %s", links[i,1]))
if( i == 2 ) {
stop( simpleError("error"))
}
},
error = function(e){
for (j in seq_along(links[i,])){
tryCatch({
print(sprintf("falling back to %s", links[i,j]))
},
error = function(e){
})
}
})
}

Related

R Language scraper - returns Error in value[[3L]](cond) : no loop for break/next, jumping to top level - loop issue

So im trying to create a iteration through a list of proxies - now just a bunch in a list... I created this code with some internet help to switch proxies when they encounter error 429. But it does the error and i cant just seem to figure it out in any way. Thanks in advance! I have seen some similar questions and answers here, but i unfortunately cannot figure my problem out of the other answers as R is still new to me. Theres an issue with the loop but dont know where.
build_oem_table <- function(...) {
proxies <- c("203.24.108.170:80", "172.67.182.165:80", "45.12.30.84:80", "203.28.8.207:80")
for (i in seq_along(proxies)) {
proxy <- proxies[i]
response <- tryCatch({
response <- GET("https://www.gsmarena.com/", config(proxy = paste0("http://", proxy)))
if (status_code(response) != 429) {
break
}
}, error = function(e) {
next
})
if (status_code(response) != 429) {
break
}
}
if (status_code(response) == 429) {
stop("All proxies failed with HTML error 429")
}
sesh <- session("https://www.gsmarena.com/makers.php3")
makers <- read_html(sesh)
makers <- read_html("C:\\Users\\dex\\Downloads\\gsm\\List of all mobile phone brands - GSMArena.com.html")

How to redo tryCatch after error in for loop

I am trying to implement tryCatch in a for loop.
The loop is built to download data from a remote server. Sometimes the server no more responds (when the query is big).
I have implemented tryCatch in order to make the loop keeping.
I also have added a sys.sleep() pause if an error occurs in order to wait some minutes before sending next query to the remote server (it works).
The problem is that I don't figure out how to ask the loop to redo the query that failed and lead to a tryCatch error (and to sys.sleep()).
for(i in 1:1000){
tmp <- tryCatch({download_data(list$tool[i])},
error = function(e) {Sys.sleep(800)})
}
Could you give me some hints?
You can do something like this:
for(i in 1:1000){
download_finished <- FALSE
while(!download_finished) {
tmp <- tryCatch({
download_data(list$tool[i])
download_finished <- TRUE
},
error = function(e) {Sys.sleep(800)})
}
}
If you are certain that waiting for 800 seconds always fixes the issue this change should do it.
for(i in 1:1000) {
tmp <- tryCatch({
download_data(list$tool[i])
},
error = function(e) {
Sys.sleep(800)
download_data(list$tool[i])
})
}
A more sophisticated approach could be, to collect the information of which request failed and then rerun the script until all requests succeed.
One way to do this is to use the possibly() function from the purrr package. It would look something like this:
todo <- rep(TRUE, length(list$tool))
res <- list()
while (any(todo)) {
res[todo] <- map(list$tool[todo],
possibly(download_data, otherwise = NA))
todo <- map_lgl(res, ~ is.na(.))
}

How to check if an url object is reachable or not using try catch in R

I have the following URL objects and need to check if they are reachable before downloading and processing the CSV files. I can't use the URLs directly as it keeps on changing based on previous steps.
My requirement is, read the link if reachable else throw an error and go to the next link.
url1= "https://s3.mydata.csv"
url2="https://s4.mydata.csv"
url3="https://s5.mydata.csv"
(Below code will be repeated for the other 2 URLs as well)
readUrl <- function(url1) {
out <- tryCatch(
{
readLines(con=url, warn=FALSE)
error=function(cond) {
message(cond)
return(NA)
},
finally={
dataread=data.table::fread(url1, sep = ",", header= TRUE,verbose = T,
fill =TRUE,skip = 2 )
}
)
return(out)
}
y <- lapply(urls, readUrl)
Why not the function url.exists directly from package RCurl.
From documentation:
This functions is analogous to file.exists and determines whether a
request for a specific URL responds without error.
Function doc LINK
Using the boolean result of this function you can easly adapt your starting code without Try Catch.

TryCatch with parLapply (Parallel package) in R

I am trying to run something on a very large dataset. Basically, I want to loop through all files in a folder and run the function fromJSON on it. However, I want it to skip over files that produce an error. I have built a function using tryCatch however, that only works when i use the function lappy and not parLapply.
Here is my code for my exception handling function:
readJson <- function (file) {
require(jsonlite)
dat <- tryCatch(
{
fromJSON(file, flatten=TRUE)
},
error = function(cond) {
message(cond)
return(NA)
},
warning = function(cond) {
message(cond)
return(NULL)
}
)
return(dat)
}
and then I call parLapply on a character vector files which contains the full paths to the JSON files:
dat<- parLapply(cl,files,readJson)
that produces an error when it reaches a file that doesn't end properly and does not create the list 'dat' by skipping over the problematic file. Which is what the readJson function was supposed to mitigate.
When I use regular lapply, however it works perfectly fine. It generates the errors, however, it still creates the list by skipping over the erroneous file.
any ideas on how I could use exception handling with parLappy parallel such that it will skip over the problematic files and generate the list?
In your error handler function cond is an error condition. message(cond) signals this condition, which is caught on the workers and transmitted as an error to the master. Either remove the message calls or replace them with something like
message(conditionMessage(cond))
You won't see anything on the master though, so removing is probably best.
What you could do is something like this (with another example, reproducible):
test1 <- function(i) {
dat <- NA
try({
if (runif(1) < 0.8) {
dat <- rnorm(i)
} else {
stop("Error!")
}
})
return(dat)
}
cl <- parallel::makeCluster(3)
dat <- parallel::parLapply(cl, 1:100, test1)
See this related question for other solutions. I think using foreach with .errorhandling = "pass" would be another good solution.

jsonlite working in plain code, but not as a part of a function

I stumbled upon an error:
> getBDLsearch("czas")
Error in file(con, "r") : cannot open the connection
...so I started a teardown to find where the problem is in the function. It's very simple, so I'll just paste it:
require(htmltools)
getBDLsearch <- function(query = "", debug = 0, raw = FALSE) {
url <- paste0('https://api.mojepanstwo.pl/bdl/search?q=', htmlEscape(query))
if (raw) {
document <- jsonlite::fromJSON(txt = url,simplifyVector=FALSE)
return(document)
}
else {
document <- jsonlite::fromJSON(txt = url,simplifyDataFrame=TRUE)
return(document)
}
}
( https://github.com/pbiecek/SmarterPoland )
The thing is when I run subsequent lines manually, it works like a charm and the variable "document" gets filled in nicely. I'm curious, why is it so?

Resources