Check if URL exists in R - r

I want to loop over a list of URLs and I want to find out if these URLs exist or not.
RCurl provides the url.exists() function. However, the output doesn't seem to be right, because for example it says that amazon.com is not registered (it does so because the url.exists()-function doesn't return a value in the 200 range, in the case of amazon.com it's 405 ("method not allowed").
I also tried HEAD() and GET() provided by the httr package. But sometimes I get error messages here, for example for timeouts or because the URL is not registered.
Error messages look like this:
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: Connection timed out after 10000 milliseconds
Error in curl::curl_fetch_memory(url, handle = handle) :
Could not resolve host: afsadadssadasf.com
When I get such an error, the whole for loop stops. Is it possible to continue the for loop? I tried tryCatch(), but to my knowledge this can only help when the problem is in the dataframe itself.

pingr::ping() only uses ICMP which is blocked on sane organizational networks since attackers used ICMP as a way to exfiltrate data and communicate with command-and-control servers.
pingr::ping_port() doesn't use the HTTP Host: header so the IP address may be responding but the target virtual web host may not be running on it and it definitely doesn't validate that the path exists at the target URL.
You should clarify what you want to happen when there are only non-200:299 range HTTP status codes. The following makes an assumption.
NOTE: You used Amazon as an example and I'm hoping that's the first site that just "came to mind" since it's unethical and a crime to scrape Amazon and I would appreciate my code not being brought into your universe if you are in fact just a brazen content thief. If you are stealing content, it's unlikely you'd be up front here about that, but on the outside chance you are both stealing and have a conscience, please let me know so I can delete this answer so at least other content thieves can't use it.
Here's a self-contained function for checking URLs:
#' #param x a single URL
#' #param non_2xx_return_value what to do if the site exists but the
#' HTTP status code is not in the `2xx` range. Default is to return `FALSE`.
#' #param quiet if not `FALSE`, then every time the `non_2xx_return_value` condition
#' arises a warning message will be displayed. Default is `FALSE`.
#' #param ... other params (`timeout()` would be a good one) passed directly
#' to `httr::HEAD()` and/or `httr::GET()`
url_exists <- function(x, non_2xx_return_value = FALSE, quiet = FALSE,...) {
suppressPackageStartupMessages({
require("httr", quietly = FALSE, warn.conflicts = FALSE)
})
# you don't need thse two functions if you're alread using `purrr`
# but `purrr` is a heavyweight compiled pacakge that introduces
# many other "tidyverse" dependencies and this doesnt.
capture_error <- function(code, otherwise = NULL, quiet = TRUE) {
tryCatch(
list(result = code, error = NULL),
error = function(e) {
if (!quiet)
message("Error: ", e$message)
list(result = otherwise, error = e)
},
interrupt = function(e) {
stop("Terminated by user", call. = FALSE)
}
)
}
safely <- function(.f, otherwise = NULL, quiet = TRUE) {
function(...) capture_error(.f(...), otherwise, quiet)
}
sHEAD <- safely(httr::HEAD)
sGET <- safely(httr::GET)
# Try HEAD first since it's lightweight
res <- sHEAD(x, ...)
if (is.null(res$result) ||
((httr::status_code(res$result) %/% 200) != 1)) {
res <- sGET(x, ...)
if (is.null(res$result)) return(NA) # or whatever you want to return on "hard" errors
if (((httr::status_code(res$result) %/% 200) != 1)) {
if (!quiet) warning(sprintf("Requests for [%s] responded but without an HTTP status code in the 200-299 range", x))
return(non_2xx_return_value)
}
return(TRUE)
} else {
return(TRUE)
}
}
Give it a go:
c(
"http://content.thief/",
"http://rud.is/this/path/does/not_exist",
"https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=content+theft",
"https://www.google.com/search?num=100&source=hp&ei=xGzMW5TZK6G8ggegv5_QAw&q=don%27t+be+a+content+thief&btnK=Google+Search&oq=don%27t+be+a+content+thief&gs_l=psy-ab.3...934.6243..7114...2.0..0.134.2747.26j6....2..0....1..gws-wiz.....0..0j35i39j0i131j0i20i264j0i131i20i264j0i22i30j0i22i10i30j33i22i29i30j33i160.mY7wCTYy-v0",
"https://rud.is/b/2018/10/10/geojson-version-of-cbc-quebec-ridings-hex-cartograms-with-example-usage-in-r/"
) -> some_urls
data.frame(
exists = sapply(some_urls, url_exists, USE.NAMES = FALSE),
some_urls,
stringsAsFactors = FALSE
) %>% dplyr::tbl_df() %>% print()
## A tibble: 5 x 2
## exists some_urls
## <lgl> <chr>
## 1 NA http://content.thief/
## 2 FALSE http://rud.is/this/path/does/not_exist
## 3 TRUE https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=con…
## 4 TRUE https://www.google.com/search?num=100&source=hp&ei=xGzMW5TZK6G8ggegv5_QAw&q=don%27t…
## 5 TRUE https://rud.is/b/2018/10/10/geojson-version-of-cbc-quebec-ridings-hex-cartograms-wi…
## Warning message:
## In FUN(X[[i]], ...) :
## Requests for [http://rud.is/this/path/does/not_exist] responded but without an HTTP status code in the 200-299 range

Here is a simple solution to the problem.
urls <- c("http://www.amazon.com",
"http://this.isafakelink.biz",
"https://stackoverflow.com")
valid_url <- function(url_in,t=2){
con <- url(url_in)
check <- suppressWarnings(try(open.connection(con,open="rt",timeout=t),silent=T)[1])
suppressWarnings(try(close.connection(con),silent=T))
ifelse(is.null(check),TRUE,FALSE)
}
sapply(urls,valid_url)

Try the ping function in the pingr package. It gives the timings of pings.
library(pingr)
ping("amazon.com") # good site
## [1] 45 46 45
ping("xxxyyyzzz.com") # bad site
## [1] NA NA NA

Here's a function that evaluates an expression and returns TRUE if it works and FALSE if it doesn't. You can also assign variables inside the expression.
try_catch <- function(exprs) {!inherits(try(eval(exprs)), "try-error")}
try_catch(out <- log("a")) # returns FALSE
out # Error: object 'out' not found
try_catch(out <- log(1)) # returns TRUE
out # out = 0
You can use the expression to check for success.
done <- try_catch({
# try something
})
if(!done) {
done <- try_catch({
# try something else
})
}
if(!done) {
# default expression
}

Related

Choose command order in a function based on an error [R]

I have three files in a folder with the following names:
./multiqc_data$ ls
file1.json
file2.json
file3.json
When I open the files with the TidyMultiqc package existing NA values in the files might lead to the following error:
files <- dir(path,pattern = "*.json") #locate files
files %>%
map(~ load_multiqc(file.path(path, .))) #parse them
## the error
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid char in json text.
"mapped_failed_pct": NaN, "paired in
(right here) ------^
I want to create a function to handle this error.
I want every time this error pops up to be able to apply this sed function in all files of the folder.
system(paste("gsed -i 's/NaN/null/g'",paste0(path,"*.json")))
Any ideas how can I achieve this
You could use this wrapper :
safe_load_multiqc <- function(path, file) {
tryCatch(load_multiqc(file.path(path, file)), error = function(e) {
system(paste("gsed -i 's/NaN/null/g'",paste0(path,"*.json")))
# retry
load_multiqc(path, file)
})
}
A good way to handle errors in work pipelines like that is using restarts and withCallingHandlers and withRestarts.
You establish the condition handlers and the recovery protocols (restarts) then you can choose what protocols to use and in which order. Calling handlers allows a much finer control on error conditions than common try-catch.
In the example, I wrote two handlers: removeNaNs (works at folder level) and skipFile (works at file level), if the first fails, the second is executed (simply skipping the file). Of course is an example
I think in your case you can simply run sed in every case, nevertheless, I hope this answer meet your looking for a canonical way
Inspiration and Extra lecture: Beyond Exception Handling: Conditions and Restarts
path <- "../your_path"
# function that does the error_prone task
do_task <- function(path){
files <- dir(path,pattern = "*.json") #locate files
files %>%
map(~ withRestart( # set an alternative restart
load_multiqc(file.path(path, .)), # parsing
skipFile = function() { # if fails, skip only this file
message(paste("skipping ", file.path(path, .)))
return(NULL)
}))
}
# error handler that invokes "removeNaN"
removeNaNHandler <- function(e) tryInvokeRestart("removeNaN")
# error handler that invokes "skipFile"
skipFileHandler <- function(e) tryInvokeRestart("skipFile")
# run the task with handlers in case of error
withCallingHandlers(
condition = removeNaNHandler, # call handler (on generic error)
# condition = skipFileHandler, # if previous fails skips file
{
# run with recovery protocols (can define more than one)
withRestarts({
do_task(path)},
removeNaN = function() # protocol "removeNaN"
{
system(paste("gsed -i 's/NaN/null/g'",paste0(path,"*.json")))
do_task(path) # try again
}
)
}
)
Based on this open github issue, a potential solution provided by Peter Diakumis is to use RJSONIO::fromJSON() in place of jsonlite::read_json(). You could adapt this solution to your use-case by e.g. creating your own load_multiqc() function:
library(RJSONIO)
load_multiqc_bugfix <- function(paths,
plots = NULL,
find_metadata = function(...) {
list()
},
plot_parsers = list(),
sections = "general") {
assertthat::assert_that(all(sections %in% c(
"general", "plot", "raw"
)), msg = "Only 'general', 'plot' and 'raw' (and combinations of those) are valid items for the sections parameter")
# Vectorised over paths
paths %>%
purrr::map_dfr(function(path) {
parsed <- RJSONIO::fromJSON(path)
# The main data is plots/general/raw
main_data <- sections %>%
purrr::map(~ switch(.,
general = parse_general(parsed),
raw = parse_raw(parsed),
plot = parse_plots(parsed, plots = plots, plot_parsers = plot_parsers)
)) %>%
purrr::reduce(~ purrr::list_merge(.x, !!!.y), .init = list()) %>%
purrr::imap(~ purrr::list_merge(.x, metadata.sample_id = .y))
# Metadata is defined by a user function
metadata <- parse_metadata(parsed = parsed, samples = names(main_data), find_metadata = find_metadata)
purrr::list_merge(metadata, !!!main_data) %>%
dplyr::bind_rows()
}) %>%
# Only arrange the columns if we have at least 1 column
`if`(
# Move the columns into the order: metadata, general, plot, raw
ncol(.) > 0,
(.) %>%
dplyr::relocate(dplyr::starts_with("raw")) %>%
dplyr::relocate(dplyr::starts_with("plot")) %>%
dplyr::relocate(dplyr::starts_with("general")) %>%
dplyr::relocate(dplyr::starts_with("metadata")) %>%
# Always put the sample ID at the start
dplyr::relocate(metadata.sample_id),
.
)
}

How to use Trycatch to skip errors in data downloading in R

I am trying to download data from the USGS website using the dataRetrieval package of R.
For that purpose, I have generated a function called getstreamflow in R that works fine when I ran for example.
siteNumber <- c("094985005","09498501","09489500","09489499","09498502")
Streamflow = getstreamflow(siteNumber)
The output of the function is a list of data frames
I could run the function when there is no issue downloading the data, but for some stations, I got the following error:
Request failed [404]. Retrying in 1.1 seconds...
Request failed [404]. Retrying in 3.3 seconds...
For: https://waterservices.usgs.gov/nwis/site/?siteOutput=Expanded&format=rdb&site=0946666666
To avoid that the function stops when encounters an error, I am trying to use tryCatch as in the following code:
Streamflow = tryCatch(
expr = {
getstreamflow(siteNumber)
},
error = function(e) {
message(paste(siteNumber," there was an error"))
})
I want the function to skip the station and go to the next when encountering an error. Currently, the output I got is the one presented below, that obviously is wrong, because it says that for all the stations there was an error:
094985005 there was an error09498501 there was an error09489500 there was an error09489499 there was an error09498502 there was an error09511300 there was an error09498400 there was an error09498500 there was an error09489700 there was an error09500500 there was an error09489082 there was an error09510200 there was an error09489100 there was an error09490500 there was an error09510180 there was an error09494000 there was an error09490000 there was an error09489086 there was an error09489089 there was an error09489200 there was an error09489078 there was an error09510170 there was an error09493500 there was an error09493000 there was an error09498503 there was an error09497500 there was an error09510000 there was an error09509502 there was an error09509500 there was an error09492400 there was an error09492500 there was an error09497980 there was an error09497850 there was an error09492000 there was an error09497800 there was an error09510150 there was an error09499500 there was an error... <truncated>
What I am doing wrong using the tryCatch?
Answer
You wrote the tryCatch outside of getstreamflow. Hence, if one site fails, then getstreamflow will return an error and nothing else. You should either supply 1 site at a time, or put the tryCatch inside getstreamflow.
Example
x <- 1:5
fun <- function(x) {
for (i in x) if (i == 5) stop("ERROR")
return(x^2)
}
tryCatch(fun(x), error = function(e) paste0("wrong", x))
This returns:
[1] "wrong1" "wrong2" "wrong3" "wrong4" "wrong5"
Multiple arguments
You indicated that you have both siteNumber and datatype to iterate over.
Using Map, we can define a function that takes two inputs:
Map(function(x, y) tryCatch(fun(x, y),
error = function(e) message(paste(x, " there was an error"))),
x = siteNumber,
y = datatype)
Using a for-loop, we can just iterate over them:
Streamflow <- vector(mode = "list", length = length(siteNumber))
for (i in seq_along(siteNumber)) {
Streamflow[[i]] <- tryCatch(getstreamflow(siteNumber[i], datatype), error = function(e) message(paste(x, " there was an error")))
}
Or, as suggested, just modify getstreamflow.

RPlumber API - returning data as CSV instead of JSON - works locally on mac, but not on ubuntu-16.04

We are using RPlumber to host an API, and our developers asked that the API endpoints provide data in a CSV format, rather than JSON. To handle this, we have the following:
r_endpoints.R
#* #get /test-endpoint-1
testEndpoint <- function(res) {
mydata <- data.frame(a = c(1,2,3), b = c(3,4,5))
print('mydata')
print(mydata)
con <- textConnection("val","w")
print(paste0('con: ', con))
write.csv(x = mydata, con, row.names = FALSE)
close(con)
print('res and res.body')
print(res);
res$body <- paste(val, collapse="\n")
print(res$body)
return(res)
}
#* #get /test-endpoint-2
testEndpoint2 <- function() {
mydata <- data.frame(a = c(1,2,3), b = c(3,4,5))
return(mydata)
}
run_api.r
library(plumber)
pr <- plumber::plumb("r_endpoints.R")
pr$run(host = "0.0.0.0", port = 8004)
test-endpoint-2 returns the data in a JSON format, whereas test-endpoint-1 returns the data in a CSV format. When these endpoints are run locally on my mac, and when I hit the endpoints, I receive the following correct output:
To host the API, we've installed R + the libraries + pm2 on a Linode Ubuntu 16.04 server, and installed all (I think all) of the dependencies. When we try to hit the endpoints as hosted on the server, we receive:
Here are the print statements that I've added to test-endpoint-1 to help with debugging:
[1] "mydata"
a b
1 1 3
2 2 4
3 3 5
[1] "con: 3"
[1] "res and res.body"
<PlumberResponse>
Public:
body: NULL
clone: function (deep = FALSE)
headers: list
initialize: function (serializer = serializer_json())
removeCookie: function (name, path, http = FALSE, secure = FALSE, same_site = FALSE,
serializer: function (val, req, res, errorHandler)
setCookie: function (name, value, path, expiration = FALSE, http = FALSE,
setHeader: function (name, value)
status: 200
toResponse: function ()
[1] "\"a\",\"b\"\n1,3\n2,4\n3,5"
These are the correct print statements - the same that we get locally. For some reason, the server will not allow us to return in a CSV format in the same way that my local machine allows, and I have no idea why this is the case, or how to fix it.
Edit
After updating the plumber library on my local machine, I now receive the error An exception occurred. on my local machine as well. It seems, in the newer version of plumber, that the snippet of code I use to convert the API endpoint output to a CSV file:
...
con <- textConnection("val","w")
write.csv(x = mydata, con, row.names = FALSE)
close(con)
res$body <- paste(val, collapse="\n")
return(res)
no longer works.
Edit 2
Here's my own stackoverflow post from nearly 3 years ago on how to return the data as a CSV... seems to no longer work.
Edit 3
Using #serialize csv does "work", but when I hit the endpoint, the data is downloaded as a CSV onto my local machine, whereas it would be better for the data to simply be returned in a CSV format from the API, but not automatically downloaded into a CSV file...
Maybe look into this for inspiration, here I'm modifying responses content-type headers to text/plain. text/plain should display in the browser I believe.
#* #get /json
#* #serializer unboxedJSON
function() {
dostuff()
}
#* #get /csv
#* #serializer csv list(type="text/plain; charset=UTF-8")
function() {
dostuff()
}
dostuff <- function() {
mtcars
}
This ugly code works
EDIT : added an enum spec for swagger UI
library(plumber)
#* #get /iris
function(type, res) {
if (type == "csv") {
res$serializer <- serializer_csv(type = "text/plain; charset=UTF-8")
}
iris
}
#* #plumber
function(pr) {
pr_set_api_spec(pr, function(spec) {
spec$paths$`/iris`$get$parameters[[1]]$schema$enum = c("json", "csv")
spec
})
}
The An exception occurred issue is actually from httpuv and is fixed in the latest GitHub version of the package (see https://github.com/rstudio/httpuv/pull/289). Installing httpuv from GitHub (remotes::install_github("rstudio/httpuv")) and running the API again should resolve the issue.

R - `try` in conjunction with capturing ALL console output?

Here's a piece of code I'm working with:
install.package('BiocManager');BiocManager::install('UniProt.ws')
requireNamespace('UniProt.ws')
uniprot_object <- UniProt.ws::UniProt.ws(
UniProt.ws::availableUniprotSpecies(
pattern = '^Homo sapiens$')$`taxon ID`)
query_results <- try(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')))
This particular key/keytype combination is non-productive and produces the following output:
Getting mapping data for BAA08084.1 ... and ACC
error while trying to retrieve data in chunk 1:
no lines available in input
continuing to try
Error in `colnames<-`(`*tmp*`, value = `*vtmp*`) :
attempt to set 'colnames' on an object with less than two dimensions
Of the two [eE]rrors reported only the second is a 'proper' R error object and given the use of try accordingly captured in the variable query_result.
I am, however, desperate to capture the other error bit (no lines available in input) to inform downstream programmatic processes.
After playing with a plethora of capture.output, sink, purrr::quietly, etc. options found by startpaging (googling), I continue to fail capturing that bit. How can I do that?
As #Csd suggested, you could use tryCatch. The message that you are after is printed by the message() function in R, not stop(), so try() will ignore it. To capture output from message(), use code like this:
query_results <- tryCatch(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')),
message = function(e) conditionMessage(e))
This will abort evaluation when it gets any message, and return the message in query_results. If you are doing more than debugging, you probably want the message saved, but evaluation to continue. In that case, use withCallingHandlers instead. For example,
saveMessages <- c()
query_results <- withCallingHandlers(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')),
message = function(e)
saveMessages <<- c(saveMessages, conditionMessage(e)))
When I run this version, query_results is unchanged (because the later error aborted execution), but the messages are saved:
saveMessages
[1] "Getting mapping data for BAA08084.1 ... and ACC\n"
[2] "error while trying to retrieve data in chunk 1:\n no lines available in input\ncontinuing to try\n"
Based on #user2554330 s most excellent answer, I constructed an ugly thing that does exactly what I want:
try to execute the statement
don't fail fatally
leave no ugly messages
allow me access to errors and messages
So here it is in all it's despicable glory:
saveMessages <- c()
query_results <- suppressMessages(
withCallingHandlers(
try(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')),
silent = TRUE),
message = function(e)
saveMessages <<- c(saveMessages, conditionMessage(e))))

Function decode_short_URL from twitteR package not working

I am using decode_short_url of the twitteR package to decode shortened URLs from Twitter posts, but I am not able to get the desired results, It is always giving back the same results such as:
decode_short_url(decode_short_url("http://bit.ly/23226se656"))
## http://bit.ly/23226se656
## [1] "http://bit.ly/23226se656
UPDATE I wrapped this functionality in a package and managed to get it on CRAN same-day. Now, you can just do:
library(longurl)
expand_urls("http://bit.ly/23226se656", check=TRUE, warn=TRUE)
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
## Source: local data frame [1 x 2]
##
## orig_url expanded_url
## 1 http://bit.ly/23226se656 NA
##
## Warning message:
## In FUN(X[[i]], ...) : client error: (404) Not Found
You can pass in a vector of URLs and get a data_frame/data.frame back in that form.
That particular bit.ly URL gives a 404 error. Here's a version of decode_short_url that has an optional check parameter that will attempt a HEAD request and throw a warning message for any HTTP status other than 200.
You can further modify it to return NA in the event the "expanded" link 404's (I have no idea what you need this to really do in the event the link is bad).
NOTE that the addd HEAD request will significantly slow the process down, so you may want to do a first pass with check=FALSE to a separate column, then compare which weren't "expanded", then check those with check=TRUE.
You might also want to rename this to avoid namespace conflicts with the one from twitteR.
decode_short_url <- function(url, check=FALSE, ...) {
require(httr)
request_url <- paste("http://api.longurl.org/v2/expand?url=",
url, "&format=json", sep="")
response <- GET(request_url, query=list(useragent="twitteR"), ...)
parsed <- content(response, as="parsed")
ret <- NULL
if (!("long-url" %in% names(parsed))) {
ret <- url
} else {
ret <- parsed[["long-url"]]
}
if (check) warn_for_status(HEAD(url))
return(url)
}
decode_short_url("http://bit.ly/23226se656", check=TRUE)
## [1] "http://bit.ly/23226se656"
## Warning message:
## In decode_short_url("http://bit.ly/23226se656", check = TRUE) :
## client error: (404) Not Found

Resources