Ignore error when importing JSON files in R - r

I have this for loop that download a json file from a solr search server.
It loops over a vector that contain keywords (100, in this case):
library(jsonlite)
for (i in 1:100) {
docs <- fromJSON(paste("http://myurl.com/solr/select?df=topic&fq=",keywords[i],"&indent=on&q=*:*&rows=1&wt=json",sep=""))
numFound <- docs$response$numFound
print(numFound)
}
It works fine, until it reaches a certain keyword that is not found on the solr, and returns this error:
Error in open.connection(con, "rb") : HTTP error 400.
And then the loop stops.
Is there a way to ignore the error and proceed the loop?
I've read something using tryCatch but still couldn't figure it out.

Simpler than tryCatch, you can use the function try inside your keyword loop. This will attempt to load the URL, but if an error is encountered will print the error but continue to the next keyword.
library(jsonlite)
for (i in 1:100) {
try({
docs <- fromJSON(paste("http://myurl.com/solr/select?df=topic&fq=",keywords[i],"&indent=on&q=*:*&rows=1&wt=json",sep=""))
numFound <- docs$response$numFound
print(numFound)
})
}
If you also don't want to have the errors printed, specify silent = TRUE:
library(jsonlite)
for (i in 1:100) {
try({
docs <- fromJSON(paste("http://myurl.com/solr/select?df=topic&fq=",keywords[i],"&indent=on&q=*:*&rows=1&wt=json",sep=""))
numFound <- docs$response$numFound
print(numFound)
}, silent = TRUE)
}

I'm partial to purrr's safely for this kind of task, which works well in purrr's map functions. You can test it by getting JSONs from GitHub's API:
keywords <- c("hadley", "gershomtripp", "lsjdflkaj")
url <- "https://api.github.com/users/{.}/repos"
Now we can get the JSONs and extract the repo IDs
library(jsonlite)
library(purrr)
library(glue)
json_list <- map(keywords, safely(~ fromJSON(glue(url)) %>% .$id))
This will return a list of elements containing result and error. If there was an error it will be saved in error, otherwise the results will be save in result.
[[1]]
[[1]]$result
[1] 40423928 40544418 14984909 12241750 5154874 9324319 20228011 82348 888200 3116998
[11] 8296284 137344416 133734429 2788278 28724058 9470424 116708612 34325557 41144 41157
[21] 78543290 66588778 35225488 14507273 15718805 18562209 12522 115742443 119107571 201908
[[1]]$error
NULL
[[2]]
[[2]]$result
[1] 150995700 141743224 127107806 130802586 185857872 131488780 148619375 165221804 135417803 127116088
[11] 181662388 173351888 127131146 136896011
[[2]]$error
NULL
[[3]]
[[3]]$result
NULL
[[3]]$error
<simpleError in open.connection(con, "rb"): HTTP error 404.>

Related

How skip some line in R

I have many URLs which I import their text in R.
I use this code:
setNames(lapply(1:1000, function(x) gettxt(get(paste0("url", x)))), paste0("url", 1:1000, "_txt")) %>%
list2env(envir = globalenv())
However, some URLs can not import and show this error:
Error in file(con, "r") : cannot open the connection In addition:
Warning message: In file(con, "r") : InternetOpenUrl failed: 'A
connection with the server could not be established'
So, my code doesn't run and doesn't import any text from any URL.
How can I recognize wrong URLs and skip them in other to import correct URLs?
one possible aproach besides trycatch mentioned by #tester can be the purrr-package:
library(purrr)
# declare function
my_gettxt <- function(x) {
gettxt(get(paste0("url", x)))
}
# make function error prone by defining the otherwise value (could be empty df with column defintion, etc.) used as output if function fails
my_gettxt <- purrr::possibly(my_gettxt , otherwise = NA)
# use map from purrr instead of apply function
my_data <- purrr::map(1:1000, ~my_gettxt(.x))

How to use Trycatch to skip errors in data downloading in R

I am trying to download data from the USGS website using the dataRetrieval package of R.
For that purpose, I have generated a function called getstreamflow in R that works fine when I ran for example.
siteNumber <- c("094985005","09498501","09489500","09489499","09498502")
Streamflow = getstreamflow(siteNumber)
The output of the function is a list of data frames
I could run the function when there is no issue downloading the data, but for some stations, I got the following error:
Request failed [404]. Retrying in 1.1 seconds...
Request failed [404]. Retrying in 3.3 seconds...
For: https://waterservices.usgs.gov/nwis/site/?siteOutput=Expanded&format=rdb&site=0946666666
To avoid that the function stops when encounters an error, I am trying to use tryCatch as in the following code:
Streamflow = tryCatch(
expr = {
getstreamflow(siteNumber)
},
error = function(e) {
message(paste(siteNumber," there was an error"))
})
I want the function to skip the station and go to the next when encountering an error. Currently, the output I got is the one presented below, that obviously is wrong, because it says that for all the stations there was an error:
094985005 there was an error09498501 there was an error09489500 there was an error09489499 there was an error09498502 there was an error09511300 there was an error09498400 there was an error09498500 there was an error09489700 there was an error09500500 there was an error09489082 there was an error09510200 there was an error09489100 there was an error09490500 there was an error09510180 there was an error09494000 there was an error09490000 there was an error09489086 there was an error09489089 there was an error09489200 there was an error09489078 there was an error09510170 there was an error09493500 there was an error09493000 there was an error09498503 there was an error09497500 there was an error09510000 there was an error09509502 there was an error09509500 there was an error09492400 there was an error09492500 there was an error09497980 there was an error09497850 there was an error09492000 there was an error09497800 there was an error09510150 there was an error09499500 there was an error... <truncated>
What I am doing wrong using the tryCatch?
Answer
You wrote the tryCatch outside of getstreamflow. Hence, if one site fails, then getstreamflow will return an error and nothing else. You should either supply 1 site at a time, or put the tryCatch inside getstreamflow.
Example
x <- 1:5
fun <- function(x) {
for (i in x) if (i == 5) stop("ERROR")
return(x^2)
}
tryCatch(fun(x), error = function(e) paste0("wrong", x))
This returns:
[1] "wrong1" "wrong2" "wrong3" "wrong4" "wrong5"
Multiple arguments
You indicated that you have both siteNumber and datatype to iterate over.
Using Map, we can define a function that takes two inputs:
Map(function(x, y) tryCatch(fun(x, y),
error = function(e) message(paste(x, " there was an error"))),
x = siteNumber,
y = datatype)
Using a for-loop, we can just iterate over them:
Streamflow <- vector(mode = "list", length = length(siteNumber))
for (i in seq_along(siteNumber)) {
Streamflow[[i]] <- tryCatch(getstreamflow(siteNumber[i], datatype), error = function(e) message(paste(x, " there was an error")))
}
Or, as suggested, just modify getstreamflow.

using GET in a loop

I am using the following code. I create a list of first names and then generate links to an API for each name and then try to capture the data from each link.
mydata$NameGenderURL2 <- paste ("https://gender-api.com/get?name=",mydata$firstname, "&key=suZrzhrNJRvrkWFXAG", sep="")
mynamegenderfunction <- function(x){
GET(url= mydata$NameGenderURL2[x])
this.raw.content <- genderdata$content
this.raw.content <- rawToChar(genderdata$content)
this.content <- fromJSON(this.raw.content)
name1[x] <- this.content$name
gender1[x] <- this.content$gender}
namelist <- mydata$firstname[1:100]
genderdata <- lapply(namelist, mynamegenderfunction)
Oddly enough I receive the following message:
>Error in curl::curl_fetch_memory(url, handle = handle) :
>Could not resolve host: NA`
I tried another API and got the same issue. Any suggestions?
Here is a data sample:
namesurl
https://api.genderize.io/?name=kaan
https://api.genderize.io/?name=Joan
https://api.genderize.io/?name=homeblitz
https://api.genderize.io/?name=Flatmax
https://api.genderize.io/?name=BRYAN
https://api.genderize.io/?name=James
https://api.genderize.io/?name=Dion
https://api.genderize.io/?name=Flintu
https://api.genderize.io/?name=Adriana
The output that I need is the gender for each link, which would be :Male/Female, Null

How to deal with "Warning: object 'xxx' is created by more than one data call"

When checking an R package, I got the warning
Warning: object 'xxx' is created by more than one data call
What causes this, and how can I fix it?
This warning occurs when multiple RData files in the data directory of the package store a variable with the same name.
To reproduce, we create a package and save the cars dataset twice, to different files:
library(devtools)
create("test")
dir.create("test/data")
save(cars, file = "test/data/cars1.RData")
save(cars, file = "test/data/cars2.RData")
check("test")
The output from check includes these lines:
Found the following significant warnings:
Warning: object 'cars' is created by more than one data call
If you receive this warning, you can find repeated variable names using:
rdata_files <- dir("test/data", full.names = TRUE, pattern = "\\.RData$")
var_names <- lapply(
rdata_files,
function(rdata_file)
{
e <- new.env()
load(rdata_file, envir = e)
ls(e)
}
)
Reduce(intersect, var_names)
## [1] "cars"

Function decode_short_URL from twitteR package not working

I am using decode_short_url of the twitteR package to decode shortened URLs from Twitter posts, but I am not able to get the desired results, It is always giving back the same results such as:
decode_short_url(decode_short_url("http://bit.ly/23226se656"))
## http://bit.ly/23226se656
## [1] "http://bit.ly/23226se656
UPDATE I wrapped this functionality in a package and managed to get it on CRAN same-day. Now, you can just do:
library(longurl)
expand_urls("http://bit.ly/23226se656", check=TRUE, warn=TRUE)
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
## Source: local data frame [1 x 2]
##
## orig_url expanded_url
## 1 http://bit.ly/23226se656 NA
##
## Warning message:
## In FUN(X[[i]], ...) : client error: (404) Not Found
You can pass in a vector of URLs and get a data_frame/data.frame back in that form.
That particular bit.ly URL gives a 404 error. Here's a version of decode_short_url that has an optional check parameter that will attempt a HEAD request and throw a warning message for any HTTP status other than 200.
You can further modify it to return NA in the event the "expanded" link 404's (I have no idea what you need this to really do in the event the link is bad).
NOTE that the addd HEAD request will significantly slow the process down, so you may want to do a first pass with check=FALSE to a separate column, then compare which weren't "expanded", then check those with check=TRUE.
You might also want to rename this to avoid namespace conflicts with the one from twitteR.
decode_short_url <- function(url, check=FALSE, ...) {
require(httr)
request_url <- paste("http://api.longurl.org/v2/expand?url=",
url, "&format=json", sep="")
response <- GET(request_url, query=list(useragent="twitteR"), ...)
parsed <- content(response, as="parsed")
ret <- NULL
if (!("long-url" %in% names(parsed))) {
ret <- url
} else {
ret <- parsed[["long-url"]]
}
if (check) warn_for_status(HEAD(url))
return(url)
}
decode_short_url("http://bit.ly/23226se656", check=TRUE)
## [1] "http://bit.ly/23226se656"
## Warning message:
## In decode_short_url("http://bit.ly/23226se656", check = TRUE) :
## client error: (404) Not Found

Resources