I'm trying out roadoi to access Unpaywall from R, but no matter what I try to query for, I'm getting this response:
Error in UseMethod("http_error") : no applicable method for
'http_error' applied to an object of class "c('simpleError', 'error',
'condition')"
Running methods(http_error) gives me this:
[1] http_error.character* http_error.integer* http_error.response*
Could this be caused by me being behind an institutional firewall? (even so, it seems weird that this would be the response...)
Is there a way around it?
The http_error (actually from library httr) is a very simple function: it loads an url given by a character (http_error.character), retrieves the response (http_error.response) and ultimately looks at the response code (http_error.integer). If the response code is >=400 the function returns TRUE otherwise FALSE.
What your error says, is that you (or any function in your chain) tries to call http_error on a simpleError object. My guess is that your firewall settings block the request. Because the request is blocked the underlying httr::RETRY (which is called from oadoi_fetch) returns an error instead of a proper response object and http_error sees just this error object and breaks.
If I locally switch off my proxy (through which I can make requests) I also get an error:
library(roadoi)
Sys.unsetenv(c("HTTP_PROXY", "HTTPS_PROXY"))
oadoi_fetch("10.1038/nature12373", email = "name#whatever.com")
# Error in UseMethod("http_error") :
# no applicable method for 'http_error' applied to an object of class
# "c('simpleError', 'error', 'condition')"
As soon as my proxy is set properly I get
Sys.setenv(HTTPS_PROXY = my_proxy, HTTP_PROXY = my_proxy)
oadoi_fetch("10.1038/nature12373", email = "name#whatever.com")
# # A tibble: 1 x 16
# doi best_oa_location oa_locations data_standard is_oa genre journal_is_oa journal_is_in_d~ journal_issns journal_name publisher title year updated non_compliant authors
# <chr> <list> <list> <int> <lgl> <chr> <lgl> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <list> <list>
# 1 10.1038~ <tibble [1 x 10]> <tibble [4 x~ 2 TRUE journa~ FALSE FALSE 0028-0836,147~ Nature Springer ~ Nanometre-s~ 2013 2019-04-0~
If the problem lies indeed with the proxy, I would try the following, which helped me on my corporate Windows machine, but may be dependent on your local IT setting:
## get the proxy settings
system("netsh winhttp show proxy")
Sys.setenv(HTTP_PROXY = <the proxy from netsh>, HTTPS_PROXY = <the proxy from netsh>)
Actually, you can reproduce the error easily:
httr::http_error(simpleError("Cannot reach the page"))
# Error in UseMethod("http_error") :
# no applicable method for 'http_error' applied to an object of class
# "c('simpleError', # 'error', 'condition')"
Related
I am pulling drive times and distances using the distance matrix API in R. Currently I am utilizing the mapdist function which returns the time and distance for an origin and destination.
However, I also found the gmapsdistance function in R. This function also allows for further parameters such as eliminating ferry/toll routes ect (which may become important in the future).
I was curious to whether one is better than the other. So I utilized the same origin and destination to see whether the results are the same using the code below:
mapdist(from = "02081", to = "89801", mode = "driving")
gmapsdistance(origin = "02081", destination = "89801", mode = "driving",
key = "XXXXXXXXXXXXXXXXXXX")
And the following outputs were returned:
# A tibble: 1 x 9
from to m km miles seconds minutes hours mode
<chr> <chr> <int> <dbl> <dbl> <int> <dbl> <dbl> <chr>
1 02081 89801 4185463 4185. 2601. 141708 2362. 39.4 driving
> gmapsdistance(origin = "02081", destination = "89801", mode = "driving",
+ key = "XXXXXXXXXXXXXXXX")
$Time
[1] 141630
$Distance
[1] 4185370
$Status
[1] "OK"
I though the distances and times would return the same value, but they are slightly off? Has anybody encountered this and have any idea why they may return different values?
Also - Is it possible to eliminate the possibility of toll routes/ferry routes using the mapdist function?
Is it possible to lock the global environment and still allow .Random.seed to be set or removed? The default behavior of lockEnvironment() is too aggressive
for my use case.
lockEnvironment(globalenv())
rnorm(10)
#> Error in rnorm(10) : cannot add bindings to a locked environment
rm(.Random.seed)
#> Error in rm(.Random.seed) :
#> cannot remove bindings from a locked environment
Background
drake version 7.0.0 will have a new safeguard to protect reproducibility.
plan <- drake_plan(
x = {
data(mtcars)
mtcars$mpg
},
y = mean(x)
)
plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 x { data(mtcars) mtcars$mpg }
#> 2 y mean(x)
make(plan)
#> target x
#> fail x
#> Error: Target `x` failed. Call `diagnose(x)` for details. Error message:
#> cannot add bindings to a locked environment.
#> One of your targets tried to modify your environment,
#> which could invalidate other targets
#> and undermine reproducibility (example:
#> https://github.com/ropensci/drake/issues/664#issuecomment-453163562).
#> Beware <<-, ->>, attach(), data(), and side effects in general.
#> Use make(lock_envir = FALSE) to avoid this error (not recommended).
The error comes from the call to data(mtcars). The very act of building x would have changed x's dependencies. Without guardrails, the workflow invalidates itself.
make(plan, lock_envir = FALSE)
#> target x
#> target y
make(plan, lock_envir = FALSE)
#> target x
But with guardrails, we run into edge cases like https://github.com/ropensci/drake/issues/749 and https://github.com/ropensci/drake/issues/675#issuecomment-458222414.
I am trying to connect to Web API with token and read jason files from it using jsonlite
library(httr)
library(jsonlite)
Token="xmYeeiLGrHJNaYBWgrMfLbpZOwNgpOAh"
source="https://www.ncdc.noaa.gov/cdo-web/api/v2/%s.json"
GET(source)
POST(source,token = "xmYeeiLGrHJNaYBWgrMfLbpZOwNgpOAh")
jsonFiles=fromJSON(source)
but I keep getting this error:
Error in open.connection(con, "rb") : HTTP error 400.
What could I be doing wrong?
Have a look here. I adjusted the source since you had an invalid endpoint:
library(httr)
library(jsonlite)
Token <- "xmYeeiLGrHJNaYBWgrMfLbpZOwNgpOAh"
# Fetch all available datasets - https://www.ncdc.noaa.gov/cdo-web/webservices/v2#datasets
source <- "https://www.ncdc.noaa.gov/cdo-web/api/v2/datasets"
response <- GET(source, add_headers(token = Token))
#POST(source, token = Token) # Not needed
# Not actually fetching all datasets, but rather a list of all datasets
raw <- content(response, as="text")
results <- fromJSON(raw)
# Convert to dataframe/tibble
library(tidyverse)
df <- results$results %>% as_tibble()
head(df)
#> # A tibble: 6 x 6
#> uid mindate maxdate name datacoverage id
#> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 gov.noaa.ncdc… 1763-01-… 2018-08-… Daily Summaries 1 GHCND
#> 2 gov.noaa.ncdc… 1763-01-… 2018-07-… Global Summary … 1 GSOM
#> 3 gov.noaa.ncdc… 1763-01-… 2018-01-… Global Summary … 1 GSOY
#> 4 gov.noaa.ncdc… 1991-06-… 2018-08-… Weather Radar (… 0.95 NEXRAD2
#> 5 gov.noaa.ncdc… 1994-05-… 2018-08-… Weather Radar (… 0.95 NEXRAD3
#> 6 gov.noaa.ncdc… 2010-01-… 2010-01-… Normals Annual/… 1 NORMAL…
Created on 2018-08-18 by the reprex package (v0.2.0).
Related to your original code:
source="https://www.ncdc.noaa.gov/cdo-web/api/v2/%s.json"
If this was indeed a valid endpoint (which it isn't), I think what you would want is something like this:
filename <- "dataset_name"
source <- sprintf("https://www.ncdc.noaa.gov/cdo-web/api/v2/%s.json", filename)
I'm relatively new to R (and brand spanking new to scraping with R), so apologies in advance if I'm overlooking something obvious here!
I've been trying to learn how to scrape with RSelenium by following this tutorial: https://rawgit.com/petrkeil/Blog/master/2017_08_15_Web_scraping/web_scraping.html#advanced-scraping-with-rselenium
After running the following in Terminal (docker run -d -p 4445:4444 selenium/standalone-firefox), I tried to run the R code below, pulled with only slight modifications from the tutorial hyperlinked above:
get.tree <- function(genus, species)
{
# navigate to the page
browser <- remoteDriver(port=4445L)
browser$open(silent = T)
browser$navigate("http://www.bgci.org/global_tree_search.php?sec=globaltreesearch")
browser$refresh()
# create r objects from the web search input and button elements
genusElem <- browser$findElement(using = 'id', value = "genus-field")
specElem <- browser$findElement(using = 'id', value = "species-field")
buttonElem <- browser$fiendElement(using = 'class', value = "btn_ohoDO")
# tell R to fill in the fields
genusElem$sendKeysToElement(list(genus))
specElem$sendKeysToElement(list(species))
# tell R to click the search button
buttonElem$clickElement()
# get output
out <- browser$findElement(using = "css", value = "td.cell_1O3UaG:nth-child(4)") # the country origin
out <- out$getElementText()[[1]] # extract actual text string
out <- strsplit(out, split = "; ")[[1]] # turns into character vector
# close browser
browser$close()
return(out)
}
# Now let's try it:
get.tree("Abies", "alba")
But after doing all that, I get the following error:
Selenium message:Failed to decode response from marionette Build info:
version: '3.6.0', revision: '6fbf3ec767', time:
'2017-09-27T16:15:40.131Z' System info: host: 'd260fa60d69b', ip:
'172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version:
'4.9.49-moby', java.version: '1.8.0_131' Driver info: driver.version:
unknown
Error: Summary: UnknownError Detail: An unknown server-side error
occurred while processing the command. class:
org.openqa.selenium.WebDriverException Further Details: run
errorDetails method
Anyone have any idea what this means and where I went wrong?
Thanks very much for your help!
Just take advantage of the XHR request it makes to retrieve the in-line results and toss RSelenium:
library(httr)
library(tidyverse)
get_tree <- function(genus, species) {
GET(
url = sprintf("https://data.bgci.org/treesearch/genus/%s/species/%s", genus, species),
add_headers(
Origin = "http://www.bgci.org",
Referer = "http://www.bgci.org/global_tree_search.php?sec=globaltreesearch"
)
) -> res
stop_for_status(res)
matches <- content(res, flatten=TRUE)$results[[1]]
flatten_df(matches[c("id", "taxon", "family", "author", "source", "problems", "distributionDone", "note", "wcsp")]) %>%
mutate(geo = list(map_chr(matches$TSGeolinks, "country"))) %>%
mutate(taxas = list(map_chr(matches$TSTaxas, "checkTaxon")))
}
xdf <- get_tree("Abies", "alba")
xdf
## # A tibble: 1 x 8
## id taxon family author source distributionDone geo taxas
## <int> <chr> <chr> <chr> <chr> <chr> <list> <list>
## 1 58373 Abies alba Pinaceae Mill. WCSP Phans yes <chr [21]> <chr [45]>
glimpse(xdf)
## Observations: 1
## Variables: 8
## $ id <int> 58373
## $ taxon <chr> "Abies alba"
## $ family <chr> "Pinaceae"
## $ author <chr> "Mill."
## $ source <chr> "WCSP Phans"
## $ distributionDone <chr> "yes"
## $ geo <list> [<"Albania", "Andorra", "Austria", "Bulgaria", "Croatia", "Czech Republic", "Fr...
## $ taxas <list> [<"Abies abies", "Abies alba f. columnaris", "Abies alba f. compacta", "Abies a...
It's highly likely you'll need to modify get_tree() at some point but it's better than having Selenium or Splash or phantomjs or Headless Chrome as a dependency.
I am trying to retrieve information from an API, which gives the name of the product from the barcode, through the API.
I am using the httr::GET().
The URL needed for the API contains the barcode itself, but I do not know how to automate the system so it can read the barcode contained in every entry, and plugging it into the url without me copying and pasting the code manually in the script.
one_code <- GET("api.upcdatabase.org/json/aebfed7a26f24a05efd7f77749dc2fcc/…")
result <- content(one_code)
result$description
A couple extra things to consider.
First, the site provides https for the API so it should be used since you're exposing your API key on any network you make requests from otherwise.
Test the core HTTP status code and halt on major HTTP errors (not API errors).
You should also put your API key in something like an environment variable so it never ends up in scripts or GitHub repo commits. Use ~/.Renviron (make a single line entry for UPCDATABASE_API_KEY=your_key and then restart R).
You should handle error and success conditions and consider returning a data frame so you can have all the fields in a tidy, accessible fashion.
Finally, do some basic type conversion prior to returning the values to make return field values easier to use.
library(httr)
library(jsonlite)
library(purrr)
get_upc_code_info <- function(code, api_key=Sys.getenv("UPCDATABASE_API_KEY")) {
URL <- sprintf("https://api.upcdatabase.org/json/%s/%s", api_key, code)
res <- GET(URL)
stop_for_status(res)
res <- content(res, as="text", encoding="UTF-8")
res <- fromJSON(res, flatten=TRUE)
if (res$valid == "true") {
res <- flatten_df(res)
res$valid <- TRUE
res$avg_price <- as.numeric(res$avg_price)
res$rate_up <- as.numeric(res$rate_up)
res$rate_down <- as.numeric(res$rate_down)
return(res)
} else {
message(res$reason)
return(data.frame(number = code, valid = FALSE, stringsAsFactors=FALSE))
}
}
xdf <- get_upc_code_info("0111222333446")
dplyr::glimpse(xdf)
## Observations: 1
## Variables: 8
## $ valid <lgl> TRUE
## $ number <chr> "0111222333446"
## $ itemname <chr> "UPC Database Testing Code"
## $ alias <chr> "Testing Code"
## $ description <chr> "http://upcdatabase.org/code/0111222333446"
## $ avg_price <dbl> 123.45
## $ rate_up <dbl> 14
## $ rate_down <dbl> 3
Similar to what Aurèle suggested, you can use the function to make it easier to get multiple codes. Since this function returns a data frame, you can easily get a larger, complete data frame from individual lookups with purrr::map_df():
codes <- c("0057000006976", "3228881010711", "0817346023170", "44xx4444444")
xdf <- map_df(codes, get_upc_code_info)
dplyr::glimpse(xdf)
## Observations: 4
## Variables: 8
## $ valid <lgl> TRUE, TRUE, TRUE, FALSE
## $ number <chr> "0057000006976", "3228881010711", "0817346023170",...
## $ itemname <chr> "Heinz Original Beans (Pork & Molasses)", "Lip...
## $ alias <chr> "", "", "", NA
## $ description <chr> "", "Boîte de 20 sachets", "", NA
## $ avg_price <dbl> NA, NA, 39.99, NA
## $ rate_up <dbl> 0, 0, 1, NA
## $ rate_down <dbl> 0, 0, 0, NA
Consider putting finishing touches on this, adding a function to POST to the API, possibly make a Shiny app so folks can submit new entries through R and turning it into a package. You might even get extra free credits on the site if you do so.
Store your barcodes in one data structure (list or vector):
barcodes <- c(
"aebfed7a26f24a05efd7f77749dc2fcc",
"xyz1234567f24a05efd7f77749dc2fcc",
"pqr9876543f24a05efd7f77749dc2fcc"
)
Write a function:
scrape <- function(barcode) {
sample=GET(paste0("api.upcdatabase.org/json/", barcode, "/rest/of/the/url"))
result=content(sample)
result$description
}
And apply:
res <- lapply(barcodes, scrape)
The results are stored in a single list, so that they're easier to manipulate.