I'm relatively new to R (and brand spanking new to scraping with R), so apologies in advance if I'm overlooking something obvious here!
I've been trying to learn how to scrape with RSelenium by following this tutorial: https://rawgit.com/petrkeil/Blog/master/2017_08_15_Web_scraping/web_scraping.html#advanced-scraping-with-rselenium
After running the following in Terminal (docker run -d -p 4445:4444 selenium/standalone-firefox), I tried to run the R code below, pulled with only slight modifications from the tutorial hyperlinked above:
get.tree <- function(genus, species)
{
# navigate to the page
browser <- remoteDriver(port=4445L)
browser$open(silent = T)
browser$navigate("http://www.bgci.org/global_tree_search.php?sec=globaltreesearch")
browser$refresh()
# create r objects from the web search input and button elements
genusElem <- browser$findElement(using = 'id', value = "genus-field")
specElem <- browser$findElement(using = 'id', value = "species-field")
buttonElem <- browser$fiendElement(using = 'class', value = "btn_ohoDO")
# tell R to fill in the fields
genusElem$sendKeysToElement(list(genus))
specElem$sendKeysToElement(list(species))
# tell R to click the search button
buttonElem$clickElement()
# get output
out <- browser$findElement(using = "css", value = "td.cell_1O3UaG:nth-child(4)") # the country origin
out <- out$getElementText()[[1]] # extract actual text string
out <- strsplit(out, split = "; ")[[1]] # turns into character vector
# close browser
browser$close()
return(out)
}
# Now let's try it:
get.tree("Abies", "alba")
But after doing all that, I get the following error:
Selenium message:Failed to decode response from marionette Build info:
version: '3.6.0', revision: '6fbf3ec767', time:
'2017-09-27T16:15:40.131Z' System info: host: 'd260fa60d69b', ip:
'172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version:
'4.9.49-moby', java.version: '1.8.0_131' Driver info: driver.version:
unknown
Error: Summary: UnknownError Detail: An unknown server-side error
occurred while processing the command. class:
org.openqa.selenium.WebDriverException Further Details: run
errorDetails method
Anyone have any idea what this means and where I went wrong?
Thanks very much for your help!
Just take advantage of the XHR request it makes to retrieve the in-line results and toss RSelenium:
library(httr)
library(tidyverse)
get_tree <- function(genus, species) {
GET(
url = sprintf("https://data.bgci.org/treesearch/genus/%s/species/%s", genus, species),
add_headers(
Origin = "http://www.bgci.org",
Referer = "http://www.bgci.org/global_tree_search.php?sec=globaltreesearch"
)
) -> res
stop_for_status(res)
matches <- content(res, flatten=TRUE)$results[[1]]
flatten_df(matches[c("id", "taxon", "family", "author", "source", "problems", "distributionDone", "note", "wcsp")]) %>%
mutate(geo = list(map_chr(matches$TSGeolinks, "country"))) %>%
mutate(taxas = list(map_chr(matches$TSTaxas, "checkTaxon")))
}
xdf <- get_tree("Abies", "alba")
xdf
## # A tibble: 1 x 8
## id taxon family author source distributionDone geo taxas
## <int> <chr> <chr> <chr> <chr> <chr> <list> <list>
## 1 58373 Abies alba Pinaceae Mill. WCSP Phans yes <chr [21]> <chr [45]>
glimpse(xdf)
## Observations: 1
## Variables: 8
## $ id <int> 58373
## $ taxon <chr> "Abies alba"
## $ family <chr> "Pinaceae"
## $ author <chr> "Mill."
## $ source <chr> "WCSP Phans"
## $ distributionDone <chr> "yes"
## $ geo <list> [<"Albania", "Andorra", "Austria", "Bulgaria", "Croatia", "Czech Republic", "Fr...
## $ taxas <list> [<"Abies abies", "Abies alba f. columnaris", "Abies alba f. compacta", "Abies a...
It's highly likely you'll need to modify get_tree() at some point but it's better than having Selenium or Splash or phantomjs or Headless Chrome as a dependency.
Related
I have ssh + tunel connection executed like:
target <- paste0("host:3306")
ProxySev <- "name#1.1.1.1"
keyfile <- "path/to/key"
port <- 3307
cmd <- paste0('ssh::ssh_tunnel(ssh::ssh_connect(host = "',
ProxySev, '", keyfile ="', keyfile, '"), port = ', port, ', target = "', target, '")')
pid <- sys::r_background(
args = c("-e", cmd),
std_out = TRUE,
std_err = TRUE
)
To close it I use tools::pskill(pid), but how to do that not knowing the pid? Eg without proper closing previous connection and trying to run again
pid <- sys::r_background(
args = c("-e", cmd),
std_out = TRUE,
std_err = TRUE
)
I get the message:
Error: System failure for: bind() (Only one usage of each socket address (protocol/network address/port) is normally permitted)
Execution halted
Warning message:
Disconnecting from unused ssh session. Please use ssh_disconnect()
How to use ssh_disconnect() in such case? Or how to get pid of this background process?
Thanks!
I think you may be able to infer which process by using tools in the ps package. I'll demonstrate with Sys.sleep instead of an ssh connection, but the steps should be translatable.
pid <- sys::r_background(args=c("-e", "Sys.sleep(120)"), std_out = FALSE, std_err = FALSE)
procs <- ps::ps()
subset(procs, name == "Rterm.exe")
# # A tibble: 3 x 11
# pid ppid name username status user system rss vms created ps_handle
# <int> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm> <I<list>>
# 1 22284 4224 Rterm.exe "myhost\\r2" running 0.0938 0.0156 47214592 68915200 2022-09-21 12:43:32.263 <ps_handl>
# 2 8472 14464 Rterm.exe "myhost\\r2" running 6.5 0.719 267501568 281235456 2022-09-20 21:31:30.642 <ps_handl>
# 3 4224 14464 Rterm.exe "myhost\\r2" running 4768. 272. 1433354240 31551180800 2022-09-17 22:46:15.056 <ps_handl>
pids <- subset(procs, name == "Rterm.exe")$pid
lapply(setNames(nm = pids), function(pid) ps::ps_cmdline(ps::ps_handle(pid)))
# $`22284`
# [1] "c:\\R\\R-4.1.2\\bin\\x64\\Rterm" "-e" "Sys.sleep(120)"
# $`8472`
# [1] "c:\\R\\R-4.1.2\\bin\\x64\\Rterm.exe" "--ess" "--no-save"
# $`4224`
# [1] "c:\\R\\R-4.1.2\\bin\\x64\\Rterm.exe" "--ess" "--no-save"
(tools::pskill(22284))
# [1] TRUE
I wrapped pskill in parens since its return value (logical indicating if the process was found and killed) is invisible, the parens cause it to be printed.
I really like using the code_to_plan function when constructing drake plans. I also really using target(..., format = "fst") for big files. However I am struggling to combine these two workflows. For example if I have this _drake.R file:
# Data --------------------------------------------------------------------
data_plan = code_to_plan("code/01-data/data.R")
join_plan = code_to_plan("code/01-data/merging.R")
# Cleaning ----------------------------------------------------------------
cleaning_plan = code_to_plan("code/02-cleaning/remove_na.R")
# Model -------------------------------------------------------------------
model_plan = code_to_plan("code/03-model/model.R")
# Combine Plans
dplan = bind_plans(
data_plan,
join_plan,
cleaning_plan,
model_plan
)
config <- drake_config(dplan)
This works fine when called with r_make(r_args = list(show = TRUE))
As I understand it though target can only be used within a drake_plan. If I try something like this:
dplan2 <- drake_plan(full_plan = target(dplan, format = "fst"))
config <- drake_config(dplan2)
I get an r_make error like this:
target full_plan
Error in fst::write_fst(x = value$value, path = tmp) :
Unknown type found in column.
In addition: Warning message:
You selected fst format for target full_plan, so drake will convert it from class c("drake_plan", "tbl_df", "tbl", "data.frame") to a plain data frame.
Error:
-->
in process 18712
See .Last.error.trace for a stack trace.
So ultimately my question is where does one specify special data formats for targets when you are using code_to_plan?
Edit
Using #landau helpful suggestion, I defined this function:
add_target_format <- function(plan) {
# Get a list of named commands.
commands <- plan$command
names(commands) <- plan$target
# Turn it into a good plan.
do.call(drake_plan, commands)
}
So that this would work:
dplan = bind_plans(
data_plan,
join_plan,
cleaning_plan,
model_plan
) %>%
add_target_format()
It is possible, but not convenient. Here is a workaround.
writeLines(
c(
"x <- small_data()",
"y <- target(large_data(), format = \"fst\")"
),
"script.R"
)
cat(readLines("script.R"), sep = "\n")
#> x <- small_data()
#> y <- target(large_data(), format = "fst")
library(drake)
# Produces a plan, but does not process target().
bad_plan <- code_to_plan("script.R")
bad_plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 x small_data()
#> 2 y target(large_data(), format = "fst")
# Get a list of named commands.
commands <- bad_plan$command
names(commands) <- bad_plan$target
# Turn it into a good plan.
good_plan <- do.call(drake_plan, commands)
good_plan
#> # A tibble: 2 x 3
#> target command format
#> <chr> <expr> <chr>
#> 1 x small_data() <NA>
#> 2 y large_data() fst
Created on 2019-12-18 by the reprex package (v0.3.0)
I'm trying out roadoi to access Unpaywall from R, but no matter what I try to query for, I'm getting this response:
Error in UseMethod("http_error") : no applicable method for
'http_error' applied to an object of class "c('simpleError', 'error',
'condition')"
Running methods(http_error) gives me this:
[1] http_error.character* http_error.integer* http_error.response*
Could this be caused by me being behind an institutional firewall? (even so, it seems weird that this would be the response...)
Is there a way around it?
The http_error (actually from library httr) is a very simple function: it loads an url given by a character (http_error.character), retrieves the response (http_error.response) and ultimately looks at the response code (http_error.integer). If the response code is >=400 the function returns TRUE otherwise FALSE.
What your error says, is that you (or any function in your chain) tries to call http_error on a simpleError object. My guess is that your firewall settings block the request. Because the request is blocked the underlying httr::RETRY (which is called from oadoi_fetch) returns an error instead of a proper response object and http_error sees just this error object and breaks.
If I locally switch off my proxy (through which I can make requests) I also get an error:
library(roadoi)
Sys.unsetenv(c("HTTP_PROXY", "HTTPS_PROXY"))
oadoi_fetch("10.1038/nature12373", email = "name#whatever.com")
# Error in UseMethod("http_error") :
# no applicable method for 'http_error' applied to an object of class
# "c('simpleError', 'error', 'condition')"
As soon as my proxy is set properly I get
Sys.setenv(HTTPS_PROXY = my_proxy, HTTP_PROXY = my_proxy)
oadoi_fetch("10.1038/nature12373", email = "name#whatever.com")
# # A tibble: 1 x 16
# doi best_oa_location oa_locations data_standard is_oa genre journal_is_oa journal_is_in_d~ journal_issns journal_name publisher title year updated non_compliant authors
# <chr> <list> <list> <int> <lgl> <chr> <lgl> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <list> <list>
# 1 10.1038~ <tibble [1 x 10]> <tibble [4 x~ 2 TRUE journa~ FALSE FALSE 0028-0836,147~ Nature Springer ~ Nanometre-s~ 2013 2019-04-0~
If the problem lies indeed with the proxy, I would try the following, which helped me on my corporate Windows machine, but may be dependent on your local IT setting:
## get the proxy settings
system("netsh winhttp show proxy")
Sys.setenv(HTTP_PROXY = <the proxy from netsh>, HTTPS_PROXY = <the proxy from netsh>)
Actually, you can reproduce the error easily:
httr::http_error(simpleError("Cannot reach the page"))
# Error in UseMethod("http_error") :
# no applicable method for 'http_error' applied to an object of class
# "c('simpleError', # 'error', 'condition')"
I am trying to connect to Web API with token and read jason files from it using jsonlite
library(httr)
library(jsonlite)
Token="xmYeeiLGrHJNaYBWgrMfLbpZOwNgpOAh"
source="https://www.ncdc.noaa.gov/cdo-web/api/v2/%s.json"
GET(source)
POST(source,token = "xmYeeiLGrHJNaYBWgrMfLbpZOwNgpOAh")
jsonFiles=fromJSON(source)
but I keep getting this error:
Error in open.connection(con, "rb") : HTTP error 400.
What could I be doing wrong?
Have a look here. I adjusted the source since you had an invalid endpoint:
library(httr)
library(jsonlite)
Token <- "xmYeeiLGrHJNaYBWgrMfLbpZOwNgpOAh"
# Fetch all available datasets - https://www.ncdc.noaa.gov/cdo-web/webservices/v2#datasets
source <- "https://www.ncdc.noaa.gov/cdo-web/api/v2/datasets"
response <- GET(source, add_headers(token = Token))
#POST(source, token = Token) # Not needed
# Not actually fetching all datasets, but rather a list of all datasets
raw <- content(response, as="text")
results <- fromJSON(raw)
# Convert to dataframe/tibble
library(tidyverse)
df <- results$results %>% as_tibble()
head(df)
#> # A tibble: 6 x 6
#> uid mindate maxdate name datacoverage id
#> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 gov.noaa.ncdc… 1763-01-… 2018-08-… Daily Summaries 1 GHCND
#> 2 gov.noaa.ncdc… 1763-01-… 2018-07-… Global Summary … 1 GSOM
#> 3 gov.noaa.ncdc… 1763-01-… 2018-01-… Global Summary … 1 GSOY
#> 4 gov.noaa.ncdc… 1991-06-… 2018-08-… Weather Radar (… 0.95 NEXRAD2
#> 5 gov.noaa.ncdc… 1994-05-… 2018-08-… Weather Radar (… 0.95 NEXRAD3
#> 6 gov.noaa.ncdc… 2010-01-… 2010-01-… Normals Annual/… 1 NORMAL…
Created on 2018-08-18 by the reprex package (v0.2.0).
Related to your original code:
source="https://www.ncdc.noaa.gov/cdo-web/api/v2/%s.json"
If this was indeed a valid endpoint (which it isn't), I think what you would want is something like this:
filename <- "dataset_name"
source <- sprintf("https://www.ncdc.noaa.gov/cdo-web/api/v2/%s.json", filename)
I am trying to retrieve information from an API, which gives the name of the product from the barcode, through the API.
I am using the httr::GET().
The URL needed for the API contains the barcode itself, but I do not know how to automate the system so it can read the barcode contained in every entry, and plugging it into the url without me copying and pasting the code manually in the script.
one_code <- GET("api.upcdatabase.org/json/aebfed7a26f24a05efd7f77749dc2fcc/…")
result <- content(one_code)
result$description
A couple extra things to consider.
First, the site provides https for the API so it should be used since you're exposing your API key on any network you make requests from otherwise.
Test the core HTTP status code and halt on major HTTP errors (not API errors).
You should also put your API key in something like an environment variable so it never ends up in scripts or GitHub repo commits. Use ~/.Renviron (make a single line entry for UPCDATABASE_API_KEY=your_key and then restart R).
You should handle error and success conditions and consider returning a data frame so you can have all the fields in a tidy, accessible fashion.
Finally, do some basic type conversion prior to returning the values to make return field values easier to use.
library(httr)
library(jsonlite)
library(purrr)
get_upc_code_info <- function(code, api_key=Sys.getenv("UPCDATABASE_API_KEY")) {
URL <- sprintf("https://api.upcdatabase.org/json/%s/%s", api_key, code)
res <- GET(URL)
stop_for_status(res)
res <- content(res, as="text", encoding="UTF-8")
res <- fromJSON(res, flatten=TRUE)
if (res$valid == "true") {
res <- flatten_df(res)
res$valid <- TRUE
res$avg_price <- as.numeric(res$avg_price)
res$rate_up <- as.numeric(res$rate_up)
res$rate_down <- as.numeric(res$rate_down)
return(res)
} else {
message(res$reason)
return(data.frame(number = code, valid = FALSE, stringsAsFactors=FALSE))
}
}
xdf <- get_upc_code_info("0111222333446")
dplyr::glimpse(xdf)
## Observations: 1
## Variables: 8
## $ valid <lgl> TRUE
## $ number <chr> "0111222333446"
## $ itemname <chr> "UPC Database Testing Code"
## $ alias <chr> "Testing Code"
## $ description <chr> "http://upcdatabase.org/code/0111222333446"
## $ avg_price <dbl> 123.45
## $ rate_up <dbl> 14
## $ rate_down <dbl> 3
Similar to what Aurèle suggested, you can use the function to make it easier to get multiple codes. Since this function returns a data frame, you can easily get a larger, complete data frame from individual lookups with purrr::map_df():
codes <- c("0057000006976", "3228881010711", "0817346023170", "44xx4444444")
xdf <- map_df(codes, get_upc_code_info)
dplyr::glimpse(xdf)
## Observations: 4
## Variables: 8
## $ valid <lgl> TRUE, TRUE, TRUE, FALSE
## $ number <chr> "0057000006976", "3228881010711", "0817346023170",...
## $ itemname <chr> "Heinz Original Beans (Pork & Molasses)", "Lip...
## $ alias <chr> "", "", "", NA
## $ description <chr> "", "Boîte de 20 sachets", "", NA
## $ avg_price <dbl> NA, NA, 39.99, NA
## $ rate_up <dbl> 0, 0, 1, NA
## $ rate_down <dbl> 0, 0, 0, NA
Consider putting finishing touches on this, adding a function to POST to the API, possibly make a Shiny app so folks can submit new entries through R and turning it into a package. You might even get extra free credits on the site if you do so.
Store your barcodes in one data structure (list or vector):
barcodes <- c(
"aebfed7a26f24a05efd7f77749dc2fcc",
"xyz1234567f24a05efd7f77749dc2fcc",
"pqr9876543f24a05efd7f77749dc2fcc"
)
Write a function:
scrape <- function(barcode) {
sample=GET(paste0("api.upcdatabase.org/json/", barcode, "/rest/of/the/url"))
result=content(sample)
result$description
}
And apply:
res <- lapply(barcodes, scrape)
The results are stored in a single list, so that they're easier to manipulate.