In trouble with rvest - r

I am trying to scrape HTML or JSON file in a site which references economists through the world.
Here is an exemple of the page I am trying to exploit :
https://ideas.repec.org/f/pan296.html
More accurately, I am trying to scrape the data shown when clicking on the button "Export references", in JSON, HTML or whatever.
Here is what I do :
test <- rvest::html_session("https://ideas.repec.org/f/pan296.html") %>% jump_to("https://ideas.repec.org/cgi-bin/refs.cgi")
test$response
The connexion works well, but the output is empty :
Response [https://ideas.repec.org/cgi-bin/refs.cgi]
Date: 2020-07-13 08:50
Status: 200
Content-Type: text/plain; charset=utf-8
<EMPTY BODY>
Any idea ?

As Aziz said, you have to observe the POST request to reconstruct it. But in this situation, the work can be tricky since the request in the new tab. Follow this topic to see how you can observe the request open in new tab: Chrome Dev Tools: How to trace network for a link that opens a new tab?
The code to get the export content:
library(rvest)
url <- "https://ideas.repec.org/f/pan296.html"
pg <- html_session(url)
handle_value <- pg %>% html_node(xpath = "//form/input[#name='handle']") %>% html_attr("value")
pg <- pg %>% rvest:::request_POST(url = "https://ideas.repec.org/cgi-bin/refs.cgi",
body = list("handle"= handle_value,
"ref" = "Export references ",
"output" = "0"))
pg$response
(Change the output number value to get different output format, 0 is for HTML)

Related

R - curl (not httr) POST request w/ JSON body

Let me start by saying that I understand how to do a POST request using "httr" and "crul" packages. I am working on developing an asynchronous method to sending multiple POST request with unique JSON body requests using the basic "curl" package. I have legitimate reasons for trying this with this package, but more importantly I'm just determined to get it to work. This may not be possible, or I may even be trying to wrong functions in "curl"...but wanted to see if anyone had any ideas.
I am trying to send a post request using curl_fetch_multi() as a POST request with a JSON in the body like this...
{
"configuration": {
"Id": 4507
},
"age": 0,
"zip": 32411,
"Date": "2020-12-23"
}
I have succeeded in at least getting getting error messages back form the API indicating an invalid body input using something along the lines of starting with an object containing each body i need to submit
library(curl)
library(jsonlite)
library(magrittr)
pool <- new_pool()
# results only available through call back function
cb <- function(req){cat("done:", req$url, ": HTTP:", req$status, "\n", "content:", rawToChar(req$content), "\n")}
# Create request for each body
for(i in 1:nrow(df)){
curl_fetch_multi(
"http://api.com/values?api_key=1234",
done = cb,
pool = pool,
handle = new_handle() %>%
handle_setopt(post = TRUE) %>%
handle_setheaders("Content-Type"="application/vnd.v1+json") %>%
handle_setform(body = df$body[[i]]) ###df$body[[i]] is a JSON string
)
}
# This actually performs requests
out <- multi_run(pool = pool)
done: http://api.com/values?api_key=1234 : HTTP: 400
content: {"errors":[{"code":"Service.input.invalid","message":"Invalid input"}]}
done: http://api.com/values?api_key=1234 : HTTP: 400
content: {"errors":[{"code":"Service.input.invalid","message":"Invalid input"}]}
....
I'm 90% positive it has to do with how it's attempting to call the JSON in handle_setform() setting of the handle. This is about where I am over my head and documentation is scarce.
Also, I am pretty sure the JSON is structured properly, as I can use them in other packages with no problem.
Any assistance would be greatly appreciated.
Found the solution!!
Needed to use following settings with handle_setopts()
for(i in 1:nrow(df)){
curl_fetch_multi(
"http://api.com/values?api_key=1234",
done = cb,
pool = pool,
handle = new_handle() %>%
handle_setheaders("Content-Type"="application/v1+json") %>%
handle_setopt(customrequest = "POST") %>%
handle_setopt(postfields = df$body[[i]]) #df$body is list of JSON
)
}
out <- multi_run(pool = pool)

using Rvest to get table

I am trying to scrape the table in : WEB TABLE
I have tried copying the xpath but it does not return anything:
require("rvest")
url = "https://www.barchart.com/options/stocks-by-sector?page=1"
pg = read_html(url)
pg %>% html_nodes(xpath="//*[#id=main-content-column]/div/div[4]/div/div[2]/div")
EDIT
I found the following link and feel I am getting closer....
So by using the same process I found the updated link by watching the XHR updates:
url = paste0("https://www.barchart.com?access_token=",token,"/proxies/core-api/v1/quotes/",
"get?lists=stocks.optionable.by_sector.all.us&fields=symbol%2CsymbolName",
"%2ClastPrice%2CpriceChange%2CpercentChange%2ChighPrice%2ClowPrice%2Cvolume",
"%2CtradeTime%2CsymbolCode%2CsymbolType%2ChasOptions&orderBy=symbol&orderDir=",
"asc&meta=field.shortName%2Cfield.type%2Cfield.description&hasOptions=true&page=1&limit=100&raw=1")
Where the token is found within the scope:
token = "eyJpdiI6IjJZMDZNOGYwUDk4dE1OcVc4ekdnUGc9PSIsInZhbHVlIjoib2lYcWtzRi9VN3ovbzdER2NhQlg0KzJQL1ZId2ZOeWpwSTF5YThlclN1SW9YSEtJbG9kR0FLbmRmWmtNcmd1eCIsIm1hYyI6ImU4ODA3YzZkZGUwZjFhNmM1NTE4ZjEzNmZkNThmZDY4ODE1NmM0YTM1Yjc2Y2E2OWVkNjZiZTE3ZDcxOGFlZjMifQ"
However, I do not know if I am placing the token where I should in the URL, but when I ran:
fixture <- jsonlite::read_json(url,simplifyVector = TRUE)
I received the following error:
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid char in json text.
<!doctype html> <html itemscope
(right here) ------^
The token needs to be sent as a request header named x-xsrf-token not by pass to the parameters:
Also, the token value might change over sessions so you need to get it in the cookie. After that, convert the data to a data frame and get the result:
library(rvest)
pg <- html_session("https://www.barchart.com/options/stocks-by-sector?page=1")
cookies <- pg$response$cookies
token <- URLdecode(dplyr::recode("XSRF-TOKEN", !!!setNames(cookies$value, cookies$name)))
pg <-
pg %>% rvest:::request_GET(
"https://www.barchart.com/proxies/core-api/v1/quotes/get?lists=stocks.optionable.by_sector.all.us&fields=symbol%2CsymbolName%2ClastPrice%2CpriceChange%2CpercentChange%2ChighPrice%2ClowPrice%2Cvolume%2CtradeTime%2CsymbolCode%2CsymbolType%2ChasOptions&orderBy=symbol&orderDir=asc&meta=field.shortName%2Cfield.type%2Cfield.description&hasOptions=true&page=1&limit=1000000&raw=1",
config = httr::add_headers(`x-xsrf-token` = token)
)
data_raw <- httr::content(pg$response)
data <-
purrr::map_dfr(
data_raw$data,
function(x){
as.data.frame(x$raw)
}
)

R: fetching pdf documents from Companies House API

I'm trying to fetch documents from the API using R. Appreciate the clarification of the process in this post. I've been following the above steps with partial success, but still fail the last step to get access to documents' content:
Find the document filing you're interested in (e.g. make a filing history request1 for the company). Parse the response for the link to the document in the field "links" : { "document_metadata" : "link URI fragment here" }.
No problem:
library(httr)
library(jsonlite)
library(openssl)
### retrieving filing history ####
company_num = 'FC013908'
key = 'my_key'
fh_path = paste0('/company/', str_to_upper(company_num), "/filing-history")
fh_url <- modify_url("https://api.companieshouse.gov.uk/", path = fh_path)
fh_test <- GET(fh_url, authenticate(key, "")) #status_code = 200
fh_parsed <- jsonlite::fromJSON(content(fh_test, "text",encoding = "utf-8"), flatten = TRUE)
docs <- fh_parsed$items
Done.
2 For a given document request the document metadata via CH Document API3. Parse the response to get the document (mime) types available and the link to the actual document data (document URI fragment).
No problems here:
md_meta_url = docs$links.document_metadata[1]
key_pass <- paste0(key,":")
decoded_auth <- paste0('Basic ', base64_encode(key_pass))
md_test <- GET(md_meta_url,
add_headers(Authorization = decoded_auth)
)
md_test #status_code = 200!
md_parsed <- jsonlite::fromJSON(content(md_test, "text",encoding = "utf-8"), flatten = TRUE)
This way I can obtain the content URL:
cont_url = md_parsed$links$document
Request the actual document9, specifying the mime type (e.g. "application/pdf").
I do it while NOT following the redirect and, as expected, I get the 302 status code with the location header:
accept = 'application/pdf'
cont_test <- GET(cont_url,
add_headers(Authorization = decoded_auth,
Accept = accept),
config(followlocation = FALSE)
)
final_url <- cont_test$headers$location
> final_url
[1] "https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/LjBouRHeXXpIYAvqYIPWL06iXaliPz6Pucp1OXCXQhI/application-pdf?AWSAccessKeyId=ASIAJX7TVURFXZTY5DNQ&Expires=1529483765&Signature=uUQx6RTW7XBLqx4L6pYr5tOUySg%3D&x-amz-security-token=FQoDYXdzEP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDGxe7meYGe3OYhNwcSK3AwcVYJUXaUMf19oVO9s4qNPWN8AHjNNd5rrZhgE9YTkF1OmzyZSL5xHbls664kDP%2Bxd7dz9PIU5O1D%2BVxoDyoYcFiS6acDnO28KpfFE56lUZNfedf1jys%2FP0SJ8f%2F50Cbn93bfOlm0MZA9%2BQ2DYQvPfkWSvrDjMyCXHbu57gpZHjQKPNRTgzGXzUUCvFwREytGMM4eThhn4Glvvx%2FA8IiLbnsvgmEKw9iAj7KWIenhoJq3cTRytUpVeipLnQoBVLau8dFYkKdAHZaYM2Tlx0z6ObRb%2BGdm7W7eOVA1bFXuUXmUmnAHruDIwwLlgOVN2IJ9CxmJU22lY8jrEm%2BUivtrdp2oofn32PryBEJ8jJOg9cIpLbBBx%2FeOkng9zJwnZbute7Nmh%2BnaY2btsId6JjraFNsTvR%2B1qEZX9uuznUdJdqgVfTMj2gGrAmntwk0JAkILlvamzjWC%2F9vAqK7Xvt8aC6hlIMB2vdzTCU9Jf%2FrIMTClTJkk0BzBuvJ86t1l%2BXb4rF5Pab%2FegFpJ6nvZKqde%2F77wMMiTyG35EndmYx4AWqTIh9EofYwKZa9uciNvRT0E2%2BYnT5jZMo%2BdWn2QU%3D"
However, when I try to
Request this URI from Amazon again passing the content type you want again.
I get 400 error:
final_test <- GET(final_url,
add_headers(Authorization = decoded_auth,
Accept = accept
))
> final_test
Response [https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/LjBouRHeXXpIYAvqYIPWL06iXaliPz6Pucp1OXCXQhI/application-pdf?AWSAccessKeyId=ASIAJX7TVURFXZTY5DNQ&Expires=1529483765&Signature=uUQx6RTW7XBLqx4L6pYr5tOUySg%3D&x-amz-security-token=FQoDYXdzEP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDGxe7meYGe3OYhNwcSK3AwcVYJUXaUMf19oVO9s4qNPWN8AHjNNd5rrZhgE9YTkF1OmzyZSL5xHbls664kDP%2Bxd7dz9PIU5O1D%2BVxoDyoYcFiS6acDnO28KpfFE56lUZNfedf1jys%2FP0SJ8f%2F50Cbn93bfOlm0MZA9%2BQ2DYQvPfkWSvrDjMyCXHbu57gpZHjQKPNRTgzGXzUUCvFwREytGMM4eThhn4Glvvx%2FA8IiLbnsvgmEKw9iAj7KWIenhoJq3cTRytUpVeipLnQoBVLau8dFYkKdAHZaYM2Tlx0z6ObRb%2BGdm7W7eOVA1bFXuUXmUmnAHruDIwwLlgOVN2IJ9CxmJU22lY8jrEm%2BUivtrdp2oofn32PryBEJ8jJOg9cIpLbBBx%2FeOkng9zJwnZbute7Nmh%2BnaY2btsId6JjraFNsTvR%2B1qEZX9uuznUdJdqgVfTMj2gGrAmntwk0JAkILlvamzjWC%2F9vAqK7Xvt8aC6hlIMB2vdzTCU9Jf%2FrIMTClTJkk0BzBuvJ86t1l%2BXb4rF5Pab%2FegFpJ6nvZKqde%2F77wMMiTyG35EndmYx4AWqTIh9EofYwKZa9uciNvRT0E2%2BYnT5jZMo%2BdWn2QU%3D]
Date: 2018-06-20 08:37
Status: 400
Content-Type: application/xml
Size: 523 B
<BINARY BODY>
Needless to say, executing
browseURL(final_test$url)
returns Access Denied error. I suspect it may have something to do with Amazon authorization problems similar to those described here. Any ideas how to solve this final hurdle?
Thanks!
The answer was provided by #voracityemail in response to my question on Companies House Developers Hub. Basically, the final call doesn't require the Authorization header, so if you run the following code for final_test:
final_test <- GET(final_url, add_headers(Accept = accept))
It will return 200 code
> final_test
Response [https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/Rl1qKy2kNqdskHUIsqU9u0bGzH2goTfJfnCrNg4S0lg/application-pdf?AWSAccessKeyId=ASIAJMG7NTZHYC4NH3MA&Expires=1530093768&Signature=EteMSmwXS%2FqqdOFRmYY%2Fgf187Aw%3D&x-amz-security-token=FQoDYXdzELf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDOMKrcNPR6jb5bnzGSK3A1yzaoVZWhgAeXYCN9WJnxx8b%2BTKCEEZyZui3aR5j0WoNWIQhW9GIQ8R4xTGVkRjwQIhzgDp%2BRCfXGQ0CfPCOfseaQri5m%2BWTEWBgjfToL7%2FMdcC1IINMTFRrih1APE%2FmmTcQaW7SvyZWv3Q4bVQB%2FtOsiX5k8rWVsT7%2FecfQmnJMljcKF0%2F3vDRTtLRURTCtrdegfnIFrSqXkelLxVVypKY9UeURBgxAgngOgoP7YhYt3wD%2BEz5rBdNfMvF1Zuv91hLGDyBaKuV4fRKMRXlymDHCwNgNZl3JeyuAmnX8pexK6PJzH7MerM8QX8LoPfge1yutvqEj0%2FjRSYEShOWUebecQ2tJqWIEOZly0Ji8fc%2BMtFDO1FWZBrMl6lXgkwTMpELnTH5%2BP4ULMdFfEz30bWSnAuTGXcAxsoFWsFTIE2uO35zgkOsAUT2un4UNGnL2S8XexWbgwq%2B%2Bhtxo9ruP9WA8mTpjBkup2Qe5EpvUiNwGX9APjThi7QFTllVWWvpKgzKTSBh%2Btua9xK8RgiNAYDgEa5k%2BH%2FmWIP56WglBE6r3HGsXgbi%2Bff8Rg8z2lVFLo8f9hVv%2BCYoptXM2QU%3D]
Date: 2018-06-27 10:02
Status: 200
Content-Type: application/pdf
Size: 21.7 kB
<BINARY BODY>
and then
browseURL(final_test$url)
will open the specified document in the browser. Victory!

Failing HTTP request to Google Sheets API using googleAuthR, httr, and jsonlite

I'm attempting to request data from a google spreadsheet using the googleAuthR. I need to use this library instead of Jenny Bryan's googlesheets because the request is part of a shiny app with multiple user authentication. When the request range does not contain spaces (e.g. "Sheet1!A:B" the request succeeds. However, when the tab name contains spaces (e.g. "'Sheet 1'!A:B" or "\'Sheet 1\'!A:B", the request fails and throws this error:
Request Status Code: 400
Error : lexical error: invalid char in json text.
<!DOCTYPE html> <html lang=en>
(right here) ------^
Mark Edmondson's googleAuthR uses jsonlite for parsing JSON. I assume this error is coming from jsonlite, but I'm at a loss for how to fix it. Here is a minimal example to recreate the issue:
library(googleAuthR)
# scopes
options("googleAuthR.scopes.selected" = "https://www.googleapis.com/auth/spreadsheets.readonly")
# client id and secret
options("googleAuthR.client_id" = "XXXX")
options("googleAuthR.client_secret" = "XXXX")
# request
get_data <- function(spreadsheetId, range) {
l <- googleAuthR::gar_api_generator(
baseURI = "https://sheets.googleapis.com/v4/",
http_header = 'GET',
path_args = list(spreadsheets = spreadsheetId,
values = range),
pars_args = list(majorDimension = 'ROWS',
valueRenderOption = 'UNFORMATTED_VALUE'),
data_parse_function = function(x) x)
req <- l()
req
}
# authenticate
gar_auth(new_user = TRUE)
# input
spreadsheet_id <- "XXXX"
range <- "'Sheet 1'!A:B"
# get data
df <- get_data(spreadsheet_id, range)
How should I format range variable for the request to work? Thanks in advance for the help.
Use URLencode() to percent-encode spaces.
Details:
Using options(googleAuthR.verbose = 1) shows that the GET request was of the form:
GET /v4/spreadsheets/.../values/'Sheet 1'!A:B?majorDimension=ROWS&valueRenderOption=UNFORMATTED_VALUE HTTP/1.1
I had assumed the space would be encoded, but I guess not. In this github issue from August 2016, Mark states URLencode() was going to be the default for later versions of googleAuthR. Not sure if that will still happen, but it's an easy fix in the meantime.

How to scrape with rvest?

I need to get three different numbers (in yellow, see picture) from this page:
https://www.scopus.com/authid/detail.uri?authorId=7006040753
I used this code using rvest and inspectorgadget:
site=read_html("https://www.scopus.com/authid/detail.uri?authorId=7006040753")
hindex=site %>% html_node(".row3 .valueColumn span")%>% html_text()
documents=site %>% html_node("#docCntLnk")%>% html_text()
citations=site %>% html_node("#totalCiteCount")%>% html_text()
print(citations)
I can get the h-index and documents but the citations do not work
Can you help me?
Now I've found a solution - I noticed the value took some time to load so I included a little timeout in the PhnatomJS script. Now it works on my machine with the following R code:
setwd("path/to/phantomjs/bin")
system('phantomjs readexample.js') # call PhantomJS script (stored in phantomjs/bin)
totalCiteCount <- "rendered_page.html" %>% # "rendered_page.html" is created by PhantomJS
read_html() %>%
html_nodes("#totalCiteCount") %>%
html_text()
## totalCiteCount
## [1] "52018"
The corresponding PhantomJS script file "readexample.js" looks like the following (kudos to https://www.r-bloggers.com/web-scraping-javascript-rendered-sites/):
var webPage = require('webpage');
var url ='https://www.scopus.com/authid/detail.uri?authorId=7006040753';
var fs = require('fs');
var page = webPage.create();
var system = require('system');
page.settings.userAgent = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)';
page.open(url, function (status) {
setTimeout(function() {
fs.write('rendered_page.html', page.content, 'w');
phantom.exit();
}, 2500);
});
The code throws the following errors in R, but at least the value is scraped correctly.
> system('phantomjs readexample.js') TypeError: undefined is not a constructor (evaluating 'mutation.addedNodes.forEach')
https://www.scopus.com/gzip_N1846232499/bundles/SiteCatalystTop.js:73 :0 in forEach https://www.scopus.com/gzip_N1846232499/bundles/SiteCatalystTop.js:73 ReferenceError: Can't find variable: SDM
https://www.scopus.com/gzip_N1729184664/bundles/AuthorProfileTop.js:73 in sendIndex https://www.scopus.com/gzip_N1729184664/bundles/AuthorProfileTop.js:67 in loadEvents
Using PhantomJS is quite convenient as you don't have to install anything (so also works if you don't have admin rights on your machine). Simply download the .zip file and unpack it to any folder. Afterwards set the working directory in R (setwd()) to the "phantomjs/bin" folder and it works.
You can also change the PhantomJS scripte (iteratively if desired) in R e.g. to pass different URLs to the script in a loop. Example:
for (i in 1:n_urls) {
url_var <- urls[i] # assuming you have created a var "urls" with multiple URLs before
lines <- readLines("readexample.js")
lines[2] <- paste0("var url ='", url_var ,"';") # exchange code line with new URL
writeLines(lines, "readexample.js") # new url is stored in the PhantomJS script
system('phantomjs readexample.js')
# <any code> #
}
Hope this brings you one step further?

Resources