How to scrape with rvest? - web-scraping

I need to get three different numbers (in yellow, see picture) from this page:
https://www.scopus.com/authid/detail.uri?authorId=7006040753
I used this code using rvest and inspectorgadget:
site=read_html("https://www.scopus.com/authid/detail.uri?authorId=7006040753")
hindex=site %>% html_node(".row3 .valueColumn span")%>% html_text()
documents=site %>% html_node("#docCntLnk")%>% html_text()
citations=site %>% html_node("#totalCiteCount")%>% html_text()
print(citations)
I can get the h-index and documents but the citations do not work
Can you help me?

Now I've found a solution - I noticed the value took some time to load so I included a little timeout in the PhnatomJS script. Now it works on my machine with the following R code:
setwd("path/to/phantomjs/bin")
system('phantomjs readexample.js') # call PhantomJS script (stored in phantomjs/bin)
totalCiteCount <- "rendered_page.html" %>% # "rendered_page.html" is created by PhantomJS
read_html() %>%
html_nodes("#totalCiteCount") %>%
html_text()
## totalCiteCount
## [1] "52018"
The corresponding PhantomJS script file "readexample.js" looks like the following (kudos to https://www.r-bloggers.com/web-scraping-javascript-rendered-sites/):
var webPage = require('webpage');
var url ='https://www.scopus.com/authid/detail.uri?authorId=7006040753';
var fs = require('fs');
var page = webPage.create();
var system = require('system');
page.settings.userAgent = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)';
page.open(url, function (status) {
setTimeout(function() {
fs.write('rendered_page.html', page.content, 'w');
phantom.exit();
}, 2500);
});
The code throws the following errors in R, but at least the value is scraped correctly.
> system('phantomjs readexample.js') TypeError: undefined is not a constructor (evaluating 'mutation.addedNodes.forEach')
https://www.scopus.com/gzip_N1846232499/bundles/SiteCatalystTop.js:73 :0 in forEach https://www.scopus.com/gzip_N1846232499/bundles/SiteCatalystTop.js:73 ReferenceError: Can't find variable: SDM
https://www.scopus.com/gzip_N1729184664/bundles/AuthorProfileTop.js:73 in sendIndex https://www.scopus.com/gzip_N1729184664/bundles/AuthorProfileTop.js:67 in loadEvents
Using PhantomJS is quite convenient as you don't have to install anything (so also works if you don't have admin rights on your machine). Simply download the .zip file and unpack it to any folder. Afterwards set the working directory in R (setwd()) to the "phantomjs/bin" folder and it works.
You can also change the PhantomJS scripte (iteratively if desired) in R e.g. to pass different URLs to the script in a loop. Example:
for (i in 1:n_urls) {
url_var <- urls[i] # assuming you have created a var "urls" with multiple URLs before
lines <- readLines("readexample.js")
lines[2] <- paste0("var url ='", url_var ,"';") # exchange code line with new URL
writeLines(lines, "readexample.js") # new url is stored in the PhantomJS script
system('phantomjs readexample.js')
# <any code> #
}
Hope this brings you one step further?

Related

R - curl (not httr) POST request w/ JSON body

Let me start by saying that I understand how to do a POST request using "httr" and "crul" packages. I am working on developing an asynchronous method to sending multiple POST request with unique JSON body requests using the basic "curl" package. I have legitimate reasons for trying this with this package, but more importantly I'm just determined to get it to work. This may not be possible, or I may even be trying to wrong functions in "curl"...but wanted to see if anyone had any ideas.
I am trying to send a post request using curl_fetch_multi() as a POST request with a JSON in the body like this...
{
"configuration": {
"Id": 4507
},
"age": 0,
"zip": 32411,
"Date": "2020-12-23"
}
I have succeeded in at least getting getting error messages back form the API indicating an invalid body input using something along the lines of starting with an object containing each body i need to submit
library(curl)
library(jsonlite)
library(magrittr)
pool <- new_pool()
# results only available through call back function
cb <- function(req){cat("done:", req$url, ": HTTP:", req$status, "\n", "content:", rawToChar(req$content), "\n")}
# Create request for each body
for(i in 1:nrow(df)){
curl_fetch_multi(
"http://api.com/values?api_key=1234",
done = cb,
pool = pool,
handle = new_handle() %>%
handle_setopt(post = TRUE) %>%
handle_setheaders("Content-Type"="application/vnd.v1+json") %>%
handle_setform(body = df$body[[i]]) ###df$body[[i]] is a JSON string
)
}
# This actually performs requests
out <- multi_run(pool = pool)
done: http://api.com/values?api_key=1234 : HTTP: 400
content: {"errors":[{"code":"Service.input.invalid","message":"Invalid input"}]}
done: http://api.com/values?api_key=1234 : HTTP: 400
content: {"errors":[{"code":"Service.input.invalid","message":"Invalid input"}]}
....
I'm 90% positive it has to do with how it's attempting to call the JSON in handle_setform() setting of the handle. This is about where I am over my head and documentation is scarce.
Also, I am pretty sure the JSON is structured properly, as I can use them in other packages with no problem.
Any assistance would be greatly appreciated.
Found the solution!!
Needed to use following settings with handle_setopts()
for(i in 1:nrow(df)){
curl_fetch_multi(
"http://api.com/values?api_key=1234",
done = cb,
pool = pool,
handle = new_handle() %>%
handle_setheaders("Content-Type"="application/v1+json") %>%
handle_setopt(customrequest = "POST") %>%
handle_setopt(postfields = df$body[[i]]) #df$body is list of JSON
)
}
out <- multi_run(pool = pool)

using Rvest to get table

I am trying to scrape the table in : WEB TABLE
I have tried copying the xpath but it does not return anything:
require("rvest")
url = "https://www.barchart.com/options/stocks-by-sector?page=1"
pg = read_html(url)
pg %>% html_nodes(xpath="//*[#id=main-content-column]/div/div[4]/div/div[2]/div")
EDIT
I found the following link and feel I am getting closer....
So by using the same process I found the updated link by watching the XHR updates:
url = paste0("https://www.barchart.com?access_token=",token,"/proxies/core-api/v1/quotes/",
"get?lists=stocks.optionable.by_sector.all.us&fields=symbol%2CsymbolName",
"%2ClastPrice%2CpriceChange%2CpercentChange%2ChighPrice%2ClowPrice%2Cvolume",
"%2CtradeTime%2CsymbolCode%2CsymbolType%2ChasOptions&orderBy=symbol&orderDir=",
"asc&meta=field.shortName%2Cfield.type%2Cfield.description&hasOptions=true&page=1&limit=100&raw=1")
Where the token is found within the scope:
token = "eyJpdiI6IjJZMDZNOGYwUDk4dE1OcVc4ekdnUGc9PSIsInZhbHVlIjoib2lYcWtzRi9VN3ovbzdER2NhQlg0KzJQL1ZId2ZOeWpwSTF5YThlclN1SW9YSEtJbG9kR0FLbmRmWmtNcmd1eCIsIm1hYyI6ImU4ODA3YzZkZGUwZjFhNmM1NTE4ZjEzNmZkNThmZDY4ODE1NmM0YTM1Yjc2Y2E2OWVkNjZiZTE3ZDcxOGFlZjMifQ"
However, I do not know if I am placing the token where I should in the URL, but when I ran:
fixture <- jsonlite::read_json(url,simplifyVector = TRUE)
I received the following error:
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid char in json text.
<!doctype html> <html itemscope
(right here) ------^
The token needs to be sent as a request header named x-xsrf-token not by pass to the parameters:
Also, the token value might change over sessions so you need to get it in the cookie. After that, convert the data to a data frame and get the result:
library(rvest)
pg <- html_session("https://www.barchart.com/options/stocks-by-sector?page=1")
cookies <- pg$response$cookies
token <- URLdecode(dplyr::recode("XSRF-TOKEN", !!!setNames(cookies$value, cookies$name)))
pg <-
pg %>% rvest:::request_GET(
"https://www.barchart.com/proxies/core-api/v1/quotes/get?lists=stocks.optionable.by_sector.all.us&fields=symbol%2CsymbolName%2ClastPrice%2CpriceChange%2CpercentChange%2ChighPrice%2ClowPrice%2Cvolume%2CtradeTime%2CsymbolCode%2CsymbolType%2ChasOptions&orderBy=symbol&orderDir=asc&meta=field.shortName%2Cfield.type%2Cfield.description&hasOptions=true&page=1&limit=1000000&raw=1",
config = httr::add_headers(`x-xsrf-token` = token)
)
data_raw <- httr::content(pg$response)
data <-
purrr::map_dfr(
data_raw$data,
function(x){
as.data.frame(x$raw)
}
)

In trouble with rvest

I am trying to scrape HTML or JSON file in a site which references economists through the world.
Here is an exemple of the page I am trying to exploit :
https://ideas.repec.org/f/pan296.html
More accurately, I am trying to scrape the data shown when clicking on the button "Export references", in JSON, HTML or whatever.
Here is what I do :
test <- rvest::html_session("https://ideas.repec.org/f/pan296.html") %>% jump_to("https://ideas.repec.org/cgi-bin/refs.cgi")
test$response
The connexion works well, but the output is empty :
Response [https://ideas.repec.org/cgi-bin/refs.cgi]
Date: 2020-07-13 08:50
Status: 200
Content-Type: text/plain; charset=utf-8
<EMPTY BODY>
Any idea ?
As Aziz said, you have to observe the POST request to reconstruct it. But in this situation, the work can be tricky since the request in the new tab. Follow this topic to see how you can observe the request open in new tab: Chrome Dev Tools: How to trace network for a link that opens a new tab?
The code to get the export content:
library(rvest)
url <- "https://ideas.repec.org/f/pan296.html"
pg <- html_session(url)
handle_value <- pg %>% html_node(xpath = "//form/input[#name='handle']") %>% html_attr("value")
pg <- pg %>% rvest:::request_POST(url = "https://ideas.repec.org/cgi-bin/refs.cgi",
body = list("handle"= handle_value,
"ref" = "Export references ",
"output" = "0"))
pg$response
(Change the output number value to get different output format, 0 is for HTML)

Failing HTTP request to Google Sheets API using googleAuthR, httr, and jsonlite

I'm attempting to request data from a google spreadsheet using the googleAuthR. I need to use this library instead of Jenny Bryan's googlesheets because the request is part of a shiny app with multiple user authentication. When the request range does not contain spaces (e.g. "Sheet1!A:B" the request succeeds. However, when the tab name contains spaces (e.g. "'Sheet 1'!A:B" or "\'Sheet 1\'!A:B", the request fails and throws this error:
Request Status Code: 400
Error : lexical error: invalid char in json text.
<!DOCTYPE html> <html lang=en>
(right here) ------^
Mark Edmondson's googleAuthR uses jsonlite for parsing JSON. I assume this error is coming from jsonlite, but I'm at a loss for how to fix it. Here is a minimal example to recreate the issue:
library(googleAuthR)
# scopes
options("googleAuthR.scopes.selected" = "https://www.googleapis.com/auth/spreadsheets.readonly")
# client id and secret
options("googleAuthR.client_id" = "XXXX")
options("googleAuthR.client_secret" = "XXXX")
# request
get_data <- function(spreadsheetId, range) {
l <- googleAuthR::gar_api_generator(
baseURI = "https://sheets.googleapis.com/v4/",
http_header = 'GET',
path_args = list(spreadsheets = spreadsheetId,
values = range),
pars_args = list(majorDimension = 'ROWS',
valueRenderOption = 'UNFORMATTED_VALUE'),
data_parse_function = function(x) x)
req <- l()
req
}
# authenticate
gar_auth(new_user = TRUE)
# input
spreadsheet_id <- "XXXX"
range <- "'Sheet 1'!A:B"
# get data
df <- get_data(spreadsheet_id, range)
How should I format range variable for the request to work? Thanks in advance for the help.
Use URLencode() to percent-encode spaces.
Details:
Using options(googleAuthR.verbose = 1) shows that the GET request was of the form:
GET /v4/spreadsheets/.../values/'Sheet 1'!A:B?majorDimension=ROWS&valueRenderOption=UNFORMATTED_VALUE HTTP/1.1
I had assumed the space would be encoded, but I guess not. In this github issue from August 2016, Mark states URLencode() was going to be the default for later versions of googleAuthR. Not sure if that will still happen, but it's an easy fix in the meantime.

trying to execute r web scraping code but giving an error

I am trying to execute the below code and getting the error when the joblocations execution.The pages are loaded into the ulrs but the locations are not extracted from web page.
library(data.table)
library(XML)
pages<-c(1:12)
ulrs <- rbindlist(lapply(pages, function(x)
{url <- paste("http://www.r-users.com/jobs/page/",x,"/",sep = " ")
data.frame(url)}),fill = TRUE)
joblocations <- rbindlist(apply(ulrs,1,function(url){
doc1 <- htmlParse(url)
locations <- getNodeSet(doc1,'//*[#id="mainContent"]/div[2]/ol/li/dl/dd[3]/span/text()')
data.frame(sapply(locations,function(x){xmlValue(x)}))
} ),fill = TRUE)
Error: failed to load external entity "http://www.r-users.com/jobs/page/%201%20/"
First, changing the http to https will get you past that error with XML but then leads to another: WARNING: XML content does not seem to be XML: 'https://www.r-users.com/jobs/page/1/ and still doesn't work.
I tried again, swapping out use of the XML package for rvest and got it working.
library(data.table)
library(rvest)
pages<-c(1:12)
ulrs <- rbindlist(lapply(pages, function(x)
{url <- paste0("http://www.r-users.com/jobs/page/",x,"/")
data.frame(url)}
),fill = TRUE)
joblocations <- rbindlist(apply(ulrs,1,function(url){
doc1 <- read_html(url)
locations <- html_nodes(doc1,xpath = '//*[#id="mainContent"]/div[2]/ol/li/dl/dd[3]/span/text()')
data.frame(sapply(locations,function(x){html_text(x)}))
} ))
rvest seems to work whether http or https is specified, but that is something that has tripped me up before.

Resources