using Rvest to get table - web-scraping

I am trying to scrape the table in : WEB TABLE
I have tried copying the xpath but it does not return anything:
require("rvest")
url = "https://www.barchart.com/options/stocks-by-sector?page=1"
pg = read_html(url)
pg %>% html_nodes(xpath="//*[#id=main-content-column]/div/div[4]/div/div[2]/div")
EDIT
I found the following link and feel I am getting closer....
So by using the same process I found the updated link by watching the XHR updates:
url = paste0("https://www.barchart.com?access_token=",token,"/proxies/core-api/v1/quotes/",
"get?lists=stocks.optionable.by_sector.all.us&fields=symbol%2CsymbolName",
"%2ClastPrice%2CpriceChange%2CpercentChange%2ChighPrice%2ClowPrice%2Cvolume",
"%2CtradeTime%2CsymbolCode%2CsymbolType%2ChasOptions&orderBy=symbol&orderDir=",
"asc&meta=field.shortName%2Cfield.type%2Cfield.description&hasOptions=true&page=1&limit=100&raw=1")
Where the token is found within the scope:
token = "eyJpdiI6IjJZMDZNOGYwUDk4dE1OcVc4ekdnUGc9PSIsInZhbHVlIjoib2lYcWtzRi9VN3ovbzdER2NhQlg0KzJQL1ZId2ZOeWpwSTF5YThlclN1SW9YSEtJbG9kR0FLbmRmWmtNcmd1eCIsIm1hYyI6ImU4ODA3YzZkZGUwZjFhNmM1NTE4ZjEzNmZkNThmZDY4ODE1NmM0YTM1Yjc2Y2E2OWVkNjZiZTE3ZDcxOGFlZjMifQ"
However, I do not know if I am placing the token where I should in the URL, but when I ran:
fixture <- jsonlite::read_json(url,simplifyVector = TRUE)
I received the following error:
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid char in json text.
<!doctype html> <html itemscope
(right here) ------^

The token needs to be sent as a request header named x-xsrf-token not by pass to the parameters:
Also, the token value might change over sessions so you need to get it in the cookie. After that, convert the data to a data frame and get the result:
library(rvest)
pg <- html_session("https://www.barchart.com/options/stocks-by-sector?page=1")
cookies <- pg$response$cookies
token <- URLdecode(dplyr::recode("XSRF-TOKEN", !!!setNames(cookies$value, cookies$name)))
pg <-
pg %>% rvest:::request_GET(
"https://www.barchart.com/proxies/core-api/v1/quotes/get?lists=stocks.optionable.by_sector.all.us&fields=symbol%2CsymbolName%2ClastPrice%2CpriceChange%2CpercentChange%2ChighPrice%2ClowPrice%2Cvolume%2CtradeTime%2CsymbolCode%2CsymbolType%2ChasOptions&orderBy=symbol&orderDir=asc&meta=field.shortName%2Cfield.type%2Cfield.description&hasOptions=true&page=1&limit=1000000&raw=1",
config = httr::add_headers(`x-xsrf-token` = token)
)
data_raw <- httr::content(pg$response)
data <-
purrr::map_dfr(
data_raw$data,
function(x){
as.data.frame(x$raw)
}
)

Related

Get API response after logging in using rvest

Using a browser, I can navigate to https://myaccount.draftkings.com/login and login using a username and password. Then if I navigate to https://api.draftkings.com/scores/v1/leaderboards/9000?format=json&embed=leaderboard in another tab, I can see the API response.
I want to do this programmatically.
There is a python package that I previously used successfully for this purpose. But for some reason, I can't get that package to work now.
I believe the python package hits an API endpoint with a post request to authenticate and saves the cookies to a file that are later used to make requests.
I tried sending a post request to the login endpoint the python package uses with httr2:
req <- httr2::request("https://api.draftkings.com/users/v3/providers/draftkings/logins?format=json") %>%
httr2::req_method("POST") %>%
httr2::req_headers(
"Contest-Type" = "application/json",
"Accept" = "*/*",
"Accept-Encoding" = "gzip, deflate, br"
) %>%
httr2::req_body_json(
list(
"login" = "email",
"password" = "password",
"host" = "api.draftkings.com",
"challengeResponse" = list("solution" = "", "type" = "Recaptcha")
)
)
req %>%
httr2::req_perform()
but I get a 403 error.
I also tried logging in using rvest:
library(rvest)
url <- "myaccount.draftkings.com/login"
session <- session(url)
form <- session %>%
html_form() %>%
magrittr::extract2(1)
form$action <- url
filled_form <- form %>%
html_form_set(!!! list(EmailOrUsername = "user",
Password = 'password'))
html_form_submit(filled_form, submit = 3)
session_jump_to(session, "https://www.draftkings.com/lobby")
but that didn't seem to do what I want either. Note that the form object returned an empty action element. I'm not sure what to replace the action element with, but if I leave it null I get an error.
I also tried saving the cookies from a browser session after logging into draftkings.com and passing all the cookies to a GET request with httr:
cook <- jsonlite::read_json("cookies.json")
clean_cook <- unlist(lapply(cook, function(x) {stats::setNames(x$value, x$name)}))
resp <- httr::GET(
"https://api.draftkings.com/scores/v1/leaderboards/9000?format=json&embed=leaderboard",
httr::set_cookies(.cookies = clean_cook)
)
httr::content(resp)
but this returns a 400 error code and message "Invalid userKey.". This is the same error message I get if I clear my cache in the browser and then visit https://api.draftkings.com/scores/v1/leaderboards/9000?format=json&embed=leaderboard. I don't think the issue is related to URL encoding of the cache values. I tried restarting my RStudio session.
Update
I figured out how to successfully perform the GET request using the cookies saved from my browser and httr2:
cook <- jsonlite::read_json("cookies.json")
clean_cook <- paste0(unlist(lapply(cook, function(x) {paste0(x$name, "=", x$value)})), collapse = ";")
req <- httr2::request(
"https://api.draftkings.com/scores/v1/leaderboards/9000?format=json&embed=leaderboard"
)
req <- httr2::req_headers(req, "cookie" = clean_cook)
resp <- httr2::req_perform(req)
str(httr2::resp_body_json(resp))
For some reason, I was unable to get httr::set_cookies() to work. Passing cookies in the header (using httr::add_headers) was also not reliably successful using httr.

Updating Google Tag Manager container via API using R: invalidArgument error

I am trying to write an R script to programmatically update a Google Tag Manager container via API and I have hit a bit of a wall getting it to work, as it keeps returning an invalid argument error. The problem is that I can't quite figure out what the problem is.
The documentation for the API call is here:
https://developers.google.com/tag-manager/api/v2/reference/accounts/containers/update
Here's the code:
library(httr)
url_base <- 'https://www.googleapis.com/tagmanager/v2'
url_path <- paste('accounts',account_id,'containers',container_id,sep='/')
api_url <- paste(url_base,url_path,sep='/')
#since the instructions indicate that the request body parameters are all optional, let's just send a new name
call <- PUT(api_url,
add_headers(Authorization = paste("Bearer", gtm_token$credentials$access_token)),
encode = 'json',
body = list(name = 'new name'))
call_content <- content(call,'parsed')
This is a pretty standard API call to the GTM API, and in fact I have written a bunch of functions for other GTM API methods that work in the same way, so I am a bit perplexed as to why this one keeps failing:
$error
$error$errors
$error$errors[[1]]
$error$errors[[1]]$domain
[1] "global"
$error$errors[[1]]$reason
[1] "invalidArgument"
$error$errors[[1]]$message
[1] "Bad Request"
$error$code
[1] 400
$error$message
[1] "Bad Request"
It seems like the issue is in the message body, but it's not clear if the issue is down to the API expecting different information / more parameters, when the documentation suggests that all of the parameters are optional.
OK, so the documentation is lacking here. This works if you include a name at least. Here's a working function:
gtm_containers_update <- function(account_id,container_id,container_name,usage_context,domain_name,notes,token) {
require(httr)
token$refresh()
#create the post url
api_url <- paste('https://www.googleapis.com/tagmanager/v2','accounts',account_id,'containers',container_id,sep='/')
#create the list with required components
call_body <- list(name = container_name,
usageContext = list(usage_context),
notes = notes,
domainName = domain_name)
call <- POST(url,
add_headers(Authorization = paste("Bearer", token$credentials$access_token)),
encode = 'json',
body = call_body)
print(paste('Status code:',call$status_code))
}

Failing HTTP request to Google Sheets API using googleAuthR, httr, and jsonlite

I'm attempting to request data from a google spreadsheet using the googleAuthR. I need to use this library instead of Jenny Bryan's googlesheets because the request is part of a shiny app with multiple user authentication. When the request range does not contain spaces (e.g. "Sheet1!A:B" the request succeeds. However, when the tab name contains spaces (e.g. "'Sheet 1'!A:B" or "\'Sheet 1\'!A:B", the request fails and throws this error:
Request Status Code: 400
Error : lexical error: invalid char in json text.
<!DOCTYPE html> <html lang=en>
(right here) ------^
Mark Edmondson's googleAuthR uses jsonlite for parsing JSON. I assume this error is coming from jsonlite, but I'm at a loss for how to fix it. Here is a minimal example to recreate the issue:
library(googleAuthR)
# scopes
options("googleAuthR.scopes.selected" = "https://www.googleapis.com/auth/spreadsheets.readonly")
# client id and secret
options("googleAuthR.client_id" = "XXXX")
options("googleAuthR.client_secret" = "XXXX")
# request
get_data <- function(spreadsheetId, range) {
l <- googleAuthR::gar_api_generator(
baseURI = "https://sheets.googleapis.com/v4/",
http_header = 'GET',
path_args = list(spreadsheets = spreadsheetId,
values = range),
pars_args = list(majorDimension = 'ROWS',
valueRenderOption = 'UNFORMATTED_VALUE'),
data_parse_function = function(x) x)
req <- l()
req
}
# authenticate
gar_auth(new_user = TRUE)
# input
spreadsheet_id <- "XXXX"
range <- "'Sheet 1'!A:B"
# get data
df <- get_data(spreadsheet_id, range)
How should I format range variable for the request to work? Thanks in advance for the help.
Use URLencode() to percent-encode spaces.
Details:
Using options(googleAuthR.verbose = 1) shows that the GET request was of the form:
GET /v4/spreadsheets/.../values/'Sheet 1'!A:B?majorDimension=ROWS&valueRenderOption=UNFORMATTED_VALUE HTTP/1.1
I had assumed the space would be encoded, but I guess not. In this github issue from August 2016, Mark states URLencode() was going to be the default for later versions of googleAuthR. Not sure if that will still happen, but it's an easy fix in the meantime.

trying to execute r web scraping code but giving an error

I am trying to execute the below code and getting the error when the joblocations execution.The pages are loaded into the ulrs but the locations are not extracted from web page.
library(data.table)
library(XML)
pages<-c(1:12)
ulrs <- rbindlist(lapply(pages, function(x)
{url <- paste("http://www.r-users.com/jobs/page/",x,"/",sep = " ")
data.frame(url)}),fill = TRUE)
joblocations <- rbindlist(apply(ulrs,1,function(url){
doc1 <- htmlParse(url)
locations <- getNodeSet(doc1,'//*[#id="mainContent"]/div[2]/ol/li/dl/dd[3]/span/text()')
data.frame(sapply(locations,function(x){xmlValue(x)}))
} ),fill = TRUE)
Error: failed to load external entity "http://www.r-users.com/jobs/page/%201%20/"
First, changing the http to https will get you past that error with XML but then leads to another: WARNING: XML content does not seem to be XML: 'https://www.r-users.com/jobs/page/1/ and still doesn't work.
I tried again, swapping out use of the XML package for rvest and got it working.
library(data.table)
library(rvest)
pages<-c(1:12)
ulrs <- rbindlist(lapply(pages, function(x)
{url <- paste0("http://www.r-users.com/jobs/page/",x,"/")
data.frame(url)}
),fill = TRUE)
joblocations <- rbindlist(apply(ulrs,1,function(url){
doc1 <- read_html(url)
locations <- html_nodes(doc1,xpath = '//*[#id="mainContent"]/div[2]/ol/li/dl/dd[3]/span/text()')
data.frame(sapply(locations,function(x){html_text(x)}))
} ))
rvest seems to work whether http or https is specified, but that is something that has tripped me up before.

Importing data to google analytics with R

I would like to automate the data upload to GoogleAnalytics with R, but cannot find the way to do it.
So far I have done this:
Get google auth token based on googleAuthR package:
token <- Authentication$public_fields$token
Generate url to the upload endpoint:
url.template <- "https://www.googleapis.com/upload/analytics/v3/management/accounts/%1$i/webproperties/%2$s/customDataSources/%3$s/uploads"
url <- sprintf(url.template, account.id, web.property.id, data.source)
Call POST using httr package:
httr::content_type("text/csv")
httr::POST(url = url,
body = list(y = httr::upload_file("ga-product-import.csv")),
config = token,
encode = "multipart"
)
So far I am getting 400 response.
I also tried this:
f <- gar_api_generator(url,
"POST",
data_parse_function = function(x) x)
f(the_body = list(y = httr::upload_file("ga-product-import.csv")))
but getting this error:
Error : No method asJSON S3 class: form_file Error in
the_request$status_code : $ operator is invalid for atomic vectors
The library googleAnalyticsR depends on googleAuthR, and has a cost data upload function with help found at ?ga_custom_upload

Resources