trying to execute r web scraping code but giving an error - r

I am trying to execute the below code and getting the error when the joblocations execution.The pages are loaded into the ulrs but the locations are not extracted from web page.
library(data.table)
library(XML)
pages<-c(1:12)
ulrs <- rbindlist(lapply(pages, function(x)
{url <- paste("http://www.r-users.com/jobs/page/",x,"/",sep = " ")
data.frame(url)}),fill = TRUE)
joblocations <- rbindlist(apply(ulrs,1,function(url){
doc1 <- htmlParse(url)
locations <- getNodeSet(doc1,'//*[#id="mainContent"]/div[2]/ol/li/dl/dd[3]/span/text()')
data.frame(sapply(locations,function(x){xmlValue(x)}))
} ),fill = TRUE)
Error: failed to load external entity "http://www.r-users.com/jobs/page/%201%20/"

First, changing the http to https will get you past that error with XML but then leads to another: WARNING: XML content does not seem to be XML: 'https://www.r-users.com/jobs/page/1/ and still doesn't work.
I tried again, swapping out use of the XML package for rvest and got it working.
library(data.table)
library(rvest)
pages<-c(1:12)
ulrs <- rbindlist(lapply(pages, function(x)
{url <- paste0("http://www.r-users.com/jobs/page/",x,"/")
data.frame(url)}
),fill = TRUE)
joblocations <- rbindlist(apply(ulrs,1,function(url){
doc1 <- read_html(url)
locations <- html_nodes(doc1,xpath = '//*[#id="mainContent"]/div[2]/ol/li/dl/dd[3]/span/text()')
data.frame(sapply(locations,function(x){html_text(x)}))
} ))
rvest seems to work whether http or https is specified, but that is something that has tripped me up before.

Related

R AzureGraph | How to get callRecords from Graph

With the premise that it's the first time I'm working with Graph API... I can't seem to get a handle on how to retrieve callRecords with the AzureGraph package. I'm using an existing app registration.
I wrote my code based on the vignette of the package (which uses examples for delegated authentication), however I'm not sure how to proceed...
This is what works:
library(AzureAuth)
library(AzureGraph)
library(tidyverse)
library(dplyr)
library(rjson)
library(jsonlite)
library(RCurl)
# Create graph login
gr <- create_graph_login()
# My info
me <- gr$get_user()
# App Authentication
AppID <- "xAppID"
TenantID <- "xTenantId"
Secret <- "xSecret"
App <- gr$get_app(AppID)
service <- App$get_service_principal()
# Graph API Information
Version <- getOption("azure_graph_api_version")
Endpoint <- "https://graph.microsoft.com"
Tok <- AzureAuth::get_azure_token(Endpoint, tenant = TenantID, app = AppID, password = Secret)
And this is where I get stuck:
firstpage <-call_graph_endpoint(Tok, "me/memberOf")
pager <- ms_graph_pager$new(Tok, firstpage)
pager$has_data()
pager$value
isFALSE(pager$has_data())
is.null(pager$value)
This gives me an error:
> firstpage <-call_graph_endpoint(Tok, "me/memberOf")
Error in process_response(res, match.arg(http_status_handler), simplify) :
Bad Request (HTTP 400). Failed to complete operation. Message:
/me request is only valid with delegated authentication flow.
> pager <- ms_graph_pager$new(Tok, firstpage)
> pager$has_data()
[1] FALSE
> pager$value
list()
> isFALSE(pager$has_data())
[1] TRUE
> is.null(pager$value)
[1] FALSE
I then tried to change the operation parameter, but I cannot understand what I'm supposed to fetch. This is one random attempt:
firstpage <-call_graph_endpoint(Tok, "$metadata#calls")
pager <- ms_graph_pager$new(Tok, firstpage)
pager$has_data()
pager$value
isFALSE(pager$has_data())
is.null(pager$value)
This doesn't give me any errors, but I don't know what to do with it:
Second way:
firstdf <- call_graph_endpoint(Tok, operation = "$metadata#calls", options = list('$top'=1), simplify=TRUE)
pager <- ms_graph_pager$new(Tok, firstdf)
df <- NULL
while (pager$has_data()) {
df <- vctrs::vec_rbin(df, pager$value)
This throws me a bad request error.
So... my first problem is that I don't know what my operation string should look like.
Thanks in advance for throwing some light on this...

Why is the Service Unavailable Error using Lapply?

I am using the spotifyr library where I want to find audio features for multiple tracks. For example I can do this in order to find the audio features of a specific song using it's id.
analysis2 <- get_track_audio_features("2xLMifQCjDGFmkHkpNLD9h",
authorization = get_spotify_access_token())
Yesterday, I wrote this function below that takes all the tracks in a dataframe and finds the audio features for all of them and stores them in a list and it was working fine.
get_analysis <- function(track_id)
{
analysis <- get_track_audio_features(track_id,
authorization = get_spotify_access_token())
}
tracks_list <- lapply(all_tracks$track.id, get_analysis)
Now I am getting an error saying Request failed [503] and Error in get_track_audio_features(track_id, authorization = get_spotify_access_token()) : Service Unavailable (HTTP 503).
I am still able to find the audio features of a specific song so I am not sure which service is unavailable.
I suspect you are reaching a song in your data for which the response is denied from spotify. You could try adding an error-catching mechanism to see which one it is:
get_analysis <- function(track_id){
tryCatch(
expr = {
get_track_audio_features(track_id, authorization = get_spotify_access_token())
},
error = function(e){
print(track_id)
}) -> analysis
return(analysis)
}
tracks_list <- lapply(all_tracks$track.id, get_analysis)
I looked at the source code for the package and didn't see any sneaky rate-limiting issues and the Web API page shows error 503 as a generic error that needs waiting to be resolved (https://developer.spotify.com/documentation/web-api/). Thus you could also try just adding a 10 minute wait (I couldn't find how long exactly it is on Spotify's website):
get_analysis <- function(track_id){
tryCatch(
expr = {
get_track_audio_features(track_id, authorization = get_spotify_access_token()) -> output
return(output)
},
error = function(e){
print(track_id)
return(e)
}) -> output
}
wait.function <- funciton(){
Sys.sleep(600)
}
get_analysis_master <- function(all_tracks){
k <- 1
tracks_list <- list()
for(track.id in all_tracks$track.id){
get_analysis(track.id) -> output
if(!inherits(output, "error")){
tracks_list[[k]] <- output
k <- k + 1
} else {
wait.function()
}
return(tracks_list)
}
get_analysis_master(all_tracks) -> tracks_list

using Rvest to get table

I am trying to scrape the table in : WEB TABLE
I have tried copying the xpath but it does not return anything:
require("rvest")
url = "https://www.barchart.com/options/stocks-by-sector?page=1"
pg = read_html(url)
pg %>% html_nodes(xpath="//*[#id=main-content-column]/div/div[4]/div/div[2]/div")
EDIT
I found the following link and feel I am getting closer....
So by using the same process I found the updated link by watching the XHR updates:
url = paste0("https://www.barchart.com?access_token=",token,"/proxies/core-api/v1/quotes/",
"get?lists=stocks.optionable.by_sector.all.us&fields=symbol%2CsymbolName",
"%2ClastPrice%2CpriceChange%2CpercentChange%2ChighPrice%2ClowPrice%2Cvolume",
"%2CtradeTime%2CsymbolCode%2CsymbolType%2ChasOptions&orderBy=symbol&orderDir=",
"asc&meta=field.shortName%2Cfield.type%2Cfield.description&hasOptions=true&page=1&limit=100&raw=1")
Where the token is found within the scope:
token = "eyJpdiI6IjJZMDZNOGYwUDk4dE1OcVc4ekdnUGc9PSIsInZhbHVlIjoib2lYcWtzRi9VN3ovbzdER2NhQlg0KzJQL1ZId2ZOeWpwSTF5YThlclN1SW9YSEtJbG9kR0FLbmRmWmtNcmd1eCIsIm1hYyI6ImU4ODA3YzZkZGUwZjFhNmM1NTE4ZjEzNmZkNThmZDY4ODE1NmM0YTM1Yjc2Y2E2OWVkNjZiZTE3ZDcxOGFlZjMifQ"
However, I do not know if I am placing the token where I should in the URL, but when I ran:
fixture <- jsonlite::read_json(url,simplifyVector = TRUE)
I received the following error:
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid char in json text.
<!doctype html> <html itemscope
(right here) ------^
The token needs to be sent as a request header named x-xsrf-token not by pass to the parameters:
Also, the token value might change over sessions so you need to get it in the cookie. After that, convert the data to a data frame and get the result:
library(rvest)
pg <- html_session("https://www.barchart.com/options/stocks-by-sector?page=1")
cookies <- pg$response$cookies
token <- URLdecode(dplyr::recode("XSRF-TOKEN", !!!setNames(cookies$value, cookies$name)))
pg <-
pg %>% rvest:::request_GET(
"https://www.barchart.com/proxies/core-api/v1/quotes/get?lists=stocks.optionable.by_sector.all.us&fields=symbol%2CsymbolName%2ClastPrice%2CpriceChange%2CpercentChange%2ChighPrice%2ClowPrice%2Cvolume%2CtradeTime%2CsymbolCode%2CsymbolType%2ChasOptions&orderBy=symbol&orderDir=asc&meta=field.shortName%2Cfield.type%2Cfield.description&hasOptions=true&page=1&limit=1000000&raw=1",
config = httr::add_headers(`x-xsrf-token` = token)
)
data_raw <- httr::content(pg$response)
data <-
purrr::map_dfr(
data_raw$data,
function(x){
as.data.frame(x$raw)
}
)

Failing HTTP request to Google Sheets API using googleAuthR, httr, and jsonlite

I'm attempting to request data from a google spreadsheet using the googleAuthR. I need to use this library instead of Jenny Bryan's googlesheets because the request is part of a shiny app with multiple user authentication. When the request range does not contain spaces (e.g. "Sheet1!A:B" the request succeeds. However, when the tab name contains spaces (e.g. "'Sheet 1'!A:B" or "\'Sheet 1\'!A:B", the request fails and throws this error:
Request Status Code: 400
Error : lexical error: invalid char in json text.
<!DOCTYPE html> <html lang=en>
(right here) ------^
Mark Edmondson's googleAuthR uses jsonlite for parsing JSON. I assume this error is coming from jsonlite, but I'm at a loss for how to fix it. Here is a minimal example to recreate the issue:
library(googleAuthR)
# scopes
options("googleAuthR.scopes.selected" = "https://www.googleapis.com/auth/spreadsheets.readonly")
# client id and secret
options("googleAuthR.client_id" = "XXXX")
options("googleAuthR.client_secret" = "XXXX")
# request
get_data <- function(spreadsheetId, range) {
l <- googleAuthR::gar_api_generator(
baseURI = "https://sheets.googleapis.com/v4/",
http_header = 'GET',
path_args = list(spreadsheets = spreadsheetId,
values = range),
pars_args = list(majorDimension = 'ROWS',
valueRenderOption = 'UNFORMATTED_VALUE'),
data_parse_function = function(x) x)
req <- l()
req
}
# authenticate
gar_auth(new_user = TRUE)
# input
spreadsheet_id <- "XXXX"
range <- "'Sheet 1'!A:B"
# get data
df <- get_data(spreadsheet_id, range)
How should I format range variable for the request to work? Thanks in advance for the help.
Use URLencode() to percent-encode spaces.
Details:
Using options(googleAuthR.verbose = 1) shows that the GET request was of the form:
GET /v4/spreadsheets/.../values/'Sheet 1'!A:B?majorDimension=ROWS&valueRenderOption=UNFORMATTED_VALUE HTTP/1.1
I had assumed the space would be encoded, but I guess not. In this github issue from August 2016, Mark states URLencode() was going to be the default for later versions of googleAuthR. Not sure if that will still happen, but it's an easy fix in the meantime.

Importing data to google analytics with R

I would like to automate the data upload to GoogleAnalytics with R, but cannot find the way to do it.
So far I have done this:
Get google auth token based on googleAuthR package:
token <- Authentication$public_fields$token
Generate url to the upload endpoint:
url.template <- "https://www.googleapis.com/upload/analytics/v3/management/accounts/%1$i/webproperties/%2$s/customDataSources/%3$s/uploads"
url <- sprintf(url.template, account.id, web.property.id, data.source)
Call POST using httr package:
httr::content_type("text/csv")
httr::POST(url = url,
body = list(y = httr::upload_file("ga-product-import.csv")),
config = token,
encode = "multipart"
)
So far I am getting 400 response.
I also tried this:
f <- gar_api_generator(url,
"POST",
data_parse_function = function(x) x)
f(the_body = list(y = httr::upload_file("ga-product-import.csv")))
but getting this error:
Error : No method asJSON S3 class: form_file Error in
the_request$status_code : $ operator is invalid for atomic vectors
The library googleAnalyticsR depends on googleAuthR, and has a cost data upload function with help found at ?ga_custom_upload

Resources