Web Scraping XHR Dynamic pages with rvest and R - r

I am attempting to scrape a Dynamic website Morningstar.com via XHR requests.
The exact site I am scraping is: http://performance.morningstar.com/funds/etf/total-returns.action?t=SPY&region=USA&culture=en_US
What I am trying to scrape is the Quarterly performance number (1-month). The result should be 0.64 as of today.
try(res <- GET(url = "http://performance.morningstar.com/fund/performance-return.action",
query = list(
t="SPY",
region="usa",
culture="en-US"
)
))
tryCatch(x <- content(res) %>%
html_nodes(xpath = '//*[#id="tab-quar-end-content"]/table/tbody/tr[1]/td[1]') %>%
html_text() %>%
trimws() %>%
as.numeric()
, error = function(e) x <-NA)
However, the result is numeric(0)
Any idea what I am doing wrong?
Sody
Update:
I was able to get the html data with the following code:
try(res <- GET(url = "http://performance.morningstar.com/fund/performance-return.action",
query = list(
t = "SPY",
region = "usa",
culture = "en-US",
ops = "clear",
s = "0P0000J533",
ndec = "2",
ep = "true",
align = "q",
annlz = "true",
comparisonRemove = "false"
)
))
But I am still having problems pointing to the data using either the CSS selector or the xpath with rvest.
What do you guys use to find those data points? is SelectorGadget still the go to?
Cheers, Aaron

library(httr)
GET(
url = "http://performance.morningstar.com/perform/Performance/cef/trailing-total-returns.action",
add_headers(
Referer = "http://performance.morningstar.com/funds/etf/total-returns.action?t=SPY&region=USA&culture=en_US",
`X-Requested-With` = "XMLHttpRequest"
),
query = list(
t = "ARCX:SPY", region = "usa", culture = "en-US",
cur = "", ops = "clear", s = "0P00001MK8", ndec = "2", ep = "true",
align = "q", annlz = "true", comparisonRemove = "false",
benchmarkSecId = "", benchmarktype = ""
),
verbose()
) -> res
You have to target the XHR directly.

The table is embedded using java script, not hard-coded. You won't be able to scrape this data.

Related

Implement call retries with httr::RETRY() function in API call (R)

I use the UN Comtrade data API with R.
library(rjson)
get.Comtrade <- function(url="http://comtrade.un.org/api/get?"
,maxrec=50000
,type="C"
,freq="A"
,px="HS"
,ps="now"
,r
,p
,rg="all"
,cc="TOTAL"
,fmt="json"
)
{
string<- paste(url
,"max=",maxrec,"&" #maximum no. of records returned
,"type=",type,"&" #type of trade (c=commodities)
,"freq=",freq,"&" #frequency
,"px=",px,"&" #classification
,"ps=",ps,"&" #time period
,"r=",r,"&" #reporting area
,"p=",p,"&" #partner country
,"rg=",rg,"&" #trade flow
,"cc=",cc,"&" #classification code
,"fmt=",fmt #Format
,sep = ""
)
if(fmt == "csv") {
raw.data<- read.csv(string,header=TRUE)
return(list(validation=NULL, data=raw.data))
} else {
if(fmt == "json" ) {
raw.data<- fromJSON(file=string)
data<- raw.data$dataset
validation<- unlist(raw.data$validation, recursive=TRUE)
ndata<- NULL
if(length(data)> 0) {
var.names<- names(data[[1]])
data<- as.data.frame(t( sapply(data,rbind)))
ndata<- NULL
for(i in 1:ncol(data)){
data[sapply(data[,i],is.null),i]<- NA
ndata<- cbind(ndata, unlist(data[,i]))
}
ndata<- as.data.frame(ndata)
colnames(ndata)<- var.names
}
return(list(validation=validation,data =ndata))
}
}
}
However, sometimes it fails to connect server and I need to run the code several times to start working. Solution given here, to use Retry() function, which retries a request until it succeeds, seems attractive.
However, I have some difficulties implementing this function in the code given above. has anybody used it before and knows how to recode it?
An API call using httr::RETRY could look like the following:
library(httr)
library(jsonlite)
res <- RETRY(
verb = "GET",
url = "http://comtrade.un.org/",
path = "api/get",
encode = "json",
times = 3,
query = list(
max = 50000,
type = "C",
freq = "A",
px = "HS",
ps = "now",
r = 842,
p = "124,484",
rg = "all",
cc = "TOTAL",
fmt = "json"
)
)
# alternativ: returns dataset as a `list`:
# parsed_content <- content(res, as = "parsed")
# returns dataset as a `data.frame`:
json_content <- content(res, as = "text")
parsed_content <- parse_json(json_content, simplifyVector = TRUE)
parsed_content$validation
parsed_content$dataset
I'd suggest rewriting the get.Comtrade function using httr:
get.Comtrade <- function(verb = "GET",
url = "http://comtrade.un.org/",
path = "api/get",
encode = "json",
times = 3,
max = 50000,
type = "C",
freq = "A",
px = "HS",
ps = "now",
r,
p,
rg = "all",
cc = "TOTAL",
fmt = "json") {
res <- httr::RETRY(
verb = verb,
url = url,
path = path,
encode = encode,
times = times,
query = list(
max = max,
type = type,
freq = freq,
px = px,
ps = ps,
r = r,
p = p,
rg = rg,
cc = cc,
fmt = fmt
)
)
jsonlite::parse_json(content(res, as = "text"), simplifyVector = TRUE)
}
s1 <- get.Comtrade(r = "842", p = "124,484", times = 5)
print(s1)
Please see this and this for more information on library(httr).

How do I use the rvcontinue parameter in the Mediawiki API using R?

Im trying to extract the wikipedia revision history of several hundred pages. However, the Mediawiki API sets the return limit to 500 for any given page(https://www.mediawiki.org/wiki/API:Revisions).
The "rvcontinue" parameter allows you to extract the next 500 and so on, but I'm not sure how to automate this in R. (I've seen some examples of Python code (Why does the Wikipedia API Call in Python throw up a Type Error?), but I don't know how to replicate it in R).
A sample GET request code for one page is appended below, any help is appreciated!
base_url <- "http://en.wikipedia.org/w/api.php"
query_param <- list(action = "query",
pageids = "8091",
format = "json",
prop = "revisions",
rvprop = "timestamp|ids|user|userid|size",
rvlimit = "max",
rvstart = "2014-05-01T12:00:00Z",
rvend = "2021-12-30T23:59:00Z",
rvdir = "newer",
rvcontinue = #the continue value returned from the original request goes here
)
revision_hist <- GET(base_url, query_param)
Ideally my GET request would automatically update the rvcontinue parameter every 500 values until there are none left.
Thanks!
Edit 1
In your first response, you need to extract the value of rvcontinue to feed it into the second query. I'm still tinkering with the loop but here's the basics:
# Query 1
base_url <- "http://en.wikipedia.org/w/api.php"
query_param <- list(action = "query",
pageids = "8091",
format = "json",
prop = "revisions",
rvprop = "timestamp|ids|user|userid|size",
rvlimit = "max",
rvstart = "2014-05-01T12:00:00Z",
rvend = "2021-12-30T23:59:00Z",
rvdir = "newer"
)
r <- httr::GET(base_url, query = query_param)
parsed <- jsonlite::fromJSON(httr::content(r, as = "text"))
# Query 2
query_param2 <- list(action = "query",
pageids = "8091",
format = "json",
prop = "revisions",
rvprop = "timestamp|ids|user|userid|size",
rvlimit = "max",
rvstart = "2014-05-01T12:00:00Z",
rvend = "2021-12-30T23:59:00Z",
rvdir = "newer",
rvcontinue = parsed[["continue"]][["rvcontinue"]]
)
r2 <- httr::GET(base_url, query = query_param2)
parsed2 <- jsonlite::fromJSON(httr::content(r2, as = "text"))
Original answer
I haven't solved it completely, but I noticed that you're probably missing query = query_param in http::GET(). Here I tried using rvcontinue = "rvcontinue", but that doesn't seem to work for now.
base_url <- "http://en.wikipedia.org/w/api.php"
query_param <- list(action = "query",
pageids = "8091",
format = "json",
prop = "revisions",
rvprop = "timestamp|ids|user|userid|size",
rvlimit = "max",
rvstart = "2014-05-01T12:00:00Z",
rvend = "2021-12-30T23:59:00Z",
rvdir = "newer",
rvcontinue = "rvcontinue"
)
response <- httr::GET(base_url, query = query_param)
parsed <- jsonlite::fromJSON(httr::content(response, as = "text"))
Here's my error message:
> print(parsed)
$error
$error$code
[1] "badcontinue"
$error$info
[1] "Invalid continue param. You should pass the original value returned by the previous query."
$error$`*`
[1] "See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."
$servedby
[1] "mw1398"
Posting a solution here if someone in the future is ever stuck with a similar problem:
getRevHist <- function(titles) {
base_url= "http://en.wikipedia.org/w/api.php"
query_param = list ( action = "query",
titles = titles,
format = "json",
prop = "revisions",
rvprop = "timestamp|ids|user|userid",
rvlimit = "max",
rvstart = "2008-01-01T12:00:00Z",
rvend = "2021-12-31T23:59:00Z",
rvdir = "newer")
x <- GET(url = base_url, query = query_param)
y <- fromJSON(content(x, "text"))
revision_df <- y[[2]][[1]][[1]][[4]][c("revid", "parentid", "user", "userid", "timestamp")]
page_title <- y[[2]][[1]][[1]][[3]]
revision_df_title <- cbind(page_title, revision_df)
while ("||" %in% y[[1]]) {
continue_param <- y[[1]][[1]]
continue_query_param <- list ( action = "query",
titles = titles,
format = "json",
prop = "revisions",
rvprop = "timestamp|ids|user|userid",
rvlimit = "max",
rvstart = "2008-01-01T12:00:00Z",
rvend = "2021-12-31T23:59:00Z",
rvdir = "newer",
rvcontinue = continue_param)
x <- GET(url = base_url, query = continue_query_param)
y <- fromJSON(content(x, "text"))
continue_revision_df <- y[[2]][[1]][[1]][[4]][c("revid", "parentid", "user", "userid", "timestamp")]
revision_df <- rbind(revision_df, continue_revision_df)
revision_df_title <- cbind(page_title, revision_df)
}
return(revision_df_title)
}

Connecting to Admob with r

Is there smooth solution to query data from Google Admob API into R environment? (for e.g. similar to googleAnalyticsR package).
So if anyone ever is looking for an answer the solution that i worked out is to use library(httr) and library(jsonlite) which are general packages for managing API's
First set up your google app for admob and generate oauth credentials,
store them in as described below and authorize token.
Obtain Token:
options(admob.client_id = client.id)
options(admob.client_secret = key.secret)
# GETTING OAUTH TOKEN
token = oauth2.0_token(endpoint = oauth_endpoints("google"), # 'google is standard
app = oauth_app(appname = "google",
key = getOption('admob.client_id'),
secret = getOption("admob.client_secret")
),
scope = r"{https://www.googleapis.com/auth/admob.report}",
use_oob = TRUE,
cache = TRUE)
You can generate your body request in within avaible google documentation in here:
https://developers.google.com/admob/api/v1/reference/rest/v1/accounts.networkReport/generate
I'am passing mine as.list:
`json body` = toJSON(list(`reportSpec` = list(
`dateRange` = list(
`startDate` = list(
`year` = 2021,
`month` = 10,
`day` = 20),
`endDate` = list(
`year` = 2021,
`month` = 10,
`day` = 22
)),
`dimensions` = list("DATE"),
`metrics` = list("IMPRESSIONS")
)), auto_unbox = TRUE)
Query your request:
test = POST(url = 'https://admob.googleapis.com/v1/accounts/YOURPUBIDGOESINHERE/networkReport:generate',
add_headers(`authorization` = paste("Bearer",
token$credentials$access_token)),
add_headers(`content-type` = "application/json"),
body = `json body`
)
Finalize with some data cleansing and you are done.
jsonText = content(test, as = 'text')
df = fromJSON(jsonText, flatten = T)
df = na.omit(df[c('row.metricValues.IMPRESSIONS.integerValue',
'row.dimensionValues.DATE.value')]) # select only needed columns
row.names(df) = NULL # reindex rows
df

Call Amadeus flight-offers-pricing API from R?

Update:
Here is code that shows how to get an access token. I also use the test api here which is free (no credit card required).
The first api call to test.api.amadeus.com/v2/shopping/flight-offers is shown.
It is the second api call to test.api.amadeus.com/v1/shopping/flight-offers/pricing api that I don't know how to format.
My question remains, what is the correct way to call the second API using R?
R Script
library("tidyverse")
library("httr")
library("rjson")
amadeus_api_key_prod <- Sys.getenv("AMADEUS_API_KEY")
amadeus_api_secret_prod <- Sys.getenv("AMADEUS_SECRET")
# Initialize variables
tmp_origin <- NULL
tmp_dest <- NULL
tmp_avg_total_fare <- NULL
# Get Token
response <- POST("https://test.api.amadeus.com/v1/security/oauth2/token",
add_headers("Content-Type" = "application/x-www-form-urlencoded"),
body = list(
"grant_type" = "client_credentials",
"client_id" = amadeus_api_key_prod,
"client_secret" = amadeus_api_secret_prod),
encode = "form")
response
rsp_content <- content(response, as = "parsed", type = "application/json")
access_token <- paste0("Bearer ", rsp_content$access_token)
origin <- "JFK"
dest <- "LHR"
dep_date <- "2021-12-01"
return_date <- "2021-12-18"
max_num_flights <- 1
url <- paste0("https://test.api.amadeus.com/v2/shopping/flight-offers?originLocationCode=",
origin,
"&destinationLocationCode=",
dest,
"&departureDate=",
dep_date,
"&returnDate=",
return_date,
"&max=",
max_num_flights,
"&adults=1&nonStop=false&travelClass=ECONOMY&max=1&currencyCode=CAD")
# Get flight info
response <- GET(url,
add_headers("Content-Type" = "application/x-www-form-urlencoded",
"Authorization" = access_token),
encode = "form")
response
rsp_content <- content(response, as = "parsed", type = "application/json")
rsp_content
# Get current, more detailed flight info
# This is the part I do not know how to do
url2 <- "https://test.api.amadeus.com/v1/shopping/flight-offers/pricing"
flt_info <- toJSON(rsp_content[["data"]])
response2 <- GET(url2,
add_headers("Authorization" = access_token),
body = list(
"priceFlightOffersBody" = flt_info
),
encode = "form")
response2
rsp_content2 <- content(response2, as = "parsed", type = "application/json")
rsp_content2
Original Question
I am using the Amadeus flight info api to retrieve prices for flights.
My understanding is that to get complete price info requires two steps.
Call to https://api.amadeus.com/v2/shopping/flight-offers api
Call to https://api.amadeus.com/v1/shopping/flight-offers/pricing api
I can successfully perform the first api call with origin, destination, dates, etc.
The second call, confirms the pricing is still available and has a more detailed breakdown of the fare than does the first api call. It is this detailed breakdown that I am most interested in.
I am having trouble understanding what info that gets returned from the first api call needs to be passed to the second api call and in what format.
Below, I have included the data structure that gets returned from the first api call and my failed attempt to call the second api.
What is the correct way to call the second API using R?
Links to what I believe is the relevant documentation:
https://developers.amadeus.com/self-service/category/air/api-doc/flight-offers-price/api-reference
https://github.com/amadeus4dev/amadeus-code-examples/blob/master/flight_offers_price/v1/post/curl/flight_offers_price.sh
# Data structure returned from call to https://api.amadeus.com/v2/shopping/flight-offers
# For YYZ to YOW return, Dec 1-18 economy
rsp_content <- list(meta = list(count = 1L, links = list(self = "https://api.amadeus.com/v2/shopping/flight-offers?originLocationCode=YYZ&destinationLocationCode=YOW&departureDate=2021-12-01&returnDate=2021-12-18&max=1&adults=1&nonStop=false&travelClass=ECONOMY&max=1&currencyCode=CAD")),
data = list(list(type = "flight-offer", id = "1", source = "GDS",
instantTicketingRequired = FALSE, nonHomogeneous = FALSE,
oneWay = FALSE, lastTicketingDate = "2021-08-07", numberOfBookableSeats = 7L,
itineraries = list(list(duration = "PT1H10M", segments = list(
list(departure = list(iataCode = "YYZ", terminal = "3",
at = "2021-12-01T11:00:00"), arrival = list(iataCode = "YOW",
at = "2021-12-01T12:10:00"), carrierCode = "WS",
number = "3462", aircraft = list(code = "DH4"),
duration = "PT1H10M", id = "1", numberOfStops = 0L,
blacklistedInEU = FALSE))), list(duration = "PT1H21M",
segments = list(list(departure = list(iataCode = "YOW",
at = "2021-12-18T10:45:00"), arrival = list(iataCode = "YYZ",
terminal = "3", at = "2021-12-18T12:06:00"),
carrierCode = "WS", number = "3463", aircraft = list(
code = "DH4"), duration = "PT1H21M", id = "2",
numberOfStops = 0L, blacklistedInEU = FALSE)))),
price = list(currency = "CAD", total = "232.78", base = "125.00",
fees = list(list(amount = "0.00", type = "SUPPLIER"),
list(amount = "0.00", type = "TICKETING")), grandTotal = "232.78"),
pricingOptions = list(fareType = list("PUBLISHED"), includedCheckedBagsOnly = FALSE),
validatingAirlineCodes = list("WS"), travelerPricings = list(
list(travelerId = "1", fareOption = "STANDARD", travelerType = "ADULT",
price = list(currency = "CAD", total = "232.78",
base = "125.00"), fareDetailsBySegment = list(
list(segmentId = "1", cabin = "ECONOMY", fareBasis = "LAVD0TBJ",
brandedFare = "BASIC", class = "E", includedCheckedBags = list(
quantity = 0L)), list(segmentId = "2",
cabin = "ECONOMY", fareBasis = "LAVD0ZBI",
brandedFare = "BASIC", class = "E", includedCheckedBags = list(
quantity = 0L))))))), dictionaries = list(
locations = list(YOW = list(cityCode = "YOW", countryCode = "CA"),
YYZ = list(cityCode = "YTO", countryCode = "CA")),
aircraft = list(DH4 = "DE HAVILLAND DHC-8 400 SERIES"),
currencies = list(CAD = "CANADIAN DOLLAR"), carriers = list(
WS = "WESTJET")))
# Get full pricing info
url2 <- "https://api.amadeus.com/v1/shopping/flight-offers/pricing"
# Get pricing info
response2 <- GET(url2,
add_headers("Authorization" = access_token),
body = list(
"priceFlightOffersBody" = rsp_content[["data"]][[1]]
),
encode = "form")
response2
rsp_content2 <- content(response2, as = "parsed", type = "application/json")
rsp_content2
You can take a look at this blog article, it explains how the data needs to be passed between the 3 endpoints and it has a video showing it on Postman.
You can check some of the code samples the Amadeus for Developers team has built. Here for Flight Offers Price (in different programming languages but not R) and here for Flight Create Orders (that includes the previous steps of search and price).
They have a couple of demo applications as well that show you how to combine these endpoints to build a flight booking engine, one of them in Python that you can find here.
It seems to work with the following, for the second request.
Changes :
Use POST, not GET
While the documentation names the body priceFlightOffersBody, it should actually be just the data part, no need to encapsulate in another list.
The data part is not exactly the same as the result of the first request.
Pass the data part as an R list and use POST argument encode = "json".
url2 <- "https://test.api.amadeus.com/v1/shopping/flight-offers/pricing"
flt_info <- list(data = list(type = "flight-offers-pricing",
flightOffers = rsp_content$data))
response2 <- POST(url2,
add_headers(Authorization = access_token),
body = flt_info,
encode = "json")
rsp_content2 <- content(response2, as = "parsed", type = "application/json")
rsp_content2

Twitter GET not working with since_id

Working in R, but that shouldn't really matter.
I want to gather all tweets after : https://twitter.com/ChrisChristie/status/663046613779156996
So Tweet ID : 663046613779156996
base = "https://ontributor_details = "contributor_details=true"
## include_rts
include_rts = "include_rts=true"
## exclude_replies
exclude_replies = "exclude_replies=false"api.twitter.com/1.1/statuses/user_timeline.json?"
queryName = "chrischristie"
query = paste("q=", queryName, sep="")
secondary_url = paste(query, count, contributor_details,include_rts,exclude_replies, sep="&")
final_url = paste(base, secondary_url, sep="")
timeline = GET(final_url, sig)
This (the above) works. There is no since_id. The URL comes out to be
"https://api.twitter.com/1.1/statuses/user_timeline.json?q=chrischristie&count=200&contributor_details=true&include_rts=true&exclude_replies=false"
The below does not, just by adding in the following
cur_since_id_url = "since_id=663046613779156996"
secondary_url = paste(query, count,
contributor_details,include_rts,exclude_replies,cur_since_id_url, sep="&")
final_url = paste(base, secondary_url, sep="")
timeline = GET(final_url, sig)
The url for the above there is
"https://api.twitter.com/1.1/statuses/user_timeline.json?q=chrischristie&count=200&contributor_details=true&include_rts=true&exclude_replies=false&since_id=663046613779156992"
This seems to work:
require(httr)
myapp <- oauth_app(
"twitter",
key = "......",
secret = ".......")
twitter_token <- oauth1.0_token(oauth_endpoints("twitter"), myapp)
req <- GET("https://api.twitter.com/1.1/statuses/user_timeline.json",
query = list(
screen_name="chrischristie",
count=10,
contributor_details=TRUE,
include_rts=TRUE,
exclude_replies=FALSE,
since_id=663046613779156992),
config(token = twitter_token))
content(req)
Have a look at GET statuses/user_timeline

Resources