R Web scrape - Error

R Web scrape - Error - r

Okay, So I am stuck on what seems would be a simple web scrape. My goal is to scrape Morningstar.com to retrieve a fund name based on the entered url. Here is the example of my code:
library(rvest)
url <- html("http://www.morningstar.com/funds/xnas/fbalx/quote.html")
url %>%
read_html() %>%
html_node('r_title')
I would expect it to return the name Fidelity Balanced Fund, but instead I get the following error: {xml_missing}
Suggestions?
Aaron
edit:
I also tried scraping via XHR request, but I think my issue is not knowing what css selector or xpath to select to find the appropriate data.
XHR code:
get.morningstar.Table1 <- function(Symbol.i,htmlnode){
try(res <- GET(url = "http://quotes.morningstar.com/fundq/c-header",
query = list(
t=Symbol.i,
region="usa",
culture="en-US",
version="RET",
test="QuoteiFrame"
)
))
tryCatch(x <- content(res) %>%
html_nodes(htmlnode) %>%
html_text() %>%
trimws()
, error = function(e) x <-NA)
return(x)
} #HTML Node in this case is a vkey
still the same question is, am I using the correct css/xpath to look up? The XHR code works great for requests that have a clear css selector.

OK, so it looks like the page dynamically loads the section you are targeting, so it doesn't actually get pulled in by read_html(). Interestingly, this part of the page also doesn't load using an RSelenium headless browser.
I was able to get this to work by scraping the page title (which is actually hidden on the page) and doing some regex to get rid of the junk:
library(rvest)
url <- 'http://www.morningstar.com/funds/xnas/fbalx/quote.html'
page <- read_html(url)
title <- page %>%
html_node('title') %>%
html_text()
symbol <- 'FBALX'
regex <- paste0(symbol, " (.*) ", symbol, ".*")
cleanTitle <- gsub(regex, '\\1', title)
As a side note, and for your future use, your first call to html_node() should include a "." before the class name you are targeting:
mypage %>%
html_node('.myClass')
Again, this doesn't help in this specific case, since the page is failing to load the section we are trying to scrape.
A final note: other sites contain the same info and are easier to scrape (like yahoo finance).

Related

How do I extract certain html nodes using rvest?

I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.

There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()

Scraping frames in R without RSelenium?

I need to scrape “manuscript received date” that is visible in the right-hand frame, once you click “Information” at this page: https://onlinelibrary.wiley.com/doi/10.1002/jcc.26717 . I tried to use an rvest script listed below, that worked fine in similar situations. However, it does not work in this case, perhaps because of the click required to get to the publication history. I tried solving this issue by adding #pane-pcw-details to the url (https://onlinelibrary.wiley.com/doi/10.1002/jcc.26717#pane-pcw-details) but to no avail. Another option would be to use RSelenium, but perhaps there is a simpler workaround?
library(rvest)
link <-c("https://onlinelibrary.wiley.com/doi/10.1002/jcc.26717#pane-pcw-details")
wiley_output <-data.frame()
page = read_html(link)
revhist = page %>% html_node(".publication-history li:nth-child(5)") %>% html_text()
wiley_output = rbind(wiley_output, data.frame(link, revhist, stringsAsFactors = FALSE))

That data comes from an ajax call you can find in the network tab. It has a lot of querystring params but you actually only need the identifier for the document, along with ajax = True to ensure return of data associated with the specified ajax action:
https://onlinelibrary.wiley.com/action/ajaxShowPubInfo?ajax=true&doi=10.1002/jcc.26717
library(rvest)
library(magrittr)
link <- 'https://onlinelibrary.wiley.com/action/ajaxShowPubInfo?ajax=true&doi=10.1002/jcc.26717'
page <- read_html(link)
page %>% html_node(".publication-history li:nth-child(5)") %>% html_text()

How do I find html_node on search form?

I have a list of names (first name, last name, and date-of-birth) that I need to search the Fulton County Georgia (USA) Jail website to determine if a person is in or released from jail.
The website is http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400
The site requires you enter a last name and first name, then it gives you a list of results.
I have found some stackoverflow posts that have given me some direction, but I'm still struggling to figure this out. I"m using this post as and example to follow. I am using SelectorGaget to help figure out the CSS tags.
Here is the code I have so far. Right now I can't figure out what html_node to use.
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session(fc.url)
# Grab initial form
form.unfilled <- jail %>% html_node("form")
form.unfilled
The result I get from form.unfilled is {xml_missing} <NA> which I know isn't right.
I think if I can figure out the html_node value, I can proceed to using set_values and submit_form.
Thanks.

It appears on the initial call the webpage opens onto "http://justice.fultoncountyga.gov/PAJailManager/default.aspx". Once the session is started you should be able to jump to the search page:
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form")
Note: Verify that your actions are within the terms of service for the website. Many sites do have policy against scraping.

The website relies heavily on Javascript to render itself. When opening the link you provided in a fresh browser instance, you get redirected to http://justice.fultoncountyga.gov/PAJailManager/default.aspx, where you have to click the "Jail Records" link. This executes a bit a Javascript, to send you to the page with the form.
rvest is unable to execute arbitrary Javascript. You might have to look at RSelenium. Selenium basically remote-controls a browser (for example Firefox or Chrome), which executes the Javascript as intended.

Thanks to Dave2e.
Here is the code that works. This questions is answered (but I'll post another one because I'm not getting a table of data as a result.)
Note: I cannot find any Terms of Service on this site that I'm querying
library(rvest)
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form") %>% html_form()
form.unfilled
#name values
lname <- "DOE"
fname <- "JOHN"
# Fille the form with name values
form.filled <- form.unfilled %>%
set_values("LastName" = lname,
"FirstName" = fname)
#Submit form
r <- submit_form(jail2, form.filled,
submit = "SearchSubmit")
#grab tables from submitted form
table <- r %>% html_nodes("table")
#grab a table with some data
table[[5]] %>% html_table()
# resulting text in this table:
# " An error occurred while processing your request.Please contact your system administrator."

How to pull a product link from customer profile page on Amazon

I'm trying to get the product link from a customers profile page usign R's RVEST package
I've referenced various questions on stack overflow including here(could not read webpage with read_html using rvest package from r), but each time I try something, I'm not able to return the correct result.
For example on this profile page:
https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8
I'd like to be able to return this link, with the end goal to extract the product id: B01A51S9Y2
https://www.amazon.com/Amagabeli-Stainless-Chainmail-Scrubber-Pre-Seasoned/dp/B01A51S9Y2?ref=pf_vv_at_pdctrvw_dp
library(dplyr)
library(rvest)
library(stringr)
library(httr)
library(rvest)
# get url
url='https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'
x <- GET(url, add_headers('user-agent' = 'test'))
page <- read_html(x)
page %>%
html_nodes("[class='a-link-normal profile-at-product-box-link a-text-normal']") %>%
html_text()
#I did a test to see if i could even find the href, with no luck
test <- page %>%
html_nodes("#a-page") %>%
html_text()
grepl("B01A51S9Y2",test)
Thanks for the tip #Qharr on Rselenium. that is helpful, but still unsure how to extract the link or asin. library(RSelenium)
driver <- rsDriver(browser=c("chrome"), port = 4574L, chromever = "77.0.3865.40")
rd <- driver[["client"]]
rd$open()
rd$navigate("https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_arp_d_gw_btm?ie=UTF8")
prod <- rd$findElement(using = "css", '.profile-at-product-box-link')
prod$getElementText
This doesn't really return anything
Added the get attribute href, and was able to get the link
prod <- rd$findElements(using = "css selector", '.profile-at-product-box-link')
for (link in 1:length(prod)){
print(prod[[link]]$getElementAttribute('href'))
}

That info is pulled in dynamically from a POST request the page makes that your rvest initial request doesn't capture. This subsequent request returns in json format the content governing asins, the products links etc.....
You can find it in the network tab of dev tools F12. Press F5 to refresh the page then examine network traffic:
It is not a simple POST request to mimic and I would just go with RSelenium to let the page render and then use css selector
.profile-at-product-box-link
to gather a webElements collection you can loop and extract href attribute from.

Rvest is unable to find the node specified by css selector, how do I fix it?

I am scraping data from this website and for some reason, I'm unable to get the name of the seller, even though I use the exact node returned by SelectorGadget. I have, however, managed to get all the other data with Rvest.
I managed to scrape the seller's name with RSelenium but that takes too much time. Anyway, here's the link of the page I'm scraping:
https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946
Here's the code I've used
SellerName <-
read_html("https://kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946") %>%
html_nodes(".link-4200870613") %>%
html_text()

You can regex out the seller name easily from the return as it is contained in a script tag (presumably loaded from here when browser is able to run javascript - which rvest does not.)
library(rvest)
library(magrittr)
library(stringr)
p <- read_html('https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946') %>% html_text()
seller_name <- str_match_all(p,'"sellerName":"(.*?)"')[[1]][,2][1]
print(seller_name)
Regex: