Pasting URL address in dataframe when scraping table in R - r

I'm trying to scrape a table, however I can only get it to paste the value of the hyper link. I'm want the URL to be pasted instead of the value in the table. I've worked out how to do this for a single hyperlink however I need to go through and acquire every xpath. Is there a quicker way of doing this?
This is the code I've been working with:
library(rvest)
url <- read_html("https://coinmarketcap.com/coins/views/all/")
cryptocurrencies <- url %>% html_nodes(xpath = '//*[#id="currencies-all"]')
%>% html_table(fill = T)
cryptocurrencies <- cryptocurrencies[[1]]
I suspect there is an argument in the html_nodes function that would allow me to paste the 'href' however I can't seem to workout how to do it. Thanks

First, you need to use html_attr() to get attributes of each note, in your case, the attribute is href
relative_paths <- page %>%
html_nodes(".currency-name-container") %>%
html_attr("href") #note it is relative path
relative_paths[1:3]
"/currencies/bitcoin/" "/currencies/ethereum/" "/currencies/ripple/"
Once you get the relative path, you can use jump_to() or follow_link() function to do scraping on each page.
#display first three result
for (path in relative_paths) {
current_session <- html_session("https://coinmarketcap.com/coins/views/all/") %>%
jump_to(path)
#do something here
print(current_session$url)
}
[1] "https://coinmarketcap.com/currencies/bitcoin/"
[1] "https://coinmarketcap.com/currencies/ethereum/"
[1] "https://coinmarketcap.com/currencies/ripple/
Or can get the absolute path:
#or get absolute path
absolute_path <- paste0("https://coinmarketcap.com",relative_paths)
absolute_path[1:3]
[1] "https://coinmarketcap.com/currencies/bitcoin/" "https://coinmarketcap.com/currencies/ethereum/" "https://coinmarketcap.com/currencies/ripple/"
Finally, you can merge it into your data frame.

Related

Confusion Regarding HTML Code For Web Scraping With R

I am struggling using the rvest package in R, most likely due to my lack of knowledge about CSS or HTML. Here is an example (my guess is the ".quote-header-info" is what is wrong, also tried the ".Trsdu ..." but no luck either):
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
website=read_html(url) %>%
html_nodes(".quote-header-info") %>%
html_text() %>% toString()
website
The below is the webpage I am trying to scrape. Specifically looking to grab the value "416.74". I took a peek at the documentation here (https://cran.r-project.org/web/packages/rvest/rvest.pdf) but think the issue is I don't understand the breakdown of the webpage I am looking at.
The tricky part is determining the correct set of attributes to only select this one html node.
In this case the span tag with a class of Trsdu(0.3s) and Fz(36px)
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
#read page once
page <- read_html(url)
#now extract information from the page
price <- page %>% html_nodes("span.Trsdu\\(0\\.3s\\).Fz\\(36px\\)") %>%
html_text()
price
Note: "(", ")", and "." are all special characters thus the need to double escape "\\" them.
Those classes are dynamic and change much more frequently than other parts of the html. They should be avoided. You have at least two more robust options.
Extract the javascript option housing that data (plus a lot more) in a script tag then parse with jsonlite
Use positional matching against other, more stable, html elements
I show both below. The advantage of the first is that you can extract lots of other page data from the json object generated.
library(magrittr)
library(rvest)
library(stringr)
library(jsonlite)
page <- read_html('https://finance.yahoo.com/quote/SPY')
data <- page %>%
toString() %>%
stringr::str_match('root\\.App\\.main = (.*?[\\s\\S]+)(?=;[\\s\\S]+\\(th)') %>% .[2]
json <- jsonlite::parse_json(data)
print(json$context$dispatcher$stores$StreamDataStore$quoteData$SPY$regularMarketPrice$raw)
print(page %>% html_node('#quote-header-info div:nth-of-type(2) ~ div div:nth-child(1) span') %>% html_text() %>% as.numeric())

how to download pdf file with R from web (encode issue)

I am trying to download a pdf file from a website using R. When I tried to to use the function browserURL, it only worked with the argument encodeIfNeeded = T. As a result, if I pass the same url to the function download.file, it returns "cannot open destfile 'downloaded/teste.pdf', reason 'No such file or directory", i.e., it cant find the correct url.
How do I correct the encode, in order for me to be able to download the file programatically?
I need to automate this, because there are more than a thousand files to download.
Here's a minimum reproducible code:
library(tidyverse)
library(rvest)
url <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html"
webpage <- read_html(url)
# scrapping hyperlinks
links_decisoes <- html_nodes(webpage,".borderTD a") %>%
html_attr("href")
# creating full/correct url
full_links <- paste("http://www.ouvidoriageral.sp.gov.br/", links_decisoes, sep="" )
# browseURL only works with encodeIfNeeded = T
browseURL(full_links[1], encodeIfNeeded = T,
browser = "C://Program Files//Mozilla Firefox//firefox.exe")
# returns an error
download.file(full_links[1], "downloaded/teste.pdf")
There are a couple of problems here. Firstly, the links to some of the files are not properly formatted as urls - they contain spaces and other special characters. In order to convert them you must use url_escape(), which should be available to you as loading rvest also loads xml2, which contains url_escape().
Secondly, the path you are saving to is relative to your R home directory, but you are not telling R this. You either need the full path like this: "C://Users/Manoel/Documents/downloaded/testes.pdf", or a relative path like this: path.expand("~/downloaded/testes.pdf").
This code should do what you need:
library(tidyverse)
library(rvest)
# scraping hyperlinks
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
read_html() %>%
html_nodes(".borderTD a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("http://www.ouvidoriageral.sp.gov.br/", .)}
# Looks at page in firefox
browseURL(full_links[1], encodeIfNeeded = T, browser = "firefox.exe")
# Saves pdf to "downloaded" folder if it exists
download.file(full_links[1], path.expand("~/downloaded/teste.pdf"))

What makes table web scraping with rvest package sometimes fail?

I'm playing with rvest package and trying to figure out why sometimes it fails to scrape objects that definitely seem tables.
Consider for instance a script like this:
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[#id="options"]/table/tbody/tr/td/table[2]/tbody') %>%
html_table()
population
If I inspect population, it's an empty list:
> population
list()
Another example:
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[#id="Col1-1-OptionContracts-Proxy"]/section/section[1]/div[2]') %>%
html_table()
population
I was wondering if the use of PhantomJS is mandatory - as explained here - or if the problem is elsewhere.
Neither of your current xpaths actually select just the table. In both cases I think you need to pass an html table to html_table as under the hood there is:
html_table.xml_node(.) : html_name(x) == "table"
Also, long xpaths are too fragile especially when applying a path valid for browser rendered content versus rvest return html - as javascript doesn't run with rvest. Personally, I prefer nice short CSS selectors. You can use the second fastest selector type of class and only need specify a single class
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_node('.optionchain') %>%
html_table()
The table needs cleaning of course, due to "merged" cells in source, but you get the idea.
With xpath you could do:
require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
xml2::read_html() %>%
html_node(xpath='//table[2]') %>%
html_table()
Note: I reduce the xpath and work with a single node which represents a table.
For your second:
Again, your xpath is not selecting for a table element. The table class is multi-valued but a single correctly chosen class will suffice in xpath i.e. //*[contains(#class,"calls")] . Select for a single table node.
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_node(xpath='//*[contains(#class,"calls")]') %>%
html_table()
Once again, my preference is for a css selector (less typing!)
require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
xml2::read_html() %>%
html_node('.calls') %>%
html_table()

Parse CDATA with R

I'm scraping and analyzing data from a car auction website. My goal is to develop date-time and sentiment analysis skills, and I like old cars. The website is Bring A Trailer-- they do not offer API access (I asked), but robots.txt is OK.
SO user '42' pointed out that this is not permitted by BAT's terms, so I have removed their base url. I will likely remove the question. After thinking about it, I can do what I want by saving a couple of webpages from my browser and analyzing that data. I don't need ALL the auctions, I just followed a tutorial that did and here I am reading TOS instead of doing what I wanted in the first place...
Some of the data is easily accessed, but the best parts are hard, and I'm stuck with that. I'm really looking for advice on my approach.
My first steps work: I can find and locally cache the webpages:
library(tidyverse)
library(rvest)
data_dir <- "bat_data-html/"
# Step 1: Create list of links to listings ----------------------------
base_url <- "https://"
pages <- read_html(file.path(base_url,"/auctions/")) %>%
html_nodes(".auctions-item-title a") %>%
html_attr("href") %>%
file.path
pages <- head(pages, 3) # use a subset for testing code
# Step 2 : Save auction pages locally ---------------------------------
dir.create(data_dir, showWarnings = FALSE)
p <- progress_estimated(length(pages))
# Download each auction page
walk(pages, function(url){
download.file(url, destfile = file.path(data_dir, basename(url)), quiet = TRUE)
p$tick()$print()
})
I can also process metadata about the auction from these cached pages, identifying the css selectors with SelectorGadget and specifying them to rvest:
# Step 3: Process each auction info into df ----------------------------
files <- dir(data_dir, pattern = "*", full.names = TRUE)
# Function: get_auction_details, to be applied to each auction page
get_auction_details <- function(file) {
pagename <- basename(file) # the filename of the page (trailing index for multiples)
page <- read_html(file) # read the html into R ( consider , options = "NOCDATA")
# Grab the title of the auction stored in the ".listing-post-title" tag on the page
title <- page %>% html_nodes(".listing-post-title") %>% html_text()
# Grab the "BAT essentials" of the auction stored in the ".listing-essentials-item" tag on the page
essence <- page %>% html_nodes(".listing-essentials-item") %>% html_text()
# Assemble into a data frame
info_tbl0 <- as_tibble(essence)
info_tbl <- add_row(info_tbl0, value = title, .before = 1)
names(info_tbl) [1] <- pagename
return(info_tbl)
}
# Apply the get_auction_details function to each element of files
bat0 <- map_df(files, get_auction_details) # run function
bat <- gather(bat0) %>% subset(value != "NA") # serialize results
# Save as csv
write_csv(bat, path = "data-csv/bat04.csv") # this table contains the expected metadata:
key,value
1931-ford-model-a-12,Modified 1931 Ford Model A Pickup
1931-ford-model-a-12,Lot #8576
1931-ford-model-a-12,Seller: TargaEng
But the auction data (bids, comments) is inside of a CDATA section:
<script type='text/javascript'>
/* <![CDATA[ */
var BAT_VMS = { ...bids, comments, results
/* ]]> */
</script>
I've tried elements within this section using the path that I find using SelectorGadget, but they are not found-- this gives an empty list:
tmp <- page %>% html_nodes(".comments-list") %>% html_text()
Looking at the text within this CDATA section, I see some xml tags but it is not structured in the cached file like it is when I inspect the auction section of the live webpage.
To extract this information, should I try to parse the information "as-is" from within this CDATA section, or can I transform it so that it can be parsed like XML? Or am I barking up the wrong tree?
I appreciate any advice!
It's buried in the XML2 documentation, but you can use this option to keep the CDATA intact.
# Instead of rvest::read_html()
xml2::read_xml(options = "NOCDATA")
After reading the feed in this way, you'll be able to access the comments list the way you wanted.
tmp <- page %>% html_nodes(".comments-list") %>% html_text()

R Web scrape - Error

Okay, So I am stuck on what seems would be a simple web scrape. My goal is to scrape Morningstar.com to retrieve a fund name based on the entered url. Here is the example of my code:
library(rvest)
url <- html("http://www.morningstar.com/funds/xnas/fbalx/quote.html")
url %>%
read_html() %>%
html_node('r_title')
I would expect it to return the name Fidelity Balanced Fund, but instead I get the following error: {xml_missing}
Suggestions?
Aaron
edit:
I also tried scraping via XHR request, but I think my issue is not knowing what css selector or xpath to select to find the appropriate data.
XHR code:
get.morningstar.Table1 <- function(Symbol.i,htmlnode){
try(res <- GET(url = "http://quotes.morningstar.com/fundq/c-header",
query = list(
t=Symbol.i,
region="usa",
culture="en-US",
version="RET",
test="QuoteiFrame"
)
))
tryCatch(x <- content(res) %>%
html_nodes(htmlnode) %>%
html_text() %>%
trimws()
, error = function(e) x <-NA)
return(x)
} #HTML Node in this case is a vkey
still the same question is, am I using the correct css/xpath to look up? The XHR code works great for requests that have a clear css selector.
OK, so it looks like the page dynamically loads the section you are targeting, so it doesn't actually get pulled in by read_html(). Interestingly, this part of the page also doesn't load using an RSelenium headless browser.
I was able to get this to work by scraping the page title (which is actually hidden on the page) and doing some regex to get rid of the junk:
library(rvest)
url <- 'http://www.morningstar.com/funds/xnas/fbalx/quote.html'
page <- read_html(url)
title <- page %>%
html_node('title') %>%
html_text()
symbol <- 'FBALX'
regex <- paste0(symbol, " (.*) ", symbol, ".*")
cleanTitle <- gsub(regex, '\\1', title)
As a side note, and for your future use, your first call to html_node() should include a "." before the class name you are targeting:
mypage %>%
html_node('.myClass')
Again, this doesn't help in this specific case, since the page is failing to load the section we are trying to scrape.
A final note: other sites contain the same info and are easier to scrape (like yahoo finance).

Resources