web scrape with rvest

web scrape with rvest - r

I'm trying to grab a table of data using read_html from the r package rvest.
I've tried the below code:
library(rvest)
raw <- read_html("https://demanda.ree.es/movil/peninsula/demanda/tablas/2016-01-02/2")
I don't believe the above pulled the data from the table, since I see 'raw' is a list of 2:
'node:<externalptr>' and 'doc:<externalptr>'
I've tried grabbing the xpath too:
html_nodes(raw,xpath = '//*[(#id = "tabla_generacion")]//*[contains(concat( " ", #class, " " ), concat( " ", "ng-scope", " " ))]')
Any advice on what to try next?
Thanks.

This website is using angular to make a call to get the data. You can just use that call to get the raw JSON. The response is not pure JSON, so you can't just run fromJSON(url), you have to download the data and get rid of the non-JSON stuff before you parse it.
library(jsonlite)
library(httr)
url <- "https://demanda.ree.es/WSvisionaMovilesPeninsulaRest/resources/demandaGeneracionPeninsula?callback=angular.callbacks._2&curva=DEMANDA&fecha=2016-01-02"
a <- GET(url)
a <- content(a, as="text")
# get rid of the non-JSON stuff...
a <- gsub("^angular.callbacks._2\\(", "", a)
a <- gsub("\\);$", "", a)
df <- fromJSON(a, simplifyDataFrame = TRUE)
I found this by pushing F12 in Chrome and looking at the "Sources" tab. The data to fill the table had to come from somewhere... so it's just a matter of figuring out where. I was unable to use rvest to scrape the table. I'm not sure if that call to get the data was executed in R as it was in chrome... so there may have been no data for rvest to scrape.

Related

Rvest scraping webpage content returned from html_text()

I am attempting to scrape (dynamic?) content from a webpage using the rvest package. I understand that dynamic content should require the use of tools such as Selenium or PhantomJS.
However my experimentation leads me to believe I should still be able to find the content I want using only standard webscraping r packages (rvest,httr,xml2).
For this example I will be using a google maps webpage.
Here is the example url...
https://www.google.com/maps/dir/920+nc-16-br,+denver,+nc,+28037/2114+hwy+16,+denver,+nc,+28037/
If you follow the hyperlink above it will take you to an example webpage. The content I would want in this example are the addresses "920 NC-16, Crumpler, NC 28617" and "2114 NC-16, Newton, NC 28658" in the top left corner of the webpage.
Standard techniques using the css selector or xpath did not work, which initially made sense, as I thought this content was dynamic.
url<-"https://www.google.com/maps/dir/920+nc-16-br,+denver,+nc,+28037/2114+hwy+16,+denver,+nc,+28037/"
page<-read_html(url)
# The commands below all return {xml nodeset 0}
html_nodes(page,css=".tactile-searchbox-input")
html_nodes(page,css="#sb_ifc50 > input")
html_nodes(page,xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "tactile-searchbox-input", " " ))]')
The commands above all return "{xml nodeset 0}" which I thought was a result of this content being generated dynamically, but here's were my confusion lies, if I convert the whole page to text using html_text() I can find the addresses in the value returned.
html_text(read_html(url))
substring<-substr(x,33561-100,33561+300)
Executing the commands above results in a substring with the following value,
"null,null,null,null,[null,null,null,null,null,null,null,[[[\"920 NC-16, Crumpler, NC 28617\",null,null,null,null,null,null,null,null,null,null,\"Nzm5FTtId895YoaYC4wZqUnMsBJ2rlGI\"]\n,[\"2114 NC-16, Newton, NC 28658\",null,null,null,null,null,null,null,null,null,null,\"RIU-FSdWnM8f-IiOQhDwLoMoaMWYNVGI\"]\n]\n,null,null,0,null,[[null,null,null,null,null,null,null,3]\n,[null,null,null,null,[null,null,null,null,nu"
The substring is very messy but contains the content I need. I've heard parsing webpages using regex is frowned upon but I cannot think of any other way of obtaining this content which would also avoid the use of dynamic scraping tools.
If anyone has any suggestions for parsing the html returned or can explain why I am unable to find the content using xpath or css selectors, but can find it by simply parsing the raw html text, it would be greatly appreciated.
Thanks for your time.

The reason why you can't find the text with Xpath or css selectors is that the string you have found is within the contents of a javascript array object. You were right to assume that the text elements you can see on the screen are loaded dynamically; these are not where you are reading the strings from.
I don't think there's anything wrong with parsing specific html with regex. I would ensure that I get the full html rather than just the html_text() output, in this case by using the httr package. You can grab the address from the page like this:
library(httr)
GetAddressFromGoogleMaps <- function(url)
{
GET(url) %>%
content("text") %>%
strsplit("spotlight") %>%
extract2(1) %>%
extract(-1) %>%
strsplit("[[]{3}(\")*") %>%
extract2(1) %>%
extract(2) %>%
strsplit("\"") %>%
extract2(1) %>%
extract(1)
}
Now:
GetAddressFromGoogleMaps(url)
#[1] "920 NC-16, Crumpler, NC 28617, USA"

Using R to mimic “clicking” a download file button on a webpage

There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400

The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}

R Web scrape - Error

Okay, So I am stuck on what seems would be a simple web scrape. My goal is to scrape Morningstar.com to retrieve a fund name based on the entered url. Here is the example of my code:
library(rvest)
url <- html("http://www.morningstar.com/funds/xnas/fbalx/quote.html")
url %>%
read_html() %>%
html_node('r_title')
I would expect it to return the name Fidelity Balanced Fund, but instead I get the following error: {xml_missing}
Suggestions?
Aaron
edit:
I also tried scraping via XHR request, but I think my issue is not knowing what css selector or xpath to select to find the appropriate data.
XHR code:
get.morningstar.Table1 <- function(Symbol.i,htmlnode){
try(res <- GET(url = "http://quotes.morningstar.com/fundq/c-header",
query = list(
t=Symbol.i,
region="usa",
culture="en-US",
version="RET",
test="QuoteiFrame"
)
))
tryCatch(x <- content(res) %>%
html_nodes(htmlnode) %>%
html_text() %>%
trimws()
, error = function(e) x <-NA)
return(x)
} #HTML Node in this case is a vkey
still the same question is, am I using the correct css/xpath to look up? The XHR code works great for requests that have a clear css selector.

OK, so it looks like the page dynamically loads the section you are targeting, so it doesn't actually get pulled in by read_html(). Interestingly, this part of the page also doesn't load using an RSelenium headless browser.
I was able to get this to work by scraping the page title (which is actually hidden on the page) and doing some regex to get rid of the junk:
library(rvest)
url <- 'http://www.morningstar.com/funds/xnas/fbalx/quote.html'
page <- read_html(url)
title <- page %>%
html_node('title') %>%
html_text()
symbol <- 'FBALX'
regex <- paste0(symbol, " (.*) ", symbol, ".*")
cleanTitle <- gsub(regex, '\\1', title)
As a side note, and for your future use, your first call to html_node() should include a "." before the class name you are targeting:
mypage %>%
html_node('.myClass')
Again, this doesn't help in this specific case, since the page is failing to load the section we are trying to scrape.
A final note: other sites contain the same info and are easier to scrape (like yahoo finance).

Scrape contents of dynamic pop-up window using R

I'm stuck on this one after much searching....
I started with scraping the contents of a table from:
http://www.skatepress.com/skates-top-10000/artworks/
Which is easy:
data <- data.frame()
for (i in 1:100){
print(paste("page", i, "of 100"))
url <- paste("http://www.skatepress.com/skates-top-10000/artworks/", i, "/", sep = "")
temp <- readHTMLTable(stringsAsFactors = FALSE, url, which = 1, encoding = "UTF-8")
data <- rbind(data, temp)
} # end of scraping loop
However, I need to additionally scrape the detail that is contained in a pop-up box when you click on each name (and on the artwork title) in the list on the site.
I can't for the life of me figure out how to pass the breadcrumb (or artist-id or painting-id) through in order to make this happen. Since straight up using rvest to access the contents of the nodes doesn't work, I've tried the following:
I tried passing the painting id through in the url like this:
url <- ("http://www.skatepress.com/skates-top-10000/artworks/?painting_id=576")
site <- html(url)
But it still gives an empty result when scraping:
node1 <- "bread-crumb > ul > li.activebc"
site %>% html_nodes(node1) %>% html_text(trim = TRUE)
character(0)
I'm (clearly) not a scraping expert so any and all assistance would be greatly appreciated! I need a way to capture this additional information for each of the 10,000 items on the list...hence why I'm not interested in doing this manually!
Hoping this is an easy one and I'm just overlooking something simple.

This will be a more efficient base scraper and you can get progress bars for free with the pbapply package:
library(xml2)
library(httr)
library(rvest)
library(dplyr)
library(pbapply)
library(jsonlite)
base_url <- "http://www.skatepress.com/skates-top-10000/artworks/%d/"
n <- 100
bind_rows(pblapply(1:n, function(i) {
mutate(html_table(html_nodes(read_html(sprintf(base_url, i)), "table"))[[1]],
`Sale Date`=as.Date(`Sale Date`, format="%m.%d.%Y"),
`Premium Price USD`=as.numeric(gsub(",", "", `Premium Price USD`)))
})) -> skatepress
I added trivial date & numeric conversions.
I belive your main issue is that the site requires a login to get the additional data. You should give that (i.e. logging in) a shot using httr and grab the wordpress_logged_inXXXXXXX… cookie from that endeavour. I just grabbed it from inspecting the session with Developer Tools in Chrome and that will also work for you (but it's worth the time to learn how to do it via httr).
You'll need to scrape two additional <a … tags from each table row. The one for "artist" looks like:
Pablo Picasso
You can scrape the contents with:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artist.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id="pab_pica_1881"),
verbose()) -> artist_response
fromJSON(content(artist_response, as="text"))
(The return value is too large to post here)
The one for "artwork" looks like:
Les femmes d′Alger (Version ′O′)
and you can get that in similar fashion:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artwork.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id=576),
verbose()) -> artwork_response
fromJSON(content(artwork_response, as="text"))
That's not huge but I won't clutter the response with it.
NOTE that you can also use rvest's html_session to do the login (which will get you cookies for free) and then continue to use that session in the scraping (vs read_html) which will mean you don't have to do the httr GET/PUT.
You'll have to figure out how you want to incorporate that data into the data frame or associate it with it via various id's in the data frame (or some other strategy).
You can see it call those two php scripts via Developer Tools and it also shows the data it passes in. I'm also really shocked that site doesn't have any anti-scraping clauses in their ToS but they don't.

How can I scrape this data?

I want to scrape the statistics from this page:
url <- "http://www.pgatour.com/players/player.20098.stuart-appleby.html/statistics"
Specifically, I want to grab the data in the table that's underneath Stuart's headshot. It's headlined by "Stuart Appleby - 2015 STATS PGA TOUR"
I attempt to use rvest, in combo with the Selector Gadget (http://selectorgadget.com/).
url_html <- url %>% html()
url_html %>%
html_nodes(xpath = '//*[(#id = "playerStats")]//td')
'Should' get me the table without, for example, the row on top that says "Recap -- Rank -- Additional Stats"
url_html <- url %>% html()
url_html %>%
html_nodes(xpath = '//*[(#id = "playerStats")] | //th//*[(#id = "playerStats")]//td')
'Should' get me the table with that "Recap -- Rank -- Add'l Stats" line.
Neither do.
Obvs I'm a complete newb when it comes to web scraping. When I click on 'view source' for that webpage, the data contained in the table isn't there.
In the source code, where I think the table should be starting, is this bit of code:
<script id="playerStatsTourTemplate" type="text/x-jquery-tmpl">
{{each(t, tour) tours}}
{{if pgatour.players.shouldProcessTour(tour.tourCodeLC)}}
<div class="statistics-head">
<h2 class="title">Stuart Appleby - <b>${year} STATS
.
.
.
So, it appears the table is stored somewhere (Json? Jquery? Javascript? Are those terms applicable here?) that isn't accessible to the html() function. Is there anyway to use rvest to grab this data? Is there an rvest equivalent for grabbing data that is stored in this manner?
Thanks.

I'd probably use the GET request that the page is making to get the raw data from their API and work on parsing that...
content(a) gives you a list representation... basically the output from fromJSON()
or
as(a, "character") gives you the raw JSON
library("httr")
a <- GET("http://www.pgatour.com/data/players/20098/2014stat.json")
content(a)
as(a, "character")

Check this out.
Open source project on GitHub scraping PGA data: https://github.com/zachwill/golf/blob/master/pga.py

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

web scrape with rvest - r

Related

Rvest scraping webpage content returned from html_text()

Using R to mimic “clicking” a download file button on a webpage

R Web scrape - Error

Scrape contents of dynamic pop-up window using R

How can I scrape this data?

Categories

Resources