I am attempting to scrape (dynamic?) content from a webpage using the rvest package. I understand that dynamic content should require the use of tools such as Selenium or PhantomJS.
However my experimentation leads me to believe I should still be able to find the content I want using only standard webscraping r packages (rvest,httr,xml2).
For this example I will be using a google maps webpage.
Here is the example url...
https://www.google.com/maps/dir/920+nc-16-br,+denver,+nc,+28037/2114+hwy+16,+denver,+nc,+28037/
If you follow the hyperlink above it will take you to an example webpage. The content I would want in this example are the addresses "920 NC-16, Crumpler, NC 28617" and "2114 NC-16, Newton, NC 28658" in the top left corner of the webpage.
Standard techniques using the css selector or xpath did not work, which initially made sense, as I thought this content was dynamic.
url<-"https://www.google.com/maps/dir/920+nc-16-br,+denver,+nc,+28037/2114+hwy+16,+denver,+nc,+28037/"
page<-read_html(url)
# The commands below all return {xml nodeset 0}
html_nodes(page,css=".tactile-searchbox-input")
html_nodes(page,css="#sb_ifc50 > input")
html_nodes(page,xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "tactile-searchbox-input", " " ))]')
The commands above all return "{xml nodeset 0}" which I thought was a result of this content being generated dynamically, but here's were my confusion lies, if I convert the whole page to text using html_text() I can find the addresses in the value returned.
html_text(read_html(url))
substring<-substr(x,33561-100,33561+300)
Executing the commands above results in a substring with the following value,
"null,null,null,null,[null,null,null,null,null,null,null,[[[\"920 NC-16, Crumpler, NC 28617\",null,null,null,null,null,null,null,null,null,null,\"Nzm5FTtId895YoaYC4wZqUnMsBJ2rlGI\"]\n,[\"2114 NC-16, Newton, NC 28658\",null,null,null,null,null,null,null,null,null,null,\"RIU-FSdWnM8f-IiOQhDwLoMoaMWYNVGI\"]\n]\n,null,null,0,null,[[null,null,null,null,null,null,null,3]\n,[null,null,null,null,[null,null,null,null,nu"
The substring is very messy but contains the content I need. I've heard parsing webpages using regex is frowned upon but I cannot think of any other way of obtaining this content which would also avoid the use of dynamic scraping tools.
If anyone has any suggestions for parsing the html returned or can explain why I am unable to find the content using xpath or css selectors, but can find it by simply parsing the raw html text, it would be greatly appreciated.
Thanks for your time.
The reason why you can't find the text with Xpath or css selectors is that the string you have found is within the contents of a javascript array object. You were right to assume that the text elements you can see on the screen are loaded dynamically; these are not where you are reading the strings from.
I don't think there's anything wrong with parsing specific html with regex. I would ensure that I get the full html rather than just the html_text() output, in this case by using the httr package. You can grab the address from the page like this:
library(httr)
GetAddressFromGoogleMaps <- function(url)
{
GET(url) %>%
content("text") %>%
strsplit("spotlight") %>%
extract2(1) %>%
extract(-1) %>%
strsplit("[[]{3}(\")*") %>%
extract2(1) %>%
extract(2) %>%
strsplit("\"") %>%
extract2(1) %>%
extract(1)
}
Now:
GetAddressFromGoogleMaps(url)
#[1] "920 NC-16, Crumpler, NC 28617, USA"
I want to automatically download all the whitepapers from this website: https://icobench.com/ico, when you choose to enter each ICO's webpage, there's a whitepaper tab to click, which will take you to the pdf preview screen, I want to retrieve the pdf url from the css script by using rvest, but nothing comes back after I tried multiple input on the nodes
A example of one ico's css inspect:
embed id="plugin" type="application/x-google-chrome-pdf"
src="https://www.ideafex.com/docs/IdeaFeX_twp_v1.1.pdf"
stream-url="chrome-extension://mhjfbmdgcfjbbpaeojofohoefgiehjai/9ca6571a-509f-4924-83ef-5ac83e431a37"
headers="content-length: 2629762
content-type: application/pdf
I've tried something like the following:
library(rvest)
url <- "https://icobench.com/ico"
url <- str_c(url, '/hygh')
webpage <- read_html(url)
Item_html <- html_nodes(webpage, "content embed#plugin")
Item <- html_attr(Item_html, "src")
or
Item <- html_text(Item_html)
Item
But nothing comes back, anybody can help?
From above example, I'm expecting to retrieve the embedded url to the ico's official website for pdf whitepapers, eg: https://www.ideafex.com/docs/IdeaFeX_twp_v1.1.pdf
But as it's google chrome plugin, it's not being retrieved by the rvest package, any ideas?
A possible solution:
Using your example I would change the selector to combine, using descendant combinator, id with attribute = value selector. This would target the whitepaper tab by id and the child link by href attribute value; using $ ends with operator to get the pdf.
library(rvest)
library(magrittr)
url <- "https://icobench.com/ico/hygh"
pdf_link <- read_html(url) %>% html_node(., "#whitepaper [href$=pdf]") %>% html_attr(., "href")
Faster option?
You could also target the object tag and its data attribute
pdf_link <- read_html(url) %>% html_node(., "#whitepaper object") %>% html_attr(., "data")
Explore which is fit for purpose across pages.
The latter is likely faster and seems to be used across the few sites I checked.
Solution for all icos:
You could put this in a function that receives an url as input (the url of each ico); the function would return the pdf url, or some other specified value if no url found/css selector fails to match. You'd need to add some handling for that scenario. Then call that function over a loop of all ico urls.
I am trying to scrape off the number amounts listed in a set of donation websites. So in this example, I would like to get
$3, $10, $25, $100, $250, $1500, $2800
The xpath indicates that one of them should be
/html/body/div[1]/div[3]/div[2]/div/div[1]/div/div/
form/div/div[1]/div/div/ul/li[2]/label
and the css selector
li.btn--wrapper:nth-child(2) > label:nth-child(1)
Up to the following, I see something in the xml_nodeset:
library(rvest)
url <- "https://secure.actblue.com/donate/pete-buttigieg-announcement-day"
read_html(url) %>% html_nodes(
xpath = '//*[#id="cf-app-target"]/div[3]/div[2]/div/div[1]/div/div'
)
Then I see add the second part of the xpath and it shows up blank. Same with
X %>% html_nodes("li")
which gives a bunch of things, but all the StyledButton__StyledAnchorButton-a7s38j-0 kEcVlT turn blank.
I have worked with rvest for a fair bit now, but this one's baffling. And I am not quite sure how RSelenium will help here, although I have knowledge on how to use it for screenshots and clicks. If it helps, the website also refuses to be captured in the wayback machine---there's only the background and nothing else.
I have even tried just taking a screenshot with RSelenium and attempting ocr with tessaract and magick, but while other pages worked this particular example spectacularly fails, because the text is in white and in a rather nonstandard font. Yes, I've tried image_negate and image_resize to see if it helped, but it only showed that relying on OCR is rather a bad idea, as it depends on screenshot size.
Any advice on how to best extract what I want in this situation? Thanks.
You can use regex to extract numbers from script tag. You get a comma separated character vector
library(rvest)
library(stringr)
con <- url('https://secure.actblue.com/donate/pete-buttigieg-announcement-day?refcode=website', "rb")
page = read_html(con)
res <- page %>%
html_nodes(xpath=".//script[contains(., 'preloadedState')]")%>%
html_text() %>% as.character %>%
str_match_all(.,'(?<="amounts":\\[)(\\d+,?)+')
print(res[[1]][,1])
Try it here
I tried to scrape Kickstarter. However I do not get a result when I try to get the URLs that refer to the projects.
This should be one of the results:
https://www.kickstarter.com/projects/1534822242/david-bowie-hunger-city-photo-story?ref=category_ending_soon
and this is my code:
Code:
main.page1 <- read_html(x ="https://www.kickstarter.com/discover/advanced?
category_id=1&sort=end_date&seed=2498921&page=1")
urls1 <- main.page1 %>% # feed `main.page` to the next step
html_nodes(".block.img-placeholder.w100p") %>% # get the CSS nodes
html_attr("href") # extract the URLs
Does anyone see where I go wrong?
First declare all the packages you use - I had to go search to realise I needed rvest:
> library(rvest)
> library(dplyr)
Get your HTML:
> main.page1 <- read_html(x ="https://www.kickstarter.com/discover/advanced?category_id=1&sort=end_date&seed=2498921&page=1")
As that stands, the data for each project is stashed in a data-project attribute in a bunch of divs. Some Javascript (I suspect built using the React framework) in the browser will normally fill the other DIVs in and get the images, format the links etc. But you have just grabbed the raw HTML so that isn't available. But the raw data is.... So....
The relevant divs appear to be class "react-disc-landing" so this gets the data as text strings:
> data = main.page1 %>%
html_nodes("div.react-disc-landing") %>%
html_attr("data-project")
These things appear to be JSON strings:
> substr(data[[1]],1,80)
[1] "{\"id\":208460273,\"photo\":{\"key\":\"assets/017/007/465/9b725fdf5ba1ee63e8987e26a1d33"
So let's use the rjson package to decode the first one:
> library(rjson)
> jdata = fromJSON(data[[1]])
jdata is now a very complex nested list. Use str(jdata) to see what is in it. I'm not sure what bit of it you want, but maybe this URL:
> jdata$urls$web$project
[1] "https://www.kickstarter.com/projects/1513052868/sense-of-place-by-jose-davila"
If not, the URL you want must be in that structure somewhere.
Repeat over data[[i]] to get all links.
Note that you should check the site T+Cs that you are allowed to do this, and also see if there's an API you should really be using.
I am trying to scrape data from this page:
http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?
If I try to scrape the name of the players using the css selector and the usual rvest syntax:
names <- read_html("http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?") %>%
html_nodes(".scoring-player-name") %>% sapply(html_text)
everything goes well.
Unfortunately if I try to scrape the statistics below (first serve pts won, ..)
using the selector .stat-breakdown span I am not able to retrieve any data.
I know rvest is generally not recommended to scrape pages created dynamically, however I don't understand why some data are scraped and some not.
I don't use Rvest. If you follow the code below you should get to the format which is in the picture basically a string which you could transform to dataframe based on separators :, .
This Tag also contains more information than it was displayed in UI of webpage.
I can try also RSelenium but need to get my other PC. So I would let you know if RSelenium worked for me.
library(XML)
library(RCurl)
library(stringr)
url<-"http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?"
url2<-getURL(url)
parsed<-htmlParse(url2)
# get messi data from tag
step1<-xpathSApply(parsed,"//script[#id='matchStatsData']",xmlValue)
# removing some unwanted characters
step2<-str_replace_all(step1,"\r\n","")
step3<-str_replace_all(step2,"\t","")
step4<-str_replace_all(step3,"[[{}]\"]","")
Output then is a string like this