R - Using rvest to scrape Google + reviews - r

As part of a project, I am trying to scrape the complete reviews from Google + (in previous attempts on other websites, my reviews were truncated by a More which hides the full review unless you click on it).
I have chosen the package rvest for this. However, I do not seem to be getting the results I want.
Here are my steps
library(rvest)
library(xml2)
library(RSelenium)
queens <- read_html("https://www.google.co.uk/search?q=queen%27s+hospital+romford&oq=queen%27s+hospitql+&aqs=chrome.1.69i57j0l5.5843j0j4&sourceid=chrome&ie=UTF-8#lrd=0x47d8a4ce4aaaba81:0xf1185c71ae14d00,1,,,")
#Here I use the selectorgadget tool to identify the user review part that I wish to scrape
reviews=queens %>%
html_nodes(".review-snippet") %>%
html_text()
However this doesn't seem to be working. I do not get any output here.
I am quite new to this package and web scraping, so any inputs on this would be greatly appreciated.

Here is the workflow with RSelenium and rvest:
1. Scroll down any times to get as many contents as you want, remember to pause once a while to let the contents load.
2. Click on all "click on more" buttons and get full reviews.
3. Get pagesource and use rvest to get all reveiws in a list
What you want to scrape is not static, so you need the help of RSelenium. This should work:
library(rvest)
library(xml2)
library(RSelenium)
rmDr=rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
myclient= rmDr$client
myclient$navigate("https://www.google.co.uk/search?q=queen%27s+hospital+romford&oq=queen%27s+hospitql+&aqs=chrome.1.69i57j0l5.5843j0j4&sourceid=chrome&ie=UTF-8#lrd=0x47d8a4ce4aaaba81:0xf1185c71ae14d00,1,,,")
#click on the snippet to switch focus----------
webEle <- myclient$findElement(using = "css",value = ".review-snippet")
webEle$clickElement()
#simulate scroll down for several times-------------
scroll_down_times=20
for(i in 1 :scroll_down_times){
webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
#the content needs time to load,wait 1 second every 5 scroll downs
if(i%%5==0){
Sys.sleep(1)
}
}
#loop and simulate clicking on all "click on more" elements-------------
webEles <- myclient$findElements(using = "css",value = ".review-more-link")
for(webEle in webEles){
tryCatch(webEle$clickElement(),error=function(e){print(e)}) # trycatch to prevent any error from stopping the loop
}
pagesource= myclient$getPageSource()[[1]]
#this should get you the full review, including translation and original text-------------
reviews=read_html(pagesource) %>%
html_nodes(".review-full-text") %>%
html_text()
#number of stars
stars <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes("g-review-stars > span") %>%
html_attr("aria-label")
#time posted
post_time <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes(".dehysf") %>%
html_text()

Related

How do I extract certain html nodes using rvest?

I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()

Scraping frames in R without RSelenium?

I need to scrape “manuscript received date” that is visible in the right-hand frame, once you click “Information” at this page: https://onlinelibrary.wiley.com/doi/10.1002/jcc.26717 . I tried to use an rvest script listed below, that worked fine in similar situations. However, it does not work in this case, perhaps because of the click required to get to the publication history. I tried solving this issue by adding #pane-pcw-details to the url (https://onlinelibrary.wiley.com/doi/10.1002/jcc.26717#pane-pcw-details) but to no avail. Another option would be to use RSelenium, but perhaps there is a simpler workaround?
library(rvest)
link <-c("https://onlinelibrary.wiley.com/doi/10.1002/jcc.26717#pane-pcw-details")
wiley_output <-data.frame()
page = read_html(link)
revhist = page %>% html_node(".publication-history li:nth-child(5)") %>% html_text()
wiley_output = rbind(wiley_output, data.frame(link, revhist, stringsAsFactors = FALSE))
That data comes from an ajax call you can find in the network tab. It has a lot of querystring params but you actually only need the identifier for the document, along with ajax = True to ensure return of data associated with the specified ajax action:
https://onlinelibrary.wiley.com/action/ajaxShowPubInfo?ajax=true&doi=10.1002/jcc.26717
library(rvest)
library(magrittr)
link <- 'https://onlinelibrary.wiley.com/action/ajaxShowPubInfo?ajax=true&doi=10.1002/jcc.26717'
page <- read_html(link)
page %>% html_node(".publication-history li:nth-child(5)") %>% html_text()

How to log scrape paths rvest used?

Background: Using rvest I'd like to scrape all details of all art pieces for the painter Paulo Uccello on wikiart.org. The endgame will look something like this:
> names(uccello_dt)
[1] title year style genre media imgSRC infoSRC
Problem: When a scraping attempt doesn't go as planned, I get back character(0). This isn't helpful for me in understanding exactly what path the scrape took to get character(0). I'd like to have my scrape attempts output what path it specifically took so that I can better troubleshoot my failures.
What I've tried:
I use Firefox, so after each failed attempt I go back to the web inspector tool to make sure that I am using the correct css selector / element tag. I've been keeping the rvest documentation by my side to better understand its functions. It's been a trial and error that's taking much longer than I think should. Here's a cleaned up source of 1 of many failures:
library(tidyverse)
library(data.table)
library(rvest)
sample_url <-
read_html(
"https://www.wikiart.org/en/paolo-uccello/all-works#!#filterName:all-paintings-chronologically,resultType:detailed"
)
imgSrc <-
sample_url %>%
html_nodes(".wiki-detailed-item-container") %>% html_nodes(".masonry-detailed-artwork-item") %>% html_nodes("aside") %>% html_nodes(".wiki-layout-artist-image-wrapper") %>% html_nodes("img") %>%
html_attr("src") %>%
as.character()
title <-
sample_url %>%
html_nodes(".masonry-detailed-artwork-title") %>%
html_text() %>%
as.character()
Thank you in advance.

Using rvest to Scrape Multiple Job Listing pages

I have read through multiple other similar questions and can't seem to find one that gives me the right answer.
I am trying to scrape all the current job titles on TeamWorkOnline.com.
This is the specific URL: https://www.teamworkonline.com/jobs-in-sports?employment_opportunity_search%5Bexclude_united_states_opportunities%5D=0&commit=Search
I have no problem starting the scraping process with this code:
listings <- data.frame(title=character(),
stringsAsFactors=FALSE)
{
url_ds <- paste0('https://www.teamworkonline.com/jobs-in-sports?employment_opportunity_search%5Bexclude_united_states_opportunities%5D=0&commit=Search',i)
var <- read_html(url_ds)
#job title
title <- var %>%
html_nodes('.margin-none') %>%
html_text() %>%
str_extract("(\\w+.+)+")
listings <- rbind(listings, as.data.frame(cbind(title)))
}
However, if you look at the site, there is 'numbered navigation' at the bottom to continue to other pages where more jobs are listed.
I cannot seem to figure out how to add the correct code to get rvest to automatically navigate to the other pages and scrape those jobs as well.
Any help would be greatly appreciated.
Try this:
library(rvest)
library(stringr)
listings <- character()
for (i in 1:25) {
url_ds <- paste0("https://www.teamworkonline.com/jobs-in-sports?employment_opportunity_search%5Bexclude_united_states_opportunities%5D=0&page=", i)
#job title
title <- read_html(url_ds) %>%
html_nodes('.margin-none') %>%
html_text() %>%
str_extract("(\\w+.+)+")
listings <- c(listings, title)
}
Simply loop through all pages to scrape and combine them.
There are 25 pages in the search results,
https://www.teamworkonline.com/jobs-in-sports?employment_opportunity_search%5Bexclude_united_states_opportunities%5D=0&page=25
when ever you clicking on the next button the number at the end of url is changing according to the navigation page number, if the above code working for the first page then you need to iterate through a range 1 to 25 and append your page number to the url and extract it.
I hope it works

RVest sometimes works, sometimes returns 0 nodes

I am relatively new to web scraping, and I have recently been using rvest.
I am scraping news headlines, paragraphs, and links from a Yahoo News page (about 10 of each at a time). The code I am using to do it is below:
headlines <- read_html(url) %>%
html_nodes("#web a") %>%
html_text()
paragraphs <- read_html(url) %>%
html_nodes("#web p") %>%
html_text()
links <- read_html(url) %>%
html_nodes("#web a") %>%
html_attr("href")
My issue is that sometimes my code works perfectly and I get what I need (three vectors of info, each of length 10), and then a second later on another test it returns nothing:
> headlines <- read_html(url) %>%
+ html_nodes("#web a") %>%
+ html_text()
> headlines
character(0)
Does anyone know why this is or how to make this more reliable? I am putting the code into a dashboard and want to be able to reliably check the top news articles every day. Does rvest/Yahoo News maybe have rate limits that are blocking me? I am currently unaware of any. For context too - I am testing the dashboard constantly (100 times a day minimum), is it possible that this could be overworking it?
Thank you in advance for any guidance.

Resources