I am trying to scrape reviews from a webpage to determine word frequency. However, only partial reviews are given when the review is longer. You have to click on "More" to get the webpage to show the full review. Here is the code I am using to extract the text of the review. How can I "click" on more to get the full review?
library(rvest)
tripAdvisorURL <- "https://www.tripadvisor.com/Hotel_Review-g33657-d85704-
Reviews-Hotel_Bristol-Steamboat_Springs_Colorado.html#REVIEWS"
webpage <-read_html(tripAdvisorURL)
reviewData <- xml_nodes(webpage,xpath = '//*[contains(concat( " ", #class, "
" ), concat( " ", "partial_entry", " " ))]')
head(reviewData)
xml_text(reviewData[[1]])
[1] "The rooms were clean and we slept so good we had room 10 and 12 we
didn’t use 12 but it joins 10 .kind of strange but loved the hotel ..me
personally I would take the hot tub out it was kinda old..the lady
that...More"
As mentioned in the comment, you can use Rselenium together with rvest for more interactivity:
library(RSelenium)
rmDr <- rsDriver(browser = "chrome")
myclient <- rmDr$client
tripAdvisorURL <- "https://www.tripadvisor.com/Hotel_Review-g33657-d85704-Reviews-Hotel_Bristol-Steamboat_Springs_Colorado.html#REVIEWS"
myclient$navigate(tripAdvisorURL)
#select all "more" button, and loop to click them
webEles <- myclient$findElements(using = "css",value = ".ulBlueLinks")
for (webEle in webEles) {
webEle$clickElement()
}
mypagesource <- myclient$getPageSource()
read_html(mypagesource[[1]]) %>%
html_nodes(".partial_entry") %>%
html_text()
I want to extract all download links for txt-files from this page: https://www.bundestag.de/dokumente/protokolle/plenarprotokolle/plenarprotokolle. To do so, I tried SelectorGadget and selected the following:
With this information I wrote this code:
library(rvest)
library(tidyverse)
protokolle <-
read_html("https://www.bundestag.de/dokumente/protokolle/plenarprotokolle/plenarprotokolle")
txts <-
protokolle %>%
html_nodes(".bt-link-dokument")
The result is the same if I try
txts <-
protokolle %>%
html_nodes(xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "bt-link-dokument", " " ))]')
For a reason I do not understand, txts contains only {xml_nodeset (0)}. Any ideas on what went wrong?
I am looking to pull a table in at http://www.nfl.com/inactives?week=5 in order to process active and inactive players. I am very familiar with rvest and have tried using the code:
library(rvest)
url <- paste0("http://www.nfl.com/inactives?week=5")
Table <- url %>%
read_html() %>%
html_nodes(xpath= '//*[contains(concat( " ", #class, " " ), concat( " ", "yui3-datatable-cell", " " ))]') %>%
html_table()
TableNew <- Table[[1]]
TableNew
Nothing is coming up correctly though. Ideally, I would like to be able to put all the players and their team name into one single table. I appreciate your insights.
I want to grab some data from Pro Football Reference website using the rvest package. First, let's grab results for all games played in 2015 from this url http://www.pro-football-reference.com/years/2015/games.htm
library("rvest")
library("dplyr")
#grab table info
url <- "http://www.pro-football-reference.com/years/2015/games.htm"
urlHtml <- url %>% read_html()
dat <- urlHtml %>% html_table(header=TRUE) %>% .[[1]] %>% as_data_frame()
Is that how you would have done it? :)
dat could be cleaned up a bit. Two of the variables seem to have blanks for names. Plus the header row is repeated between each week.
colnames(dat) <- c("week", "day", "date", "winner", "at", "loser",
"box", "ptsW", "ptsL", "ydsW", "toW", "ydsL", "toL")
dat2 <- dat %>% filter(!(box == ""))
head(dat2)
Looks good!
Now let's look at an individual game. At the webpage above, click on "Boxscore" in the very first row of the table: The Sept 10th game played between New England and Pittsburgh. That takes us here: http://www.pro-football-reference.com/boxscores/201509100nwe.htm.
I want to grab the individual snap counts for each player (about half way down the page). Pretty sure these will be our first two lines of code:
gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()
But now I can't figure out how to grab the specific table I want. I use the Selector Gadget to highlight the table of Patriots snap counts. I do this by clicking on the table in several places, then 'unclicking' the other tables that were highlighted. I end up with a path of:
#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left
Each of these attempts returns {xml_nodeset (0)}
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right")
gameHtml %>% html_nodes("#home_snap_counts")
Maybe let's try using xpath. All of these attempts also return {xml_nodeset (0)}
gameHtml %>% html_nodes(xpath = '//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "right", " " ))] | //*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "left", " " ))]//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "left", " " ))]//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "tooltip", " " ))]//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ), concat( " ", "left", " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(#id = "home_snap_counts")]//*[contains(concat( " ", #class, " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(#id = "home_snap_counts")]')
How can I grab that table? I'll also point out, when I do "View Page Source" in Google Chrome, the tables I want almost seem to be commented out? That is, they're typed in green, instead of the usual red/black/blue color scheme. That is not the case for the table of game results we pulled first. "View Page Source" for that table is the usual red/black/blue color scheme. Is the greenness indicative of what's preventing me from being able to grab this snap count table?
Thanks!
The information you are looking for is programmatically display at run time. One solution is to use RSelenium.
While looking at the web page's source, the information from the tables are stored in the code but are hidden because the tables are stored as comments. Here is my solution where I remove the comments markers and reprocess the page normally.
I saved the file to the working directory and then read the file in using the readLines function.
Now I search for the html’s begin and end comment flags and then remove them. I save the file a second time (less the comment flags) in order to reread and process the file for the selected tables.
gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()
gameHtml %>% html_nodes("tbody")
#Only save and work with the body
body<-html_node(gameHtml,"body")
write_xml(body, "nfl.xml")
#Find and remove comments
lines<-readLines("nfl.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "nfl2.xml")
#Read the file back in and process normally
body<-read_html("nfl2.xml")
html_table(html_nodes(body, "table")[29])
#extract the attributes and find the attribute of interest
a<-html_attrs(html_nodes(body, "table"))
#find the tables of interest.
homesnap<-which(sapply(a, function(x){x[2]})=="home_snap_counts")
html_table(html_nodes(body, "table")[homesnap])
visitsnap<-which(sapply(a, function(x){x[2]})=="vis_snap_counts")
html_table(html_nodes(body, "table")[visitsnap])
I want to get the line up data from this site; http://festileaks.com/pinkpop-festival/pinkpop-programma-2015/ , as input for a function.
To use the data for my function I need it to be in a list. For this to work all the artists need to be between " " , which is the case in the vector of character strings. As soon as I collapse the vector into one string the "artist" changes into artist (without " "). Is there a way to get the artists into one list and not lose the " " ?
This is the code I use (Rpackage rvest);
Pinkpop_2015 <- read_html("http://festileaks.com/pinkpop-festival/pinkpop- programma-2015/")
Pinkpophtml <- html_nodes(Pinkpop_2015, ".g7-one_third > a")
Pinkpoplineup <- html_text(Pinkpophtml)
Pinkpoplineup <- gsub("Â", "", Pinkpoplineup)
Pinkpoplineup <- gsub("-", "", Pinkpoplineup)
Pinkpoplineup <- gsub("^ ", '', Pinkpoplineup)
Plineup_string = paste(Pinkpoplineup, collapse=" ,")
Ok.. so based on your new character vector (I took just the first three occurrences):
Pinkpoplineup <- "Vrijdag 12 juni\n Muse\n Elbow"
You can try
Pinkpoplineup <- unlist(strsplit(Pinkpoplineup, split="\n "))
Pinkpoplineup <- paste0('\"', Pinkpoplineup, '\"', collapse=",")
Quotes are now escaped and you can see the result using
cat(Pinkpoplineup)
or writing the object to a file
write(Pinkpoplineup, file = "Pinkpoplineup.txt")