I am trying to scrape reviews from a webpage to determine word frequency. However, only partial reviews are given when the review is longer. You have to click on "More" to get the webpage to show the full review. Here is the code I am using to extract the text of the review. How can I "click" on more to get the full review?
library(rvest)
tripAdvisorURL <- "https://www.tripadvisor.com/Hotel_Review-g33657-d85704-
Reviews-Hotel_Bristol-Steamboat_Springs_Colorado.html#REVIEWS"
webpage <-read_html(tripAdvisorURL)
reviewData <- xml_nodes(webpage,xpath = '//*[contains(concat( " ", #class, "
" ), concat( " ", "partial_entry", " " ))]')
head(reviewData)
xml_text(reviewData[[1]])
[1] "The rooms were clean and we slept so good we had room 10 and 12 we
didn’t use 12 but it joins 10 .kind of strange but loved the hotel ..me
personally I would take the hot tub out it was kinda old..the lady
that...More"
As mentioned in the comment, you can use Rselenium together with rvest for more interactivity:
library(RSelenium)
rmDr <- rsDriver(browser = "chrome")
myclient <- rmDr$client
tripAdvisorURL <- "https://www.tripadvisor.com/Hotel_Review-g33657-d85704-Reviews-Hotel_Bristol-Steamboat_Springs_Colorado.html#REVIEWS"
myclient$navigate(tripAdvisorURL)
#select all "more" button, and loop to click them
webEles <- myclient$findElements(using = "css",value = ".ulBlueLinks")
for (webEle in webEles) {
webEle$clickElement()
}
mypagesource <- myclient$getPageSource()
read_html(mypagesource[[1]]) %>%
html_nodes(".partial_entry") %>%
html_text()
Related
I am attempting to scrape the this URL to get the names of the top 50 soundcloud artists in Canada.
Using SelectorGadget, I selected the artists names and it told me the path is '.sc-link-light'.
My first attempt was as follows:
library(rvest)
library(stringr)
library(reshape2)
soundcloud <- read_html("https://soundcloud.com/charts/top?genre=all-music&country=CA")
artist_name <- soundcloud %>% html_nodes('.sc-link-light') %>% html_text()
which yielded artist_name as a list of 0.
My second attempt I changed the last line to:
artist_name <- soundcloud %>% html_node(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", ".sc-link-light", " " ))]') %>% html_text()
which again yielded the same result.
What exactly am I doing wrong? I believe this should give me the artists names in a list.
Any help is appreciated, thank you.
The webpage you are attempting to scrape is dynamic. As a result you will need to use a library such as RSelenium. A sample script is below:
library(tidyverse)
library(RSelenium)
library(rvest)
library(stringr)
url <- "https://soundcloud.com/charts/top?genre=all-music&country=CA"
rD <- rsDriver(browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate(url)
pg <- read_html(remDr$getPageSource()[[1]])
artist_name <- pg %>% html_nodes('.sc-link-light') %>% html_text()
####clean up####
remDr$close()
rD$server$stop()
rm(rD, remDr)
gc()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
I'm trying to collect information using rvest package in R.
While collecting the data with for loop, I found some of the pages do not contain information so that it comes out an error: Error in open.connection(x, "rb") : HTTP error 404.
Here is my R code. The page number 15138 and 15140 do have information, whereas 15139 does not. How can I skip 15139 with for loop function?
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(stringi)
source_url <- "https://go2senkyo.com/local/senkyo/"
senkyo <- data.frame()
for (i in 15138:15140) {
Sys.sleep(0.5)
target_page <- paste0(source_url, i)
recall_html <- read_html(target_page, encoding = "UTF-8")
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()
city <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "column_ttl", " " ))]') %>%
html_text()
city <- trimws(gsub("[\r\n]", "", city ))
senkyo2 <- cbind(prefecture, city)
senkyo <- rbind(senkyo , senkyo2)
}
I'm looking forward to your answer!
You can handle exceptions a few different ways. I'm a noob when it comes to scraping, but here are a few options for your situation.
Tailor Your Loop Range
If you know that you don't want the value 15139, you can remove from the vector of options, like:
for (i in c(15138,15140)) {
Which will completely ignore 1539 when running your loop.
Add Control Flow
This is basically the same thing as tailoring your loop range, but handles the exception within the loop itself, like:
for (i in 15138:15140) {
Sys.sleep(0.5)
# control statement
if (i == 15139 {
next # moves to next iteration of loop, in this case 15140
}
target_page <- paste0(source_url, i) # not run if i == 15139, since loop skipped to next iteration
Condition Handling Tools
This is where I get out of my depth, and constantly reference Advanced-R. Essentially, you can wrap functions like try() around your potentially buggy code, which can insulate your loop from errors and keep it from breaking, and gives you flexibility about what to do if your code breaks in specific ways.
My usual approach would be to add something to your code like:
# wrap the part of your code that can break in try()
recall_html <- try(read_html(target_page, encoding = "UTF-8"))
# you'll still see your error, but it won't stop your code, unless you set silent = TRUE
# you'll need to add control flow to keep your loop from breaking at the next function, however
if (class(recall_html) == 'try-error') {
next
} else {
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()
I want to scrape the number of product sold using Rvest from a marketplace webpage.
Screenshoot
I used this code, but it returned no value.
library(rvest)
doc <- read_html("https://www.tokopedia.com/berasprimasari/beras-bunga-25kg")
sold <- html_nodes(doc, ".rvm-product-info--item_value.mt-5.item-sold-count") %>%
html_text()
sold
------------
RESULT:
[1] " "
EXPECTED:
[1] " 378 "
How can I adjust my code to extract that number?
Many thanks in advance!
It is retrieved dynamically from a product stats endpoint you can find in network tab. You could string split or simply regex out part giving number sold. You need to pass the product id which you can grab from a request to the original url.
library(stringr)
library(magrittr)
library(httr)
get_product_id <- function(url){
headers = c('User-Agent' = 'Mozilla/5.0')
s <- read_html(httr::GET(url, httr::add_headers(.headers=headers)))%>%html_text()
id <- str_match_all(s,'product_id\\s+=\\s+(\\d+);')[[1]][,2]
return(id)
}
url = 'https://www.tokopedia.com/berasprimasari/beras-bunga-25kg'
p <- read_html(paste0('https://js.tokopedia.com/productstats/check?pid=',get_product_id(url),'&callback=show_product_stats&_='))%>%
html_text()
number_sold <- str_match_all(p,'item_sold\":(\\d+)')[[1]][,2]
I am trying to scrape a website but it does not give me any data.
#Get the Data
require(tidyverse)
require(rvest)
#specify the url
url <- 'https://www.travsport.se/sresultat?kommando=tevlingsdagVisa&tevdagId=570243&loppId=0&valdManad&valdLoppnr&source=S'
#get data
url %>%
read_html() %>%
html_nodes(".green div:nth-child(1)") %>%
html_text()
character(0)
I have also tried to use the xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "green", " " ))]//div[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a' but this gives me the same result with 0 data.
I am expecting Horse names. Shouldnt I at least get some javascript code even if data on page is rendered by javascript?
I cant see what else CSS selector I should use here.
You can simply use RSelenium package to scrape dynamycal pages :
library(RSelenium)
#specify the url
url <- 'https://www.travsport.se/sresultat?kommando=tevlingsdagVisa&tevdagId=570243&loppId=0&valdManad&valdLoppnr&source=S'
#Create the remote driver / navigator
rsd <- rsDriver(browser = "chrome")
remDr <- rsd$client
#Go to your url
remDr$navigate(url)
page <- read_html(remDr$getPageSource()[[1]])
#get your horses data by parsing Selenium page with Rvest as you know to do
page %>% html_nodes(".green div:nth-child(1)") %>% html_text()
Hope that will helps
Gottavianoni
I am looking to pull a table in at http://www.nfl.com/inactives?week=5 in order to process active and inactive players. I am very familiar with rvest and have tried using the code:
library(rvest)
url <- paste0("http://www.nfl.com/inactives?week=5")
Table <- url %>%
read_html() %>%
html_nodes(xpath= '//*[contains(concat( " ", #class, " " ), concat( " ", "yui3-datatable-cell", " " ))]') %>%
html_table()
TableNew <- Table[[1]]
TableNew
Nothing is coming up correctly though. Ideally, I would like to be able to put all the players and their team name into one single table. I appreciate your insights.