Scrape Products Information from a Ecommerce Page - r

Need to Scrape Product Information from a Ecommerce Page. But page has infinite scrolling. Currently I am able to scrape only products shown without scrolling down. Below is the code for it.
require(RCurl)
require(XML)
require(dplyr)
require(stringr)
webpage <- getURL("http://www.jabong.com/kids/clothing/girls-clothing/kids-tops-t-shirts/?source=topnav_kids")
linklist <- str_extract_all(webpage, '(?<=href=")[^"]+')[[1]]
linklist <- as.data.frame(linklist)
linklist <- filter(linklist, grepl("\\?pos=", linklist))
linklist <- unique(linklist)
a <- as.data.frame(linklist)
a[2] <- "Jabong.com"
a <- add_rownames(a, "ID")
a$V3 <- gsub(" ", "", paste(a$V2, a$linklist))
a <- a[, -(1:3)]
colnames(a) <- "Links"

Well, if scrolling is truly infinite, then it is impossible to get ALL of the links... If you wanted to settle for a finite number, you can indeed fruitfully use RSelenium here.
library(RSelenium)
#start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()
# load your page
remDr$navigate("http://www.jabong.com/kids/clothing/girls-clothing/kids-tops-t-shirts/?source=topnav_kids")
# scroll down 5 times, allowing 3 second for the page to load everytime
for(i in 1:5){
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(3)
}
# get the page html
page_source<-remDr$getPageSource()
# get the URL's that you are looking for
pp <- xml2::read_html(page_source[[1]]) %>%
rvest::html_nodes("a") %>%
rvest::html_attr("data-original-href") %>%
{.[!is.na(.)]}
The result is 312 links (in my browser). The more you have RSelenium scroll down, the more links you'll get.

Related

Using Sys.sleep breaks rvest scrape

I am trying to scrape a website that has hundreds of pages. I have been using the following code to get through all pages, but in order to not overwhelm the website, there must be a pause between scrapes. I have been trying to induce this pause using Sys.sleep(15), but this causes the final dataframe to come out empty. Any ideas why this is happening?
Version one:
a <- lapply(paste0("https://website.com/page/",1:500),
function(url){
url %>% read_html() %>%
html_nodes(".text") %>%
html_text()
Sys.sleep(15)
})
raw_posts <- unlist(a)
a <- data.frame(raw_posts)
This simply returns empty data frame.
Version two:
url_base <- "https://website.com/page/"
map_df(1:500, function(i) {
Sys.sleep(15)
cat(" bababooeey ")
pg <- read_html(sprintf(url_base, i))
data.frame(text=html_text(html_nodes(pg, ".text")),
date=html_text(html_nodes(pg, "time")),
stringsAsFactors=FALSE)
}) -> b
This just pastes the same set of results found on the same page over and over.
Does anything stand out as being wrongly coded?

R webscraping doesn't work when url is in variable

I'm having trouble scraping in R. I want to scrape genre information for several titles on goodreads.
If I do this, it works completely fine and gives me what I need:
library(polite)
library(rvest)
library(dplyr)
session <- bow("https://www.goodreads.com/book/show/29991718-royally-matched",
delay = 5)
genres <- scrape(session) %>%
html_elements(".bookPageGenreLink") %>%
html_text()
genres
However, since I'd like loop over several pages, I need this to work, but it always returns character(0).
host <- "https://www.goodreads.com/book/show/29991718-royally-matched"
session <- bow(host,
delay = 5)
genres <- scrape(session) %>%
html_elements(".bookPageGenreLink") %>%
html_text()
genres
Something like this would also be fine for me, but it doesn't work either:
link = "29991718-royally-matched"
session <- bow(paste0("https://www.goodreads.com/book/show/29991718-royally-matched", link),
delay = 5)
genres <- scrape(session) %>%
html_elements(".bookPageGenreLink") %>%
html_text()
genres
If I open the website and disable javascript, it still works completely fine, so I don't think Selenium is necessary and I really can't figure out why this doesn't work, which drives me crazy.
Thank you so much for your support!
Solution (kind of)
So I noticed that the success of my scrapings was kind of dependent on the random moods of the scraping gods.
So I did the following:
links <- c("31752345-black-mad-wheel", "00045101-The-Mad-Ship", "2767052-the-hunger-games", "18619684-the-time-traveler-s-wife", "29991718-royally-matched")
data <- data.frame(links)
for (link in links) {
print(link)
genres <- character(0)
url <- paste0('https://www.goodreads.com/book/show/',link)
#I don't know why, but resaving it kinda helped
host <- url
#I had the theory that repeating the scraping would eventually lead to a result. For me that didn't work though
try <- 0
while (identical(genres, character(0)) & (try < 10)) {
try <- try+1
print(paste0(try, ": ", link))
session <- bow(host,
delay = 5)
scraping <- scrape(session)
genres <- scraping %>% html_elements(".bookPageGenreLink") %>%
html_text()
}
if(identical(genres, character(0))){
print("Scraping unsuccessfull.. :( ")
}
else{
print("scraping success!!")
genres.df <- data.frame(genres)
data <- left_join(data,
genres.df, by = c("link"))
}
}
## then I created a list of the missing titles
missing_titles <- data %>%
filter(is.na(genre_1))
missing_links <- unique(missing_titles$link)
So the next step(s) were closing R (while saving the workspace of course), restarting it and refeeding the loop with missing_titles instead of links. It took me like 7 iterations of that to get everything I needed, while on the last run I had to insert the last remaining link directly into example 1, since it did not work inside the loop. whyever.
I hope the code kind of works, since I wanted to spare you pages of wild data formatting.
If someone has an explanation, why I needed to go through this hustle, I would still very much appreciate it.
You can consider to use the R package RSelenium as follows :
library(RSelenium)
library(rvest)
url <- "https://www.goodreads.com/book/show/29991718-royally-matched"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
page_Content <- remDr$getPageSource()[[1]]
read_html(page_Content) %>% html_elements(".bookPageGenreLink") %>% html_text()
Afterwards, you can loop over the url you want.

Issue with scrolling down to scrape Google Reviews

I am trying scrape data from Google Reviews (stars, commentary, date, etc.).
I tried to adapt a code that I found available online but am having problems to make it work. Apparently, R is not managing to scroll down google reviews and only returns the first ten reviews (that are the ones that Google displays without scrolling)
Has someone came across the same issue? Thanks!
#install.packages("rvest")
#install.packages("xml2")
#install.packages("RSelenium")
library(rvest)
library(xml2)
library(RSelenium)
rmDr=rsDriver(port = 4444L, browser=c("firefox"))
myclient= rmDr$client
myclient$navigate("https://www.google.com/search?client=firefox-b-d&q=emporio+santa+maria#lrd=0x94ce576a4e45ed99:0xa36a342d3ceb06c3,1,,,")
#click on the snippet to switch focus----------
webEle <- myclient$findElement(using = "css",value = ".review-snippet")
webEle$clickElement()
#simulate scroll down for several times-------------
scroll_down_times=1000
for(i in 1 :scroll_down_times){
webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
#the content needs time to load,wait 1 second every 5 scroll downs
if(i%%5==0){
Sys.sleep(3)
}
}
#loop and simulate clicking on all "click on more" elements-------------
webEles <- myclient$findElements(using = "css",value = ".review-more-link")
for(webEle in webEles){
tryCatch(webEle$clickElement(),error=function(e){print(e)}) # trycatch to prevent any error from stopping the loop
}
pagesource= myclient$getPageSource()[[1]]
#this should get you the full review, including translation and original text-------------
reviews=read_html(pagesource) %>%
html_nodes(".review-full-text") %>%
html_text()
#number of stars
stars <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes("g-review-stars > span") %>%
html_attr("aria-label")
#time posted
post_time <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes(".dehysf") %>%
html_text()`enter code here`
The codes are all correct, but you didn't target the correct element, use .review-dialog-list in css instead. That element is where the scroll bar resides.
library(RSelenium)
rmDr <- rsDriver(browser = "firefox")
driver <- rmDr$client
driver$navigate("https://www.google.com/search?client=firefox-b-d&q=emporio+santa+maria#lrd=0x94ce576a4e45ed99:0xa36a342d3ceb06c3,1,,,")
Sys.sleep(3) # wait a couple of seconds to let browser render the review window.
webEle <- driver$findElement(using = "css",value = ".review-dialog-list")
for(i in 1 : 5){
webEle$sendKeysToElement(sendKeys = list(key = "page_down"))
Sys.sleep(1)
}

Extracting data from more than one page of TripAdvisor results

I'm trying to scrape data from TripAdvisor search results that span several pages using rvest.
Here's my code:
library(rvest)
starturl <- 'https://www.tripadvisor.co.uk/Search?q=swim+with&uiOrigin=trip_search_Attractions&searchSessionId=CA54193AF19658CB1D983934FB5C86F41511875967385ssid#&ssrc=A&o=0'
swimwith <- read_html(starturl)
swdf <- swimwith %>%
html_nodes('.title span') %>%
html_text()
It works fine for the first page of results, but I can't figure out how to get results from the subsequent pages. I noticed that the end of the url denotes the start position of the results, so I changed it from '0' to '30' as follows:
url <- sub('A&o=0', paste0('A&o=', '30'), starturl)
webpage <- html_session(url)
swimwith <- read_html(webpage)
swdf2 <- swimwith %>%
html_nodes('.title span') %>%
html_text()
However, the results for swdf2 are the same as swdf even though the url loads the second page of results in a web browser.
Any idea how I can get the results from these subsequent pages?
I think you want something like this.
jump <- seq(0, 300, by = 30)
site <- paste('https://www.tripadvisor.co.uk/Search?q=swim+with&uiOrigin=trip_search_Attractions&searchSessionId=CA54193AF19658CB1D983934FB5C86F41511875967385ssid#&ssrc=A&o=', jump, sep="")
dfList <- lapply
(site, function(i)
{
swimwith <- read_html(i)
swdf <- swimwith %>%
html_nodes('.title span') %>%
html_text()
}
)
finaldf <- do.call(rbind, dfList)
It doesn't work in my office because the firewall is blocking it, but I think that should work for you.
Also, take a look at the links below.
https://rpubs.com/ryanthomas/webscraping-with-rvest
loop across multiple urls in r with rvest
Approach 1) Here is an approach based on the R package RSelenium :
library(RSelenium)
# Note : You have to install chromedriver
rd <- rsDriver(chromever = "96.0.4664.45", browser = "chrome", port = 4450L)
remDr <- rd$client
remDr$open()
remDr$navigate("https://www.tripadvisor.co.uk/Search?q=swim+with&uiOrigin=trip_search_Attractions&searchSessionId=CA54193AF19658CB1D983934FB5C86F41511875967385ssid#&ssrc=A&o=0")
remDr$screenshot(display = TRUE, useViewer = TRUE)
list_Text <- list()
for(i in 1 : 30)
{
print(i)
web_Obj <- remDr$findElement("xpath", paste0("//*[#id='BODY_BLOCK_JQUERY_REFLOW']/div[2]/div/div[2]/div/div/div/div/div[1]/div/div[1]/div/div[3]/div/div[1]/div/div[2]/div/div/div[", i, "]"))
list_Text[[i]] <- web_Obj$getElementText()
}
Note : You have to install chromedriver.
Approach 2) If you are looking to extract the titles only, you can print the webpage to PDF and extract the text from the PDF afterwards. Here is an example :
library(pagedown)
library(pdftools)
chrome_print("https://www.tripadvisor.co.uk/Search?q=swim+with&uiOrigin=trip_search_Attractions&searchSessionId=CA54193AF19658CB1D983934FB5C86F41511875967385ssid#&ssrc=A&o=0",
"C:\\...\\trip_advisor.pdf")
text <- pdf_text("C:\\...\\trip_advisor.pdf")
text <- strsplit(text, split = "\r\n")
# The titles are in the variable text ...

select textbox with RSelenium

Start RSelenium
library(RSelenium)
RSelenium::startServer()
pJS <- phantom()
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
Go to the site and wait a bit
remDr$navigate("http://ideal-scope.com/online-holloway-cut-adviser/")
Sys.sleep(5)
now when I try to find element on the text boxes
depthElem <- remDr$findElements("name","depth_textbox")
tableElem <- remDr$findElements("name","table_textbox")
crownElem <- remDr$findElements("name","crown_textbox")
pavilionElem <- remDr$findElements("name","pavilion_textbox")
...just gives me a bunch of objects that are list()
If I do findElement instead of findElements I get
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
How can I select these textboxes? Why can't I select them by searching name?
The findElements method returns an empty list when no elements are present. The page has the content you require in an iframe. You will need to switch to the iframe first before you can search for the elements:
remDr$navigate("http://ideal-scope.com/online-holloway-cut-adviser/")
# get iframes
webElems <- remDr$findElements("css", "iframe")
# there is only one
remDr$switchToFrame(webElems[[1]])
depthElem <- remDr$findElement("name","depth_textbox")
# > depthElem$getElementAttribute("name")
# [[1]]
# [1] "depth_textbox"
remDr$findElement("name","table_textbox")
crownElem <- remDr$findElement("name","crown_textbox")
pavilionElem <- remDr$findElement("name","pavilion_textbox")

Resources