Interact with all elements in a list in R - r

I'm trying to interact with all elements of the list in R.
Precisely, I'm trying to click on all elements in list using RSelenium using clickElement() function.
I'm scraping data from this webpage:
https://play.google.com/store/apps/details?id=hr.mireo.arthur&hl=en&fbclid=IwAR3c-PkUXOea8KrKLp9Q3JUjCidGmgO2jYX_Qb7O8VuWlHXPIS5nDOfKRKI&showAllReviews=true
Here is my code so far:
url <- 'https://play.google.com/store/apps/details?id=hr.mireo.arthur&hl=en&fbclid=IwAR3c-PkUXOea8KrKLp9Q3JUjCidGmgO2jYX_Qb7O8VuWlHXPIS5nDOfKRKI&showAllReviews=true'
#Open webpage using RSelenium
rD <- rsDriver(port = 4445L, browser=c("chrome"), chromever="80.0.3987.106")
remDr <- rD[["client"]]
remDr$navigate(url)
#-----------------------------------------Load whole page by scrolling and showing more
xp_show_more <- "//*[#id='fcxH9b']/div[4]/c-wiz/div/div[2]/div/div[1]/div/div/div[1]/div[2]/div[2]/div"
replicate(5,
{
replicate(5,
{
# scroll down
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))
# wait
Sys.sleep(1)
})
# find button
morereviews <- remDr$findElement(using = 'xpath', xp_show_more)
# click button
tryCatch(morereviews$clickElement(),error=function(e){print(e)}) # trycatch to prevent any error from stopping the loop
# wait
Sys.sleep(3)
})
This code will load the whole page and show all comments, but some comments are long and have the "Full review" buttons which have to be clicked in order to show the whole lenght of the comment. I have managed to find all of those buttons (there are 36 of them) using the "findElements" function withfollowing code:
fullreviews <- remDr$findElements(using = 'css selector', value = ".OzU4dc")
This code results in a list of 36 elements and when I want to click on them using this code:
fullreviews$clickElement()
I get the this error:
Error: attempt to apply non-function
I can click on all 36 elements using this 36 lines of code:
fullreviews[[1]]$clickElement()
fullreviews[[2]]$clickElement()
fullreviews[[3]]$clickElement()
...ans so on until 36.
But I would like to make this possible with one function. In case I have to scrape some bigger webpage and have thousands of elements to click.
I have tried this code, but it doesn't work
fullreviews[[1:36]]$clickElement()
I guess some sort of lapply function is needed, but I can't manage to creat one which is working.
Is there a way to do this in a single loop or function?

Related

Click on many of the same buttons when scraping data

I'm trying to scrape reviews from this Google Play site using R (manily "RSelenium" and "rvest":
https://play.google.com/store/apps/details?id=hr.mireo.arthur&hl=en&fbclid=IwAR3c-PkUXOea8KrKLp9Q3JUjCidGmgO2jYX_Qb7O8VuWlHXPIS5nDOfKRKI&showAllReviews=true
I've managed to load page using RSelenium and make a loop which scrolls down the page and clicks on all "Show more" buttons. Here is the code I've used:
#Load packages
library(rvest)
library(dplyr)
library(wdman)
library(RSelenium)
#Open website using RSelenium
url <- 'https://play.google.com/store/apps/details?id=hr.mireo.arthur&hl=en&fbclid=IwAR3c-PkUXOea8KrKLp9Q3JUjCidGmgO2jYX_Qb7O8VuWlHXPIS5nDOfKRKI&showAllReviews=true'
rD <- rsDriver(port = 4567L, browser=c("chrome"), chromever="80.0.3987.106")
remDr <- rD[["client"]]
remDr$open()
remDr$navigate(url)
#Load whole page by scrolling and showing more
xp_show_more <- "//*[#id='fcxH9b']/div[4]/c-wiz/div/div[2]/div/div[1]/div/div/div[1]/div[2]/div[2]/div"
replicate(5,
{
replicate(5,
{
# scroll down
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))
# wait
Sys.sleep(1)
})
# find button
morereviews <- remDr$findElement(using = 'xpath', xp_show_more)
# click button
tryCatch(morereviews$clickElement(),error=function(e){print(e)}) # trycatch to prevent any error from stopping the loop
# wait
Sys.sleep(3)
})
This results with loading all 573 comments, but several comments have "Full review" buttons which have to be clicked in order to see the rest of it. I'm trying to make a script which clicks on all "Full review" buttons (I beleave there are just over 30 of them), but I can't manage to do so. My current script clicks on the first "Full review" buttons
#Click on "Full Review"
xp_full_review <- '//*[#id="fcxH9b"]/div[4]/c-wiz/div/div[2]/div/div[1]/div/div/div[1]/div[2]/div/div[2]/div/div[2]/div[2]/span[1]/div/button'
replicate(35,
{
fullreviews <- remDr$findElement(using = 'xpath', xp_full_review)
fullreviews$clickElement()
Sys.sleep(0.5)
})
Can someone spot a mistake and find a way to click on all "Full review" buttons?
Thank you

Why is JavaScript click not working in Rselenium?

I am trying to scrape a web page with a JavaScript drop down menu in R. I can follow the directions listed here, but nothing happens and no errors are shown. Instead, it gives an empty list:
dropdown <- remDr$findElement(using = "id", "s2id_autogen4_search")
remDr$executeScript("arguments[0].setAttribute('class','select2-input select2-focused');", list(dropdown))
> list()
Also, nothing happens (and no console output) with dropdown$clickElement().
This is somewhat related to this post, but I need to click first to activate the drop down.
There was a mask over it, so I needed to find the mask, click on that, then supply arguments to the dropdown itself:
dropdown <- remDr$findElement(using = "id", "s2id_autogen4_search")
mask <- remDr$findElement(using = "xpath", "//*[#id='select2-drop-mask']")
mask$clickElement()
dropdown$sendKeysToElement(list("l"))

R: Getting links from Google search results beyond the first page r

I have this RSelenium setup (using selenium really shouldn't impact the answer to this question):
library(tidyverse)
library(rvest)
library(httr)
library(RSelenium) # running through docker
## RSelenium setup
remDr <- remoteDriver(port = 4445L, browserName = "chrome")
remDr$open()
## Navigate to Google Books
remDr$navigate("https://books.google.com/")
books <- remDr$findElement(using = "css", "[name = 'q']")
## Search for whatever, the Civil War, for example
books$sendKeysToElement(list("the civil war", key = "enter"))
## Getting Google web elements (10 per page)
bookElem <- remDr$findElements(using = "xpath", "//h3[#class = 'LC20lb']//parent::a")
## Click on each book link
links <- sapply(bookElem, function(bookElem){
bookElem$getElementAttribute("href")
})
This works great - and compiles all of the links from the first page of results (Google automatically limits it to 10 results, so ten links). What I would like is to have that same links vector compile every result link from the first, say, 12 pages (to keep it manageable). So:
goog_pgs <- seq(1:12) # to set the limit
Where I'm lost: how do I feed that into my links vector? The links from each page are too different and aren't simple enough to just feed the number to its end. I've tried inserting the following:
nextButton <- remDr$findElements("xpath", "//*[#id = 'pnnext']")
next_page <- sapply(nextButton, function(nextButton) {
next_elements$clickElement()
})
And that does not work. What's the solution here?
You can use the sequence 1:12 as something to iterate over, using a for loop, lapply, or other looping mechanism. I have a terrible time with the apply functions, so I swapped in with map. The steps that need to be done repeatedly are finding books, getting the href of each book, and clicking the "next" button. With some modification, you can use:
books_12 <- map(1:12, function(pg) {
bookElem <- remDr$findElements(using = "xpath", "//h3[#class = 'LC20lb']//parent::a")
links <- map_chr(bookElem, ~.$getElementAttribute("href")[[1]])
nextButton <- remDr$findElement("xpath", "//*[#id='pnnext']")
nextButton$clickElement()
links
})
Note that getElementAttribute returns a list; since each element only has one href, I kept the first (only) one with [[1]]. This yields a list of 12 vectors of 10 URLs each.

NoSuchElementException scraping ESPN with RSelenium

I'm using R (and RSelenium) to scrape data from ESPN. It's not the first time I use it, but in this case I'm getting an error and I can't sort this out.
Consider this page: http://en.espn.co.uk/premiership-2011-12/rugby/match/142562.html
Let's try to scrape the timeline. If I inspect the page I get the css selector
#liveLeft
As usual, I go with
checkForServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/match/",matchId,".html")
remDr$navigate(url)
and the page correctly loads. So far so good. Now when I try to get the nodes with
div<- remDr$findElement(using = 'css selector','#liveLeft')
I get back
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
I'm puzzled. I tried also with Xpath and doesn't work. I also tried to get different elements of the page with no luck. The only selector that gives something back is
#scrumContent
From the comments.
The element resides in an iframe and as such the element isnt available to select. This is shown when using js in the console in chrome with document.getElementById('liveLeft'). When on the full page it will return null, i.e. element doesn't exist, even though it is clearly visible. To get around this simply load the iframe instead.
If you inspect the page you will see the scr for the iframe is /premiership-2011-12/rugby/current/match/142562.html?view=scorecard, from the example provided. Navigating to this page instead of the 'full' page will allow the element to be 'visible' and as such selectable to RSelenium.
checkForServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/current/match/",matchId,".html?view=scorecard")
# Amend url to return iframe
remDr$navigate(url)
div<- remDr$findElement(using = 'css selector','#liveLeft')
UPDATE
If it would be more applicable to load the iframe contents in a variable and then traverse through that then the following example shows this.
document.getElementById('liveLeft') # Will return null as iframe has seperate DOM
var doc = document.getElementById('win_old').contentDocument # Loads iframe DOM elements in the variable doc
doc.getElementById('liveLeft') # Will now return the desired element.
Generally with Selenium when you have a webpage with frames/iframes you need to use the switchToFrame method of the remoteDriver class:
library(RSelenium)
selServ <- startServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/match/",matchId,".html")
remDr$navigate(url)
# check the iframes
iframes <- htmlParse(remDr$getPageSource()[[1]])["//iframe", fun = function(x){xmlGetAttr(x, "id")}]
# iframes[[3]] == "win_old" contains the data switch to this frame
remDr$switchToFrame(iframes[[3]])
# check you can access the element
div<- remDr$findElement(using = 'css selector','#liveLeft')
div$highlightElement()
# get data
ifSource <- htmlParse(remDr$getPageSource()[[1]])
out <- readHTMLTable(ifSource["//div[#id = 'liveLeft']"][[1]], header = TRUE)

How do I click a javascript "link"? Is it my xpath or my relenium/selenium usage?

Like the beginning to any problem before I post it on stack overflow I think I have tried everything. This is a learning experience for me on how to work with javascript and xml so I'm guessing my problem is there.
My question is how to get the results of clicking on the parcel number links that are javascript links? I've tried getting the xpath of the link and using the $click method which following my intuition but this wasn't right or is at least not working for me.
Firefox 26.0
R 3.0.2
require(relenium)
library(XML)
library(stringr)
initializing_parcel_number <- "00000000000"
firefox <- firefoxClass$new()
firefox$get("http://www.muni.org/pw/public.html")
inputElement <- firefox$findElementByXPath("/html/body/form[2]/table/tbody/tr[2]/td/table[1]/tbody/tr[3]/td[4]/input[1]")
inputElement$sendKeys(initializing_parcel_number)
inputElement$sendKeys(key = "ENTER")
##xpath to the first link. Or is it?
first_link <- "/html/body/table/tbody/tr[2]/td/table[5]/tbody/tr[2]/td[1]/a"
##How I'm trying to click the thing.
linkElement <- firefox$findElementByXPath("/html/body/table/tbody/tr[2]/td/table[5]/tbody/tr[2]/td[1]/a")
linkElement$click()
You can do this using RSelenium. See http://johndharrison.github.io/RSelenium/ . DISCLAIMER I am the author of the RSelenium package. A basic vignette on operation can be viewed at RSelenium basics and
RSelenium: Testing Shiny apps
If you are unsure of what element is selected you can use the highlightElement utility method in the webElement class see the commented out code.
The element click event wont work in this case. You need to simulate a click using javascript:
require(RSelenium)
# RSelenium::startServer # if needed
initializing_parcel_number <- "00000000000"
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.muni.org/pw/public.html")
webElem <- remDr$findElement(using = "name", "PAR1")
# webElem$highlightElement() # to visually check what elemnet is selected
webElem$sendKeysToElement(list(initializing_parcel_number, key = "enter"))
# get first link containing javascript:getParcel
webElem <- remDr$findElement(using = "css selector", '[href*="javascript:getParcel"]')
# webElem$highlightElement() # to visually check what elemnet is selected
# send a webElement as an argument.
remDr$executeScript("arguments[0].click();", list(webElem))
#

Resources