R: Getting links from Google search results beyond the first page r - r

I have this RSelenium setup (using selenium really shouldn't impact the answer to this question):
library(tidyverse)
library(rvest)
library(httr)
library(RSelenium) # running through docker
## RSelenium setup
remDr <- remoteDriver(port = 4445L, browserName = "chrome")
remDr$open()
## Navigate to Google Books
remDr$navigate("https://books.google.com/")
books <- remDr$findElement(using = "css", "[name = 'q']")
## Search for whatever, the Civil War, for example
books$sendKeysToElement(list("the civil war", key = "enter"))
## Getting Google web elements (10 per page)
bookElem <- remDr$findElements(using = "xpath", "//h3[#class = 'LC20lb']//parent::a")
## Click on each book link
links <- sapply(bookElem, function(bookElem){
bookElem$getElementAttribute("href")
})
This works great - and compiles all of the links from the first page of results (Google automatically limits it to 10 results, so ten links). What I would like is to have that same links vector compile every result link from the first, say, 12 pages (to keep it manageable). So:
goog_pgs <- seq(1:12) # to set the limit
Where I'm lost: how do I feed that into my links vector? The links from each page are too different and aren't simple enough to just feed the number to its end. I've tried inserting the following:
nextButton <- remDr$findElements("xpath", "//*[#id = 'pnnext']")
next_page <- sapply(nextButton, function(nextButton) {
next_elements$clickElement()
})
And that does not work. What's the solution here?

You can use the sequence 1:12 as something to iterate over, using a for loop, lapply, or other looping mechanism. I have a terrible time with the apply functions, so I swapped in with map. The steps that need to be done repeatedly are finding books, getting the href of each book, and clicking the "next" button. With some modification, you can use:
books_12 <- map(1:12, function(pg) {
bookElem <- remDr$findElements(using = "xpath", "//h3[#class = 'LC20lb']//parent::a")
links <- map_chr(bookElem, ~.$getElementAttribute("href")[[1]])
nextButton <- remDr$findElement("xpath", "//*[#id='pnnext']")
nextButton$clickElement()
links
})
Note that getElementAttribute returns a list; since each element only has one href, I kept the first (only) one with [[1]]. This yields a list of 12 vectors of 10 URLs each.

Related

Interact with all elements in a list in R

I'm trying to interact with all elements of the list in R.
Precisely, I'm trying to click on all elements in list using RSelenium using clickElement() function.
I'm scraping data from this webpage:
https://play.google.com/store/apps/details?id=hr.mireo.arthur&hl=en&fbclid=IwAR3c-PkUXOea8KrKLp9Q3JUjCidGmgO2jYX_Qb7O8VuWlHXPIS5nDOfKRKI&showAllReviews=true
Here is my code so far:
url <- 'https://play.google.com/store/apps/details?id=hr.mireo.arthur&hl=en&fbclid=IwAR3c-PkUXOea8KrKLp9Q3JUjCidGmgO2jYX_Qb7O8VuWlHXPIS5nDOfKRKI&showAllReviews=true'
#Open webpage using RSelenium
rD <- rsDriver(port = 4445L, browser=c("chrome"), chromever="80.0.3987.106")
remDr <- rD[["client"]]
remDr$navigate(url)
#-----------------------------------------Load whole page by scrolling and showing more
xp_show_more <- "//*[#id='fcxH9b']/div[4]/c-wiz/div/div[2]/div/div[1]/div/div/div[1]/div[2]/div[2]/div"
replicate(5,
{
replicate(5,
{
# scroll down
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))
# wait
Sys.sleep(1)
})
# find button
morereviews <- remDr$findElement(using = 'xpath', xp_show_more)
# click button
tryCatch(morereviews$clickElement(),error=function(e){print(e)}) # trycatch to prevent any error from stopping the loop
# wait
Sys.sleep(3)
})
This code will load the whole page and show all comments, but some comments are long and have the "Full review" buttons which have to be clicked in order to show the whole lenght of the comment. I have managed to find all of those buttons (there are 36 of them) using the "findElements" function withfollowing code:
fullreviews <- remDr$findElements(using = 'css selector', value = ".OzU4dc")
This code results in a list of 36 elements and when I want to click on them using this code:
fullreviews$clickElement()
I get the this error:
Error: attempt to apply non-function
I can click on all 36 elements using this 36 lines of code:
fullreviews[[1]]$clickElement()
fullreviews[[2]]$clickElement()
fullreviews[[3]]$clickElement()
...ans so on until 36.
But I would like to make this possible with one function. In case I have to scrape some bigger webpage and have thousands of elements to click.
I have tried this code, but it doesn't work
fullreviews[[1:36]]$clickElement()
I guess some sort of lapply function is needed, but I can't manage to creat one which is working.
Is there a way to do this in a single loop or function?

R webscraping a slow/overburdened (?) website

I am webscraping a website to collect data for research purposes using RSelenium, docker and rvest.
I've built a script that automatically 'clicks' through the pages of which I want to download content. My problem is that when I run this script, the results change. The amount of observations of the variable I'm interested in change. It concerns about 50.000 observations. When running the script several times, the total amount of observations differs by a few hundred.
I'm thinking it has something to do with the internet connection being too slow or with the website not being able to load quick enough... Or something... When I change Sys.sleep(2) the results change too, but without clear effect of wether changing it to higher numbers makes it worse or better.
In the R terminal I run:
docker run -d -p 4445:4444 selenium/standalone-chrome
Then my code looks something like this:
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("url of website")
pages <- 100 # for example, I want information from the first hundred pages
variable <- vector("list", pages)
i <- 1
while (i <= pages) {
variable[[i]] <- remDr$getPageSource()[[1]] %>%
read_html(encoding = "UTF-8") %>%
html_nodes("node that indicates the information I want") %>% # select the information I want
html_text()
element_next_page <- remDr$findElement(using = 'css selector', "node that indicates the 'next page button") # select button with which I can go to the next page
element_next_page$sendKeysToElement(list(key="enter")) # go to the next page
Sys.sleep(2) # I believe this is done to not overload the website I'm scraping
i <- i + 1
}
variable <- unlist(variable)
Somehow running this multiple times this keeps returning different results in terms of the number of observations that remain when I unlist variable.
Does someone have the same experiences and tips on what to do?
Thanks.
You could consider including the following code before extracting the text :
for(i in 1 : 100)
{
print(i)
remDr$executeScript(paste0("scroll(0, ", i * 2000, ")"))
}
This code forces the application to go "almost everywhere in the web page" which can help the page to load some sections that are not loaded. This approach is used in the following post : How to webscrape texts that are contained into sublinks of a link in R?.

Scraping with RSelenium and understanding html tags

I've been learning how to scrape the web using RSelenium by trying to gather sports data. I have a difficult time understanding how to get certain elements based on tags. In the following example, I get the player names that I want, but I only get the top 28. I don't understand why, since when I inspect lower elements, they have similar xpaths. Example:
library(rvest)
library(RCurl)
library(RSelenium)
library(XML)
library(dplyr)
# URL
rotoURL = as.character("https://rotogrinders.com/lineuphq/nba?site=draftkings")
# Start remote driver
remDrall <- rsDriver(browser = "chrome", verbose = F)
remDr <- remDrall$client
remDr$open(silent = TRUE)
Sys.sleep(1)
# Go to URL
remDr$navigate(rotoURL)
Sys.sleep(3)
# Get player names and clean
plyrNms <- remDr$findElement(using = "xpath", "//*[#id='primary-pane']/div/div[3]/div/div/div/div")
plyrNmsText <- plyrNms$getElementAttribute("outerHTML")[[1]]
plyrNmsClean <- htmlTreeParse(plyrNmsText, useInternalNodes=T)
plyrNmsCleaner <- trimws(unlist(xpathApply(plyrNmsClean, '//a', xmlValue)))
plyrNmsCleaner <- plyrNmsCleaner[!plyrNmsCleaner=='']
If you run this, you'll see that the list stops at Ben McLemore, even though there are 50+ names below. I tried this code yesterday as well and it still limited me to 28 names, which tells me the 28 isn't arbitrary.
What part of my code is preventing me from grabbing all names? I'm assuming it has to do with findElement, but I've tried a 100 different xpaths with the InspectorGadget html selector tool, and nothing seems to work. Any help would be much appreciated!

NoSuchElementException scraping ESPN with RSelenium

I'm using R (and RSelenium) to scrape data from ESPN. It's not the first time I use it, but in this case I'm getting an error and I can't sort this out.
Consider this page: http://en.espn.co.uk/premiership-2011-12/rugby/match/142562.html
Let's try to scrape the timeline. If I inspect the page I get the css selector
#liveLeft
As usual, I go with
checkForServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/match/",matchId,".html")
remDr$navigate(url)
and the page correctly loads. So far so good. Now when I try to get the nodes with
div<- remDr$findElement(using = 'css selector','#liveLeft')
I get back
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
I'm puzzled. I tried also with Xpath and doesn't work. I also tried to get different elements of the page with no luck. The only selector that gives something back is
#scrumContent
From the comments.
The element resides in an iframe and as such the element isnt available to select. This is shown when using js in the console in chrome with document.getElementById('liveLeft'). When on the full page it will return null, i.e. element doesn't exist, even though it is clearly visible. To get around this simply load the iframe instead.
If you inspect the page you will see the scr for the iframe is /premiership-2011-12/rugby/current/match/142562.html?view=scorecard, from the example provided. Navigating to this page instead of the 'full' page will allow the element to be 'visible' and as such selectable to RSelenium.
checkForServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/current/match/",matchId,".html?view=scorecard")
# Amend url to return iframe
remDr$navigate(url)
div<- remDr$findElement(using = 'css selector','#liveLeft')
UPDATE
If it would be more applicable to load the iframe contents in a variable and then traverse through that then the following example shows this.
document.getElementById('liveLeft') # Will return null as iframe has seperate DOM
var doc = document.getElementById('win_old').contentDocument # Loads iframe DOM elements in the variable doc
doc.getElementById('liveLeft') # Will now return the desired element.
Generally with Selenium when you have a webpage with frames/iframes you need to use the switchToFrame method of the remoteDriver class:
library(RSelenium)
selServ <- startServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/match/",matchId,".html")
remDr$navigate(url)
# check the iframes
iframes <- htmlParse(remDr$getPageSource()[[1]])["//iframe", fun = function(x){xmlGetAttr(x, "id")}]
# iframes[[3]] == "win_old" contains the data switch to this frame
remDr$switchToFrame(iframes[[3]])
# check you can access the element
div<- remDr$findElement(using = 'css selector','#liveLeft')
div$highlightElement()
# get data
ifSource <- htmlParse(remDr$getPageSource()[[1]])
out <- readHTMLTable(ifSource["//div[#id = 'liveLeft']"][[1]], header = TRUE)

RSelenium - Extracting data from tables (and non-tables)

I have had my first go at using RSelenium today to scrape data from websites. I can navigate to the data I require via the tabs and drop-down menus (the hard bit?) but am now stuck at the point of extracting the actual data I need (the easy bit!)
My code so far is:
library(RSelenium)
checkForServer()
startServer()
remDr <- remoteDriver$new()
remDr$open()
remDr$navigate("https://www.whoscored.com/Teams/31")
webElem1 <- remDr$findElement(value = '//a[#href = "#team-squad-stats-detailed"]')
webElem1$clickElement()
webElem2 <- remDr$findElement("id", "category")
webElem2$clickElement()
webElem2$sendKeysToElement(list(key="down_arrow", key="down_arrow", key="down_arrow",
key="down_arrow", key="down_arrow", key="enter"))
webElem3 <- remDr$findElement("id", "subcategory")
webElem3$clickElement()
webElem3$sendKeysToElement(list(key="down_arrow", key="enter"))
webElem4 <- remDr$findElement("id", "statsAccumulationType")
webElem4$clickElement()
webElem4$sendKeysToElement(list(key="down_arrow", key="down_arrow", key="down_arrow",
key="enter"))
webElem5 <- remDr$findElement("id", "player-table-statistics-body")
Can someone advise the simplest way to now extract the data in this player table into csv form please? I am used to using the XML package and readHTMLTable to scrape other (static) websites but I am stuck on how to combine this with my RSelenium steps above.
Thank you
EDIT - having come back to this with fresh eyes the answer I have found is below:
webElem5 <- remDr$findElement(using = "id", value = "statistics-table-detailed")
webElem5txt <- webElem5$getElementAttribute("outerHTML")[[1]]
table <- readHTMLTable(webElem5txt, header=TRUE, as.data.frame=TRUE)[[1]]
This allows me to proceed with what I need on this part of the website.
If I may, I would like to ask for help with another part of the same site. I navigate to the data I need as follows:
remDr$navigate("https://www.whoscored.com/Matches/959894")
webElem1 <- remDr$findElement(using = "link text", value = "Match Centre")
webElem1$clickElement()
webElem2 <- remDr$findElement(value = '//a[#href = "#chalkboard"]')
webElem2$clickElement()
The data I would like to extract is in these boxes, but as the HTML doesn't say they are built as tables I don't really know how to proceed.

Resources