Rselenium web scraping problems

Rselenium web scraping problems - r

I'm trying to parse this HTML with R in order to extract some currency exchange rates. They are only visible after clicking on buttons in the center of the webpage (sorry, it's in Russian).
So far I've tried both Rselenium and rvest, but none of them allows me to get to this css: "tr:nth-child(2) td".
And if I try this:
library("RSelenium")
startServer()
mybrowser <- remoteDriver(browserName = "chrome")
mybrowser$open()
mybrowser$navigate("https://www.tinkoff.ru/about/documents/exchange/")
dol<-mybrowser$findElement(using = c('partial link text'), "USD")
it returns a "NoSuchElement" error.
I've highlighted the place in the html code where I need to get

txt<- ".documents-exchange-vertical-list__menu:nth-child(2) .documents-exchange-vertical-list__item+ .documents-exchange-vertical-list__item .Currency-Rate-Trigger";
dol<-mybrowser$findElement(using = 'css selector', txt)clickElement()
#possibly this will work or may not
dol<-mybrowser$findElement(using = 'css selector', "tr:nth-child(2) td:nth-child(1)")$getElementText()

Related

Get current url in RSelenium

I have a code in R where I am using Selenium to get URL from Google Maps. But it doesn't work:
for (i in 1:dim(Coord_t)[1])
{
remDr$navigate("https://www.google.ro/maps") # navigates to webpage
# select CNP field
option <- remDr$findElement(using='xpath', "//*[#id='searchboxinput']")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list(Coord_t$Conc[i]))
#press key
webElem <- remDr$findElement(using = 'css', "#searchbox-searchbutton")
webElem$highlightElement()
webElem$clickElement()
remDr$setTimeout(type = "page load", milliseconds = 10000)
Coord_t$Url[i] = remDr$getCurrentUrl()
}
Coord$Conc[i] it is a simple name of a restaurant, hotel etc.
I get the first URL and after I get only: "https://www.google.ro/maps".
I tried also with: webElem$getCurrentUrl() and option$getCurrentUrl() and it doesn't work.
I just can't figure it out what is wrong here.
Thank you!
EDIT: I found that the problem is with "page load". It seems that I have to set a timeout to wait the page load. I did this with setTimeout but it doesn't work.

RSelenium: clicking on subsequent links in for loop from a Google search

I'm using RSelenium to do some simple Google searches. Setup:
library(tidyverse)
library(RSelenium) # running docker to do this
library(rvest)
library(httr)
remDr <- remoteDriver(port = 4445L, browserName = "chrome")
remDr$open()
remDr$navigate("https://books.google.com/")
books <- remDr$findElement(using = "css", "[name = 'q']")
books$sendKeysToElement(list("NHL teams", key = "enter"))
bookElem <- remDr$findElements(using = "css", "h3.LC20lb")
That's the easy part. Now, there are 10 links on that first page, and I want to click on every link, back out, and then clink the next link. What's the most efficient way to do that? I've tried the following:
bookElem$clickElement()
Returns Error: attempt to apply non-function - I expected this to click on the first link, but no good. (This works if I take the s off of findElements() - the above, not the for loop below).
clack <- lapply(bookElem, function(y) {
y$clickElement()
y$goBack()
})
Begets an error, kind of like this question:
Error: Summary: StaleElementReference
Detail: An element command failed because the referenced element is no longer attached to the DOM.
Further Details: run errorDetails method
Would it be easier to use rvest, within RSelenium?

I think you could consider grabbing the links and looping through them without going back to the main page.
In order to achieve that, you will have to grab the link elements ("a tag").
bookElems <- remDr$findElements(using = "xpath",
"//h3[#class = 'LC20lb']//parent::a")
And then extracting the "href" attribute and navigate to that:
links <- sapply(bookElems, function(bookElem){
bookElem$getElementAttribute("href")
})
for(link in links){
remDr$navigate(link)
# DO SOMETHING
}
Full code would read:
remDr$open()
remDr$navigate("https://books.google.com/")
books <- remDr$findElement(using = "css", "[name = 'q']")
books$sendKeysToElement(list("NHL teams", key = "enter"))
bookElems <- remDr$findElements(using = "xpath",
"//h3[#class = 'LC20lb']//parent::a")
links <- sapply(bookElems, function(bookElem){
bookElem$getElementAttribute("href")
})
for(link in links){
remDr$navigate(link)
# DO SOMETHING
}

Enter text in popup box using Rselenium

I'm trying to pull data from glassdoor website using Rselenium. I'm unable to enter email id and password in the popup window.
This is my code. I'm not sure where I'm going wrong. When I try to highlight email box, sign in button is being highlighted.
remDr$navigate("https://www.glassdoor.com")
webElem <- remDr$findElement("class", "sign-in")
webElem$highlightElement()
webElem$clickElement()
email <- webElem$findElement(using = "name", "username")
email$highlightElement()
email$sendKeysToElement(list("EMAIL ID")) -->Throwing Error

The following works with latest chrome:
library(RSelenium)
rD <- rsDriver()
remDr <- rD$client
remDr$navigate("https://www.glassdoor.com")
webElem <- remDr$findElement("class", "sign-in")
webElem$highlightElement()
webElem$clickElement()
remDr$setImplicitWaitTimeout()
email <- remDr$findElement(using = "id", "signInUsername")
email$sendKeysToElement(list("EMAIL ID"))
....
rm(rD)
gc()

Web-scraping in r. How to scrape the data from ("+More" etc).?

Suppose I want to get information about Amenities from this webpage (https://www.airbnb.com/rooms/6676364). It works ok only for visible part.
But how to extract the rest from "+More" button?
I tried the node from "source code" with the help of xpathSApply, but it returns me "+more".
Do you know the solution of this problem?
My RSelenium approach:
url <- "https://www.airbnb.com/rooms/12344760"
library('RSelenium')
pJS <- phantom()
library('XML')
shell.exec(paste0("C:\\Users\\Daniil\\Desktop\\R-language,Python\\file.bat"))
Sys.sleep(10)
checkForServer()
startServer()
remDr <- remoteDriver(browserName="chrome", port=4444)
remDr$open(silent=T)
remDr$navigate(url)
var <- remDr$findElement('id','details') ### extracting all table###
vartxt <- var$getElementAttribute("outerHTML")[[1]]
varxml <- htmlParse(vartxt, useInternalNodes=T)
Amenities <- xpathSApply(varxml,"//div[#class = expandable-content expandable-content-full']",xmlValue)
Also doesn't work

After you navigate the RSelenium driver to the target URL, use the following XPath to find <a> element where inner text equals '+ More' within amenities <div> :
remDr$navigate(url)
link <- remDr$findElement(using = 'xpath', "//div[#class='row amenities']//a[.='+ More']")
Then perform click on the link to get complete list of amenities :
link$clickElement()
Lastly, pass current page HTML source to whatever R function you want to use for further processing :
doc <- htmlParse(remDr$getPageSource()[[1]])
....

Press the next button in google search result

I try to click to go to the next page of search results of google using the following code:
library("RSelenium")
startServer()
checkForServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("https://www.google.com/")
webElem <- remDr$findElement(using = "xpath", "//*/input[#id = 'lst-ib']")
webElem$sendKeysToElement(list("R Cran", "\uE007"))
webElem <- remDr$findElement(using = 'css selector', "#pnnext")
click <- webElem$getElementAttribute("href")
remDr$clickElement(click)
However I receive the following error:
Error in envRefInferField(x, what, getClass(class(x)), selfEnv) :
‘clickElement’ is not a valid field or method name for reference class “remoteDriver”
Does click next button to google search results has different code?
Using inspect I can see that the source code for the button is:
<a id="pnnext" class="pn" style="text-align:left" href="/search?q=R+Cran&biw=1366&bih=657&ei=szhxVv_NMaHMygPW4pLQDg&start=10&sa=N">
Finally what was worked for me:
library("RSelenium")
startServer()
checkForServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("https://www.google.com/")
webElem <- remDr$findElement(using = "xpath", "//*/input[#id = 'lst-ib']")
Sys.sleep(5)
webElem$sendKeysToElement(list("R Cran", "\uE007"))
Sys.sleep(5)
link <- remDr$executeScript("return document.getElementById('pnnext').href;")
remDr$navigate(link[[1]])

You are trying to "click" an attribute/string, which is not working the way you try it.
On this line you are grabbing a link as a string (which is not a WebElement for Selenium!)
click <- webElem$getElementAttribute("href")
and then in the next line you are trying to click this link/string via a method that actually needs a WebElement
remDr$clickElement(click)
So here is what you can try:
1) you could try to click your last WebElement directly (not doing the getAttribute):
webElem$clickElement()
or
2) you could try to navigate to the link you just got through getAttribute:
click <- webElem$getElementAttribute("href")
// change your last line to this
remDr$navigate(click)

Not sure what client are you using but it might be that you need to wait() until the ajax request finishes. visibilityOfElementLocated #pnext

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rselenium web scraping problems - r

Related

Get current url in RSelenium

RSelenium: clicking on subsequent links in for loop from a Google search

Enter text in popup box using Rselenium

Web-scraping in r. How to scrape the data from ("+More" etc).?

Press the next button in google search result

Categories

Resources