Scrape website with load more button with R (rvest) - r

I'm trying to scrape a website with a load more button. I've set up a selenium server by using a windows prompt. The Selenium server is running, but I keep getting the following error when I run the script in R. I've red a lot of blog posts and tried to find the answer, but I lack the technical knowledge to figure this out, so I hope someone is willing to help me.
Error
[1] "Connecting to remote server"
Selenium message:The path to the driver executable must be set by the
webdriver.gecko.driver system property; for more information, see
https://github.com/mozilla/geckodriver. The latest version can be
downloaded from https://github.com/mozilla/geckodriver/releases
Error: Summary: UnknownError Detail: An unknown server-side error
occurred while processing the command. class:
java.lang.IllegalStateException Further Details: run errorDetails
method
Windows prompt
cd c:\selenium
java -Dwebdriver.chrome.driver=c:\geckodriver\chromedriver.exe -
Dwebdriver.gecko.driver.driver=c:\geckodriver\geckodriver.exe -jar selenium-
server-standalone-3.4.0.jar
R SCRIPT
library(rvest)
library(RSelenium)
library(stringr)
library(xm12)
library(wdman)
url <- "https://www.social-enterprise.nl/wie-doen-het/"
remDr <- remoteDriver()
# Open the browser webpage
remDr$open()
#navigate to your page
remDr$navigate(url)
# Locate the load more button
loadmorebutton <- remDr$findElement(using = 'css selector', "#morenews")
for (i in 1:2){
print(i)
loadmorebutton$clickElement()
Sys.sleep(30)
}
page_source<-remDr$getPageSource()
Merken <- read_html(page_source[[1]]) %>%
html_nodes("#membershipCntr span") %>%
html_text()
remDr$close()

You are missing some options in remote web driver instantiate. You can try the following code,
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4444L
, browserName = "firefox"
)

Related

RSelenium Error: Can't Connect to Host; Selenium Server is not running

I am getting the following error: "Error in checkError(res) :
Couldnt connect to host on http://localhost:4444/wd/hub.
Please ensure a Selenium server is running."
I'm using a mac version 10.9.5, and downloaded all of the latest versions of packages and java. My code is:
library(rvest)
library(RSelenium)
library(wdman)
setwd(Path to selenium standalone file)
pJS <- phantomjs(pjs_cmd = "/phantomjs-2.1.1-macosx/bin/phantomjs")
remDr <- remoteDriver(browserName = "phantomjs")
Sys.sleep(5)
remDr$open(silent = FALSE)
And then I get the mentioned error. I've tried using the "java -jar selenium-server-standalone.jar" command in the terminal (after us the cd command to navigate to the correct directory). I've tried changing my port in the remoteDriver() function (to 4444, 5556). I've tried various Sys.sleep() times (up to 20 seconds). When I googled this error, most of the fixes were for FireFox or Windows, and not applicable to using PhantomJS
What else can I try?
The RSelenium::phantom function is deprecated. This had a pjs_cmd argument which I think you refer to above. You can use the rsDriver function from the RSelenium or the phantomjs function from the wdman package:
library(RSelenium)
rD <- rsDriver(browser = "phantomjs")
remDr <- rD[["client"]]
# no need for remDr$open a phantom browser is already initialised
remDr$navigate("http://www.google.com/ncr")
....
....
# clean up
rm(rD)
gc()
Alternatively using the wdman package
library(RSelenium)
library(wdman)
pDrv <- phantomjs(port = 4567L)
remDr <- remoteDriver(browserName = "phantomjs", port = 4567L)
remDr$open()
remDr$navigate("http://www.google.com/ncr")
...
...
# clean up
remDr$close()
pDrv$stop()

RSelenium through docker

My OS is windows 8.1 and I have the version 3.3.3 of R.
I have installed the RSelenium packages and I try to run it using this:
library("RSelenium")
#start RSelenium server
startServer()
checkForServer()
and I receive this error:
Error: checkForServer is now defunct. Users in future can find the function in
file.path(find.package("RSelenium"), "examples/serverUtils"). The
recommended way to run a selenium server is via Docker. Alternatively
see the RSelenium::rsDriver function.
Is there anything changed in the way RSelenium opens? I search for the error and I found only this but it doesn't help me. What can I do?
Also an alternative I tried is to download the chromedrive from here 'https://sites.google.com/a/chromium.org/chromedriver/downloads'
and using this script:
require(RSelenium)
cprof <- getChromeProfile("C:/Users/Peri/Desktop/chromedriver/chromedriver.exe", "Profile 1")
require(RSelenium)
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4444
, browserName = "chrome", extraCapabilities = cprof
)
remDr$open()
and I receive this error:
Error in checkError(res) :
Couldnt connect to host on http://localhost:4444/wd/hub.
Please ensure a Selenium server is running.
what can I do to run chrome instead of the pre-default browser Firefox?
You need to use the function rsDriver. The Selenium Version wants you to use Docker (which I also would recommend), but if you are not familiar with this you can go this way.
rsdriver will manage the binaries needed for running a Selenium Server. This provides a wrapper around the wdman::selenium function.
Here is what you have to do to start a Chrome Browser:
driver<- rsDriver()
remDr <- driver[["client"]]
And then you can work with it:
remDr$navigate("http://www.google.de")
remDr$navigate("http://www.spiegel.de")
And stop it:
remDr$close()

RSelenium, Can't start server

I'm trying to use RSelenium for web-scraping purposes behind a login and I can't get the server to run.
Current result:
library(RSelenium)
startServer()
remDr <- remoteDriver(port = 4444,
browserName = "firefox")
remDr$open()
# [1] "Connecting to remote server"
Error: Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
class: org.openqa.selenium.firefox.NotConnectedException
I've tried running the server myself by downloading and trying to open it (nothing happens).
This was a tough one and stopped me for a couple of days when I could search on it. In the end I uninstalled Firefox and installed version 37.0 while also disabling the update service. That fixed it for me and RSelenium works fine again.
Run the following code first then it should work:
RSelenium::checkForServer()
This line of code installs the selenium server which you need for running RSelenium commands.
Try below.
rD <- rsDriver(port=4444L,browser="firefox")
mybrowser <- remoteDriver(browserName = "firefox")
mybrowser$open()
RSelenium has problems to establish serwer at the begginig on respective port. Subsequently we are telling which driver should be used.

Rselenium not working

I'm trying to install Rselenium and I get this
Connecting to remote server"
Error: Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
class: java.net.ConnectException"
Code which I have tried is
install.packages("RSelenium")
library(RSelenium)
startServer()
checkForServer()
mybrowser <- remoteDriver(browserName = "chrome")
mybrowser$open()
mybrowser$navigate("http://www.weather.gov")
Sometime I face a connection problem. Here what I make:
library("RSelenium")
startServer()
checkForServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.weather.gov")
Sys.sleep(5)
I leave remoteDriver() which is the Firefox. However sometimes when I run the code I receive a connection problem. As a workaround I open the firefox on my own and after that I run the code and the code runs successfully.
Note: I would have made a comment, but I don't have 50 rep yet
I've had problems with this in the past (see here). It has to do with opening the server, if you click manually on the server file (something like selenium-server-standalone-2.48.2.jar), then run your code again, it should work.

RSelenium is not working

I try to install and run a simple example for R Selenium package using this:
install.packages("RSelenium")
library("RSelenium")
startServer()
checkForServer()
startServer()
remDr <- remoteDriver(browserName = "Chrome")
remDr$open()
In the last code I receive this:
[1] "Connecting to remote server"
Error: Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
class: org.openqa.selenium.WebDriverException
I tried some workarounds from google but nothing worked. What can I do?
From comments:
Click start
Select Control Panel > System
Select Advance system settings
Click Environment Variables...
Under System Variables
Scroll to Path and double click
At the end of Variable value: add ;C:\path\to\directory that holds the chromedriver.exe file. Note the ; that separates the paths
Restart your R session and you should now be able to run:
> require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver(browserName = "chrome")
remDr$open()
EDIT
For RSelenium to operate with chrome you first need to download chromedriver.exe you can download this from https://sites.google.com/a/chromium.org/chromedriver/downloads. Once downloaded unzip the folder and place chromedriver.exe where you would like to store it.
The directory that you store chromedriver.exe and add to your system PATH can be anywhere you choose. As stated in comments, for example, mine currently resides in C:\Python27\Scripts.

Resources