RSelenium - downloading a file with phantom.js - r

Using RSelenium I can download a file from a webpage using a connection via a Firefox browser with the following formula:
csv = remDr$findElement(using = 'css selector', "a[ng-click*=download]")
remDr$executeScript("arguments[0].click();", list(csv))
When I try to replicate the process with phantomjs browser nothing happens. Guessing maybe no download directory is set, I've tried:
remDr$extraCapabilities = makeFirefoxProfile(list(browser.download.dir = "/download/path"))
Still nothing happens. Grateful for an idea what needs to happen to get this to work.
Edit.
I should add the following report during setup, which may or may not be relevant, although it doesn't appear to stop the page connection or element selection:
> pJS = phantom()
[ERROR - 2016-03-17T17:54:08.914Z] GhostDriver - main.fail - {"line":85,"sourceURL":"phantomjs://code/main.js","stack":"global code#phantomjs://code/main.js:85:56"}
phantomjs://platform/console++.js:263 in error

phantomjs://platform/console++.js:263 in error
This error commonly happens when you run selenium server and phantomjs in the same port

Hmm it seems phanrom.js doesn't support file download.

Related

Problems with RSelenium and ChromeDriver - "Could not open chrome browser"

I have been using RSelenium for years and have never had this issue. I recently updated my google chrome to the latest version available 110.0.5481.78. I am now getting the following error when I go to use rsDriver
require(RSelenium)
rD <- rsDriver(browser = "chrome",port = 9537L, chromever = "110.0.5481.77")
"> Could not open chrome browser.
> Client error message:
> Undefined error in httr call. httr output: Failed to connect to localhost port 9537: Connection refused
> Check server log for further details.
> Warning message:
> In rsDriver(browser = "chrome", port = 9537L, chromever = "110.0.5481.77") :
> Could not determine server status."
R Console
I have tried with different versions of chromever from binman::list_versions("chromedriver") as well as leaving rsDriver blank all together. In the past when chrome has updated it has been a very simple change to chromever and everything works perfectly. Not sure if or what has changed with this latest update.
Thanks in advance.
I just fixed this same problem by removing a file LICENSE.chromedriver as per this thread: https://github.com/ropensci/RSelenium/issues/264
Use
wdman::selenium(retcommand=T)
to find the file location of binman_chromedriver files.
Navigate to this file location, go to the driver version you're using and delete the LICENSE.chromedriver file. Mine worked immediately after this action, but note that I also tried downgading wdman version to 0.2.5 (I was on 0.2.6) first:
remotes::install_version('wdman',version = '0.2.5')
I'm not sure if it was both actions that fixed it or just the file delete!

RSelenium: Can't see Browser as I run my Code

MacOS Sierra 10.12.4. Chrome 63 (most recent). R 1.1.383.
I'm using RSelenium to scrape web data. I'm able to pull data using the remote driver, but the actual web page browser doesn't pop up for me to view. This makes it difficult to debug some of my trickier web pulls. This is an example video of what I want to happen. The user can visually see the changes he's making in the browser- The goal of this post is to find out why I cannot visually see the browser as I run the code.
Here's an example of my process to pull from RSelenium.
From the Terminal:
(name)$ docker run -d -p 4567:4444 selenium/standalone-chrome
(name)$ docker ps
output:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8de3a1cbc777 selenium/standalone-chrome "/opt/bin/entry_po..." 5 minutes ago Up 5 minutes 0.0.0.0:4567->4444/tcp wizardly_einstein
In R
library(RSelenium)
library(magrittr)
library(stringr)
library(stringi)
library(XML)
remDr <- rsDriver(port = 4567L, browser = "chrome")
remDr$client$open()
remDr$client$navigate("https://shiny.rstudio.com/gallery/datatables-options.html")
webElems <- remDr$client$findElements("css selector", "iframe")
remDr$client$switchToFrame(webElems[[1]])
elems <- remDr$client$findElements("css selector", "#showcase-app-container > nav > div > ul li")
unlist(lapply(elems, function(x) x$getElementText()))
[1] "Display length" "Length menu" "No pagination" "No filtering" "Function callback"
This is my confirmation that RSelenium is working properly. However, this is all happening "blindly" - I can't see what is going on. In a complicated web pull I'm trying to perform (hidden behind credentials so I can't give an example), certain elements cannot be found after iterations even though I know they are on the page. Being able to see the browser would allow me to easily debug the code.
Not sure if this means anything, but it doesn't look like the driver is attached to an IP address:
(name)$ docker-machine ip
Error: No machine name(s) specified and no "default" machine exists
Is there something else I need to download to be able to visually see the webdriving process? Thanks in advance.
I'm not sure about the exact behavior in that video, but I always use a phantomjs headless browser and look at screenshots as I go. This code would produce what I'm talking about:
library(RSelenium)
#this sets up the phantomjs driver
pjs <- wdman::phantomjs()
#open a connection to it
dr <- rsDriver(browser = 'phantomjs')
remdr <- dr[['client']]
#go to the site
remdr$navigate("https://stackoverflow.com/")
#show browser screenshot in viewer
remdr$screenshot(TRUE)

Having trouble connecting RSelenium to Server

I've been learning R programming for the last few months and really enjoying the language. I wanted to start using it to automate a few things at work. However for the life of me no matter how much I Google or experiment I can't seem to start the browser.
I followed the steps from this article
https://www.r-bloggers.com/rselenium-a-wonderful-tool-for-web-scraping/
and got the server started from the command line. This is the code I ran in the console and the error message I'm getting.
> library(RSelenium)
> checkForServer()
Warning message:
checkForServer is deprecated.
Users in future can find the function in
file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see
vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see
vignette("RSelenium-docker", package = "RSelenium")
I'm running on Windows 10 64-bit and have installed the latest Firefox.
Any help or pointers on this would be greatly appreciated.
Thanks,
Shan
Okay, I just went through this. So you can skip the whole Selenium Server entirely by just using phantomjs, which RSelenium can call directly.
Steps:
Download phantomjs for your platform here
Put this binary file in the system PATH or anywhere else you have access too from R
Now try this:
library(RSelenium)
pJS <- phantom(pjs_cmd = "<YOUR BINARY LOCATION>") # no arg if it's in PATH
Sys.sleep(5)
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open(silent = T)
url <- "http://www.google.com"
remDr$navigate(url)
remDr$screenshot(display = TRUE)
NOTE: When I run this I get an error after the first step, but it still works and pulls up the page. Not sure why that happens.

phantomjs unable to find element on page

Recently, I've been having trouble driving phantomjs under RSelenium. It seems that the browser is unable to locate anything on the page using findElement(). If I pass something as simple as:
library("RSelenium")
RSelenium::checkForServer()
RSelenium::startServer()
rd <- remoteDriver(browserName = "phantomjs")
rd$open()
Sys.sleep(5)
rd$navigate("https://www.Facebook.com")
searchBar <- rd$findElement(using = "id", "email")
I get the error below:
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Any thoughts on what is causing this? It doesn't seem to matter what page I navigate to; it simply fails anytime I try to locate an element on the webpage. This issue started recently and I noticed it when my cron jobs began failing.
I'm working in Ubuntu 14.04 LTS with R 3.3.1 and phantomjs 2.1.1. I don't suspect some type of compatibility issue as this has worked very recently and I haven't updated anything.
The version of phantomjs you installed may be limited. See here
Disabled Ghostdriver due to pre-built source-less Selenium blobs.
Added README.Debian explaining differences from upstream "phantomjs".
If you installed recently using apt-get then this is most likely the case. You can download from the phantomjs website and place the bin location in your PATH.
Alternatively use npm to install a version for you
npm install phantomjs-prebuilt
This will then but a link to the bin in node_modules/.bin/phantomjs.
For the reasons behind the limitations in apt-get you can read the README.Debian file contained here.
Limitations
Unlike original "phantomjs" binary that is statically linked with
modified QT+WebKit, Debian package is built with system libqt5webkit5.
Unfortunately the latter do not have webSecurity extensions therefore
"--web-security=no" is expected to fail.
https://github.com/ariya/phantomjs/issues/13727#issuecomment-155609276
Ghostdriver is crippled due to removed source-less pre-built blobs:
src/ghostdriver/third_party/webdriver-atoms/*
Therefore all PDF functionality is broken.
PhantomJS cannot run in headless mode (if there is no X server
available).
Unfortunately it can not be fixed in Debian. To achieve headless-ness
upstream statically link with customised QT + Webkit. We don't want to
ship forks of those projects. It would be great to eventually convince
upstream to use standard libraries. Meanwhile one can use "xvfb-run"
from "xvfb" package:
xvfb-run --server-args="-screen 0 640x480x16" phantomjs
If you don't want to set your path for phantomjs then you can add it as a extra:
library(RSelenium)
selServ <- startServer()
pBin <- list(phantomjs.binary.path = "/home/john/node_modules/phantomjs-prebuilt/lib/phantom/bin/phantomjs")
rd <- remoteDriver(browserName = "phantomjs"
, extraCapabilities = pBin)
Sys.sleep(5)
rd$open()
rd$navigate("https://www.Facebook.com")
searchBar <- rd$findElement(using = "id", "email")
rd$close()
selServ$stop()

Can browser called from RSelenium run in the backround

I am working on a windows 7 machine. Is it possible to run remoteDriver()$open() from the RSelenium library and have the browser run in the background (i.e not visible).
Thanks
Yes, that is possible. The default browser for RSelenium is Firefox. However, RSelenium even supports headless browsing using PhantomJS which is described in the respective vignette in detail.
In general, for leveraging PhanomJS under Windows 7 you just need to
download PhantomJS and add the folder path to the phantomjs.exe as an additional entry to the user or system PATH variable in the Environment Variable menu on your system (e.g. C:\Program Files\phantomjs-1.9.7-windows) Note: phantomjs.exe itself is not part of the path specification.
replace the code snippets at the beginning and at the end of your code like described next
Default browsing:
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
...
remDrv$quit()
remDrv$closeServer()
Headless browsing:
pJS <- phantom()
remDrv <- remoteDriver(browserName = 'phantomjs')
remDrv$open()
...
remDrv$close()
pJS$stop()
Additional advice
Command line arguments and POODLE
Pay attention to the command line arguments which you can pass to phantom.
For instance, PhantomJS uses SSLv3 by default which is discouraged by every server since POODLE.
The workaround is to call phantom with --ssl-protocol=tlsv1:
pJS <- phantom(extras = c('--ssl-protocol=tlsv1'))
Timing issues
One thing that often happens with PhantomJS is timing issues. Code that works with browsers such as Firefox and Chrome breaks with PhantomJS because PhantomJS is too fast.
You can solve this issue by placing Sys.sleep calls between the different remoteDriver calls.

Resources