download file with Rselenium & docker toolbox - r

I m trying to download files by Rselenium but it looks impossible.I don't arrive to download even with an easy example:
1) i have installed docker toolbox (https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-docker.html)
2) i ran the firefox standalone image : 3.1.0 and now i m testing the older 2.52.0
3) i have installed the rselenium package on My R X64 3.3.2 and i read all the questions & answers on stackoverflow
4) i have tried the following code, by the way, when i analyse the firefox options about:config , i don't find the "browser.download.dir" options:
require(RSelenium)
fprof <- makeFirefoxProfile(list(browser.download.dir = "C:/temp"
, browser.download.folderList = 2L
, browser.download.manager.showWhenStarting = FALSE
, browser.helperApps.neverAsk.saveToDisk = "application/zip"))
remDr <- remoteDriver(browserName = "firefox",remoteServerAddr = "192.168.99.100",port = 4445L,extraCapabilities = fprof)
remDr$open(silent = TRUE)
remDr$navigate("https://www.chicagofed.org/applications/bhc/bhc-home")
# click year 2012
webElem <- remDr$findElement("name", "SelectedYear")
webElems <- webElem$findChildElements("css selector", "option")
webElems[[which(sapply(webElems, function(x){x$getElementText()}) == "2012" )]]$clickElement()
# click required quarter
webElem <- remDr$findElement("name", "SelectedQuarter")
Sys.sleep(1)
webElems <- webElem$findChildElements("css selector", "option")
webElems[[which(sapply(webElems, function(x){x$getElementText()}) == "4th Quarter" )]]$clickElement()
# click button
webElem <- remDr$findElement("id", "downloadDataFile")
webElem$clickElement()
6) i have no error but i have no file
7) At the end , i would like to download the excel file on this page by Rselenium:
[link]https://app2.msci.com/products/indexes/performance/country_chart.html?asOf=Feb%2028,%202010&size=30&scope=C&style=C&currency=15&priceLevel=0&indexId=83#

If you are using Docker toolbox with windows you may have issues mapping volumes see Docker : Sharing a volume on Windows with Docker toolbox
If you are using Docker Machine on Mac or Windows, your Docker daemon has only limited access to your OS X or Windows filesystem. Docker Machine tries to auto-share your /Users (OS X) or C:\Users (Windows) directory.
I initiated a clean install of docker toolbox on a windows 10 box and ran the following image:
$ docker stop $(docker ps -aq)
$ docker rm $(docker ps -aq)
$ docker run -d -v //c/Users/john/test/://home/seluser/Downloads -p 4445:4444 -p 5901:5900 selenium/standalone-firefox-debug:2.53.1
NOTE: we mapped to a directory in the Users/john space. User john is running docker toolbox
Running the below code
require(RSelenium)
fprof <- makeFirefoxProfile(list(browser.download.dir = "home/seluser/Downloads"
, browser.download.folderList = 2L
, browser.download.manager.showWhenStarting = FALSE
, browser.helperApps.neverAsk.saveToDisk = "application/zip"))
remDr <- remoteDriver(browserName = "firefox",remoteServerAddr = "192.168.99.100",port = 4445L,extraCapabilities = fprof)
remDr$open(silent = TRUE)
remDr$navigate("https://www.chicagofed.org/applications/bhc/bhc-home")
# click year 2012
webElem <- remDr$findElement("name", "SelectedYear")
webElems <- webElem$findChildElements("css selector", "option")
webElems[[which(sapply(webElems, function(x){x$getElementText()}) == "2012" )]]$clickElement()
# click required quarter
webElem <- remDr$findElement("name", "SelectedQuarter")
Sys.sleep(1)
webElems <- webElem$findChildElements("css selector", "option")
webElems[[which(sapply(webElems, function(x){x$getElementText()}) == "4th Quarter" )]]$clickElement()
# click button
webElem <- remDr$findElement("id", "downloadDataFile")
webElem$clickElement()
And checking the mapped download folder
> list.files("C://Users/john/test")
[1] "bhcf1212.zip"
>

finally i have decided to make a clean install of the docker for windows (17.03.0) stable.
i needed to decrease the number of available cpu (to 1) and available ram too (to 1GB).
i have shared my c too (btw it s mandatory to have a password session otherwise you can't share the directory
after that i restarted my computer
On the R side , do not forget to remove the:
remoteServerAddr = "192.168.99.100"
and i got the file.
my fear now is about the stability of docker, sometimes it runs, sometimes not.
many thanks john for your help

Related

Extract text from body of HTML page with RSelenium

I need to extract the text from a bunch of web pages that use JavaScript to render.
The code below usually works for me, resulting in just text and line returns which is fine.
However on some pages it doesn't work.
How can I use RSelenium to extract the text of the body of the "URL Fails" indicated webpage?
library("tidyverse")
library("rvest")
library("RSelenium")
remDr <- remoteDriver(port = 4445L)
remDr$open()
# URL Works
url <- "https://www.td.com/ca/en/personal-banking/products/credit-cards/travel-rewards/rewards-visa-card/"
# URL Fails
# url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
remDr$navigate(url)
pg <-
remDr$getPageSource()[[1]] %>%
read_html(encoding = "UTF-8") %>%
html_node(xpath = "//body") %>%
as.character() %>%
htm2txt::htm2txt()
remDr$close()
Proposed Solution by #NadPat
url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()
Result for me:
Selenium message:a is null
Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03'
System info: host: 'fe72a1de69e7', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-84-generic', java.version: '1.8.0_91'
Driver info: driver.version: unknown
Error: Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
class: org.openqa.selenium.WebDriverException
Further Details: run errorDetails method
For the failing URL something is being read because
remDr$getPageSource()[[1]]
returns:
[1] "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><script>\n\nsitePrefix = 'BMO';\nvar pageNameMapping = {};\n\n//channelDemo\npageNameMapping[\"atm_en\"]=\"channelDemo\";\npageNameMapping[\"atm_fr\"]=\"channelDemo\";\n\n//Every Day Banking\npageNameMapping[\"Personal\"]=\"PERS\";\npageNameMapping[\"Bank Accounts\"]=\"Bank-Accounts\";\npageNameMapping[\"Daily savings account\"]=\"Premium-Rate-Savings\";\npageNameMapping[\"High Interest Savings Account\"]=\"Smart-Saver\";\npageNameMapping[\"Chequing account\"]=\"Primary-Chequing\";\npageNameMapping[\"Business Premium Rate Savings\"]=\"Business Premium Rate Account\";\n\n//Cards\npageNameMapping[\"Credit Cards\"]=\"CC\";\n\n\n//Mortgages\npageNameMapping[\"Mortgages\"]=\"MTG\";\npageNameMapping[\"Special Offers\"]=\"Special-Offers\";\n\n//Wealth Management\npageNameMapping[\"Wealth Management\"]=\"Wealth\";\npageNameMapping[\"AdviceDirect\"]=\"Advicedirect\";\n\n//Online Investing\npageNameMapping[\"Online Investing\"]=\"ONL-INVS\";\npageNameMapping...
Is there something wrong with how I have setup RSelenium with Docker?
=======================
UPDATE:
I pulled the latest version of standalone-firefox from docker and now #NadPat's solutions work for me.
docker pull selenium/standalone-firefox:latest
Launching the browser,
library(RSelenium)
driver = rsDriver(
port = 4841L,
browser = c("firefox"))
remDr <- driver[["client"]]
url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
First method,
remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()
[[1]]
[1] "Skip navigation\nPersonal\nPrivate Wealth\nBusiness\nCommercial\nCapital Markets\nSearch\nFind us\nSupport\nEN\nLogin\nBank Accounts\nCredit Cards\nMortgages\nLoans & Lines of Credit\nInvestments\nFinancial Planning\nInsurance\nWays to Bank\nAbout BMO\nPersonal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional groce
Second Method,
text <- remDr$findElement(using = 'xpath', value = '//*[#id="main"]')
text$getElementText()
[[1]]
[1] "Personal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional grocery rewards earn rate on cash back credit cards with no annual fee as of June 1, 2021.\nWelcome offer\nGet up to 5% cash back in your first 3 months‡‡ and a 1.99% introductory interest rate on balance transfers for 9 months with a 1% transfer fee.§§\nAPPL

how to install a package from github?

I need to install a github package
This is my code:
devtools :: install_github ("jgalgarra / kcorebip")
I already have the devtool installed but it gives me the following error:
Error: Failed to install 'unknown package' from GitHub:
Timeout was reached: Connection timed out after 10015 milliseconds
This is what I have configured in my Rprofile.site:
# Things you might want to change
# options (papersize = "a4")
# options (editor = "notepad")
# options (pager = "internal")
# set the default help type
# options (help_type = "text")
options (help_type = "html")
# set a site library
# .Library.site <- file.path (chartr ("\\", "/", R.home ()), "site-library")
# set a CRAN mirror
# local ({r <- getOption ("repos")
# r ["CRAN"] <- "http: //my.local.cran"
# r ["CRAN"] <- http://cran.us.r-project.org
#options (repos = r)})
local ({r <- getOption ("repos")
r ["Nexus"] <- "http://nexus.uo.edu.cu:8081/repository/R-repository/"
options (repos = r)
})
# Give a fortune cookie, but only to interactive sessions
# (This would need the fortunes package to be installed.)
# if (interactive ())
# fortunes :: fortune ()
Can I modify any line to install the github or is it another problem?

Launch TorBrowser or Tor enabled Firefox browser in RSelenium

How do you launch TorBrowser in RSelenium?
I tried this to no avail:
library(RSelenium)
browserP <- "C:/Users/Administrator/Desktop/Tor Browser/Browser/firefox.exe"
jArg <- paste0("-Dwebdriver.firefox.bin=\"", browserP, "\"")
pLoc <- "C:/Users/Administrator/Desktop/Tor Browser/Browser/TorBrowser/Data/Browser/profile.meek-http-helper/"
jArg <- c(jArg, paste0("-Dwebdriver.firefox.profile=\"", pLoc, "\""))
selServ <- RSelenium::startServer(javaargs = jArg)
Error: startServer is now defunct. Users in future can find the function in
file.path(find.package("RSelenium"), "examples/serverUtils"). The
recommended way to run a selenium server is via Docker. Alternatively
see the RSelenium::rsDriver function.
rsDriver doesn't take a javaargs argument, and I can't figure out how to get this to work either:
fprof <- getFirefoxProfile("C:/Users/Administrator/Desktop/Tor Browser/Browser/TorBrowser/Data/Browser/profile.meek-http-helper/", useBase = T)
remDr <- remoteDriver(extraCapabilities = list(marionette = TRUE))

Scrape website with R by navigating doPostBack

I want to extract a table periodicaly from below site.
price list changes when clicked building block names(BLOK 16 A, BLOK 16 B, BLOK 16 C, ...) . URL doesn't change, page changes by trigering
javascript:__doPostBack('ctl00$ContentPlaceHolder1$DataList2$ctl04$lnk_blok','')
I've tried 3 ways after searching google and starckoverflow.
what I've tried no 1: this doesn't triger doPostBack event.
postForm( "http://www.kentkonut.com.tr/tr/modul/projeler/daire_fiyatlari.aspx?id=44", ctl00_ContentPlaceHolder1_DataList2_ctl03_lnk_blok="ctl00$ContentPlaceHolder1$DataList2$ctl03$lnk_blok")
what I've tried no 2: selenium remote seem to works on (http://localhost:4444/) but remotedriver doesn't navigate. returns this error. (Error in checkError(res) :
Undefined error in httr call. httr output: length(url) == 1 is not TRUE)
library(RSelenium)
startServer()
remDr <- remoteDriver()
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4444L, browserName = "firefox")
remDr$open()
remDr$getStatus()
remDr$navigate("http://www.kentkonut.com.tr/tr/modul/projeler/daire_fiyatlari.aspx?id=44")
what I've tried no 3: this another way to triger dopostback event. it doesn't navigate.
base.url <- "http://www.kentkonut.com.tr/tr/modul/projeler/",
event.target <- 'ctl00$ContentPlaceHolder1$DataList2$ctl03$lnk_blok',
action <- "daire_fiyatlari.aspx?id=44"
ftarget <- paste0(base.url, action)
dum <- getURL(ftarget)
event.val <- unlist(strsplit(dum,"__EVENTVALIDATION\" value=\""))[2]
event.val <- unlist(strsplit(event.val,"\" />\r\n\r\n<script"))[1]
view.state <- unlist(strsplit(dum,"id=\"__VIEWSTATE\" value=\""))[2]
view.state <- unlist(strsplit(view.state,"\" />\r\n\r\n\r\n<script"))[1]
web.data <- postForm(ftarget, "form name" = "ctl00_ContentPlaceHolder1_DataList2_ctl03_lnk_blok",
"method" = "POST",
"action" = action,
"id" = "ctl00_ContentPlaceHolder1_DataList2_ctl03_lnk_blok",
"__EVENTTARGET"=event.target,
"__EVENTVALIDATION"=event.val,
"__VIEWSTATE"=view.state)
thanks for your help.
library(rvest)
url<-"http://www.kentkonut.com.tr/tr/modul/projeler/daire_fiyatlari.aspx?id=44"
pgsession<-html_session(url)
t<-html_table(html_nodes(read_html(pgsession), css = "#ctl00_ContentPlaceHolder1_DataList1"), fill= TRUE)[[1]]
even_indices<-seq(2,length(t$X1),2)
t<-t[even_indices,]
t<-t[2:(length(t$X1)),]
EDITED CODE:
library(rvest)
url<-"http://www.kentkonut.com.tr/tr/modul/projeler/daire_fiyatlari.aspx?id=44"
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]
page<-rvest:::request_POST(pgsession,"http://www.kentkonut.com.tr/tr/modul/projeler/daire_fiyatlari.aspx?id=44",
body=list(
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__EVENTTARGET`="ctl00$ContentPlaceHolder1$DataList2$ctl01$lnk_blok",
`__EVENTARGUMENT`="",
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value
),
encode="form"
)
# in the above example change eventtarget as "ctl00$ContentPlaceHolder1$DataList2$ctl02$lnk_blok" to get different table
t<-html_table(html_nodes(read_html(page), css = "#ctl00_ContentPlaceHolder1_DataList1"), fill= TRUE)[[1]]
even_indices<-seq(2,length(t$X1),2)
t<-t[even_indices,]
t<-t[2:(length(t$X1)),]

In R phantomJS ran with Rselenium hangs after several iterations

I'm using phantomJS to collect data from different sites. During data scrapping process I experience a lot of crashes when parsing sites or sites elements. Unfortunately nor phantomJS nor RSelenium don't provide any information or bag report in the console. Script just hangs without any warnings. I see that it is executing, but actually nothing happens. The only way to stop script from executing is to manually restart R. After several test I found that phantomJS usually hangs on executing remDr$findElements() commands. I tried to reran my code using firefox and RSelenium - it works normally. So the problem is in how phantomJS works.Does anyone experience anything similar when running phantomJS? Is it possible to fix this misbehavior?
I'm using:
Windows 7
Selenium 2.0
R version 3.1.3
phantomjs-2.0.0-windows
My code:
# starting phantom server driver
phantomjsdir <- paste(mywd, "/phantomjs-2.0.0-windows/bin/phantomjs.exe", sep="" )
phantomjsUserAgent <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 OPR/28.0.1750.48"
eCap <- list(phantomjs.binary.path = phantomjsdir, phantomjs.page.settings.userAgent = phantomjsUserAgent )
pJS <- phantom(pjs_cmd = phantomjsdir)
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
remDr$open(silent = FALSE)
mywords <- c("canon 600d", "sony 58k","nikon","nikon2","nikon 800","nikon 80","nikon 8")
timeout <- 3
#'
#' Exceuting script
#'
for (word in mywords) {
print(paste0("searching for: ",word))
ss.word <- word
remDr$navigate("http://google.com")
webElem <- remDr$findElement(using = "class", "gsfi")
webElem$sendKeysToElement(list(enc2utf8(ss.word),key = "enter"))
Sys.sleep(1)
print (remDr$executeScript("return document.readyState;")[[1]])
while (remDr$executeScript("return document.readyState;")[[1]]!= "complete" && totalwait<10) {
Sys.sleep(timeout)
}
print(paste0("search completed: ",ss.word))
elem.snippet <- remDr$findElements(using="class name",value = "rc")
for (i in 1:length(elem.snippet)) {
print(paste0("element opened: ",ss.word," pos",i))
print(elem.snippet[[i]])
ss.snippet.code <- elem.snippet[[i]]$getElementAttribute('innerHTML')
print(paste0("element element innerHTML ok"))
elemtitle <- elem.snippet[[i]]$findChildElement(using = "class name", value = "r")
print(paste0("element title ok"))
elemcode <- elemtitle$getElementAttribute('innerHTML')
print(paste0("element innerHTML ok"))
elemtext <- elem.snippet[[i]]$findChildElement(using = "class name", value = "st")
ss.text <- elemtext$getElementText()[[1]]
print(paste0("element loaded: ",ss.word," pos",i))
elemloc <- elem.snippet[[i]]$getElementLocation()
elemsize <- elem.snippet[[i]]$getElementSize()
print(paste0("element location parsed: ",ss.word," pos",i))
}
print(paste0("data collected: ",ss.word))
}

Resources