Extract text from body of HTML page with RSelenium - r

I need to extract the text from a bunch of web pages that use JavaScript to render.
The code below usually works for me, resulting in just text and line returns which is fine.
However on some pages it doesn't work.
How can I use RSelenium to extract the text of the body of the "URL Fails" indicated webpage?
remDr <- remoteDriver(port = 4445L)
# URL Works
url <- "https://www.td.com/ca/en/personal-banking/products/credit-cards/travel-rewards/rewards-visa-card/"
# URL Fails
# url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
pg <-
remDr$getPageSource()[[1]] %>%
read_html(encoding = "UTF-8") %>%
html_node(xpath = "//body") %>%
as.character() %>%
Proposed Solution by #NadPat
url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
text <- remDr$findElement(using = 'xpath', value = '/html')
Result for me:
Selenium message:a is null
Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03'
System info: host: 'fe72a1de69e7', ip: '', os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-84-generic', java.version: '1.8.0_91'
Driver info: driver.version: unknown
Error: Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
class: org.openqa.selenium.WebDriverException
Further Details: run errorDetails method
For the failing URL something is being read because
[1] "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><script>\n\nsitePrefix = 'BMO';\nvar pageNameMapping = {};\n\n//channelDemo\npageNameMapping[\"atm_en\"]=\"channelDemo\";\npageNameMapping[\"atm_fr\"]=\"channelDemo\";\n\n//Every Day Banking\npageNameMapping[\"Personal\"]=\"PERS\";\npageNameMapping[\"Bank Accounts\"]=\"Bank-Accounts\";\npageNameMapping[\"Daily savings account\"]=\"Premium-Rate-Savings\";\npageNameMapping[\"High Interest Savings Account\"]=\"Smart-Saver\";\npageNameMapping[\"Chequing account\"]=\"Primary-Chequing\";\npageNameMapping[\"Business Premium Rate Savings\"]=\"Business Premium Rate Account\";\n\n//Cards\npageNameMapping[\"Credit Cards\"]=\"CC\";\n\n\n//Mortgages\npageNameMapping[\"Mortgages\"]=\"MTG\";\npageNameMapping[\"Special Offers\"]=\"Special-Offers\";\n\n//Wealth Management\npageNameMapping[\"Wealth Management\"]=\"Wealth\";\npageNameMapping[\"AdviceDirect\"]=\"Advicedirect\";\n\n//Online Investing\npageNameMapping[\"Online Investing\"]=\"ONL-INVS\";\npageNameMapping...
Is there something wrong with how I have setup RSelenium with Docker?
I pulled the latest version of standalone-firefox from docker and now #NadPat's solutions work for me.
docker pull selenium/standalone-firefox:latest

Launching the browser,
driver = rsDriver(
port = 4841L,
browser = c("firefox"))
remDr <- driver[["client"]]
url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
First method,
text <- remDr$findElement(using = 'xpath', value = '/html')
[1] "Skip navigation\nPersonal\nPrivate Wealth\nBusiness\nCommercial\nCapital Markets\nSearch\nFind us\nSupport\nEN\nLogin\nBank Accounts\nCredit Cards\nMortgages\nLoans & Lines of Credit\nInvestments\nFinancial Planning\nInsurance\nWays to Bank\nAbout BMO\nPersonal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional groce
Second Method,
text <- remDr$findElement(using = 'xpath', value = '//*[#id="main"]')
[1] "Personal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional grocery rewards earn rate on cash back credit cards with no annual fee as of June 1, 2021.\nWelcome offer\nGet up to 5% cash back in your first 3 months‡‡ and a 1.99% introductory interest rate on balance transfers for 9 months with a 1% transfer fee.§§\nAPPL


RSelenium - how to go to the next page by cliking on the button next?

my question is about scraping with RSelenium.
I am trying to scrape data from the following website:
"https://www.nhtsa.gov/ratings" using RSelenium.
My present difficulty lies in managing to skip between pages for a given carmaker.
This is my code so far:
#opens a connection
rD <- rsDriver()
remDr <- rD$client
#goes to the page we want
url <- "https://www.nhtsa.gov/ratings"
#clicking to open the manufacturer selection "page"
webElem <- remDr$findElement(using = 'css selector', "#vehicle a")
#opening the options menu
option.menu <- remDr$findElement(using='css selector', 'select')
#selecting one maker, loop over this later
maker.select <- remDr$findElement(using = 'xpath', "//*/option[#value = 'AUDI']")
#search our selection
maker.click<-remDr$findElement(using='css selector', '.manufacturer-search-submit')
#now we have to go through each car (10 per page), loop later
cars<-remDr$findElement(using='css selector', 'tbody:nth-child(6) a')
#going to the next page
next_page<-remDr$findElement(using='css selector', 'button.btn.link-arrow::after')
But I get the error:
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
Further Details: run errorDetails method
As you can probabily see I am new to RSelenium. Any help that you can give me would be appreciated. Thanks in advance.
Here is another approach that might be of help.
You can access the data simply by sending a GET request to the website. From the website (on the first page), we can see
This is where we can get the data. The second page will have offset=10 then 20,30,etc.
If api_url is defined to be the above url, then we can get the data using httr
# request the data
request <- httr::GET(api_url)
# retrieve the content
request_content <- httr::content(request)
request_result <- request_content$results
# request results contains the data of interest
# A few glimpses into the data
# The first model
# [1] "A3"
# [1] 2018
Now by playing around with offset it is straight forward to build a loop and gather all pages
out <- list()
k <- 0L
i <- 1L
while (k < 1e+3) {
req_url <- paste0('https://api.nhtsa.gov/vehicles/byManufacturer?offset=',
req <- httr::content(httr::GET(req_url))$result
if (length(req) == 0) break
out[[i]] <- req
cat(paste0('\nAdded content for offset \t', k))
i <- i + 1L
k <- k + 10L
# [1] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
Note that you can also play around with manufacturerName in the url and with many more arguments to have clean and tailored data.

download file with Rselenium & docker toolbox

I m trying to download files by Rselenium but it looks impossible.I don't arrive to download even with an easy example:
1) i have installed docker toolbox (https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-docker.html)
2) i ran the firefox standalone image : 3.1.0 and now i m testing the older 2.52.0
3) i have installed the rselenium package on My R X64 3.3.2 and i read all the questions & answers on stackoverflow
4) i have tried the following code, by the way, when i analyse the firefox options about:config , i don't find the "browser.download.dir" options:
fprof <- makeFirefoxProfile(list(browser.download.dir = "C:/temp"
, browser.download.folderList = 2L
, browser.download.manager.showWhenStarting = FALSE
, browser.helperApps.neverAsk.saveToDisk = "application/zip"))
remDr <- remoteDriver(browserName = "firefox",remoteServerAddr = "",port = 4445L,extraCapabilities = fprof)
remDr$open(silent = TRUE)
# click year 2012
webElem <- remDr$findElement("name", "SelectedYear")
webElems <- webElem$findChildElements("css selector", "option")
webElems[[which(sapply(webElems, function(x){x$getElementText()}) == "2012" )]]$clickElement()
# click required quarter
webElem <- remDr$findElement("name", "SelectedQuarter")
webElems <- webElem$findChildElements("css selector", "option")
webElems[[which(sapply(webElems, function(x){x$getElementText()}) == "4th Quarter" )]]$clickElement()
# click button
webElem <- remDr$findElement("id", "downloadDataFile")
6) i have no error but i have no file
7) At the end , i would like to download the excel file on this page by Rselenium:
If you are using Docker toolbox with windows you may have issues mapping volumes see Docker : Sharing a volume on Windows with Docker toolbox
If you are using Docker Machine on Mac or Windows, your Docker daemon has only limited access to your OS X or Windows filesystem. Docker Machine tries to auto-share your /Users (OS X) or C:\Users (Windows) directory.
I initiated a clean install of docker toolbox on a windows 10 box and ran the following image:
$ docker stop $(docker ps -aq)
$ docker rm $(docker ps -aq)
$ docker run -d -v //c/Users/john/test/://home/seluser/Downloads -p 4445:4444 -p 5901:5900 selenium/standalone-firefox-debug:2.53.1
NOTE: we mapped to a directory in the Users/john space. User john is running docker toolbox
Running the below code
fprof <- makeFirefoxProfile(list(browser.download.dir = "home/seluser/Downloads"
, browser.download.folderList = 2L
, browser.download.manager.showWhenStarting = FALSE
, browser.helperApps.neverAsk.saveToDisk = "application/zip"))
remDr <- remoteDriver(browserName = "firefox",remoteServerAddr = "",port = 4445L,extraCapabilities = fprof)
remDr$open(silent = TRUE)
# click year 2012
webElem <- remDr$findElement("name", "SelectedYear")
webElems <- webElem$findChildElements("css selector", "option")
webElems[[which(sapply(webElems, function(x){x$getElementText()}) == "2012" )]]$clickElement()
# click required quarter
webElem <- remDr$findElement("name", "SelectedQuarter")
webElems <- webElem$findChildElements("css selector", "option")
webElems[[which(sapply(webElems, function(x){x$getElementText()}) == "4th Quarter" )]]$clickElement()
# click button
webElem <- remDr$findElement("id", "downloadDataFile")
And checking the mapped download folder
> list.files("C://Users/john/test")
[1] "bhcf1212.zip"
finally i have decided to make a clean install of the docker for windows (17.03.0) stable.
i needed to decrease the number of available cpu (to 1) and available ram too (to 1GB).
i have shared my c too (btw it s mandatory to have a password session otherwise you can't share the directory
after that i restarted my computer
On the R side , do not forget to remove the:
remoteServerAddr = ""
and i got the file.
my fear now is about the stability of docker, sometimes it runs, sometimes not.
many thanks john for your help

Scrape website with R by navigating doPostBack

I want to extract a table periodicaly from below site.
price list changes when clicked building block names(BLOK 16 A, BLOK 16 B, BLOK 16 C, ...) . URL doesn't change, page changes by trigering
I've tried 3 ways after searching google and starckoverflow.
what I've tried no 1: this doesn't triger doPostBack event.
postForm( "http://www.kentkonut.com.tr/tr/modul/projeler/daire_fiyatlari.aspx?id=44", ctl00_ContentPlaceHolder1_DataList2_ctl03_lnk_blok="ctl00$ContentPlaceHolder1$DataList2$ctl03$lnk_blok")
what I've tried no 2: selenium remote seem to works on (http://localhost:4444/) but remotedriver doesn't navigate. returns this error. (Error in checkError(res) :
Undefined error in httr call. httr output: length(url) == 1 is not TRUE)
remDr <- remoteDriver()
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4444L, browserName = "firefox")
what I've tried no 3: this another way to triger dopostback event. it doesn't navigate.
base.url <- "http://www.kentkonut.com.tr/tr/modul/projeler/",
event.target <- 'ctl00$ContentPlaceHolder1$DataList2$ctl03$lnk_blok',
action <- "daire_fiyatlari.aspx?id=44"
ftarget <- paste0(base.url, action)
dum <- getURL(ftarget)
event.val <- unlist(strsplit(dum,"__EVENTVALIDATION\" value=\""))[2]
event.val <- unlist(strsplit(event.val,"\" />\r\n\r\n<script"))[1]
view.state <- unlist(strsplit(dum,"id=\"__VIEWSTATE\" value=\""))[2]
view.state <- unlist(strsplit(view.state,"\" />\r\n\r\n\r\n<script"))[1]
web.data <- postForm(ftarget, "form name" = "ctl00_ContentPlaceHolder1_DataList2_ctl03_lnk_blok",
"method" = "POST",
"action" = action,
"id" = "ctl00_ContentPlaceHolder1_DataList2_ctl03_lnk_blok",
thanks for your help.
t<-html_table(html_nodes(read_html(pgsession), css = "#ctl00_ContentPlaceHolder1_DataList1"), fill= TRUE)[[1]]
# in the above example change eventtarget as "ctl00$ContentPlaceHolder1$DataList2$ctl02$lnk_blok" to get different table
t<-html_table(html_nodes(read_html(page), css = "#ctl00_ContentPlaceHolder1_DataList1"), fill= TRUE)[[1]]

Using R to access FTP Server and Download Files Results in Status "530 Not logged in"

What I'm Attempting to Do
I'm attempting to download several weather data files from the US National Climatic Data Centre's FTP server but am running into problems with an error message after successfully completing several file downloads.
After successfully downloading two station/year combinations I start getting an error "530 Not logged in" message. I've tried starting at the offending year and running from there and get roughly the same results. It downloads a year or two of data and then stops with the same error message about not being logged in.
Working Example
Following is a working example (or not) with the output truncated and pasted below.
options(timeout = 300)
ftp <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/"
td <- tempdir()
station <– c("983240-99999", "983250-99999", "983270-99999", "983280-99999", "984260-41231", "984290-99999", "984300-99999", "984320-99999", "984330-99999")
years <- 1960:2016
for (i in years) {
remote_file_list <- RCurl::getURL(
paste0(ftp, "/", i, "/"), ftp.use.epsv = FALSE, ftplistonly = TRUE,
crlf = TRUE, ssl.verifypeer = FALSE)
remote_file_list <- strsplit(remote_file_list, "\r*\n")[[1]]
file_list <- paste0(station, "-", i, ".op.gz")
file_list <- file_list[file_list %in% remote_file_list]
file_list <- paste0(ftp, i, "/", file_list)
Map(function(ftp, dest) utils::download.file(url = ftp,
destfile = dest, mode = "wb"),
file_list, file.path(td, basename(file_list)))
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1960/983250-99999-1960.op.gz'
Content type 'unknown' length 7135 bytes
downloaded 7135 bytes
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1961/984290-99999-1961.op.gz'
Content type 'unknown' length 7649 bytes
downloaded 7649 bytes
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz'
downloaded 0 bytes
Error in utils::download.file(url = ftp, destfile = dest, mode = "wb") :
cannot download all files In addition: Warning message:
In utils::download.file(url = ftp, destfile = dest, mode = "wb") :
URL ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz':
status was '530 Not logged in'
Different Methods and Ideas I've Tried but Haven't Yet Been Successful
So far I've tried to slow the requests down using Sys.sleep in a for loop and any other manner of retrieving the files more slowly by opening then closing connections, etc. It's puzzling because: i) it works for a bit then stops and it's not related to the particular year/station combination per se; ii) I can use nearly the exact same code and download much larger annual files of global weather data without any errors over a long period of years like this; and iii) it's not always stopping after 1961 going to 1962, sometimes it stops at 1960 when it starts on 1961, etc., but it does seem to be consistently between years, not within from what I've found.
The login is anonymous, but you can use userpwd "ftp:your#email.address". So far I've been unsuccessful in using that method to ensure that I was logged in to download the station files.
I think you're going to need a more defensive strategy when working with this FTP server:
library(curl) # ++gd > RCurl
library(purrr) # consistent "data first" functional & piping idioms FTW
library(dplyr) # progress bar
# We'll use this to fill in the years
ftp_base <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/%s/"
dir_list_handle <- new_handle(ftp_use_epsv=FALSE, dirlistonly=TRUE, crlf=TRUE,
ssl_verifypeer=FALSE, ftp_response_timeout=30)
# Since you, yourself, noted the server was perhaps behaving strangely or under load
# it's prbly a much better idea (and a practice of good netizenship) to cache the
# results somewhere predictable rather than a temporary, ephemeral directory
cache_dir <- "./gsod_cache"
dir.create(cache_dir, showWarnings=FALSE)
# Given the sporadic efficacy of server connection, we'll wrap our calls
# in safe & retry functions. Change this variable if you want to have it retry
# more times.
# Wrapping the memory fetcher (for dir listings)
s_curl_fetch_memory <- safely(curl_fetch_memory)
retry_cfm <- function(url, handle) {
i <- 0
repeat {
i <- i + 1
res <- s_curl_fetch_memory(url, handle=handle)
if (!is.null(res$result)) return(res$result)
if (i==MAX_RETRIES) { stop("Too many retries...server may be under load") }
# Wrapping the disk writer (for the actual files)
# Note the use of the cache dir. It won't waste your bandwidth or the
# server's bandwidth or CPU if the file has already been retrieved.
s_curl_fetch_disk <- safely(curl_fetch_disk)
retry_cfd <- function(url, path) {
# you should prbly be a bit more thorough than `basename` since
# i think there are issues with the 1971 and 1972 filenames.
# Gotta leave some work up to the OP
cache_file <- sprintf("%s/%s", cache_dir, basename(url))
if (file.exists(cache_file)) return()
i <- 0
repeat {
i <- i + 1
if (i==6) { stop("Too many retries...server may be under load") }
res <- s_curl_fetch_disk(url, cache_file)
if (!is.null(res$result)) return()
# the stations and years
station <- c("983240-99999", "983250-99999", "983270-99999", "983280-99999",
"984260-41231", "984290-99999", "984300-99999", "984320-99999",
years <- 1960:2016
# progress indicators are like bowties: cool
pb <- progress_estimated(length(years))
walk(years, function(yr) {
# the year we're working on
year_url <- sprintf(ftp_base, yr)
# fetch the directory listing
tmp <- retry_cfm(year_url, handle=dir_list_handle)
con <- rawConnection(tmp$content)
fils <- readLines(con)
# sift out only the target stations
map(station, ~grep(., fils, value=TRUE)) %>%
keep(~length(.)>0) %>%
flatten_chr() -> fils
# grab the stations files
walk(paste(year_url, fils, sep=""), retry_cfd)
# tick off progress
You may also want to set curl_interrupt to TRUE in the curl handle if you want to be able to stop/esc/interrupt the downloads.

Download a file from HTTPS using download.file()

I would like to read online data to R using download.file() as shown below.
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
download.file(URL, destfile = "./data/data.csv", method="curl")
Someone suggested to me that I add the line setInternet2(TRUE), but it still doesn't work.
The error I get is:
Warning messages:
1: running command 'curl "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv" -o "./data/data.csv"' had status 127
2: In download.file(URL, destfile = "./data/data.csv", method = "curl", :
download had nonzero exit status
Appreciate your help.
It might be easiest to try the RCurl package. Install the package and try the following:
# install.packages("RCurl")
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)
## Or
## x <- getURL(URL, ssl.verifypeer = FALSE)
out <- read.csv(textConnection(x))
# 1 H 186 8 700 4 16
# 2 H 306 8 700 4 16
# 3 H 395 8 100 4 16
# 4 H 506 8 700 4 16
# 5 H 835 8 800 4 16
# 6 H 989 8 700 4 16
# [1] 6496 188
Here's an update as of Nov 2014. I find that setting method='curl' did the trick for me (while method='auto', does not).
For example:
# does not work
# does not work. this appears to be the default anyway
destfile='localfile.zip', method='auto')
# works!
destfile='localfile.zip', method='curl')
I've succeed with the following code:
url = "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x = read.csv(file=url)
Note that I've changed the protocol from https to http, since the first one doesn't seem to be supported in R.
If using RCurl you get an SSL error on the GetURL() function then set these options before GetURL(). This will set the CurlSSL settings globally.
The extended code:
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)
Worked for me on Windows 7 64-bit using R3.1.0!
Offering the curl package as an alternative that I found to be reliable when extracting large files from an online database. In a recent project, I had to download 120 files from an online database and found it to half the transfer times and to be much more reliable than download.file.
ptm <- proc.time()
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)
proc.time() - ptm
ptm1 <- proc.time()
curl_download(url =URL ,destfile="TEST.CSV",quiet=FALSE, mode="wb")
proc.time() - ptm1
ptm2 <- proc.time()
y = download.file(URL, destfile = "./data/data.csv", method="curl")
proc.time() - ptm2
In this case, rough timing on your URL showed no consistent difference in transfer times. In my application, using curl_download in a script to select and download 120 files from a website decreased my transfer times from 2000 seconds per file to 1000 seconds and increased the reliability from 50% to 2 failures in 120 files. The script is posted in my answer to a question I asked earlier, see .
Try following with heavy files
URL <- "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- fread(URL)
127 means command not found
In your case, curl command was not found. Therefore it means, curl was not found.
You need to install/reinstall CURL. That's all. Get latest version for your OS from http://curl.haxx.se/download.html
Close RStudio before installation.
Had exactly the same problem as UseR (original question), I'm also using windows 7. I tried all proposed solutions and they didn't work.
I resolved the problem doing as follows:
Using RStudio instead of R console.
Actualising the version of R (from 3.1.0 to 3.1.1) so that the library RCurl runs OK on it. (I'm using now R3.1.1 32bit although my system is 64bit).
I typed the URL address as https (secure connection) and with / instead of backslashes \\.
Setting method = "auto".
It works for me now. You should see the message:
Content type 'text/csv; charset=utf-8' length 9294 bytes
opened URL
downloaded 9294 by
You can set global options and try-
download.file(URL, destfile = "./data/data.csv", method="auto")
For issue refer to link-
Downloading files through the httr-package also works:
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
httr::write_disk(path = basename(URL),
overwrite = TRUE))
