I'm using phantomJS to collect data from different sites. During data scrapping process I experience a lot of crashes when parsing sites or sites elements. Unfortunately nor phantomJS nor RSelenium don't provide any information or bag report in the console. Script just hangs without any warnings. I see that it is executing, but actually nothing happens. The only way to stop script from executing is to manually restart R. After several test I found that phantomJS usually hangs on executing remDr$findElements() commands. I tried to reran my code using firefox and RSelenium - it works normally. So the problem is in how phantomJS works.Does anyone experience anything similar when running phantomJS? Is it possible to fix this misbehavior?
I'm using:
Windows 7
Selenium 2.0
R version 3.1.3
phantomjs-2.0.0-windows
My code:
# starting phantom server driver
phantomjsdir <- paste(mywd, "/phantomjs-2.0.0-windows/bin/phantomjs.exe", sep="" )
phantomjsUserAgent <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 OPR/28.0.1750.48"
eCap <- list(phantomjs.binary.path = phantomjsdir, phantomjs.page.settings.userAgent = phantomjsUserAgent )
pJS <- phantom(pjs_cmd = phantomjsdir)
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
remDr$open(silent = FALSE)
mywords <- c("canon 600d", "sony 58k","nikon","nikon2","nikon 800","nikon 80","nikon 8")
timeout <- 3
#'
#' Exceuting script
#'
for (word in mywords) {
print(paste0("searching for: ",word))
ss.word <- word
remDr$navigate("http://google.com")
webElem <- remDr$findElement(using = "class", "gsfi")
webElem$sendKeysToElement(list(enc2utf8(ss.word),key = "enter"))
Sys.sleep(1)
print (remDr$executeScript("return document.readyState;")[[1]])
while (remDr$executeScript("return document.readyState;")[[1]]!= "complete" && totalwait<10) {
Sys.sleep(timeout)
}
print(paste0("search completed: ",ss.word))
elem.snippet <- remDr$findElements(using="class name",value = "rc")
for (i in 1:length(elem.snippet)) {
print(paste0("element opened: ",ss.word," pos",i))
print(elem.snippet[[i]])
ss.snippet.code <- elem.snippet[[i]]$getElementAttribute('innerHTML')
print(paste0("element element innerHTML ok"))
elemtitle <- elem.snippet[[i]]$findChildElement(using = "class name", value = "r")
print(paste0("element title ok"))
elemcode <- elemtitle$getElementAttribute('innerHTML')
print(paste0("element innerHTML ok"))
elemtext <- elem.snippet[[i]]$findChildElement(using = "class name", value = "st")
ss.text <- elemtext$getElementText()[[1]]
print(paste0("element loaded: ",ss.word," pos",i))
elemloc <- elem.snippet[[i]]$getElementLocation()
elemsize <- elem.snippet[[i]]$getElementSize()
print(paste0("element location parsed: ",ss.word," pos",i))
}
print(paste0("data collected: ",ss.word))
}
Related
I was running a script for webscraping in RStudio and got the following error:
Selenium message:javascript error: this.each is not a function
(Session info: chrome=81.0.4044.129)
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
System info: host: 'xxxxxx', ip: 'xxx.xxx.x.xxx', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_231'
Driver info: driver.version: unknown
Error: Summary: JavaScriptError
Detail: An error occurred while executing user supplied JavaScript.
class: org.openqa.selenium.JavascriptException
Further Details: run errorDetails method
I don't really understand what the problem is and how I might solve it.
Does anyone know how to solve this problem? I am still quite new to this, so concrete steps would be very practical for me.
Thank you in advance!
Edit: This is the script I'm using. The Error seems to occur just before "#end of the main loop"
library(data.table) # Required for rbindlist
library(dplyr) # Required to use the pipes %>% and some table manipulation commands
library(magrittr) # Required to use the pipes %>%
library(rvest) # Required for read_html
library(RSelenium) # Required for webscraping with javascript
library(lubridate) # Required to collect dates
library(stringr)
library(purrr)
options(stringsAsFactors = F) #needed to prevent errors when merging data frames
#Paste the GoodReads Url
url <- "https://www.goodreads.com/book/show/1885.Pride_and_Prejudice?ac=1&from_search=true&qid=VkA2NbcGBa&rank=1"
languageOnly = F #If FALSE, "all languages" is chosen
#Set your browser settings
rD <- rsDriver(port = 4585L, browser = "chrome", chromever = "81.0.4044.69")
remDr <- rD[["client"]]
remDr$setTimeout(type = "implicit", 2000)
remDr$navigate(url)
bookTitle = unlist(remDr$getTitle())
finalData = data.frame()
# Main loop going through the website pages
morePages = T
pageNumber = 1
while(morePages){
#Select reviews in correct language.
#It should also work if you only fill in the numeral language code, and leave the first one empty.
selectLanguage = if(languageOnly){
selectLanguage = remDr$findElement("xpath", "//select[#id='language_code']/option[#value='']")
} else {
selectLanguage = remDr$findElement("xpath", "//select[#id='language_code']/option[5]")
}
selectLanguage$clickElement()
Sys.sleep(3)
#Expand all reviews
expandMore <- remDr$findElements("link text", "...more")
sapply(expandMore, function(x) x$clickElement())
#Extracting the reviews from the page
reviews <- remDr$findElements("css selector", "#bookReviews .stacked")
reviews.html <- lapply(reviews, function(x){x$getElementAttribute("outerHTML")[[1]]})
reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} )
reviews.text <- unlist(reviews.list)
#Some reviews have only rating and no text, so we process them separately
onlyRating = unlist(map(1:length(reviews.text), function(i) str_detect(reviews.text[i], "^\\\n\\\n")))
#Full reviews
if(sum(!onlyRating) > 0){
filterData = reviews.text[!onlyRating]
fullReviews = purrr::map_df(seq(1, length(filterData), by=2), function(i){
review = unlist(strsplit(filterData[i], "\n"))
data.frame(
date = mdy(review[2]), #date
username = str_trim(review[5]), #user
rating = str_trim(review[9]), #overall
comment = str_trim(review[12]) #comment
)
})
#Add review text to full reviews
fullReviews$review = unlist(purrr::map(seq(2, length(filterData), by=2), function(i){
str_trim(str_remove(filterData[i], "\\s*\\n\\s*\\(less\\)"))
}))
} else {
fullReviews = data.frame()
}
#partial reviews (only rating)
if(sum(onlyRating) > 0){
filterData = reviews.text[onlyRating]
partialReviews = purrr::map_df(1:length(filterData), function(i){
review = unlist(strsplit(filterData[i], "\n"))
data.frame(
date = mdy(review[9]), #date
username = str_trim(review[4]), #user
rating = str_trim(review[8]), #overall
comment = "",
review = ""
)
})
} else {
partialReviews = data.frame()
}
finalData = rbind(finalData, fullReviews, partialReviews)
#Go to next page if possible
nextPage = remDr$findElements("xpath", "//a[#class='next_page']")
if(length(nextPage) > 0){
message(paste("PAGE", pageNumber, "Processed - Going to next"))
nextPage[[1]]$clickElement()
pageNumber = pageNumber + 1
Sys.sleep(2)
} else {
message(paste("PAGE", pageNumber, "Processed - Last page"))
morePages = FALSE
}
}
#end of the main loop
#Replace missing ratings by 'not rated'
finalData$rating = ifelse(finalData$rating == "", "not rated", finalData$rating)
#Stop server
rD[["server"]]$stop()
#set directory to where you wish the file to go
#copy your working directory and exchange all backward slashes with forward slashes
getwd()
setwd("C:/Users/ledgreve/Desktop/GoodReads_TextMining-master/Scripts/New Scripts/Test1")
#Write results
write.csv(finalData, paste0(bookTitle, ".csv"), row.names = F)
message("FINISHED!")
Just my own update: This issue was resolved after I reinstalled java and installed rjava (https://cimentadaj.github.io/blog/2018-05-25-installing-rjava-on-windows-10/installing-rjava-on-windows-10/)
I recently noticed that the install.pandoc function in the installr package appears to be broken.
I get the following error message:
trying URL 'https://github.com/'
Content type 'text/html; charset=utf-8' length unknown
downloaded 78 KB
github.com is not compatible with the version of Windows you're running. Check your computer's system information and then contact the software publisher.
It looks like the function is not finding the appropriate file from GitHub. I have submitted a pull request to the installr package on GitHub which corrects this error.
Here is the function that should install Pandoc correctly and that was submitted as a pull request. In case you run into this error before it is fixed.
library(installr)
FixedInstall.Pandoc <- function (URL = "https://github.com/jgm/pandoc/releases", use_regex = TRUE,
to_restart, ...)
{
URL <- "https://github.com/jgm/pandoc/releases"
page_with_download_url <- URL
if (!use_regex)
warning("use_regex is no longer supported, you can stop using it from now on...")
page <- readLines(page_with_download_url, warn = FALSE)
sysArch <- Sys.getenv("R_ARCH")
sysArch <- gsub("/ |/x", "", sysArch)
pat <- paste0("jgm/pandoc/releases/download/[0-9.]+/pandoc-[0-9.-]+-windows",".*", sysArch, ".*", ".msi")
target_line <- grep("windows", page, value = TRUE)
m <- regexpr(pat, target_line)
URL <- regmatches(target_line, m)
URL <- head(URL, 1)
URL <- paste("https://github.com/", URL, sep = "")
installed <- install.URL(URL, ...)
if (!installed)
return(invisible(FALSE))
if (missing(to_restart)) {
if (is.windows()) {
you_should_restart <- "You should restart your computer\n in order for pandoc to work properly"
winDialog(type = "ok", message = you_should_restart)
choices <- c("Yes", "No")
question <- "Do you want to restart your computer now?"
the_answer <- menu(choices, graphics = "TRUE", title = question)
to_restart <- the_answer == 1L
}
else {
to_restart <- FALSE
}
}
if (to_restart)
os.restart()
}
How do you launch TorBrowser in RSelenium?
I tried this to no avail:
library(RSelenium)
browserP <- "C:/Users/Administrator/Desktop/Tor Browser/Browser/firefox.exe"
jArg <- paste0("-Dwebdriver.firefox.bin=\"", browserP, "\"")
pLoc <- "C:/Users/Administrator/Desktop/Tor Browser/Browser/TorBrowser/Data/Browser/profile.meek-http-helper/"
jArg <- c(jArg, paste0("-Dwebdriver.firefox.profile=\"", pLoc, "\""))
selServ <- RSelenium::startServer(javaargs = jArg)
Error: startServer is now defunct. Users in future can find the function in
file.path(find.package("RSelenium"), "examples/serverUtils"). The
recommended way to run a selenium server is via Docker. Alternatively
see the RSelenium::rsDriver function.
rsDriver doesn't take a javaargs argument, and I can't figure out how to get this to work either:
fprof <- getFirefoxProfile("C:/Users/Administrator/Desktop/Tor Browser/Browser/TorBrowser/Data/Browser/profile.meek-http-helper/", useBase = T)
remDr <- remoteDriver(extraCapabilities = list(marionette = TRUE))
I m trying to download files by Rselenium but it looks impossible.I don't arrive to download even with an easy example:
1) i have installed docker toolbox (https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-docker.html)
2) i ran the firefox standalone image : 3.1.0 and now i m testing the older 2.52.0
3) i have installed the rselenium package on My R X64 3.3.2 and i read all the questions & answers on stackoverflow
4) i have tried the following code, by the way, when i analyse the firefox options about:config , i don't find the "browser.download.dir" options:
require(RSelenium)
fprof <- makeFirefoxProfile(list(browser.download.dir = "C:/temp"
, browser.download.folderList = 2L
, browser.download.manager.showWhenStarting = FALSE
, browser.helperApps.neverAsk.saveToDisk = "application/zip"))
remDr <- remoteDriver(browserName = "firefox",remoteServerAddr = "192.168.99.100",port = 4445L,extraCapabilities = fprof)
remDr$open(silent = TRUE)
remDr$navigate("https://www.chicagofed.org/applications/bhc/bhc-home")
# click year 2012
webElem <- remDr$findElement("name", "SelectedYear")
webElems <- webElem$findChildElements("css selector", "option")
webElems[[which(sapply(webElems, function(x){x$getElementText()}) == "2012" )]]$clickElement()
# click required quarter
webElem <- remDr$findElement("name", "SelectedQuarter")
Sys.sleep(1)
webElems <- webElem$findChildElements("css selector", "option")
webElems[[which(sapply(webElems, function(x){x$getElementText()}) == "4th Quarter" )]]$clickElement()
# click button
webElem <- remDr$findElement("id", "downloadDataFile")
webElem$clickElement()
6) i have no error but i have no file
7) At the end , i would like to download the excel file on this page by Rselenium:
[link]https://app2.msci.com/products/indexes/performance/country_chart.html?asOf=Feb%2028,%202010&size=30&scope=C&style=C¤cy=15&priceLevel=0&indexId=83#
If you are using Docker toolbox with windows you may have issues mapping volumes see Docker : Sharing a volume on Windows with Docker toolbox
If you are using Docker Machine on Mac or Windows, your Docker daemon has only limited access to your OS X or Windows filesystem. Docker Machine tries to auto-share your /Users (OS X) or C:\Users (Windows) directory.
I initiated a clean install of docker toolbox on a windows 10 box and ran the following image:
$ docker stop $(docker ps -aq)
$ docker rm $(docker ps -aq)
$ docker run -d -v //c/Users/john/test/://home/seluser/Downloads -p 4445:4444 -p 5901:5900 selenium/standalone-firefox-debug:2.53.1
NOTE: we mapped to a directory in the Users/john space. User john is running docker toolbox
Running the below code
require(RSelenium)
fprof <- makeFirefoxProfile(list(browser.download.dir = "home/seluser/Downloads"
, browser.download.folderList = 2L
, browser.download.manager.showWhenStarting = FALSE
, browser.helperApps.neverAsk.saveToDisk = "application/zip"))
remDr <- remoteDriver(browserName = "firefox",remoteServerAddr = "192.168.99.100",port = 4445L,extraCapabilities = fprof)
remDr$open(silent = TRUE)
remDr$navigate("https://www.chicagofed.org/applications/bhc/bhc-home")
# click year 2012
webElem <- remDr$findElement("name", "SelectedYear")
webElems <- webElem$findChildElements("css selector", "option")
webElems[[which(sapply(webElems, function(x){x$getElementText()}) == "2012" )]]$clickElement()
# click required quarter
webElem <- remDr$findElement("name", "SelectedQuarter")
Sys.sleep(1)
webElems <- webElem$findChildElements("css selector", "option")
webElems[[which(sapply(webElems, function(x){x$getElementText()}) == "4th Quarter" )]]$clickElement()
# click button
webElem <- remDr$findElement("id", "downloadDataFile")
webElem$clickElement()
And checking the mapped download folder
> list.files("C://Users/john/test")
[1] "bhcf1212.zip"
>
finally i have decided to make a clean install of the docker for windows (17.03.0) stable.
i needed to decrease the number of available cpu (to 1) and available ram too (to 1GB).
i have shared my c too (btw it s mandatory to have a password session otherwise you can't share the directory
after that i restarted my computer
On the R side , do not forget to remove the:
remoteServerAddr = "192.168.99.100"
and i got the file.
my fear now is about the stability of docker, sometimes it runs, sometimes not.
many thanks john for your help
I'm trying to scrape a set of web pages with the new rvest package. It works for most of the web pages but when there are no tabular entries for a particular letter, an error is returned.
# install the packages you need, as appropriate
install.packages("devtools")
library(devtools)
install_github("hadley/rvest")
library(rvest)
This code works OK because there are entries for the letter E on the web page.
# works OK
url <- "https://www.propertytaxcard.com/ShopHillsborough/participants/alph/E"
pg <- html_session(url, user_agent("Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"))
pg %>% html_nodes(".sponsor-info .bold") %>% html_text()
This doesn't work because there are no entries for the letter F on the web page. The error message is "Error in class(out) <- "XMLNodeSet" : attempt to set an attribute on NULL"
# yields error message
url <- "https://www.propertytaxcard.com/ShopHillsborough/participants/alph/F"
pg <- html_session(url, user_agent("Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"))
pg %>% html_nodes(".sponsor-info .bold") %>% html_text()
Any suggestions. Thanks in advance.
You could always wrap the pg…html_nodes…html_text in try and test for the class afterwards:
tmp <- try(pg %>% html_nodes(".sponsor-info .bold") %>% html_text(), silent=TRUE)
if (class(tmp) == "character") {
print("do stuff")
} else {
print("do other stuff")
}
EDIT: one other option is to use the boolean() XPath operator and do the test that way:
html_nodes_exist <- function(rvest_session, xpath) {
xpathApply(content(rvest_session$response, as="parsed"),
sprintf("boolean(%s)", xpath))
}
pg %>% html_nodes_exist("//td[#class='sponsor-info']/span[#class='bold']")
which will return TRUE if those nodes exist and FALSE if they don't (that function needs to be generalized to be able to use session and ["HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" "XMLAbstractDocument"] objects and work with both CSS selectors as well as XPath, but it's a way to avoid try.