web scraping RSelenium findElement

web scraping RSelenium findElement - r

I feel this is supposed to be simple but I have been struggled to get it right. I'm trying to extract the Employees number ("2,300,000") from this webpage: https://fortune.com/company/walmart/
I used Chrome's extension SelectorGadget to locate the number---"info__row--7f9lE:nth-child(13) .info__value--2AHH7""
```
library(RSelenium)
library(rvest)
library(netstat)
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
Employees<-remDr$findElement(using = 'xpath','//h3[#class="info__row--7f9lE:nth-child(13) .info__value--2AHH7"]')
Employees
```
An error says
> "Selenium message:no such element: Unable to locate element".
I have also tried:
```
Employees<-remDr$findElement(using = 'class name','info__value--2AHH7')
```
But it returns the data not as wanted.
Can someone point out the problem? Really appreciate it!
Updated
I modified the code as suggested by Frodo below in the comment to apply to multiple webpages to save the statistics as a dataframe. But I still encountered an error.
library(RSelenium)
library(rvest)
library(netstat)
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client
Data<-data.frame("url" = c("https://fortune.com/company/walmart/", "https://fortune.com/company/amazon-com/"
,"https://fortune.com/company/apple/"
,"https://fortune.com/company/cvs-health/"
,"https://fortune.com/company/jpmorgan-chase/"
,"https://fortune.com/company/verizon/"
,"https://fortune.com/company/ford-motor/"
, "https://fortune.com/company/general-motors/"
,"https://fortune.com/company/anthem/"
, "https://fortune.com/company/centene/"
,"https://fortune.com/company/fannie-mae/"
, "https://fortune.com/company/comcast/"
, "https://fortune.com/company/chevron/"
,"https://fortune.com/company/dell-technologies/"
,"https://fortune.com/company/bank-of-america-corp/"
,"https://fortune.com/company/target/") )
Data$numEmp<-"NA"
Data$numEmp <- numeric()
for (i in 1:length(Data$url))
{
remDr$navigate(url = Data$url[i])
pgSrc <- remDr$getPageSource()
pgCnt <- read_html(pgSrc[[1]])
Data$numEmp[i] <- pgCnt %>%
html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
html_text(trim = TRUE)
}
Data$numEmp
Selenium message:unknown error: unexpected command response
(Session info: chrome=103.0.5060.114)
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
System info: host: 'DESKTOP-VCCIL8P', ip: '192.168.1.249', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_311'
Driver info: driver.version: unknown
Error: Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
class: org.openqa.selenium.WebDriverException
Further Details: run errorDetails method
Can someone please take another look?

Use RSelenium to load up the webpage and get the page source
remdr$navigate(url = 'https://fortune.com/company/walmart/')
pgSrc <- remdr$getPageSource()
Use Rvest to read the contents of the webpage
pgCnt <- read_html(pgSrc[[1]])
Further, use rvest::html_nodes and rvest::html_text functions to extract the text using relevant xpath selectors. (this Chrome extension should help)
reqTxt <- pgCnt %>%
html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
html_text(trim = TRUE)
Output of reqTxt
> reqTxt
[1] "2,300,000"
UPDATE
The error Selenium message:unknown error: unexpected command response seems to be occurring specifically 103 version of Chromedriver. More info here. One of the answers there was a giving a simple wait of 5 seconds before and after the driver navigates to the URL. And I have also used tryCatch to keep continuing the code to run within a while loop. Essentially, the code will run until it loads the page. This seems to work.
# Function to fetch employee count
getEmployees <- function(myURL) {
pagestatus <<- 0
while(pagestatus == 0) {
tryCatch(
expr = remDr$navigate(url = myURL),
pagestatus <<- 1,
error = function(error){
pagestatus <<- 0
}
)
}
pgSrc <- remDr$getPageSource()
pgCnt <- read_html(pgSrc[[1]])
return(pgCnt %>% html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>% html_text(trim = TRUE))
}
Implement this function to all of your dataframe URLs.
for(i in 1:nrow(Data)) {
Sys.sleep(5)
Data[i, 2] <- getEmployees(Data[i, 1])
Sys.sleep(5)
}
Now when we see the output of second column
> Data[, 2]
[1] "2,300,000" "1,608,000" "154,000" "258,000" "271,025" "118,400"
[7] "183,000" "157,000" "98,200" "72,500" "7,400" "189,000"
[13] "42,595" "133,000" "208,248" "450,000"

Does it have to be with RSelenium only? In my experience, the most flexible approach is to use RSelenium to navigate to the required pages (where findElement helps you find boxes to enter text into or buttons to click) and then use rvest to extract what you need from the page.
Start with
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
page_source <- remDr$getPageSource()
pg <- xml2::read_html(page_source[[1]])
How you then go about it depends on how specific you want the solution to be wrt this exact page. Here is one way:
rvest::html_elements(pg, "div.info__row--7f9lE") |>
rvest::html_text2()
or
rvest::html_elements(pg, "div:nth-child(13) > div.info__value--2AHH7") |>
rvest::html_text2()
or
rvest::html_elements(pg, "div.info__row--7f9lE")[11] |>
rvest::html_children()
or
rvest::html_elements(pg, '.info__row--7f9lE:nth-child(13) .info__value--2AHH7') |>
rvest::html_text2()
et cetera. What you do in the rvest part would depend on how general you want the selection/extraction process to be.

Related

Web scraping PDF files from a map

I've been trying to download pdfs embedded in a map following this code (original one can be found here). Each pdf refers to a brazilian municipality (5,570 files).
library(XML)
library(RCurl)
url <- "http://simec.mec.gov.br/sase/sase_mapas.php?uf=RJ&tipoinfo=1"
page <- getURL(url)
parsed <- htmlParse(page)
links <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds <- grep("*.pdf", links)
links <- links[inds]
regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)
for(i in seq_along(links)){
download.file(links[i], destfile=destination[i])
Sys.sleep(runif(1, 1, 5))
}
I already used this code in other projects a few times and it worked. For this specific case, it doesn't. In fact, I've tried many things to scrape these files but it seems impossible to me. Recently, I got the following link. Then it makes possible to combine uf (state) and muncod (municipal code) to download the file, but I dont know how to include this to the code though.
http://simec.mec.gov.br/sase/sase_mapas.php?uf=MT&muncod=5100102&acao=download
Thanks in advance!

devtools::install_github("ropensci/RSelenium")
library(rvest)
library(httr)
library(RSelenium)
# connect to selenium server from within r (REPLACE SERVER ADDRESS)
rem_dr <- remoteDriver(
remoteServerAddr = "192.168.50.25", port = 4445L, browserName = "firefox"
)
rem_dr$open()
# get the two-digit state codes for brazil by scraping the below webpage
tables <- "https://en.wikipedia.org/wiki/States_of_Brazil" %>%
read_html() %>%
html_table(fill = T)
states <- tables[[4]]$Abbreviation
# for each state, we are going to go navigate to the map of that state using
# selenium, then scrape the list of possible municipality codes from the drop
# down menu present in the map
get_munip_codes <- function(state) {
url <- paste0("http://simec.mec.gov.br/sase/sase_mapas.php?uf=", state)
rem_dr$navigate(url)
# have to wait until the drop down menu loads. 8 seconds will be enough time
# for each state
Sys.sleep(8)
src <- rem_dr$getPageSource()
out <- read_html(src[[1]]) %>%
html_nodes(xpath = "//select[#id='muncod']/option[boolean(#value)]") %>%
xml_attrs("value") %>%
unlist(use.names = F)
print(state)
out
}
state_munip <- sapply(
states, get_munip_codes, USE.NAMES = TRUE, simplify = FALSE
)
# now you can download each pdf. first create a directory for each state, where
# the pdfs for that state will go:
lapply(names(state_munip), function(x) dir.create(file.path("brazil-pdfs", x)))
# ...then loop over each state/municipality code and download the pdf
lapply(
names(state_munip), function(state) {
lapply(state_munip[[state]], function(munip) {
url <- sprintf(
"http://simec.mec.gov.br/sase/sase_mapas.php?uf=%s&muncod=%s&acao=download",
state, munip
)
file <- file.path("brazil-pdfs", state, paste0(munip, ".pdf"))
this_one <- paste0("state ", state, ", munip ", munip)
tryCatch({
GET(url, write_disk(file, overwrite = TRUE))
print(paste0(this_one, " downloaded"))
},
error = function(e) {
print(paste0("couldn't download ", this_one))
try(unlink(file, force = TRUE))
}
)
})
}
)
STEPS:
Get the IP address of your windows machine (see https://www.digitalcitizen.life/find-ip-address-windows)
start selenium server docker container by running this:
docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.1
start rocker/tidyverse docker container by running this:
docker run -v `pwd`/brazil-pdfs:/home/rstudio/brazil-pdfs -dp 8787:8787 rocker/tidyverse
Go into your preferred browser and enter this address: http://localhost:8787 ...This will take you to the login screen for rstudio server. login using the username "rstudio" and password "rstudio"
Copy/paste the code shown above in a new Rstudio .R document. Replace the value for remoteServerAddr with the IP address you found in step 1.
Run the code...this should write the pdfs to a directory "brazil-pdfs" that is both inside the container and mapped to your windows machine (in other words, the pdfs will show up in the brazil-pdfs dir on your local machine as well). note, it takes a while to run the code b/c there are a lot of pdfs.

Suppress but capture warnings in R

I am building a function to connect to a specific password-protected ODBC data source that will be used many members of a team - it may be used in multiple environments. In the event that the connection is rejected, I would like to display the warning messages but mask the password that's displayed. If I use suppressWarnings() nothing gets captured as far as I can tell, and if I don't, then the message is displayed in the standard output with the password. Here's the function so far:
connectToData <- function(uid, pswd, dsn='myDSN') {
# Function to connect to myDSN data
#
# Args:
# uid: The user's ID for connecting to the database
# pswd: The user's password for connecting to the database.
# dsn: The DSN for the (already existing) ODBC connection to the 5G
# data. It must be set up on an individual Windows user's machine,
# and they could use any name for it. The default is 'myDSN'
#
# Returns:
# The 'RODBC' class object returned by the RODBC:odbcConnect() function.
#
# TODO: 1) See if you can specify the connection using odbcDriverConnect()
# so as to not rely on user's ODBC connections
# 2) Capture warnings from odbcConnect() and print them while
# disguising password using gsub, as I've attempted to do below.
library('RODBC')
db.conn <- odbcConnect(dsn,
uid=uid,
pwd=pswd)
if(class(db.conn) != 'RODBC') { # Error handling for connections that don't make it
print(gsub(pswd,'******',warnings())) # This doesn't work like I want it to
stop("ODBC connection could not be opened. See warnings()")
} else {
return(db.conn)
}
}
When I run it with the right username/password, I get the right result but when I run it with a bad password, I get this:
> db.conn <- connectTo5G(uid='myID',pswd='badpassword', dsn='myDSN')
[1] "RODBC::odbcDriverConnect(\"DSN=myDSN;UID=myID;PWD=******\")"
[2] "RODBC::odbcDriverConnect(\"DSN=myDSN;UID=myID;PWD=******\")"
Error in connectTo5G(uid = "myID", pswd = "badpassword", dsn = "myDSN") :
ODBC connection could not be opened. See warnings()
In addition: Warning messages:
1: In RODBC::odbcDriverConnect("DSN=myDSN;UID=myID;PWD=badpassword") :
[RODBC] ERROR: state 28000, code 1017, message [Oracle][ODBC][Ora]ORA-01017: invalid username/password; logon denied
2: In RODBC::odbcDriverConnect("DSN=myDSN;UID=myID;PWD=badpassword") :
ODBC connection failed
The print(gsub(...)) appears to work on the most recent warnings from before the function was invoked, and it also only prints the function call that produced the warning, not the text of the warning.
What I would like to do is capture everything after "In addition: Warning messages:" so that I can use gsub() on it, but avoid printing it before the gsub() gets a chance to work on it. I think I need to use withCallingHandlers() but I've looked through the documentation and examples and I cannot figure it out.
Some extra background: This is an Oracle database that locks users out after three attempts to connect so I want to use stop() in case someone writes code that calls this function multiple times. Different users in my group work in both Windows and Linux (sometimes going back and forth) so any solution needs to be flexible.

Catching error messages
I do not fully understand what you want to accomplish with ODBC but in terms of converting the error message, you can use tryCatch as #joran suggested
pswd = 'badpassword'
# Just as a reproducable example, a function which fails and outputs badpassword
failing <- function(){
badpassword == 1
}
# This would be the error handling part
tryCatch(failing(),
error = function(e) gsub(pswd, '******', e))
[1] "Error in failing(): object '******' not found\n"
e in this case is the error message and you could think of other ways to manipulate what is put to your screen, so it would not be as easy to guess passwords based on what was replaced. Note for example that 'object' would have been replaced as well if the password had been 'object' for some reason. Or even parts of words, which get replaced as well. At the very least, it would make sense to include word boundaries in the gsub command:
pswd = 'ling'
failing <- function(){
ling == 1
}
tryCatch(failing(),
error = function(e) gsub(paste0("\\b", pswd, "\\b"), '******', e))
[1] "Error in failing(): object '******' not found\n"
For other improvements you should look closely at the specific error messages.
Warnings
trycatch can also manipulate warning:
pswd = 'ling'
failing <- function(){
warning("ling")
ling == 1
}
tryCatch(failing(),
warning = function(w) gsub(paste0("\\b", pswd, "\\b"), '******', w),
error = function(e) gsub(paste0("\\b", pswd, "\\b"), '******', e))
[1] "simpleWarning in failing(): ******\n"
This will not show the error then, however.
withCallingHandlers
If you really want to catch all output from errors and warnings, you do indeed need withCallingHandlers, which works mostly in the same way, except that it does not terminate the rest of the evaluation.
pswd = 'ling'
failing <- function(pswd){
warning(pswd)
warning("asd")
stop(pswd)
}
withCallingHandlers(failing(),
warning = function(w) {
w <- gsub(paste0("\\b", pswd, "\\b"), '******', w)
warning(w)},
error = function(e){
e <- gsub(paste0("\\b", pswd, "\\b"), '******', e)
stop(e)
})

R: Error in file(con, "r") : cannot open the connection

i'm trying to run a API request for a number of parameters with the lapply function in R.
However, when i run this function, i get the error " Error in file(con, "r") : cannot open the connection"
Google suggests using setInternet2(TRUE) to fix this issue, however, i get the error: Error: 'setInternet2' is defunct.
See help("Defunct"
localisedDestinationNameForGivenLang <- function (LocationId) {
gaiaURL <- paste0("https://URL/",LocationId, "?geoVersion=rwg&lcid=", "1036",
"&cid=geo&apk=explorer")
print(LocationId)
localisation <- fromJSON(gaiaURL)
}
lapply(uniqueLocationId, localisedDestinationNameForGivenLang)
Can someone suggest a fix please?

Here's a sample of how you could identify which sites are throwing errors while still getting response from the ones that don't:
urls = c("http://citibikenyc.com/stations/test", "http://citibikenyc.com/stations/json")
grab_data <- function(url) {
out <- tryCatch(
{fromJSON(url)},
error=function(x) {
message(paste(url, x))
error_msg = paste(url, "threw an error")
return(error_msg)
})
return(out)
}
result <- lapply(urls, grab_data)
result will be a list that contains API response for urls that work, and error msg for those that don't.

Deployed R app with Shiny crash with 'could not find function "httpdPort"'

I have an application built with Shiny (a tutorial, where ui.R and server.R are taken from here: http://shiny.rstudio.com/tutorial/lesson1/).
I have these two files in shiny-frontend folder, and if I runApp("shiny-frontend") locally in RStudio - everything works great and I see the tutorial in my browser.
Now I want the same app to be put into Bluemix via cloudfoundry. I'm using this: http://www.ibm.com/developerworks/library/ba-rtwitter-app/ as a tutorial, but struggling with an error.
I have a start.r file which I run as R -f ./start.r --gui-none --no-save. I'm using https://github.com/virtualstaticvoid/heroku-buildpack-r buildpack.
My start.r looks like this (taken from the bluemix tutorial with a very minor modifications):
library(shiny)
if (Sys.getenv('VCAP_APP_PORT') == "") {
print("Running Shiny")
runApp("shiny-frontend")
} else {
# In case we're on Cloudfoundry, run this:
print('running on CF')
# Starting Rook server during CF startup phase - after 60 seconds start the actual Shiny server
library(Rook)
myPort <- as.numeric(Sys.getenv('VCAP_APP_PORT'))
myInterface <- Sys.getenv('VCAP_APP_HOST')
status <- -1
# R 2.15.1 uses .Internal, but the next release of R will use a .Call.
# Either way it starts the web server.
if (as.integer(R.version[["svn rev"]]) > 59600) {
status <- .Call(tools:::startHTTPD, myInterface, myPort)
} else {
status <- .Internal(startHTTPD(myInterface, myPort))
}
if (status == 0) {
unlockBinding("httpdPort", environment(tools:::startDynamicHelp))
assign("httpdPort", myPort, environment(tools:::startDynamicHelp))
s <- Rhttpd$new()
s$listenAddr <- myInterface
s$listenPort <- myPort
s$print()
Sys.sleep(60)
s$stop()
}
# run shiny server
sink(stderr())
options(bitmapType='cairo')
getOption("bitmapType")
print("test")
write("prints to stderr", stderr())
write("prints to stdout", stdout())
write(port, stdout())
runApp('shiny-frontend',port=myPort,host="0.0.0.0",launch.browser=F)
}
And my init.r, looks like this:
install.packages("shiny", clean=T)
install.packages("Rook", clean=T)
Then when I run, everything is deployed correctly, but then when I try to go by the route, I see an error in the log:
* ERR Calls: <Anonymous> -> startDynamicHelp
* ERR Execution halted
* ERR Error in startDynamicHelp(FALSE) : could not find function "httpdPort"
I also noticed that assigned port is different every time, which is weird and the route in bluemix dashboard does not mention it. But I output the port to the log, and use that number.
Also the way I'm doing it seems a bit too complicated, so if anybody could suggest any easier way, I'd appreciate it

it took me a while to understand that this error is thrown by R because it can not find the function (not the value) httpdPort. Instead of binding httpdPort to a function you are binding it to a value. The line s$stop() is the one causing trouble. It calls startDynamicHelp that assumes that httpdPort is a function defined in the environment tools.
To fix this issue you can change the block if (status == 0){...} in your code to:
if (status == 0) {
getSettable <- function(default){
function(obj = NA){if(!is.na(obj)){default <<- obj};
default}
}
myHttpdPort <- getSettable(myPort)
unlockBinding("httpdPort", environment(tools:::startDynamicHelp))
assign("httpdPort", myHttpdPort, environment(tools:::startDynamicHelp))
s <- Rhttpd$new()
s$listenAddr <- myInterface
s$listenPort <- myPort
s$print()
Sys.sleep(60)
s$stop()
}

system open RStudio close connection

I'm attempting to use R to open a .Rproj file used in RStudio. I have succeeded with the code below (stolen from Ananda here). However, the connection to open RStudio called from R is not closed after the file is opened. How can I sever this "connection" after the .Rproj file is opened? (PS this has not been tested on Linux or Mac yet).
## Create dummy .Rproj
x <- c("Version: 1.0", "", "RestoreWorkspace: Default", "SaveWorkspace: Default",
"AlwaysSaveHistory: Default", "", "EnableCodeIndexing: Yes",
"UseSpacesForTab: No", "NumSpacesForTab: 4", "Encoding: UTF-8",
"", "RnwWeave: knitr", "LaTeX: pdfLaTeX")
loc <- file.path(getwd(), "Bar.rproj")
cat(paste(x, collapse = "\n"), file = loc)
## wheresRStudio function to find RStudio location
wheresRstudio <-
function() {
myPaths <- c("rstudio", "~/.cabal/bin/rstudio",
"~/Library/Haskell/bin/rstudio", "C:\\PROGRA~1\\RStudio\\bin\\rstudio.exe",
"C:\\RStudio\\bin\\rstudio.exe")
panloc <- Sys.which(myPaths)
temp <- panloc[panloc != ""]
if (identical(names(temp), character(0))) {
ans <- readline("RStudio not installed in one of the typical locations.\n
Do you know where RStudio is installed? (y/n) ")
if (ans == "y") {
temp <- readline("Enter the (unquoted) path to RStudio: ")
} else {
if (ans == "n") {
stop("RStudio not installed or not found.")
}
}
}
temp
}
## function to open .Rproj files
open_project <- function(Rproj.loc) {
action <- paste(wheresRstudio(), Rproj.loc)
message("Preparing to open project!")
system(action)
}
## Test it (it works but does no close)
open_project(loc)

It's not clear what you're trying to do exactly. What you've described doesn't really sound to me like a "connection" -- it's a system call.
I think what you're getting at is that after you run open_project(loc) in your above example, you don't get your R prompt back until you close the instance of RStudio that was opened by your function. If that is the case, you should add wait = FALSE to your system call.
You might also need to add an ignore.stderr = TRUE in there to get directly back to the prompt. I got some error about "QSslSocket: cannot resolve SSLv2_server_method" on my Ubuntu system, and after I hit "enter" it took me back to the prompt. ignore.stderr can bypass that (but might also mean that the user doesn't get meaningful errors in the case of serious errors).
In other words, I would change your open_project() function to the following and see if it does what you expect:
open_project <- function(Rproj.loc) {
action <- paste(wheresRstudio(), Rproj.loc)
message("Preparing to open project!")
system(action, wait = FALSE, ignore.stderr = TRUE)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

web scraping RSelenium findElement - r

Related

Web scraping PDF files from a map

Suppress but capture warnings in R

R: Error in file(con, "r") : cannot open the connection

Deployed R app with Shiny crash with 'could not find function "httpdPort"'

system open RStudio close connection

Categories

Resources