Use R Selenium to deal with pop-ups - r

How can I navigate to a link, click a link, and scrape data from there?
I tried this code, without success.
library("RSelenium")
startServer()
mybrowser <- remoteDriver()
mybrowser$open()
mybrowser$navigate("https://finance.yahoo.com/quote/SBUX/balance-sheet?p=SBUX")
# click 'Quarterly' button...
Something that is kind of close, is this.
Testing updated code now; results below.
> rm(list=ls())
>
>
> library("RSelenium")
> startServer()
Error: startServer is now defunct. Users in future can find the function in
file.path(find.package("RSelenium"), "examples/serverUtils"). The
recommended way to run a selenium server is via Docker. Alternatively
see the RSelenium::rsDriver function.
> mybrowser <- remoteDriver()
> mybrowser$open()
[1] "Connecting to remote server"
Error in checkError(res) :
Undefined error in httr call. httr output: Failed to connect to localhost port 4444: Connection refused
> mybrowser$navigate("https://finance.yahoo.com/quote/SBUX/balance-sheet?p=SBUX")
Error in checkError(res) :
Undefined error in httr call. httr output: length(url) == 1 is not TRUE
> mybrowser$findElement("xpath", "//button[text() = '
+
+ OK
+ ']")$clickElement()
Error in checkError(res) :
Undefined error in httr call. httr output: length(url) == 1 is not TRUE
> mybrowser$findElement("xpath", "//span[text() = 'Quarterly']")$clickElement()
Error in checkError(res) :
Undefined error in httr call. httr output: length(url) == 1 is not TRUE
>

I think it might be the case you run into this on the website.
You can just "click" the OK button via:
mybrowser$findElement("xpath", "//button[text() = '
OK
']")$clickElement()
And then you can click "quarterly" via:
mybrowser$findElement("xpath", "//span[text() = 'Quarterly']")$clickElement()
(Hint: To identify these kind of errors it can be helpful to check the current state of the browser via: remDr$screenshot(TRUE).)
I am not sure it is up to date, but certain data is also available via the API, you could check the quantmod package to get an easier access.
Full example:
library("RSelenium")
startServer()
mybrowser <- remoteDriver()
mybrowser$open()
mybrowser$navigate("https://finance.yahoo.com/quote/SBUX/balance-sheet?p=SBUX")
mybrowser$findElement("xpath", "//button[text() = '
OK
']")$clickElement()
mybrowser$findElement("xpath", "//span[text() = 'Quarterly']")$clickElement()

Related

rscopus package on R in MacBook - Invalid API key error

I am trying to use the Scopus API for the first time. I have the API key and the institution token. However, I am still getting an error, when I try to use it in R on my Mac. Here is my code:
library(rscopus)
set_api_key(MY_KEY)
hdr=inst_token_header(MY_TOKEN)
key=get_api_key()
print(rscopus::get_api_key(), reveal=TRUE)
have_api_key()
auth_info = process_author_name(last_name="Muschelli", first_name="John", verbose=FALSE)
The error message is:
> library(rscopus)
>
> set_api_key(MY_KEY)
> hdr=inst_token_header(MY_TOKEN)
> key=get_api_key()
> print(rscopus::get_api_key(), reveal=TRUE)
[1] "MY_KEY"
> have_api_key()
[1] TRUE
>
> if (have_api_key()) {
+ auth = elsevier_authenticate(api_key=key)
+ }
HTTP specified is: https://api.elsevier.com/authenticate
Warning message:
In elsevier_authenticate(api_key = key) : Forbidden (HTTP 403).
> auth_info = process_author_name(last_name="Muschelli", first_name="John", verbose=FALSE)
$`service-error`
$`service-error`$status
$`service-error`$status$statusCode
[1] "AUTHENTICATION_ERROR"
$`service-error`$status$statusText
[1] "Invalid API Key: valid apikey credentials required."
Error in get_complete_author_info(...) : Service Error
I tried
if (have_api_key()) {
auth = elsevier_authenticate(api_key=key)
}
but I get the error:
HTTP specified is: https://api.elsevier.com/authenticate
Warning message:
In elsevier_authenticate(api_key = key) : Forbidden (HTTP 403).
I have tried using auth_token_header(MY_TOKEN) instead of inst_token_header(MY_TOKEN) but the code is still not working.
I have also taken the following step in my terminal:
export Elsevier_API=MY_KEY > ~/.bash_profile
source ~/.bash_profile
I am still getting the error. However, the combination of key and institution token work here: https://dev.elsevier.com/scopus.html
Can anyone please help me debug this issue?
Thank You!
I figured out the issue. So, the correct way to query would be to pass the headers argument as well:
auth_info = process_author_name(last_name="Muschelli", first_name="John", verbose=FALSE, headers=hdr)
And now the code will work! :)

R: cannot read API call - Error in open.connection(con, "rb")

I can see the contents of the API call in webbrowser, but getting this error with jsonlite package: read_json.
Error in open.connection(con, "rb") : connection cannot be opened
Añso: Warning message:
In open.connection(con, "rb") :
cannot open URL 'https://www.plazavea.com.pe/api/catalog_system/pub/products/search?fq=C:/678/687/&_from=21&_to=41&O=OrderByScoreDESC&': HTTP status was '206 Partial Content'
Code::
library(rvest)
library(tidyverse)
library(jsonlite)
api_request <- "https://www.plazavea.com.pe/api/catalog_system/pub/products/search?fq=C:/678/687/&_from=21&_to=41&O=OrderByScoreDESC&"
product_data <- jsonlite::read_json(api_request)
Use httr and then extract as = 'text' and pass over to parse_json(), or simply specify as = 'parsed' in the content() call on the response object.
library(httr)
api_request <- "https://www.plazavea.com.pe/api/catalog_system/pub/products/search?fq=C:/678/687/&_from=21&_to=41&O=OrderByScoreDESC&"
product_data <- content(httr::GET(api_request), as = 'parsed')

R: Error in file(con, "r") : cannot open the connection

i'm trying to run a API request for a number of parameters with the lapply function in R.
However, when i run this function, i get the error " Error in file(con, "r") : cannot open the connection"
Google suggests using setInternet2(TRUE) to fix this issue, however, i get the error: Error: 'setInternet2' is defunct.
See help("Defunct"
localisedDestinationNameForGivenLang <- function (LocationId) {
gaiaURL <- paste0("https://URL/",LocationId, "?geoVersion=rwg&lcid=", "1036",
"&cid=geo&apk=explorer")
print(LocationId)
localisation <- fromJSON(gaiaURL)
}
lapply(uniqueLocationId, localisedDestinationNameForGivenLang)
Can someone suggest a fix please?
Here's a sample of how you could identify which sites are throwing errors while still getting response from the ones that don't:
urls = c("http://citibikenyc.com/stations/test", "http://citibikenyc.com/stations/json")
grab_data <- function(url) {
out <- tryCatch(
{fromJSON(url)},
error=function(x) {
message(paste(url, x))
error_msg = paste(url, "threw an error")
return(error_msg)
})
return(out)
}
result <- lapply(urls, grab_data)
result will be a list that contains API response for urls that work, and error msg for those that don't.

rJava: failing to load awt.events

Right now, I try to implement some event methods of awt.events, however I cannot get it to load, since I always get an ClassNotFoundException error.
> library(rJava)
> .jinit()
[1] 0
> jEvents <- .jnew("java.awt.event")
Error in .jnew("java.awt.event") : java.lang.ClassNotFoundException
Edit:
Even if I try a specific class, I get an error message:
> library(rJava)
> .jinit()
> jEvents <- .jnew("java.awt.event.ActionEvent")
Error in .jnew("java.awt.event.ActionEvent") :
java.lang.NoSuchMethodError: <init>
It seems you are trying to instantiate package instead of class:
https://docs.oracle.com/javase/7/docs/api/java/awt/event/package-summary.html
Maybe you are looking for:
java.awt.event.ActionEvent
You can try this one:
library(rJava)
.jinit()
> EVT <- J("java.awt.event.ActionEvent")
> aEVT <- new(EVT, "StringObject", 1001L, "Hello")
> aEVT
[1] "Java-Object{java.awt.event.ActionEvent[ACTION_PERFORMED,cmd=Hello,when=0,modifiers=] on Str}"
You have to call the constructor with given parameters. Note that ActionEvent doesn't have default constructor.
You can find nice source here:
http://www.deducer.org/pmwiki/pmwiki.php?n=Main.Development#wwjoir

Can't create dplyr src backed by SparkSQL in dplyr.spark.hive package

Recently I found out about great dplyr.spark.hive package that enables dplyr frontend operations with spark or hive backend .
There is an information on how to install this package in package's README :
options(repos = c("http://r.piccolboni.info", unlist(options("repos"))))
install.packages("dplyr.spark.hive")
and there are also many examples on how to work with dplyr.spark.hive when one is already connected to hiveServer - check this.
But I am not able to connect to hiveServer, so I can not benefit from the great power of this package...
I've tried such commands, but they did not work out. Does anyone have any solution or comment on what am I doing wrong?
> library(dplyr.spark.hive,
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
>
> my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+ port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
>
> my_db = src_SparkSQL(start.server = TRUE,
+ list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
EDIT 2
I have set more paths to system variables like this but now I receive a warning telling me that some kind of Java logging-configuration is not specified bu I think it is
> library(dplyr.spark.hive,
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
3: package ‘SparkR’ was built under R version 3.2.1
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HADOOP_HOME="/usr/share/hadoop")
> Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop")
> Sys.setenv(PATH='/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/share/hadoop/bin:/opt/hive/bin')
>
>
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
My log properties are not empty.
-bash-4.2$ wc /etc/hadoop/log4j.properties
179 432 6581 /etc/hadoop/log4j.properties
EDIT 3
My exact call to the scr_SparkSQL() is
> detach("package:SparkR", unload=TRUE)
Warning message:
package ‘SparkR’ was built under R version 3.2.1
> detach("package:dplyr", unload=TRUE)
> library(dplyr.spark.hive, lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
And then the proces does not stop (never).
Where those settings work for beeline with such params:
beeline -u "jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl" -n mkosinski --outputformat=tsv --incremental=true -f sql_statement.sql > sql_output
but I am not able to pass user name and dbname to src_SparkSQL()
so I have tried to manual use the code from inside that function but I receive the sam problem that the below code also does not finish
host = 'tools-1.hadoop.srv'
port = 10000
driverclass = "org.apache.hive.jdbc.HiveDriver"
Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
library(RJDBC)
dr = JDBC(driverclass, Sys.getenv("HADOOP_JAR"))
url = paste0("jdbc:hive2://", host, ":", port)
class = "Hive"
con.class = paste0(class, "Connection") # class = "Hive"
# dbConnect_retry =
# function(dr, url, retry){
# if(retry > 0)
# tryCatch(
# dbConnect(drv = dr, url = url),
# error =
# function(e) {
# Sys.sleep(0.1)
# dbConnect_retry(dr = dr, url = url, retry - 1)})
# else dbConnect(drv = dr, url = url)}
#################
##con = new(con.class, dbConnect_retry(dr, url, retry = 100))
#################
con = new(con.class, dbConnect(dr, url, user = "mkosinski", dbname = "loghost"))
Maybe the url should containg also /loghost - the dbname?
I now see that you tried multiple things with multiple errors. Let me comment error by error.
my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
The RJDBC object could not be created. Unless we solve this, nothing else will work, workarounds or not. Have you set HADOOP_JAR with, for instance,
Sys.setenv(HADOOP_JAR = "../spark/assembly/target/scala-2.10/spark-assembly-1.5.0-hadoop2.6.0.jar"). Sorry I seem to have skipped this in the instructions. Will fix.
my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+ port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
Same problem. Please note host port argument do not accept URL syntax, just host and port. URL is formed internally.
my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
Stop thriftserver first or connect to existing one, but you still have to fix the class not found problem.
my_db = src_SparkSQL(start.server = TRUE,
+ list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
Same as above.
Plan:
Set HADOOP_JAR. Find host and port of running thriftserver, if not default. Try src_SparkSQL with start.server = FALSE. If happy quit, else goto step 2
Stop existing thriftserver. Try again src_SparkSQL with start.server = TRUE
Let me know how things go.
There was a problem that I did't specify the proper classPath that was needed inside JDBC function that created a driver. Parameters to classPath in dplyr.spark.hive package are passed via HADOOP_JAR global variable.
To use JDBC as a driver to hiveServer2 (through the Thrift protocol) one need to add at least those 3 .jars with Java classes to create a proper driver
hive-jdbc-1.0.0-standalone.jar
hadoop/common/lib/commons-configuration-1.6.jar
hadoop/common/hadoop-common-2.4.1.jar
versions are arbitrary and should be compatible with the installed version of local hive, hadoop and hiveServer2.
They need to be set with the .Platform$path.sep (as described here)
classPath = c("system_path1_to_hive/hive/lib/hive-jdbc-1.0.0-standalone.jar",
"system_path1_to_hadoop/hadoop/common/lib/commons-configuration-1.6.jar",
"system_path1_to_hadoop/hadoop/common/hadoop-common-2.4.1.jar")
Sys.setenv(HADOOP_JAR= paste0(classPath, collapse=.Platform$path.sep)
Then when HADOOP_JAR is set one have to be carefull with hiveServer2 url. In my case it had to be
host = 'tools-1.hadoop.srv'
port = 10000
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
and finally the proper connection with hiveServer2 using RJDBC package is
Sys.setenv(HADOOP_HOME="/usr/share/hadoop/share/hadoop/common/")
Sys.setenv(HIVE_HOME = '/opt/hive/lib/')
host = 'tools-1.hadoop.srv'
port = 10000
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
driverclass = "org.apache.hive.jdbc.HiveDriver"
library(RJDBC)
.jinit()
dr2 = JDBC(driverclass,
classPath = c("/opt/hive/lib/hive-jdbc-1.0.0-standalone.jar",
#"/opt/hive/lib/commons-configuration-1.6.jar",
"/usr/share/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar",
"/usr/share/hadoop/share/hadoop/common/hadoop-common-2.4.1.jar"),
identifier.quote = "`")
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
dbConnect(dr2, url, username = "mkosinski") -> cont

Resources