PhantomJS access internal server without TLD via Selenium - r

I'm working on an internal webscraper at my company to pull names and IDs from a Sharepoint list. Let's assume that this is the only way I can access the information. From R, I'm using RSelenium to control phantomJS (I've successfully connected to external servers and pulled in data so I know it's working in general). When I navigate to the site http://teams4/AllItems.aspx phantom decides to be helpful and change the URL to teams4.com. Now, I'm using cntlm as a local authentication proxy before the big corporate proxy (it lets Python and R connect to the Internet) - without it I couldn't get to the internet at all. I tried connecting directly the corporate Is there a way to force phantom to resolve the name the way it's given?
library(RSelenium)
library(rvest)
library(magrittr)
beautifulSoup = function(source){
s = source %>%
extract2(1) %>%
html()
return(s)
}
pJS = phantom(pjs_cmd="C:/phantomjs2/bin/phantomjs.exe",extras="--proxy=localhost:3128")
remDr = remoteDriver(browserName = 'phantomjs')
remDr$open()
output = data.frame(Name="test",ID=0)
remDr$navigate('http://teams4/AllItems.aspx')
soup = beautifulSoup(remDr$getPageSource())
allRequestors = soup %>%
html_nodes(xpath='//td[contains(#class,"ms-vb-user")]')

Related

RSelenium with Chrome: how to keep informations?

I have been trying to do a webscraping with RSelenium using Chrome browser.
To acess the page with the information I need, first I need to scan a QRcode.
In a regular browser, I do it just one time and I can close and open the browser as many times I want. With RSelenium, I need to scan the QRcode everytime I open the browser.
Is there a way to keep this information? Cache or cookies?
I have been trying:
rD <- RSelenium::rsDriver(browser = "chrome",
chromever = "108.0.5359.71",
port = netstat::free_port(),
verbose = FALSE)
remDr <- rD[["client"]]
remDr$navigate("https://google.com/")
remDr$maxWindowSize()
remDr$getAllCookies()
But Im̀ getting only an empty list.

How to use a proxy api in R

I'm using a proxy service IP in R which I have configured perfectly by the following:
https://support.rstudio.com/hc/en-us/articles/200488488-Configuring-R-to-Use-an-HTTP-or-HTTPS-Proxy
This service, which im not sure if I can mention here, also offers a proxy API.
e.g. https://xxxx.com/proxy-api/7976ed1223443d907283443dffc961ff2c9bb219_993353-2033906
I'm using the proxy service for a read_html script from library(rvest).
Would it be possible, if more efficient, to use proxy API, rather than modifying the Renviron.site file?
solved it as follows:
library(jsonlite)
json_file <- "https://xxxx.com/proxy-api/7976ed1223443d907283443dffc961ff2c9bb219_993353-2033906"
json_data <- fromJSON(json_file, flatten=TRUE)
x<-sample(1:5, 1)
Sys.setenv(http_proxy = json_data[x])
Sys.setenv(https_proxy = json_data[x])

R - using RSelenium to log into website (Captcha, and staying logged in)

I want to use RSelenium to access and scrape a website each day. Something I've noticed is that when I open up the website in a regular chrome browser, I am already logged in from the last time I visited the website. However, if I use RSelenium to open up a remote driver, and visit the webpage using this driver, it does not have me logged into the website already. It's basic enough to log into most sites usually, however for this website there is a Captcha that makes logging in more difficult.
Is there anyway the remote driver can access the website with me already logged in?
Example of my code below:
this_URL = "my_url_goes_here"
startServer()
remDr = remoteDriver$new(browserName = 'chrome')
Sys.sleep(2); remDr$open();
Sys.sleep(4); remDr$navigate(this_URL);
login_element = remDr$findElement(using = "id", "login-link")
login_element$
After clicking the login_element link, it brings me to the page where I input my username, password, and click the captcha / do what it asks.
Thanks,
It should work using firefox and firefox profiles as follows:
Setup Firefxx Access:
Open firefox and login as usual. Make sure when you close firefox and you login again you stay logged in.
Figure out the location of your default firefox profile:
This should be somethink like: (source + more details)
Windows: %AppData%MozillaFirefoxProfilesxxxxxxxx.default
Mac: ~/.mozilla/firefox/xxxxxxxx.default/
Linux: ~/Library/Application Support/Firefox/Profiles/xxxxxxxx.default/
Start a new RSelenium driver and set the profile as follows
->
require(RSelenium)
eCap <- list("webdriver.firefox.profile" = "MySeleniumProfile")
remDr <- remoteDriver(browserName = "firefox", extraCapabilities = eCap)
remDr$open()
The firefox-window that opens should be your chosen profile.
I did this a while ago. If i remember correctly it works like this.
P.S.: You could also create an extra/new firefox profile for that. To do that follow the steps in the link above

Authenticating google sheets on AWS Ubuntu without browser

I'm running R Studio on an AWS "Ubuntu Server 12.04.2" and accessing R Studio via my browser.
When I try to authenticate google auth API using the package googlesheets with the code:
gs_auth(token = NULL, new_user = FALSE,
key = getOption("googlesheets.client_id"),
secret = getOption("googlesheets.client_secret"),
cache = getOption("googlesheets.httr_oauth_cache"), verbose = TRUE)
The problem here is that it redirects me to browser which is of local machine (windows based).
Even if I authorize it, it redirects to a URL like "http://localhost:1410/?state=blahblah&code=blahblah".
How do I authorize googlesheets in such case?
I have even tried transfering existing httr-oauth token from my windows machine to remove ubuntu server.
The simplest way to create a gs_auth token from a server is to set the httr_oob_default option to true, which will tell httr to use the out of band method for authenticating. You will be given a URL and expected to return an authorization code.
library(googlesheets)
options(httr_oob_default=TRUE)
gs_auth(new_user = TRUE)
gs_ls()
One thing httr does when you set the httr_oob_default option is to redefine the URI to urn:ietf:wg:oauth:2.0:oob as seen in the code for oauth-init.
Alternatively, you can create a .httr-oauth token manually using httr commands. Use the out of band authentication mode by setting use_oob=TRUE in the oauth2.0_token command.
library(googlesheets)
library(httr)
file.remove('.httr-oauth')
oauth2.0_token(
endpoint = oauth_endpoints("google"),
app = oauth_app(
"google",
key = getOption("googlesheets.client_id"),
secret = getOption("googlesheets.client_secret")
),
scope = c(
"https://spreadsheets.google.com/feeds",
"https://www.googleapis.com/auth/drive"),
use_oob = TRUE,
cache = TRUE
)
gs_ls()
Another, less elegant, solution is to create the .httr-oauth token on your desktop and then copying the file to a server.
After lot of head banging, I found that a package "httpuv" which supports HTTP handling and WebSocket requests from R was creating the problem. It was forcing R to open web browser.
Once I uninstalled this package, "googlesheets" gave me a link which I could paste in browser separately and then paste the auth code back in R server.

Configuring listener_endpoint in httr when using Rstudio server

I;'m struggling to connect to Google Analytics with httr oauth2.0 function
oauth2.0_token(oauth_endpoints("google")
, oauth_app("google", client.id, client.secret)
, scope = "https://www.googleapis.com/auth/analytics.readonly")
It works perfectly in my local Rstudio, but it breaks in AWS-based Rstudio Server. The error appears when I agree to pass data in browser and Google redirects me to the page
http://localhost:1410/?state=codehere
When launching authentication in local Rstudio, browser responds with a message - Authentication complete. Please close this page and return to R, incase of Rstudio server it's just This webpage is not available
I suspect I need to change listener_endpoint configuration, but how? Should I put my Rstudio server address instead of default 127.0.0.1? Or is it flaw of httr+Rtudio server and I should not bother?
Your redirect URI is part of the problem. Httr's oauth2.0_token() function identify the correct one. When you set up your project, Google Analytics created two redirect URIs, one that can be used on your RStudio IDE (local) and one that can be used in the RStudio web-based environment for out-of-of band authentication: "urn:ietf:wg:oauth:2.0:oob"
Once you've authenticated, the following code should work.
library(httr)
ga_id <- YourProjectID
client_id <- YourClientID
redirect_uri <- 'urn:ietf:wg:oauth:2.0:oob'
scope <- YourScope
client_secret <- YourSecret
response_type <-'code'
auth1 <- oauth2.0_token(
endpoint = oauth_endpoints("google"),
app = oauth_app(
"google",
key = client_id,
secret = client_secret
),
scope,
use_oob = TRUE,
cache = TRUE
)
-- Ann
You could use out-of-band authentication -
options(httr_oob_default = TRUE)

Resources