RSelenium with Chrome: how to keep informations? - r

I have been trying to do a webscraping with RSelenium using Chrome browser.
To acess the page with the information I need, first I need to scan a QRcode.
In a regular browser, I do it just one time and I can close and open the browser as many times I want. With RSelenium, I need to scan the QRcode everytime I open the browser.
Is there a way to keep this information? Cache or cookies?
I have been trying:
rD <- RSelenium::rsDriver(browser = "chrome",
chromever = "108.0.5359.71",
port = netstat::free_port(),
verbose = FALSE)
remDr <- rD[["client"]]
remDr$navigate("https://google.com/")
remDr$maxWindowSize()
remDr$getAllCookies()
But Im̀ getting only an empty list.

Related

RSelenium Chrome webdriver doesn't work with user profile

I have some existing RSelenium code I am trying to get working with a Chrome Profile. I am using the code below to open a browser:
cprof <- getChromeProfile("C:/Users/Paul/AppData/Local/Google/Chrome/User Data", "Profile 1")
driver <- rsDriver(browser=c("chrome"), chromever="80.0.3987.106", port=4451L, extraCapabilities=cprof)
But when I run this, Three (3!) new Chrome browser windows open before the following error is displayed in RStudio:
Could not open chrome browser.
Client error message:
Summary: SessionNotCreatedException
Detail: A new session could not be created.
Further Details: run errorDetails method
Check server log for further details.
The puzzling part is that it does look like it is getting the correct profile, because when I switch between "Profile 1", "Profile 2" and even "Default" in the getChromeProfile call, I see the correct user icon in the browser windows that open. And if I leave off the extraCapabilities the browser opens with no problem (using the default "empty" profile).
Any idea what I am doing wrong?

R - using RSelenium to log into website (Captcha, and staying logged in)

I want to use RSelenium to access and scrape a website each day. Something I've noticed is that when I open up the website in a regular chrome browser, I am already logged in from the last time I visited the website. However, if I use RSelenium to open up a remote driver, and visit the webpage using this driver, it does not have me logged into the website already. It's basic enough to log into most sites usually, however for this website there is a Captcha that makes logging in more difficult.
Is there anyway the remote driver can access the website with me already logged in?
Example of my code below:
this_URL = "my_url_goes_here"
startServer()
remDr = remoteDriver$new(browserName = 'chrome')
Sys.sleep(2); remDr$open();
Sys.sleep(4); remDr$navigate(this_URL);
login_element = remDr$findElement(using = "id", "login-link")
login_element$
After clicking the login_element link, it brings me to the page where I input my username, password, and click the captcha / do what it asks.
Thanks,
It should work using firefox and firefox profiles as follows:
Setup Firefxx Access:
Open firefox and login as usual. Make sure when you close firefox and you login again you stay logged in.
Figure out the location of your default firefox profile:
This should be somethink like: (source + more details)
Windows: %AppData%MozillaFirefoxProfilesxxxxxxxx.default
Mac: ~/.mozilla/firefox/xxxxxxxx.default/
Linux: ~/Library/Application Support/Firefox/Profiles/xxxxxxxx.default/
Start a new RSelenium driver and set the profile as follows
->
require(RSelenium)
eCap <- list("webdriver.firefox.profile" = "MySeleniumProfile")
remDr <- remoteDriver(browserName = "firefox", extraCapabilities = eCap)
remDr$open()
The firefox-window that opens should be your chosen profile.
I did this a while ago. If i remember correctly it works like this.
P.S.: You could also create an extra/new firefox profile for that. To do that follow the steps in the link above

Getting ftp data with RCurl

I can access ftp site with Chrome but not with Internet Explorer cause of company restriction I think. For that reason maybe, I can not download ftp data with RCurl in R. Do you have any solution to download ftp data via Chrome setup in R?
Thanks
url<-c("myUrl")
x<-getURL(url,userpwd="user:password", connecttimeout=60)
writeLines(x, "Append.txt")
The package RCurl does not use a web browser to access ftp sites. It uses libcurl, as it says in the documentation. The problem you encounter should be solved within the constraints of libcurl.
Also, if one web browser on your computer can access a website, and another can't, it need not be a problem with the web browser per se. The most common problem is the way files or paths are referenced, such as whether or not one includes a trailing / with a pathname (never with a filename, of course). Perhaps this is the case for you?
Otherwise there may be a problem with your ftp settings: libcurl is pretty smart about guessing things right, but it is possible to twiddle with all sorts of settings, in case the defaults do not work, for example (from the manual):
# Deal with newlines as \n or \r\n. (BDR)
# Or alternatively, instruct libcurl to change \n's to \r\n's for us with crlf = TRUE
# filenames = getURL(url, ftp.use.epsv = FALSE, ftplistonly = TRUE, crlf = TRUE)
filenames = paste(url, strsplit(filenames, "\r*\n")[[1]], sep = "")
con = getCurlHandle( ftp.use.epsv = FALSE)
If this doesn't help, it might help us help you if you give us more complete information. What is this myUrl in url<-c("myUrl"), for example? Is it a filename? A pathname?

PhantomJS access internal server without TLD via Selenium

I'm working on an internal webscraper at my company to pull names and IDs from a Sharepoint list. Let's assume that this is the only way I can access the information. From R, I'm using RSelenium to control phantomJS (I've successfully connected to external servers and pulled in data so I know it's working in general). When I navigate to the site http://teams4/AllItems.aspx phantom decides to be helpful and change the URL to teams4.com. Now, I'm using cntlm as a local authentication proxy before the big corporate proxy (it lets Python and R connect to the Internet) - without it I couldn't get to the internet at all. I tried connecting directly the corporate Is there a way to force phantom to resolve the name the way it's given?
library(RSelenium)
library(rvest)
library(magrittr)
beautifulSoup = function(source){
s = source %>%
extract2(1) %>%
html()
return(s)
}
pJS = phantom(pjs_cmd="C:/phantomjs2/bin/phantomjs.exe",extras="--proxy=localhost:3128")
remDr = remoteDriver(browserName = 'phantomjs')
remDr$open()
output = data.frame(Name="test",ID=0)
remDr$navigate('http://teams4/AllItems.aspx')
soup = beautifulSoup(remDr$getPageSource())
allRequestors = soup %>%
html_nodes(xpath='//td[contains(#class,"ms-vb-user")]')

How to avoid "too many redirects" error when using readLines(url) in R?

I am trying to mine news articles from various sources by doing
site = readLines(link)
link being the url of the site I am trying to download. Most of the time this works but with some specific sources I get the error:
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") : too many redirects, aborting ...
Which I'd like to avoid but so far I had no success in doing so.
Replicating this is quite easy as virtually none of the New York Times links work
e.g. http://www.nytimes.com/2014/08/01/us/politics/african-leaders-coming-to-talk-business-may-also-be-pressed-on-rights.html
It seems like the NYT site forces redirects for cookie and tracking purposes. Looks like the built-in URL reader isn't able to deal with them correctly (not sure if it supports cookies which is probably the problem).
Anyway, you might consider using the RCurl package to access the file instead. Try
library(RCurl)
link = "http://www.nytimes.com/2014/08/01/us/politics/african-leaders-coming-to-talk-business-may-also-be-pressed-on-rights.html?_r=0"
site <- getURL(link, .opts = curlOptions(
cookiejar="", useragent = "Mozilla/5.0", followlocation = TRUE
))

Resources