CURL handle goes Stale when inside foreach() - r

Alright, so I've recently figured out that I can query a website behind a login screen for a CSV Report. Then I thought, wouldn't it be even better to do this concurrently? Afterall some reports take a lot longer to produce than others and if I were querying 10 different reports at once that would be way more efficient. So I'm now over my head twice here playing around with HTTPS protocols and now also Parallel Processing. I think my frankencode is almost there though but it gives me a
"Error in ( : task 1 failed - "Stale CURL handle being passed to libcurl"
Note that the "curl" is very much current as the "html" variable did login successfully. Something happens in it's parallel chunk that makes it stale.
library(RCurl)
library(doParallel)
registerDoParallel(cores=4)
agent="Firefox/23.0"
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
curl = curl
)
un="username#domain.com"
pw="password"
html = postForm(paste("https://login.salesforce.com/?un=", un, "&pw=", pw, sep=""), curl=curl)
urls = c("https://xyz123.salesforce.com/00O400000046ayd?export=1&enc=UTF-8&xf=csv",
"https://xyz123.salesforce.com/00O400000045sWu?export=1&enc=UTF-8&xf=csv",
"https://xyz123.salesforce.com/00O400000045z3Q?export=1&enc=UTF-8&xf=csv")
x <- foreach(i=1:4, .combine=rbind, .packages=c("RCurl")) %dopar% {
xxx <- getURL(urls[i], curl=curl)
}

Related

R Download ZIP with RCURL and log in

I am trying to log in to a website("dkurl") below and then download a zip file("url") below. Following other answers using RCURL, I have attempted to use the code below, however I cannot get the file downloaded. Are there other parameters or commands I am missing?
url <- 'http://www.draftkings.com/contest/exportfullstandingscsv/40827113'
dkurl <- 'https://www.draftkings.com/account/sitelogin/'
pars = list(username = xxx, password = xxx)
agent = "Mozilla/5.0"
curl = getCurlHandle()
curlSetOpt(cookiejar="", useragent = agent, followlocation = TRUE, curl=curl)
html=postForm(dkurl, .params=pars, curl=curl)
html=getURL(url, curl=curl)
It's quite convenient to download files using httr package. Just like this.
library(httr)
GET(fileUrl, authenticate(user, password),
write_disk(filename), timeout(60))

Using R to download file from https with login credentials

I am trying to write a code that will allow me to download a .xls file from a secured https website which requires a login. This is very difficult for me, as i have no experience with web-coding--all my R experience comes from econometric work with readily available datasets.
i followed this thread to help write some code, but i think im running into trouble because the example is http, and i need https.
this is my code:
install.packages("RCurl")
library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
html <- getURL('https://jump.valueline.com/login.aspx', curl = curl)
viewstate <- as.character(sub('.*id="_VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
params <- list(
'ct100$ContentPlaceHolder$LoginControl$txtUserID' = 'MY USERNAME',
'ct100$ContentPlaceHolder$LoginControl$txtUserPw' = 'MY PASSWORD',
'ct100$ContentPlaceHolder$LoginControl$btnLogin' = 'Sign In',
'_VIEWSTATE' = viewstate)
html <- postForm('https://jump.valueline.com/login.aspx', .params = params, curl = curl)
when i get to running the piece that starts "html <- getURL(..." i get:
> html <- getURL('https://jump.valueline.com/login.aspx', curl = curl)
Error in function (type, msg, asError = TRUE) :
SSL certificate problem: unable to get local issuer certificate
is there a workaround for this? how am i able to access the local issuer certificate?
I read that adding '.opts = list(ssl.verifypeer = FALSE)' into the curlSetOpt would remedy this, but when i add that, the getURL runs, but then postForm line gives me
> html <- postForm('https://jump.valueline.com/login.aspx', .params = params, curl = curl)
Error: Internal Server Error
Besides that, does this code look like it will work given the website i am trying to access? I went into the inspector, and changed all the params to be correct for my webpage, but since i'm not well versed in webcoding i'm not 100% i caught the right things (particularly the VIEWSTATE). Also, is there a better, more efficient way i could approach this?
automating this process would be huge for me, so your help is greatly appreciated.
Try httr:
library(httr)
html <- content(GET('https://jump.valueline.com/login.aspx'), "text")
viewstate <- as.character(sub('.*id="_VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
params <- list(
'ct100$ContentPlaceHolder$LoginControl$txtUserID' = 'MY USERNAME',
'ct100$ContentPlaceHolder$LoginControl$txtUserPw' = 'MY PASSWORD',
'ct100$ContentPlaceHolder$LoginControl$btnLogin' = 'Sign In',
'_VIEWSTATE' = viewstate
)
POST('https://jump.valueline.com/login.aspx', body = params)
That still gives me a server error, but that's probably because you're not sending the right fields in the body.
html <- getURL('https://jump.valueline.com/login.aspx', curl = curl, ssl.verifypeer = FALSE)
This should work for you. The error you're getting is probably because libcurl doesn't know where to look for to get a certificate for SSL.

How can i log into this website, download files, using Rcurl

I need to write a piece of code that will download data files from a website which requires a log in.
I'd have thought that this would be quite easy, but I'm having difficulty with the login, programmatically.
I tried using the steps outlined in this post:
How to login and then download a file from aspx web pages with R
But when i get to the second from last step in the top answer I get an error message:
Error: Internal Server Error
So I am trying to write an RCurl code to login to the site, then download the files.
Here is what I have tried:
install.packages("RCurl")
library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', .opts = list(ssl.verifypeer = FALSE), followlocation = TRUE, autoreferer = TRUE, curl= curl)
html <- getURL('https://research.valueline.com/secure/f2/export?params=[{appId:%27com_2_4%27,%20context:{%22Symbol%22:%22GT%22,%22ListId%22:%22recent%22}}]', curl = curl)
viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
params <- list(
'ctl00$ContentPlaceHolder$LoginControl$txtUserID' = '<myusername>',
'ctl00$ContentPlaceHolder$LoginControl$txtUserPw' = '<mypassword>',
'ctl00$ContentPlaceHolder$LoginControl$btnLogin' = 'Sign In',
'__VIEWSTATE' = viewstate
)
html = postForm('https://research.valueline.com/secure/f2/export?params=[{appId:%27com_2_4%27,%20context:{%22Symbol%22:%22GT%22,%22ListId%22:%22recent%22}}]', .params = params, curl = curl)
grepl('Logout', html)

RCurl memory leak in getURL method

It looks like we have hit a bug in RCurl. The method getURL seems to be leaking memory. A simple test case to reproduce the bug is given here:
library(RCurl)
handle<-getCurlHandle()
range<-1:100
for (r in range) {x<-getURL(url="news.google.com.au",curl=handle)}
If I run this code, the memory allocated to the R session is never recovered.
We are using RCurl for some long running experiments and we are running out of memory on the test system.
The specs of our test system are as follows:
OS: Ubuntu 14.04 (64 bit)
Memory: 24 GB
RCurl version: 1.95-4.3
Any ideas about how to get around this issue?
Thanks
See if getURLContent() also exhibits the problem, i.e. replace getURL() with getURLContent().
The function getURLContent() is a richer version of getURL() and one that gets more attention.
I just hit this too, and made the following code change to work around it:
LEAK (Old code)
h = basicHeaderGatherer()
tmp = tryCatch(getURL(url = url,
headerfunction = h$update,
useragent = R.version.string,
timeout = timeout_secs),
error = function(x) { .__curlError <<- TRUE; __curlErrorMessage <<- x$message })
NO LEAK (New code)
method <- "GET"
h <- basicHeaderGatherer()
t <- basicTextGatherer()
tmp <- tryCatch(curlPerform(url = url,
customrequest = method,
writefunction = t$update,
headerfunction = h$update,
useragent=R.version.string,
verbose = FALSE,
timeout = timeout_secs),
error = function(x) { .__curlError <<- TRUE; .__curlErrorMessage <<- x$message })

R - posting a login form using RCurl

I am new to using R to post forms and then download data off the web. I have a question that is probably very easy for someone out there to spot what I am doing wrong, so I appreciate your patience. I have a Win7 PC and Firefox 23.x is my typical browser.
I am trying to post the main form that shows up on
http://www.aplia.com/
I have the following R script:
your.username <- 'username'
your.password <- 'password'
setwd( "C:/Users/Desktop/Aplia/data" )
require(SAScii)
require(RCurl)
require(XML)
agent="Firefox/23.0"
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
curl = curl
)
# list parameters to pass to the website (pulled from the source html)
params <-
list(
'userAgent' = agent,
'screenWidth' = "",
'screenHeight' = "",
'flashMajor' = "",
'flashMinor' = "",
'flashBuild' = "",
'flashPatch' = "",
'redirect' = "",
'referrer' = "http://www.aplia.com",
'txtEmail' = your.username,
'txtPassword' = your.password
)
# logs into the form
html = postForm('https://courses.aplia.com/', .params = params, curl = curl)
html
# download a file once form is posted
html <-
getURL(
"http://courses.aplia.com/af/servlet/mngstudents?ctx=filename" ,
curl = curl
)
html
But from there I can tell that I am not getting the page I want, as what is returned into html is a redirect message that appears to be asking me to login again (?):
"\r\n\r\n<html>\r\n<head>\r\n <title>Aplia</title>\r\n\t<script language=\"JavaScript\" type=\"text/javascript\">\r\n\r\n top.location.href = \"https://courses.aplia.com/af/servlet/login?action=form&redirect=%2Fservlet%2Fmngstudents%3Fctx%3Dfilename\";\r\n \r\n\t</script>\r\n</head>\r\n<body>\r\n Click here to continue.\r\n</body>\r\n</html>\r\n"
Although I do believe there are a series of redirects that occur once the form is posted successfully (manually, in a browser). How can I tell the form was posted correctly?
I am quite sure that once I can get the post working correctly, I won't have a problem directing R to download the files I need (online activity reports for each of my 500 students this semester). But spent several hours working on this and got stuck. Maybe I need to set more options with the RCurl package that have to do with cookies (as the site does use cookies) ---?
Any help so much appreciated!! I typically use R to handle statistical data so am new to these packages and functions.
The answer ends up being very simple. For some reason, I didn't see one option that needs to be included in postForm:
html = postForm('https://courses.aplia.com/', .params = params, curl = curl, style="POST")
And that's it...

Resources