I am trying to write a code that will allow me to download a .xls file from a secured https website which requires a login. This is very difficult for me, as i have no experience with web-coding--all my R experience comes from econometric work with readily available datasets.
i followed this thread to help write some code, but i think im running into trouble because the example is http, and i need https.
this is my code:
install.packages("RCurl")
library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
html <- getURL('https://jump.valueline.com/login.aspx', curl = curl)
viewstate <- as.character(sub('.*id="_VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
params <- list(
'ct100$ContentPlaceHolder$LoginControl$txtUserID' = 'MY USERNAME',
'ct100$ContentPlaceHolder$LoginControl$txtUserPw' = 'MY PASSWORD',
'ct100$ContentPlaceHolder$LoginControl$btnLogin' = 'Sign In',
'_VIEWSTATE' = viewstate)
html <- postForm('https://jump.valueline.com/login.aspx', .params = params, curl = curl)
when i get to running the piece that starts "html <- getURL(..." i get:
> html <- getURL('https://jump.valueline.com/login.aspx', curl = curl)
Error in function (type, msg, asError = TRUE) :
SSL certificate problem: unable to get local issuer certificate
is there a workaround for this? how am i able to access the local issuer certificate?
I read that adding '.opts = list(ssl.verifypeer = FALSE)' into the curlSetOpt would remedy this, but when i add that, the getURL runs, but then postForm line gives me
> html <- postForm('https://jump.valueline.com/login.aspx', .params = params, curl = curl)
Error: Internal Server Error
Besides that, does this code look like it will work given the website i am trying to access? I went into the inspector, and changed all the params to be correct for my webpage, but since i'm not well versed in webcoding i'm not 100% i caught the right things (particularly the VIEWSTATE). Also, is there a better, more efficient way i could approach this?
automating this process would be huge for me, so your help is greatly appreciated.
Try httr:
library(httr)
html <- content(GET('https://jump.valueline.com/login.aspx'), "text")
viewstate <- as.character(sub('.*id="_VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
params <- list(
'ct100$ContentPlaceHolder$LoginControl$txtUserID' = 'MY USERNAME',
'ct100$ContentPlaceHolder$LoginControl$txtUserPw' = 'MY PASSWORD',
'ct100$ContentPlaceHolder$LoginControl$btnLogin' = 'Sign In',
'_VIEWSTATE' = viewstate
)
POST('https://jump.valueline.com/login.aspx', body = params)
That still gives me a server error, but that's probably because you're not sending the right fields in the body.
html <- getURL('https://jump.valueline.com/login.aspx', curl = curl, ssl.verifypeer = FALSE)
This should work for you. The error you're getting is probably because libcurl doesn't know where to look for to get a certificate for SSL.
Related
I am trying to login with my credentials to a .NET site but unable to get it working. My code is inspired from the below thread
How to login and then download a file from aspx web pages with R
library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
html <- getURL('http://www.aceanalyser.com/Login.aspx', curl = curl)
viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
viewstategenerator <- as.character(sub('.*id="__VIEWSTATEGENERATOR" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
params <- list(
'txtUserID' = '********',
'txtPwd' = '*******',
'Btn_Login' = 'GO',
'__VIEWSTATE' = viewstate,
'__VIEWSTATEGENERATOR' = viewstategenerator,
'HiddenField1' = '1280',
'HiddenField2' = '700',
'Hdn_Pwd' = 'true')
html = postForm('http://www.aceanalyser.com/Login.aspx', .params = params, curl = curl)
grepl('Logout', html)
Result: FALSE
Please help me understand the issue
You may change the option
'Btn_Login' = 'GO'
to
something like
'Btn_Login.x' = '22',
'Btn_Login.y' = '14'
According to this, the reason is a bug.
Some browsers (IIRC it is just some versions of Internet Explorer) only send the co-ordinates of the image map (in name.x and name.y) and ignore the value.
If you are sure your credentials are correct you could try to add more additional curl setopt arguments, following this example in PHP.
It's maybe also worth a try to monitor how your credentials are transferred. Maybe some extra encoding is necessary or unnecessary one added.
I need to write a piece of code that will download data files from a website which requires a log in.
I'd have thought that this would be quite easy, but I'm having difficulty with the login, programmatically.
I tried using the steps outlined in this post:
How to login and then download a file from aspx web pages with R
But when i get to the second from last step in the top answer I get an error message:
Error: Internal Server Error
So I am trying to write an RCurl code to login to the site, then download the files.
Here is what I have tried:
install.packages("RCurl")
library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', .opts = list(ssl.verifypeer = FALSE), followlocation = TRUE, autoreferer = TRUE, curl= curl)
html <- getURL('https://research.valueline.com/secure/f2/export?params=[{appId:%27com_2_4%27,%20context:{%22Symbol%22:%22GT%22,%22ListId%22:%22recent%22}}]', curl = curl)
viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
params <- list(
'ctl00$ContentPlaceHolder$LoginControl$txtUserID' = '<myusername>',
'ctl00$ContentPlaceHolder$LoginControl$txtUserPw' = '<mypassword>',
'ctl00$ContentPlaceHolder$LoginControl$btnLogin' = 'Sign In',
'__VIEWSTATE' = viewstate
)
html = postForm('https://research.valueline.com/secure/f2/export?params=[{appId:%27com_2_4%27,%20context:{%22Symbol%22:%22GT%22,%22ListId%22:%22recent%22}}]', .params = params, curl = curl)
grepl('Logout', html)
Alright, so I've recently figured out that I can query a website behind a login screen for a CSV Report. Then I thought, wouldn't it be even better to do this concurrently? Afterall some reports take a lot longer to produce than others and if I were querying 10 different reports at once that would be way more efficient. So I'm now over my head twice here playing around with HTTPS protocols and now also Parallel Processing. I think my frankencode is almost there though but it gives me a
"Error in ( : task 1 failed - "Stale CURL handle being passed to libcurl"
Note that the "curl" is very much current as the "html" variable did login successfully. Something happens in it's parallel chunk that makes it stale.
library(RCurl)
library(doParallel)
registerDoParallel(cores=4)
agent="Firefox/23.0"
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
curl = curl
)
un="username#domain.com"
pw="password"
html = postForm(paste("https://login.salesforce.com/?un=", un, "&pw=", pw, sep=""), curl=curl)
urls = c("https://xyz123.salesforce.com/00O400000046ayd?export=1&enc=UTF-8&xf=csv",
"https://xyz123.salesforce.com/00O400000045sWu?export=1&enc=UTF-8&xf=csv",
"https://xyz123.salesforce.com/00O400000045z3Q?export=1&enc=UTF-8&xf=csv")
x <- foreach(i=1:4, .combine=rbind, .packages=c("RCurl")) %dopar% {
xxx <- getURL(urls[i], curl=curl)
}
I am new to using R to post forms and then download data off the web. I have a question that is probably very easy for someone out there to spot what I am doing wrong, so I appreciate your patience. I have a Win7 PC and Firefox 23.x is my typical browser.
I am trying to post the main form that shows up on
http://www.aplia.com/
I have the following R script:
your.username <- 'username'
your.password <- 'password'
setwd( "C:/Users/Desktop/Aplia/data" )
require(SAScii)
require(RCurl)
require(XML)
agent="Firefox/23.0"
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
curl = curl
)
# list parameters to pass to the website (pulled from the source html)
params <-
list(
'userAgent' = agent,
'screenWidth' = "",
'screenHeight' = "",
'flashMajor' = "",
'flashMinor' = "",
'flashBuild' = "",
'flashPatch' = "",
'redirect' = "",
'referrer' = "http://www.aplia.com",
'txtEmail' = your.username,
'txtPassword' = your.password
)
# logs into the form
html = postForm('https://courses.aplia.com/', .params = params, curl = curl)
html
# download a file once form is posted
html <-
getURL(
"http://courses.aplia.com/af/servlet/mngstudents?ctx=filename" ,
curl = curl
)
html
But from there I can tell that I am not getting the page I want, as what is returned into html is a redirect message that appears to be asking me to login again (?):
"\r\n\r\n<html>\r\n<head>\r\n <title>Aplia</title>\r\n\t<script language=\"JavaScript\" type=\"text/javascript\">\r\n\r\n top.location.href = \"https://courses.aplia.com/af/servlet/login?action=form&redirect=%2Fservlet%2Fmngstudents%3Fctx%3Dfilename\";\r\n \r\n\t</script>\r\n</head>\r\n<body>\r\n Click here to continue.\r\n</body>\r\n</html>\r\n"
Although I do believe there are a series of redirects that occur once the form is posted successfully (manually, in a browser). How can I tell the form was posted correctly?
I am quite sure that once I can get the post working correctly, I won't have a problem directing R to download the files I need (online activity reports for each of my 500 students this semester). But spent several hours working on this and got stuck. Maybe I need to set more options with the RCurl package that have to do with cookies (as the site does use cookies) ---?
Any help so much appreciated!! I typically use R to handle statistical data so am new to these packages and functions.
The answer ends up being very simple. For some reason, I didn't see one option that needs to be included in postForm:
html = postForm('https://courses.aplia.com/', .params = params, curl = curl, style="POST")
And that's it...
I'd like to web-scrape the html as seen in the source code of the web-browser, for this url "https://portal.tirol.gv.at/wisPvpSrv/wisSrv/wis/wbo_wis_auszug.aspx?ATTR=Y&TREE=N&ANL_ID=T20889658R3&TYPE=0".
what I get with..
library(RCurl)
library(XML)
myurl = "https://portal.tirol.gv.at/wisPvpSrv/wisSrv/wis/wbo_wis_auszug.aspx?ATTR=Y&TREE=N&ANL_ID=T20889658R3&TYPE=0"
x = getURL(myurl, followlocation = TRUE, ssl.verifypeer = FALSE)
htmlParse(x, asText = TRUE)
..is not what I see in the browser's source code -
how to circumvent this??
Here ya go:
library(RCurl)
library(XML)
cookie = 'cookiefile.txt'
curl = getCurlHandle ( cookiefile = cookie ,
useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
header = FALSE,
verbose = TRUE,
netrc = TRUE,
maxredirs = as.integer(20),
followlocation = TRUE,
# userpwd = "bob:duncantl", ## enter here your username:password
ssl.verifypeer = TRUE)
myurl = "https://portal.tirol.gv.at/wisSrvPublic/wis/wbo_wis_auszug.aspx?ANL_ID=T20889658R3&TYPE=O"
x = getURL(myurl, curl = curl, cainfo = "path to R/library/RCurl/CurlSSL/ca-bundle.crt")
x2 <- gsub('\r','', gsub('\t','', gsub('\n','', x))) # remove white spaces
htmlParse(x2, asText = TRUE)
If you can not pass the ssl verification have a look at this post :
using Rcurl with HTTPs
If that website uses a lot of Javascript (and it seems it does) to generate content then you are pretty much stuck for starters.
If you use Firefox and get the developer toolbar then you can disable Javascript to see what the site looks like without it, and what content might be scrapable. You may hope that the site has a usable non-javascript version (this is called 'graceful degradation', where JS is only used to fancy stuff).
Otherwise use Firebug or some other JS debugger to see how the site pulls content if it's using AJAX. Then replicate those calls in R and scrape from the response.
Not that I can test any of this because if I go to that URL I get a Benutzername and Passwort prompt, and I don't have a Benutzername. If the content is behind authentication then you'll have to handle that in the RCurl process too - which might mean mucking with cookies and so on.
Good luck with that.