Download an excel file behind login (rvest) - r

I am trying to download excel files that are stored behind an intranet that requires a login. All's working well except when I try to download the file. I am using rvest.
The example below uses dummy data:
my_intranet = "www.mywebsite_login.com"
the_excel_file = "www.mywebsite_login.com/excel.xlsx"
session = session(my_intranet)
form = html_form(session)[[1]]
fl_fm = html_form_set(form, sUserName = "XXXXXX", sPassword = "XXXXXXX")
main_page = session_submit(session, fl_fm)
b = session_jump_to(main_page, the_excel_file)
writeBin(b, basename(the_excel_file))
When I execute:
b = session_jump_to(main_page, the_excel_file)
I get the following curl error:
Error in curl::curl_fetch_memory(url, handle = handle) :
Could not resolve host: NA
Any idea about what's wrong? Thanks!

Related

Open .ODC connection in R

I have an .odc (office data connection) that connects Excel to a Web Service (MSBI, Web PowerBI).
It's working fine. I open the odc file, Excel opens up and it is connected to the data source.
How can I open this connection directly from R?
The odc file contents are:
<odc:ConnectionString>
Provider=MSOLAP;
Integrated Security=ClaimsToken;
Identity Provider=https://login.microsoftonline.com/common,
https://analysis.windows.net/powerbi/api, xxxxxx-xx-xx-xxxxxx;
Data Source=pbiazure://api.powerbi.com;
Initial Catalog=xxxxx-xxxx-xxxx-xxxx-xxxxx;
MDX Compatibility= 1;
MDX Missing Member Mode= Error;
Safety Options= 2;
Update Isolation Level= X;
Locale Identifier= 10XX
</odc:ConnectionString>
This is what I tried so far:
library(httr); library(httpuv)
oauth_endpoints("azure")
powerbi.urls <- oauth_endpoint(access = "authorize",
authorize = "token",
base_url = "https://login.windows.net/common/oauth2")
powerbi.app <- oauth_app(
appname = "pbiazure://api.powerbi.com XXXX-XX-XXX-a611",
key = "XXXXXXXXX",
secret = "XXXXXXXXX")
powerbi.token <- oauth2.0_token(powerbi.urls, powerbi.app,
user_params = list(resource = "https://analysis.windows.net/powerbi/api"),
use_oob = FALSE)
But it is returning the following error:
AADSTS900561: The endpoint only accepts POST, OPTIONS requests. Received a GET request.

post a csv file to url

I am trying to post a .csv file to a url and it works when I am using Python with this code :
import requests
url = 'http:...'
files = {'file': open('test.csv')}
response = requests.post(url, files=files)
Since all the other code is in R and I would like to have all the code at one place I tried to translate it. I tried several different things:
library(httr)
POST("http:...",
body = list(name = "test.csv",
filedata = upload_file("~/test.csv", "text/csv")))
POST("http:...",
body = list(testFile = "~/test.csv"))
POST("http:...",
body = upload_file("~/test.csv"))
But I keep on running into the same error.
Error in curl::curl_fetch_memory(url, handle = handle) : Timeout
was reached
Is there any other way I could try to upload the file to the url using R?
Any help or suggestions are appreciated!

Using R to download file from https with login credentials

I am trying to write a code that will allow me to download a .xls file from a secured https website which requires a login. This is very difficult for me, as i have no experience with web-coding--all my R experience comes from econometric work with readily available datasets.
i followed this thread to help write some code, but i think im running into trouble because the example is http, and i need https.
this is my code:
install.packages("RCurl")
library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
html <- getURL('https://jump.valueline.com/login.aspx', curl = curl)
viewstate <- as.character(sub('.*id="_VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
params <- list(
'ct100$ContentPlaceHolder$LoginControl$txtUserID' = 'MY USERNAME',
'ct100$ContentPlaceHolder$LoginControl$txtUserPw' = 'MY PASSWORD',
'ct100$ContentPlaceHolder$LoginControl$btnLogin' = 'Sign In',
'_VIEWSTATE' = viewstate)
html <- postForm('https://jump.valueline.com/login.aspx', .params = params, curl = curl)
when i get to running the piece that starts "html <- getURL(..." i get:
> html <- getURL('https://jump.valueline.com/login.aspx', curl = curl)
Error in function (type, msg, asError = TRUE) :
SSL certificate problem: unable to get local issuer certificate
is there a workaround for this? how am i able to access the local issuer certificate?
I read that adding '.opts = list(ssl.verifypeer = FALSE)' into the curlSetOpt would remedy this, but when i add that, the getURL runs, but then postForm line gives me
> html <- postForm('https://jump.valueline.com/login.aspx', .params = params, curl = curl)
Error: Internal Server Error
Besides that, does this code look like it will work given the website i am trying to access? I went into the inspector, and changed all the params to be correct for my webpage, but since i'm not well versed in webcoding i'm not 100% i caught the right things (particularly the VIEWSTATE). Also, is there a better, more efficient way i could approach this?
automating this process would be huge for me, so your help is greatly appreciated.
Try httr:
library(httr)
html <- content(GET('https://jump.valueline.com/login.aspx'), "text")
viewstate <- as.character(sub('.*id="_VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
params <- list(
'ct100$ContentPlaceHolder$LoginControl$txtUserID' = 'MY USERNAME',
'ct100$ContentPlaceHolder$LoginControl$txtUserPw' = 'MY PASSWORD',
'ct100$ContentPlaceHolder$LoginControl$btnLogin' = 'Sign In',
'_VIEWSTATE' = viewstate
)
POST('https://jump.valueline.com/login.aspx', body = params)
That still gives me a server error, but that's probably because you're not sending the right fields in the body.
html <- getURL('https://jump.valueline.com/login.aspx', curl = curl, ssl.verifypeer = FALSE)
This should work for you. The error you're getting is probably because libcurl doesn't know where to look for to get a certificate for SSL.

How can i log into this website, download files, using Rcurl

I need to write a piece of code that will download data files from a website which requires a log in.
I'd have thought that this would be quite easy, but I'm having difficulty with the login, programmatically.
I tried using the steps outlined in this post:
How to login and then download a file from aspx web pages with R
But when i get to the second from last step in the top answer I get an error message:
Error: Internal Server Error
So I am trying to write an RCurl code to login to the site, then download the files.
Here is what I have tried:
install.packages("RCurl")
library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', .opts = list(ssl.verifypeer = FALSE), followlocation = TRUE, autoreferer = TRUE, curl= curl)
html <- getURL('https://research.valueline.com/secure/f2/export?params=[{appId:%27com_2_4%27,%20context:{%22Symbol%22:%22GT%22,%22ListId%22:%22recent%22}}]', curl = curl)
viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
params <- list(
'ctl00$ContentPlaceHolder$LoginControl$txtUserID' = '<myusername>',
'ctl00$ContentPlaceHolder$LoginControl$txtUserPw' = '<mypassword>',
'ctl00$ContentPlaceHolder$LoginControl$btnLogin' = 'Sign In',
'__VIEWSTATE' = viewstate
)
html = postForm('https://research.valueline.com/secure/f2/export?params=[{appId:%27com_2_4%27,%20context:{%22Symbol%22:%22GT%22,%22ListId%22:%22recent%22}}]', .params = params, curl = curl)
grepl('Logout', html)

R - posting a login form using RCurl

I am new to using R to post forms and then download data off the web. I have a question that is probably very easy for someone out there to spot what I am doing wrong, so I appreciate your patience. I have a Win7 PC and Firefox 23.x is my typical browser.
I am trying to post the main form that shows up on
http://www.aplia.com/
I have the following R script:
your.username <- 'username'
your.password <- 'password'
setwd( "C:/Users/Desktop/Aplia/data" )
require(SAScii)
require(RCurl)
require(XML)
agent="Firefox/23.0"
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
curl = curl
)
# list parameters to pass to the website (pulled from the source html)
params <-
list(
'userAgent' = agent,
'screenWidth' = "",
'screenHeight' = "",
'flashMajor' = "",
'flashMinor' = "",
'flashBuild' = "",
'flashPatch' = "",
'redirect' = "",
'referrer' = "http://www.aplia.com",
'txtEmail' = your.username,
'txtPassword' = your.password
)
# logs into the form
html = postForm('https://courses.aplia.com/', .params = params, curl = curl)
html
# download a file once form is posted
html <-
getURL(
"http://courses.aplia.com/af/servlet/mngstudents?ctx=filename" ,
curl = curl
)
html
But from there I can tell that I am not getting the page I want, as what is returned into html is a redirect message that appears to be asking me to login again (?):
"\r\n\r\n<html>\r\n<head>\r\n <title>Aplia</title>\r\n\t<script language=\"JavaScript\" type=\"text/javascript\">\r\n\r\n top.location.href = \"https://courses.aplia.com/af/servlet/login?action=form&redirect=%2Fservlet%2Fmngstudents%3Fctx%3Dfilename\";\r\n \r\n\t</script>\r\n</head>\r\n<body>\r\n Click here to continue.\r\n</body>\r\n</html>\r\n"
Although I do believe there are a series of redirects that occur once the form is posted successfully (manually, in a browser). How can I tell the form was posted correctly?
I am quite sure that once I can get the post working correctly, I won't have a problem directing R to download the files I need (online activity reports for each of my 500 students this semester). But spent several hours working on this and got stuck. Maybe I need to set more options with the RCurl package that have to do with cookies (as the site does use cookies) ---?
Any help so much appreciated!! I typically use R to handle statistical data so am new to these packages and functions.
The answer ends up being very simple. For some reason, I didn't see one option that needs to be included in postForm:
html = postForm('https://courses.aplia.com/', .params = params, curl = curl, style="POST")
And that's it...

Resources