How to properly set cookies to get URL content using httr - r

I need to download information from web site that is protected using cookies. I pass this protection manually and then insert cookies to httr.
Here is similar topic, but it does not solve my problem: (Copying cookie for httr)
library(httr)
url<-"http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ"
cook<-"_SMIDA=9117a9eb136353bd6956651bd59acd37; __utmt=1; __utma=29983421.1729484844.1413489369.1413625619.1413627797.3; __utmb=29983421.7.10.1413627797; __utmc=29983421; __utmz=29983421.1413489369.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"
response <- GET(url, config(cookie= cook))
content(x = response,as = 'text', encoding = "UTF-8")
So when I use content it return me information, that I am not logged in( as I do without cookie)
How can I solve this problem?
Test credentials are login: mytest2, pass: qwerty12)

This would be the way to set_cookies with GET & httr:
GET("http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ",
set_cookies(`_SMIDA` = "7cf9ea4bfadb60bbd0950e2f8f4c279d",
`__utma` = "29983421.138599299.1413649536.1413649536.1413649536.1",
`__utmb` = "29983421.5.10.1413649536",
`__utmc` = "29983421",
`__utmt` = "1",
`__utmz` = "29983421.1413649536.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"))
That worked for me, well at least I think it did as I cannot read the language. A table comes back with the same structure and no prompt to login.
Unfortunately the captcha on login prevents the use of Rselenium (or other, similar, crawling packages), so you'll have to continue to manually grab those cookies (or use a utility to extract them from the session).
Finally, you probably want to seriously consider changing those credentials, now :-)
EDIT: #VadymB and I both found that the code didn't work until we rebooted RStudio. Your mileage may vary.

Related

In R, can I get a redirect URL be returned from a httr:BROWSE statement?

At first, I am not the best programmer, so please excuse me if I ask something stupid.
I have a question about the following code (in language R) which I have written in order to get an authentication code for the Withings API:
library(httr)
my_client_id = "..." #deleted because it is secret
my_redirect_uri = "..." #deleted because it is secret
my_scope="user.activity,user.metrics,user.info"
access_url = "https://wbsapi.withings.net/v2/oauth2"
authorize_url = "https://account.withings.com/oauth2_user/authorize2"
my_response_type = "code"
my_state = "..." #deleted because it is secret
httr::BROWSE(authorize_url, query = list(response_type = my_response_type,
client_id = my_client_id,
redirect_uri = my_redirect_uri,
scope = my_scope,
state = my_state))
This code successfully opens the URL
http://%22https://account.withings.com/oauth2_user/account_login?response_type=code&client_id=...&redirect_uri=...&scope=user.activity%2Cuser.metrics%2Cuser.info&state=...&b=authorize2%22
where I can enter my e-mail-adress and password. After that, it redirects me to the URL
http://.../?code=...&state=...
where the first dots are my redirect URL. This gives me the code I need for getting the access token. I have tested the code, i.e. I tried to get an access token with using this code and I was successfull.
The problem is, I have to copy/paste the code from the URL (in my browser) to my POST statement (which I use to get the access token) manually and I would like to automatize that. So I would like to get returned the URL with the code so that I can parse it in order to extract the code. I know how to extract the code if I have the URL, but I have no idea how to avoid the copying/pasting and I am not even sure if it is possible. If it is possible, does anyone have an idea how I could add something to my existing code or how I could change my existing code in order to get the URL with the code (apart from doing it manually)?
I am very happy about any help and I want to say thank you in advance!

Rvest doesn't recognize a form

I'm trying to parse a website that require to log in a session using Rvest.
I'm using this code to begin :
login<-"https://www.drugs.com/account/login/"
session<-html_session(login)
form<-html_form(session)
But even after extracting all forms it just recognize the "Advanced Search" form and not the login form.
Do you have an idea why this happen? I was wondering if the login form require javascript or something like this.
Thank you,
Vitruves
Depending on where you are, I believe the problem may be the EU GDPR consent. The first time I opened the website it asked me to accept cookies in order to log in. Accepting set the following cookie in my browser:
ddbab21688799cacb48f7d384642573f = "agree"
and only after displayed the log-in form. For me the name of the cookie was always set to the same value, but if this is not always the case then you may have to accept consent within your rvest session to have the cookie set.
If I set the cookie when opening the rvest session, I get two forms returned, one of which is the log-in form.
You can set the cookie as follows:
login <- "https://www.drugs.com/account/login/"
session <- html_session(login, httr::set_cookies(ddbab21688799cacb48f7d384642573f = "agree"))
form <- html_form(session)

Python 3.5 requests for clawing

I have a coding problem regarding Python 3.5 web clawing.
I try to use 'requests.get' to extract the real link from 'http://www.baidu.com/link?url=ePp1pCIHlDpkuhgOrvIrT3XeWQ5IRp3k0P8knV3tH0QNyeA042ZtaW6DHomhrl_aUXOaQvMBu8UmDjySGFD2qCsHHtf1pBbAq-e2jpWuUd3'. An example of the code is like below:
import requests
response = requests.get('http://www.baidu.com/link?url=ePp1pCIHlDpkuhgOrvIrT3XeWQ5IRp3k0P8knV3tH0QNyeA042ZtaW6DHomhrl_aUXOaQvMBu8UmDjySGFD2qCsHHtf1pBbAq-e2jpWuUd3')
c = response.url
I expected that 'c' should be 'caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm'. (I remove http:// from the link as I can't post two links in one question.)
However, it doesn't work, and keeps return me the same link as I putted in.
Can anyone help on this. Many thanks in advance.
#
Thanks a lot to Charlie.
I have found out the solution. I first use .content.decode to read the response history, but that will be mixed up with many irrelevant info. I then use .findall to extract the redirect url from the history, which should be the first url displayed in the response history. Then, I use requests.get to retrieve the info. Below is the code:
rep1 = requests.get(url)
cont = rep1.content.decode('utf-8')
extract_cont = re.findall('"([^"]*)"', cont)
redir_url = extract_cont[0]
rep = requests.get(redir_url)
You may consider looking into the response headers for a 'location' header.
response.headers['location']
You may also consider looking at the response history, which contains a response for each response instance in a chain of redirects
response.history
Your sample URL doesn't redirect; The response is a 200 and then it uses a JavaScript window.location change. The requests library won't support this type of redirect.
<script>window.location.replace("http://caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm")</script>
<noscript><META http-equiv="refresh" content="0;URL='http://caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm'"></noscript>
If you know you will always be using this one service, you could parse the response, maybe using regex.
If you don't know what service will always be used and also want to handle every possible situation, you might need to instantiate a WebKit instance or something and somehow try to determine when it finally finishes. I'm sure there's a page load complete event which you could use, but you still might have pages that do a window.location change after the page is loaded using a timer. This will be very heavyweight and still not cover every conceivable type of redirect.
I recommend starting with writing a special handler for each type of edge case and fallback on a default handler that just looks at the response.url. As new edge cases come up, write new handlers. It's kind of the 'trial and error' approach.

Using Datazen in an iframe with external authentication

I was able to successfully use external authentication with datazen via HTTPWEBREQUEST from code-behind with VB.NET, but I am unclear how to use this with an iframe or even a div. I'm thinking maybe the authorization cookies/token isn't following the iframe around? The datazen starts to load correctly, but then it redirects back to the login page as if it's not being authenticated. Not sure how to do that part, this stuff is pretty new to me and any help would be greatly appreciated!!
Web page errors include:
-OPTIONS url send # jquery.min.js:19b.extend.ajax # jquery.min.js:19Viewer.Controls.List.ajax # Scripts?page=list:35Viewer.Controls.List.load # Scripts?page=list:35h.callback # Scripts?page=list:35
VM11664 about:srcdoc:1
XMLHttpRequest cannot load http://datazenserver.com/viewer/jsondata. Response for preflight has invalid HTTP status code 405Scripts?page=list:35
load(): Failed to load JSON data. V…r.C…s.List {version: "2.0", description: "KPI & dashboard list loader & controller", url: "/viewer/jsondata", index: "/viewer/", json: null…}(anonymous function) # Scripts?page=list:35c # jquery.min.js:4p.fireWith # jquery.min.js:4k # jquery.min.js:19r # jquery.min.js:19
Scripts?page=list:35
GET http://datazenserver.com/viewer/login 403 (Forbidden)(anonymous function) # Scripts?page=list:35c # jquery.min.js:4p.fireWith # jquery.min.js:4k # jquery.min.js:19r # jquery.min.js:19
' ''//////////////////////////////////
Dim myHttpWebRequest As HttpWebRequest = CType(WebRequest.Create("http://datazenserver.com/"), HttpWebRequest)
myHttpWebRequest.CookieContainer = New System.Net.CookieContainer()
Dim authInfo As String = Session("Email")
myHttpWebRequest.AllowAutoRedirect = False
myHttpWebRequest.Headers.Add("headerkey", authInfo)
myHttpWebRequest.Headers.Add("Access-Control-Allow-Origin", "*")
myHttpWebRequest.Headers.Add("Access-Control-Allow-Headers", "Accept, Content-Type, Origin")
myHttpWebRequest.Headers.Add("Access-Control-Allow-Methods", "GET, POST, PUT, DELETE, OPTIONS")
Dim myHttpWebResponse As HttpWebResponse = CType(myHttpWebRequest.GetResponse(), HttpWebResponse)
Response.AppendHeader("Access-Control-Allow-Origin", "*")
' Create a new 'HttpWebRequest' Object to the mentioned URL.
' Assign the response object of 'HttpWebRequest' to a 'HttpWebResponse' variable.
Dim streamResponse As Stream = myHttpWebResponse.GetResponseStream()
Dim streamRead As New StreamReader(streamResponse)
frame1.Page.Response.AppendHeader("Access-Control-Allow-Origin", "*")
frame1.Page.Response.AppendHeader("headerkey", authInfo)
frame1.Attributes("srcdoc") = "<head><base href='http://datazenserver.com/viewer/' target='_blank'/></head>" & streamRead.ReadToEnd()
You might have to do more of this client-side, and I don't know whether you'll be able to because of security concerns.
External authentication in Datazen looks something like this:
User-Agent | Proxy | Server
-------------------|----------------------|------------------------------------
1. /viewer/home --> 2. Append header --> 3. Check cookie (not present)
<-- 5. Forward <-- 4. Redirect to /viewer/login
6. /viewer/login --> 7. Append header --> 8. Append cookie
<-- 10. Forward <-- 9. Redirect to /viewer/home
11. /viewer/home --> 12. Append header --> 13. Check cookie (valid)
<-- 15. Forward <-- 14. Give content
16. .................. Whatever the user wanted ..........................
So even though you're working off a proxy with a header, you're still getting a cookie back that it uses.
Now, that's just context.
My guess, from your description of the symptoms, is that myHttpWebResponse should have a cookie set (DATAZEN_AUTH_TOKEN, I believe), but it's essentially getting thrown out--you aren't using it anywhere.
You would need to tell your browser client to append that cookie to any subsequent (iframe-based) requests to the domain of your Datazen server, but I don't believe that's possible due to security restrictions. I don't know a whole lot about CORS, though, so there might be a way to permit it.
I don't know whether there's any good way to do what you're looking to do here. At best, I can maybe think of a start to a hack that would work, but I can't even find a good way to make that work, and you really wouldn't want to go there.
Essentially, if you're looking to embed Datazen in an iframe, I would shy away from external authentication. I'd shy away from it regardless, but especially there.
But, if you're absolutely sure you need it over something like ADFS, you'll need some way to get that cookie into your iframe requests.
The only way I can think to make this work would be to put everything on the same domain:
www.example.com
datazen.example.com (which is probably your proxy)
You could then set a cookie from your response that stores some encrypted (and likely expiring) form of Session("Email"), and passes it back down in your html.
That makes your iframe relatively simple, because you can just tell it to load the viewer home. Something to the effect of:
<iframe src="//datazen.example.com/viewer/home"></iframe>
In your proxy, you'll detect the cookie set by your web server, decrypt the email token, ensure it isn't expired, then set a header on the subsequent request onto the Datazen server.
This could be simplified at a couple places, but this should hold as true as possible to your original implementation, as long as you can mess with DNS settings.
I suppose another version of this could involve passing a parameter to your proxy, and sharing some common encryption key. That would get you past having to be on the same domain.
So if you had something like:
var emailEncrypted = encrypt(Session("Email") + ":somesalt:" + DateTime.UtcNow.ToString("O"));
Then used whatever templating language you want to set your iframe up with:
<iframe src="//{{ customDomain }}/viewer/home?emailkey={{ emailEncrypted }}"></iframe>
Then your proxy detected that emailkey parameter, decrypted it, and checked for expiration, that could work.
Now you'd have a choice to make on how to handle this, because Datazen will give you a 302 to /viewer/login to get a cookie, and you need to make sure to pass the correct emailkey on through that.
What I would do, you could accept that emailkey parameter in your proxy, set a completely new cookie yourself, then watch for that cookie on subsequent requests.
Although at that point, it would probably be reasonable to switch your external authentication mode to just use cookies. That's probably a better version of this anyway, assuming this is the only place you use Datazen, and you'd be safe to change something so fundamental. That would substantially reduce your business logic.
But, you wouldn't have to. If you didn't want to change that, you could just check for the cookie, and turn it into a header.
You should do (1), but just for good measure, one thing I'm not sure on, is whether you can pass users directly to /viewer/login to get a cookie from Datazen. Normally you wouldn't, but it seems like you should be able to.
Assuming it works as expected, you could just swap that URL out for that. As far as I know (although I'd have to double-check this), the header is actually only necessary once, to set up the cookie. So if you did that, you should get the cookie, then not need the URL parameter anymore, so the forced navigation would be no concern.
You'll, of course, want to make sure you've got a good form of encryption there, and the expiration pattern is important. But you should be able to secure that if you do it right.
I ended up just grabbing the username and password fields and entering them in with javascript. But this piece helped me a ton. You have to make sure you set the
document.domain ='basedomain.com';
in javascript on both sites in order to access the iframe contents else you'll run into the cross-domain issues.

Scraping https website using getURL

I had a nice little package to scrape Google Ngram data but I have discovered they have switched to SSL and my package has broken. If I switch from readLines to getURL gets some of the way there, but some of the included script in the page is missing. Do I need to get fancy with user agents or something?
Here is what I have tried so far (pretty basic):
library(RCurl)
myurl <- "https://books.google.com/ngrams/graph?content=hacker&year_start=1950&year_end=2000"
getURL(myurl)
Comparing the results to viewing the source after entering the url in a browser shows that the crucial content is missing from the results returned to R. In the browser, the source includes content looking like this:
<script type="text/javascript">
var data = [{"ngram": "hacker", "type": "NGRAM", "timeseries": [9.4930387994907051e-09,
1.1685493106483591e-08, 1.0784501440023556e-08, 1.0108472218003532e-08,
etc.
Any suggestions would be greatly appreciated!
Sorry, not a direct solution, but it doesn't seem to be an user-agent problem. When you open your URL in a browser, you can see that there is a redirection that adds a parameter at the end of the address : direct_url=t1%3B%2Chacker%3B%2Cc0.
If you use getURL() to download this new URL, complete with the new parameter, then the javascript you are mentioning is present in the result.
Another solution could be to try to access data via Google BigQuery, as mentioned in this SO question :
Google N-Gram Web API

Resources