I am trying to use RCurl along with XML package to download and mine article from the WSJ (wall street journal). However, whenever i use getURL from RCurl, i do get the version of the article which is available for public viewer.
What i would like to be able to do is to download the full version of the article - as i am paying member. I imagine i have to pass the login credential, when i call the function getURL, however, i am not sure how to do so..
Is this information stored in cookies?
Do i need be "authenticated" - whatever the difference (in purpose maybe) is?
I would appreciate if someone could explain how do website such as WSJ, uses login-info to fetch data, and how can i tweak RCurl in order to take such information into account. A very simple example will go a long way in explaining the different concept of setting cookies (files, jar, ..) etc
Thank you in advance
Typically, authentication information is not stored in cookies. Instead, a "session cookie" is stored on your computer - and refers to authentication stored on the server. See the Session management article on Wikipedia for a bit more information and pointers.
So basically you will need to create a cookie jar file for this site, login with curl (this can be painful, as WSJ doesn't use the standard form-based POST but instead relies on javascript), and then you'll be able to tell curl to re-use the cookie for the following requests on the articles. Read this answer to see how to do it in practice.
Related
Other than HTTP PUT and POST, what other methods can a web application designer use to allow users to upload content (either files or listbox text) from a page of his web app to a remote server?
On the same topic, I was wondering what technology/APIs does a service like Google Docs or Google Drive use? The reason I ask this is: Our Sys Admin has disabled file uploading (via Squid proxy), yet I was able to create and share a document using Google Docs / Google Drive.
Many thanks in advance,
/HS
EDIT Please see the strikeout above.
This depends on the server in question - as the standard set of HTTP commands can be expanded, and some may not be configured/allowed. One of the common commands is "OPTIONS" that ask "what can I do".
But to answer more helpfully: you generally have two main options:
POST (the one you probably want to user as it's nearly always avaiable
GET. You could use GET (but I'm NOT advocating it - just saying you could you it - you should not use a GET to make changes to the server). There are problems with this approach (including size of files, manually handling the encoding etc) but it's possible if you have to go this route.
PUT it often not enabled on servers for security reasons.
More reading: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
Edit: if "file uploading" is prevented by proxy, have you tried encoding the POST? i.e. As opposed to sending a multipart POST, try encoding the files yourself into POST string and sending that instead? Or encode the file and split into multiple small posts and piecing them together at the other end?
Google Docs uses a mixture of POST and GET. POST for the updates. Google Drive I don't know.
I saw this previous post but I have not been able to adapt the answer to get my code to work.
I am trying to filter on the term bruins and need to reference cacert.pem since for authentication on my Windows machine. Lastly, I have written a function to parse each response (my.function) and need to include this as well.
postForm("https://stream.twitter.com/1/statuses/sample.json",
userpwd="user:pass",
cainfo = "cacert.pem",
a = "bruins",
write=my.function)
I am looking to stay completely within R and unfortunately need to use Windows.
Simply, how can I include the search term(s) that I want such that the response is filtered?
Thanks in advance.
Alright, so I've looked at what you're doing, and some of what you're working on may be helped by examining the Twitter API methods, although it can be difficult to figure out how to translate some of the examples into R (via the RCurl Package).
What you're currently trying is very close to what you need to do, you simply need to change two things.
First of all, you're querying the url for the random sample of statuses. This url returns a random sample of roughly 1% of all tweets.
If you're interested in collecting only tweets about specific keywords, you want to use the filter API url: "https://stream.twitter.com/1/statuses/filter.json"
After changing that, you simply need to change your parameter from "a" to "postfields", and the parameter you'd be passing would look like: "track=bruins"
Finally, you should use the getURL function, to open a continuous stream, so all tweets with your keywords can be collected, rather than using the postForm command (which I believe is intended for HTML forms).
so your final function call should look like the following:
getURL("https://stream.twitter.com/1/statuses/filter.json",
userpwd="Username:Password",
cainfo = "cacert.pem",
write=my.function,
postfields="track=bruins")
For manipulating twitter, use the twitteR package.
library(twitteR)
searchTwitter("bruins")
You can include other parameters (like cainfo) in the call to searchTwitter, and they should get passed getForm underneath.
I don't think the Streaming API is currently included in twitteR - the search api is different (it's backward looking, whereas streaming is "current looking").
From my understanding, streaming is quite different to how lots of APIs work typically work; rather than pulling data from a web service and having a defined object returned, you're setting up a "pipe" for Twitter to push data to you and you then listen for that response.
You also need to worry about OAuth I think (which twitteR does deal with).
Is there any reason that you want to keep in R? I've used python successfully with the Streaming API and a package called tweepy to write data to a MySQL database and then use R to query and analyse the data.
Last time I checked, twitteR did not talk to the streaming API. Moreover, as far as I know, very few publicly-available Twitter Streaming API connection libraries in any language honor Twitter's recommendations on reconnecting when Streaming disconnects / throws an error.
My recommendation is to access Streaming via a library that's actively maintained, write the re-connection protocol yourself if you have to, and persist the data into a database that handles JSON natively. I'm about to embark on a project of this nature and will be writing the collector in Perl, doing my own re-connect logic and persisting into either PostgreSQL or MongoDB. Most likely it will be MongoDB; PostgreSQL doesn't get native JSON till 9.2.
Late to the game, I know, but you'll want to use the "streamR" package for access to Twitter's streaming API.
I've a home-made php based web calendar which I would like my users to import into Google Calendar, iCal, etc. so they have up-to-date information available on their calendar of choice. I understand providing a webcal link is the way to go but I am not sure how to create it. I've donwloaded an example .ics file but did not have much info..
Where can I find more info on creating a webcal feed? Also, does webcal allow authentication? The feed will most likely be password protected.
Thanks!
A webcal feed use iCalendar format as defined in RFC 5545. It's a rather complex and cumbersome format. You'll find simple examples on wikipedia which may fit your needs. You could also opt to use a library to abstract the format, such as:
http://framework.zend.com/svn/framework/laboratory/Zend_Ical/
http://bennu.sourceforge.net/
http://sourceforge.net/projects/icalcreator/
http://sabre.io/vobject/
From all of these the last one might be your best bet (all others seemed to be dead last time I checked).
As for authentication you may use basic HTTP authentication. Or use a secret token to identify the user (as seen in Google Calendar). Anyway, in both cases you probably should use secure connection (SSL) so data (and passwords) are not sent in clear.
And finally I would recommend the use of webcal:// or webcals:// scheme for ease of use for end-user. But you may face troubles with some clients (eg: Outlook 2007 and forced SSL). I don't have a work-for-all solution yet...
EDIT
I forgot to mention the ICS validator in case you don't use a lib .
I've been trying (unsuccessfully, I might add) to scrape a website created with the Microsoft stack (ASP.NET, C#, IIS) using Python and urllib/urllib2. I'm also using cookielib to manage cookies. After spending a long time profiling the website in Chrome and examining the headers, I've been unable to come up with a working solution to log in. Currently, in an attempt to get it to work at the most basic level, I've hard-coded the encoded URL string with all of the appropriate form data (even View State, etc..). I'm also passing valid headers.
The response that I'm currently receiving reads:
29|pageRedirect||/?aspxerrorpath=/default.aspx|
I'm not sure how to interpret the above. Also, I've looked pretty extensively at the client-side code used in processing the login fields.
Here's how it works: You enter your username/pass and hit a 'Login' button. Pressing the Enter key also simulates this button press. The input fields aren't in a form. Instead, there's a few onClick events on said Login button (most of which are just for aesthetics), but one in question handles validation. It does some rudimentary checks before sending it off to the server-side. Based on the web resources, it definitely appears to be using .NET AJAX.
When logging into this website normally, you request the domian as a POST with form-data of your username and password, among other things. Then, there is some sort of URL rewrite or redirect that takes you to a content page of url.com/twitter. When attempting to access url.com/twitter directly, it redirects you to the main page.
I should note that I've decided to leave the URL in question out. I'm not doing anything malicious, just automating a very monotonous check once every reasonable increment of time (I'm familiar with compassionate screen scraping). However, it would be trivial to associate my StackOverflow account with that account in the event that it didn't make the domain owners happy.
My question is: I've been able to successfully log in and automate services in the past, none of which were .NET-based. Is there anything different that I should be doing, or maybe something I'm leaving out?
For anyone else that might be in a similar predicament in the future:
I'd just like to note that I've had a lot of success with a Greasemonkey user script in Chrome to do all of my scraping and automation. I found it to be a lot easier than Python + urllib2 (at least for this particular case). The user scripts are written in 100% Javascript.
When scraping a web application, I use either:
1) WireShark ... or...
2) A logging proxy server (that logs headers as well as payload)
I then compare what the real application does (in this case, how your browser interacts with the site) with the scraper's logs. Working through the differences will bring you to a working solution.
I'm attempting to right a script to logon (username and password) on to a website and download a file. I've tried using the webClient and webBrowser classes to no avail, they don't seem to work for what I need them to do.
Does anyone else have any suggestions?
I'd suggest you look at this StackOverflow thread. This is not specific to any programming language or platform (you didn't mention which language/platform you're using, so I can't offer any specific code advice) but in my answer I detail the basic approach you'll need to take for successful HTML screen-scraping.
In a nutshell, first you'll need to the right combination of HTTP headers, URLs, and POST data to fool the server into thinking your client app is a "real" browser. A tool like Fiddler which allows you to see actual HTTP requests going over the wire, and experimentally build new requests based on browser requests, is invaluable here.
Next you'll need to figure out, in your language and platform, how to produce that set of headers, URLs, and/or POST data. Typically, handling cookies and redirects are the hardest-- and most platforms have specialized classes (e.g. CookieContainer in .NET) to help with things like this.