.NET Web Scripting - http

I'm attempting to right a script to logon (username and password) on to a website and download a file. I've tried using the webClient and webBrowser classes to no avail, they don't seem to work for what I need them to do.
Does anyone else have any suggestions?

I'd suggest you look at this StackOverflow thread. This is not specific to any programming language or platform (you didn't mention which language/platform you're using, so I can't offer any specific code advice) but in my answer I detail the basic approach you'll need to take for successful HTML screen-scraping.
In a nutshell, first you'll need to the right combination of HTTP headers, URLs, and POST data to fool the server into thinking your client app is a "real" browser. A tool like Fiddler which allows you to see actual HTTP requests going over the wire, and experimentally build new requests based on browser requests, is invaluable here.
Next you'll need to figure out, in your language and platform, how to produce that set of headers, URLs, and/or POST data. Typically, handling cookies and redirects are the hardest-- and most platforms have specialized classes (e.g. CookieContainer in .NET) to help with things like this.

Related

Best practices - When to use Server / Client code

I was searching for information about one of my doubts, but I couldn't find any. I'm working in an ASP.NET site and using AJAX to require data, since I'm currently working on my own, I don't know web programming's best practices.
I usually get all the information I need from the server and use Javascript to display / Modify it and AJAX to send it back to the server. A friend of mine uses PHP for most part of the programming, He rarelly uses any javascript and he told me it's way faster this way, since it does not consume the client's resources.
The basic question actually is:
According to the best practices, is it better for the server just to provide the data needed for the
application or is better you use the server for more than this?
That is going to depend on the expected amount of traffic for the site, the amount of content being generated, and the expectations of the end-user.
In a high-traffic site, it is actually "faster" for the end-user if you let javascript generate a portion of the content on the client side. Also, you can deliver a better user experience with long load times through client side scripting than you can if the content is loaded completely on the server.
In most cases you would need at least some backend code. E.g. when validating user input or when retrieving information from a real persistent database. Or what about when somebody has javascript disabled in his user-agent or somebody with a screenreader or searchengine crawlers?
IMHO you should at least (again in most cases) have the backend code which is able to do all the work and spit out a full webpage to the client. In addition to this you can add javascript functionality to make the user interface "smoother" by for example validating user data before submitting it to the server (remember to ALWAYS also check on the serverside) or by loading partial html (AJAX).
The point about being faster or using less resources when doing it serverside doesn't make much sense. Even if it does that it doesn't matter (but again I highly doubt this statement). If you use clientside scripting to only load parts that are needed it would rather use less resources on both the client- and the serverside.

How do I handle use 100 Continue in a REST web service?

Some background
I am planning to writing a REST service which helps facilitate collaboration between multiple client systems. Similar to how git or hg handle things I want the client to perform all merging locally and for the server to reject new changes unless they have been merged with existing changes.
How I want to handle it
I don't want clients to have to upload all of their change sets before being told they need to merge first. I would like to do this by performing a POST with the Expect 100 Continue header. The server can then verify that it can accept the change sets based on the header information (not hard for me in this case) and either reject the request or send the 100 Continue status through to the client who will then upload the changes.
My problem
As far as I have been able to figure out so far ASP.NET doesn't support this scenario, by the time you see the request in your controller actions the POST body has normally already been completely uploaded. I've had a brief look at WCF REST but I haven't been able to see a way to do it there either, their conditional PUT example has the full request body before rejecting the request.
I'm happy to use any alternative framework that runs on .net or can easily be made to run on Windows Azure.
I can't recommend WcfRestContrib enough. It's free, and it has a lot of abilities.
But I think you need to use OpenRasta instead of WCF in order to do what you're wanting. There's a lot of stuff out there on it, like wiki, blog post 1, blog post 2. It might be a lot to take in, but it's a .NET framework thats truly focused on being RESTful, and not RPC like WCF. And it has the ability work with headers, like you asked about. It even has PipelineContributors, which have access to the whole context of a call and can halt execution, handle redirections, or even render something different than what was expected.
EDIT:
As far as I can tell, this isn't possible in OpenRasta after all, because "100 continue is usually handled by the hosting environment, not by OR, so there’s no support for it as such, because we don’t get a chance to respond in the asp.net pipeline"

Scraping ASP.NET with Python and urllib2

I've been trying (unsuccessfully, I might add) to scrape a website created with the Microsoft stack (ASP.NET, C#, IIS) using Python and urllib/urllib2. I'm also using cookielib to manage cookies. After spending a long time profiling the website in Chrome and examining the headers, I've been unable to come up with a working solution to log in. Currently, in an attempt to get it to work at the most basic level, I've hard-coded the encoded URL string with all of the appropriate form data (even View State, etc..). I'm also passing valid headers.
The response that I'm currently receiving reads:
29|pageRedirect||/?aspxerrorpath=/default.aspx|
I'm not sure how to interpret the above. Also, I've looked pretty extensively at the client-side code used in processing the login fields.
Here's how it works: You enter your username/pass and hit a 'Login' button. Pressing the Enter key also simulates this button press. The input fields aren't in a form. Instead, there's a few onClick events on said Login button (most of which are just for aesthetics), but one in question handles validation. It does some rudimentary checks before sending it off to the server-side. Based on the web resources, it definitely appears to be using .NET AJAX.
When logging into this website normally, you request the domian as a POST with form-data of your username and password, among other things. Then, there is some sort of URL rewrite or redirect that takes you to a content page of url.com/twitter. When attempting to access url.com/twitter directly, it redirects you to the main page.
I should note that I've decided to leave the URL in question out. I'm not doing anything malicious, just automating a very monotonous check once every reasonable increment of time (I'm familiar with compassionate screen scraping). However, it would be trivial to associate my StackOverflow account with that account in the event that it didn't make the domain owners happy.
My question is: I've been able to successfully log in and automate services in the past, none of which were .NET-based. Is there anything different that I should be doing, or maybe something I'm leaving out?
For anyone else that might be in a similar predicament in the future:
I'd just like to note that I've had a lot of success with a Greasemonkey user script in Chrome to do all of my scraping and automation. I found it to be a lot easier than Python + urllib2 (at least for this particular case). The user scripts are written in 100% Javascript.
When scraping a web application, I use either:
1) WireShark ... or...
2) A logging proxy server (that logs headers as well as payload)
I then compare what the real application does (in this case, how your browser interacts with the site) with the scraper's logs. Working through the differences will bring you to a working solution.

When should one use GET instead of POST in a web application?

It seems that sticking to POST is the way to go because it results in clean looking URLs. GET seems to create long confusing URLs. POST is also better in terms of security. Good for protecting passwords in forms. In fact I hear that many developers only use POST for forms. I have also heard that many developers never really use GET at all.
So why and in what situation would one use GET if POST has these 2 advantages?
What benefit does GET have over POST?
you are correct, however it can be better to use gets for search pages and such. Places where you WANT the URL's to be obvious and discoverable. If you look at Google's (or any search page), it puts a www.google.com/?q=my+search at the end so people could link directly to the search.
You actually use GET much more than you think. Simply returning the web page is a GET request. There are also POST, PUT, DELETE, HEAD, OPTIONS and these are all used in RESTful programming interfaces.
GET vs. POST has no implications on security, they are both insecure unless you use HTTP/SSL.
Check the manual, I'm surprised that nobody has pointed out that GET and POST are semantically different and intended for quite different purposes.
While it may appear in a lot of cases that there is no functional difference between the 2 approaches, until you've tested every browser, proxy and server combination you won't be able to rely on that being a consistent in every case. e.g. mobile devices / proxies often cache aggressivley even where they are requested not to (but I've never come across one which incorrectly caches a POST response).
The protocol does not allow for anything other than simple, scalar datatypes as parameters in a GET - e.g. you can only send a file using POST or PUT.
There are also implementation constraints - last time I checked, the size of a URL was limited to around 2k in MSIE.
Finally, as you've noted, there's the issue of data visibility - you may not want to allow users to bookmark a URL containing their credit card number / password.
POST is the way to go because it results in clean looking URLs
That rather defeats the purpose of what a URL is all about. Read RFC 1630 - The Need For a Universal Syntax.
Sometimes you want your web application to be discoverable as in users can just about guess what a URL should be for a certain operation. It gives a nicer user experience and for this you would use GET and base your URLs on some sort of RESTful specification like http://microformats.org/wiki/rest/urls
If by 'web application' you mean 'website', as a developer you don't really have any choice. It's not you as a developer that makes the GET or POST requests, it's your user. They make the requests via their web browser.
When you request a web page by typing its URL into the address bar of the browser (or clicking a link, etc), the browser issues a GET request.
When you submit a web page using a button, you make a POST request.
In a GET request, additional data is sent in the query string. For example, the URL www.mysite.com?user=david&password=fish sends the two bits of data 'user' and 'password'.
In a POST request, the values in the form's controls (e.g. text boxes etc) are sent. This isn't visible in the address bar, but it's completely visible to anyone viewing your web traffic.
Both GET and POST are completely insecure unless SSL is used (e.g. web addresses beginning https).

Is there any reason not to use HTTP PUT and DELETE in a web application?

Looking around, I can't name a single web application (not web service) that uses anything besides GET and POST requests. Is there a specific reason for this? Do some browsers (or servers) not support any other types of requests? Or is this only for historical reasons? I'd like to make use of PUT and DELETE requests to make my life a little easier on the server-side, but I'm reluctant to because no one else does.
Actually a fair amount of people use PUT and DELETE, mostly for non-browser APIs. Some examples are the Atom Publishing Protocol and the Google Data APIs:
http://www.ietf.org/rfc/rfc5023.txt
http://code.google.com/apis/gdata/docs/2.0/basics.html
Beyond that, you don't see PUT/DELETE in common usage because most browsers don't support PUT and DELETE through Forms. HTML5 seems to be fixing this:
http://www.w3.org/TR/html5/forms.html#form-submission-0
The way it works for browser applications is: people design RESTful applications with PUT and DELETE in mind, then "tunnel" those requests through POSTs from the browser. For example, see this SO question on how Ruby on Rails accomplishes this using hidden fields:
How can I emulate PUT/DELETE for Rails and GWT?
So, you wouldn't be on your own designing your application with the larger set of HTTP verbs in mind.
EDIT: By the way, if you're curious about why PUT/DELETE are missing from browser based form posts, it turns out there's no real good technical reason. Reading around this thread on the rest-discuss mailing list, especially Roy Fielding's comments, is interesting for some context:
http://tech.groups.yahoo.com/group/rest-discuss/message/9620?threaded=1&var=1&l=1&p=13
EDIT: There are some comments on whether AJAX libraries support all the methods. It does come down to the actual browser implementation of XMLHttpRequest. I thought someone might find this link handy, which tests your browser to see how compliant the HttpRequest object is with various HTTP options.
http://www.mnot.net/javascript/xmlhttprequest/
Unfortunately, I don't know of a reference which collects these results.
Quite simply, the HTML 4.01 form element only allows the values "POST" and "GET" in its method attribute
Some proxy servers with tough security policies might drop them. I'm using PUT and DELETE anyways.
I've read that some browsers do not support other HTTP methods properly, though I can't name any specifics.
Rails, in particular, will pack your forms with a method parameter to explicitly set this even if the browser doesn't support those methods. That seems like a reasonable precaution if you're going to do this.
I say use all the features of HTTP, browsers be damned, lol. Maybe it'll inspire more complete and proper use of the HTTP protocol moving forward. There's more happening on the net than just POSTs and GETs. About time browser implementations reflected this.
This depends on your browser and Ajax library. For example jQuery supports all HTTP methods even though the browser may not. See for example the jQuery "ajax" documentation on the "type" attribute.
The Restlet Java framework lets you tunnel PUT and DELETE requests through HTML POST operations. To do this, you just add method=put or method=delete to your URI's query string, eg:
http://www.example.com/user=xyz?method=delete ...
This is the same as Ruby on Rails' approach (as described by #ars above).
Personally, I really don't see any purpose for using PUT or DELETE in a web application. All operations that an application performs are read or write, aka input output. Why do you need to distinguish the nature of the operation in the header of the HTTP request?
I could make ajax calls with the same url of form /object/object_id
and do multiple operations like delete, update, get the value, or create.
Just by looking at the URL, I have no clue which one it is.
By using GET and POST only, my urls will be:
/object/id/delete
/object/id/create
/object/id/update
/object/id --> implied GET
etc.
Based on my limited experience, this is a lot cleaner than hidden header request types in many cases.
I am not saying one should never use PUT or DELETE, just saying, use them only if absolutely needed.
Refer to "RESTful Web API" by Leonard Richardson to read more about different use cases and conventions regarding HTTP request methods in a RESTful web api.

Resources