Scraping ASP.NET with Python and urllib2 - asp.net

I've been trying (unsuccessfully, I might add) to scrape a website created with the Microsoft stack (ASP.NET, C#, IIS) using Python and urllib/urllib2. I'm also using cookielib to manage cookies. After spending a long time profiling the website in Chrome and examining the headers, I've been unable to come up with a working solution to log in. Currently, in an attempt to get it to work at the most basic level, I've hard-coded the encoded URL string with all of the appropriate form data (even View State, etc..). I'm also passing valid headers.
The response that I'm currently receiving reads:
29|pageRedirect||/?aspxerrorpath=/default.aspx|
I'm not sure how to interpret the above. Also, I've looked pretty extensively at the client-side code used in processing the login fields.
Here's how it works: You enter your username/pass and hit a 'Login' button. Pressing the Enter key also simulates this button press. The input fields aren't in a form. Instead, there's a few onClick events on said Login button (most of which are just for aesthetics), but one in question handles validation. It does some rudimentary checks before sending it off to the server-side. Based on the web resources, it definitely appears to be using .NET AJAX.
When logging into this website normally, you request the domian as a POST with form-data of your username and password, among other things. Then, there is some sort of URL rewrite or redirect that takes you to a content page of url.com/twitter. When attempting to access url.com/twitter directly, it redirects you to the main page.
I should note that I've decided to leave the URL in question out. I'm not doing anything malicious, just automating a very monotonous check once every reasonable increment of time (I'm familiar with compassionate screen scraping). However, it would be trivial to associate my StackOverflow account with that account in the event that it didn't make the domain owners happy.
My question is: I've been able to successfully log in and automate services in the past, none of which were .NET-based. Is there anything different that I should be doing, or maybe something I'm leaving out?

For anyone else that might be in a similar predicament in the future:
I'd just like to note that I've had a lot of success with a Greasemonkey user script in Chrome to do all of my scraping and automation. I found it to be a lot easier than Python + urllib2 (at least for this particular case). The user scripts are written in 100% Javascript.

When scraping a web application, I use either:
1) WireShark ... or...
2) A logging proxy server (that logs headers as well as payload)
I then compare what the real application does (in this case, how your browser interacts with the site) with the scraper's logs. Working through the differences will bring you to a working solution.

Related

Website debugger to find parameters passed on user submission

I was wondering if there was a way to see what parameters or other information is passed upon submitting a form from another website of which you don't have any of the server code.
Here is the page I am trying to debug - https://umbc.t2hosted.com/cit/index.aspx.
When I put information into the fields, and submit it, there is not added data to the url like there would be in a regular get request. Is there any tool that can help me find out what parameters are actually passed so that I may simulate user requests with a program?
Thank you in advance with you help.
You can use a debugger proxy such as fiddler to see all the data that is sent from your machine to the website when doing the query.
This will allow you to see the HTTP messages sent from your browser to the website. Once you've seen and understood how the message are sent, it should be relatively easy to reproduce them with another program.

Mixing JQuery Ajax with ASP.NET : is there any security risk

I am using jQuery with ASP.NET in a project. Instead of using ASP.NET Ajax, I am using jquery's ajax functions. Is there any security risk if I do that? I mean, since I am using jquery's ajax calls, no view state information will be passed to the server so that it can verify the page's authenticity (though it saves a lot of bandwidth..).
I would also like to know what is the best/good practice here.
Microsoft has included Jquery in their Visual Studio releases (see: http://weblogs.asp.net/scottgu/archive/2008/09/28/jquery-and-microsoft.aspx)
If there was a big security risk they probably wouldn't have done that ;)
As with al webapplications never trust the input you recieve. It doesn;t matter if you're working with ASP.Net AJAX, Jquery or any other library. Web requests can always be spoofed. Therefor always sanitize the input you recieve and make sure that the user is authenticated (ASP.Net forms authentication uses cookies and not viewstate).
Make sure that you validate all user input. And post basic authentication information to your Web Services (jQuery.ajax has a data parameter), so that no one can use the services without being a part of the system.
Passing along a session GUID and thus providing the Web Service with full authentication, is enough security for most applications (in addition to normal security checks such as input validation). You may specify closer what security level your application needs.
I use the same practice on many case - jQuery ajax on aspx pages
You can check 3 thinks (select 1-2 of them) and be sure that none can create troubles on your site.
Send all the post data encrypted (if you can).
Send hash value with the post data - and check for the correction of the hash (if you can).
Check that the calls is coming from your host on url.
eg, you have a page 'http://www.yourhost.com/askforajax.aspx',
check if the url starting with the 'http://www.yourhost.com/'
The hash I mean here, must be your implimation of hash or crc check or what ever you like you can call it.
here is a real ajax call from my pages
doSomeWork.aspx?plist=36&pslst=1&e=1202638085&er=12585795
The last 2 parametres are check parametres.
Also inside the the ajax page that make the calculations check every parameter for be correct.
I also check some other thinks in some cases, for example if a user press a button that make a change somewhere this user must have cookies enabled, so I check if the users cookie hash is the same.
For the url check
I belive that the Request.ServerVariables["HTTP_REFERER"], can do the work of checking from where the request come from.
HTTP_REFERER Returns a string containing the URL of the page that referred the request to the current page using an tag. If the page is redirected, HTTP_REFERER is empty
Hope this help you.

Stop Direct Page Calls to Ajax Pages

Is there a "clever" way of stopping direct page calls in ASP.NET? (Page functionality, not the page itself)
By clever, I mean not having to add in hashes between pages to stop AJAX pages being called directly. In a nutshell, this is stopping users from accessing the Ajax pages without it coming from one of your websites pages in a legitimate way. I understand that nothing is impossible to break, I am simply interested in seeing what other interesting methods there are.
If not, is there any way that one could do it without using sessions/cookies?
Have a look at this question: Differentiating Between an AJAX Call / Browser Request
The best answer from the above question is to check for a requested-by or custom header.
Ultimately, your web server is receiving requests (including headers) of what the client sends you - all data that can be spoofed. If a user is determined, then any request can look like an AJAX request.
I can't think of an elegant method to prevent this (there are inelegant and probably non-perfect methods whereby you provide a hash of some sort of request counter between ajax and non-ajax requests).
Can I ask why your application is so sensitive to "ajax" pages being called directly? Could you design around this?
You can check the Request headers to see if the call is initiated by AJAX Usually, you should find that x-requested-with has the value XMLHttpRequest. Or in the case of ASP.NET AJAX, check to see if ScriptMAnager.IsInAsyncPostBack == true. However, I'm not sure about preventing the request in the first place.
Have you looked into header authentication? If you only want your app to be able to make ajax calls to certain pages, you can require authentication for those pages...not sure if that helps you or not?
Basic Access Authentication
or the more secure
Digest Access Authentication
Another option would be to append some sort of identifier to your URL query string in your application before requesting the page, and have some sort of authentication method on the server side.
I don't think there is a way to do it without using a session. Even if you use an Http header, it is trivial for someone to create a request with the exact same headers.
Using session with ASP.NET Ajax requests is easy. You may run into some problems, like session expiration, but you should be able to find a solution.
With sessions you will be able to guarantee that only logged-in users can access the Ajax services. When servicing an Ajax request simply test that there is a valid session associated with it. Of course a logged-in user will be able to access the service directly. There is nothing you can do to avoid this.
If you are concerned that a logged-in user may try to contact the service directly in order to steal data, you can add a time limit to the service. For example do not allow the users to access the service more often than one minute at a time (or whatever rate else is needed for the application to work properly).
See what Google and Amazon are doing for their web services. They allow you to contact them directly (even providing APIs to do this), but they impose limits on how many requests you can make.
I do this in PHP by declaring a variable in a file that's included everywhere, and then check if that variable is set in the ajax call file.
This way, you can't directly call the file ever because that variable will never have been defined.
This is the "non-trivial" way, hence it's not too elegant.
The only real idea I can think of is to keep track of every link. (as in everything does a postback and then a response.redirect). In this way you could keep a static List<> or something of IP addresses(and possible browser ID and such) that say which pages are allowed to be accessed at the moment from that visitor.. along with a time out for them and such to keep them from going straight to a page 3 days from now.
I recommend rethinking your design to be sure that this is really needed though. And also note IPs and such can be spoofed.
Also if you follow this route be sure to read up about when static variables get disposed and such. You wouldn't want one of those annoying "your session has expired" messages when they have been using the site for 10 minutes.

How do I force expiration of an ASP.Net session when a user leaves the site?

We have a scenario in which we like to detect when the user has left our site and immediately expire their .Net session. We're using Forms Authentication. We're not talking about a session timeout, which we already have. We would like to know when a user has browsed away from our site, either via a link, by typing in an address or following a bookmark. If they return to our site, even if right away, they will have to log back in (I understand this is not great usability - this is a security requirement we've been given by our client).
My initial instinct is that this is either not possible, or that any solutions will be extremely unreliable. The only solutions we've come up with are:
Add a JavaScript onBlur event handler that tells the server to log out the session when the user leaves the site.
Once the user has logged in, check the HTTP referrer to ensure that the user has navigated from within the site.
Add AJAX polling back to the server to keep the session refreshed, possibly on a 10-second interval. When the call isn't received on time the session would end.
The onBlur seems like the easiest, but possibly least reliable method - I'm not sure if it would even work. There are also issues with the referrer method, as the user could type in an address within the site and not follow a link. The AJAX method seems like it would work, but it's complicated - I'm not even sure how to handle it on the back-end. I'm thinking there might also be scenarios in which that wouldn't always work.
Any ideas would be appreciated. Thanks.
I have gone for a heartbeat type scenario like you describe above. Either Ajax Polling or an IFRAME. When the user closes the browser and a certain timeout elapses (10 seconds?), then you can log them out.
Another alternative would be to have the site run entirely on AJAX. Thus there is only one "URL" that a user can visit and all content is loaded dynamically. Of course you break all sorts of usability stuff this way, but at least you achieve your goal.
If the user closes their browser, or types in a different URL (including selecting a favourite) there is not much for you to detect.
For links on your site, you could create links that forward via your site (i.e. rather than linking to http://example.com/foo you link to http://mysite.com/forwarder?dest=http://example.com/foo).
Just be careful to only forward to sites you intend to, otherwise you can open up security issues with "universal forwarding" being used for phishing etc..
You absolutely, positively need to tell the client that this is not possible. They are having a basic misunderstanding of how the Web works. Be diplomatic, obviously... hell, it's probably someone else's job... but it needs to be done.
Your suggestions, or a combination of them, may work in a simple proof-of-concept... but they will bring you nothing but support nightmares and will not work consistently enough. Worse, you will undoubtably also create situations where users cannot use the application at all due to the security hacks misfiring on them.
Javascript has an onUnload event, which is triggered when the browser is told to leave the page. You can see this on StackOverflow when you try to press the back button or click a link while editing an answer.
You may use this event to trigger an auto-logoff for your site.
I am unsure, however, if this will handle cases wherein the browser is deliberately closed or the browser process externally terminated (I'm guessing it doesn't happen in the 2nd case).
If all navigation within your site is done through .NET postbacks (no simple html links or javascript open statements), you can do automatic logoff and redirect to the login page if the page load is not a postback. This does not end the session on exit, but it looks like it because it enforces a login if manually navigating to your web app. To get this functionality for all pages, you can use a Master page that does this in the Page_Load.
private void Page_Load(object sender, System.EventArgs e)
{
if (!IsPostBack)
{
System.Web.Security.FormsAuthentication.SignOut();
System.Web.Security.FormsAuthentication.RedirectToLoginPage();
}
}

How to handle IMAP requests from MSOutlook in ASP.NET page?

Brief: I am tinkering with a personal project that would serve up Task objects to MSOutlook. I would like to create a new HTTP account in MSOutlook which points at my website's *.aspx page. This page would deliver a list of Task items that do not actually reside on a mail server but are instead stored in a XML file or other simple structure.
Question: Are there any guides for handling IMAP requests in ASP.NET? I've found plenty of information on developing a web client but I want something more akin to a server/service though nothing so robust.
Background: My daughter is in high school. She is computer literate but abhors complexity and all nerdiness. She is comfortable with MSOutlook so I would like to run a little website in my house to send homework Tasks to her. If I can set up an HTTP account, the Tasks will be delivered to her without any trouble on her part. Don't get me started on the screen scraping I'm doing to retrieve assignments from her teacher's "websites" (I don't think the term could be applied any more loosely without completely falling off).
I think you'd be better off using/customizing an Open Source IMAP server, there are several out here. But I am not sure if the mail server idea is a good one. You'd be bringing a lot of baggage into this effort.
Why don't you just send your daughter an email, as opposed to putting the assignment on a web page and then trying to get it off of there?
If you must have the pull model (as opposed to a push model), why not put up an asp page with a "Send me the assignment" button. She can go there, click on it, and will receive the content in the email.

Resources