Hampering website parsing by adding useless data inside actual data - web-scraping

I want to prevent or hamper the parsing of the classifieds website that I'm improving.
The website uses API with JSON responses. As a solution, I want to add useless data between my data as programmers will probably parse by ID. And not give a clue about it in neither JSON response body nor header; so they won't be able to distinguish it without close inspection.
To prevent users from seeing it, I won't give that "useless data" to my users if they don't request it explicitly by ID. From an SEO perspective, I know that Google won't parse the page with useless data if there isn't any internal or external link.
How reliable would that technic be? And what problems/disadvantages/drawbacks do you think can occur in terms of user experience or SEO? Any ideas or suggestions will be very much appreciated.
P.S. I'm also limiting big request counts made in a short time. But it doesn't help. That's why I'm thinking of this technic.
I think banning parsers won't be better because they can change IP and etc.
Maybe I can get a better solution by requiring a login to access more than 50 item details for example (and that will work for Selenium, maybe?). Registering will make it harder. Even they do it, I can see these users and slow their response times and etc.

Related

How can I tell the difference between a post from a browser, and someone trying to post programmatically

Is there a way to determine if the request coming to a handler (lets assume the handler responds to get and post) is being performed by a real browser versus a programmatic client?
I already know that it is easy to spoof things like the User Agent and the Referrer, but are there other headers that are more difficult to spoof? Maybe headers that are not commonly available in classes like .net's HttpWebRequest?
The other path that I looked at is maybe using the Encrypted View State to send a value to the browser that gets validated on the server side, though couldn't that value simply be scraped from the previous response and added as a post parameter to the next request?
Any help would be much appreciated,
Cheers,
There is no easy way to differentiate because in the end, a post programitically looks the same to the server as a post by a user from the browser.
As mentioned, captcha's can be used to control posting but are not perfect (as it is very hard but not impossible for a computer to solve them). They also can annoy users.
Another route is only allowing authenticated users to post, but this can also still be done programatically.
If you want to get a good feel for how people are going to try to abuse your site, then you may want to look at http://seleniumhq.org/
This is very similar to the famous Halting Problem in computer science. See some more on the proof, and Alan Turing here: http://webcache.googleusercontent.com/search?q=cache:HZ7CMq6XAGwJ:www-inst.eecs.berkeley.edu/~cs70/fa06/lectures/computability/lec30.ps+alan+turing+infinite+loop+compiler&cd=1&hl=en&ct=clnk&gl=us
The most common way is using captcha's. Of course captcha's have their own issues (users don't really care for them) but they do make it much more difficult to programatically post data. Doesn't really help with GETs though you can force them to solve a captcha before delivering content.
Many ways to do this, like dynamically generated XHR requests that can only be made with human tasks.
Here's a great article on NP-Hard problems. I can see a huge possibility here:
http://www.i-programmer.info/news/112-theory/3896-classic-nintendo-games-are-np-hard.html
One way: You could use some tricky JS to handle tokens on click. So your server issues token-id's to elements on the page during the backend render phase. Log these in a database or data file. Then, when users click around and submit, you can compare the id's sent via the onclick() function. There's plenty of ways around this, but you could apply some heuristics to determine if posts are too fast to be a human or not, that is, even if they scripted the hijacking of the token-ids and auto submitted, you could check that the time between click events appears automated. Signed up for a twitter account lately? They use passive human detection that while not 100% foolproof, it is slower and more difficult to break. Many if not all of the spam accounts there had to be human opened.
Another Way: http://areyouahuman.com/
As long as you are using encrypted methods verifying humanity without crappy CAPTCHA is possible.I mean, don't ignore your headers either. These are complimentary ways.
The key is to have enough complexity to make for an NP-Complete problem in terms of number of ways to solve the total number of problems is extraordinary. http://en.wikipedia.org/wiki/NP-complete
When the day comes when AI can solve multiple complex Human problems on their own, we will have other things to worry about than request tampering.
http://louisville.academia.edu/RomanYampolskiy/Papers/1467394/AI-Complete_AI-Hard_or_AI-Easy_Classification_of_Problems_in_Artificial
Another company doing interesting research is http://www.vouchsafe.com/play-games they actually use games designed to trick the RTT into training the RTT how to be more solvable by only humans!

Is there anything not good using POST instead of GET?

I know the difference between POST and GET, however if I used POST instead of GET, anything not good besides not up to W3C standards?
Anything inefficiency, insecurity or anything else?
See the answer from deceze:
POST requests can't be bookmarked.
In all the interviews I've done, all the teaching I've done, this is the best place to start. There's a lot more, but start with this.
Ignore anything anyone says about security. A good hacker can change POST to GET easily.
If you get this far, know that POST changes data (adds a membership, or charges a credit card), whereas GET only fetches data (searches for red shirts). The makers of browsers make their browsers behave differently for the results of POST vs GET. The results of POST have side effects that you may not want to repeat (such as adding another membership or double charging a credit card).
If you understand THIS, then read about the POST-Redirect-GET pattern, and understand it well. (Then know that GET has a URL length limit, and that you may need to resort to POST in this case.)
Never use POST requests for normal view-only pages. POST requests can't be bookmarked, send in an email or otherwise be reused. They screw up proper navigation using the browsers back/forward buttons. Only ever use them for sending data to the server in one unique operation and (usually) have the server answer with a redirect.
Other than that, they're not more or less efficient or secure than GET requests, they're just for a different purpose.

Best Practices for Passing Data Between Pages

The Problem
In the stack that we re-use between projects, we are putting a little bit too much data in the session for passing data between pages. This was good in theory because it prevents tampering, replay attacks, and so on, but it creates as many problems as it solves.
Session loss itself is an issue, although it's mostly handled by implementing Session State Server (or by using SQL Server). More importantly, it's tricky to make the back button work correctly, and it's also extra work to create a situation where a user can, say, open the same screen in three tabs to work on different records.
And that's just the tip of the iceberg.
There are workarounds for most of these issues, but as I grind away, all this friction gives me the feeling that passing data between pages using session is the wrong direction.
What I really want to do here is come up with a best practice that my shop can use all the time for passing data between pages, and then, for new apps, replace key parts of our stack that currently rely on Session.
It would also be nice if the final solution did not result in mountains of boilerplate plumbing code.
Proposed Solutions
Session
As mentioned above, leaning heavily on Session seems like a good idea, but it breaks the back button and causes some other problems.
There may be ways to get around all the problems, but it seems like a lot of extra work.
One thing that's very nice about using session is the fact that tampering is just not an issue. Compared to passing everything via the unencrypted QueryString, you end up writing much less guard code.
Cross-Page Posting
In truth I've barely considered this option. I have a problem with how tightly coupled it makes the pages -- if I start doing PreviousPage.FindControl("SomeTextBox"), that seems like a maintenance problem if I ever want to get to this page from another page that maybe does not have a control called SomeTextBox.
It seems limited in other ways as well. Maybe I want to get to the page via a link, for instance.
QueryString
I'm currently leaning towards this strategy, like in the olden days. But I probably want my QueryString to be encrypted to make it harder to tamper with, and I would like to handle the problem of replay attacks as well.
On 4 guys from Rolla, there's an article about this.
However, it should be possible to create an HttpModule that takes care of all this and removes all the encryption sausage-making from the page. Sure enough, Mads Kristensen has an article where he released one. However, the comments make it sound like it has problems with extremely common scenarios.
Other Options
Of course this is not an exaustive look at the options, but rather the main options I'm considering. This link contains a more complete list. The ones I didn't mention such as Cookies and the Cache not appropriate for the purpose of passing data between pages.
In Closing...
So, how are you handling the problem of passing data between pages? What hidden gotchas did you have to work around, and are there any pre-existing tools around this that solve them all flawlessly? Do you feel like you've got a solution that you're completely happy with?
Thanks in advance!
Update: Just in case I'm not being clear enough, by 'passing data between pages' I'm talking about, for instance, passing a CustomerID key from a CustomerSearch.aspx page to Customers.aspx, where the Customer will be opened and editing can occur.
First, the problems with which you are dealing relate to handling state in a state-less environment. The struggles you are having are not new and it is probably one of the things that makes web development harder than windows development or the development of an executable.
With respect to web development, you have five choices, as far as I'm aware, for handling user-specific state which can all be used in combination with each other. You will find that no one solution works for everything. Instead, you need to determine when to use each solution:
Query string - Query strings are good for passing pointers to data (e.g. primary key values) or state values. Query strings by themselves should not be assumed to be secure even if encrypted because of replay. In addition, some browsers have a limit on the length of the url. However, query strings have some advantages such as that they can be bookmarked and emailed to people and are inherently stateless if not used with anything else.
Cookies - Cookies are good for storing very tiny amounts of information for a particular user. The problem is that cookies also have a size limitation after which it will simply truncate the data so you have to be careful with putting custom data in a cookie. In addition, users can kill cookies or stop their use (although that would prevent use of standard Session as well). Similar to query strings, cookies are better, IMO, for pointers to data than for the data itself unless the data is tiny.
Form data - Form data can take quite a bit of information however at the cost of post times and in some cases reload times. ASP.NET's ViewState uses hidden form variables to maintain information. Passing data between pages using something like ViewState has the advantage of working nicer with the back button but can easily create ginormous pages which slow down the experience for the user. In general, ASP.NET model does not work on cross page posting (although it is possible) but instead works on posts back to the same page and from there navigating to the next page.
Session - Session is good for information that relates to a process with which the user is progressing or for general settings. You can store quite a bit of information into session at the cost of server memory or load times from the databases. Conceptually, Session works by loading the entire wad of data for the user all at once either from memory or from a state server. That means that if you have a very large set of data you probably do not want to put it into session. Session can create some back button problems which must be weighed against what the user is actually trying to accomplish. In general you will find that the back button can be the bane of the web developer.
Database - The last solution (which again can be used in combination with others) is that you store the information in the database in its appropriate schema with a column that indicates the state of the item. For example, if you were handling the creation of an order, you could store the order in the Order table with a "state" column that determines whether it was a real order or not. You would store the order identifier in the query string or session. The web site would continue to write data into the table to update the various parts and child items until eventually the user is able to declare that they are done and the order's state is marked as being a real order. This can complicate reports and queries in that they all need to differentiate "real" items from ones that are in process.
One of the items mentioned in your later link was Application Cache. I wouldn't consider this to be user-specific since it is application wide. (It can obviously be shoe-horned into being user-specific but I wouldn't recommend that either). I've never played with storing data in the HttpContext outside of passing it to a handler or module but I'd be skeptical that it was any different than the above mentioned solutions.
In general, there is no one solution to rule them all. The best approach is to assume on each page that the user could have navigated to that page from anywhere (as opposed to assuming they got there by using a link on another page). If you do that, back button issues become easier to handle (although still a pain). In my development, I use the first four extensively and on occasion resort to the last solution when the need calls for it.
Alright, so I want to preface my answer with this; Thomas clearly has the most accurate and comprehensive answer so far for people starting fresh. This answer isn't in the same vein at all. My answer is coming from a "business developer's" standpoint. As we all know too well; sometimes it's just not feasible to spend money re-writing something that already exists and "works"... at least not all in one shot. Sometimes it's best to implement a solution which will let you migrate to a better alternative over time.
The only thing I'd say Thomas is missing is; client-side javascript state. Where I work we've found customers are coming to expect "Web 2.0"-type applications more and more. We've also found these sorts of applications typically result in much higher user satisfaction. With a little practice, and the help of some really great javascript libraries like jQuery (we've even started using GWT and found it to be AWESOME) communicating with JSON-based REST services implemented in WCF can be trivial. This approach also provides a very nice way to start moving towards a SOA-based architecture, and clean separation of UI and business logic.
But I digress.
It sounds to me as though you already have an application, and you've already stretched the limits of ASP.NET's built-in session state management. So... here's my suggestion (assuming you've already tried ASP.NET's out-of-process session management, which scales signifigantly better than the in-process/on-box session management, and it sounds like you have because you mentioned it); NCache.
NCache provides you with a "drop-in" replacement for ASP.NET's session management options. It's super easy to implement, and could "band-aid" your application more than well enough to get you through - without any significant investment in refactoring your existing codebase immediately.
You can use the extra time and money to start reducing your technical debt by focusing new development on things with immediate business-value - using a new approach (such as any of the alternatives offered in the other answers, or mine).
Just my thoughts.
Several months later, I thought I would update this question with the technique I ended up going with, since it has worked out so well.
After playing with more involved session state handling (which resulted in a lot of broken back buttons and so on) I ended up rolling my own code to handle encrypted QueryStrings. It's been a huge win -- all of my problem scenarios (back button, multiple tabs open at the same time, lost session state, etc) are solved and the complexity is minimal since the usage is very familiar.
This is still not a magic bullet for everything but I think it's good for about 90% of the scenarios you run into.
Details
I built a class called CorePage that inherits from Page. It has methods called SecureRequest and SecureRedirect.
So you might call:
SecureRedirect(String.Format("Orders.aspx?ClientID={0}&OrderID={1}, ClientID, OrderID)
CorePage parses out the QueryString and encrypts it into a QueryString variable called CoreSecure. So the actual request looks like this:
Orders.aspx?CoreSecure=1IHXaPzUCYrdmWPkkkuThEes%2fIs4l6grKaznFGAeDDI%3d
If available, the currently logged in UserID is added to the encryption key, so replay attacks are not as much of a problem.
From there, you can call:
X = SecureRequest("ClientID")
Conclusion
Everything works seamlessly, using familiar syntax.
Over the last several months I've also adapted this code to work with edge cases, such as hyperlinks that trigger a download - sometimes you need to generate a hyperlink on the client that has a secure QueryString. That works really well.
Let me know if you would like to see this code and I will put it up somewhere.
One last thought: it's weird to accept my own answer over some of the very thoughtful posts other people put on here, but this really does seem to be the ultimate answer to my problem. Thanks to everyone who helped get me there.
After going through all the above scenarios and answers and this link Data pasing methods My final advice would be :
COOKIES for:
ENCRYPT[userId's]
ENCRYPT[productId]
ENCRYPT[xyzIds..]
ENCRYPT[etc..]
DATABASE for:
datasets BY COOKIE ID
datatables BY COOKIE ID
all other large chunks BY COOKIE ID
My advise also depends on the below statistics and this link details Data pasing methods :
I would never do this. I have never had any issues storing all session data in the database, loading it based on the users cookie. It's a session as far as anything is concerned, but I maintain control over it. Don't give up control of your session data to your web server...
With a little work, you can support sub sessions, and allow multi-tasking in different tabs/windows.
As a starting point, I find using the critical data elements, such as a Customer ID, best put into the query string for processing. You can easily track/filter bad data coming off of these elements, and it also allows for some integration with e-mail or other related sites/applications.
In a previous application, the only way to view an employee or a request record involving them was to log into the application, do a search for the employee or do a search for recent records to find the record in question. This became problematic and a big time sink when somebody from a related department needed to do a simple view on records for auditing purposes.
In the rewrite, I made both the employee Id, and request Ids available through a basic URL of "ViewEmployee.aspx?Id=XXX" and "ViewRequest.aspx?Id=XXX". The application was setup to A) filter out bad Ids and B) authenticate and authorize the user before allowing them to these pages. What this allowed the primarily application users to do was to send simple e-mails to the auditors with a URL in the e-mail. When they were in a big hurry, they were in their bulk processing time, they were able to simply click down a list of URLs and do the appropriate processing.
Other session related data, such as modification dates and maintaining the "state" of the user's interaction with the application gets a little more complex, but hopefully this provides a starting poing for you.

ASP.NET: Looking for solution to solve XSS

We got a long-running website where XSS lurks. The problem comes from that some developers directly - without using HtmlEncode/Decode() - retrieve Request["sth"] to do the process, putting on the web.
I wonder if there is any mechanism like HTTPModule to help us HtmlEncode() all the items in a Http request to avoid XSS to some extent.
Appreciate for any suggestion.
Rgds,
Ricky
The problem is not retrieving Request data without HTML-encoding. In fact that's perfectly correct. You should not encode any text until the final output stage when you spit it into an HTML page.
Trying to blanket-encode incoming parameters, whether that's HTML-encoding or SQL-encoding, is totally the wrong thing. It may hide XSS holes in your app but it does not fix them. You will still have a hole if you output content that hasn't come from parameters, or has been processed since then. Meanwhile the automatic encoding will fill your database with multiply-escaped & crud.
You need to fix the output stage, that's where the problem lies.
Like bobince said, this is an output problem, not an input problem. If you can isolate where this data is being output on the page, you could create a Filter and add it to the Response object. This filter would isolate the areas that are common output and then HtmlEncode them.

So why should we use POST instead of GET for posting data? [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicates:
How should I choose between GET and POST methods in HTML forms?
When do you use POST and when do you use GET?
Obviously, you should. But apart from doing so to fulfil the HTTP protocol, are there any reasons to do so? Less overhead? Some kind of security thing?
because GET must not alter the state of the server by definition.
see RFC2616 9.1.1 Safe Methods:
9.1.1 Safe Methods
Implementors should be aware that the
software represents the user in their
interactions over the Internet, and
should be careful to allow the user to
be aware of any actions they might
take which may have an unexpected
significance to themselves or others.
In particular, the convention has been
established that the GET and HEAD
methods SHOULD NOT have the
significance of taking an action other
than retrieval. These methods ought to
be considered "safe". This allows user
agents to represent other methods,
such as POST, PUT and DELETE, in a
special way, so that the user is made
aware of the fact that a possibly
unsafe action is being requested.
If you use GET to alter the state of the server then a search engine bot or some link prefetching extension in a web browser can wreak havoc on your site and (for example) delete all user data just by following links to your site.
There is a nice paper by the W3C about this: URIs, Addressability, and the use of HTTP GET and POST.
1.3 Quick Checklist for Choosing HTTP GET or POST
Use GET if:
The interaction is more like a question (i.e., it is a safe operation such as a query, read operation, or lookup).
Use POST if:
The interaction is more like an order, or
The interaction changes the state of the resource in a way that the user would perceive (e.g., a subscription to a service), or
The user be held accountable for the results of the interaction
Because, if you use GET to alter state, Google can delete your stuff.
When do you use POST and when do you use GET?
How should I choose between GET and POST methods in HTML forms?
If you accept GETs to perform write operations then a malicious hacker could inject somewhere links to perform an unauthorized operation. Your user clicks on a link - and something is deleted from a database. Or maybe some amount of money is transferred away from the user's account if he's still logged in to their online banking.
http://superbank.com/TransferMoney?amount=1000&recipient=2342524
Send a malicious email with an embedded image referencing this link, and as soon as the document is opened, something funny has happened behind the scenes.
GET is limited by the length of URL the browser/server can handle. This used to be as short as 256 characters.
There is atleast one situation where you want a GET to change data on the server. That is when a GET returns data, and you need to record which data was given to a user and when it was given.
If you use complex data types then it must be in a POST it cannot be in a GET. For example testing a WCF web service in a browser can only be done when the contract uses simple data types.
Using GET and POST where it is expected helps to keep your program understandable.
When you use POST, you can see the information being "posted" in the address-bar of the web browser. This is [apparently] not the case when you use the GET method.
This article was somewhere on http://www.w3schools.com/ Once I've found the exact page it was on, I'll repost. :-)

Resources