query strings - use row ids or human readable values? - query-string

I'm building a form and I wonder if there is a significant advantage in showing values in a more human readable format; e.g:
index.php?user=ted&location=newyork
Rather than:
index.php?user=23423&location=34645
On the one hand, having the query string a little more readable allows the user and search engines to better understand where they are, but this also creates a little more work on the server side, as I'll have to track down the associated rows through something other than their unique id.
For example, first find what the id of 'newyork' is before being able to work on other rows that require the location_id. I always prefer to give the db as little work as possible.
Edit: decided to go with readability. I figure I can always use the mysql query cache to speed things up if necessary.

Use human readable values when you can. Just be sure to sanitise the input.
Edit: Yes, this can and should still be done for SEO purposes (if its worth it to you) if you have lots of choices. Even if the user has lots of choices, you should know what they are (or what the limits are) so that you can properly sanitise the input. For instance, if they are choosing states, you can know all 50. If they are just making up their own text, make sure on your end that its only text.

A good rule of thumb is to store data as id's and display it as human readable text etc.

Depends on your goals.
If you are talking about something like a blog where you want everyone to see everthing (and find it easily), then the human/search engine readable format is a no brainer.
If these pages are locked behind a login then it doesn't much matter. You can do what is easier on the database.
For most internet apps, I'd err to the side of readability since that will help with search engines as well.

You shouldn't worry about the efficiency for any typically sized application on any reasonable database engine. Write your app for users, not for query optimizers. QO's can more easily take care of themselves. Deal with optimization in the unlikely event you start seeing a problem.

I am going to come from a different direction. In my opinion a URL should be readable if want the user to be able to use the URL to change their parameters by editing the URL instead of using the UI. An example would be https://www.coolreportingapp.com/accountReport.jsp?account=ABC&month=200911 . In this example, the user can "easily" change the account or month they looking at without messing with the UI. This of course means you need to validate the URL params each and every time, which you should do anyway. If you don't want the user to alter the URL params, you need to obfuscate and hash values and use the hash to verify they haven't.

Seriously, IMHO, in your example, none is more readable than the other. Do normal users know that "&" is a separator from "variables" in "user=ted&location=newyork&"? Do they need to know that exists something like a variable? Having this in mind, what's the difference in showing numbers or words?
If you really want readable urls you should build SEO Friendly urls (human readable). Remember that even a "dashes vs underscores" simple question matters in the end.

Related

Converting words to hashtags when posting to Twitter

I currently use LinqToTwitter to send posts to Twitter. I'd like to convert words in the title of the post to hashtags when it gets fired off as tweet so something like - "Firefox is cool" is the blog post and becomes #Firefox is cool http://myshortu.rl/dhsgeh on Twitter.
So far the way i see it is i need a database table with the words i want to convert to hashtags. I'd have to parse out the title and compare the words to those in the db and add on the pound sign. Is the best way to use a db table? Or can I do it with an in memory collection or keep the words in web.config? Thanks....
The decision on whether to use a database or file (such as web.config) might depend on whether you want to write code that allows you to maintain the list. e.g. Add, Modify, Remove. If so, then a DB sounds like the easiest option. If the list is small and doesn't change, then adding a delimited list to web.config would work fine.
Since you're using ASP.NET you can't hold it in a memory variable, but you can hold the list in Cache. This can make for some very fast lookups, rather than multiple file or DB queries.
Just to put this into perspective though, it's tough to recommend a proper design in a forum because there might be details that aren't known. So, it's best to take my answer as something that helps think about what the tradeoffs are, rather than a definitive recommendation on what you should do.

How to protect ASP .NET web app from XSS while preserving entered data?

My colleagues and I have been debating how to best protect ourselves
from XSS attacks but still preserve HTML characters that get entered
into fields in our software.
To me, the ideal solution is to accept the data (turn off ASP .NET
request validation) as the user enters it, throw it in the database
exactly as they entered it. Then, whenever you display the data on the
web, HTML-encode it. The problem with this approach is that there's a
high likelihood that a developer somewhere someday will forget to
HTML-encode the display of a value somewhere. Bam! XSS vulnerability.
Another solution that was proposed was to turn request validation off
and strip out any HTML users enter before it is stored in the database
using a regex. Devs will still have to HTML-encode things for display,
but since you've stripped out any HTML tags, even if a dev forgets, we
think it would be safe. The drawback to this is that users can't enter
HTML tags into descriptions and fields and things, even if they
explicitly want to, or they may accidentally paste in an email address
surrounded by < > and the regex doesn't pick it up...whatever. It
screws with the data, and it's not ideal.
The other issue we have to keep in mind is that the system has been
built in the fear of commitment to any one strategy around this. And
at one point, some devs wrote some pages to HTML encode data before it
gets entered into the database. So some data may be already HTML
encoded in the database, some data is not - it's a mess. We can't
really trust any data that comes from the database as safe for display
in a browser.
My question is: What would be the ideal solution if you were
building an ASP .NET web app from the ground up, and what would be a good
approach for us, given our situation?
Assuming you go ahead and store the HTML directly in the database, in ASP.NET/MVC Razor, HTML-encoding is done automatically, so your negligent developer would have to really go above and beyond the call of duty to introduce the XSS. With standard webforms (or the webform view engine), you can force developers to use the <%: syntax, which will accomplish the same thing. (albeit with more risk that the developer will be negligent)
Furthermore, you could consider only selectively disabling request validation. Do you really need to support it for every request? The vast majority of requests, presumably, would not need to preserve (or allow) the HTML.
Using a regex to strip html is fairly easy to defeat and very difficult to get correct. If you want to clean HTML input it's better to use an actual parser to enforce strict XML compliance.
What I would do in this situation is store two fields in the database: clean and raw for the data. When the user wants to edit their content, you send them the raw data. When they submit changes, you sanitize it and store it in the clean field. Developers then only ever use the clean field when outputting the content to the page. I would even go so far as to name the raw field dangerousRawContent so it's obvious that care must be taken when referencing that field.
The added benefit of this technique is that you can re-sanitize the raw data with improved parsers at a later date without every loosing the originally intended content.

Best Practices for Passing Data Between Pages

The Problem
In the stack that we re-use between projects, we are putting a little bit too much data in the session for passing data between pages. This was good in theory because it prevents tampering, replay attacks, and so on, but it creates as many problems as it solves.
Session loss itself is an issue, although it's mostly handled by implementing Session State Server (or by using SQL Server). More importantly, it's tricky to make the back button work correctly, and it's also extra work to create a situation where a user can, say, open the same screen in three tabs to work on different records.
And that's just the tip of the iceberg.
There are workarounds for most of these issues, but as I grind away, all this friction gives me the feeling that passing data between pages using session is the wrong direction.
What I really want to do here is come up with a best practice that my shop can use all the time for passing data between pages, and then, for new apps, replace key parts of our stack that currently rely on Session.
It would also be nice if the final solution did not result in mountains of boilerplate plumbing code.
Proposed Solutions
Session
As mentioned above, leaning heavily on Session seems like a good idea, but it breaks the back button and causes some other problems.
There may be ways to get around all the problems, but it seems like a lot of extra work.
One thing that's very nice about using session is the fact that tampering is just not an issue. Compared to passing everything via the unencrypted QueryString, you end up writing much less guard code.
Cross-Page Posting
In truth I've barely considered this option. I have a problem with how tightly coupled it makes the pages -- if I start doing PreviousPage.FindControl("SomeTextBox"), that seems like a maintenance problem if I ever want to get to this page from another page that maybe does not have a control called SomeTextBox.
It seems limited in other ways as well. Maybe I want to get to the page via a link, for instance.
QueryString
I'm currently leaning towards this strategy, like in the olden days. But I probably want my QueryString to be encrypted to make it harder to tamper with, and I would like to handle the problem of replay attacks as well.
On 4 guys from Rolla, there's an article about this.
However, it should be possible to create an HttpModule that takes care of all this and removes all the encryption sausage-making from the page. Sure enough, Mads Kristensen has an article where he released one. However, the comments make it sound like it has problems with extremely common scenarios.
Other Options
Of course this is not an exaustive look at the options, but rather the main options I'm considering. This link contains a more complete list. The ones I didn't mention such as Cookies and the Cache not appropriate for the purpose of passing data between pages.
In Closing...
So, how are you handling the problem of passing data between pages? What hidden gotchas did you have to work around, and are there any pre-existing tools around this that solve them all flawlessly? Do you feel like you've got a solution that you're completely happy with?
Thanks in advance!
Update: Just in case I'm not being clear enough, by 'passing data between pages' I'm talking about, for instance, passing a CustomerID key from a CustomerSearch.aspx page to Customers.aspx, where the Customer will be opened and editing can occur.
First, the problems with which you are dealing relate to handling state in a state-less environment. The struggles you are having are not new and it is probably one of the things that makes web development harder than windows development or the development of an executable.
With respect to web development, you have five choices, as far as I'm aware, for handling user-specific state which can all be used in combination with each other. You will find that no one solution works for everything. Instead, you need to determine when to use each solution:
Query string - Query strings are good for passing pointers to data (e.g. primary key values) or state values. Query strings by themselves should not be assumed to be secure even if encrypted because of replay. In addition, some browsers have a limit on the length of the url. However, query strings have some advantages such as that they can be bookmarked and emailed to people and are inherently stateless if not used with anything else.
Cookies - Cookies are good for storing very tiny amounts of information for a particular user. The problem is that cookies also have a size limitation after which it will simply truncate the data so you have to be careful with putting custom data in a cookie. In addition, users can kill cookies or stop their use (although that would prevent use of standard Session as well). Similar to query strings, cookies are better, IMO, for pointers to data than for the data itself unless the data is tiny.
Form data - Form data can take quite a bit of information however at the cost of post times and in some cases reload times. ASP.NET's ViewState uses hidden form variables to maintain information. Passing data between pages using something like ViewState has the advantage of working nicer with the back button but can easily create ginormous pages which slow down the experience for the user. In general, ASP.NET model does not work on cross page posting (although it is possible) but instead works on posts back to the same page and from there navigating to the next page.
Session - Session is good for information that relates to a process with which the user is progressing or for general settings. You can store quite a bit of information into session at the cost of server memory or load times from the databases. Conceptually, Session works by loading the entire wad of data for the user all at once either from memory or from a state server. That means that if you have a very large set of data you probably do not want to put it into session. Session can create some back button problems which must be weighed against what the user is actually trying to accomplish. In general you will find that the back button can be the bane of the web developer.
Database - The last solution (which again can be used in combination with others) is that you store the information in the database in its appropriate schema with a column that indicates the state of the item. For example, if you were handling the creation of an order, you could store the order in the Order table with a "state" column that determines whether it was a real order or not. You would store the order identifier in the query string or session. The web site would continue to write data into the table to update the various parts and child items until eventually the user is able to declare that they are done and the order's state is marked as being a real order. This can complicate reports and queries in that they all need to differentiate "real" items from ones that are in process.
One of the items mentioned in your later link was Application Cache. I wouldn't consider this to be user-specific since it is application wide. (It can obviously be shoe-horned into being user-specific but I wouldn't recommend that either). I've never played with storing data in the HttpContext outside of passing it to a handler or module but I'd be skeptical that it was any different than the above mentioned solutions.
In general, there is no one solution to rule them all. The best approach is to assume on each page that the user could have navigated to that page from anywhere (as opposed to assuming they got there by using a link on another page). If you do that, back button issues become easier to handle (although still a pain). In my development, I use the first four extensively and on occasion resort to the last solution when the need calls for it.
Alright, so I want to preface my answer with this; Thomas clearly has the most accurate and comprehensive answer so far for people starting fresh. This answer isn't in the same vein at all. My answer is coming from a "business developer's" standpoint. As we all know too well; sometimes it's just not feasible to spend money re-writing something that already exists and "works"... at least not all in one shot. Sometimes it's best to implement a solution which will let you migrate to a better alternative over time.
The only thing I'd say Thomas is missing is; client-side javascript state. Where I work we've found customers are coming to expect "Web 2.0"-type applications more and more. We've also found these sorts of applications typically result in much higher user satisfaction. With a little practice, and the help of some really great javascript libraries like jQuery (we've even started using GWT and found it to be AWESOME) communicating with JSON-based REST services implemented in WCF can be trivial. This approach also provides a very nice way to start moving towards a SOA-based architecture, and clean separation of UI and business logic.
But I digress.
It sounds to me as though you already have an application, and you've already stretched the limits of ASP.NET's built-in session state management. So... here's my suggestion (assuming you've already tried ASP.NET's out-of-process session management, which scales signifigantly better than the in-process/on-box session management, and it sounds like you have because you mentioned it); NCache.
NCache provides you with a "drop-in" replacement for ASP.NET's session management options. It's super easy to implement, and could "band-aid" your application more than well enough to get you through - without any significant investment in refactoring your existing codebase immediately.
You can use the extra time and money to start reducing your technical debt by focusing new development on things with immediate business-value - using a new approach (such as any of the alternatives offered in the other answers, or mine).
Just my thoughts.
Several months later, I thought I would update this question with the technique I ended up going with, since it has worked out so well.
After playing with more involved session state handling (which resulted in a lot of broken back buttons and so on) I ended up rolling my own code to handle encrypted QueryStrings. It's been a huge win -- all of my problem scenarios (back button, multiple tabs open at the same time, lost session state, etc) are solved and the complexity is minimal since the usage is very familiar.
This is still not a magic bullet for everything but I think it's good for about 90% of the scenarios you run into.
Details
I built a class called CorePage that inherits from Page. It has methods called SecureRequest and SecureRedirect.
So you might call:
SecureRedirect(String.Format("Orders.aspx?ClientID={0}&OrderID={1}, ClientID, OrderID)
CorePage parses out the QueryString and encrypts it into a QueryString variable called CoreSecure. So the actual request looks like this:
Orders.aspx?CoreSecure=1IHXaPzUCYrdmWPkkkuThEes%2fIs4l6grKaznFGAeDDI%3d
If available, the currently logged in UserID is added to the encryption key, so replay attacks are not as much of a problem.
From there, you can call:
X = SecureRequest("ClientID")
Conclusion
Everything works seamlessly, using familiar syntax.
Over the last several months I've also adapted this code to work with edge cases, such as hyperlinks that trigger a download - sometimes you need to generate a hyperlink on the client that has a secure QueryString. That works really well.
Let me know if you would like to see this code and I will put it up somewhere.
One last thought: it's weird to accept my own answer over some of the very thoughtful posts other people put on here, but this really does seem to be the ultimate answer to my problem. Thanks to everyone who helped get me there.
After going through all the above scenarios and answers and this link Data pasing methods My final advice would be :
COOKIES for:
ENCRYPT[userId's]
ENCRYPT[productId]
ENCRYPT[xyzIds..]
ENCRYPT[etc..]
DATABASE for:
datasets BY COOKIE ID
datatables BY COOKIE ID
all other large chunks BY COOKIE ID
My advise also depends on the below statistics and this link details Data pasing methods :
I would never do this. I have never had any issues storing all session data in the database, loading it based on the users cookie. It's a session as far as anything is concerned, but I maintain control over it. Don't give up control of your session data to your web server...
With a little work, you can support sub sessions, and allow multi-tasking in different tabs/windows.
As a starting point, I find using the critical data elements, such as a Customer ID, best put into the query string for processing. You can easily track/filter bad data coming off of these elements, and it also allows for some integration with e-mail or other related sites/applications.
In a previous application, the only way to view an employee or a request record involving them was to log into the application, do a search for the employee or do a search for recent records to find the record in question. This became problematic and a big time sink when somebody from a related department needed to do a simple view on records for auditing purposes.
In the rewrite, I made both the employee Id, and request Ids available through a basic URL of "ViewEmployee.aspx?Id=XXX" and "ViewRequest.aspx?Id=XXX". The application was setup to A) filter out bad Ids and B) authenticate and authorize the user before allowing them to these pages. What this allowed the primarily application users to do was to send simple e-mails to the auditors with a URL in the e-mail. When they were in a big hurry, they were in their bulk processing time, they were able to simply click down a list of URLs and do the appropriate processing.
Other session related data, such as modification dates and maintaining the "state" of the user's interaction with the application gets a little more complex, but hopefully this provides a starting poing for you.

Interpreting Search Results

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon
I doubt that there exist a silver-bullet algorithm that without any training will just work on any arbitrary search query output.
However, this task can be solved and is actually solved in many applications, but with different approach. First you have to define general structure of single search result item based on what you actually going to do with it (it could be name, date, link, description snippet, etc.), and then write number of html parsers that will extract necessary necessary fields from search result output of particular web sites.
I know it is not super sexy solution, but it probably the only one that works. And it is not rocket science. Writing parsers is actually extremly simple, you can make dozen per day. If you will look into html source of search result, you will notice that output results are typically very structured and marked with specific div sections or class atributes, so it is very easy to find it in the document. You dont have even use any complicated HTML parsing library for that, something grep-like will be enough.
For example, on this particular page your question starts with <div class="post-text"> and ends with </div>. Everything in between is actually a post text with some HTML formatting that you may want to remove along with extra spaces and "\n". And this <div class="post-text"> appears on the page only once.
Once you go at large scale with your retrieval applicaiton, you will find out that there is not that big variety of different search engines on different sites, and you will be able to re-use already created parsers for sties using similar search engines.
The only thing you have to remember is built-in self-testing. Sites tend to upgrade and change design from time to time. If your application is going to live for some time, you will need to include into your parsers some logic that will check validity of their results and notify you every time search output has changed and is not compatible anymore with your parser. Then you will have to modify particular parser or write new one.
Hope this helps.

asp.net profile provider

I know there are already some questions on this topic on the site...
I am just trying to understand if it's safe to use ASP.NET Profile Provider with a website with huge traffic?
The way I see it, it's laid out inefficiently. You store property name (which is a string) and property value (which is a string too). If you are just trying to store even age in the profile, you are unnecessarily storing the string "age" in the database over and over whereas with a self-created table, you could just add a column titled age, and no redundancy?
(I am just trying to make sure I am not missing something about it, because I am fairly new to it.)
The profile provider uses an EAV (Entity-Attribute-Value) design deliberately, because profiles in general very commonly have a sparsely populated schema - that is, there are many potential attributes, but only a few will be used for a given single entity, and the few that are used varies widely from one entity to the next.
Let's use a totally arbitrary example - let's say only one in 10 of your users want to provide their age. Making that a column now seems more like a waste, no?
But what if your application makes age mandatory? OK, that column gets populated for everyone. But what if you need to make a note in the profile "user doesn't want to see this obscure dialog anymore". Do you really want a column for every single dialog in your application whether a user wants to see it? Probably not. When you get into the little one-off details of an application of any significant scope, EAV actually becomes the more economical choice.
In the general, it scales quite well (far better than you probably think). In the specific, it doesn't matter - as always, use what works and fix performance problems when they come up. Whatever the scalability limitations of the profile provider are, you'll know when you hit them. I guarantee two things - (1) you'll have to fix a lot of other performance problems you didn't expect before you have to fix that; and (2) if your site is getting enough traffic to break the profile provider, it's a good problem to have.
I agree with Rex M, unless you have a need to do things like sort all your users by age or do other procedures with aggregate profile data. Then you could consider rolling your own. But for just storing properties that you access here and there on a user-by-user basis, Rex M is right.
I do know what you mean. Wouldn't it make sense to supplment the profile provider's table with another table that has columns with mandatory fields? or do you think the overhead of join would not make it not worth it?

Resources