Best way to design URIs when they are based on user generated content - http

In our system, we have URLs for pages where the content, including the title, is based on user generated content. I'm trying to figure out the best design that balances SEO, human readability and resiliency.
I've been reading a bunch of material on this, including Tim Berners-Lee's document from way back: Cool URIs don't change.
As an example, imagine I have a book review site where users are submitting content (a worded review) and the book's title and author.
So if they submitted a book review for A Tale of Two Cities (user unintentionally mispells it) with Author of Charles Dickens. The URL could be:
http://foo.com/charles-dickens/a-tale-of-two-cities
Later on, if another book by Dickens is added, it could be:
http://foo.com/charles-dickens/oliver-twist
Then http://foo.com/charles-dickens/ could be a list of all the reviewed books on the site.
However, the problem comes into play if a change is made to book title. Imagine the user mispelled something, like A Tale of Two City, then it's later corrected. This would also change the URL and would break any external links to that page, pagerank, etc.
What is the recommended way to handle this type of problem? Options I see:
First commit wins: No changes to URL are possible after it's initially established
Last commit wins: Always change the URL. So if there's a change to the User generated content, revise the URL. With this approach, either the old URL is dead or a trail is preserved of all the URL changes and all of them still function. Stackoverflow seems to to this.
Don't base URL on UGC: Ignore the user generated content and just come up with URLs not based on it. So url could be http://foo.com/reviews/1234.
What are people's thoughts on this?

You're slightly wrong; Stack Overflow combines #2 and #3. A question has a specific id, and that's all you need to locate the question. For example, this question's id is 11011252. You can access the question with https://stackoverflow.com/questions/11011252, no need to add the portion of the URL (or would you call it a URI in this case?) generated from the question title. In fact, that will get automatically tacked on (whether by redirect or some other method) when you use the titleless address.
Even better, you can append whatever you want (within reason, I suppose) to the end of the address. https://stackoverflow.com/questions/11011252/this-text-will-be-ignored will take you to the question without any problem.
Stack Overflow isn't the only website that does this, either; many other sites I've seen focused on user-generated content follow the same protocol/whatever you call it. It seems like the best method to go with, as it combines the advantages of #3 (underlying URI remains the same) with the advantages of #2 (the URL contains some information about its target, which users will like), and best of all means you won't get any URI conflicts if two people generate content with the same non-unique identifiers.

Related

Is static URL better than dynamic URL in terms of SERP?

I've been reading up on SEO and how to construct my links in terms of getting better SERP.
I'm using WordPress as the framework for my site and have custom templates retrieving data from my DB.
What makes a URL dynamic, is the usage of ? and &. Nothing more, nothing less. Google recommends that I should not have too many attributes in my URL - and that's understandable.
Dynamic: www.mysite.com/?id=123&name=some+store+name&city=london
Static: www.mysite.com/london/some+store+name/123
Q1: I don't feel that adding the store ID in this static URL looks nice. But I do need it in order to fetch data from the DB, right?
Reading various blogs, I see many SEO (experts) saying different things, but I feel most of it is just talk without actually proving their statements. We can all agree that static URLs are good in terms of usability (and readability).
Q2: But many claim that static URLs prevent duplicate content. I don't agree on that as all my contents have unique ID. Can anyone comment on this?
Q3: In the end, for the Google search engine (and others) it really doesn't matter if the URL is static or dynamic. But since Google is working towards user friendly content, is that the only argument for having static URLs?
1) There's no problem using DB ids alongside static URLs. Many huge e-commerce and other commercial sites do this (Amazon, eBay... hell, everyone really.)
2) A static URL in and of itself does not prevent duplicate content. There are hundreds of ways duplicates can happen (child pages, external copy, javascript, form fields, ajax, archive sections... the list foes on.)
3) It doesn't matter if it's static or dynamic for indexing. But in terms of ranking well, static URLs with informative (and relevant to the targeted keywords) searches are hugely beneficial. Multivariate testing I've done shows users are also generally re-assured by clean looking URLs in terms of usability.
If you give me some more examples, I can probably help out a bit more.
Urls without parameters are always better. It won't absolutely kill SEO - but it is better not to have them.
!0 years ago Google would ignore parameters and would penalize you for URLs with parameters. Today they are really good at figuring out these db parameters - but not perfect. Among other things Google has to try to figure out which URL parameters matter, and which don't and if parameter order matters.
E.g. you may have URL parameters that store user preferences, navigation state etc. This will just proliferate URLs that Google has to try to decode. So what you should do is:
Right before generating an URL at least sort your parameters.
Convert parameters that matter into things that don't look like parameters. So if I had a shoe store with a urls like http://mystore.com/mypage?category=boots&brand=great&color=red I'd rewrite that to something like http://mystore.com/mypage/category/boots/brand/great/color/red or even better:
http://myscore.com/mypage/boots/great/red
Then you can add the parameters that don't matter for the page content at the end. Google will figure out they don't matter.
The other reason to fix your URLs is that Goolge displays them to users in the SERP, and people are more likely to click on readable URLs than database URLs.
Why do big stores like amazon use database urls? because they are giant, bad urls don't hurt them, and their systems are so large and complex it is the only way to manage it. But for smaller sites with fewer products, readble URLs are achievable and are one of the few advantages a small site can have over a big one.
If anyone observing closely Google SERP results definitely find some part of SERP results are highlighted and bold as well. Now noticing further one can easily find "Search Query" are getting highlighted or bold in "Title" , "Descriptions" and "URL" who are using same "Search Query" in Title, Descriptions and URL as well.
Now thing is if any website URL's are dynamic and coming with parameter ID, they are loosing keywords from Title, Descriptions and URL as well.
Ex:
http://www.johnzaccheofineart.com/catagory-2/?id=4
http://www.johnzaccheofineart.com/painting/johnzaccheo
Sample Search : Painting for Sale
Now easily we can understand difference between static and dynamic URL performance. One URL coming with such word which has no search value, other URL is coming with category name as well as painter name.
So, being a user i will give preference to 2nd one which is understandable from URL itself.

Crawl only articles/content

I want a crawler to be able to identify which pages on, for example, a news site, are actual content (i.e. articles), as opposed to About, Contact, category listings, etc.
I've found no elegant way about this so far, as the criteria for content seem to vary by site (no common tags/layouts/protocols, etc.). Can anyone direct me either to libraries or methods that can identify with some level of certainty whether a website is a piece of content? It's perfectly acceptable to make this distinction after I have crawled the candidate page.
Barring anything that already exists, I'd also appreciate any starting points to existing/ongoing research in this area.
You can start by checking Boilerpipe framework . There is online extraction demo available from their project's page. If extraction result is not very good for your case, you need to extend their algorithm.

Interpreting Search Results

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon
I doubt that there exist a silver-bullet algorithm that without any training will just work on any arbitrary search query output.
However, this task can be solved and is actually solved in many applications, but with different approach. First you have to define general structure of single search result item based on what you actually going to do with it (it could be name, date, link, description snippet, etc.), and then write number of html parsers that will extract necessary necessary fields from search result output of particular web sites.
I know it is not super sexy solution, but it probably the only one that works. And it is not rocket science. Writing parsers is actually extremly simple, you can make dozen per day. If you will look into html source of search result, you will notice that output results are typically very structured and marked with specific div sections or class atributes, so it is very easy to find it in the document. You dont have even use any complicated HTML parsing library for that, something grep-like will be enough.
For example, on this particular page your question starts with <div class="post-text"> and ends with </div>. Everything in between is actually a post text with some HTML formatting that you may want to remove along with extra spaces and "\n". And this <div class="post-text"> appears on the page only once.
Once you go at large scale with your retrieval applicaiton, you will find out that there is not that big variety of different search engines on different sites, and you will be able to re-use already created parsers for sties using similar search engines.
The only thing you have to remember is built-in self-testing. Sites tend to upgrade and change design from time to time. If your application is going to live for some time, you will need to include into your parsers some logic that will check validity of their results and notify you every time search output has changed and is not compatible anymore with your parser. Then you will have to modify particular parser or write new one.
Hope this helps.

Any justification for an IT policy that query parameters should not be used?

My company, which builds ad server, affiliate network, contact form and CRM software was acquired last year, and we are now in the process of reworking our technology to fit the IT policies and guidelines of the parent corporation.
One of these policies is a tremendous sticking point and causing all sorts of problems for us:
No query parameters are to be used in any URL visible to the end user
This includes content URLs, ad clickthrough targets, redirects, anything which will either show up in the address bar or in a mouseover status bar update. The effect would be no affiliate ID parameters, media source tracking IDs, session IDs, CMS content selection parameters, anything. Several fundamental functions of our software simply can't be accomplished without passing parameter data from one page to another. In our case, many of these links are from different sites or subdomains, it's not possible to pass data via cookies, either
The only justification I've been given is that query parameters prevent some proxy caches from working properly. This makes no sense to me--I've never heard of such a thing--and nobody is willing or interested in discussing it at length. I've not even been given an example of what specifically is broken or why the policy was created.
In any case, this being a global corporate IT policy, in the end the reasoning doesn't matter, only compliance. Although getting it changed is most likely out of the question, I would still like to understand what valid concerns may have prompted its institution. Understanding the mindset may be a first step towards finding a workaround.
My first thought for a workaround was to embed parameters within the path portion of the URL and extract them with an Apache mod_rewrite, but this is out of the question because:
Corollary: Every URL must present unique content available through no other URL
So making multiple URLs which actually refer to the same page but contain other parameter data in the URL is also unacceptable.
Questions:
Is there valid justification for not using query parameters?
Specifically what proxies or systems fail to work when query parameters are present?
Does it possibly have something to do with SEO? The corollary makes it appear so.
What workarounds might there be for passing data from one site to another under this restriction?
i only have answer for the "workaround" question: use PATH_INFO.
edit to be more specific
instead of /banner.php?what=ever&any=thing use /banner.php/what=ever/any=thing. apache will still serve the request through /banner.php, and /what=ever/any=thing will be present in $_SERVER['PATH_INFO']. you'll have to rawurldecode and explode the string yourself since the webserver won't do that for you, but that's no big deal.

"context" on a drupal page

This question is a general one, and I've already posted a version of it here. I'm hoping, though, that I'll have a better chance of getting a response, and of being useful to more people, by asking in this forum.
Associating content together when it all loads on a drupal page is tricky business. In drupal, each page, no matter the site, is basically the same: you have main content in the middle (a view, a node, or multiple nodes), with blocks surrounding that central content. To make the blocks somehow aware of whats in the middle, (much less aware of each other) you either have to do some really fancy footwork in your own custom module, or you have to make "arguments" available in the URL.
I've been studying the spaces/context/features/purl suite of modules provided by developmentseed, and I've also looked into the Panels/Ctools modules made by Earl Miles (the guy who wrote views). While both provide tools to make my job easier, my understanding of each is that I'm still required to place "arguments" in the URL if I want the contents of my blocks defined by my "context" (I use that in the general sense, and not in the specific sense meant by either the context module, or the concept of context in Ctools).
Am I missing something, or is that where we're at with Drupal?
Finally, I should say in closing that I am aware of other modules that help with this kind of thing on a limited, case-by-case basis. The Views attach module and the Node reference views module, for instance, each take a stab at solving this issue for a very specific use case. They're both good modules, and there are others like them, but I'd really like to find a solution to this problem in general.
I guess I do not really understand what you're aiming at, but I'll try nonetheless:
For every non static website, be it based on Drupal or anything else, there are two basic things providing the 'context' for the decision on what content to deliver for given a request.
The first and most important thing is obviously the request itself. This is the only information that is always guaranteed to be there. In most cases, this will simply be a GET request, and with those, the URL is implicitly the main source of 'context' available. POST requests can provide a bit more 'context' besides the URL, but for your question, one could argue that they are just a more complicated variation of a GET request, providing some more 'arguments' besides the ones from the URL (and in most cases, one could turn a POST request into a GET request with a more elaborated URL anyways).
The second 'context providing' thing is the session. Whatever mechanism session handling is based on (predominantly cookies nowadays), the goal is always the same, namely to carry some 'state' information across the boundary of inherently stateless requests. It does this by tying a given request to information from previous requests, stored on the server side. This allows to 'enrich' the information that is available for the decision on what content to deliver for a request. Basically, one could look at it as a way of adding some more 'arguments' to the request.
And that's it. Any other information needed for assembling a response needs to somehow be derived from the information given in the request (and one could say that session handling is already the main process of doing so, by adding 'context' based on a cookie or some other identifier coming with the request).
Drupal reflects this process pretty well, IMHO, as it first assembles the 'main' content for a response based on the URL, with additional information (e.g. about the user) 'attached' in the session. It is only after the main content got assembled via calling $return = menu_execute_active_handler() in index.php, that the other elements of a response get added (e.g. blocks, menus, etc.), by calling theme('page', $return);.
So whatever 'context' it is that you want to 'pass' to those other elements, you either have to 'reextract' it from the information already used for assembling the main content (URL, session), or you have to store it temporarily during the generation of the main context. You can do this in many ways, e.g. by adding it to the information already stored in the session, by using static caching within some functions, by setting global variables (don't ;), by passing stuff through the database, etc...
So again, I do not seem to understand what you are aiming at. What is it that you are missing here?
Good answer from Henrik but I'd like to add that there can be quite a lot of information in the request beside maintaining state with cookies. Think important HTTP headers like accept or language or even X-REQUESTED-WITH. Most webframeworks wrap this information into one convenient datastructure. Unfortunately from the answers given I have to conclude that drupal doesn't.

Resources