Sites like cnn.com or foxnews.com.
Where do they store all the articles? In html files? In database?
More logically to store everything in DB but how to generate a static link to something that is inside DB?
It's not that they have a a dynamic page load like: LoadArticle.aspx?ArticleID=123, every article has it's own address.
Please explain how this is done.
They use a special content management library called VoodooLib.dll.
Seriously, when you write something to a database, you normally generate some kind of unique identifier - 123, for example. It gets permanently associated with that record (article content). After that it is used to generate the same id as part of an Url at any time later.
As for the static link, it is a simple matter of Url Rewriting.
You generate static links to display on a page because they work much better for SEO. When a request for that static Url hits the server, it gets substituted for something "server friendly" and then gets to be processed.
They probably use some form of Content Management System (CMS). There are many different ones out there - most store the actual content in a database or as XML (some store XML in a database). They will the either publish that content as static HTML pages or, more commonly now, as dynamic pages that are cached. Many use what are known as "friendly URLs" that are virtual addresses that are mapped to the actual physical file path using URL-rewriting techniques.
Note you can't tell whether a page is dynamic or static simply from the extension. It is quite possible to have dynamic pages that end in the .html extension.
Just because the URL looks "static" doesn't mean it is; they could be using something like mod_rewrite or an IIS ISAPI to make the URLs more search engine friendly.
For the high-volume news sites that you mention, however, they may very well generate the pages statically in order to prevent overloading the database with repeated requests for the same article.
Look at the URl of this page, it doesn't have xxx.aspx?some-query-string
You are refering to using friendly URLs.
To do something like that, one common way is to use URL Rewrite and/or some custom HTTPModule
Here's a good reference: http://weblogs.asp.net/scottgu/archive/2007/02/26/tip-trick-url-rewriting-with-asp-net.aspx
Just because a page has a normal URL does not mean that it isn't serving dynamic content. With the Apache mod_rewrite module, it is possible to manipulate URLs. So, for example, a page like http://www.domain.tld/permalink/12345/message-title-slug can be converted internally to http://www.domain.tld/permalink/index.php?id=12345&slug=message-title-slug.
I do not know exactly what cnn.com and foxnews.com use, but I would bet that they use a Content Management System (CMS) which serves all pages dynamically, with the content stored either in a database or on the filesystem, and with authoring/publishing all being performed through the particular CMS.
Just checking cnn.com, the article links have in them
Year
Location (US or WORLD/specificlocationid)
Month
Day
Article name.
All of this information together can be used to uniquely identify any article (even less of it is probably actually needed). The dynamic content loading page address could easily be hidden by some method of URL rewriting, and then the information in the requested URL is used to determine which article in the DB is to be served up.
I don't know why all the other answerers seem to assume that some form of URL rewriting is necessary to create friendly URLs. It's not true at all.
It's perfectly possible to write web serving code that splits a URL into parameters - eg year, month, title - and pass that directly to the code that gets the content from the database, without any need to rewrite the URL. Most modern web frameworks such as Django and Rails include this functionality out of the box.
This is done through mod-rewrite techniques.
Here's an article about the mod rewriting engine: http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html
And here's their "guide": http://httpd.apache.org/docs/2.0/misc/rewriteguide.html
I hope that helps. It should make for a good starting point. Goodluck.
Related
I have a page as part of my IIS 7 (ASP.NET) website which serves images from a database. It uses a querystring to select the image and sets the content type header appropriately (image/jpeg) so that, for example, image.aspx?ID=1234 will be displayed in the browser as a jpeg image.
What I want to do instead is offer a URI formed in a manner such as image/1234.jpg which will produce the same result. In other words, there is no actual file on the server named 1234.jpg, it's just the contents of a database record, but from the browser's perspective, it will appear as if there is such a file.
I'm sure this is possible, but I can't figure out how it's accomplished, or where to look for answers. I'm thinking it may be done with an ISAPI filter, but I haven't found an accessible path into the docs to know if that's even the correct basis for a solution.
Possibly the best option here would be to implement a URL rewrite rule that changes image/1234.jpg to image.aspx?ID=1234
You can find more on URL rewrite for IIS here.
If, for whatever reason, URL rewrite isn't an option to you, then another possible method might be to implement a custom 404 page. When your request to image/1234.jpg doesn't result in a real file, it'll end up there.
You should be able to detect the URI at that point and serve up the image.
I have a currently existing site with URL re-writing enabled using ISAPI Rewrite and IIRF files, the problem is it's causing a lot of problems, both for site development / maintenance and because of continuous errors on the server.
Because of this I'm looking to replace it with .Net 4.0's URL Routing.
I'm having two problems with this, first, the current routing rules are set up so that the page being routed to is simply the re-written URL with the file extension appended to the end.
So www.site.com/page/ would become www.site.com/page.aspx
The second problem is that certain re-written URL's actually point to physical files within sub-folders on the website using the same logic as above.
So www.site.com/folder/page/ would redirect to www.site.com/folder/page.aspx
I've read through this article and created custom IRouteHandler and IHttpHander implemented classes, but I'm not really sure where to go from here.
I've tried a few different things, mainly related to variables in the URL and trying to redirect to them, but I'm not sure that's the right way for me to go.
I'm not posting code because all I have at the moment is pretty much the example code from the link above.
I've implemented basic routing on another site where a single page uses a variable in the URL to grab the relevant content out of a database, and that seemed simple enough, but this is making my head spin, I'm sure it shouldn't be this complicated.
For the past few years, if I've wanted a URL of a page on a site rewritten I've put the rewritten URL into the link on the page.
E.g. If the page is /Product.aspx?filename=ProductA and it's rewritten to /Product/ProductA.aspx then I've put the following in my link:
...
However, with outbound rules I could just put the links in to the actual file paths, and rewrite with an outbound rule.
Is this a bad method? Would it cost the server unnessacery additional resources?
I would not consider this bad practice. Infact it affords you some additional flexibility as your mapping for friendly to real url's is all managed in one central location. If your seo team decide they want to change the url scheme, you dont have to pick through all the links on your site updating them- risking missing one!
One important limitation of the current version of the IIS rewrite module, is you cannot use outbound rewriting in conjunction with Static compression- However you can still use Dynamic compression. Static compression is nice because it will cache the compressed version of the page. See this article for instructions on getting url rewrite working with Dynamic compression: http://forums.iis.net/p/1165899/1950572.aspx
The situation is the following: I created a site with Plone, developed, used, but behind a test URL. Now it has to be published, but the test URL is not appropriate and I don't want to move the site. I think, if I use a redirect, it won't be appear in the URL-bar, only in the case of site start page. Am I wrong? (The test URL should not be used, because it will be a "semi-official" site.) What do you suggest to do?
As far as I can see Plone uses absolute URLs everywhere. I can add relative URLs, but if I create a new page, a new event, etc., then they have absolute URLs on other automatically generated inner pages. Is there any way to convert these URLs to relative paths? Is there any setting possibilty where only a checkbox changes this default setting?
Plone does not store your URLs in the database. It uses the inbound host header (and any virtual hosting configuration set up with rewrite rules in Apache or Nginx) to calculate the correct absolute URL when rendering the page.
In other words - as soon as you actually point the relevant domain name to the server with your Plone instance, it'll just work.
P.S.
You should put a bit more effort into asking your question. This is just a copy and paste of a half-finished email chain where you tried to get the answer from me in private. It's not very easy to understand what you're asking.
I think what you are looking for is url rewriting to handle virtual hosting. ie to get your site to appear as if it's the root url of a domain.
This is normally done via the webserver that normally sits in front of plone. For apache, here is a howto
http://plone.org/documentation/kb/plone-apache/virtualhost
for other servers
http://plone.org/documentation/manual/plone-community-developer-documentation/hosting
You can also achieve this directly in zope (via ZMI) using something called the Virtual Host
Monster. see http://docs.zope.org/zope2/zope2book/VirtualHosting.html
PS. I don't think your question is badly worded. Plone does serve pages with a "base" tag and what appears to be absolute urls. They aren't baked into the database but it's also not obvious that the solution to getting the url you want is the VHM url syntax and a proxying frontend webserver. There is a reason why it doesn't use relative urls... which I can't remember it was so long ago.
I've seen a lot of dynamic website through the internet that their pages are in html or htm format . I don't get it why is that ? And how they do that ?
Just look at this website : http://www.realmadrid.com/cs/Satellite/en/Home.htm
What you see in the URL can be set at will by the people running the web site. The technique is called URL rewriting.
How
On Apache, the most popular solution to that is the mod_rewrite module.
Seeing as you've tagged ASP.NET: As far as I know, ASP.NET has only limited rewriting support out of the box. This blog entry promises a complete URL rewriting solution in ASP 2.0
Why
As for the why, there is no compelling technical reason to do this.
It's just that htm and html are the recognized standard extensions for HTML content, and many (including myself) think they simply look nicer than .php, .php5, .asp, .aspx and so on.
Also, as Adam Pope points out in his answer, this makes it less obvious which server side technology/language is used.
The .html/.htm extension has the additional effect that if you save it to disk, it is usually automatically connected with your installed browser.
Maybe (a very big maybe) there are very stupid simple client programs around that recognize that they have to parse HTML by looking at the extension. But that would be a blatant violation of rules and was hopefully last seen in 1994. Anyway, I don't think this is the case any more.
There are a number of potential reasons, these may include:
They could be trying to hide the technology they built the site with
They could be serving a cached version of a page which was written out to HTML.
They could simply perceive it to look friendlier to the user
They might be using a server-side scripting language like PHP or ASP. You can configure what file extensions get parsed by the language by editing the web server configuration files.
For example in PHP the default extension is .php but you could configure the server to use .html, that would mean any files with the .html extension could contain PHP code they would get parsed before the page is sent to the clients web browser.
This is generally not recommend as it adds an overhead and .html pages that don't have any PHP would be parsed by the PHP engine anyway which is slower then serving pages direct to the browser.
The other way would be to use some form of URL rewriting. See URL Rewriting in ASP.NET
Another reason is SEO(Search engine optimization). Many search engines like html pages and many guys(I mean some SEO specialists) think the html can improve the rank of their content in search engine.
One possibility is just historical reasons. Pages that started static, now are generated dynamically, but sites don't want to break old customer's favorites.
They keep some pages as html because their content is not supposed to change frequently or not at all.
But you should also keep in mind the fact that some sites are dynamic but they change the page extention to html but original page remains same eg php or aspx, etc using htaccess or some frameworks like codeigniter etc.