Google Sitemap HttpHandler cacheing - asp.net

I have a HttpHandler that generates a Google sitemap based on my asp.net web.sitemap. Fairly standard stuff. Except that it does some fairly heavy database work to auto-generate additional urls for Ajax tabs within pages.
All this means our DB gets hit fairly heavily if the bot hits sitemap.axd.
What we need, of course, is output caching. But how do you go about caching inside something that basically writes directly to a XmlTextWriter?

The simplest answer is to write the XML to a string and store it in a static field.

Related

What different customizations are possible by using HttpHandlers in an ASP.NET application?

Digging deeper into HttpHandlers I found they provide nice way to customize an ASP.NET application. I am new to ASP.NET and I want to know about different customizations that are possible using HttpHandlers. Lots of websites talk about how they are implemented but it would be nice to know some use cases beyond what ASP.NET already provides using HttpHandlers.
An ASPX page provides a base template (so to speak) for a form-based web page. By default, it outputs text/html and allows for easy adding of form elements and event handling for these elements.
In contrast, an HttpHandler is stripped to the bone. It is like a blank slate for HTTP requests. Therefore, an HttpHandler is good for many types of requests that do not necessarily require a web form. You could use an HttpHandler to output dynamic images, JSON, or many other MIME type results.
A couple examples:
1) You have a page which needs to make an AJAX call which will return a JSON response. An HttpHandler could be setup to handle this request and output the JSON.
2) You have a page which links to PDF documents that are stored as binary blobs in a database. An HttpHandler could be setup to handle this request and output the binary blob as a byte stream with a PDF MIME type for the content type.
Check this page for a good example and code of why you might want to customize them: http://dotnetslackers.com/articles/aspnet/Range-Specific-Requests-in-ASP-NET.aspx Essentially it can be used when you want to server certain files but not allow them to be accessible via a plain url (security).

Create Google compliant dynamic XML sitemap

I want to create a dynamic (fetching data from the database) XML sitemap which I can submit to Google webmaster tools.
Surprisingly, I couldn't find any recent controls/code online to do this. The most recent code I found was this http://weblogs.asp.net/bleroy/archive/2005/12/02/432188.aspx which is for ASP.Net 2.0. I don't mind using this, but I suspect it's outdated.
Can somebody please point me in the direction of code which accomplishes this?
A couple of options include:
The ASP.Net SiteMap infrastructure. It allows you to write a custom sitemap provider like this one, which uses Micosoft Access, to generate a sitemap.
You can also find a very simple sitemap generator project on this site.
Another option (and fun learning experience) is to write your own by just looking at the sitemap protocol, and using Linq To SQL along with Linq To Xml to generate the format. Here is an example uses Linq To SQL and Linq To XML to generate XML.
Finally, Google also accepts RSS/Atom feeds, so you could generate one of those instead. If you go this route then you can make use of the SyndicationFeed class. There are also a couple open source options available.
Actually i just done it recently using LinqToXMl
How to generate xsi:schemalocation attribute correctly when generating a dynamic sitemap.xml with LINQ to XML?
Actually the string that is returned by that code is written directly to the Response object. I use a .ashx HttpHandler to deliver the content as XML and using Routing to serve it under the name of sitemap.xml. Also you should put it on your robots.txt file

Updateable Google Sitemap for ASP.NET 3.5 Web App Project

I am working on an ASP.NET 3.5 Web Application project in C#. I have manually added a Google-friendly sitemap which includes entries for every page in the project - this is not a CMS.
<url>
<loc>http://www.mysite.com/events.aspx</loc>
<lastmod>2009-11-17T20:45:46Z</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
The client updates events using an admin back-end. Other than that, the site is relatively static. I'm trying to decide on the best way to update the <lastmod> values for a handful of pages that are regularly updated.
In particular, I am using the QueryStringField of the ListView control to enhance SEO as described here:
https://web.archive.org/web/20211029044137/https://www.4guysfromrolla.com/articles/010610-1.aspx
http://gsej.wordpress.com/2009/05/31/using-a-datapager-with-both-a-querystringfield-and-renderdisabledbuttonsaslabels/
When the QueryStringField property is set, the DataPager renders the paging interface as a series of hyperlinks which the crawler can follow and index. However, if Google has crawled my list of events two days ago, and in the meantime, the admin has added another dozen events... say the page size is set to 6; in this case, the Google SERP links would now be pointing to the wrong pages. This is why I need to be sure that the sitemap reflects changes to the events page as soon as they happen.
I have already looked though other SO questions for info and didn't find what I needed. Can anyone offer some guidance or an alternative approach?
UPDATE:
Since this is a shared hosting environment, a directory watcher/service won't work:
How to create file watcher in shared webhosting environment
UPDATE:
Starting to realize that I may need signify to Google that the containing page has been updated; update the last-modified HTTP header?
Rather than using a hand-coded sitemap, create a sitemap handler that will generate the sitemap on the fly. You can create a method in the handler that will grab pages from an existing navigation sitemap, from the database, or even from a hard-coded list of pages. You can create an XmlDocument from the list, and write the InnerXml of the document out to the handler response stream.
Then, create a class with a method that will automatically ping search engines with the above handler's URL (like http://www.google.com/webmasters/tools/ping?sitemap=http://www.mysite.com/sitemap.ashx).
Whever someone adds a new event, call the above method. This will ping Google using your latest sitemap (freshly generated by the above method).
You want to make sure that the ping only works if the sitemap has actually been updated. You could use File.SetLastWriteTime on events.aspx in the AddNewEvent handler to signify that the containing page has been updated.
Aslo, be careful to make sure there have been no pings for the last hour (as Google guidelines discourage pinging more than once per hour).
I actually plan to implement this in the following OSS project: http://cyclemania.codeplex.com. I will let you know once it's done and you can have a look.
If you let your user add events to the website you are probably using a database.
This means you can generate the XML-Sitemap at runtime like this:
create a page where your sitemap will be available (this doesn't need to be sitemap.xml but can also be sitemap.aspx or even sitemap.ashx).
open a database connection
loop through all records and create an Xml Element for each record
This blog post should help you further: Build a Search Engine SiteMap in C#.
It is not using the new XElements from .Net 3.5, but is will work fine.
You can put this in an aspx page, but adding an HttpHandler is probably better as described on the same blog, different post: (creating a httphandler for a sitemap)

What are some ways to support multiple websites with a single code base?

I'm writing a pretty straight forward ASP.NET MVC web app: only a couple of CRUD pages, some folders where clients can browse documents and just 3 or 4 roles. The website will be used in a B2B scenario, where every client will have their "own" website.
At this point, the only thing that will change in the website, from client to client is the content (ie. the documents, and the rows of data they'll see). If this is the case, what's the best way to manage roles across all of my clients? I'm looking for the simplest possible solution because this is a proof of concept and I don't want to invest a lot of time right now.
What if it's not just the content that changes? Maybe some clients will want a few custom static pages. At this point, is my only option replicating the entire website? I'm leery of this because it'll become hard to maintain if I get a lot of clients.
I'd appreciate any help... I just don't want to shoot myself in the foot; I'm sure someone has done this before.
I create Virtual Directories in IIS for each client, all pointed back to the same folder where my ASP.NET code resides.
This allows me to support several dozen nearly-identical "web sites," each with their own database that is basically identical in form, only differs in data.
So, my site URLs look like:
http://mysite.com/clientacme/
http://mysite.com/clientbill/
http://mysite.com/clientcharlie/
There are two key implementation details I worked out for this:
I use the Virtual Directory folder name to determine which DSN my code reads from. This is accomplished by creating a simple static method that injects the folder name into a DSN string template. If you want to use the same database to store everyone's data, you can use the folder name as a default filter in your queries.
I store the settings for each web site (headers and footers, options, links to custom reports, etc.) in a simple "settings" table in each database (key, value) rather than in the web.config (which is shared). This allows me to extend the code base over time to customize the experience for each client without forking the code.
For user authentication, I use Basic authentication, and I keep usernames, passwords, and roles in a table in each database.
The important thing is that if you use different SQL Server databases for each client's content, you need to script any changes to your database tables, indexes, etc. and apply them across all databases at the same time (after testing of course). One simple way to do this is to maintain an Excel sheet with a table of database names and a big "SQL" cell at the top. Beside each database name, create a formula to "USE databasename;" and then concat the SQL code at the top.
I'm not sure if this answers your question completely, but as far as maintaining custom "static" pages I found myself implementing a system on a client's MVC website where the client can create "Pages" from their admin control panel and each Page has a collection of "PageContent" entities which consist of a Title and and HTML content field (populated using a WYISWYG editor). Upon creating a page the MVC application maps http://yoursite.com/Page/Page-Url-Specified-By-The-User to that page and renders its content there. Obviously, the pages are dynamic, but as far as the client can tell they have created a brand new custom page with little or no effort.

Where content based websites store their content?

Sites like cnn.com or foxnews.com.
Where do they store all the articles? In html files? In database?
More logically to store everything in DB but how to generate a static link to something that is inside DB?
It's not that they have a a dynamic page load like: LoadArticle.aspx?ArticleID=123, every article has it's own address.
Please explain how this is done.
They use a special content management library called VoodooLib.dll.
Seriously, when you write something to a database, you normally generate some kind of unique identifier - 123, for example. It gets permanently associated with that record (article content). After that it is used to generate the same id as part of an Url at any time later.
As for the static link, it is a simple matter of Url Rewriting.
You generate static links to display on a page because they work much better for SEO. When a request for that static Url hits the server, it gets substituted for something "server friendly" and then gets to be processed.
They probably use some form of Content Management System (CMS). There are many different ones out there - most store the actual content in a database or as XML (some store XML in a database). They will the either publish that content as static HTML pages or, more commonly now, as dynamic pages that are cached. Many use what are known as "friendly URLs" that are virtual addresses that are mapped to the actual physical file path using URL-rewriting techniques.
Note you can't tell whether a page is dynamic or static simply from the extension. It is quite possible to have dynamic pages that end in the .html extension.
Just because the URL looks "static" doesn't mean it is; they could be using something like mod_rewrite or an IIS ISAPI to make the URLs more search engine friendly.
For the high-volume news sites that you mention, however, they may very well generate the pages statically in order to prevent overloading the database with repeated requests for the same article.
Look at the URl of this page, it doesn't have xxx.aspx?some-query-string
You are refering to using friendly URLs.
To do something like that, one common way is to use URL Rewrite and/or some custom HTTPModule
Here's a good reference: http://weblogs.asp.net/scottgu/archive/2007/02/26/tip-trick-url-rewriting-with-asp-net.aspx
Just because a page has a normal URL does not mean that it isn't serving dynamic content. With the Apache mod_rewrite module, it is possible to manipulate URLs. So, for example, a page like http://www.domain.tld/permalink/12345/message-title-slug can be converted internally to http://www.domain.tld/permalink/index.php?id=12345&slug=message-title-slug.
I do not know exactly what cnn.com and foxnews.com use, but I would bet that they use a Content Management System (CMS) which serves all pages dynamically, with the content stored either in a database or on the filesystem, and with authoring/publishing all being performed through the particular CMS.
Just checking cnn.com, the article links have in them
Year
Location (US or WORLD/specificlocationid)
Month
Day
Article name.
All of this information together can be used to uniquely identify any article (even less of it is probably actually needed). The dynamic content loading page address could easily be hidden by some method of URL rewriting, and then the information in the requested URL is used to determine which article in the DB is to be served up.
I don't know why all the other answerers seem to assume that some form of URL rewriting is necessary to create friendly URLs. It's not true at all.
It's perfectly possible to write web serving code that splits a URL into parameters - eg year, month, title - and pass that directly to the code that gets the content from the database, without any need to rewrite the URL. Most modern web frameworks such as Django and Rails include this functionality out of the box.
This is done through mod-rewrite techniques.
Here's an article about the mod rewriting engine: http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html
And here's their "guide": http://httpd.apache.org/docs/2.0/misc/rewriteguide.html
I hope that helps. It should make for a good starting point. Goodluck.

Resources