I'm trying to figure out the best way to do caching for a website I'm building. It relies heavily on screen scraping the wikipedia website. Here is the process that I'm currently doing:
User requests a topic from wikipedia via my site (i.e. http://www.wikipedia.org/wiki/Kevin_Bacon would be http://www.wikipediamaze.com/wiki?topic?=Kevin_Bacon) NOTE: Because IIS can't handle requests that end in a '.' I'm forced to use the querystring parameter
Check to see if I've already stored the formatted html in my database and if it does then just display it to the user
Otherwise I perform a web request to wikipedia
Decompress the stream if needed.
Do a bunch of DOM manipulation to get rid of the stuff I don't need (and inject stuff I do need).
Store the html in my database for future requests
Return the html to the browser
Because it relies on screen scraping and DOM manipulation I am trying to keep things speedy so that I only have to do it once per topic instead of for every single request. Here are my questions:
Is there a better way of doing caching or additional things I can do to help performace?
I know asp.net has built in caching mechanism, but will it work in the way that I need it to? I don't want to have to retrieve the html (pretty heavy) from the database on every request, but I DO need to store the html so that every user get's the same page. I only ever want to get the data from Wikipedia 1 time.
Is there anything I can do with compression to get it to the browser quicker and if so can the browser handle uncmopressing and displaying the html? Or is this not even a consideration. The only reason I'm asking is that because some of the pages wikipedia sends me through the HttpWebRequest come through as a gzip stream.
Any and all suggestions, guidance, etc. are much appreciated.
Thanks!
You can try to enable the OutputCache for your page with VaryByParam=topic. That stores a copy of the page in memory if multiple clients request it. When the page is not in memory, the server can retrieve it from your database. The beauty of OutputCache is that you can even store a gzipped version of the HTML (use VaryByEncoding)
If it's a problem for you to decompress the stuff you get from Wikipedia, then don't send an Accept-Encoding header. That should force Wikipedia to send the page to you uncompressed.
Caching strategy: write the HTML to a static file and let users download from that file.
Compression strategy: check out Google's PageSpeed Best Practices.
Related
This is both a strategy and a technical question, I'm building a web posting mechanism and I will need to store a lot of HTML posts (discussions, comments etc.)
I'm thinking about saving all my HTML posts into database as a ZIP compressed stream (instead of plain text or XML) in order to save space and increase security by encrypting those ZIP data steams, so it will be saved to the database compressed (hopefully close to 90% smaller) and secure. (it does not need to be searchable, I'm going to create the search index myself out of the content of each post)
I want to deliver the ZIP object to the web page/cache and then have the client side unzip the stream and render the HTML that it represent.
This is a Microsoft based MVC web site (c#)
I'm trying to figure out reasons not to do it... other than performance, can anyone pinpoint any other issues with doing something like that?
Also, is there any recommended libraries or built-in ones that I should use for better performance - that both server side and client side can understand (zip and unzip with encryption key/password)?
Thanks in advance.
In normal operation, http allows to send the html in a gzipped stream. The webserver compresses the data and sets the corresponding header. The client then unzips transparently.
You simply have to make sure to set the correct header and not have the webserver zip again the already zipped stream.
I see a major drawbacks :
You cannot alter the data. That means you cannot add the code for your template nor link between the pages.
I don't think this is a good approach. Store your data as you like and decompress it on the server.
I was searching for information about one of my doubts, but I couldn't find any. I'm working in an ASP.NET site and using AJAX to require data, since I'm currently working on my own, I don't know web programming's best practices.
I usually get all the information I need from the server and use Javascript to display / Modify it and AJAX to send it back to the server. A friend of mine uses PHP for most part of the programming, He rarelly uses any javascript and he told me it's way faster this way, since it does not consume the client's resources.
The basic question actually is:
According to the best practices, is it better for the server just to provide the data needed for the
application or is better you use the server for more than this?
That is going to depend on the expected amount of traffic for the site, the amount of content being generated, and the expectations of the end-user.
In a high-traffic site, it is actually "faster" for the end-user if you let javascript generate a portion of the content on the client side. Also, you can deliver a better user experience with long load times through client side scripting than you can if the content is loaded completely on the server.
In most cases you would need at least some backend code. E.g. when validating user input or when retrieving information from a real persistent database. Or what about when somebody has javascript disabled in his user-agent or somebody with a screenreader or searchengine crawlers?
IMHO you should at least (again in most cases) have the backend code which is able to do all the work and spit out a full webpage to the client. In addition to this you can add javascript functionality to make the user interface "smoother" by for example validating user data before submitting it to the server (remember to ALWAYS also check on the serverside) or by loading partial html (AJAX).
The point about being faster or using less resources when doing it serverside doesn't make much sense. Even if it does that it doesn't matter (but again I highly doubt this statement). If you use clientside scripting to only load parts that are needed it would rather use less resources on both the client- and the serverside.
It seems that sticking to POST is the way to go because it results in clean looking URLs. GET seems to create long confusing URLs. POST is also better in terms of security. Good for protecting passwords in forms. In fact I hear that many developers only use POST for forms. I have also heard that many developers never really use GET at all.
So why and in what situation would one use GET if POST has these 2 advantages?
What benefit does GET have over POST?
you are correct, however it can be better to use gets for search pages and such. Places where you WANT the URL's to be obvious and discoverable. If you look at Google's (or any search page), it puts a www.google.com/?q=my+search at the end so people could link directly to the search.
You actually use GET much more than you think. Simply returning the web page is a GET request. There are also POST, PUT, DELETE, HEAD, OPTIONS and these are all used in RESTful programming interfaces.
GET vs. POST has no implications on security, they are both insecure unless you use HTTP/SSL.
Check the manual, I'm surprised that nobody has pointed out that GET and POST are semantically different and intended for quite different purposes.
While it may appear in a lot of cases that there is no functional difference between the 2 approaches, until you've tested every browser, proxy and server combination you won't be able to rely on that being a consistent in every case. e.g. mobile devices / proxies often cache aggressivley even where they are requested not to (but I've never come across one which incorrectly caches a POST response).
The protocol does not allow for anything other than simple, scalar datatypes as parameters in a GET - e.g. you can only send a file using POST or PUT.
There are also implementation constraints - last time I checked, the size of a URL was limited to around 2k in MSIE.
Finally, as you've noted, there's the issue of data visibility - you may not want to allow users to bookmark a URL containing their credit card number / password.
POST is the way to go because it results in clean looking URLs
That rather defeats the purpose of what a URL is all about. Read RFC 1630 - The Need For a Universal Syntax.
Sometimes you want your web application to be discoverable as in users can just about guess what a URL should be for a certain operation. It gives a nicer user experience and for this you would use GET and base your URLs on some sort of RESTful specification like http://microformats.org/wiki/rest/urls
If by 'web application' you mean 'website', as a developer you don't really have any choice. It's not you as a developer that makes the GET or POST requests, it's your user. They make the requests via their web browser.
When you request a web page by typing its URL into the address bar of the browser (or clicking a link, etc), the browser issues a GET request.
When you submit a web page using a button, you make a POST request.
In a GET request, additional data is sent in the query string. For example, the URL www.mysite.com?user=david&password=fish sends the two bits of data 'user' and 'password'.
In a POST request, the values in the form's controls (e.g. text boxes etc) are sent. This isn't visible in the address bar, but it's completely visible to anyone viewing your web traffic.
Both GET and POST are completely insecure unless SSL is used (e.g. web addresses beginning https).
I am trying to design a system for something like this with ASP.net/C#.
The users pay for downloading some content (files- mp3s/PDFs,doc etc).I should be able to track the number of bytes downloaded by the user. If the number of bytes downloaded match the number of bytes on the server, I should set a flag in DB (telling that the download was successful and prevent them from downloading the file again/asking them to pay for the download again). If the download was incomplete, they should be able to download the file again without paying for it again(since the flag will not be set).
Is there any way to keep track of the number of bytes successfully downloaded by the client ?
Also when I see a file size in my WinXP machine, I see two sizes(size,size on disk). Which one should I consider ? And will it differ from one OS to another ?
You can easily measure data passed to the client in ASP.NET assuming you replace a direct IIS-controlled download with your own, which would go something like this:
while (context.Response.IsClientConnected) {
bytesRead = ReadFileChunkAsByteArrayWIthOffsetOrWhatever(buffer, offset);
context.Response.OutputStream.Write(buffer, 0, bytesRead);
context.Response.Flush();
offset += bytesRead;
if (bytesRead != bufferSize)
break;
}
It's complicated to make this 100% reliable from within ASP, but it can be done. You pretty much have to account for every possible failure point and react accordingly.
The problem though is still - as someone mentioned above - that it's impossible to know that the client received the data. If money is involved in this transaction, that can get to be a problem really quickly.
For that reason, the best approach would be to use a custom downloader client, like the one Amazon uses for MP3 file purchases. That way you're not subjecting either yourself or your customers to the vagaries of moving monetized bits over something as unreliable as HTTP.
you can create an asp.net handler that serves the file ( for asp.net mvc u can do a result action instead ... this is what I'm using). Make sure it supports resumable downloads.
from the you can track the bytes served.
Ps. this incurs a performance overhead vs. letting IIS serve it
update 1: I used something pretty similar to this http://dotnetslackers.com/articles/aspnet/Range-Specific-Requests-in-ASP-NET.aspx ... and the article has a pretty clear explanation on what's inside it. You probably can use that one as is, see the example in that post.
You could try looking into HTTP reponse codes (i.e: 200, 404 etc) - the client and server will be exchanging http headers so that they know what's going on - you should be able to monitor these to see if the reponses was successful (not sure - but you should be able to).
With regards to file size - I would try experiments on files with 'known' sizes, compare what the Http Logs tell you with what file explorer tells you.
Also, I've seen tools/wodgets that report file upload progress - so you're right you should be able to to the same in reverse, I guess. You could try looking at file upload code examples and tutorials - you might get some hints. I can't think of any off the top of my head - sorry.
To do custom byte serving like this, you will need to implement your own http handler.
This handler should do the following:
Implement some kind of authentication on the http handler, so you know who you are dealing with.
Then you will need to implement some kind of logging for files requested and files allowed to be downloaded.
Implement etags and expires headers for client side caching.
Server side caching
Deflate, gzip compression
If you want to support resumable downloads, you will need to implement 206 partial responses. This is essential for any kind of streaming and serving pdfs.
So you should be handling the following http headers:
ETag
Expires
Accept-Ranges
Range
If-Range
Last-Modified
If-Match
If-None-Match
If-Modified-Since
If-Unmodified-Since
Unless-Modified-Since
If you are looking for a sample implementations of http handlers check out:
http://code.google.com/p/talifun-web/wiki
It has a static file handler that implements all the above http headers, client side and server side caching and even compression.
There is also a log module and an authorization module that should go a long way into how to implement authentication and logging.
The size you want is the size (not the size on disk). Size on disk includes extra space that is taken up by fitting into the 4K block size of the partition. The size is the exact number of bits in the file.
I don't believe there is a good way to tell that a download has been completed. Response.TransmitFile is probably the best method for sending the file securely. But I don't believe it has anything that will tell you if the user actually recieved the file.
I don't know about the business this is supporting, but I can't think of a legitimate business where users would tolerate a single download per purchase model, and with the abiguity of the standard HTTP request/response model does not lend itself to making an accurate client side reciever. Not to mention this model could be eaisyly hacked by sending a failed response on reciept of the last packet.
I think using somthing like download windows (2hrs after purchase) and then lock it to an IP after the first request would accomplish the same result and result in alot less user issues and support calls. Also unless the file has some sort of stringent DRM, allowing the user persisten access based on their loggin is most likely the appropriate business model, because once they get the file they can copy it as many times as they like.
Look at DVD or Blu-Ray, no amount of copy protection or access controls will save your files from pirates, so make things easy for legitimate users.
Is there a built-in asp.net way to conditionally serve pages, for example I want the following logic:
If there is a session data I generate
a page, if there is no session data I
serve the cached page.
I am only interested in knowing about a built-in asp.net mechanism for this. If it does not exist I am probably going to simply cache my page manually and decide whether to serve it or not for each request, based on the session data availability.
I don't think there is built-in support (like varyByParam) for generating fresh output for users with Session Data.
As you suggest, I would recommend manually caching the pages. I would probably determine the user's Session state in the PreRequestHandlerExecute event handler in the Global.asax and then maybe set:
HttpContext.Current.Response.Cache.SetCacheability(HttpCacheability.NoCache);
At the risk of karmabombing, I really don't like this approach to caching.
For me if a GET request is made, then a server should respond to that in good faith. Caching at a page level should be controlled by http headers because the primary goal is not to get the redundant request at all - you don't want to allocate server/bandwidth resources full stop.
Caching objects which are resources involved in making up a page I can totally get behind, but I can't see great arguments for caching a page wholesale.
Respect the headers.
You might want to look at the substitution control (Link) new in .NET 2.0, however it might not be exactly what you are after.