RSS reader - how to get only updates since a certain time?

RSS reader - how to get only updates since a certain time? - rss

How can I get only the updates-since-last-fetch from an RSS feed? The various code snippets I have seen seem to be polling the url periodically. But that's very dependent on:
How active the feed has been during the polling interval and,
The number of items the server returns as its policy.
This poses a risk of losing some items by infrequent polling, polling way too often and add bandwidth costs, or both.
The feeds I am considering also seem to update Last-modified to current time, which is not very helpful.
Is there some API element that will allow something like this?
Thanks

Unless the feed you're pinging allows you to send a parameter of when the cut off date is, you can't reliably work out when new items have been added since your last poll without downloading the feed again.
You're back to where you started with the periodic polling to see what new items exist, if any.
You can then hash the feed to check if it's different from last time and skip checking for new items if it's the same. Or you could check its cache headers, but then you'll have to assume the server is sending it back correctly and/or you're not also back to the cases where like the Last-Modified tag it's just not reliable at all.

Related

Caching issue with javascript and asp.net

I asked a question a while back on here regarding caching data for a calendar/scheduling web app, and got some good responses. However, I have now decided to change my approach and stat caching the data in javascript.
I am directly caching the HTML for each day's column in the calendar grid inside the $('body').data() object, which gives very fast page load times (almost unnoticable).
However, problems start to arise when the user requests data that is not yet in the cache. This data is created by the server using an ajax call, so it's asynchronous, and takes about 0.2s per week's data.
My current approach is simply to block for 0.5s when the user requests information from the server, and cache 4 weeks either side in the inital page load (and 1 extra week per page change request), however I doubt this is the optimal method.
Does anyone have a suggestion as to how to improve the situation?
To summarise:
Each week takes 0.2s to retrieve from the server, asynchronously.
Performance must be as close to real-time as possible. (however the data is not needed to be fully real-time: most appointments are added by the user and so we can re-cache after this)
Currently 4 weeks are cached on either side of the inial week loaded: this is not enough.
to cache 1 year takes ~ 21s, this is too slow for an initial load.

As I read your description, I thought of 2 things: Asynchrony and Caching.
First, Asynchrony
Why would you block for 0.5s? Why not use an ajax call, and in the callback, update the page with the retrieved info. There is no blocking for a set time, it is done asynchronously. You'd have to suppress multiple clicks though, while a request is outstanding, but that shouldn't be a problem at all.
You can also pre-load the in-page cache in the background, using setInterval or better, setTimeout. Especially makes sense if the cost to compute or generate the calendar is long and the data size is relatively small - in other words, small enough to store months in the in-page cache even if it is never used. Sounds like you may be doing this anyway and only need to block when the user jumps out of the range of cached data.
Intelligent Caching
I am imagining the callback function - the one that is called when the ajax call completes - will check if the currently selected date is on the "edge" of the cached data - either the first week in cache or the last week (or whatever). If the user is on the edge, then the callback can send out an additional request to optimistically pre-load the cache up to the 4 week limit, or whatever time range makes sense for your 80% use cases.
You may also consider caching the generated calendar data on the server side, on a per-user basis. If it is CPU- and time-intensive to generate these things, then it should be a good trade to generate once and keep it in the server-side cache, invalidating only when the user makes an update. With x64 servers and cheap memory, this is probably very feasible. Depending on the use cases, it may make for a much more usable interaction, the 2nd time a user connects to the app. You could even consider pre-loading the server-side cache on a per-user basis, before the user requests any calendar.

Should I filter a Html index with Javascript or do it Server side?

I have an index of titles that I am currently filtering by user entered keywords on the server before sending to HTML. I am wondering if it would be better to send the entire index to the page and have javascript show or hide the items in the list based on the user input.
I am concerned that the server side is going to get too many requests as users use different keyword combinations. Even if I cache the index on the server, wont the javascript solution be a better one?
EDIT: assuming a list of thousand titles or more.

You have to take into account two facts:
(a) not all users have Javascript or Javascript on
This means that it is possible that client-side filtering may not be working.
(b) some visitors could be limited in resources (slow CPU, older [slower] Javascript implementation)
By sending the entire set for processing on the client (especially if your data set is large enough) my guess is that your visitors will start complaining.
Finally, what you gain in server processing time you will be losing in bandwidth.

GET vs. POST does it really really matter?

Ok, I know the difference in purpose. GET is to get some data. Make a request and get data back. POST should be used for CRUD operations other than read I believe. But when it comes down to it, does the server really care if it's receiving a GET vs. POST in the end?

According to the HTTP RFC, GET should not have any side-effects, while POST may have side-effects.
The most basic example of this is that GET is not appropriate for anything like a purchase-transaction or posting an article to a blog, while POST is appropriate for actions-that-have-consequences.
By the RFC, you can hold a user responsible for actions done by POST (such as a purchase), but not for GET actions. 'Bots always use GET for this reason.
From the RFC 2616, 9.1.1:
9.1.1 Safe Methods
Implementors should be aware that the
software represents the user in
their interactions over the Internet,
and should be careful to allow the
user to be aware of any actions they
might take which may have an
unexpected significance to themselves
or others.
In particular, the convention has
been established that the GET and
HEAD methods SHOULD NOT have the
significance of taking an action
other than retrieval. These methods
ought to be considered "safe". This
allows user agents to represent other
methods, such as POST, PUT and
DELETE, in a special way, so that the
user is made aware of the fact that
a possibly unsafe action is being
requested.
Naturally, it is not possible to
ensure that the server does not
generate side-effects as a result of
performing a GET request; in fact,
some dynamic resources consider that a
feature. The important distinction
here is that the user did not request
the side-effects, so therefore
cannot be held accountable for them.

It does if a search engine is crawling the page, since they will be making GET requests but not POST. Say you have a link on your page:
http://www.example.com/items.aspx?id=5&mode=delete
Without some sort of authorization check performed before the delete, it's possible that Googlebot could come in and delete items from your page.

Since you're the one writing the server software (presumably), then it cares if you tell it to care. If you handle POST and GET data identically, then no, it doesn't.
However, the browser definitely cares. Refreshing or clicking back to a page you got as a response to a POST pops up the little "Are you sure you want to submit data again" prompt, for example.

GET has data limit restrictions based on the sending browser:
The spec for URL length does not dictate a minimum or maximum URL length, but implementation varies by browser. On Windows: Opera supports ~4050 characters, IE 4.0+ supports exactly 2083 characters, Netscape 3 -> 4.78 support up to 8192 characters before causing errors on shut-down, and Netscape 6 supports ~2000 before causing errors on start-up

If you use a GET request to alter back-end state, you run the risk of bad things happening if a webcrawler of some kind traverses your site. Back when wikis first became popular, there were horror stories of whole sites being deleted because the "delete page" function was implemented as a GET request, with disastrous results when the Googlebot came knocking...

"Use GET if: The interaction is more like a question (i.e., it is a safe operation such as a query, read operation, or lookup)."
"Use POST if: The interaction is more like an order, or the interaction changes the state of the resource in a way that the user would perceive (e.g., a subscription to a service), or the user be held accountable for the results of the interaction."
source

You be aware of a few subtle security differences. See my question
GET versus POST in terms of security?
Essentially the important thing to remember is that GET will go into the browser history and will be transmitted through proxies in plain text, so you don't want any sensitive information, like a password in a GET.
Obvious maybe, but worth mentioning.

By HTTP specifications, GET is safe and idempotent and POST is neither. What this means is that a GET request can be repeated multiple times without causing side effects.
Even if your server doesn't care (and this is unlikely), there may be intermediate agents between your client and the server, all of whom have this expectation. For example proxies to cache data at your ISP or other providers for improved performance. THe same expectation is true for accelerators, for example, a prefetching plugin for your browser.
Thus a GET request can be cached (based on certain parameters), and if it fails, it can be automatically repeated without any expecation of harmful effects. So, really your server should strive to fulfill this contract.
On the other hand, POST is not safe, not idempotent and every agent knows not to cache the results of a POST request, or retry a POST request automatically. So, for example, a credit card transaction would never, ever be a GET request (you don't want accounts being debited multiple times because of network errors, etc).
That's a very basic take on this. For more information, you might consider the "RESTful Web Services" book by Ruby and Richardson (O'Reilly press).
For a quick take on the topic of REST, consider this post:
http://www.25hoursaday.com/weblog/2008/08/17/ExplainingRESTToDamienKatz.aspx
The funny thing is that most people debate the merits of PUT v POST. The GET v POST issue is, and always has been, very well settled. Ignore it at your own peril.

GET has limitations on the browser side. For instance, some browsers limit the length of GET requests.

I think a more appropriate answer, is you can pretty much do the same things with both. It is not so much a matter of preference, however, but a matter of correct usage. I would recommend you use you GETs and POSTs how they were intended to be used.

Technically, no. All GET does is post the stuff in the first line of the HTTP request, and POST posts stuff in the body.
However, how the "web infrastructure" treats the differences makes a world of difference. We could write a whole book about it. However, I'll give you some "best practises":
Use "POST" for when your HTTP request would change something "concrete" inside the web server. Ie, you're editing a page, making a new record, and so on. POSTS are less likely to be cached, or treated as something that's "repeatable without side-effects"
Use "GET" for when you want to "look at an object". Now, such a look might change something "behind the scenes" in terms of caching or record keeping, but it shouldn't change anything "substantial". Ie, I could repeat my GET over and over and nothing bad would happen, except for inflated hit counts. GETs should be easily bookmarkable, so a user can go back to that same object later on.
The parameters to the GET (the stuff after the ?, traditionally) should be considered "attributes to the view" or "what to view" and so on. Again, it shouldn't actually change anything: use POST for that.
And, a final word, when you POST something (for example, you're creating a new comment), have the processing for the post issue a 302 to "redirect" the user to a new URL that views that object. Ie, a POST processes the information, then redirects the browser to a GET statement to view the new state. Displaying information as a result of a POST can also cause problems. Doing the redirection is often used, and makes things work better.

Should the user be able to bookmark the resulting page? Another thing to think about is some browsers/servers incorrectly limit the GET URI length.
Edit: corrected char length restriction note - thanks ars!

It depends on the software at the server end. Some libraries, like CGI.pm in perl handles both by default. But there are situations where you more or less have to use POST instead of GET, at least for pushing data to the server. Large amounts of data (where the corresponding GET url would become too long), binary data (to avoid lots of encoding/decoding trouble), multipart files, non-parsed headers (for continuous updates pre-AJAX style...) and similar.

The server technically couldn't care one way or the other about what kind of request it receives. It will blindly execute any request coming across the wire.
Which is the problem. If you have an action that destroys or modifies data in a GET action, Google will tear your site up as it crawls through indexing.

The server usually doesn't care. But it's mostly for following good practices, as you mentioned. The client side also matter - as mentioned you cannot bookmark a POST'd page usually, and some browsers have limits on the length of the URL for really long GET queries.

Since GET is intended for specifying resource you wanna get, depending on exact software on the server side, the web server (or the load balancer in front of it) may have a size limit on GET requests to prevent Denial Of Service attacks...

Be aware that browsers may cache GET requests but will generally not cache POST requests.

Yes, it does matter. GET and POST are quite different, really.
You are right in that normally, GET is for "getting" data from the server and displaying a page, while POST is for "posting" data back to the server. Internally, your scripts get the same data whether it's GET or POST, so no, the server doesn't really care.
The main difference is GET parameters are specified in URLs, while POST is not. This is why POST is used for signup and login forms - you don't want your password in a URL. Similarly, if you're viewing different pages or displaying a specific view of some data, you normally want a unique URL.

It really does matter. I have gathered like 11 things you should know abut them.
11 things you should know about GET vs POST

No, they shouldn't except for #jbruce2112 answer and uploading files require POST.

Ajax data update. Extjs

I need to keep certain data ( in a grid) up to date
and was gonna do a poll to the server every 15 seocnds or so to get the data and refresh the grid, however it feels a bit dirty ( the grid will have the loading icon every 15 sec..) doesnt look great...
Another option is to check if there is new data and compare the new data with the current data and only refresh the grid if there is any changes ( I would have to do this client side tho because maintaing the current state of every logged in user also seems like an overkill)
I m sure there are better solutions and would love to hear about them
I heard about COMET, but tit seems to be a bit of an overkill
BTW i m using asp.net MVC on the server side
I d like to hear what people have to say for or against continuos polling with js
Cheers

Sounds like COMET is indeed the solution you're looking for. In that scenario, you don't need to poll, nor do comparisons, as you can push out only the "relevant" changed data to your grid.
Check out WebSync, it's a nice comet server for .NET that'll let you do exactly what you've described.
Here's a demo using ExtJS and ASP.NET that pushes a continuous stream of stock ticker updates. The demo is a little more than you need, but the principal is identical.

Every time you get the answer from the server, check if something has changed.
Do a request. Do let the user know that you are working with some spinner, don't hide it. Schedule the next request in 15 seconds. The next request executes; if nothing has changed, schedule the next one in 15 + 5 seconds. The next request executes; if nothing has changed, schedule the next on in 15 +5 +5 seconds. And so on. The next request executes; if something has indeed changed, reset the interval to 15 seconds.
Prototype can do this semi-automatically with Ajax.PeriodicalUpdater but you probably need stuff that is more customized to your needs.
Anyway, just an idea.
As for continuous polling in general; it's bad only if you hit a different site (using a PHP "bridge" or something like that). If you're using your own resources you just have to make sure you don't deplete them. Set decent intervals with a decay.

I suggest Comet is not an overkill if "updates need to be constant." 15 seconds is very frequent; is your visited by many? Your server may be consumed serving these requests while starving others.

I don't know what your server-side data source looks like, or what kind of data you're serving, but one solution is to server your data with a timestamp, and send a timestamp of the last poll with every subsequent request.
Poll the server, sending the timestamp of when the service was last polled (eg: lastPollTime).
The server uses the timestamp to determine what data is new/updated and returns only that data (the delta), decreasing your transmission size and simplifying your client-side code.
It may be empty, it may be a few cells, it may be the entire grid, but the client always updates data that is returned to it because it is already known to be new.
The benefits of this method are that it simplifies your client side code (which is less code for the client to load), and decreases your transmission size for subsequent polls that have no new data for the user.
Also, this allows you to maintain state on the server side because you don't have to save a state for each individual user. You just have one state, the state of the current data, that is differentiated by access time.

I think checking if there is any new data is a good option.
I would count the number of rows in the database and compare that with the number of rows in your (HTML) table. If they're not the same, get the difference in rows.
Say you got 12 table rows and there are 14 database rows as you check: Get the latest (14 - 12) = 2 rows.

How to build large/busy RSS feed

I've been playing with RSS feeds this week, and for my next trick I want to build one for our internal application log. We have a centralized database table that our myriad batch and intranet apps use for posting log messages. I want to create an RSS feed off of this table, but I'm not sure how to handle the volume- there could be hundreds of entries per day even on a normal day. An exceptional make-you-want-to-quit kind of day might see a few thousand. Any thoughts?

I would make the feed a static file (you can easily serve thousands of these), regenerated periodically. Then you have a much broader choice, because it doesn't have to run below second, it can run even minutes. And users still get perfect download speed and reasonable update speed.

If you are building a system with notifications that must not be missed, then a pub-sub mechanism (using XMPP, one of the other protocols supported by ApacheMQ, or something similar) will be more suitable that a syndication mechanism. You need some measure of coupling between the system that is generating the notifications and ones that are consuming them, to ensure that consumers don't miss notifications.
(You can do this using RSS or Atom as a transport format, but it's probably not a common use case; you'd need to vary the notifications shown based on the consumer and which notifications it has previously seen.)

I'd split up the feeds as much as possible and let users recombine them as desired. If I were doing it I'd probably think about using Django and the syndication framework.
Django's models could probably handle representing the data structure of the tables you care about.
You could have a URL that catches everything, like: r'/rss/(?(\w*?)/)+' (I think that might work, but I can't test it now so it might not be perfect).
That way you could use URLs like (edited to cancel the auto-linking of example URLs):
http:// feedserver/rss/batch-file-output/
http:// feedserver/rss/support-tickets/
http:// feedserver/rss/batch-file-output/support-tickets/ (both of the first two combined into one)
Then in the view:
def get_batch_file_messages():
# Grab all the recent batch files messages here.
# Maybe cache the result and only regenerate every so often.
# Other feed functions here.
feed_mapping = { 'batch-file-output': get_batch_file_messages, }
def rss(request, *args):
items_to_display = []
for feed in args:
items_to_display += feed_mapping[feed]()
# Processing/returning the feed.
Having individual, chainable feeds means that users can subscribe to one feed at a time, or merge the ones they care about into one larger feed. Whatever's easier for them to read, they can do.

Without knowing your application, I can't offer specific advice.
That said, it's common in these sorts of systems to have a level of severity. You could have a query string parameter that you tack on to the end of the URL that specifies the severity. If set to "DEBUG" you would see every event, no matter how trivial. If you set it to "FATAL" you'd only see the events that that were "System Failure" in magnitude.
If there are still too many events, you may want to sub-divide your events in to some sort of category system. Again, I would have this as a query string parameter.
You can then have multiple RSS feeds for the various categories and severities. This should allow you to tune the level of alerts you get an acceptable level.

In this case, it's more of a manager's dashboard: how much work was put into support today, is there anything pressing in the log right now, and for when we first arrive in the morning as a measure of what went wrong with batch jobs overnight.

Okay, I decided how I'm gonna handle this. I'm using the timestamp field for each column and grouping by day. It takes a little bit of SQL-fu to make it happen since of course there's a full timestamp there and I need to be semi-intelligent about how I pick the log message to show from within the group, but it's not too bad. Further, I'm building it to let you select which application to monitor, and then showing every message (max 50) from a specific day.
That gets me down to something reasonable.
I'm still hoping for a good answer to the more generic question: "How do you syndicate many important messages, where missing a message could be a problem?"