Reading about RSS leads to many false-informations. I am not quite sure how RSS works. So I have some questions and I hope you dont answer using links-only. There is always another link that claims your link is wrong.
Questions:
If I subscribe to a RSS-Feed the first time, are the feeds from the last 30 years downloaded as a bulk-response may have Gigabytes of data?
Are following requests to a already subscribed RSS-Feed updates to the previous subscription? If yes, how does the server know what messages are already transported to the "client"?
How often are RSS-Feeds downloaded?
Kind regards
You get whatever is currently in the feed. How many entries and how far back that goes is up to the publisher.
No. Each request gets whatever is in the feed at the time.
As often as the client wants to download them. (The format includes options to recommend a frequency but clients may ignore it).
Related
I'm sorry if the title is too generic, but I've been browsing the Internet for one hour and I couldn't find any architectural explanation. I'm totally new both to RSS and Atom protocols, as far as I have understood until now is:
A server publishes documents
Clients subscribe to this server
Clients are notified when the server publishes new documents
Clients consume the documents
It seems like a queueing mechanism (like JMS). What is not clear to me is:
"Clients are notified" is just another way of saying "clients must poll the server to check if there are new messages"?
How does a client know that a message has already been read and that is no longer 'new'? Is this check in charge to the client or to the server?
Can anyone address me to some documentation about that? I've been googling for a while but every search sends me to sites that explain how to use libraries for parsing etc....
Thanx
I think these answer your questions:
How large RSS reader works (netvibes, Google reader...)
How RSS and ATOM inform client about updates? long polling or polling or something else?
RSS 2.0 Specification
https://en.wikipedia.org/wiki/PubSubHubbub
How does a client know that a message has already been read and that
is no longer 'new'?
I think that is specific to the implementation, but for example you could save guids of each fetched <item> and then flag them read as the user reads the items.
I think Janih's answer below is good and you should check all these links.
For more specific details to you questions:
Clients are notified" is just another way of saying "clients must poll
the server to check if there are new messages?
Yes... and no. Yes, polling is the default and yes it's cumbersome. Protocols like PubSubHubbub will help. RSS Feed API services like Superfeedr (which I built!) will do it on your behalf and send you notifications using a webhooks (so you don't have to poll at all!)
I'm writing a little application for my own use which will consume a publicly published RSS feed.
As far as I can tell, there's no subscribe/post mechanism in the protocol; I need to have my application HTTP-GET the RSS feed periodically.
If that's the case, I'd like to grab it every ten minutes or so, but I'm worried about being seen as an abuser. I'd certainly be concerned if I saw someone poking my server every ten minutes for weeks on end.
Is this a valid concern? Is there any general advice on what a "reasonable" refresh rate is? Do I even have my facts straight?
Since RSS is built on the HTTP protocol, in general, most sites should respect the If-Modified-Since HTTP header. This is fairly lightweight and most servers should be able to return this information quickly.
So for the client-side, you'll need to keep track of the last time you've sent the request and pass it to the server. If the server returns a 304 code, then you'll know that nothing has changed. But even more importantly, the server doesn't need to return the feed info, saving bytes of traffic. If the server returns a 200, then you'll need to process the results and save the response date.
Ultimately, the answer to this question depends on what type of information is at the other end of the RSS feed. If it is a blog, then probably once every 4-8 hours is sufficient. But if RSS feed is a feed of stock quote (not likely, just an example), then every 10 minutes is not sufficient.
Really simple question here:
For a PHP-driven RSS feed, am I just overwriting the same XML file every time I "publish" a new feed thing? and the syndicates it's registered with will pop in from time to time to check that it's new?
Yes. An RSS reader has the URL of the feed and regularly requests the same URL to check for new content.
that's how it works, a simple single xml rss file that gets polled for changes by rss readers
for scalability there there is FeedTree: collaborative RSS and Atom delivery but unlike another well known network program (bittorrent) it hasn't had as much support in readers by default
Essentially, yes. It isn't necessarily a "file" actually stored on disk, but your RSS (or Atom) is just changed to contain the latest items/entries and resides at a particular fixed URL. Clients will fetch it periodically. There are also technologies like PubSubHubbub and pinging for causing updates to get syndicated closer to real-time.
Yes... BUT! There are ways to make the susbcribers life better and also improve your bandwidth :) Implement the PubSubHubbub protocol. It will help any application that wants the content of the feed to be notified as soon as it's available. It'es relatively simple to implement on the publisher side as it only involves a ping.
I have a DB with user accounts information.
I've scheduled a CRON job which updates the DB with every new user data it fetches from their accounts.
I was thinking that this may cause a problem since all requests are coming from the same IP address and the server may block requests from that IP address.
Is this the case?
If so, how do I avoid being banned? should I be using a proxy?
Thanks
You get banned for suspicious (or malicious) activity.
If you are running a normal business application inside a normal company intranet you are unlikely to get banned.
Since you have access to user accounts information, you already have a lot of access to the system. The best thing to do is to ask your systems administrator, since he/she defines what constitutes suspicious/malicious activity. The systems administrator might also want to help you ensure that your database is at least as secure as the original information.
should I be using a proxy?
A proxy might disguise what you are doing - but you are still doing it. So this isn't the most ethical way of solving the problem.
Is the cron job that fetches data from this "database" on the same server? Are you fetching data for a user from a remote server using screen scraping or something?
If this is the case, you may want to set up a few different cron jobs and do it in batches. That way you reduce the amount of load on the remote server and lower the chance of wherever you are getting this data from, blocking your access.
Edit
Okay, so if you have not got permission to do scraping, obviously you are going to want to do it responsibly (no matter the site). Try gather as much data as you can from as little requests as possible, and spread them out over the course of the whole day, or even during times that a likely to be low load. I wouldn't try and use a proxy, that wouldn't really help the remote server, but it would be a pain in the ass to you.
I'm no iPhone programmer, and this might not be possible, but you could try have the individual iPhones grab the data so all the source traffic isn't from the same IP. Just an idea, otherwise just try to be a bit discrete.
Here are some tips from Jeff regarding the scraping of Stack Overflow, but I'd imagine that the rules are similar for any site.
Use GZIP requests. This is important! For example, one scraper used 120 megabytes of bandwidth in only 3,310 hits which is substantial. With basic gzip support (baked into HTTP since the 90s, and universally supported) it would have been 20 megabytes or less.
Identify yourself. Add something useful to the user-agent (ideally, a link to an URL, or something informational) so we can see your bot as something other than "generic unknown anonymous scraper."
Use the right formats. Don't scrape HTML when there is a JSON or RSS feed you could use instead. Heck, why scrape at all when you can download our cc-wiki data dump??
Be considerate. Pulling data more than every 15 minutes is questionable. If you need something more timely than that ... why not ask permission first, and make your case as to why this is a benefit to the SO community and should be allowed? Our email is linked at the bottom of every single page on every SO family site. We don't bite... hard.
Yes, you want an API. We get it. Don't rage against the machine by doing naughty things until we build it. It's in the queue.
Just learning about this via youtube but could not find answer to my question of how reader knows there is an update.
Is it like a Push in blackberry?
RSS is a file format source and doesn't actually know anything about where it gets the entries from. The answer really is: "how can an http request get only the newest results from a server" and the answer is Conditional GET source. Http also supports Conditional PUT.
This is an article about using this feature of http to specifically support rss hackers.
RSS is a pull technology. The reader re-fetches the RSS feed now and then (for example two times per hour, or more often if the reader learns that it's an often updated feed).
The feed is served through regular HTTP and consists of a simple XML file. It is always fetched from the same URL.
It just check the feed for update regularly.
Recently there is a new protocol called pubsubhubbub to make feed push to the listener. But it requires the publishers support it.
Here is a list of web services support real-time RSS pushing, including Google Reader, Blogger, FeedBurner, FriendFeed, MySpace, etc.
Let's summarize :
Usually, a client knows that an RSS feed has been updated through polling, that is regular pull (HTTP GET request on the feed URL)
Push doesn't exist on the web, at least, not with HTTP until HTML5 websocket is fixed.
However, some blog frameworks like Wordpress, Google and others, now support the pubsubhubbub convention. In this mode, you would "subscribe" to the updates of an RSS flow. The "hub" will call an URL on YOUR site (callback URL) to send you updates : that is a push.
Push or pull, in both cases you still need to write some piece of code to update the RSS list on your site, database or wherever you store/display it.
And, as a side question, it is not necessary to request the whole XML at every pull to see if the content has changed : using a standard that is not linked to RSS, but global to the whole HTTP protocol (etag and last-modified headers), you can know if the RSS page was modified after a given date, and grab the whole XML only if modified.
It's a pull. That's why you have to configure your reader how often it should refresh the feed.