I'm writing a little application for my own use which will consume a publicly published RSS feed.
As far as I can tell, there's no subscribe/post mechanism in the protocol; I need to have my application HTTP-GET the RSS feed periodically.
If that's the case, I'd like to grab it every ten minutes or so, but I'm worried about being seen as an abuser. I'd certainly be concerned if I saw someone poking my server every ten minutes for weeks on end.
Is this a valid concern? Is there any general advice on what a "reasonable" refresh rate is? Do I even have my facts straight?
Since RSS is built on the HTTP protocol, in general, most sites should respect the If-Modified-Since HTTP header. This is fairly lightweight and most servers should be able to return this information quickly.
So for the client-side, you'll need to keep track of the last time you've sent the request and pass it to the server. If the server returns a 304 code, then you'll know that nothing has changed. But even more importantly, the server doesn't need to return the feed info, saving bytes of traffic. If the server returns a 200, then you'll need to process the results and save the response date.
Ultimately, the answer to this question depends on what type of information is at the other end of the RSS feed. If it is a blog, then probably once every 4-8 hours is sufficient. But if RSS feed is a feed of stock quote (not likely, just an example), then every 10 minutes is not sufficient.
Related
Reading about RSS leads to many false-informations. I am not quite sure how RSS works. So I have some questions and I hope you dont answer using links-only. There is always another link that claims your link is wrong.
Questions:
If I subscribe to a RSS-Feed the first time, are the feeds from the last 30 years downloaded as a bulk-response may have Gigabytes of data?
Are following requests to a already subscribed RSS-Feed updates to the previous subscription? If yes, how does the server know what messages are already transported to the "client"?
How often are RSS-Feeds downloaded?
Kind regards
You get whatever is currently in the feed. How many entries and how far back that goes is up to the publisher.
No. Each request gets whatever is in the feed at the time.
As often as the client wants to download them. (The format includes options to recommend a frequency but clients may ignore it).
I'm currently polling the youtube rss feed servers about 3 times per second (multiple channels getting the feed every 3 seconds each). Is this too much and will getting it this often result in getting updates slower since it's going to start caching it for me? I'm trying to get new videos updated as quickly as possible but I can't find any gudielines on this sort of stuff.
I used to get them using the /v3/search/ at the same rate but the response from my server was always late compared to what I got when I did it on my local machine(where I didn't poll this often).
Also, is there any good alternatives? (I tried to push notifications but they were highly unreliable)
You should be using caching in your app to reduce bandwidth and latency. When caching, store the eTag so that you can include it when getting a resource. If the resource has not changed, you will get a 304 response code (NOT MODIFIED), which means you can use your cached version. Otherwise, you will get the updated version of the resource.
More info:
https://developers.google.com/youtube/v3/getting-started#etags
I crawl some data from the web, because there is no API. Unfortunately, it's quite a lot of data from several different sites and I quickly learned I can't just make thousands of requests to the same site in a short while... I want to approach the data as fast as possible, but I don't want to cause a DOS attack :)
The problem is, every server has different capabilities and I don't know them in advance. The sites belong to my clients, so my intention is to prevent any possible downtime caused by my script. So no policy like "I'll try million requests first and if it fails, I'll try half million, and if it fails..." :)
Is there any best practice for this? How Google's crawler knows how many requests it can do in the same while to the same site? Maybe they "shuffle their playlist", so there are not as many concurrent requests to a single site. Could I detect this stuff somehow via HTTP? Wait for a single request, count response time, approximately guess how well balanced the server is and then somehow make up a maximum number of concurrent requests?
I use a Python script, but this doesn't matter much for the answer - just to let you know in which language I'd prefer your potential code snippets.
The google spider is pretty damn smart. On my small site it hits me 1 page per minute to the second. They obviously have a page queue that is filled keeping time and sites in mind. I also wonder if they are smart enough about not hitting multiple domains on the same server -- so some recognition of IP ranges as well as URLs.
Separating the job of queueing up the URLs to be spidered at a specific time from the actually spider job would be a good architecture for any spider. All of your spiders could use the urlToSpiderService.getNextUrl() method which would block (if necessary) unless the next URL is to be spidered.
I believe that Google looks at the number of pages on a site to determine the spider speed. The more pages that you have the refresh in a given time then the faster they need to hit that particular server. You certainly should be able to use that as a metric although before you've done an initial crawl it would be hard to determine.
You could start out at one page every minute and then as the pages-to-be-spidered for a particular site increases, you would decrease the delay. Some sort of function like the following would be needed:
public Period delayBetweenPages(String domain) {
take the number of pages in the to-do queue for the domain
divide by the overall refresh period that you want to complete in
if more than a minute then just return a minute
if less than some minimum then just return the minimum
}
Could I detect this stuff somehow via HTTP?
With the modern internet, I don't see how you can. Certainly if the server is returning after a couple of seconds or returning 500 errors, then you should be throttling way back but a typical connection and download is sub-second these days for a large percentage of servers and I'm not sure there is much to be learned from any stats in that area.
Quoting from the CouchDB documentation:
It is recommended that you avoid POST when possible, because proxies and other network intermediaries will occasionally resend POST requests, which can result in duplicate document creation.
To my understanding, this should not be happening on the protocol level (a confused user armed with a doubleclick is a completely different story). What is the best course of action, then?
Should we really try to avoid POST requests and replace them by PUT? I don't like that as they convey a different meaning.
Should we anticipate this and protect the requests by unique IDs where we want to avoid accidental duplication? I don't like that either: it complicates the code and prevents situations where multiple identical posts may be desired.
Should we really try to avoid POST
requests and replace them by PUT? I
don't like that as they convey a
different meaning.
For document creation (as mentioned in the doc you cite), that is exactly the correct meaning. For document modification, occasional resent requests are not a problem.
We should avoid POST requests when the request is idempotent and does change server state (it should not change the server state each time it is performed, in document creation this means the document should only be created once).
POST is suitable when the request is non-idempotent. This means that everytime the request is sent there is a change in server state. For an update or delete this should not matter.
Of course, due to web browser support only GET and POST requests have really been adopted because POST does all that PUT does (it's just not designed for idempotent requests).
I must have been extremely lucky in my WebMaster days to never have seen duplicate POST requests.
TCP/IP sends numbered messages, I wouldn't know why any webserver wouldn't discard of the duplicates.
Is this duplicate POST thing really an issue?
I've been thinking about batch reads and writes in a RESTful environment, and I think I've come to the realization that I have broader questions about HTTP caching. (Below I use commas (",") to delimit multiple record IDs, but that detail is not particular to the discussion.)
I started with this problem:
1. Single GET invalidated by batch update
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farms/123,234,345 # update info on Old MacDonald's Farm and some others
GET /farms/123
How does a caching server in between the client and the Farms server know to invalidate its cache of /farms/123 when it sees the PUT?
Then I realized this was also a problem:
2. Batch GET invalidated by single (or batch) update
GET /farms/123,234,345 # get info about a few farms
PUT /farms/123 # update Old MacDonald's Farm
GET /farms/123,234,345
How does the cache know to invalidate the multiple-farm GET when it sees the PUT go by?
So I figured that the problem was really just with batch operations. Then I realized that any relationship could cause a similar problem. Let's say a farm has zero or one owners, and an owner can have zero or one farms.
3. Single GET invalidated by update to a related record
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farmers/987 # Old MacDonald sells his farm and buys another one
GET /farms/123
How does the cache know to invalidate the single GET when it sees the PUT go by?
Even if you change the models to be more RESTful, using relationship models, you get the same problem:
GET /farms/123 # get info about Old MacDonald's Farm
DELETE /farm_ownerships/456 # Old MacDonald sells his farm...
POST /farm_ownerships # and buys another one
GET /farms/123
In both versions of #3, the first GET should return something like (in JSON):
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: 987
}
And the second GET should return something like:
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: null
}
But it can't! Not even if you use ETags appropriately. You can't expect the caching server to inspect the contents for ETags -- the contents could be encrypted. And you can't expect the server to notify the caches that records should be invalidated -- caches don't register themselves with servers.
So are there headers I'm missing? Things that indicate a cache should do a HEAD before any GETs for certain resources? I suppose I could live with double-requests for every resource if I can tell the caches which resources are likely to be updated frequently.
And what about the problem of one cache receiving the PUT and knowing to invalidate its cache and another not seeing it?
Cache servers are supposed to invalidate the entity referred to by the URI on receipt of a PUT (but as you've noticed, this doesn't cover all cases).
Aside from this you could use cache control headers on your responses to limit or prevent caching, and try to process request headers that ask if the URI has been modified since last fetched.
This is still a really complicated issue and in fact is still being worked on (e.g. see http://www.ietf.org/internet-drafts/draft-ietf-httpbis-p6-cache-05.txt)
Caching within proxies doesn't really apply if the content is encrypted (at least with SSL), so that shouldn't be an issue (still may be an issue on the client though).
HTTP protocol supports a request type called "If-Modified-Since" which basically allows the caching server to ask the web-server if the item has changed. HTTP protocol also supports "Cache-Control" headers inside of HTTP server responses which tell cache servers what to do with the content (such as never cache this, or assume it expires in 1 day, etc).
Also you mentioned encrypted responses. HTTP cache servers cannot cache SSL because to do so would require them to decrypt the pages as a "man in the middle." Doing so would be technically challenging (decrypt the page, store it, and re-encrypt it for the client) and would also violate the page security causing "invalid certificate" warnings on the client side. It is technically possible to have a cache server do it, but it causes more problems than it solves, and is a bad idea. I doubt any cache servers actually do this type of thing.
Unfortunately HTTP caching is based on exact URIs, and you can't achieve sensible behaviour in your case without forcing clients to do cache revalidation.
If you've had:
GET /farm/123
POST /farm_update/123
You could use Content-Location header to specify that second request modified the first one. AFAIK you can't do that with multiple URIs and I haven't checked if this works at all in popular clients.
The solution is to make pages expire quickly and handle If-Modified-Since or E-Tag with 304 Not Modified status.
You can't cache dynamic content (withouth drawbacks), because... it's dynamic.
In re: SoapBox's answer:
I think If-Modified-Since is the two-stage GET I suggested at the end of my question. It seems like an OK solution where the content is large (i.e. where the cost of doubling the number of requests, and thus the overhead is overcome by the gains of not re-sending content. That isn't true in my example of Farms, since each Farm's information is short.)
It is perfectly reasonable to build a system that sends encrypted content over an unencrypted (HTTP) channel. Imagine the scenario of a Service Oriented Architecture where updates are infrequent and GETs are (a) frequent, (b) need to be extremely fast, and (c) must be encrypted. You would build a server that requires a FROM header (or, equivalently, an API key in the request parameters), and sends back an asymmetrically-encrypted version of the content for the requester. Asymmetric encryption is slow, but if properly cached, beats the combined SSL handshake (asymmetric encryption) and symmetric content encryption. Adding a cache in front of this server would dramatically speed up GETs.
A caching server could reasonably cache HTTPS GETs for a short period of time. My bank might put a cache-control of about 5 minutes on my account home page and recent transactions. I'm not terribly likely to spend a long time on the site, so sessions won't be very long, and I'll probably end up hitting my account's main page several times while I'm looking for that check I recently sent of to SnorgTees.