Hi I have a page in my java/jsp based web application which shows list of new products.
I want to provide an rss feed for this. So, what is the way to create an rss feed which other can use to subscribe?
I could find some java based feed creators. But, the question is how the feed will be self updating based on new products added to the system?
I'm not familiar with Java, so here's a general thought.
Your feed should be accessible via some URL, like http://mydomain.com/products/feeds/rss. When Feed Aggregator fetches this URL, the servlet (I believe this is how they are called in Java world) fetches a list of recent products from the DB or wherever, build RSS feed and then sends it back to the requester, which turns out to be a Feed Aggregator.
For performance reasons this particular servlet may not access the database each time it's executing. Rather, it can cache either resulting feed (recommended, HTTP allows for a very flexible caching) or a result of database query somewhere in memory/on disk.
Related
I'm using Solr 4.10.2 and Drupal 7.X, I have the Apache Solr Module Framework operating and sending the requests to Solr From Drupal. Currently when we perform a search, Drupal builds the query and sends it to Solr. Solr just executes the query and returns the results without using it's internal handlers which can be configured through SolrConfig.xml.
I would like to know if there is a way to just send the searched terms (without building a query) from Drupal and let Solr use the internal handlers declared in SolrConfig.xml to handle the request, build the query and then return the data?
The reason for this is we have been working on trying to boost some results when we perform a search (we want exact match first & fuzzy search results after) by changing the "weight" of some fields.
We know that from Back Office we can use the "Bias" function to boost some fields but this is too limited for what we are trying to achieve.
We also know we can change the query sent from Drupal directly from code side by using hook_apachesolr_modify_query() but we prefer changing as little code as possible and using the SolrConfig.xml /handlers which we already have configured to return the results as we want.
Ok, we figured out how to do this:
In order to choose the handler that is being used by Solr while sending a request from Drupal, we have to edit the "hook_apachesolr_query_alter" function and add the following code:
$query->addParam(‘qt’, ‘MyHandlerName’);
We did some extra coding to allow us to change the Handler directly from back office in order to be able to switch handlers without touching the code.
I need to take a few RSS feeds, and archive all the items that get added to them. I've never consumed or created RSS before, but I know xml, so the format seems pretty intuitive.
I know how to parse the feed: How can I get started making a C# RSS Reader?
I know I can't rely on the feed server to provide a complete history: Is it possible to get RSS archive
I know I'll have to have some custom logic around duplicates: how to check uniqueness (non duplication) of a post in an rss feed
My question is, how can I ensure I don't miss any items? My initial plan is to write a parser, where for each item in the feed:
1) Check to see if it's already in the archive database
2) If not, add it to the database
If I schedule this to run once a day, can I be confident I won't be missing any items?
It depends on the feed, some sites publish articles very frequently and may have their RSS feed configured to show only the 10 most recent articles. Some sites are going to do the opposite.
Ideally your app should 'learn' the frequency this from the sites and tune itself to ping those sites based on the learnt frequency. (Ex: If you see new unique articles every time you ping, you'll need to ping more often, on the other hand if you see the same set of articles on multiple attempts, you may back off the next time).
If you're open to relying on a service for this... I built my own RSS archival service (https://app.pub.center). You can access an RSS feed's data via our API.
Page 1 of The Atlantic
https://pub.center/feed/02702624d8a4c825dde21af94e9169773454e0c3/articles?limit=10&page=1
Page 2 of The Atlantic
https://pub.center/feed/02702624d8a4c825dde21af94e9169773454e0c3/articles?limit=10&page=2
The REST API is free. We have pricing plans for push notifications (email, SMS, your custom API endpoint)
Use a series of decisions based on the feed and the storage limitations. For example:
Connect to the Web site, and download the XML source of the feed. The Feed Download Engine downloads feeds and enclosures via HTTP or Secure Hypertext Transfer Protocol (HTTPS) protocols only.
Transform the feed source into the Windows RSS Platform native format, which is based on RSS 2.0 with additional namespace extensions. (The native format is essentially a superset of all supported formats.) To do this, the Windows RSS Platform requires Microsoft XML (MSXML) 3.0 SP5 or later.
Merge new feed items with existing feed items in the feed store.
Purge older items from the feed store when the predetermined maximum number of items have been received.
Optionally, schedule downloads of enclosures with Background Intelligent Transfer Service (BITS).
Use HTTP to its fullest to minimize wasted bandwidth:
To limit its impact on servers, the Feed Download Engine implements HTTP conditional GET combined with Delta encoding in HTTP (RFC3229) World Wide Web link. This implementation allows the server to transfer a minimal description of changes instead of transferring an entirely new instance of a resource cached on the client. The engine also supports compression using the HTTP gzip support of Microsoft Win32 Internet (WinInet).
A successful synchronization means that the feed was successfully downloaded, verified, transformed into the native format, and merged into the store. A server response of HTTP 304 Not Modified in response to a HTTP conditional GET (If-Modified-Since, If-None-Match, ETag, and so on) also constitutes success.
And define criteria for removal:
The following properties directly affect the number of items that remain after a synchronization operation.
PubDate—used to determine the "age" of items. If PubDate is not set, LastDownloadTime is used. If the feed is a list, the order of items is predetermined and PubDate (if present) is ignored.
MaxItemCount—a per-feed setting that limits the number of archived items. The feed's ItemCount will never exceed the maximum, even if there are more items that could be downloaded from the feed.
ItemCountLimit—the upper limit of items for any one feed, normally defined as 2500. The value of MaxItemCount may not exceed this limit. Set MaxItemCount to ItemCountLimit to retain the highest possible number of items.
References
Understanding the Feed Download Engine
So, say I'm a journalist, who wants some way of easily posting links to stories I've written that are published to my newspaper's website. Alas, my newspaper's website doesn't offer user-level RSS feeds (user-level anything for journalists, really).
Running a search (I.e., http://www.calgaryherald.com/search/search.html?q=Rininsland) brings up everything I've done in reverse chronological order (albeit with some duplicates; ignore for now, will deal with later). Is there any way I can parse this into an RSS feed?
It seems like Yahoo! Pipes might be an easy way to do this, but I'm open to whatever.
Thanks!
Normally this would be a great use of Yahoo Pipes, but it appears that the search page you cited has a robots.txt file, which Pipes respects. This means that Pipes will not pull data from the page.
For more info: "How do I keep Pipes from accessing my web pages?"
http://pipes.yahoo.com/pipes/docs?doc=troubleshooting#q14
You would have to write a scraper yourself that makes an HTTP request to that URL, parses the response, and writes RSS as output. This could be done in many server-side environments such as PHP, Python, etc.
EDIT: Feedity provides a service to scrape web pages into feeds. Here is a Feedity feed of your search url:
http://feedity.com/rss.aspx/calgaryherald-com/UFJWUVZQ
However, unless you sign up for a subscription ($3.25/mo), this feed will be subject to the following constraints:
Free feeds created
without an account are limited to 5
items and 10 hours update interval.
Free feeds created without an account
are automatically purged from our
system after 30 days of inactivity.
Provided it's just links and a timestamp you want for each article then the Yahoo Pipes Search module will return the latest 10 in it's search index of the Herlad site.
Really simple question here:
For a PHP-driven RSS feed, am I just overwriting the same XML file every time I "publish" a new feed thing? and the syndicates it's registered with will pop in from time to time to check that it's new?
Yes. An RSS reader has the URL of the feed and regularly requests the same URL to check for new content.
that's how it works, a simple single xml rss file that gets polled for changes by rss readers
for scalability there there is FeedTree: collaborative RSS and Atom delivery but unlike another well known network program (bittorrent) it hasn't had as much support in readers by default
Essentially, yes. It isn't necessarily a "file" actually stored on disk, but your RSS (or Atom) is just changed to contain the latest items/entries and resides at a particular fixed URL. Clients will fetch it periodically. There are also technologies like PubSubHubbub and pinging for causing updates to get syndicated closer to real-time.
Yes... BUT! There are ways to make the susbcribers life better and also improve your bandwidth :) Implement the PubSubHubbub protocol. It will help any application that wants the content of the feed to be notified as soon as it's available. It'es relatively simple to implement on the publisher side as it only involves a ping.
Just learning about this via youtube but could not find answer to my question of how reader knows there is an update.
Is it like a Push in blackberry?
RSS is a file format source and doesn't actually know anything about where it gets the entries from. The answer really is: "how can an http request get only the newest results from a server" and the answer is Conditional GET source. Http also supports Conditional PUT.
This is an article about using this feature of http to specifically support rss hackers.
RSS is a pull technology. The reader re-fetches the RSS feed now and then (for example two times per hour, or more often if the reader learns that it's an often updated feed).
The feed is served through regular HTTP and consists of a simple XML file. It is always fetched from the same URL.
It just check the feed for update regularly.
Recently there is a new protocol called pubsubhubbub to make feed push to the listener. But it requires the publishers support it.
Here is a list of web services support real-time RSS pushing, including Google Reader, Blogger, FeedBurner, FriendFeed, MySpace, etc.
Let's summarize :
Usually, a client knows that an RSS feed has been updated through polling, that is regular pull (HTTP GET request on the feed URL)
Push doesn't exist on the web, at least, not with HTTP until HTML5 websocket is fixed.
However, some blog frameworks like Wordpress, Google and others, now support the pubsubhubbub convention. In this mode, you would "subscribe" to the updates of an RSS flow. The "hub" will call an URL on YOUR site (callback URL) to send you updates : that is a push.
Push or pull, in both cases you still need to write some piece of code to update the RSS list on your site, database or wherever you store/display it.
And, as a side question, it is not necessary to request the whole XML at every pull to see if the content has changed : using a standard that is not linked to RSS, but global to the whole HTTP protocol (etag and last-modified headers), you can know if the RSS page was modified after a given date, and grab the whole XML only if modified.
It's a pull. That's why you have to configure your reader how often it should refresh the feed.