I have a naive question about RSS feeds.
I have a series of timed events which appear on my site and that I make available as an RSS feed for other applications to import.
Who is typically responsible for truncating this feed? Over the next year, I can see my feed having thousands of items. Should the URL mysite.com/rss always return all items? And leave it to the readers to just show the most recent? Or is it more customary that I only return, say, the top 50? Expecting the readers to cache older items? (And, if so, is there a convention for readers to ask the server for the "next page")?
What is the typical behaviour of something like FriendFeed when it pulls in an RSS stream?
You should return only top few. Readers are supposed to save older items. Readers also usually ask for the feed many times a day, so you'll want to keep its size low to save bandwidth. If someone wants to browse your archives they'll typically do its via our web site. RSS is mostly for syndication of new items.
Related
I have a doubt regarding the use of Google News RSS Feed. Google News help states this:
Why Google might block an RSS feed In some cases, Google News might
block a feed. That could happen if you are:
Using Google News feeds for profit or to increase traffic to your site
Reformatting news results so they look like your own content
Changing, editing, or creating works based on content from Google News
I am looking to clarify these points:
Can't I customize the look of the feed? I want to have a separate page for news related to content on my website. Will then I violate the second rule if I customize the look of it? For example, I'll display a slideshow on the top along with a listing in the bottom much like FeedWind or Feedgrabber widgets.
I am surely not violating the third one. But everyone displays Google News on their website to sustain traffic right? Isn't the first rule broken by everyone who uses Google news RSS feed on their website?
Can't I customize the look of the feed?
Create an app or script to grab the feed, parse it, decorate it the way you like. Now, you have successfully customized the look.
I want to have a separate page for news related to content on my website. Will then I violate the 2nd rule if I customize the look of it?
A bit critical question. Let me provide you a simple answer: mention at certain corner that it was from google news feed. When google shows you ads, it puts a little AdChoice at a cozy corner by clicking on which you can confirm that that was an ad from google - follow their strategy, give them proper credit.
I am surely not violating the third one. But everyone displays google news on their website to sustain traffic right? Isn't the first rule broken by everyone who uses Google news rss feed on their website?
When you are providing value to others, then people like to act blind and pretend that what they see is not a promotion. For example, free medical camps are not actually done for helping others if they cannot promote them to get prospective patients (clients/customers) or cannot get free media promotion (again free flow of lot of customers) - forget those doctors who are lonely or have no responsibility or for whatever reason serving for free.
I am not a developer, and I have been doing my best to get a response from SoundCloud, but it's taking a very long time (despite being a "pro" member) and I need to make a decision.
I am one of the beta SoundCloud podcasters. SoundCloud recommends that we use Feedburner to generate the RSS we submit to iTunes. However, my FeedBurner RSS has an issue (involving graphics) and I can't figure out how to edit the feed without killing it and starting anew there and with Apple.
In trying to figure out how to deal with Feedburner, I found the myriad of complaints and rumors suggesting Feednurner will soon be on the outs. I then tried going with a third party service (RapidFeeds) -- importing the (valid) SoundCloud RSS wasn't working there though, and customer support has not responded in 4 days of waiting. SoundCloud tech support says they have been having trouble with many of these 3rd party vendors.
So ... I'm back to either Feedburner -- which, unless I can figure out how to edit the feed -- means I'll kill it, redo it, and resubmit to Apple ... and face the uncertainty of whether it'll be around (with my subscribers) in a few months ...
OR
I could use the naked SoundCloud RSS -- which will definitely work with Apple -- but I will not be able to TAG the feed in any way, nor will I have any idea of subscriber stats. SoundCloud SAYS they plan to add tagging/other RSS functionality "in the next couple of months" ... but will they?
I could use your opinions on what to do. I need to make a decision quickly as I'm holding up a website launch for this. Thanks.
You didn't describe the actual problem you're having, or give your source and FeedBurner feed URLs, so my advice will be a little blind.
Login to FeedBurner, click on your feed, and go to the "Optimize" tab. Your podcast settings are under "SmartCast." If you're having trouble with your podcast cover art, then the "Image" field is what you need to change.
If this isn't your exact issue, please let us know more details.
Despite the potential doom of FeedBurner (I predict its retirement will be announced in 2013), you are right to use FeedBurner if you're using SoundCloud to host your podcast.
SoundCloud is not a podcasting service and it seems that they don't give you full freedom over your RSS feed. This is the lifeblood of your podcast. So using FeedBurner gives you a lot more control than SoundCloud's feed, but you're still sacrificing the control you would have of running your own website.
I have a self-hosted wordpress blog, and as almost expected, I found there's another blog scraping my contents, posting a perfect copy of my own posts (texts, images not hotlinked but fetched and reupped to the clone's server, html layout within the posts) with a few hours of delay.
however I must confess I'm infuriated to see that when I search Google for keywords relevant to my posts, the scraping clone always comes first.
So, here I am, open for suggestions, would you know how to prevent my site from being successfully scraped ?
Technical precisions :
the clone blog appears to be self-hosted, and so am I, I'm on a debian+webmin+virtualmin dedi
my RSS feed is already cut with a "read more on" halfway. Hey, I just thought I should publish a post while assigning it a date like 2001-01-01, and see if it appears on the clone blog, that would allow to know if my RSS is still used as a signal for "hey, it's scraping time !"
my logs can't find the scraper among legit traffic, either it's non-identifiable or else it's lost among the flood of legit traffic
I already htaccess-banned and iptables-banned the .com domain of the clone, my contents are still cloned nonetheless
the clone website makes use of reverse proxies, so I can't trace where it is hosted and what actual IPs should be blocked (well, unless I iptables-ignore-ban half of Europe to ban the whole IP ranges of its data storage facility, but I'm slightly reluctant to that !)
I'm confident this isn't hand-made, the cloning has been running for two years now, every day without fail
only my new posts are cloned, not the rest of my website (not the sidebars, not the wordpress pages as opposed to wordpress posts, not the single pages), so setting up a jail.html to log who opens it page won't work, no honey-potting
when my posts contain internal links pointing to another page of my website, the posts on the clone won't be rewritten and will still point to my own website
I'd love help and suggestions with this issue. Not being cloned, but losing traffic to that bot while I'm the original publisher.
You can't really stop them in the end, but you might be able to find them and mess with them. Try hiding the request IP in an HTML comment, or white-on-white text, or just somewhere out of the way, then see what IPs show up on the copies. You can also try to obfuscate that text if you want by turning it into a hex string or something so it's less obvious to someone who doesn't know or make it look like an error code, just so they don't catch on to what you're doing.
In the end, though, I'm not sure how much it will buy you. If they're really inattentive, rather than shutting them down and calling attention to the fact that you're onto them, you can feed them gibberish or whatever whenever one of their IPs crops up. That might be fun and it's not too hard to make a gibberish generator by putting sample texts into a Markov chain.
EDIT: Oh, and if pages aren't rewritten too much, you might be able to add some inline JS to make them link to you, if they don't strip that. Say, a banner that only shows up if they're not at your site, giving the original link to your articles and suggesting that people read that.
Are you willing to shut down your RSS Feed? if so you could do something like
function fb_disable_feed() {
wp_die( __('No feed available,please visit our homepage!') );
}
add_action('do_feed', 'fb_disable_feed', 1);
add_action('do_feed_rdf', 'fb_disable_feed', 1);
add_action('do_feed_rss', 'fb_disable_feed', 1);
add_action('do_feed_rss2', 'fb_disable_feed', 1);
add_action('do_feed_atom', 'fb_disable_feed', 1);
it means if you go to a feed page, it just returns with the message in wp_die() on line two. We use it for 'free' versions of our WP Software with an if-statement so they can't hook into their RSS feeds to link to their main website, it's an upsell opportunity for us, it works well is my point, haha.
Even though this is a little old of a post, I thought it would still be helpful for me to weigh in in case other people see the post and have the same question. Since you've eliminated the RSS feed from the mix and youre pretty confident it isnt a manual effort, then what you need to is better stop the bots they are using.
First, I would recommend banning proxy servers in your IPTables. You can get a list of known proxy server addresses from Maxmind. This should limit their ability to anonymize themselves.
Second, it would be great to make it harder for them to scrape. You could accomplish this in one of a couple of ways. You could render part, or all of your site in javascript. If nothing else, you could at least just render the links in javascript. This will make it significantly harder for them to scrape you. Alternatively, you can put your content within an iframe inside the pages. This will also make it somewhat harder to crawl and scrape.
All this said, if they really want your content they will pretty easily get by these traps. Honestly, fighting off webscrapers is an arms race. You cannot put any static trap in place to stop them, instead you have to continuously evolve your tactics.
For full disclosure, I am a co-founder of Distil Networks, and we offer an anti-scraping solution as a service.
There is a blog, powered by Wordpress, which has valid RSS feed (opens up fine in Safari), but doesn't show new posts in Google Reader. In fact, the latest article from Google Reader is from Jul 21, 2010, while the latest article on the blog dates to Aug 19, 2010.
What should I do about the RSS feed (escape characters? modify XML or what?) for it to work on Google Reader?
This is a reopened question, because the original question I found was migrated to superuser, then closed there because it is best fitted on stackoverflow, so no solution was ever provided, and no chance was given to do so. Please give it a chance to get answered.
Update:
Google Reader pulls new articles, in groups of 10, and not the latest. For example if 12 (or 13, or 11) new articles are not shown in Google Reader, when the next one is added, the oldest 10 (exactly 10) of these articles appear on Google Reader, and the date shown in Google Reader is equal for each article, as if all 10 were published in the same second - the second they appeared on Google Reader. This problem doesn't manifest itself in other aggregators that I've tried.
Update 2:
Articles started showing up regularly, so the problem is solved, temporarily. Why did it happen I don't know, maybe it's because more readers subscribed (for testing purposes), or it's because of the PubSubHubBub plugin that I've added recently. Until it becomes clear, and for 3 more days, this question remains open.
I just added the blog to my Google Reader and had a bit of a play. I noticed the same behaviour you observed where I was missing the 5 most recent posts and a bunch of about 10 of them all had the same date:
After doing a bit of a search on the web, I found this post which explains how you can actually view the Published date via a tooltip on the right-hand side:
Then once I click the "Refresh" button from Google Reader at the top, the new posts showed up:
I believe that high volume blogs that are on the Google spiders' radar would be indexed every few hours and therefore all posts would have their Received date very close to the Published date so nobody notices/cares that it is actually displaying the Recevied date.
For low volume blogs however, it seems the cache is updated much less frequently. Google has some tips to try to get it to update - Feed not updating in Reader. Maybe my subscription to the blog updated the cache, but as the spider has a delay I didn't see the updates till pressing "Refresh". Or maybe the act of pressing the "Refresh" button triggered it to look for new posts immediately.
Lastly I subscribed the blog to my wife's Google Reader account and this time the 5 latest posts came up straight away with matching Received times which translated back to about the time when I pressed the "Refresh" button (or maybe it was when I added the feed).
I feel your pain - I agree that it all seems a bit cumbersome for a low volume RSS feed ...
You may also check with the blog author / hosting company and see if they have turned down the Google indexing rate. Google can create high volumes of traffic on a site. Turning down the indexing rate (crawl rate) will help with that but it b0rks Google Reader.
As other posters have mentioned, it could also be a factor of low popularity / low page rank / something else causing Googlebot to fail to crawl the blog frequently enough.
Google Reader display is dependent on Google crawling the blog to pick up the latest content. Realistically, you'll want a client side pull of the RSS feed to get the latest data so you aren't dependent on Google crawling the website. Outlook 2010, Firefox, many others exist. The client side software will directly pull the updated RSS feed from the blog, capturing the posts as they are published to the RSS feed.
Thank you for your responses, I too have come up with some possible solutions (thanks to you).
I don't know whether It's something I did, or independent of that, but as from yesterday (when you answered to this question), feeds started showing up normally.
Maybe it is due to the fact that thanks to you the blog got more subscribers on Google Reader and the Update Rate bounced (just like #Bermo suggested).
Or, maybe the introduction of the PubSubHubBub plugin changed something. But it's rather the first variant (number of subscribers). Though it is still a mystery why other extremely unpopular blogs give me regular articles in Google Reader.
For now I will only upvote good answers, until everything becomes clear (can't really determine the exact cause) or until the last day of this bounty.
For all the RSS feeds I subscribe to I use Google Reader, which I love. I do however have a couple of specific RSS feeds that I'd like to be notified of as soon as they get updated (say, for example, an RSS feed for a forum I like to monitor and respond to as quickly as possible).
Are there any tools out there for this kind of monitoring which also have some kind of alert functionality (for example, a prompt window)?
I've tried Simbolic RSS Alert but I found it a bit buggy and couldn't get it to alert me as often as I liked.
Suggestions? Or perhaps a different experience with Simbolic?
If you have access to Microsoft Outlook 2007 or Thunderbird, these email clients allow you to add RSS feeds in the same way you would add an email account.
I use Google Reader generally but when I want to keep up-to-date with something specific, I add the RSS feed to Outlook and it arrives in my inbox as if it was an email.
RSS isn't "push", which means that you need to have something that polls the website. It's much less traffic than getting the whole site or front page (for instance, you can say "Give me all articles newer than the last time I asked"), but it's traffic nonetheless.
It's generally understood you shouldn't have a refresh of more than 30 minutes in an automated client. (Citation required).
Having said that, you may find a client which allows you to set a more frequent refresh.
RSS2mail is a simple python script which I used extensively a few years back.
As Matthew stated you really shouldn't bother an RSS feed more than the producer allows but you can use http headers to check for changes in a very light way which is something rss2email does quite well.
You could always knock something up yourself... I've done it in the past and it really isn't too difficult a job to write an RSS parser.
Of course, as others have mentioned, there's an etiquette question as to how much of the website's valuable bandwidth you want to hog for yourself in RSS request traffic. That's a matter for your own conscience. ;)
Reading all the answers reminded me that I actually never looked into solving this using a Firefox add-on. I soon found Update Scanner and I think it look really promising!
I like an old version of feedreader for that kind of use, where the icon in the system tray started spinning when new stuff came in (the new version goes from grey to yellow instead).
it's also possible to be alerted for each new message.
I've used Pingie to send me an SMS when a new item appears in an RSS feed. Perhaps, it will be useful for you, if you have a cellphone text messaging plan.
I use RSS Bandit (for Windows) to stay up to date with my RSS feeds/blogs.
There are lots of other RSS aggregator applications though.
If you don't want another "big" application but have Windows Vista, you can also choose to make Internet Explorer monitor the RSS feed and use the Feed sidebar application (called "Feedschlagzeilen in German version, not sure about the English one) that comes with Vista to show the latest headlines.
Since you mentioned a pop-up, I'll add Feed Notifier to the list. It sits in the Windows Tray (or whatever they call it now in Windows 7) and pops up a notification when there are new entries to your feeds. You can set it up with multiple feeds, each with its own polling interval. When there are new entries, it pops up a prompt which you can dismiss or click to go to the entry. You are able to go back and review recent entries later, even if you clicked to dismiss them the first time. If your PC is asleep when a new entry is added, you will be notified the next time you wake it up.