Maximum display posts on RSS? - rss

I never needed to play a lot with RSS, but now I have a project to do, and I wonder if it is possible to pull an RSS feed of all the posts for a certain blog...
I'm not talking about creating a feed generator. I'm just curious why most of the blogspot.com etc websites have available only the last 5 or last 20 posts, but never the complete list... is it a performance/security reason? I guess it is the user's right to decide how many posts go in the feed, right?

How many entries you want to have in your RSS-Feed depends on the goals you want to achieve with the RSS-Feed. Usually you want to provide information about the current articles on your site. That's why you usually have only the most recent articles in the feed.
Performance is of course an issue. A popular RSS-Feed with many followers does not want to send a huge XML file all the time. That can be addressed with enough ressources, but as long as it does not really help your goal, why do it?
I do not see a real security issue. If someone wants to steal your content, he can simple iteratate over the articles on your website directly. RSS would make it a little easier, but if someone wants to steal the content, he will steal it anyway, with or without a full RSS feed. If you take Denial of Service into consideration - we are back at performance issues - there might be a threat to availability. But that's already quite speculative.

Related

Making a content feed

I am making a website which allows people to discuss news topics. I was looking to make like a news feed which shows the most talked about topics and topics followed by users however I am not sure how to do this? As in I can't think of a process to do this and I don't think Rss feed's are the answer, help would be appreciated.
Same here. I am developing a website too and learning how to develop an RSS engine of my own.
http://news.bbc.co.uk/2/hi/help/rss/default.stm#mysite
http://www.wikihow.com/Create-an-RSS-Feed
But I need more information. What I know is- RSS feeder itself searches for latest content on the news websites or blogs (by looking at their dates perhaps) and it places the latest post on top. Now the problem is that I am not able to create that. I need to know a lot about RSS and specially XML.
But your problem is different I think. You want to show the trending post on the top. Then, I think you will need to create an algorithm to rank your pages/posts. And this algorithm should evaluate the real hotness of the content. For example a 20 days old post on your website might still be hotter than latest trending news and it might be searched in the top news.
But now the question is how would this algorithm decide whether a post hot or not? Well this can be done on the basis of likes or hearts give to it by users, comments on the page, links in the comments on the page, shares (you can track that), and external links to your post etc etc. Now it's up to you what you will prefer to make your post trending. You can give more weightage to external links or maybe comments or you could set limits to all of these which when reach gives a sign of full success of the post.
Sorry If you don't get it. I was just thinking of the solutions. I really don't know the solution to it already.

Analyzing possible WordPress hacking

I have just been checking the yearly stats for a blog I manage, and there is one post from 2008 that is getting a LOT of views, which doesn't make any sense as the info in it is outdated.
I pulled the access_log entries for this post and am finding a lot of referrers from cials-pills-online.info and sites like that. Not a lot of entries for any one of these sites, but say 20-30 a month.
I have looked around the site and can't see anything obvious amiss. Can anyone tell me where to look and what to look for to see if there's any monkey business related to this post?
Well there is a rather simple way to see if it is really monkey business or not if you dont need it delete the post and resubmit it it really depends on the post what is it on maybe you just did some great SEO or are ranking for a relative keyword. If they are leaving comments maybe its just for backlinks this happens when you have a post that is for example my site is on planes and i make a post on cars so I get people that want backlinks from a site on cars
There are also a lot of great security plugins just search in plugins

Prevent automated tools from accessing the website

The data on our website can easily be scraped. How can we detect whether a human is viewing the site or a tool?
One way is by calculating time which a user stays on a page. I do not know how to implement that. Can anyone help to detect and prevent automated tools from scraping data from my website?
I used a security image in login section, but even then a human may log in and then use an automated tool. When the recaptcha image appears after a period of time the user may type the security image and again, use an automated tool to continue scraping data.
I developed a tool to scrape another site. So I only want to prevent this from happening to my site!
DON'T do it.
It's the web, you will not be able to stop someone from scraping data if they really want it. I've done it many, many times before and got around every restriction they put in place. In fact having a restriction in place motivates me further to try and get the data.
The more you restrict your system, the worse you'll make user experience for legitimate users. Just a bad idea.
It's the web. You need to assume that anything you put out there can be read by human or machine. Even if you can prevent it today, someone will figure out how to bypass it tomorrow. Captchas have been broken for some time now, and sooner or later, so will the alternatives.
However, here are some ideas for the time being.
And here are a few more.
and for my favorite. One clever site I've run across has a good one. It has a question like "On our "about us" page, what is the street name of our support office?" or something like that. It takes a human to find the "About Us" page (the link doesn't say "about us" it says something similar that a person would figure out, though) And then to find the support office address,(different than main corporate office and several others listed on the page) you have to look through several matches. Current computer technology wouldn't be able to figure it out any more than it can figure out true speech recognition or cognition.
a Google search for "Captcha alternatives" turns up quite a bit.
This cant be done without risking false positives (and annoying users).
How can we detect whether a human is viewing the site or a tool?
You cant. How would you handle tools parsing the page for a human, like screen readers and accessibility tools?
For example one way is by calculating the time up to which a user stays in page from which we can detect whether human intervention is involved. I do not know how to implement that but just thinking about this method. Can anyone help how to detect and prevent automated tools from scraping data from my website?
You wont detect automatic tools, only unusual behavior. And before you can define unusual behavior, you need to find what's usual. People view pages in different order, browser tabs allow them to do parallel tasks, etc.
I should make a note that if there's a will, then there is a way.
That being said, I thought about what you've asked previously and here are some simple things I came up with:
simple naive checks might be user-agent filtering and checking. You can find a list of common crawler user agents here: http://www.useragentstring.com/pages/Crawlerlist/
you can always display your data in flash, though I do not recommend it.
use a captcha
Other than that, I'm not really sure if there's anything else you can do but I would be interested in seeing the answers as well.
EDIT:
Google does something interesting where if you're looking for SSNs, after the 50th page or so, they will captcha. It begs the question to see whether or not you can intelligently time the amount a user spends on your page or if you want to introduce pagination into the equation, the time a user spends on one page.
Using the information that we previously assumed, it is possible to put a time limit before another HTTP request is sent. At that point, it might be beneficial to "randomly" generate a captcha. What I mean by this, is that maybe one HTTP request will go through fine, but the next one will require a captcha. You can switch those up as you please.
The scrappers steal the data from your website by parsing URLs and reading the source code of your page. Following steps can be taken to atleast making scraping a bit difficult if not impossible.
Ajax requests make it difficult to parse the data and require extra efforts in getting the URLs to be parsed.
Use cookie even for the normal pages which do not require any authentication, create cookies once the user visits the home page and then its required for all the inner pages.This makes scraping a bit difficult.
Display the encrypted code on the website and then decrypt it on the loadtime using javascript code. I have seen it on a couple of websites.
I guess the only good solution is to limit the rate that data can be accessed. It may not completely prevent scraping but at least you can limit the speed at which automated scraping tools will work, hopefully below a level that will discourage scraping the data.

How to prevent someone from hacking API feed?

I have started developing a webpage and recently hired someone to write code to display a customized feed (powered by API) in the middle panel on http://farmball.com/. Note that this is not the RSS feed tied to the site blog. The feed ties to my account on another site. There is no RSS link for an average user to subscribe to the feed. I've taken the site out of maintenance mode to ask anyone here with scraping/hacking experience how someone would most easily go about 'taking' the feed and displaying it on their own site. More importantly, what can I do to prevent it?
^Updated for re-wording
You can't.
If you are going to expose an RSS feed which you don't want others to be able to display on their site then you are completely missing the point of RSS. The entire reason for Really Simple Syndication (RSS) is to make your content externally consumable- whether that's in an RSS Reader or through someone simply printing its content on their own website.
Why are you including an RSS feed if you do not want someone to be able to consume it?
what can I do to prevent...'taking' the feed and displaying it on their own site?
Nothing. Preventing reuse goes against the basic concept of RSS, which is to make it as easy as possible for anyone to do anything they want with it. It was designed from the ground up to be Really Simple to Syndicate, not Really Hard to Retransmit Without Permission.
You could restrict access to the feed itself to trusted users only by making them provide some credentials or pass in a key to the feed (e.g. yoursite.rss?mykey=abc123). But you cannot control use. Only access.
Be explicit about your license. It isn't a technology solution, as others have mentioned, the technology is an open technology-- this isn't DRM! But if you ask in each post that people who use this feed to not repost/fail to give credit/etc then some people will respond to the request.
Otherwise, you're better off putting your content behind a password and using a paid subscription model for distributing your content.
This is a DRM problem essentially. If you had some technique that you could put content on the web without having it redistributable, the music industry would love you.
It is possible to try to prevent redistribution. One technique you could try is embedding a signature of some sort into the feed for each user who you require to sign up. If the content is found on the web, you can identify and ban the user who redistributed your content.
This is avoidable too, by getting multiple accounts and normalizing the content to remove fingerprints. For the would-be pirate, this requires more effort than they may be willing to put in. Your signature could be a unique whitespace pattern, tiny variances in the timestamps on posts, misplaced pixels in videos, or any other thing you can vary slightly without end users noticing.
use .htpassword
better yet, don't put something private in a public place where it's likely to get picked up by software automatically. Like others have said, it's a pretty odd question, if you're trying to figure something else out, you're better off being explicit with what you want to know.

Good RSS news feed for JavaFx news and resources

...Yes I've seen:
Best Resources for Learning JavaFX?
but it doesn't really answer the question. Maybe there just aren't any good resources at the moment?
UPDATE:
http://developers.sun.com/rss/javafx.xml is OK
If you have Google Reader you could use their Discover tool to find feeds, e.g. JavaFX feeds.
Technorati has a large selection
Google Blog Search also has some results.
Note that I don't even know what JavaFX is - your best bet, as with any topic, is to use the social search tools out there to find authors who write about your particular topic, and then subscribe to them if you like what you read.
Something I've taken to recently is using Google Alerts and Google Reader (any RSS reader will do) to get reports as they come in of searches for a particular topic. You get access to what people are searching for within a topic and what they eventually decide on. I've discovered a few interesting pages on PHP since I started this, it's a useful tool.

Resources