Downloading Old RSS Contents

Downloading Old RSS Contents - web-scraping

I need to have the news of some sites like CNN, BBC, and Reuters for my research. I want to know how I can write a program to download RSS contents of these sites from almost 10 years ago. I used GoogleReaderAPI but it seems silly.

RSS data will normally not contain aged data like that. You will likely need to subscribe to a service (AP, All HEadline News, Reuters) that allows you to search archives

Related

Wordpress form to add hotels

I am trying to create a webpage to add hotels (database for hotels) using wordpress. (name, address, pictures, reviews, ...). Ideas?
Another idea for categorizing hotels with cities?
Thanks.

You will probably need to further explain your question so that we can be more help to you. If your intention is to just add a list of hotel names to your website then I don't know of any WordPress plugins for that, but you might want to check out Expedia's EAN http://developer.ean.com/
You need to sign for their affiliate program, which is very easy. You get immediate access to their hotel databases plus you can make availability/booking requests with several response options, including JSON, which is more convenient and lightweight than the (unfortunately) more widespread XML. This is something you will need to hire a developer to do.
If you want to allow clients/visitors to be able to book and search for hotels by just installing a plugin try https://wordpress.org/plugins/wp-auto-hotel-finder/.

How Does an RSS Feed Work?

How's it going?
I've found a lot of more detailed answers relating to specific problems relating to RSS feeds, but I can't really figure out how you USE one, basically.
Could someone explain?
I see the RSS feed icon at the top of a lot of Wordpress sites, including my own, but when I click it, it just seems to be a long XML file. I don't know what to do with it, or even why it would be there.
How do you use this? Are you meant to hit it with an API request, or is there a particular kind of software that you use?
Cheers

Before telling you what RSS, let me describe you a common problem that many people have.
Say there is a bunch of sites that you really like and it's sort of a
daily routine for you to go thru them. They may be a news site, your
friend's blog, but also craigslist bcause you're currently looking for
a new house and maybe a weather site to know how late you should stay
at work :)
The first thing you do when you get to work, is open your web browser
and these sites in new tabs. It's not particularly cumbersome because
there are just 4 sites. But think about it: maybe there is a new blog
that you start to like and ho, these cartoons are really funny. Maybe
there is also a bit of financial info that you're interested in and
the pictures that your brother is posting to Flickr every couple day:
they just had a new baby! Also, as you're trying to buy a house, you'd
love a little raise and you've figured that your boss really likes it
when you tell her that you've read about your company in the news or
when you tell her about a new competing product... There is also
StackOverflow. You're desperately trying to get this "expert" badge
and boost up your reputation: this may help with your boss too or even
when you're looking for a new job.
Opening all these tabs is starting to take a toll and you keep
forgetting an important one. You're also slowly getting tired of the
different reading experience that all these sites have: small fonts,
large fonts, ads all over...etc. Now you have a problem.
Imagine there is a tool that does the following: you can tell it what sites you care about, and then, this tool will look up the new stuff for you. It will show everything in a nice looking format. It should also help you identify what's really worth seeing ASAP or maybe have some kind of "serendipity" mode that you can go into and find interesting stuff that you would have missed otherwise. The tool will obviously send you to the original sites should you need more info about any particular story or classified...
This tool exists. It's usually called a Reader, mostly because it lets your read more things online. Often times you'll see them called "RSS reader", because RSS is what they use to get the information from all these sites. RSS is the pipe. You as a user should probably not know about it, but that's what the readers depend on. In an ideal world, when you're on site you like, you should just hit "follow" on a button like this one and then you'd be redirected to your reader of choice. Later when new content is added, you'll get it straight in your reader.
To get a bit into more technical details, RSS (like Atom) is an XML flavor. It's a collection (mostly reverse chronological) of entries. Entries have at least a title and a link to the actual story. They should also include a unique identifier and could have other elements like a description, an image, tags, author information... etc.
RSS is great because it's content agnostic. It can be used to represent a lot of different things (as described in the little story) and decouples the publishing platform from the subscribing platform: they don't even know the other one exists. RSS is their lingua-franca.

I wrote a blog post about this very question not long ago. Here's the link if you're interested in reading my personal interpretation. https://www.rss.com/whatisrss

An XML file is all the content of a page, with no markup. The XML represents the data in its rawest, most descriptive form. Many readers can interpret XML sources from a variety of places, and format all of the data in its own unique way.

I want to create an RSS feed that is customizable

I want to create a dropdown of RSS feeds and users can pick and choose the feeds they want and a custom feed would be created. Is this possible using straight up HTML and java script or do I need a server technology. There are 7 separate feeds so the possible combinations are 7! - far too many for me to individually code into if statements and separate feeds. Is there a program that will generate the possible feeds for me automatically after I update one of them? Then I could just upload the updated xml files.
Right. So I set up my xml files, say I have one for birthdays, one for deaths, and one for mid life crises. So that is three xml files with three separate links for rss feeds. Now what I want is for people to be able to check off the ones to which they wish to subscribe rather than hitting each one separately. So I would have a form with three checkboxes and a submit button. I could do this with javascript by having 6 separate xml feeds, one for each possible combination. But if I have 4 feeds then I need to set up 24 feeds, and 5 would be 120 possible feed combinations.
So the question becomes, is there some software or library that will either handle this computation for me and crank out RSS mixes/blends similar to what some RSS mixing software seems to do. The problem with the services and software I have seen is that it provides blending for people subscribing to feeds but not for providers. I can see in my head how easily this could be done programmatically even though it would spit out alot of xml and html/javascript.
I guess another way about it would be for them to sign up for multiple feeds simultaneously but I'm not sure if that can be done.
If I am making no sense I apologize. I have never seen this done so it might not be possible. I am just going to go with the page with a bunch of RSS links.
Thanks for everyones responses. I appreciate it.

Just because there are 7 options doesn't mean you need to write 7! if statements. You only need to check if each one of the options is set, and output something appropriately.
So, yes, you need to do this server side. And it's not at all difficult.
Where are you stuck, specifically? Your question is missing a few details.

How Do I Fetch All Old Items on an RSS Feed?

I've been experimenting with writing my own RSS reader. I can handle the "parse XML" bit. The thing I'm getting stuck on is "How do I fetch older posts?"
Most RSS feeds only list the 10-25 most recent items in their XML file. How do I get ALL the items in a feed, and not just the most recent ones?
The only solution I could find was using the "unofficial" Google Reader API, which would be something like
http://www.google.com/reader/atom/feed/http://fskrealityguide.blogspot.com/feeds/posts/default?n=1000
I don't want to make my application dependent on Google Reader.
Is there any better way? I noticed that on Blogger, I can do "?start-index=1&max-results=1000", and on WordPress I can do "?paged=5". Is there any general way to fetch an RSS feed so that it gives me everything, and not just the most recent items?

RSS/Atom feeds does not allow for historic information to be retrieved. It is up to the publisher of the feed to provide it if they want such as in the blogger or wordpress examples you gave above.
The only reason that Google Reader has more information is that it remembered it from when it came up the first time.
There is some information on something like this talked about as an extension to the ATOM protocol, but I don't know if it is actually implemented anywhere.

As the other replies here mentioned, a feed may not provide archival data but historical items may be available from another source.
Archive.org’s Wayback Machine has an API to access historical content, including RSS feeds (if their bots have downloaded it). I’ve created the web tool Backfeed that uses this API to regenerate a feed containing concatenated historical items. If you'd like to discuss the implementation in detail please get in touch.

In my experience with RSS, the feed is compiled by the last X items where X is a variable. Certain Feeds may have the full list, but for bandwidth sake most places are likely limiting to just the last few items.
The likely answer for google reader having the old info, is that it is storing it on its side for users later.

Further to what David Dean said the RSS/Atom feeds will only contain what the publisher of the feed has up at that moment and someone would need to be actively collecting this informaton in order to have any historical information. Basically Google Reader was doing this for free and when you interacted with it you could retrieve this stored informaton from the google database servers.
Now that they have retired the service, to my knowledge you have two choices. You either have to start collection of this information from your feeds of interest and store the data using XML or some such, or you could pay for this data from one of the companies who sell this type of archived feed information.
I hope this information helps somebody.
Seán

Another potential solution that might not have been available when the question was originally asked and shouldn't require any specific service.
Find the URL of the RSS feed you want and use waybackpack to get the archived urls for that feed.
Use FeedReader or a similar library to pull down the archived RSS feed.
Take the URLs from each feed and scrape them as you wish. If you're going way back in time it's possible there might be some dead links.

All previous answers more or less relied on existing services to still have a copy of that feed or the feed engine to be able to provide older items dynamically.
There's though another, admittedly pro-active and rather theoretical way to do so: Let your feedreader use a caching proxy which semantically understands RSS and/or Atom feeds and caches them on a per-item base up to as many items as you configure.
If the feedreader doesn't poll feeds regularily, the proxy could fetch known feeds time-based on its own to not miss an item in highly volatile feeds like the one from User Friendly which has only one item and changes every day (or at least used to do so). Hence if the feedreadere.g. crashed or lost network connection while you are away for a few days, you might loose items in your feedreader's cache. Having the proxy to fetch those feeds regularily (e.g. from a data center instead from at home or on a server instead of a laptop) allows you to easily run the feedreader only then and when without loosing items which were posted after your feedreader fetched feeds the last time but rotated out again before you fetch them the next time.
I call that concept a Semantic Feed Proxy and I've implemented a proof of concept implementation called sfp. It's though not much more than a proof of concept and I haven't developed it further. (So I'd be happy about hints to projects with similar ideas or purposes. :-)

Why does this problem exist?
Most RSS readers need to import feeds through a live URL, which makes things harder for sites that are unindexed on Wayback Machine.
The reason why Wayback Machine feeds can be imported is that the reader can regularly poll the server for updates according to its defined TTL configuration. The reader compares the current datetime with the RSS feed posts pubDate or lastBuildDate keys in the XML response. We can't hack the machine datetime to work around the datetime resolution because the current datetime is fetched live.
I've outlined an alternative solution without Wayback below. Unfortunately, I have not been able to find a universal solution for all feed sources.
Alternative Solution(s)
In my experience, NOT ALL feeds are partial though. The XML doesn't have to specify the datetime of each post. This means the RSS Reader doesn't have a datetime to filter the feed with. An example of this feed type can be found here.
This kind of reading experience is useful when chronological order is irrelevant, and the content doesn't need to be sorted. This approach is useful for sites where ALL the content is valuable, and the linked Essays of Paul Graham is a good example.
If the site has a generic, non-chronological feed option, subscribe to that RSS instead (the preferred option).
Download the linked timestamped .rss file, strip datetimes and host the file on your own server. Note, we can implement this via an AWS Lambda.
Set up a server that fetches the RSS from live.
Strip the pubDate tags from the XML file on fetch.
Host the modified RSS on your own server.
Note
These are suboptimal solutions due to loss of orders, however, I wanted to provide a potential alternative to WaybackMachine.
In addition, some existing answers require advanced SysDesign workarounds, more prework and in some cases are outdated (Google Reader is shut down). I hope it's helpful for those who really need a solution for a complete feed list. Constructing new RSS feeds is not too hard from the original RSS file.

What are the legalities of repackaging other's RSS feeds into a new presentation? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I know that services like my.yahoo.com allow you to add content from RSS feeds to your personal page, but in general they are links which draw the user to the site which provided the feed. What are the legalities and implications of using RSS feeds as a data source for a site which repackages the data so as to be unrecognizable that it came from said source.
Does credit need to be given? It is a copyright violation? What is ethical?
What if credit is stated? Does this change your opinion? Does permission need to be granted?

Of course it's ethical! What on earth is RSS for if not for syndication, into as many varied and wonderful forms as developers can think up?
Permission, of course, must be asked for - in the form of a "GET /feed/ HTTP 1.0". And it must be granted in the form of a "200 OK" - or denied in the form of a "403 Forbidden".
Screen scraping is at least morally ambiguous, since perhaps the author only wants humans, and not programs, to view the content (assuming you believe it's within the rights of the author to make that distinction). But RSS? Seriously? No one forces anyone to make a syndicated, easily-mungable format of their content. It's not just useful for new presentations, it's meant for it.

In my opinion it depends on the data source company as to whether they allow it in their terms and conditions.
It probably also depends on where your servers are located (i.e. Which legal framework they fall under.)
Unless it is allowed explicitly or you have written consent I don't think it's ethical.
It also depends on how big your legal department is.

I would say publishing someone else's work without giving them credit will definitely lead to lawsuits or at least strongly worded cease and desist letters (followed by lawsuits).

Well, legalities aside it isn't ethical to not give credit to the source. The AP for example wants credit

The difference between what you are proposing and services like my.yahoo.com, Netvibes, Bloglines, Google Reader, etc, is that you are the one choosing the feeds, whereas with those other services the user is specifying the feed, and is therefore aware of it's original source.
Even though content is being published in feeds, and is therefore expected to be used with services like the ones I mentioned above, the publisher still retains the copyright over their content, and would usually expect it to be republished as-as. It is also customary to provide the link back to the original source of the content and republishing content without it would be frowned upon at the very least.

I've wondered the same thing for a while and am very hesitant to republish RSS feeds FeedForAll says there is no inherent right to reproduce content. You're asking whether it's ok to mangle the content, I'm pretty sure it's not alright to even reproduce the content. I think it would be like putting
<iframe src='www.stackoverflow.com'> </iframe>
on my website.
BTW. This is not a subjective question and this it is important. I'd re-ask this question or edit the title and get more relevant feedback.

Talk to your lawyer.

From AP's RSS site...
AP provides these RSS feeds to individuals for personal, noncommercial use under the following terms and conditions. All others, including AP members or Press Association subscribers must obtain express written permission prior to use of these RSS feeds. AP provides these RSS feeds at no charge to you for your personal, noncommercial use. You agree not to associate the RSS feeds with any content that might harm the reputation of The Associated Press. AP provides this content "as is" and AP shall not be held liable for your use of the information or the feeds. TO THE FULLEST EXTENT ALLOWED, AP DISCLAIMS ALL WARRANTIES INCLUDING WARRANTIES FOR MERCHANTABILITY, NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE. You agree to use the RSS feeds only to provide headlines, each with a functional link to the associated AP story that shall display the full content immediately (e.g., no jump pages or other intermediate or interstitial pages). You further agree not to frame or otherwise control the browser window (if any) in which the AP content opens, including limiting the size or position of such window. You agree to provide proper attribution to The Associated Press in reasonable proximity to your use of the RSS feed(s), and you agree that you will not modify the format or branding of the headlines, digests and other information provided in the RSS feeds. The RSS feeds may not be spliced into or otherwise redistributed by third-party RSS providers. No content, including any advertisements or other promotional content, shall be added to the RSS feeds. AP reserves the right to object to your presentation of the RSS feeds and the right to require you to cease using the RSS feeds at any time. AP further reserves the right to terminate its distribution of the RSS feeds or change the content or formatting of the RSS feeds at any time without notice to you. By accessing the RSS feeds or the XML instructions provided herein, you indicate that you understand and agree to these terms and conditions. Note: If you do not qualify to use the RSS feeds under this license or are an AP member or Press Association subscriber and wish to uses these feeds, please contact AP Digital.link text

From Reuters RSS site...
Reuters offers RSS as a free service to any individual user or non-profit organization, subject to the following terms and conditions:
Use will be for non-commercial purposes.
Use is limited to platforms in which a functional link is made available allowing immediate display of the full article or video on the Reuters.com platform, as specified in the feed.
Use is accompanied by proper attribution to Reuters as the source.
By accessing our RSS service you are indicating your understanding and agreement that you will not use Reuters RSS in contravention of the above conditions. Reuters reserves the right to discontinue this service at any time and further reserves the right to request the immediate cessation of any specific use of its RSS service.
If you would like Reuters news for your commercial website, please visit
about.reuters.com/media.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex