It seems like such a simple thing, but I can't find any obvious solutions...
I want to be able to take two or three feeds, and then merge then in to a single rss feed, to be published internally on our network.
Is there a simple tool out there that will do this? Free or commercial..
update: Should have mentioned, looking for a windows application that will run as a scheduled service on a server.
There are a whole pile of options here: http://allrss.com/rssremixers.html.
Maybe http://www.planetplanet.org/
will do what you want.
It's for creating blog aggregations like planet lisp.
Google reader, create a group, add your feeds into the folder and then share that as an RSS feed.
:-)
Works while you're asleep!
Yahoo Pipes could be nice. Depends on how much "private" you want the resulting feed to be.
For 100% offline solution investigate Atomisator. It's a Python framework basically for doing offline what Yahoo Pipes does online.
If you're using PHP, the SimplePie library will do this. Here's a tutorial.
Related
We have a couple of relatively simple websites running on Adobe CQ 5.5 that were developed by a third party. I'm pretty familiar with how CQ works, but I'm working with somebody else's code here and I need to be able to search through all components in the system for a particular string.
The issue is that I can't seem to find a way to search across all of the various .jsp files stored with the various system components. I would have figured that the query tool in CRXDE Lite would have done the trick with something like this:
/jcr:root//*[jcr:contains(., 'Find this exact string in a JSP')] order by #jcr:score
But I've had no luck.
What I am looking for is some sort of global search that includes JSP files. Is that possible? Were I using a regular Java system, any IDE worth the download would be able to do this.
Thanks.
Might not be easiest way, but you can use the VLT tool to checkout the repository into your filesystem. Then you can lookup using whatever tool you prefer. It might even be faster in the long run
I don't have the actual answer but I suppose the JSPs are indexed via a filter that strips out some of their content.
It should be possible to configure the repository to index them as is instead, based on the info at http://wiki.apache.org/jackrabbit/IndexingConfiguration and http://jackrabbit.apache.org/jackrabbit-text-extractors.html
Sorry about the vagueness of this answer - I know the basic principles but to provide the details I would need more time than I can afford now ;-)
I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine.
I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications.
Any help or directions greatly appreciated.
The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.
You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.
Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.
You may want to check the following books:
"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL"
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204
"HTTP Programming Recipes for C# Bots"
http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677
"HTTP Programming Recipes for Java Bots"
http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669
I believe that the most important thing you must analyze is which kind of information do you want to extract. If you want to extract entire websites like google does probably your best option is to analyze tools like nutch from Apache.org or flaptor solution http://ww.hounder.org If you need to extract particular areas on unstructured data documents - websites, docs, pdf - probably you can extend nutch plugins to fit particular needs. nutch.apache.org
On the other hand if you need to extract particular text or clipping areas of a website where you set rules using DOM of the page probably what you need to check is more related to tools like mozenda.com. with those tools you will be able to set up extraction rules in order to scrap particular information on a website. You must take into consideration that any change on a webpage will give you an error on your robot.
Finally, If you are planning to develop a website using information sources you could purchase information from companies such as spinn3r.com were they sell particular niches of information ready to be consume. You will be able to save lots of money on infrastructure.
hope it helps!.
sebastian.
Python has the feedparser module, located at feedparser.org that actually handles RSS in its various flavours and ATOM in its various flavours. No reason to reinvent the wheel.
I am trying to find a list of what ISBNs are in use. I guess I could scrape a website like Amazon but that would waste a lot of bandwidth. Is there a better (free) way?
Maybe you could use the remote API for isbndb.com.
Trying to keep an enormous ISBN list up-to-date yourself is quite a huge task if you ask me.
Just for the record: note that if you actually want an ISBN for your publication, you need to go to the official agency in your country. In the US this is http://www.isbn.org/ , but it varies by country. In Australia, for example, it is here.
This might help: What is the most complete (free) ISBN API?
As the accepted answer states there is also an API to search Amazon but it's not actually supposed to be used in the way you wish to.
ended up using partial list from http://my.linkbaton.com/isbn/
Yes, try isbndb.com
I've been working with pipes for a while now, I am trying to output more than the basic structure of:
Item
title
link
description
guid
pubDate
I want to publish more data in the RSS feed under different fields but cannot figure out if this is even possible. Any ideas?
This post at the Yahoo Pipes blog goes through the basics of building a complex RSS feed with a couple examples.
http://blog.pipes.yahoo.net/2009/06/10/new-create-rss-and-rss-item-builder-modules/
I know this is not related to yahoo pipes, but if you are looking for etl tools, i found yahoo pipes very limiting. I have had the best luck with Open Kapow. Just in case you have not heard about/used it.
What's the best library to use to generate RSS for a webserver written in Common Lisp?
Most anything will probably do. Personally, I've been using xml-emitter for my blog's Atom feed, which has worked out well so far.
Just choose whichever XML generation library you like and hack away, I'd say. As others have remarked, RSS is simple; it's little work to generate it manually.
That said, I recommend not generating plain strings directly. Having to deal with quoting data is more of a hassle than installing an XML library, and it's also insecure in case your feed contains data submitted by visitors of your website.
xml-emitter says it has an RSS 2.0 emitter built in.
CL-WHO can generate XML pretty easily.
I am not aware of any specific RSS library. But the format is fairly simple so any library that can write xml will do at that level.
You could have e.g. a look at the nuclblog (http://cyrusharmon.org/projects?project=nuclblog) project as that has the capability to generate an RSS feed for the blog entries it maintains.
cl-rss-gen is a tiny library (LGPL, depends on CL-WHO) that does some boilerplate work for you (supports generating RSS entries directly from CLOS class instances by specifying which slot maps to which attribute).
Take a look at the code before using it, it may give you the idea how it's working and whether you need it or not (as other posters said, you can generate RSS yourself with CL-WHO or any XML generation library).
Oh, and sorry for resurrecting a four years old thread, but if anyone searches for similar library, he/she will find the answer here.