Does any open, simply extendible web crawler exists? - web-scraping

I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them:
partly just to read the feeds of several sites
to scrape the content of these sites
if the site has an archive I would like to crawl and index it as well
the crawler should be capable to explore part of the Web for me and it should be able to decide which sites matches the given criteria
should be able to notify me, if things possibly matching my interest were found
the crawler should not kill the servers by attacking it by too many requests, it should be smart doing crawling
the crawler should be robust against freak sites and servers
Those things above can be done one by one without any big effort, but I am interested in any solution which provide a customisable, extendible crawler. I heard of Apache Nutch, but very unsure about the project so far. Do you have experiences with it? Can you recommend alternatives?

I've used Nutch extensively, when I was building the open source project index for my Krugle startup. It's hard to customize, being a fairly monolithic design. There is a plug-in architecture, but the interaction between plug-ins and the system is tricky and fragile.
As a result of that experience, and needing something with more flexibility, I started the Bixo project - a web mining toolkit. http://openbixo.org.
Whether it's right for you depends on the weighting of factors such as:
How much flexibility you need (+)
How mature it should be (-)
Whether you need the ability to scale (+)
If you're comfortable with Java/Hadoop (+)

A quick search on GitHub threw up Anemone, a web spider framework which seems to fit your requirements - particularly extensiblility. Written in Ruby.
Hope it goes well!

I heartily recommend heritrix. It is VERY flexible and I'd argue is the most battle tested freely available open source crawler, as it's the one the Internet Archive uses.

You should be able to find something that fits your needs here.

Related

Should I avoid using a CMS if I want to be able to quickly make good sites with more features/options to customize than Wordpress?

Should I avoid using a CMS if I want to be able to quickly make good sites with more features/options to customize than Wordpress?
I want to become a better webdeveloper and able to quickly make good, fast, secure websites with lots of functionality without being limited so as I'd be with Wordpress. I don't see writing lots of plug-ins to reach the same functionality as a nice solution for doing my own programming.
I have written a few games, quizzes and other scripts I'd like to be able to recycle or easily adapt to work with the CMS.
I currently have a multi-lingual website that works with a /nl/ and /en/ part, that has a few self-written games I wrote in PHP.
CakePHP has a very good CMS called Croogo. It's still quite a young project (still in beta and being actively developed), but the great thing about it is that its a Cake app so it's coded to the well-documented Cake standards.
Whereas customizing/extending Wordpress, Joomla, Drupal et al would mean you'd have to invest a huge amount of time learning about their respective frameworks, all for the sake of one part of any given website (the CMS), if you learn CakePHP, you're learning a much more advanced and flexible framework that can pretty much be used to do anything well beyond the confines of CMSes.
If you learn Cake (or if you already know Cake) you'll find that you already understand Croogo without having to invest much additional time at all. Code you write in Cake can easily be packaged to be a Croogo plugin and even if Croogo doesn't stay around for the long term (I hope it will!), it wouldn't be difficult to re-factor all the plugins you've written to work in any other Cake-based CMS that comes along in the future, or even your own Cake apps.
Croogo is pretty basic, but quite powerful. It has a Wordpress-like feel to it, it supports nice URLs via an amazing reverse-routing system, the /en/ /nl/ language thing you mentioned works out of the box and it's very easy to get any of the huge array of Cake components and plugins working in harmony with the CMS through the use of hooks.
I'm currently working on a project using joomla and there are a ton of custom features that I need to implement. I usually have to create a plugin or module in that case. It's a pain. I'd much prefer doing most of this from scratch instead of hacking at the code. If I had a choice, I would not use a CMS. I hate them.
I think ultimately it's about long term support. When you build a custom CMS in cake or another framework it is much easier and faster for you to customize and build the way you wan too. This works great if this is a project you are planning on supporting (by this I mean bug/user support for when you unleash this CMS on non devs). This can become a headache pretty fast when things need updates and clients are looking for fixes and changes. It's completely manageable, just more of a headache then something with community support.
That being said, if you are comfortable in wordpress the amount of support that exists in that community is huge. So often times you can leave the project knowing updates for the CMS and plugins will come in at a regular speed.
TLDR So if it's a project you know you will be supporting long term (or people with the same comfort and skill level as you) then I would say build it your self for ease of build and customization. If this is a one off or something you plan on handing off to a client with little to no support, building inside of a community supported platform is best.
I really comes down to priorities, if you what to build a site really fast a CSM is hard to beat, but you do not have the same control over the core as you do when you wright it from scratch.
But you can do most any thing with plugins/modules so the control is there if you are willing to work for it. If you wright it your self you will be the only set of eyes most of the time so it will in most cases be slower to implement new standers and security fix's (because you will need to find them first) but with a CMS you will have many people working to make it better and safe at the same time.
If you want to be well rounded I think youe need to be able to do both, you can't control what the customer wants to use some times.
You can make site very quickly with a CMS like Joomla but the problem is even having over 7000 extensions sometimes for your particular purpose you don't find an extension and developing an extension can be real tough. it requires a comprehensive knowledge of Framework. If all you need to do is manage content CMS is the best choice. If it is like a web app and require more interactions go for some framework which provide the basic skeleton of your app. e.g. for CRUD operation many frameworks provide scaffolding feature and make this thing a piece of cake. CakePHP, CodeIgniter, Kohana are some of the best PHP frameworks you can use.
Using Chinese Cms DedeCms or phpcms And developer it more easily !
I like PHPCMS, it works with nginx, fasctcgi, mysql on linux or windows.
I use it to make portal site or enterprise sites group. The multi-site architecture and PHPSSO works well. Template engine is also strong enough.
take a look at big mysite: xinm123.com
Most important thing: it's open source.

Developing a newspaper site in Drupal

I need to develop a newspaper site in Drupal, I've already played around with Drupal a little, and I think I know which modules would best suit my purposes. Naturally, one of the modules I'll be needing to use most is Views, but I have a couple of questions:
Because this is a content-intensive site, I was wondering if using 5-6 views on each page to generate node teaser + thumbnail lists would impact performance adversely?
I am a designer with significant front-end development experience. Like I said I've played around with Drupal quite a bit and other than running into a few hurdles which I eventually overcame, for the most part I was able to get it to do what I needed it to. Having said that, does one also need strong programming skills to fully develop a site in Drupal?
Thank you very much for your help!
Jane
Views offers caching and Drupal also has block caching, which should help you improve performance. The SQL that Views generates is never as good as handwritten SQL, but if you make simple Views, the SQL is actually quite good and not a performance problem (unless you have millions of page views).
If you can create the features you need, with modules from Drupal.org, you don't need strong developer skills. But you do need to know some PHP to make a Drupal theme which is what controls the layout of the site. It will also be a great help, in understanding the Drupal theming system, but not a requirement.
First off, check out openpublishapp.com for a Drupal distro that is made for publishers from the ground up, it's pretty hot.
To answer your questions:
1) As far as performance and views goes, having 5-6 views on a page is a normal requirement for a drupal news site and the performance issues are usually handled by views/panel cache, and using a page cache like Varnish in front of a web server, Object caches like Memcached (for the DB) and opcode caches like APC...if you don't want to learn all that off the bat you should still be fine if your traffic isn't too intense (but go sign up at getpantheon.com for awesome hosting with all of that and the kitchen sink, and check out groups.drupal.org/pantheon)
2) If the functionality exists by way of core/contrib modules, to fully develop a site for the most part one only needs to understand enough PHP to theme, and often with starter themes like Fusion, and some of the others you hardly even need that, just an understanding of how they work and are extended (which is well documented). That said, if you want functionality that doesn't exist, you'll have to code it, or have someone code it for which strong programming skills are desired, but not necessarily required :)
Even I recommend the use of the OPENPUBLISH - https://www.acquia.com/solutions/publishing
On top of to this you can make an efficient usage of
1. APC - PHP byte-code caching
2. Drupal Caching - block/template/view level caching
3. Boost - Caching module which doesn’t need any external tools
4. Varnish - HTTP accelerator
5. Memcache - Data intensive content.
Apart from this you will also need to think effectively on deciding on DEPLOYMENT ARCHITECTURE of the site - preferably Acquia or Amazon environment.
Learning curve may vary depending on your current skills in PHP or Drupal. Usage of already established distribution like OPENPUBLISH may help you to minimize the dependability on too much custom coding.

Multiblog engine for asp.net

I know, different forms of this questions were asked on this site multiple times, but I haven't seen a single answer that would satisfy my need.
I need a ASP.NET based blogging engine that wouul use SQL Server as a back end and allow multiple independet blogs in one app instance. I'm writing a community website for major bank and blogging is the piece I'm not sure about.
Answers to other questions include a broad spectrum from BlogEngine.NET (doesn't support multiple blogs) to CommunityServer (a beast! blogging is just asmall piece of it). I don't want to install a full-blown CRM and just use blogging, I want a blogging engine. I don't mind to buy a commercial one but I can't find one.
I'm pretty much stuck, and any ideas are highly appreciated!
I would consider Oxite if you are confident in your markup and knowledge of html. Also, you can extend it with html editors to unsure better markup. I personally love how flexible the framework is.
Here is the Oxite website with more info. BTW, it was used to build MIX online for Microsoft.

iweb and mobile me for a group content management system versus open source CMS

I work at a non profit and we are looking for a web solution to do the following:
External facing web site
Internal posting board for news, updates, pictures
Entitlements around user content
One of the folks at the non profit is a mac person and suggests using iweb and mobileme for this functionality. i have no expereince with these tools but it seems like the following are more appropriate:
TikiWiki: http://info.tikiwiki.org/tiki-index.php
Drupal: http://drupal.org/
Joomla: http://www.joomla.org/about-joomla.html
i am a windows dot net guys so i also would prefer some asp.net solution here but i want to avoid getting religious here as any solution that does the job should be fine.
my question is, are there any thing to be concerned about with using the iWeb and mobileme solutions or any brick walls we are going to run into.
Also, are there PC based solution that will allow you to use these tools or does everyone need a mac?
This is only a partial answer to a multi-part question, but:
Drupal and Joomla are platform-independent. The software itself runs on PHP (presumably on a server, rather than a workstation), but you interact with the systems via a web interface. Drupal in particular lets you choose from many different editing options, via it's Wysiwyg module.
Personally, I think Drupal is an outstanding choice for nonprofit org (this being my own background) that have tech-skilled staff, and Joomla is an outstanding fit for nonprofits that don't have much in-house web expertise.
As for iWeb and MobileMe:
Compare them to Adobe Contribute. They're good software for what they do, but building organizational websites is not what they do.
What you've got is basically a souped-up MS Word that writes W3C compliant HTML. Things like members-only content, interactivity, etc are going to be pretty difficult to manage, and you'll be looking for another solution soon anyway if your site gets larger than a few dozen pages.
In short - avoid iWeb and MobileMe for this type of implementation. You may have a "Mac person" in the office (for now), but these products are designed more for individual/home use and not businesses/organizations. Eventually you'll run into any number of "brick walls".
A few other options (amongst many) if you don't have a web-designer on staff and want a hosted solution would be to look at Wordpress or Squarespace.
Thanks

Web application integration with Drupal

We want to build a web application, that is specific to our domain, but also includes forums, blogs, etc in this application. Some integration points to Twitter and Facebook are also required.
There will also be a desktop application that connects to our web application for uploading data and downloading configuration and reports.
The question is, can we extend Drupal to host both the regular modules and our web application? (There will be business entities and their properties and daily data uploaded from the desktop application)
Or can Drupal be integrated with external applications? As an example, users and roles need to be the same and consistent across both. We may also want data from the web application searchable in Drupal.
I know this is a bit vague, but I cannot reveal more. I am very new to content management and I just wanted to know if someone has built this kind of application.
I try to rephrase what you wrote, just for you to check that I got your question right. You basically need to create a web application that:
Implements some of the standard functionality of Drupal
Have some custom functionality that should "blend into" the Drupal one (same users, same permissions, etc...)
Be able to upload/download content (or data) from desktop applications.
If I got you right, the short answer is: yes, you can do that with Drupal.
Now for the extensive one:
- Drupal has literally thousands of modules, so I expect you to get most of the things you want by simply installing the right combination of readily available modules.
- Of course, any custom functionality can easily be implemented in form of a module too (quite standard thing these days).
- The interaction with a desktop application is normally implemented via webservices rather than querying the DB directly. Drupal comes natively with a xmlrpc server and client, but you can scale up to SOAP - if you wish - via a couple of contrib modules.
Some additional thoughts:
If you choose to use Drupal, and you start from scratch, then you have to be aware you and your team will need to dedicate some time and effort to understand how Drupal works. Although - differently than Palantir - I stuck with Drupal, I agree with her/him on the fact that Drupal gets complicated complex right off the bat. This is the trade-off you have to pay in order to have a platform that - rest assured - is very flexible, extremely pluggable and rock-solid (otherwise it wouldn't have been used to redesign the whitehouse, nor Drupal would have got for the second year in a row the "best PHP CMS" award, I suppose).
The good news is: there are some excellent books out there, and I would certainly recommend "Pro Drupal Development" for an in-depth and all-around explanation of the system. Just be sure to get the 2nd edition, as the first deals with the now obsolete 5 seres. That said...
A very good thing about Drupal, at least in my opinion, is that most of the tweaks you might need to do to an existing functionality can be implemented by hooking into the original code from a custom module too. This IMO is the biggest advantage of Drupal: you never have to touch other developers' code to achieve your goals, and this means - for example - that you will be able to keep your core and contrib modules up-to-date without breaking any customisation you might have done.
Drupal is heavy. Compared to other CMS it sucks plenty of processing power and RAM from your server, and - unless you are going to have a very small site - I recommend to deploy it in conjunction with nginx, rather than Apache.
Drupal scales well, thanks to a good mechanism of caching and "throttling up" mechanisms. Strange as it might sound, Drupal scales very well on large traffic websites, so that big increases in traffic do not necessarily imply big increases in resource usage.
The user experience out-of-the-box on a Drupal site is quite poor. There is a massive work being done on this at the moment (here and here (video)), but improvements won't be available until D7 is released [soon, but then you will have to wait for the modules to be ported], so it is advisable to allocate some time to create an administrative theme, if the admins of your website won't be of the technical type.
At the end of the day, my advice is: if your site is going to go big / complex / with complicated business logic and lots of functionality, then Drupal is probably a good candidate. If your site is contrarily a small-scale one with standard functionality plus a few custom bits, maybe Wordpress / Joomla could fit your needs better [not because they are 'less powerful' but because Drupal strengths would be unused in this case, while Wordpress/Joomla simpler architecture would probably represent an advantage in this scenario]
Other options would certainly be frameworks like CakePHP or Django, for example, but that - IMO - is a totally different approach to the matter, I would say.
Short answer: Drupal is well suited to build something like that, especially if you are willing to integrate your app/logic into Drupal as a suite of custom modules. The other way, integrating Drupal into an external application, can also be done, but will give you more friction, as Drupals architecture is pretty much geared towards being a framework in its own right.
Longer answer: I have a pretty much opposite opinion/experience compared to Palantirs. I've been working almost exclusively with Drupal for a year now, in the context of two fairly complex/'enterprisy' projects (after several years of 'on the side' usage for smaller things). While I agree that it imposes some rigid rules (but not limits!), I consider this to be an advantage, as those rules give a clear guidance and provide proven ways on how to do things. The three parts Palantir mentions are good examples for this:
Menu system - Provides a well structured and effective dispatching mechanism that is easy to extend with your own stuff, while giving huge flexibility to tweak/manipulate existing/default paths. (Note that 'menu system' in Drupal denotes the whole topic of managing your URL space, not just the subset of 'visible' menus that is usually associated with the term)
Forms API - A declarative approach to web forms, with a well designed processing workflow and a whole lot of built in security features that you would otherwise have to take care of yourself. Also highly extensible, with straight options to adjust/extend already existing forms on demand, add new validation rules to any field or whole forms, multi step forms, javascript based form adjustments, etc.
Translation system - This is pretty complex, simply because internationalization is fricking hard to do. But it is built in, again giving clear guidance on how to do things in order to work in a generic way (though there are problems with quite some contributed modules that are not using/supporting it the way they should).
I could give more examples for parts where I appreciate the 'rules', but this post is getting long already, and I still have to cover some downsides ;)
So to sum up the positive part - if I where given the rough specs you posted, I'd say 'no problem' and go with Drupal, being confident that it would be a solid foundation for the custom parts, while providing all the 'standards' like forum, blogs, twitter/facebook integration and many, many others in the form of already existing solutions (even though those might need some adaption/tweaking).
Downsides: As always, there are flaws, and some of them are substantial, depending on requirements/circumstances.
Learning curve - Drupal is quite complex, and 'grokking' its concepts takes time. 'Playing with it for a week', as Palantir suggests, will certainly give you a general feeling/broad impression, but it is in no way enough to allow for a serious judgement of its pros and cons, as those will only surface while coding in/for it. So if you are already deeply familiar with an established web development framework, this might be an issue. If you have to learn one anyways, this should be less of a problem.
Database restrictions - As of Drupal 6, database support is MySQL or PostgreSQL only, using a Drupal specific 'abstraction layer' (which obviously isn't one ;)
Drupal 7 will move to PDO, which should (finally) end this questionable state.
Test/Stage/Production migrations - Parts of Drupals 'out of the box' flexibility are due to many things being configurable in the administrative backend, which implies that many important configuration settings are stored in the database. This makes migration of data and/or configuration between several instances pretty difficult/tedious, once you left the (early) stages of development where you can get away with complete dump/restore operations (see e.g. this question & answers)
These are the main ones for me, but you'll probably find more :)
I worked for over a year using drupal extensively, but I ended up abandoning it. Drupal, and other CMS systems out there, have very rigid limits and rules. I'd use Drupal for projects where you have simple requirements and few or no business rules. Drupal gets complicated almost immediately when you want to do complex things (especially pay attention at the menu system, forms, and the translation system if you need to be multilingual).
If your system will really be large, with all the things you mentioned, then I'd rather use a PHP framework to implement your business logic, and integrate external products as they fit (a forum, a blog, a twitter client, etc...).
But the advice is: don't trust anyone :) Download it, and play with it for a week. You'll be able to make your mind and be more confident about your choice!
As Drupal is open source, you can pretty much do as you wish with it. A couple of points though:
Changing Drupal's user/role structure would be tedious and unnecessary. You would need to have your desktop application authenticate from Drupal's MySQL database.
Drupal has hundreds of plugins for just about everything, so Drupal could no doubt run the whole "web" side of things including visitor stats etc. You would just need, again, to connect your desktop application to the correct MySQL tables and show the data as desired.
Don't forget to check other content management systems such as Joomla! (and many others). Each has its pros and cons. www.opensourcecms.com allows you to easily test CMSs and I've used it extensively in the past.
Just be sure to map out all the components first. Every hour planning up front saves many hours of headaches later.

Resources