Programming languages comparison for web data mining task - web-scraping

I need some help comparing different programming languages, such as: C++, Java, Python, Ruby and PHP, for a task which is related for web data mining (developing web crawler, string manipulations and etc.). I have a bit experience with PHP, and I think advantages that it has for this particular task are simple syntax, in-depth string parsing capabilities, networking functions, and portability, but don't know much about other languages and their advantages and disadvantages related for this particular task.

The specific language will not matter nearly as much as your familiarity. These days, all high-level languages will come with the basics. Unless you need it to be super-fast (you're probably going to be limited by download speed, not the speed that you parse the HTML) or have other constraints not listed, the language won't matter that much.
Just make sure that you use the libraries. In particular an HTML parsing library that is good with invalid markup (not an XML parser) and regular expressions where appropriate.

As a previous post implies - being familiar makes a big difference. I would also say look at what the language was originally designed to do - it gives a good idea of what its best at.
PHP - designed for server side scripting, not really ideal for this use.
Perl - Designed to pull text apart (good start) and excellent libraries - look at LWP and the modules under HTML such as HTML::Treebuilder - a good choice. Unrivalled selection of modules to plugin.
Python - A good choice, look at beautifulsoup and urllib
Ruby - also a good choice, look at hpricot a lot less mature than Perl or Python in terms of modules available.
I have written quite a bit of web spider/data mining software and have always used Perl. If I was starting from scratch today I might choose python.

Google's first crawler was written in Python 1.5
I'm no expert on other languages, but I would go with python and html5lib or Beautifulsoup.

Related

What language to use when prototyping a small game

I am currently considering writing a small game. It is essentially a map where you can zoom out and in, and in certain places click on info boxes where, at some point, I hope to integrate minigames. Granted, game might be overstating it. Think of it as an interactive map. The theme is how mathematics can be applied in peoples every day life to raise awareness on the usefullness of mathematics.
The question is how I as fast as possible can make a reasonable prototype. If I recieve enough positive response on this I might try to code "the real thing" and use the prototype to obtain funding.
However, I am at a crossroad. I want something to work rather fast and have some C++ experience coding optimization problems, mainly in c-style. I am not convienced, though, that coding it in C++ is the fast way to obtain a prototype. Though I have some experience coding in C++, but have no experience in coding any sort of GUI.
As I see it there is a number of possibilites:
C++, possibly using some library, such as boost or ???.
Start out purely webbased, using e.g. HTML 5 and java.
Python
C#/.NET
Others, such as?
I have to admit I have little experience with anything besides C++ and the STL.
So my question to this wonderful forum is basically, is there a language that provides a significant advantage? Also, any additional insight or comments is more than welcome!
Python is a simpler language than C++, and for prototyping it will help you focus on the task at hand. You can use Pygame, a game library built on the excellent cross-platform SDL library. It provides 2D graphics, input, and audio mixing features. SDL is mainly a C library (and thus compatible with C++), and there are a number of very useful libraries that integrate with it:
SDL_image for loading images in various formats
SDL_ttf for rendering text using TrueType fonts
SDL_mixer for audio mixing
SDL_net for networking
SDL_gfx for graphics drawing primitives
So if you prototype in Python using Pygame, there is a reasonable chance you’ll be able to port what you make over to C++ with minimal hassle, if and when you choose to do so.
Possible options:
Go with what you know the best. Anything else will require a learning curve, which may be weeks to months long. If you're willing to take that road in order to make your prototype, then there are some really great tools available.
BlitzBasic is a good way to go, and is basically designed to be for games
I've done little games in Java using Slick2D - but you'll need good grounding in object-oriented coding to work effectively in Java. If you've got that from C++, then you can see a tech demo I built in Slick2D called Pedestrians. It's open source, and has demo videos here.
You might also ask your question on https://gamedev.stackexchange.com/ - a Q/A site dedicated to game programming

Are there any code libraries or languages with built-in support for Metric Time?

Metric time is documented here.
I'm looking for any implementations of Metric time to Anglo-Babylonian Time (see article link) either as a library or built-into a programming language.
With all the "joke" programming languages out there, it's possible that someone has done this.
P.S. I realize that it is trivial. This question is for FUN.
The page you mentioned has a UMT clock on it, and the source is in JavaScript. This script seems like a good place to start, it is pretty short and easy to read. Translating it to another language would be cake. I somehow doubt that there are too many libraries, and almost certainly no languages with it built in.

What are the main issues in designing an interpreter for a functional language?

Suppose I want to implement an interpreter for a functional language. I would like to understand the issues involved in doing so and suitable literature that is available. This is a new language that is in early design stages, that is why the question is broad in scope.
For the purpose of this discussion we can assume that the purpose of the language is not important and that its functional features can be changed (even drastically) if it makes a significant difference in the ease of writing an interpreter.
The MIT website has an online copy of Structure and Interpretation of Computer Programs as well as videos of the MIT 6.001 lectures using Scheme, recorded at HP in 1986. These form a great introduction to language design.
I would highly recommend Structure and Interpretation of Computer Programs (SICP) as a starting point. This book will introduce the idea of what it means to write an interpreter (and a compiler), and is generally a must-read for anybody designing languages.
Implementing an interpreter for a functional language isn't likely to be too much different from implementing an interpreter for any other general purpose language. There's lexical analysis, parsing, AST construction, semantic analysis, plus execution (for a pure interpreter) or code generation and optimisation (for a compiler, even compiling to bytecode like Java/Perl/Python). SICP will introduce the difference between "applicative order" and "normal order" evaluation, which may be important for you in a pure functional context.
For just about any language interpreter or compiler, the main issues are the same, I think.
You need to decide certain basic characteristics of the language (semantics, not syntax), and the bulk of the design of the thing follows from that.
For example, does your language have
a type system? If so, what sorts of
types does it have? Is it going to be
statically typed, dynamically typed,
duck-typed?
What sort of expressions are you
planning to support? Do you need to
define an order of operations? Will
you even have operators?
What will you use as the run-time
representation of the program? Will
you convert the text to a byte-code
representation, or an AST, or a
tokenized form of the source text?
There are toolkits available to help take some of the tedium out of the actual parsing of text (ANTLR and Bison, to name two), but I don't know of anything that helps with the actual interpretation part of the task. I'm sure somebody will suggest something.
The main issue is having a semantics for the language you're implementing -- with that, the implementation becomes straightforward. Otherwise, this question is incredibly broad and hard to answer.
I'd recommend Essentials of Programming Languages as a good complement to SICP, particularly if you're interested in interpreters: Official EOPL site. You may want to check out the third edition-- the site hasn't been updated for it yet.
Edit: spam prevention is making me choose between links, so the official page is now unheated. It's easily Google-able, though.

Figuring out the right language for the job: branching out from C#

I work in a Microsoft environment, so I can use my C# hammer on any nails I come across. That being said, what languages (compiled, interpreted, scripting, functional, any types!) complement knowing C#, and for what purposes? For example, I've moved a lot of script functionality away from compiled console apps and into Powershell scripts. If you're an MS developer, have you found a niche in your world for other languages like F#, IronRuby, IronPython, or something similar, and what niche do they fill?
Note: this question is directed at the Microsoft dev people since I can't run off and start installing LAMP stacks around my company, and therefore having to support it forever. :) However, feel free to mention any other languages that you found interesting to fulfill a certain task/role in your world apart from your main language.
Python/Perl/Ruby/PowerShell are great supplements to C#/VB.NET. If your boss hands you a text file and says insert it into the database once or twice, then any of Perl/Python/Ruby (I'm not sure about powershell but I imagine it is not that much more difficult) should be fine to parse it. Either way, for your main applications you will probably be stuck in C#. You can use one of the more dynamic languages to do code generation in C#.
Since you are in a Microsoft Environment, probably your best chance at getting your solution accepted is PowerShell. Next to that I'd say IronPython or something else that integrates with the CLR. But main issue is that for someone else to maintain what you do, they would have to know whatever language you are using. MS in the future has plans to use PowerShell a lot more, so it is probably easier to justify PowerShell then say Python/Perl/Ruby.
If you are just processing a text file for one time use. Or creating a one time code generator to generate all the code and then intend to maintain the generated code, then it doesn't matter. You are the one who will consume the results and if you save time using Perl then more power to you. But if you are doing something that will be used over and over again (like an active code generator where you change the templates and run the generator instead of maintaining the generated code) then other developers working on what you did will need to know the language you used. It is much harder to argue learning Perl/Ruby/Python in a Microsoft Shop. But PowerShell seems like the easier argument. I think the MS grand plan is that eventually applications will expose more functionality for power shell through commandlets. Assuming this happens then PowerShell is even more of a no-brainer because it will expose tons of scriptable functionality that you won't get any other way.
A nice scripting language is always a good tool to have on you belt. See Ruby, or Python.
I use python for prototyping, since there's almost no turn around time between edits and actually running the new version of the code. I may even end up using it for a real project - the more I use it, the more I like it.
It will take some getting used to as a C# programmer, though - the indentation-defines-structure system it uses is a little weird at first.
Since you are in a MS shop, I would suggest PowerShell as a decent scripting language to learn. It plays well with C#.
I'm a big fan of Ruby too.
I'd like to second or third python. Specifically, IronPython (ttp://www.codeplex.com/Wiki/View.aspx?ProjectName=IronPython) lets you learn python but also gives you access to the .net framework goodies.
It's quite nice for scripting-related tasks so it'll probably be useful for your day-to-day coding life, and also a nice way to muck around in an experimental coding/prototyping way.
While it's a bit of a fringe language, I'm compelled to mention Erlang. Erlang is an excellent language to have in your toolbox since it's unusual strengths tend to compliment other programming platforms. Erlang is very useful for building distributed, concurrent, fault-tolerant systems. It's used a lot in the instant-messaging and telephony world where there's a need for distributed, yet interconnected architectures.
Maybe play around with Boo and see what you think.
Boo at Codehaus.org
Boo at Wikipedia
If your using the .Net framework the language is really not important as the compiler and interpreter create the same IL code in any case.
If you step away from the .Net world, I am of the opinion that development tools and languages are a tool box. I strive to use the right tools for the job at hand, taking into consideration what the skill base of the other developers are and what direction a company is seeking of course (I'm a consultant).
I'm with jjnguy. Try one of the scripting languages. Plus as a bonus, when you learn Ruby/PythonPearl, etc...it's a gateway drug...err language to developing for other environments.
Trying out languages outside of your normal toolbox will give you new ways of approaching things in your current favorite language. Even if you don't use them for serious projects languages like Perl (for data mangling), Lisp (functional programming), and Javascript (prototype based programming) will teach you new ways to think about problems in your current language.
As a web developer by trade, you might look into the XSLT/XPath family as for certain types of XML processing they can be very powerful tools.
Granted, in C# 3.x Linq2Xml exposes some similar functionality inline.
XSLT, however, can be a powerful way to separate data from presentation in your apps.
I am very interested in F# and some of the other new languages in the CLR/DLR. The DLR languages might be a lot better for your UI, because they don't make you cast a lot of stupid things.
However, I think that it is important to keep in mind tha learning a new language, especially in a new area, like functional programming, is always a good way to re-train your mind so that you are exposed to new concepts and you can code better in your language of choice, even if you never use those new languages.
Check out Boo - it runs on top of the .NET stack, but its syntax is more like Python.
To learn a new language that complements C#, I'd go with C++. You can use it in a 'way better than p/invoke' style to get access to unmanaged code from your C# apps. You can then start using it to write the memory-constrained apps, and/or performance-critical bits in, if you find some of your .NET applications start hogging all the RAM and/or CPU or just generally aren't as fast as you'd like.

A universal reflections API?

Some time back I was working on an algorithm that processed code, and required a reflections API. We were interested in its implementation for multiple languages, but the reflections API for a language would not work for any other language. So is there any thing like a "universal reflections API" that would work for all languages, or maybe for a few mainstream languages (.NET,Java,Ruby,Python)
If there isnt any, is it possible to build such a thing that can process classes from different languages.
How would you go about having a unified way to process OO code from multiple languages
I don't believe there is universal Reflection API. Any Reflection API depends on the metadata that the compiler generates for the language constructs and these can vary quite a lot from language to language, even though there is a common subset across multiple languages.
In .NET there is CodeDOM, which provides a way to generate a universal syntax tree and then serialize it as (C#, VB .NET etc...) code and/or compile it. Of course that's the mirror image of Reflection, but if anyone ever writes a tool to generate the AST directly from IL the functionality could start to overlap.
In any case its the closest thing I can think of.
A reflection API depends on the metadata generated for the code, so you can have a universal API for all languages on the JVM, or all languages on the CLR...but it wouldn't really be possible to make one that does Python, Java, and VB etc...
If you want a universal API, you need to step outside the language. See our DMS meta-tool for processing arbitrary languages, and answering arbitrary questions, including those you think of as reflection.
(Op asked for support for various languages: DMS has full parsers for C#, VB.net, Java, and Python. Ruby not yet in the list; we're working on it).

Resources