How can I "think OOP" when using R? - r

I am a statistics graduate student who works a lot with R. I am familiar with OOP in other programming contexts. I even see its use in various statistical packages that define new classes for storing data.
At this stage in my graduate career, I am usually coding some algorithm for some class assignment--something that takes in raw data and gives some kind of output. I would like to make it easier to reuse code, and establish good coding habits, especially before I move on to more involved research. Please offer some advice on how to "think OOP" when doing statistical programming in R.

I would argue that you shouldn't. Try to think about R in terms of a workflow. There's some useful workflow suggestions on this page:
Workflow for statistical analysis and report writing
Another important consideration is line-by-line analysis vs. reproducible research. There's a good discussion here:
writing functions vs. line-by-line interpretation in an R workflow

Two aspects of OOP are data and the generics / methods that operate on data.
The data (especially the data that is the output of an analysis) often consists of structured and inter-related data frames or other objects, and one wishes to manage these in a coordinated fashion. Hence the OOP concept of classes, as a way to organize complex data.
Generics and the methods that implement them represent the common operations performed on data. Their utility comes when a collection of generics operate consistently across conceptually related classes. Perhaps a reasonable example is the output of lm / glm as classes, and the implementation of summary, anova, predict, residuals, etc. as generics and methods.
Many analyses follow familiar work flows; here one is a user of classes and methods, and gets the benefit of coordinated data + familiar generics. Thinking 'OOP' might lead you to explore the methods on the object, methods(class="lm") rather than its structure, and might help you to structure your work flows so they follow the well-defined channels of established classes and methods.
Implementing a novel statistical methodology, one might think about how to organize the results in to a coherent, inter-related data structure represented as a new class, and to write methods for the class that correspond to established methods on similar classes. Here one gets to represent the data internally in a way that is convenient for subsequent calculation rather than as a user might want to 'see' it (separating representation from interface). And it is easy for the user of your class (as Chambers says, frequently yourself) to use the new class in existing work flows.
It's a useful question to ask 'why OOP' before 'how OOP'.

You may want to check these links out: first one, second one.
And if you want to see some serious OO code in R, read manual page for ReferenceClasses (so called R5 object orientation), and take a look at Rook package, since it relies heavily on ReferenceClasses. BTW, Rook is a good example of reasonable usage of R5 in R coding. Previous experience with JAVA or C++ could be helpful, since R5 method dispatching differs from S3. Actually, S3 OO is very primitive, since the actuall "class" is saved as an object attribute, so you can change it quite easily.
S3: <method>.<class>(<object>)
R5: <object>$<method>
Anyway, if you can grab a copy, I recommend: "R in a Nutshell", chapter 10.

I have a limited knowledge of how to use R effectively, but there here is an article that allowed even me to walk through using R in an OO manner:
http://www.ibm.com/developerworks/linux/library/l-r3/index.html

I take exception to David Mertz's "The methods package is still somewhat tentative from what I can tell, but some moderately tweaked version of it seems certain to continue in later R versions" mentioned in the link in BiggsTRC answer. In my opinion, programming with classes and methods and using the methods package (S4) is the proper way to "think OOP" in R.
The last paragraph of chapter 9.2 "Programming with New Classes" (page 335) of John M. Chambers' "Software for Data Analysis" (2008) states:
"The amount of programming involved in using a new class may be much more than that involved in defining the class. You owe it to the users of your new classes to make that programming as effective as possible (even if you expect to be your own main user). So the fact that the programming style in this chapter and in Chapter 10 ["Methods and Generic Functions"] is somewhat different is not a coincidence. We're doing some more serious programming here."
Consider studying the methods package (S4).

Beyond some of the other good answers here (e.g. the R in a Nutshell chapter, etc), you should take a look at the core Bioconductor packages. BioC has always had a focus on strong OOP design using S4 classes.

Related

UML Equivalents for Functional and Iterative Paradigms [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
The community reviewed whether to reopen this question 2 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
More specifically, how do you model a functional program, or one developed using the Functional Style (without classes) using a diagram, and not textual representation.
Functional programmers generally don't have a lot of use for diagrams. Many functional programmers (but not all) find that writing down types is a good way to encapsulate the design relationships that OO programmers put into UML diagrams.
Because mutable state is rare in functional programs, there are no mutable "objects", so it is not usually useful or necessary to diagram relationships among them. And while one function might call another, this property is usually not important to the overall design of a system but only to the implementation of the function doing the calling.
If I were feeling a strong need to diagram a functional program I might use a concept map in which types or functions play the role of concepts.
UML isn't only class diagrams.
Most of the other diagram types (Use case diagrams, activity diagrams, sequence diagrams...) are perfectly applicable for a purely functional programming style. Even class diagrams could still be useful, if you simply don't use attributes and associations and interpret "class" as "collection of related functions".
Functional programmers have their own version of UML, it is called Category Theory.
(There is a certain truth to this, but it is meant to be read with a touch of humour).
UML is a compendium of different types of modeling. If you are talking about the Object Diagram (Class Diagram), well you are not going to find anything that fits your desired use. But if you are talking about an Interaction Diagram (Activity Diagram) or Requirements Diagram (Use Case Diagram), of course they will help you and are part of the UML base.
To model a functional program, using a diagram, and not textual representation, you can use notation like the one used to program in Viskell or Luna
I realize this is an old thread but I'm not understanding the issue here.
A class is merely an abstraction of a concept that ties the functionality of it's methods together in a more human friendly way. For instance, the class WaveGenerator might include the methods Sine, Sawtooth and SquareWave. All three methods are clearly related to the class Generator. However, all three are also stateless. If designed correctly, they don't need to store state information outside of the method. This makes them stateless objects which - if I understand correctly - makes them immutable functions which are a core concept in the functional paradigm.
From a conceptual perspective I don't see any difference between
let Envelope Sine = ...
and
let Envelope Generator.Sine = ...
other than the fact that the latter might provide greater insight into the purpose of the function.
UML is an object approach because at graphical level you can not define functional modeling. A trick is to directly add constraints and notes at model and not in the diagram levels. I mean that you can write a full functional documentation on each model element directly in the metamodel and only display an object view using the UML editor.
This is maybe stupid but I found this demo in French language exactly on the same subject and using EclipseUML Omondo :
OCL and UML 2.2 (demo in French language 3mn): http://www.download-omondo.com/regle_ocl.swf
This demo explains how to add constraints directly on methods at metamodel level. The interesting point of this demo is that using a single model for the entire project allows to be flexible enough to extend traditional UML and avoid SysML, BPMN, DSL additionals models because all information is built on the top of the UML 2.2 metamodel. I don't know if it will be a success but this initiative is very interesting because reduce modeling complexity and open new frontiers !!
I haven't actually tried modelling a large system in UML and then going for a functional implementation, but I don't see why it shouldn't work.
Assuming the implementation was going to be Haskell, I would start by defining the types and their relationships using a class diagram. Allocate functions to classes by their main argument, but bear in mind that this is just an artefact of UML. If it was easier to create a fictional singleton object just to hold all the functions, that would be fine too. If the application needs state then I would have no problem with modelling that in a state chart or sequence diagram. If I needed a custom monad for application-specific sequencing semantics then that might become a stereotype; the goal would be to describe what the application does in domain terms.
The main point is that UML could be used to model a program for functional implementation. You have to keep in mind a mapping to the implementation (and it wouldn't hurt to document it), and the fit is far from exact. But it could be done, and it might even add value.
I guess you could create a class called noclass and put in functions as methods. Also, you might want to split noclass into multiple categories of functions.

What is more in the spirit of the Julia language and philosophy?

I recently started programming in Julia for research purposes. Going through it I started loving the syntax, I positively experienced the community here in SO and now I am thinking about porting some code from other programming languages.
Working with highly computational expensive forecasting models, it would be nice to have them all in a powerful modern language as Julia.
I would like to create a project and I am wondering how I should design it. I am concerned both from a performance and a language perspective (i.e.: Would it be better to create modules – submodules – functions or something else would be preferred? Is it better off to use dictionaries or custom types?).
I have looked at different GitHub projects in my field, but I haven't really found a common standard. Therefore I am wondering: what is more in the spirit of the Julia language and philosophy?
EDIT:
It has been pointed out that this question might be too generic. Therefore, I would like to focus it on how it would be better structuring modules (i.e. separate modules for main functions and subroutines versus modules and submodules, etc.). I believe this would be enough for me to have a feel about what might be considered in the spirit of the Julia language and philosophy. Of course, additional examples and references are more than welcome.
The most you'll find is that there is an "official" style-guide. The rest of the "Julian" style is ill-defined, but there are some ways to heuristically define it.
First of all, it means designing the software around multiple dispatch and the type system. A software which follows a Julian design philosophy usually won't be defining a bunch of functions like test_pumpkin and test_pineapple, instead it will use dispatches on test for types Pumpkin and Pineapple. This allows for clean/understandable code. It will break tasks up into small type-stable functions which will allow for good performance. It likely will also be written very generically, allowing the user to use items that are subtypes of AbstractArray or Number, and using the power of dispatch to allow their software to work on numbers they've never even heard of. (In this respect, custom types are recommended over dictionaries when you need performance. However, for a type you have to know all of the fields at the beginning, which means some things require dictionaries.)
A software which follows a Julian design philosophy may also implement a DSL (Domain-Specific Language) to allow a simpler interface to the user. Instead of requiring the user to conform to archaic standards derived from C/Fortran, or write large repetitive items and inputs, the package may provide macros to allow the user to more heuristically define the problem for the software to solve.
Other items which are part of the Julian design philosophy are up for much debate. Is proper Julia code devectorized? I would say no, and the loop fusing broadcast . is a powerful way to write MATLAB-style "vectorized" code and have it be perform like a devectorized loop. However, I have seen others prefer devectorized styles.
Also note that Julia is very different from something like Python where in Julia, you can essentially "build your own standard way of doing something". Since there's no performance penalty for functions/types declared in packages rather than Base, you can build your own Julia world if you want, using macros to define your own "function-like" objects, etc. I mean, you can re-create Java styles in Julia if you wanted.

Is there a software-engineering methodology for functional programming? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Software Engineering as it is taught today is entirely focused on object-oriented programming and the 'natural' object-oriented view of the world. There is a detailed methodology that describes how to transform a domain model into a class model with several steps and a lot of (UML) artifacts like use-case-diagrams or class-diagrams. Many programmers have internalized this approach and have a good idea about how to design an object-oriented application from scratch.
The new hype is functional programming, which is taught in many books and tutorials. But what about functional software engineering?
While reading about Lisp and Clojure, I came about two interesting statements:
Functional programs are often developed bottom up instead of top down ('On Lisp', Paul Graham)
Functional Programmers use Maps where OO-Programmers use objects/classes ('Clojure for Java Programmers', talk by Rich Hickley).
So what is the methodology for a systematic (model-based ?) design of a functional application, i.e. in Lisp or Clojure? What are the common steps, what artifacts do I use, how do I map them from the problem space to the solution space?
Thank God that the software-engineering people have not yet discovered functional programming. Here are some parallels:
Many OO "design patterns" are captured as higher-order functions. For example, the Visitor pattern is known in the functional world as a "fold" (or if you are a pointy-headed theorist, a "catamorphism"). In functional languages, data types are mostly trees or tuples, and every tree type has a natural catamorphism associated with it.
These higher-order functions often come with certain laws of programming, aka "free theorems".
Functional programmers use diagrams much less heavily than OO programmers. Much of what is expressed in OO diagrams is instead expressed in types, or in "signatures", which you should think of as "module types". Haskell also has "type classes", which is a bit like an interface type.
Those functional programmers who use types generally think that "once you get the types right; the code practically writes itself."
Not all functional languages use explicit types, but the How To Design Programs book, an excellent book for learning Scheme/Lisp/Clojure, relies heavily on "data descriptions", which are closely related to types.
So what is the methodology for a systematic (model-based ?) design of a functional application, i.e. in Lisp or Clojure?
Any design method based on data abstraction works well. I happen to think that this is easier when the language has explicit types, but it works even without. A good book about design methods for abstract data types, which is easily adapted to functional programming, is Abstraction and Specification in Program Development by Barbara Liskov and John Guttag, the first edition. Liskov won the Turing award in part for that work.
Another design methodology that is unique to Lisp is to decide what language extensions would be useful in the problem domain in which you are working, and then use hygienic macros to add these constructs to your language. A good place to read about this kind of design is Matthew Flatt's article Creating Languages in Racket. The article may be behind a paywall. You can also find more general material on this kind of design by searching for the term "domain-specific embedded language"; for particular advice and examples beyond what Matthew Flatt covers, I would probably start with Graham's On Lisp or perhaps ANSI Common Lisp.
What are the common steps, what artifacts do I use?
Common steps:
Identify the data in your program and the operations on it, and define an abstract data type representing this data.
Identify common actions or patterns of computation, and express them as higher-order functions or macros. Expect to take this step as part of refactoring.
If you're using a typed functional language, use the type checker early and often. If you're using Lisp or Clojure, the best practice is to write function contracts first including unit tests—it's test-driven development to the max. And you will want to use whatever version of QuickCheck has been ported to your platform, which in your case looks like it's called ClojureCheck. It's an extremely powerful library for constructing random tests of code that uses higher-order functions.
For Clojure, I recommend going back to good old relational modeling. Out of the Tarpit is an inspirational read.
Personally I find that all the usual good practices from OO development apply in functional programming as well - just with a few minor tweaks to take account of the functional worldview. From a methodology perspective, you don't really need to do anything fundamentally different.
My experience comes from having moved from Java to Clojure in recent years.
Some examples:
Understand your business domain / data model - equally important whether you are going to design an object model or create a functional data structure with nested maps. In some ways, FP can be easier because it encourages you to think about data model separately from functions / processes but you still have to do both.
Service orientation in design - actually works very well from a FP perspective, since a typical service is really just a function with some side effects. I think that the "bottom up" view of software development sometimes espoused in the Lisp world is actually just good service-oriented API design principles in another guise.
Test Driven Development - works well in FP languages, in fact sometimes even better because pure functions lend themselves extremely well to writing clear, repeatable tests without any need for setting up a stateful environment. You might also want to build separate tests to check data integrity (e.g. does this map have all the keys in it that I expect, to balance the fact that in an OO language the class definition would enforce this for you at compile time).
Prototying / iteration - works just as well with FP. You might even be able to prototype live with users if you get very extremely good at building tools / DSL and using them at the REPL.
OO programming tightly couples data with behavior. Functional programming separates the two. So you don't have class diagrams, but you do have data structures, and you particularly have algebraic data types. Those types can be written to very tightly match your domain, including eliminating impossible values by construction.
So there aren't books and books on it, but there is a well established approach to, as the saying goes, make impossible values unrepresentable.
In so doing, you can make a range of choices about representing certain types of data as functions instead, and conversely, representing certain functions as a union of data types instead so that you can get, e.g., serialization, tighter specification, optimization, etc.
Then, given that, you write functions over your adts such that you establish some sort of algebra -- i.e. there are fixed laws which hold for these functions. Some are maybe idempotent -- the same after multiple applications. Some are associative. Some are transitive, etc.
Now you have a domain over which you have functions which compose according to well behaved laws. A simple embedded DSL!
Oh, and given properties, you can of course write automated randomized tests of them (ala QuickCheck).. and that's just the beginning.
Object Oriented design isn't the same thing as software engineering. Software engineering has to do with the entire process of how we go from requirements to a working system, on time and with a low defect rate. Functional programming may be different from OO, but it does not do away with requirements, high level and detailed designs, verification and testing, software metrics, estimation, and all that other "software engineering stuff".
Furthermore, functional programs do exhibit modularity and other structure. Your detailed designs have to be expressed in terms of the concepts in that structure.
One approach is to create an internal DSL within the functional programming language of choice. The "model" then is a set of business rules expressed in the DSL.
See my answer to another post:
How does Clojure aproach Separation of Concerns?
I agree more needs to be written on the subject on how to structure large applications that use an FP approach (Plus more needs to be done to document FP-driven UIs)
While this might be considered naive and simplistic, I think "design recipes" (a systematic approach to problem solving applied to programming as advocated by Felleisen et al. in their book HtDP) would be close to what you seem to be looking for.
Here, a few links:
http://www.northeastern.edu/magazine/0301/programming.html
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.8371
I've recently found this book:
Functional and Reactive Domain Modeling
I think is perfectly in line with your question.
From the book description:
Functional and Reactive Domain Modeling teaches you how to think of the domain model in terms of pure functions and how to compose them to build larger abstractions. You will start with the basics of functional programming and gradually progress to the advanced concepts and patterns that you need to know to implement complex domain models. The book demonstrates how advanced FP patterns like algebraic data types, typeclass based design, and isolation of side-effects can make your model compose for readability and verifiability.
There is the "program calculation" / "design by calculation" style associated with Prof. Richard Bird and the Algebra of Programming group at Oxford University (UK), I don't think its too far-fetched to consider this a methodology.
Personally while I like the work produced by the AoP group, I don't have the discipline to practice design in this way myself. However that's my shortcoming, and not one of program calculation.
I've found Behavior Driven Development to be a natural fit for rapidly developing code in both Clojure and SBCL. The real upside of leveraging BDD with a functional language is that I tend to write much finer grain unit tests than I usually do when using procedural languages because I do a much better job of decomposing the problem into smaller chunks of functionality.
Honestly if you want design recipes for functional programs, take a look at the standard function libraries such as Haskell's Prelude. In FP, patterns are usually captured by higher order procedures (functions that operate on functions) themselves. So if a pattern is seen, often a higher order function is simply created to capture that pattern.
A good example is fmap. This function takes a function as an argument and applies it to all the "elements" of the second argument. Since it is part of the Functor type class, any instance of a Functor (such as a list, graph, etc...) may be passed as a second argument to this function. It captures the general behavior of applying a function to every element of its second argument.
Well,
Generally many Functional Programming Languages are used at universities for a long time for "small toy problems".
They are getting more popular now since OOP has difficulties with "paralel programming" because of "state".And sometime functional style is better for problem at hand like Google MapReduce.
I am sure that, when functioanl guys hit the wall [ try to implement systems bigger than 1.000.000 lines of code], some of them will come with new software-engineering methodologies with buzz words :-). They should answer the old question: How to divide system into pieces so that we can "bite" each pieces one at a time? [ work iterative, inceremental en evolutionary way] using Functional Style.
It is sure that Functional Style will effect our Object Oriented
Style.We "still" many concepts from Functional Systems and adapted to
our OOP languages.
But will functional programs will be used for such a big systems?Will they become main stream? That is the question.
And Nobody can come with realistic methodology without implementing such a big systems, making his-her hands dirty.
First you should make your hands dirty then suggest solution. Solutions-Suggestions without "real pains and dirt" will be "fantasy".

Modelling / documenting functional programs

I've found UML useful for documenting various aspects of OO systems, particularly class diagrams for overall architecture and sequence diagrams to illustrate particular routines. I'd like to do the same kind of thing for my clojure applications. I'm not currently interested in Model Driven Development, simply on communicating how applications work.
Is UML a common / reasonable approach to modelling functional programming? Is there a better alternative to UML for FP?
the "many functions on a single data structure" approach of idiomatic Clojure code waters down the typical "this uses that" UML diagram because many of the functions end up pointing at map/reduce/filter.
I get the impression that because Clojure is a somewhat more data centric language a way of visualizing the flow of data could help more than a way of visualizing control flow when you take lazy evaluation into account. It would be really useful to get a "pipe line" diagram of the functions that build sequences.
map and reduce etc would turn these into trees
Most functional programmers prefer types to diagrams. (I mean types very broadly speaking, to include such things as Caml "module types", SML "signatures", and PLT Scheme "units".) To communicate how a large application works, I suggest three things:
Give the type of each module. Since you are using Clojure you may want to check out the "Units" language invented by Matthew Flatt and Matthias Felleisen. The idea is to document the types and the operations that the module depends on and that the module provides.
Give the import dependencies of the interfaces. Here a diagram can be useful; in many cases you can create a diagram automatically using dot. This has the advantage that the diagram always accurately reflects the code.
For some systems you may want to talk about important dependencies of implementations. But usually not—the point of separating interfaces from implementations is that the implementations can be understood only in terms of the interfaces they depend on.
There was recently a related question on architectural thinking in functional languages.
It's an interesting question (I've upvoted it), I expect you'll get at least as many opinions as you do responses. Here's my contribution:
What do you want to represent on your diagrams? In OO one answer to that question might be, considering class diagrams, state (or attributes if you prefer) and methods. So, obviously I would suggest, class diagrams are not the right thing to start from since functions have no state and, generally, implement one function (aka method). Do any of the other UML diagrams provide a better starting point for your thinking? The answer is probably yes but you need to consider what you want to show and find that starting point yourself.
Once you've written a (sub-)system in a functional language, then you have a (UML) component to represent on the standard sorts of diagram, but perhaps that is too high-level, too abstract, for you.
When I write functional programs, which is not a lot I admit, I tend to document functions as I would document mathematical functions (I work in scientific computing, lots of maths knocking around so this is quite natural for me). For each function I write:
an ID;
sometimes, a description;
a specification of the domain;
a specification of the co-domain;
a statement of the rule, ie the operation that the function performs;
sometimes I write post-conditions too though these are usually adequately specified by the co-domain and rule.
I use LaTeX for this, it's good for mathematical notation, but any other reasonably flexible text or word processor would do. As for diagrams, no not so much. But that's probably a reflection of the primitive state of the design of the systems I program functionally. Most of my computing is done on arrays of floating-point numbers, so most of my functions are very easy to compose ad-hoc and the structuring of a system is very loose. I imagine a diagram which showed functions as nodes and inputs/outputs as edges between nodes -- in my case there would be edges between each pair of nodes in most cases. I'm not sure drawing such a diagram would help me at all.
I seem to be coming down on the side of telling you no, UML is not a reasonable way of modelling functional systems. Whether it's common SO will tell us.
This is something I've been trying to experiment with also, and after a few years of programming in Ruby I was used to class/object modeling. In the end I think the types of designs I create for Clojure libraries are actually pretty similar to what I would do for a large C program.
Start by doing an outline of the domain model. List the main pieces of data being moved around the primary functions being performed on this data. I write these in my notebook and a lot of the time it will be just a name with 3-5 bullet points underneath it. This outline will probably be a good approximation of your initial namespaces, and it should point out some of the key high level interfaces.
If it seems pretty straight forward then I'll create empty functions for the high level interface, and just start filling them in. Typically each high level function will require a couple support functions, and as you build up the whole interface you will find opportunities for sharing more code, so you refactor as you go.
If it seems like a more difficult problem then I'll start diagramming out the structure of the data and the flow of key functions. Often times the diagram and conceptual model that makes the most sense will depend on the type of abstractions you choose to use in a specific design. For example if you use a dataflow library for a Swing GUI then using a dependency graph would make sense, but if you are writing a server to processing relational database queries then you might want to diagram pools of agents and pipelines for processing tuples. I think these kinds of models and diagrams are also much more descriptive in terms of conveying to another developer how a program is architected. They show more of the functional connectivity between aspects of your system, rather than the pretty non-specific information conveyed by something like UML.

How to get started on Information Extraction?

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.
Update: Just to answer one of the comment. I am more interested in Text Information Extraction.
Just to answer one of the comment. I am more interested in Text Information Extraction.
Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from the textual information, and apply training, scoring, or classification.
Good introductory books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).
Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from the plain text). You can use Wikipedia as a training corpus since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.
The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and a very good motivator -maybe you could reimplement some of their results as a learning exercise.
As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task is usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high-quality info is currently in obscure white papers (Google Scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.
I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).
I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.
You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a PhD paper on building a new machine-learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context-specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.
A standard process model you can follow to do your extraction is to adapt a data/text mining approach:
pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data
segmentation/classification/clustering/association - your black box where most of your extraction work will be done
post-processing - cleansing your data back to where you want to store it or represent it as information
Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.
standard steps for: input->process->output
If you are using Java/C++ there are loads of frameworks and libraries available you can work with.
Perl would be an excellent language to do your NLP extraction work with if you want to do a lot of standard text extraction.
You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.
Good sources to read are:
Handbook of Computational Linguistics and Natural Language Processing
Foundations of Statistical Natural Language Processing
Information Extraction Applications in Prospect
An Introduction to Language Processing with Perl and Prolog
Speech and Language Processing (Jurafsky)
Text Mining Application Programming
The Text Mining Handbook
Taming Text
Algorithms of Intelligent Web
Building Search Applications
IEEE Journal
Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say Reuters corpus, tipster, TREC, etc. You can even check out alchemy API, GATE, UIMA, OpenNLP, etc.
Building extractions from standard text is easier than say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.
Standard measures include precision, recall, f1 measure amongst others.
I disagree with the people who recommend reading Programming Collective Intelligence. If you want to do anything of even moderate complexity, you need to be good at applied math and PCI gives you a false sense of confidence. For example, when it talks of SVM, it just says that libSVM is a good way of implementing them.
Now, libSVM is definitely a good package but who cares about packages. What you need to know is why SVM gives the terrific results that it gives and how it is fundamentally different from Bayesian way of thinking ( and how Vapnik is a legend).
IMHO, there is no one solution to it. You should have a good grip on Linear Algebra and probability and Bayesian theory. Bayes, I should add, is as important for this as oxygen for human beings ( its a little exaggerated but you get what I mean, right ?). Also, get a good grip on Machine Learning. Just using other people's work is perfectly fine but the moment you want to know why something was done the way it was, you will have to know something about ML.
Check these two for that :
http://pindancing.blogspot.com/2010/01/learning-about-machine-learniing.html
http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
Okay, now that's three of them :) / Cool
The Wikipedia Information Extraction article is a quick introduction.
At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.
Take a look here if you need enterprise grade NER service. Developing a NER system (and training sets) is a very time consuming and high skilled task.
This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

Resources