How to get started on Information Extraction? - math

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.
Update: Just to answer one of the comment. I am more interested in Text Information Extraction.

Just to answer one of the comment. I am more interested in Text Information Extraction.
Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from the textual information, and apply training, scoring, or classification.
Good introductory books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).
Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from the plain text). You can use Wikipedia as a training corpus since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.
The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and a very good motivator -maybe you could reimplement some of their results as a learning exercise.
As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task is usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high-quality info is currently in obscure white papers (Google Scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.

I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).

I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.

You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a PhD paper on building a new machine-learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context-specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.
A standard process model you can follow to do your extraction is to adapt a data/text mining approach:
pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data
segmentation/classification/clustering/association - your black box where most of your extraction work will be done
post-processing - cleansing your data back to where you want to store it or represent it as information
Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.
standard steps for: input->process->output
If you are using Java/C++ there are loads of frameworks and libraries available you can work with.
Perl would be an excellent language to do your NLP extraction work with if you want to do a lot of standard text extraction.
You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.
Good sources to read are:
Handbook of Computational Linguistics and Natural Language Processing
Foundations of Statistical Natural Language Processing
Information Extraction Applications in Prospect
An Introduction to Language Processing with Perl and Prolog
Speech and Language Processing (Jurafsky)
Text Mining Application Programming
The Text Mining Handbook
Taming Text
Algorithms of Intelligent Web
Building Search Applications
IEEE Journal
Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say Reuters corpus, tipster, TREC, etc. You can even check out alchemy API, GATE, UIMA, OpenNLP, etc.
Building extractions from standard text is easier than say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.
Standard measures include precision, recall, f1 measure amongst others.

I disagree with the people who recommend reading Programming Collective Intelligence. If you want to do anything of even moderate complexity, you need to be good at applied math and PCI gives you a false sense of confidence. For example, when it talks of SVM, it just says that libSVM is a good way of implementing them.
Now, libSVM is definitely a good package but who cares about packages. What you need to know is why SVM gives the terrific results that it gives and how it is fundamentally different from Bayesian way of thinking ( and how Vapnik is a legend).
IMHO, there is no one solution to it. You should have a good grip on Linear Algebra and probability and Bayesian theory. Bayes, I should add, is as important for this as oxygen for human beings ( its a little exaggerated but you get what I mean, right ?). Also, get a good grip on Machine Learning. Just using other people's work is perfectly fine but the moment you want to know why something was done the way it was, you will have to know something about ML.
Check these two for that :
http://pindancing.blogspot.com/2010/01/learning-about-machine-learniing.html
http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
Okay, now that's three of them :) / Cool

The Wikipedia Information Extraction article is a quick introduction.
At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.

Take a look here if you need enterprise grade NER service. Developing a NER system (and training sets) is a very time consuming and high skilled task.

This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

Related

Project idea using Eigen

I have been reading the documentation and playing with Eigen recently:
docs
and would like to build something that uses it extensively to learn it well. I looked on their website and they mention various projects that use it - like Google Ceres. Something like that might be too large for one or two people to undertake on the side as an Eigen learning experience so I'm looking for something simpler but not trivial that would use it extensively and is a real - useful - application..
Eigen is extensively used in computer vision, and if you are comfortable with linear algebra and matrix calculus (I assume you are, otherwise you wouldn't use Eigen), why not build a toy VSLAM (visual simultaneous localization and mapping) system? Those that are based on bundle adjustment (there's a whole chapter dedicated to that in zisserman's book on multipleview geometry, and it is also discussed in the open-source yet excellent book enter link description here) can be very tricky to implement efficiently, and will take a lot of time, but since your goal is to learn Eigen, performance shouldn't be that much of an issue. If that seems too hard/long for two people, and you think it demands too much energy for a side-project, I recommend that you select some computer-vision algorithms like those that compute the the essential matrix between two images, or are used in 3d pose estimation. Well, those are the only really fun things that come to mind right now, and they will force you to discover a lot of Eigen's functionalities (and gotchas!).

Graph database vertex/edge inference from a text (i.e. an informal Graph 'schema'), using Natural Language Processing (NLP) - does this exist?

Caveat Emptor - I'm neither a linguist nor a Graph theorist, however, I am a [Java] developer wishing to use a Graph database for persistence and the following topic is of interest to me, and I hope to others.
OK, the idea is to have some application or code to:
recognise the embedded relationship structures between named entities within a given piece of text
apply or expose these discovered relationships to usage within a Graph database structure.
In such a system, the text might essentially form a basic, layman-written graph schema of sorts. To better visualise this, here is some [very], basic text:
Andrew is married to Jane
Using the online CLAWS parts-of-speech tagger (POS), I'm given the following:
Andrew_NP0 is_VBZ married_AJ0 to_SENT Jane_NP0
According to 'The BNC Basic (C5) Tagset' # Oxford University, NP0='Proper noun', which is a name (as you know) but these NP0-tagged entries would lend themselves to becoming graph vertice instances/nodes (the end user could be further prompted to give these entries an encompassing 'type/description'). The verb(s), 'VBZ' and adjective(s), AJ0, might highlight graph relationships.
Once the end user has confirmed their graph representation, they might export it to GraphML, for re-import into a graph database such as Titan or Neo4j.
So, the overall idea is to have a tool that allows a layman end user the ability to create Graph-theory-based database structures, using everyday language.
Does such a tool exist already?
Some of my observations above were influenced, in some way, by the following tools (amongst others):
http://www.plantuml.com <- UML diagrams defined using a simple and intuitive language
http://www.planttext.com <- See plantuml
http://www.acqualia.com/soulver <- An NLP-based calculator and currency exchange tool, using natural sentence phrases
http://nlp.stanford.edu/software/tagger.shtml <- Stanford Log-linear Part-Of-Speech Tagger
Yes, this exists in many different places. Examples include OpenCalais (which was created by Reuters) and the AlchemyAPI. There are a bunch of other toolkits and APIs like NLTK and IBM's UIMA that don't present you with a finished solution, but a bunch of tools necessary to build a bespoke solution.
This is a very deep area, subject to ongoing research. I can't cover all of it here, but one thing to keep in mind is that solutions in this space are often highly specific to a certain "corpus" of documents. Software which does any arbitrary English text well doesn't really exist. Instead what you see is solutions that do it really well for business press releases. Or intelligence reports. Or newspaper articles. Or medical alerts. But not any, arbitrary text.
The area is also rife with a lot of problems; one of the big ones is known as "Named Entity Recognition"
Andrew is married to Jane. Andrew bought eggs yesterday.
How many people are being discussed here? Is the second Andrew the same as the first? That's a very complicated and contextual question. But you better get it right, otherwise you might have more or fewer "person" nodes in your resulting graph than you expect.

Generating articles automatically

This question is to learn and understand whether a particular technology exists or not. Following is the scenario.
We are going to provide 200 english words. Software can add additional 40 words, which is 20% of 200. Now, using these, the software should write dialogs, meaningful dialogs with no grammar mistake.
For this, I looked into Spintax and Article Spinning. But you know what they do, taking existing articles and rewrite it. But that is not the best way for this (is it? let me know if it is please). So, is there any technology which is capable of doing this? May be semantic theory that Google uses? Any proved AI method?
Please help.
To begin with, a word of caution: this is quite the forefront of research in natural language generation (NLG), and the state-of-the-art research publications are not nearly good enough to replace human teacher. The problem is especially complicated for students with English as a second language (ESL), because they tend to think in their native tongue before mentally translating the knowledge into English. If we disregard this fearful prelude, the normal way to go about this is as follows:
NLG comprises of three main components:
Content Planning
Sentence Planning
Surface Realization
Content Planning: This stage breaks down the high-level goal of communication into structured atomic goals. These atomic goals are small enough to be reached with a single step of communication (e.g. in a single clause).
Sentence Planning: Here, the actual lexemes (i.e. words or word-parts that bear clear semantics) are chosen to be a part of the atomic communicative goal. The lexemes are connected through predicate-argument structures. The sentence planning stage also decides upon sentence boundaries. (e.g. should the student write "I went there, but she was already gone." or "I went there to see her. She has already left." ... notice the different sentence boundaries and different lexemes, but both answers indicating the same meaning.)
Surface Realization: The semi-formed structure attained in the sentence planning step is morphed into a proper form by incorporating function words (determiners, auxiliaries, etc.) and inflections.
In your particular scenario, most of the words are already provided, so choosing the lexemes is going to be relatively simple. The predicate-argument structures connecting the lexemes needs to be learned by using a suitable probabilistic learning model (e.g. hidden Markov models). The surface realization, which ensures the final correct grammatical structure, should be a combination of grammar rules and statistical language models.
At a high-level, note that content planning is language-agnostic (but it is, quite possibly, culture-dependent), while the last two stages are language-dependent.
As a final note, I would like to add that the choice of the 40 extra words is something I have glossed over, but it is no less important than the other parts of this process. In my opinion, these extra words should be chosen based on their syntagmatic relation to the 200 given words.
For further details, the two following papers provide a good start (complete with process flow architectures, examples, etc.):
Natural Language Generation in Dialog Systems
Stochastic Language Generation for Spoken Dialogue Systems
To better understand the notion of syntagmatic relations, I had found Sahlgren's article on distributional hypothesis extremely helpful. The distributional approach in his work can also be used to learn the predicate-argument structures I mentioned earlier.
Finally, to add a few available tools: take a look at this ACL list of NLG systems. I haven't used any of them, but I've heard good things about SPUD and OpenCCG.

How does software development compare with statistical programming/analysis? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Statistical analysis/programming, is writing code. Whether for descriptive or inferential, You write code to: import data, to clean it, to analyse it and to compile a report.
Analyzing the data can involve many twists and turns of statistical procedures, and angles from which you look at your data. At the end, you have many files, with many lines of code, performing tasks on your data. Some of which is reusable and you capsulate it as a "good to have" function.
This process of "Statistical analysis" feels to me like "programming" But I am not sure it feels the same to everyone.
From the Wikipedia article on Software development:
The term software development is often
used to refer to the activity of
computer programming, which is the
process of writing and maintaining the
source code, whereas the broader sense
of the term includes all that is
involved between the conception of the
desired software through to the final
manifestation of the software.
Therefore, software development may
include research, new development,
modification, reuse, re-engineering,
maintenance, or any other activities
that result in software products.
For larger software systems, usually
developed by a team of people, some
form of process is typically followed
to guide the stages of production of
the software.
According to this simplistic definition (and my humble opinion), this sounds very much like building a statistical analysis. But I imagine it is not that simple.
Which leads me to my question: what differences can you outline between the two activities?
It can be in terms of the technical aspects, the different strategies or work styles, and what ever else you think is relevant.
This question came to me from the following threads:
How do you combine "Revision Control" with "Workflow" for R?
How to organize large R programs?
Workflow for statistical analysis and report writing
As I said in my response to your other question, what you're describing is programming. So the short answer is: there is no difference. The slightly longer answer is that statistical and scientific computing should require even more controls around development than other programming.
A certain percentage of statistical analysis can be done using Excel, or in a point-and-click approach using SPSS, SAS, Matlab, or S-Plus (for instance). A more sophisticated analysis done using one of those programs (or R) that involves programming is clearly a form of software development. And this kind of statistical computing can benefit immensely from following all the best practices from software development: source control, documentation, a project plan, scope document, bug tracking/change control, etc.
Moreover, there are different kinds of statistical analyses that can follow different approaches, as with any programming project:
Exploratory data analysis should follow an iterative methodology, like the Agile methodology. In this case, when you don't know explicity the steps involved up front, it's critical to use a development methodology that is adaptive and self-reflective.
A more routine kind of analysis (e.g. an government annual survey such as the Census) could follow a more traditional methodology such as the waterfall approach since it would be following a very clear set of steps that are mostly known in advance.
I would suggest that any statistician would benefit from reading a book like "Code Complete" (look at the other top books in this post): the more organized you are with your analysis, the greater the likelihood of success.
Statistical analysis in some sense requires even more good practices around version control and documentation than other programming. If your program is just serving some business need, then the algorithm or software used is really of secondary importance so long as the program functions the way the specifications require. On the other hand, with scientific and statistical computing, accuracy and reproducibility are paramount. This is one of John Chambers' (the creator of the S language) major emphases in "Software for Data Analysis". That is another reason to add literate programming (e.g. with Sweave) as an important tool in the statistician's toolkit.
Perhaps the common denominator is "problem solving."
Beyond that, i doubt i doubt i could provide any insight, but i can at least provide a limited answer from personal experience.
This issue arises for us in hiring--i.e., do we hire a programmer and teach them statistics or do we hire a statistics person and teach them to program? Ideally we could find someone fluent in both discipline, and indeed, that's the third net we cast, but rarely with any success.
Here's an example. The most stable distinction between the two activities (software dev & statistical analysis) is probably their respective outputs, or project deliverables. For instance, in my group someone is conducting the statistical analysis on the results of our split-path and factorial experiments (e.g., from the t-test results, whether the difference is significant, or whether the test ought to continue). That analysis will be sent to the marketing department which they'll use to modify the web pages comprising the Site with a view towards improving conversion. A second task involves the abstraction of and partial automation of those analyses so the results can be processed in near-real time.
For the first task, we'll assign a statistician; for the second, a programmer. The business problem we are trying to solve is the same for both tasks, yet for the first, the crux is statistics, for the second, the statistics problems have been largely solved and the crux is a core programming task (I/O).
Notice also how the evolution of the tools associated with the two activities have evolved so the distinction between the two (software dev & data analysis) is further obfuscated: mainstream development languages are being adapted for use as domain-specific analytical tools, at the same time, frameworks continue to be developed which enable the non-developers to quickly build lightweight, task-oriented applications in DSLs.
For instance, python, a general purpose development language has R bindings (RPy2) which along with its native interactive interpreter (IDLE), substantially facilitates Python's use in statistical analysis, while at the same time, there is a clear trend in R package development toward (web) application development: R Bindings for Qt, gWidgetsWWW, and RApache--are all R Packages directed to Client or Web App development, and whose initial release was (i think) w/in the past 18 months. Aside from that, since at least the last quarter of last year, i've noticed an accelerating frequency of blog posts, presentations, etc. on the subject of Web app development in R.
Finally, i wonder if your question is perhaps evidence of the growing popularity of R. Here's what i mean. A decade ago, when my employer purchased a site license, i began learning and using one of the major statistical computing products (no point here in saying which one, it begins with "S"). i found it unnatural and inflexible. Unlike Perl (which i was using at the time) this tool was not an extension of my brain (which isn't an optional attribute of an analytical tool, to me it's more or less the definition of one). Interacting with this System was more like using a vending machine--i selected some statistical function i wanted and then waited for the "output", which was often an impressive set of high-impact, full-color charts and tables. Nearly always though what i wanted was to modify my input or use that output for the next analytical step. That seemed to required another, separate trip to the vending machine. The fact that this tool was context-aware--i.e., it knew statistics--while Perl didn't, didn't compensate for the awkward interaction. Statistical analysis done this way would never be confused with software development. (Again, this is just a summary of my own experience, i don't claim it can be abstracted. It's also not a polemic against any (or all) commercial data analysis platforms--millions use them and they've earned zillions for the people who created them, so let's assume it was my own limitations that caused the failure to bond.)
I had never heard of R until about 18 months ago, and i only discovered it while scanning PyPI (The Web Interface to Python's external package repository) for statistics libraries for python. There i came across RPy, which seemed brilliant but required a dependency called "R" (RPy of course is really just a set of Python bindings to R).
Perhaps R appeals to programmer and non-programmers equally, still for a programmer/analyst, this was a godsend. It hit everything on my wish list for a data analysis platform: an engine based on a full-featured, general programming language (which in this case is a proven scheme descendant), an underlying functional paradigm, built-in interactive interpreter, native data types built from the ground up for data analysis, and the domain knowledge baked in. Data analysis became more like coding. Life was good.
If you are using R, then you'll likely be writing code to solve your statistical questions, so in this sense, statistical analysis is a subset of programming.
On the other hand, there are plenty of SPSS users who have never ventured beyind a bit of pointing and clicking to solve their stats problems. This feels less like programming to me.

How To: Pattern Recognition

I'm interested in learning more about pattern recognition. I know that's somewhat of a broad field, so I'll list some specific types of problems I would like to learn to deal with:
Finding patterns in a seemingly random set of bytes.
Recognizing known shapes (such as circles and squares) in images.
Noticing movement patterns given a stream of positions (Vector3)
This is a new area of experimentation for me personally, and to be honest, I simply don't know where to start :-) I'm obviously not looking for the answers to be provided to me on a silver platter, but some search terms and/or online resources where I can start to acquaint myself with the concepts of the above problem domains would be awesome.
Thanks!
ps: For extra credit, if said resources provide code examples/discussion in C# would be grand :-) but doesn't need to be
Hidden Markov Models are a great place to look, as well as Artificial Neural Networks.
Edit: You could take a look at NeuronDotNet, it's open source and you could poke around the code.
Edit 2: You can also take a look at ITK, it's also open source and implements a lot of these types of algorithms.
Edit 3: Here's a pretty good intro to neural nets. It covers a lot of the basics and includes source code (albeit in C++). He implemented an unsupervised learning algorithm, I think you may be looking for a supervised backpropagation algorithm to train your network.
Edit 4: Another good intro, avoids really heavy math, but provides references to a lot of that detail at the bottom, if you want to dig into it. Includes pseudo-code, good diagrams, and a lengthy description of backpropagation.
This is kind of like saying "I'd like to learn more about electronics.. anyone tell me where to start?" Pattern Recognition is a whole field - there are hundreds, if not thousands of books out there, and any university has at least several (probably 10 or more) courses at the grad level on this. There are numerous journals dedicated to this as well, that have been publishing for decades ... conferences ..
You might start with the wikipedia.
http://en.wikipedia.org/wiki/Pattern_recognition
This is kind of an old question, but it's relevant so I figured I'd post it here :-) Stanford began offering an online Machine Learning class here - http://www.ml-class.org
OpenCV has some functions for pattern recognition in images.
You might want to look at this :http://opencv.willowgarage.com/documentation/pattern_recognition.html. (broken link: closest thing in the new doc is http://opencv.willowgarage.com/documentation/cpp/ml__machine_learning.html, although it is no longer what I'd call helpful documentation for a beginner - see other answers)
However, I also recommend starting with Matlab because openCV is not intuitive to use.
Lot of useful links on this page on computer vision related pattern recognition. Some of the links seem to be broken now but you may find it useful.
I am not an expert on this, but reading about Hidden Markov Models is a good way to start.
Beware false patterns! For any decently large data set you will find subsets that appear to have pattern, even if it is a data set of coin flips. No good process for pattern recognition should be without statistical techniques to assess confidence that the detected patterns are real. When possible, run your algorithms on random data to see what patterns they detect. These experiments will give you a baseline for the strength of a pattern that can be found in random (a.k.a "null") data. This kind of technique can help you assess the "false discovery rate" for your findings.
learning pattern-recoginition is easier in matlab..
there are several examples and there are functions to use.
it is good for the understanding concepts and experiments...
I would recommend starting with some MATLAB toolbox. MATLAB is an especially convenient place to start playing around with stuff like this due to its interactive console. A nice toolbox I personally used and really liked is PRTools (http://prtools.org); they have an implementation of pretty much every pattern recognition tool and also some other machine learning tools (Neural Networks, etc.). But the nice thing about MATLAB is that there are many other toolboxes as well you can try out (there is even a proprietary toolbox from Mathworks)
Whenever you feel comfortable enough with the different tools (and found out which classifier is perfomring best for you problem), you can start thinking about implementing the machine learning in a different application.

Resources