Generating articles automatically - math

This question is to learn and understand whether a particular technology exists or not. Following is the scenario.
We are going to provide 200 english words. Software can add additional 40 words, which is 20% of 200. Now, using these, the software should write dialogs, meaningful dialogs with no grammar mistake.
For this, I looked into Spintax and Article Spinning. But you know what they do, taking existing articles and rewrite it. But that is not the best way for this (is it? let me know if it is please). So, is there any technology which is capable of doing this? May be semantic theory that Google uses? Any proved AI method?
Please help.

To begin with, a word of caution: this is quite the forefront of research in natural language generation (NLG), and the state-of-the-art research publications are not nearly good enough to replace human teacher. The problem is especially complicated for students with English as a second language (ESL), because they tend to think in their native tongue before mentally translating the knowledge into English. If we disregard this fearful prelude, the normal way to go about this is as follows:
NLG comprises of three main components:
Content Planning
Sentence Planning
Surface Realization
Content Planning: This stage breaks down the high-level goal of communication into structured atomic goals. These atomic goals are small enough to be reached with a single step of communication (e.g. in a single clause).
Sentence Planning: Here, the actual lexemes (i.e. words or word-parts that bear clear semantics) are chosen to be a part of the atomic communicative goal. The lexemes are connected through predicate-argument structures. The sentence planning stage also decides upon sentence boundaries. (e.g. should the student write "I went there, but she was already gone." or "I went there to see her. She has already left." ... notice the different sentence boundaries and different lexemes, but both answers indicating the same meaning.)
Surface Realization: The semi-formed structure attained in the sentence planning step is morphed into a proper form by incorporating function words (determiners, auxiliaries, etc.) and inflections.
In your particular scenario, most of the words are already provided, so choosing the lexemes is going to be relatively simple. The predicate-argument structures connecting the lexemes needs to be learned by using a suitable probabilistic learning model (e.g. hidden Markov models). The surface realization, which ensures the final correct grammatical structure, should be a combination of grammar rules and statistical language models.
At a high-level, note that content planning is language-agnostic (but it is, quite possibly, culture-dependent), while the last two stages are language-dependent.
As a final note, I would like to add that the choice of the 40 extra words is something I have glossed over, but it is no less important than the other parts of this process. In my opinion, these extra words should be chosen based on their syntagmatic relation to the 200 given words.
For further details, the two following papers provide a good start (complete with process flow architectures, examples, etc.):
Natural Language Generation in Dialog Systems
Stochastic Language Generation for Spoken Dialogue Systems
To better understand the notion of syntagmatic relations, I had found Sahlgren's article on distributional hypothesis extremely helpful. The distributional approach in his work can also be used to learn the predicate-argument structures I mentioned earlier.
Finally, to add a few available tools: take a look at this ACL list of NLG systems. I haven't used any of them, but I've heard good things about SPUD and OpenCCG.

Related

a question about a design

My teammates and I have a very challenging new project to do, and we are supposed to submit it next week. We don't have a single clue about how to do it, and really need help. We are undergraduate students, new to Information Retrieval and AI, and really need your ideas.
The project is roughly:
When an expert is cited in a document,
find an expert with an opposing
opinion & find out what he/she says
about that topic.
We are free to use any programming language, but we are not concerned with the programming. We would like help to get us started. Please give us a rough idea on how to design such a system and how to retrieve information on the internet. How should we get his opinion, then find an opposite opinion?
Simple: use Amazon's Mechanical Turk.
Without that (or an equivalent) you're in trouble. If there are no further constraints on the problem then you will need a full-blown AI, the kind that doesn't yet exist. If there are severe restraints then you might have a chance of doing this in a week. If the expert can be in any field (medicine, politics, history, fashion, science, comic books, etc.) then there will be no single, well-organized repository of essays. You'll have to use Google to find Dr. X's opinion. Once you find Dr. X's writing (and let's pray it's text, not audio) you'll have to do some kind of natural language processing to get the thrust of it, even if you're lucky enough to find a descriptive title ("Digital Photography Is Absolutely Great"). Then you have to figure out it's opposite. What's the opposite of "Neil Gaiman draws on folklore for his story ideas"? Figuring out what opinion you're looking for will be a serious problem. After that, things actually get easier: you can google for the subject and use the same magic tools to find the one you're looking for.
So what do have a chance of solving? A search for opinions that someone else has already organised into "pro" and "con". Some online political forums are organised that way. Wikipedia cites opposing views in a special section in some of its articles. Science journals print letters of rebuttal. Look around, you might find a site even more cut-and-dried. Choose a small enough arena and you'll have a tractible problem.
EDIT: Damn, Ben Dunlap beat me to all my major points in a comment. Sigh
Sounds like an NLP problem to me. As for the information about documents and cites, http://citeseerx.ist.psu.edu should be a good starting point.
For each paper, there are several citations which refer to the paper. At the very minimum, you have to scan the abstract of the paper and that of the citations and run your own algorithm to figure if any citation is of the opposing opinion. Maybe your professor can give you hints on some approximate heuristic, but as far as I know it is a really hard problem.
I would be watching this thread for more interesting approaches.
Automatically submit a Google search request similar to "expert_name sucks", "expert_name wrong", or something like that. Find the first result that has "PhD" with a document link in the same sentence and return the link.
I think you might be blowing this up a little too big... as an undergraduate project, I would approach it a little more small scale.
Unless your specification says you must use actual internet resources, you would be better off creating your own database of custom short documents. Add metadata to each document stating the points they make about certain topics.
Next, I would create a list of citations which link to each document and add some metadata representing that experts stance on the topic. When someone reads a document, I would augment the list of citations with lists of links to documents which have alternative views on that topic.
Basically it would consist of these tables:
Document (id, data)
DocumentPoints (documentId, topic, stance)
Citation (documentId, topic, stance)
And when someone loads up a document, the citations are pulled up as well. For each citation, you search DocumentPoints for the same topics with different stances. The most difficult part of this project would be creating the 5 or 6 documents you need to have data in your database. After that the solution is trivial.
On a side note, most of these other answers are telling you to use some existing solution... don't do that unless the assignment tells you to. You'll be much better off understanding the problem and various ways to solve it (this is definitely not the only/best one) if you work through the entire problem yourself. When the teacher asks you to do something not supported by whatever product you chose to implement your solution on, you wouldn't be able to fix it. If you had just written it yourself, you could just as easily implement to the new spec as well.

How can I learn higher-level programming-related math without much formal training? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I haven't taken any math classes above basic college calculus. However, in the course of my programming work, I've picked up a lot of math and comp sci from blogs and reading, and I genuinely believe I have a decent mathematical mind. I enjoy and have success doing Project Euler, for example.
I want to dive in and really start learning some cool math, particularly discrete mathematics, set theory, graph theory, number theory, combinatorics, category theory, lambda calculus, etc.
My impression so far is that I'm well equipped to take these on at a conceptual level, but I'm having a really hard time with the mathematical language and symbols. I just don't "speak the language" and though I'm trying to learn it, I'm the going is extremely slow. It can take me hours to work through even one formula or terminology heavy paragraph. And yeah, I can look up terms and definitions, but it's a terribly onerous process that very much obscures the theoretical simplicity of what I'm trying to learn.
I'm really afraid I'm going to have to back up to where I left off, get a mid-level math textbook, and invest some serious time in exercises to train myself in that way of thought. This sounds amazingly boring, though, so I wondered if anyone else has any ideas or experience with this.
If you don't want to attend a class, you still need to get what the class would have given you: time in the material and lots of practice.
So, grab that text book and start doing the practice problems. There really isn't any other way (unless you've figured out how osmosis can actually happen...).
There is no knowledge that can only be gained in a classroom.
Check out the MIT Courseware for Mathematics
Also their YouTube site
Project Euler is also a great way to think about math as it relates to programming
Take a class at your local community college. If you're like me you'd need the structure. There's something to be said for the pressure of being graded. I mean there's so much to learn that going solo is really impractical if you want to have more than just a passing nod-your-head-mm-hmm sort of understanding.
Sounds like you're in the same position I am. What I'm finding out about math education is that most of it is taught incorrectly. Whether a cause or result of this, I also find most math texts are written incorrectly. Exceptions are rare, but notable. For instance, anything written by Donald Knuth is a step in the right direction.
Here are a couple of articles that state the problem quite clearly:
A Gentle Introduction To Learning
Calculus
Developing Your Intuition For
Math
And here's an article on a simple study technique that aims at retaining knowledge:
Teaching linear algebra
Consider auditing classes in discrete mathematics and proofs at a local university. The discrete math class will teach you some really useful stuff (graph theory, combinatorics, etc.), and the proofs class will teach you more about the mathematical style of thinking and writing.
I'd agree with #John Kugelman, classes are the way to go to get it done properly but I'd add that if you don't want to take classes, the internet has many resources to help you, including recorded lectures which I find can be more approachable than books and papers.
I'd recommend checking out MIT Open Courseware. There's a Maths for Computer Science module there, and I'm enjoying working through Gilbert Strang's Linear Algebra course of video lectures.
Youtube and videolectures.com are also good resources for video lectures.
Finally, there's a free Maths for CS book at bookboon.
To this list I would now add The Haskel Road to Logic, Maths, and Programming, and Conceptual Mathematics: A First Introduction to Categories.
--- Nov 16 '09 answer for posterity--
Two books. Diestel's Graph Theory, and Knuth's Concrete Mathematics. Once you get the hang of those try CAGES.
Find a good mentor who is an expert in the field who is willing to spend time with you on a regular basis.
There is a sort of trick to learning dense material, like math and mathematical CS. Learning unfamiliar abstract stuff is hard, and the most effective way to do it is to familiarize yourself with it in stages. First, you need to skim it: don't worry if you don't understand everything in the first pass. Then take a break; after you have rested, go through it again in more depth. Lather, rinse, repeat; meditate, and eventually you may become enlightened.
I'm not sure exactly where I'd start, to become familiar with the language of mathematics; I just ended up reading through lots of papers until I got better at it. You might look for introductory textbooks on formal mathematical logic, since a lot of math (especially in language theory) is based off of that; if you learn to hack the formal stuff a bit, the everyday notation might look a bit easier.
You should probably look through books on topics you're personally interested in; the inherent interest should help get you over the hump. Also, make sure you find texts that are actually introductory; I have become wary of slim, undecorated hardbacks labeled Elementary Foobar Theory, which tend to be elementary only to postdocs with a PhD in Foobar.
A word of warning: do not start out with category theory -- it is the most boring math I have ever encountered! Due to its relevance to language design and type theory, I would like to know more about it, but so far I have not been able to deal...
For a nice, scattershot intro to bits of many kinds of CS-ish math, I recommend Godel, Escher, Bach by Hofstadter (if you haven't read it already, of course). It's not a formal math book, though, so it won't help you with the familiarity problem, but it is quite inspirational.
Mathematical notation is is akin to several computer languages:
concise
exacting
based on many idioms
a fair amount of local variations and conventions
As with a computer language, you don't need to "wash the whole elephant at once": take it one part a at time.
A tentative plan for you could be
identify areas of mathematics that are interesting or important to you. (seems you already have a bit of a sense for that, CS has helped you develop quite a culture for it.)
take (or merely audit) a few formal classes in this area. I agree with several answers in this post, an in-person course, at local college is preferable, but, maybe at first, or to be sure to get the most of a particular class, first self-teaching yourself in this area with MIT OCW, similar online resources and associated books is ok/fine.
if an area of math introduces too high of a pre-requisite in terms of fluency with notation or with some underlying concept or (most often mechanical computation and transformation techniques). No problem! Just backtrack a bit, learn these foundations (and just these foundations!) and move forward again.
Find a "guru", someone that has a broad mathematical culture and exposure, not necessarily a mathematician, physics folks are good too, indeed they can often articulate math in a more practical fashion. Use this guru to guide you, as he/she can show you how the big pieces fit together.
Note: There is little gain to be had of learning mathematical notation for its own sake. Rather it should be learned in context, just like say a C# idiom is better memorized when used and when associated with a specific task, rather than learned in vacuo. A related SO posting however provides several resources to decipher and learn mathematical notation
Project Euler takes problems out of context and drops them in for people to solve them. Project Euler cannot teach you anything effectively. I think you should forget about it, if it is popular it does not mean anything. You cannot study Mathematics through Project Euler as it contains only bits and pieces(and some pretty high level pieces) that you're supposed to know in order to solve the problems. Learning mathematics means to consider a subject and a read a book about it and solving exercices or reading solutions, that's how you learn math. If it so happens that through your reading you find something that is close to some project euler thing, your luck , but otherwise Project euler is a complete waste of time. I think the time is much better invested choosing a particular branch of mathematics and studying that. Let me explain why: I solved 3 pretty advanced Projec Euler problems and they were all making appeal to knowledge from Number theory which I happened to have because i studies some part of it. I do not think Iearned anything from Project Euler, it just happened that I already knew some number theory and solved the problems.
For example, if you find out you like number theory, take H. Davenport -> Hardy & Wright -> Kenneth & Rosen's , study those.
If you like Graph Theory take Reinhard Diestel's book which is freely available and study that(or check books.google.com and find whichever is more appropriate to your taste) but don't spread your attention in 999999 directions just because Project Euler has problems ranging from dynamic programming to advanced geometry or to advanced number theory, that is clearly the wrong way to go and it will not bring you closer to your goal.
This sounds amazingly boring
Well ... Mathematics is not boring when you find some problem that you are attached to, which you like and you'd like to find the solution to, and when you have the sufficient time to reflect on it while not behind a computer screen. Mathematics is done with pen and paper mostly(yes you can use computers .. but that's not really the point).
So, if you find a real-world problem, or some programming problem that would benefit from
you knowing some advanced maths, and you know what maths you have to study , it can be motivating to learn in that way.
If you feel you are not motivated it is hard to study properly.
There is also the question of what you actually mean when you say learn. Does the learning process stop after you solved the problems at the end of the chapter of a book ? Well you decide. You can consider you have finished learning that subject, or you can consider you have not finished and read more about it. There are entire books on just one equation and variations of it.
The amount of programming-related math that you can learn without formal training is limited, but it's more than enough. But maybe you can self-teach yourself.
It all boils down to your resources and motivation.
To know mathematics you have to do mathematics not programming(project euler).
For beginning to learn category theory I recommend David Spivak's Category Theory for the Sciences (AKA Category Theory for Scientists) because its relatively comprehensible due to many examples that enable understanding by analogy and which quickly builds a foundation for understanding more abstract concepts.
It requires the ability to reason logically and an intuitive notion of what is a set. It proceeds from sets and functions through basic category theory to adjoint functors, categories of functors, sheaves, monads and an introduction to operads. Two main threads throughout are modeling databases in terms of categories and describing categories with annotated diagrams called ologs. The bibliography provides references to more advanced and specialized topics including recent papers by Dr. Spivak.
An expected outcome from reading this book is the capability of understanding category theory texts and papers written for mathematicians such as Mac Lane's Category Theory for the Working Mathematician.
In PDF format it is available from http://math.mit.edu/~dspivak/teaching/sp13/ (the dynamic version is recommended since its the most recent). The open access HTML version is available from https://mitpress.mit.edu/books/category-theory-sciences (which is recommended since it includes additional content including answers to some exercises).

How To: Pattern Recognition

I'm interested in learning more about pattern recognition. I know that's somewhat of a broad field, so I'll list some specific types of problems I would like to learn to deal with:
Finding patterns in a seemingly random set of bytes.
Recognizing known shapes (such as circles and squares) in images.
Noticing movement patterns given a stream of positions (Vector3)
This is a new area of experimentation for me personally, and to be honest, I simply don't know where to start :-) I'm obviously not looking for the answers to be provided to me on a silver platter, but some search terms and/or online resources where I can start to acquaint myself with the concepts of the above problem domains would be awesome.
Thanks!
ps: For extra credit, if said resources provide code examples/discussion in C# would be grand :-) but doesn't need to be
Hidden Markov Models are a great place to look, as well as Artificial Neural Networks.
Edit: You could take a look at NeuronDotNet, it's open source and you could poke around the code.
Edit 2: You can also take a look at ITK, it's also open source and implements a lot of these types of algorithms.
Edit 3: Here's a pretty good intro to neural nets. It covers a lot of the basics and includes source code (albeit in C++). He implemented an unsupervised learning algorithm, I think you may be looking for a supervised backpropagation algorithm to train your network.
Edit 4: Another good intro, avoids really heavy math, but provides references to a lot of that detail at the bottom, if you want to dig into it. Includes pseudo-code, good diagrams, and a lengthy description of backpropagation.
This is kind of like saying "I'd like to learn more about electronics.. anyone tell me where to start?" Pattern Recognition is a whole field - there are hundreds, if not thousands of books out there, and any university has at least several (probably 10 or more) courses at the grad level on this. There are numerous journals dedicated to this as well, that have been publishing for decades ... conferences ..
You might start with the wikipedia.
http://en.wikipedia.org/wiki/Pattern_recognition
This is kind of an old question, but it's relevant so I figured I'd post it here :-) Stanford began offering an online Machine Learning class here - http://www.ml-class.org
OpenCV has some functions for pattern recognition in images.
You might want to look at this :http://opencv.willowgarage.com/documentation/pattern_recognition.html. (broken link: closest thing in the new doc is http://opencv.willowgarage.com/documentation/cpp/ml__machine_learning.html, although it is no longer what I'd call helpful documentation for a beginner - see other answers)
However, I also recommend starting with Matlab because openCV is not intuitive to use.
Lot of useful links on this page on computer vision related pattern recognition. Some of the links seem to be broken now but you may find it useful.
I am not an expert on this, but reading about Hidden Markov Models is a good way to start.
Beware false patterns! For any decently large data set you will find subsets that appear to have pattern, even if it is a data set of coin flips. No good process for pattern recognition should be without statistical techniques to assess confidence that the detected patterns are real. When possible, run your algorithms on random data to see what patterns they detect. These experiments will give you a baseline for the strength of a pattern that can be found in random (a.k.a "null") data. This kind of technique can help you assess the "false discovery rate" for your findings.
learning pattern-recoginition is easier in matlab..
there are several examples and there are functions to use.
it is good for the understanding concepts and experiments...
I would recommend starting with some MATLAB toolbox. MATLAB is an especially convenient place to start playing around with stuff like this due to its interactive console. A nice toolbox I personally used and really liked is PRTools (http://prtools.org); they have an implementation of pretty much every pattern recognition tool and also some other machine learning tools (Neural Networks, etc.). But the nice thing about MATLAB is that there are many other toolboxes as well you can try out (there is even a proprietary toolbox from Mathworks)
Whenever you feel comfortable enough with the different tools (and found out which classifier is perfomring best for you problem), you can start thinking about implementing the machine learning in a different application.

How to get started on Information Extraction?

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.
Update: Just to answer one of the comment. I am more interested in Text Information Extraction.
Just to answer one of the comment. I am more interested in Text Information Extraction.
Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from the textual information, and apply training, scoring, or classification.
Good introductory books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).
Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from the plain text). You can use Wikipedia as a training corpus since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.
The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and a very good motivator -maybe you could reimplement some of their results as a learning exercise.
As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task is usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high-quality info is currently in obscure white papers (Google Scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.
I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).
I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.
You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a PhD paper on building a new machine-learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context-specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.
A standard process model you can follow to do your extraction is to adapt a data/text mining approach:
pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data
segmentation/classification/clustering/association - your black box where most of your extraction work will be done
post-processing - cleansing your data back to where you want to store it or represent it as information
Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.
standard steps for: input->process->output
If you are using Java/C++ there are loads of frameworks and libraries available you can work with.
Perl would be an excellent language to do your NLP extraction work with if you want to do a lot of standard text extraction.
You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.
Good sources to read are:
Handbook of Computational Linguistics and Natural Language Processing
Foundations of Statistical Natural Language Processing
Information Extraction Applications in Prospect
An Introduction to Language Processing with Perl and Prolog
Speech and Language Processing (Jurafsky)
Text Mining Application Programming
The Text Mining Handbook
Taming Text
Algorithms of Intelligent Web
Building Search Applications
IEEE Journal
Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say Reuters corpus, tipster, TREC, etc. You can even check out alchemy API, GATE, UIMA, OpenNLP, etc.
Building extractions from standard text is easier than say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.
Standard measures include precision, recall, f1 measure amongst others.
I disagree with the people who recommend reading Programming Collective Intelligence. If you want to do anything of even moderate complexity, you need to be good at applied math and PCI gives you a false sense of confidence. For example, when it talks of SVM, it just says that libSVM is a good way of implementing them.
Now, libSVM is definitely a good package but who cares about packages. What you need to know is why SVM gives the terrific results that it gives and how it is fundamentally different from Bayesian way of thinking ( and how Vapnik is a legend).
IMHO, there is no one solution to it. You should have a good grip on Linear Algebra and probability and Bayesian theory. Bayes, I should add, is as important for this as oxygen for human beings ( its a little exaggerated but you get what I mean, right ?). Also, get a good grip on Machine Learning. Just using other people's work is perfectly fine but the moment you want to know why something was done the way it was, you will have to know something about ML.
Check these two for that :
http://pindancing.blogspot.com/2010/01/learning-about-machine-learniing.html
http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
Okay, now that's three of them :) / Cool
The Wikipedia Information Extraction article is a quick introduction.
At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.
Take a look here if you need enterprise grade NER service. Developing a NER system (and training sets) is a very time consuming and high skilled task.
This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

What are your feelings on functional specs? And Software design? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Does the functional spec help or hinder your expectations? Do programmers who practice the waterfall methodology, are they open to functional specs? I'm a web designer/developer working with a team of 5 programmers and would like to write a functional spec to explain what we need, so that when I begin my work - I know we are working towards the same goal.
I won't start any freelance project until I've got a design spec and functional spec written up and signed off. There's too much room for rogue clients to nickel and dime you to death if you don't have it. The functional spec allows you to stay on target/focused and gives you a natural check list to work to.
If there's no functional spec then you get all the "what ifs" starting to creep in and developers thinking - you know, this would be useful, and it'll only take me an hour. Sure an hour to code the prototype and get it basically working - plus the day to design all the tests and make sure all test cases are covered, then another couple of days to iron out all the bugs, then time to write the documentation. There's far too much room for what seems like a trivial addition to be inserted when there's no spec. You've no doubt heard of the infamous "scope creep". There's also far too much room for clients to say "that's not what I wanted..." when you deliver it and try and wriggle out of paying you.
If you've got the design spec and the functional spec written up ahead of the development and both you and the client have signed off that your understanding of not only the basic details but all the nuances of the language used is one and the same - only then can the real work begin.
There are a couple of anecdotes out there the first is quite true, while the other is a common misconception:
Software development is only 15% about the code, the rest is resource/people management.
It takes 20% of the time to complete the first 80% of the project and the remaining 80% of the time to complete the last 20%.
The misconception is that a working prototype is 80% of the way there - don't be fooled, it is not. So it's easy for a client to say "what's taking so long, I thought you were almost done!" and then quibble that they're paying too much for something that should've been finished months ago. Some of the design methodologies out there really lend themselves well to this popular misconception. The waterfall design methodology is one of them if it's not used correctly.
My view is make sure your understanding is the same, both sign off. Set milestones and make the client very aware at the outset that prototypes are a long way from the completion of the project and set expectations right from the outset as to what those milestones are and when the client can expect to see them delivered.
For project managers of development teams documentation and expectations are everything. You can't live without it, it's the only form of recourse you have against "That's not what I said" or "That's not what I meant" and ergo "I'm not paying you".
I learned from a lot of mistakes made by far more qualified developers than me and saw what it did to them. The two most important documents for your project are the design spec and the functional spec. Don't leave home without them or it can [or most likely will] come back and bite you in the ass.
Addendum re: User Stories:
An additional note about user stories is that they're great for what they are. They should be included in your functional specification but shouldn't be your functional specification.
A user story only describes a single task. It is very lightweight and doesn't contain excessive detail. As the common recommendations go, it should fit on a 3x5 card...if you as a project manager handed me a 3x5 card and told me to write a piece of software based on what I read, no doubt it would be handed to the user at the end and they'd tell the project manager that's not what they wanted.
You need a far greater level of detail in a functional spec. It shouldn't be limited to basic workflows. A functional spec is a whole bunch of user stories along with notes on interpretation of those user stories, improvements that can be made to them, common tasks that can be combined to improve efficiency. The list goes on.
So user stories are a good beginning, but they're not a replacement for a functional spec, they're bullet points in a functional spec.
I work with mostly the Waterfall model, and solely with functional specs. When working on my own (where I can set my own model and program any way I want) I start by writing up functional specs and then implementing them. It gives me a much better idea of the size and scope of the work, helps me estimate the time involved, and helps ensure that I don't miss anything.
Also, you can pass this document to:
Users so that they can make their requirements clear
Developers to create the functionality
Testers to make sure they are testing the right thing
Architects so that they can analyze the requirements
Using functional requirements over user stories is a matter of preference and the scope of the project. If you have a small user base, then you may be able to get away with user stories (which outline various event sequences the user might do), but for larger projects, you should use functional requirements as they have more detail and lead to fewer misunderstandings. Think of them as a means of communication with all people involved in the project.
Frankly, the Functional Specifications should already be part of your Big-M (Waterfall) methodology. Your functional specification is WHAT you are going to build; not necessarily how you are going to build it (which would be your detailed design/specification and the next step in the waterfall).
If you haven't written one yet, stop what you are doing and write one. You are just waste time if you do otherwise. You can start here with Joel's article.
It took me more than 10 years to get it beat into my head to write a functional spec before doing any code. Now I will write one for anything taking more than a day to write. The level of detail, and level of assumptions should be as much as needed to clearly define what needs to be done and communicate it to others (or remind yourself), anything beyond that is a waste.
Others prefer User Stories ... which is fine too, as long as you do some kind of planning.
Another way to accomplish this is using user stories
I find well-written functional specs very useful. A well organized functional specification can also help organize your tests (many-to-many mapping from individual requirements to test cases).
<p style="tongue: in-cheek">They also prove useful for finger-pointing in larger organizations (The requirements were inaccurate! The implementation didn't follow the requirement! QA didn't properly test this requirement! etc.)</p>
I'll second Codeslave's reference to Painless Functional Specification. It's a good series of articles on specifications. See also this Stackoverflow post for a discussion on what content to put into functional specs.
I've done a few large projects, including one with some hundereds of person-years of total effort. As a project team gets larger the number of informal communication channels goes up with a quadratic upper bound. Without a spec this informal communication mechanism is the only way things can get communicated. With a spec, the communication channels approach a hub-and-spokes, making the growth more like a linear function of the project team size.
A spec is really the only scalable way to to get the team 'singing off the same hymn sheet'. There should be some review and negotiation about the spec, but ultimately someone has to take ownership of this to avoid the project becoming a free-for-all.
I think they're a lovely idea, and should be tried.
Just a few comments on some of the answers here...
First of all, I do believe that a good spec document is important for any moderatly complex requirement (and definitly for highly complex ones). But make sure it is a good spec i.e. don't just state the obvious (like one poster already mentioned) but also don't leave out those parts that may seem trivial to you (or even the developers) since you might have more knowledge of that part of the system than some others involved (e.g. testers or documenters) which will appreciate the otherwise "missing bits".
And if your spec is good, it will get read - in my experience (and I've written and read lots of specs over the last years) it's the bad specs that get dumped and the good ones that get followed.
Concerning user stories (or sometimes also called use cases): I think these are important to get an idea of the scenario, but they usually can't replace the details (e.g. screen mockups, how where and if a feature is configurable, data model, migration issues etc.) so you'll probably need both for more complex requirements.
In my experience, functional specs have a fine line between not saying enough and saying too much.
If they don't say enough, then they leave areas open to misunderstanding.
If they say too much, they don't leave enough "wiggle room" to improve the solution.
And they should always be open to a process of revision.
It depends on the functional specification. I've had functional specifications where the writer knew the system inside and out, and wrote the specification as such, and I've had other writers write it with just what they expected to see as a user.
With the former, it's easier because I know exactly what I need to write and where I need to write it, but it limits how easily I can refactor the existing code, since estimates took into account just this feature, and not refactoring existing code that it touches.
With the latter, I have freedom to refactor (so long as end functionality is the same), but if it's a system I'm unfamiliar with, the time estimate is harder to make.
Overall, I love functional specifications -- I also like to remind people that they're written on paper that bends, and as such, should be flexible.
Whether you call them functional specs, business requirements, or user stories, they are all very beneficial to the development process.
The issue with them comes when they are chiseled in stone and used as a device to pass blame around when the system utimately doesn't fit with the user's real needs. I prefer to use the functionals or requirements as a starting point for an iterative process not as the bible for exactly how the system will look when it is complete. Things change, and users typically don't have an understanding of what they want until they have something (a working prototype possibly) in their hands that they can work with instead of conceptualizing on a piece of paper how it will function in the real world. The most successful projects I've implemented were ones where the development team and the users were closely aligned and were able to rapidly turn around changes instead of holding people to what they wanted on a piece of paper six months ago.
Of course this process wouldn't work if you were in a fixed-bid type of situation as one of the earlier answers pointed out.
One interesting substitute for a func spec or user stories that I have seen advocated is to write a user manual for the software first.
Even if this is only notional (i.e. if you do not intend to ship any manual - which you probably shouldn't as nobody will read it), it can be a useful and reasonably lightweight way to reach a common understanding of what the software will do and look like. And it forces you to do the design up front.
I've found that even if you write Functional Specs a lot of the detail is sometimes lost on the person you are trying to address. Now if a Functional Spec is written along with some kind of UI mockup, that makes a huge difference. UI mockups not only show the user what you are envisioning, it also triggers users into remembering things they wouldn't have thought off in the first place.
I have seen and written many specs, some were very good, most weren't. The main thing that they all had in common is that they were never followed. They all had cobwebs on them by the 3rd day of coding.
Write the spec if you want, the best time to do it is at the end of the project. It will be useless for developers to follow but it will at least be an accurate representation of what was done (I know- not really a spec). But don't expect the spec to get the code written for you. That is the real job. If you don't know what to write talk to your stakeholders and users and get some good user stories and then get on with it.

Resources