Machine Learning and Natural Language Processing [closed] - math

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
Assume you know a student who wants to study Machine Learning and Natural Language Processing.
What specific computer science subjects should they focus on and which programming languages are specifically designed to solve these types of problems?
I am not looking for your favorite subjects and tools, but rather industry standards.
Example: I'm guessing that knowing Prolog and Matlab might help them. They also might want to study Discrete Structures*, Calculus, and Statistics.
*Graphs and trees. Functions: properties, recursive definitions, solving recurrences. Relations: properties, equivalence, partial order. Proof techniques, inductive proof. Counting techniques and discrete probability. Logic: propositional calculus, first-order predicate calculus. Formal reasoning: natural deduction, resolution. Applications to program correctness and automatic reasoning. Introduction to algebraic structures in computing.

This related stackoverflow question has some nice answers: What are good starting points for someone interested in natural language processing?
This is a very big field. The prerequisites mostly consist of probability/statistics, linear algebra, and basic computer science, although Natural Language Processing requires a more intensive computer science background to start with (frequently covering some basic AI). Regarding specific langauges: Lisp was created "as an afterthought" for doing AI research, while Prolog (with it's roots in formal logic) is especially aimed at Natural Language Processing, and many courses will use Prolog, Scheme, Matlab, R, or another functional language (e.g. OCaml is used for this course at Cornell) as they are very suited to this kind of analysis.
Here are some more specific pointers:
For Machine Learning, Stanford CS 229: Machine Learning is great: it includes everything, including full videos of the lectures (also up on iTunes), course notes, problem sets, etc., and it was very well taught by Andrew Ng.
Note the prerequisites:
Students are expected to have the following background: Knowledge of
basic computer science principles and skills, at a level sufficient to write
a reasonably non-trivial computer program. Familiarity with the basic probability theory.
Familiarity with the basic linear algebra.
The course uses Matlab and/or Octave. It also recommends the following readings (although the course notes themselves are very complete):
Christopher Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
Richard Duda, Peter Hart and David Stork, Pattern Classification, 2nd ed. John Wiley & Sons, 2001.
Tom Mitchell, Machine Learning. McGraw-Hill, 1997.
Richard Sutton and Andrew Barto, Reinforcement Learning: An introduction. MIT Press, 1998
For Natural Language Processing, the NLP group at Stanford provides many good resources. The introductory course Stanford CS 224: Natural Language Processing includes all the lectures online and has the following prerequisites:
Adequate experience with programming
and formal structures. Programming
projects will be written in Java 1.5,
so knowledge of Java (or a willingness
to learn on your own) is required.
Knowledge of standard concepts in
artificial intelligence and/or
computational linguistics. Basic
familiarity with logic, vector spaces,
and probability.
Some recommended texts are:
Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Second Edition. Prentice Hall.
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
James Allen. 1995. Natural Language Understanding. Benjamin/Cummings, 2ed.
Gerald Gazdar and Chris Mellish. 1989. Natural Language Processing in Prolog. Addison-Wesley. (this is available online for free)
Frederick Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press.
The prerequisite computational linguistics course requires basic computer programming and data structures knowledge, and uses the same text books. The required articificial intelligence course is also available online along with all the lecture notes and uses:
S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. Second Edition
This is the standard Artificial Intelligence text and is also worth reading.
I use R for machine learning myself and really recommend it. For this, I would suggest looking at The Elements of Statistical Learning, for which the full text is available online for free. You may want to refer to the Machine Learning and Natural Language Processing views on CRAN for specific functionality.

My recommendation would be either or all (depending on his amount and area of interest) of these:
The Oxford Handbook of Computational Linguistics:
(source: oup.com)
Foundations of Statistical Natural Language Processing:
Introduction to Information Retrieval:

String algorithms, including suffix trees. Calculus and linear algebra. Varying varieties of statistics. Artificial intelligence optimization algorithms. Data clustering techniques... and a million other things. This is a very active field right now, depending on what you intend to do.
It doesn't really matter what language you choose to operate in. Python, for instance has the NLTK, which is a pretty nice free package for tinkering with computational linguistics.

I would say probabily & statistics is the most important prerequisite. Especially Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs) are very important both in machine learning and natural language processing (of course these subjects may be part of the course if it is introductory).
Then, I would say basic CS knowledge is also helpful, for example Algorithms, Formal Languages and basic Complexity theory.

Stanford CS 224: Natural Language Processing course that was mentioned already includes also videos online (in addition to other course materials). The videos aren't linked to on the course website, so many people may not notice them.

Jurafsky and Martin's Speech and Language Processing http://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210/ is very good. Unfortunately the draft second edition chapters are no longer free online now that it's been published :(
Also, if you're a decent programmer it's never too early to toy around with NLP programs. NLTK comes to mind (Python). It has a book you can read free online that was published (by OReilly I think).

How about Markdown and an Introduction to Parsing Expression Grammars (PEG) posted by cletus on his site cforcoding?
ANTLR seems like a good place to start for natural language processing. I'm no expert though.

Broad question, but I certainly think that a knowledge of finite state automata and hidden Markov models would be useful. That requires knowledge of statistical learning, Bayesian parameter estimation, and entropy.
Latent semantic indexing is a commonly yet recently used tool in many machine learning problems. Some of the methods are rather easy to understand. There are a bunch of potential basic projects.
Find co-occurrences in text corpora for document/paragraph/sentence clustering.
Classify the mood of a text corpus.
Automatically annotate or summarize a document.
Find relationships among separate documents to automatically generate a "graph" among the documents.
EDIT: Nonnegative matrix factorization (NMF) is a tool that has grown considerably in popularity due to its simplicity and effectiveness. It's easy to understand. I currently research the use of NMF for music information retrieval; NMF has shown to be useful for latent semantic indexing of text corpora, as well. Here is one paper. PDF

Prolog will only help them academically it is also limited for logic constraints and semantic NLP based work. Prolog is not yet an industry friendly language so not yet practical in real-world. And, matlab also is an academic based tool unless they are doing a lot of scientific or quants based work they wouldn't really have much need for it. To start of they might want to pick up the 'Norvig' book and enter the world of AI get a grounding in all the areas. Understand some basic probability, statistics, databases, os, datastructures, and most likely an understanding and experience with a programming language. They need to be able to prove to themselves why AI techniques work and where they don't. Then look to specific areas like machine learning and NLP in further detail. In fact, the norvig book sources references after every chapter so they already have a lot of further reading available. There are a lot of reference material available for them over internet, books, journal papers for guidance. Don't just read the book try to build tools in a programming language then extrapolate 'meaningful' results. Did the learning algorithm actually learn as expected, if it didn't why was this the case, how could it be fixed.

Related

Theory of automata prerequisites

I'm interested in automata theory to improve my understanding of programming and compiler design (I would like to create some simple syntax's in my own projects , for example; L-Systems, AI, neural net structures and intelligent object-object conversation 'AI dialog') but there are things I need to learn before I go forward.
There are a lot of new symbols and mathematical concepts I need to learn before studying automata theory, I could not copy and paste examples because of the symbols and
I don't have the required reputation to post an image so hears a link to a wiki article.
Context-free grammar article on Wikipedia
Under the heading "Proper CFGs" you can see some definitions. I don't understand them.
Could someone please tell me what this notation is called so I can Google it. Any other pointers or information would also be helpful but just knowing a few key words will help. Also if anyone knows of a comprehensive resource that can be accessed for free e.g, an IIT Video lecture on the subject of that notation I would be eternally grateful as I
can't afford tutoring or even text books at this time.
The resource I'm using at the moment for automata theory(for anyone who is interested) is Theory of Automata IIT Lectures on YouTube.
The symbols ∀ and ∃ are logical quantifiers, respectively meaning "for all" and "there exists".
Typically you are first introduced to them in a discrete mathematics course, though they're a part of predicate logic (also known as first-order logic); in my particular university's CS program, Discrete Math is a pre-requisite for Logic for Computer Science, which in turn is a pre-requisite for Formal Languages and Automata.
The star * symbol in the term (V union Sigma)* there is studied in formal languages/automata theory itself: it is the Kleene star operator. Its input is an alphabet (a set of symbols), and it produces the set of all strings of zero or more symbols over that alphabet.
A useful tool for studying formal languages and automata is JFLAP.
This topic, at the level that you have referred to in your link, is really only for mathematicians or graduate-level theoretical computer science students. The symbols you are referring to are just symbolic logic. If you are really interested in automata theory, I would recommend trying to find resources that explore the topic from a conceptual level and avoid using complex logical statements. OR, if you really want to dive in, you can teach yourself symbolic logic, some set theory, probably some modern algebra, and then tackle automata theory from there.
I read many books on the subject of Languages and Automata, including the Dragon books on compilers (and the much more pragmatic Jack Crenshaw's Let's Write a Compiler), but none of it really clicked until I read the classic Finite and Infinite Machines by Marvin Minsky. Being an old book, it does not cover the latest research and developments in the field at all, but he explains the state-of-the-art for the 1960s in Automata, Neural Networks, Turing Machines, Functional Programming and Lambda Calculus, and the oft-neglected third wheel of String-Rewriting Systems. And the writing is exceptionally excellent and engaging. IIRC Minksy even co-authored a robot story with Isaac Asimov, so he has some serious writing credentials.
Like I say, this book will not bring you up-to-date in any of these fields, but it's the best book I've found for explaining everything from the ground up. And it would provide a very firm basis for reading anything more recent. This book is in the bibliography of every book published since.

Which English tutorial would you advise to learn OCaml? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I want to advertise OCaml to beginners, and I am looking for good tutorials in English; not that you have only heard of, but that you have actually tried and found useful...
I quite like the book Developing Applications With Objective Caml -- I guess the title should be updated to mirror the 'OCaml' naming decision. It is old and therefore slightly out-of-date, but on only minor aspects -- eg., it presents the stream syntax as belonging to the core language, but it is now outsourced as a Camlp4 extension. The book is surprisingly complete, and there is a lot of meat already in the chapters 2, 3 and 4.
This books covers a bit of system programming, but if it's what the reader is interested in, I would rather recommend the separate book Unix system programming in OCaml -- also translated into english by a community effort.
Finally, if one want to discover the theoretical underpinnings of OCaml, If found the U3 book, Using, Understanding, and Unraveling the OCaml Language to be a great resource. But it's only for readers that already know about OCaml.
PS: I have a very good opinion on Jason Hickey's introduction to Objective Caml as well, but I can't say I have read it in full, only glanced at it. That's the problem with "beginners" books, you can really read at most one good one.
For me, the primary one is:
$ apt-cache show ocaml-book-en
Package: ocaml-book-en
Source: ocaml-book
Version: 1.0-5
Installed-Size: 7061
Maintainer: Debian QA Group <packages#qa.debian.org>
Architecture: all
Recommends: www-browser | pdf-viewer
Description-en: English book: "Developing applications with Objective Caml"
This is the English translation of the O'Reilly's OCaml French
book "Developpement d'applications avec Objective Caml" that can
be found in the ocaml-book-fr package.
.
This package contains both the HTML and PDF version of the book.
:)
There is also great book for system programming in OCaml and cookbook-style resource here.
The tutorial I used when learning and the one I always recommend to beginners - ocaml-tutorial.org (mirrored at ocamlcore as original site went down).
Here is a book that is intended for newcomers to programming and also those who know some programming but want to learn programming in the function-oriented paradigm, or those who simply want to learn OCaml.
An OCaml port of the book How to Think Like a Computer Scientist has been created by Nicolas Monje.
According to the website, the PDF version of the book should be downloaded
From the book:
The goal of this book is to teach you to think like a computer scientist. This way of thinking combines some of the best features of mathematics, engineering, and natural science. Like mathematicians, computer scientists use formal languages to denote ideas (specifically computations). Like engineers, they design things, assembling components into systems and evaluating tradeoffs among alternatives. Like scientists, they observe the behavior of complex systems, form hypotheses, and test predictions.
The single most important skill for a computer scientist is problem solving. Problem solving means the ability to formulate problems, think creatively about solutions, and express a solution clearly and accurately. As it turns out, the process of learning to program is an excellent opportunity to practice problem-solving skills. That’s why this chapter is called, “The way of the program.”
On one level, you will be learning to program, a useful skill by itself. On another level, you will use programming as a means to an end. As we go along, that end will become clearer.
I've just started with Ocaml, and these are tutorials that I find most helpful:
Documentation and user’s manual - most useful and official
Introduction to Caml - this one i used in my first days (recently) and it was really helpful because of it's simplicity
I thought Jason Hickey's Introduction to Objective Caml was very good (the only actual text on the language I've read, and how I started). INRIA's documentation is nice as well; and reading module signatures by themselves is quite instructive once you get the hang of it ;)
Believe it or not, OCaml was the first language I (really) learned.
There is a new book "Real World OCaml" of Jason Hickey is going to be published soon. On the web-site there is a public beta available for free. Despite the fact the book is not finished yet, I didn't notice any major mistakes or irrelevancies.
It gave me a full-fledged understanding of OCaml. It contains lots of examples illustrating concepts and could be easily considered as a tutorial. I also liked that it partly covers standart modules (List, ListLabels, Map, Sys, String, may be some others).
"The Runtime System" section in this book is very useful. It provides details about compiler implementation, memory management, linkage with foreign code, language cost intuition. The latter I consider very important, because many functional programming books cover concepts without saying how expensive they are in terms of memory and time. Highly recommend this book, especially when there is a free online version.

Where can I find math topics and resources for programmers?

There are a few questions around that circle around this question but I feel this is different enough.
I've decided I want to improve the breadth and depth of my maths skills specifically in areas that are useful and/or interesting to programmers.
What topics should I study?
What resources do you recommend (blogs/books/online lectures...)?
I'm looking for easy to consume resources because I'll be doing this in my free time, I don't want to spend days struggling through a dense text but I want to get deeper than the surface. I've read the Yegge article on the topic (and most of the comments) which is useful but I think the voting system here will help me focus on the most useful/best resources and topics.
Edit:
I am looking to create myself a study course that I'll follow over the next few years, I'm not looking to solve a particular problem I just want to learn some new skills that will interest me and may be useful in my career in the future.
Concrete Mathematics: A Foundation for Computer Science would be my suggestion for a book that covers some advanced topics.
For an introduction to Discrete Mathematics I strongly suggest this.
I feel very lucky to have been provided this book from University
Any programmer would do well to have a solid understanding on the undergraduate level of these following math courses:
Calculus (at through multivariate calc)
Discrete Mathematics (absolutely essential)
Linear Algebra (necessary for an understanding of matrices)
Combinatorics (further development of Dicrete maths)
Introduction to Abstract Algebra (this will solidify your understanding of modulo number systems, in particular binary, octal, hex etc.). It also gives a deep understanding of set theory which is ubiquitous in practical programming and the comp sci literature.
This is the fundamentals. If your are thinking about graphics or game programming then you have a whole slew of additional courses in physics, graphic arts, and possibly fluid dynamics. Also Differential Geometry is essential for any real world modeling of motion on curved surfaces.
It's a bit off from your question, but let me suggest the Princeton Companion to Mathematics.
It gives an overview of all of mathematics, so it is more than "math useful to programmers", but it's style is as easy to understand as it gets, and the important parts are in there.
Some time ago Steve Yegge wrote a dedicated article about math for programmers. His thesis is: As a programmer you should learn math but you should do so in different way than in shool/university.
His summary is this:
Math is a lot easier to pick up after you know how to program. In fact, if you're a halfway decent programmer, you'll find it's almost a snap.
They teach math all wrong in school. Way, WAY wrong. If you teach yourself math the right way, you'll learn faster, remember it longer, and it'll be much more valuable to you as a programmer.
Knowing even a little of the right kinds of math can enable you do write some pretty interesting programs that would otherwise be too hard. In other words, math is something you can pick up a little at a time, whenever you have free time.
Nobody knows all of math, not even the best mathematicians. The field is constantly expanding, as people invent new formalisms to solve their own problems. And with any given math problem, just like in programming, there's more than one way to do it. You can pick the one you like best.
Math is... ummm, please don't tell anyone I said this; I'll never get invited to another party as long as I live. But math, well... I'd better whisper this, so listen up: (it's actually kinda fun.)
Sad note: Steve abandoned his blog because of too much aggressive feedback.
If you have any interest in game development, 3D graphics, or anything somewhat related to those, then do multivariate calculus and basic physics. This will help you understand the basic concepts much better. Also, linear algebra will help immensely with all of the matrix/vector stuff you will be doing.
If you are NOT interested in these topics, I would still say study calculus and physics. Why? Solving calculus and physics problems gives you good experience in problem solving and exercises the brain. Programmers NEED to be good problem solvers... that is our job. Concepts you pick up from these courses are things you will keep with you the rest of your life.
MIT and Stanford both have really good online courses for topics such as this. Of course you can't just jump into multivariate calculus without some more basic calc, but MIT and Stanford have resources for your basic calculus classes as well. Basic physics will be a little bit easier to pick up. Again, you can check MIT and Stanford for physics.
MIT OpenCourseWare:
Single Variable Calculus
Calculus
Multivariable Calculus
Physics I: Classical Mechanics
Mathematics for Computer Science
Generally speaking, the applications of math to computer programming are pretty domain-specific - that is, you need to know whatever math the specific program you're writing requires. The only mathematical topics I can think of that are generally applicable to all kinds of programming are simple arithmetic and boolean logic, but I think if you didn't already know those you wouldn't be much of a programmer ;-)
Basically, I would just recommend learning the math as needed for whatever project you're working on. If you want to give yourself a good excuse to learn some new math, start a hobby program that does something mathematical.
As for topics, look at some of the answers here. Recommended ressources are difficult for me to give, I'm German speaking. I would recommend starting with linear algebra and geometry, which you will find in computer graphics. Look at the undergraduate math series by Springer for example.
Number theory doesn't have many direct applications to programming (though there are some neat tricks you can use for optimization), but there are several basic concepts that make cryptology much easier to study.
My number theory class used Silverman's Friendly Introduction to Number Theory, which is one of the best math textbooks I've ever seen. It's very easy to read (the title is entirely accurate about its friendliness), but covers a wide range of topics. Silverman is also an author on my cryptography textbook, An Introduction to Mathematical Cryptography. It's more technical, addresses most areas of cryptography, and provides plenty of references for where to find more detail.
Consider Knuth's Art of Computer Programming series. It can get dense, but it will ground you in the math most needed for programming. I'd suggest going for the available fascicles of Volume 4 early on. These books are not for everybody, but if you find them interesting you will learn a whole lot.
They won't teach you calculus or geometry, which are important in many aspects of programming but tend to be more specialized.
I think you should dive into whatever interests you most and in order to find out what that is you should get some books which cover the facts and offer orientation and some books which nurture your motivation and curiosity. You really have to dive into it to find out, it's a pretty individual thing imho.
Facts / Orientation:
Donald Knuth -
Bronstein, Semendjajew
The Science of Programming -
Data Structures and Algorithms
Motivation / Curiosity:
The Road to Reality -
Fermat's Last Theorem -
Godel, Escher, Bach
Also for motivation on the more practical side:
projecteuler.net
What sorts of math problems do you want to solve? 'Math' is a pretty big area!
MIT has some online courses, but that's probably a big time investment.
Wolfram has some tutorials, but again, you need to know what you're looking for.

How to get started on Information Extraction?

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.
Update: Just to answer one of the comment. I am more interested in Text Information Extraction.
Just to answer one of the comment. I am more interested in Text Information Extraction.
Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from the textual information, and apply training, scoring, or classification.
Good introductory books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).
Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from the plain text). You can use Wikipedia as a training corpus since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.
The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and a very good motivator -maybe you could reimplement some of their results as a learning exercise.
As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task is usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high-quality info is currently in obscure white papers (Google Scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.
I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).
I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.
You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a PhD paper on building a new machine-learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context-specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.
A standard process model you can follow to do your extraction is to adapt a data/text mining approach:
pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data
segmentation/classification/clustering/association - your black box where most of your extraction work will be done
post-processing - cleansing your data back to where you want to store it or represent it as information
Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.
standard steps for: input->process->output
If you are using Java/C++ there are loads of frameworks and libraries available you can work with.
Perl would be an excellent language to do your NLP extraction work with if you want to do a lot of standard text extraction.
You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.
Good sources to read are:
Handbook of Computational Linguistics and Natural Language Processing
Foundations of Statistical Natural Language Processing
Information Extraction Applications in Prospect
An Introduction to Language Processing with Perl and Prolog
Speech and Language Processing (Jurafsky)
Text Mining Application Programming
The Text Mining Handbook
Taming Text
Algorithms of Intelligent Web
Building Search Applications
IEEE Journal
Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say Reuters corpus, tipster, TREC, etc. You can even check out alchemy API, GATE, UIMA, OpenNLP, etc.
Building extractions from standard text is easier than say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.
Standard measures include precision, recall, f1 measure amongst others.
I disagree with the people who recommend reading Programming Collective Intelligence. If you want to do anything of even moderate complexity, you need to be good at applied math and PCI gives you a false sense of confidence. For example, when it talks of SVM, it just says that libSVM is a good way of implementing them.
Now, libSVM is definitely a good package but who cares about packages. What you need to know is why SVM gives the terrific results that it gives and how it is fundamentally different from Bayesian way of thinking ( and how Vapnik is a legend).
IMHO, there is no one solution to it. You should have a good grip on Linear Algebra and probability and Bayesian theory. Bayes, I should add, is as important for this as oxygen for human beings ( its a little exaggerated but you get what I mean, right ?). Also, get a good grip on Machine Learning. Just using other people's work is perfectly fine but the moment you want to know why something was done the way it was, you will have to know something about ML.
Check these two for that :
http://pindancing.blogspot.com/2010/01/learning-about-machine-learniing.html
http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
Okay, now that's three of them :) / Cool
The Wikipedia Information Extraction article is a quick introduction.
At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.
Take a look here if you need enterprise grade NER service. Developing a NER system (and training sets) is a very time consuming and high skilled task.
This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

Is programming a subset of math? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I've heard many times that all programming is really a subset of math. Some suggest that OO, at its roots, is mathematically based, but I don't get the connection, aside from some obvious examples:
using induction to prove a recursive algorithm,
formal correctness proofs,
functional languages,
lambda calculus,
asymptotic complexity,
DFAs, NFAs, Turing Machines, and theoretical computation in general,
and the fact that everything on the box is binary.
I know math is very important to programming, but I struggle with this "subset" view. In what ways is programming a subset of math?
I'm looking for an explanation that might have relevance to enterprise/OO development, if there is a strong enough connection, that is.
It's math in the sense that it requires abstract thought about algorithms etc.
It's engineering when it involves planning schedules, deliverables, testing.
It's art when you have no idea how it's going to eventually turn out.
Programming is one of the most difficult branches of applied mathematics; the poorer mathematicians had better remain pure mathematicians.
--E. W. Dijkstra
Overall, remember that mathematics is a formal codification of logic, which is also what we do in software.
The list of topics in your question is loaded with mathematical problems. We are able to do programming on a fairly high level of abstraction, so the raw mathematics may not be staring you in the face. For example, you mentioned DFAs.. you can use a regular expression in your programs without knowing any math, but you'll find more of a need for mathematics when you want to design a good regular expression engine.
I think you've hit on an interesting point. Programming is an art and a science. There are a lot of "tools of the trade", and you don't necessarily sit down and do a lot of high-level mathematics in order to simply write a program. In fact, when you're programming, you many not really being doing much mathematics or computer science.
It's when we start to solve difficult problems in computer science that mathematics shows up. The deeper you go, the more it will flesh itself out.. often in lower levels of abstraction.
There are also some realms of programming that you don't necessarily have to work in, but they involve more math. For example, while you can certainly learn a language and write some apps without any formal mathematics, you won't get very far in algorithm analysis without some applied math.
OK, I was a math and CS major in college. I would say that if the set A is Math and the set B is CS, then A intersects B. It's not a subset.
It's no doubt that many of the fathers and mothers of computer science were Mathematicians like Turing and Dykstra. Most of the founders of the internet were either Phd's in Math, Physics, or Engineering. Most of the core concepts of computer science come from math, but the act of programming isn't really math. Math helps us in our daily lives, but the two aren't the same.
But there is no doubt that the original reasoning behind the computer was to well, compute things. We have come a long way from there in such a short time.
Doesn't mention programming, but idea is still relevant.
Einstein was known in 1917 as a famous mathematician. It wasn't until Hiroshima that the general public finally came around to the realization that physics is not just applied mathematics.
When people don't understand something, they try to understand it as a type of something that they do understand. They think by analogy. Programming has been described as a field of math, engineering, science, art, craft, construction... None of these are completely false; it borrows from all of these. The real issue is that the field of programming is only about 50 years old. People have not integrated it into their mental taxonomies.
There's a lot of confusion here.
First of all, "programming" does not (currently) equal "computer science." When Dijkstra called himself a "programmer" (more or less inventing the title), he was not pumping out CRUD applications, but actually doing applied computer science. Let's not let that confuse us-- today, there is a vast difference between what most programmers in a business setting do and computer science.
Now, the argument can be made that computer science is a branch of mathematics; but, as Knuth points out (in his paper "Computer Science and its Relation to Mathematics", collected in his Selected Papers on Computer Science) it can also be argued that mathematics is a branch of computer science.
In fact, I'd strongly recommend this paper to anyone thinking about the relationship between mathematics and computer science, as Knuth lays out the territory nicely.
But, to return to your original question: to a practitioner, "enterprise/OO development" is pretty far removed from mathematics-- but that's largely because most of the serious mathematics involved at the lower levels of operation have been abstracted away (by compilers, operating systems, instruction sets, etc.). Similarly, advanced knowledge of the physics of the internal combustion engine are not required for driving a car. Naturally, if you want to design a more efficient car....
if your definition of math includes all forms of formal logic, and programming is defined only by the logic and calculations extant in the code, then programming is a subset of math QED ;-)
but this is like saying that painting is merely putting colored pigments on a surface - it completely igores the art, the insight, the intuition, the entire creative process
one could argue that music is a subset of math by the same reasoning
so i'd have to say no, programming is not a subset of math. Programming uses a subset of math, but requires non-math skills/talent as well [much like music composition]
Disclaimer: I work as an IT consultant and develop mainly portals and Architecture stuff. I have a Psychology degree. I never studied Maths in University. And i get my job done. And usually well. Why? Because I don't think you need to know Maths (as in 'heavy' Maths stuff) to write code. You need analytical thinking, problem-solving skills, and a high level of abstraction. But Maths does not give you that. It's just another discipline that requires similar skills. My studies in Psychology also apply to my daily work when dealing with usability issues and data storage. Linguistics and Semiotics also play a part.
But wait, just don't flame me yet. I'm not saying Maths are not needed at all for computers - obviously, you need real Math skills when designing encryption algorithms and hardware and etc -- but if, as lots of programmers, you just work an a mid/low level language (like C) or higher level stuff (like C# or java), consuming mostly pre-built frameworks and APIs, you don't really need to understand the mathematical principles behind Fourier transforms or Huffman trees or Moebius strips... let someone else handle that, and let me build value on top of it. I am not stupid. I know the difference between linear and exponential algorithms and data structures and etc. I just don't have the interest to rewrite quicksort or a spiffy new video compression technique.
Well, aside from all that...!
Math is used for many aspects of programming such as
Creating efficient and smart algorithms
Understanding Big O notation
Security (such as RSA)
Many more...
I think that programming needs math to survive. But I wouldn't call it a subset. It's just like blowing glass uses properties of physics, but those artists don't call themselves physicists.
The foundation of everything we do is math.
Luckily, we don't need to be good at math itself to do it. Just like you don't need to understand physics to drive a car or even fly a plane.
The difference between programming and pure mathematics is the concept of state.
Have a look at http://en.wikipedia.org/wiki/Dynamic_logic_(modal_logic). It's a way of mathematically analyzing things changing through time. Also, Hoare triples is a way of formalizing the input-output behavior of programs. By having some axioms dealing with sequential composition of programs and how assignment works, you can perfectly well deal with state changing over time in a mathematically rigorous way.
If the math you know is insufficient, "invent" some new math to deal with what you want to analyze. Newton and Leibniz did it for analysis (aka calculus, I think). No reason to not do it for computation and programming.
I don't believe I've heard that programming is a subset of math. Even the link you provide is simply a proposed approach to programming (not claiming it's a subset of mathematics) and the wiki page has plenty of disagreements in it as well.
Programming requires (at least some) applied mathematics. Mathematics can be used to help describe and analyze programs and program fragments. Programming has a very close relationship with math and uses it and concepts from it heavily. But subset? no.
I'd love to see someone actually claim that it is one with some clear reasoning. I don't think I ever have
Just because you can use mathematics
to reason about something does not
imply that it is, ipso facto, a
mathematical object. Mathematics is
used to reason about internal
combustion engines, radioactive decay
and juggling patterns. Using
mathematics is not doing mathematics.
I would say...
It's partly math, especially at the theoretical level. Imagine designing efficient searching/sorting/clustering/allocating/fooifying algorithms, that's all math... running the gamut from number theory to statistics.
It's partly engineering. Complex systems can rarely achieve ideal levels of performance and reliability, and software is no exception. A lot of software development is about achieving robustness in the face of unreliable hardware and (ahem) humans.
And it's partly art. Creative and idiosyncratic software design often comes up with great new ideas... like assembly language, multitasking operating systems, graphical user interfaces, dynamic languages, and the web.
Just my 2¢...
Math + art + logic
You can actually argue that math, in the form of logical proofs, is analogous to programming --
Check out the Curry-Howard correspondence. It's probably more the way a mathematician would look at things, but I think this is hitting the proverbial nail on the head.
Programming may have originally started as a quasi-subset of math, but the increasing complexity of the field over time has led to programming being the art and science of creating good abstractions for information processing and computation.
Programming does involve math, engineering, and an aesthetic sense for good design and implementation. Algorithms are an extension of mathematics, and the systems engineering side overlaps with other engineering disciplines to some degree. However, neither mathematics nor other engineering fields have the same level of need for complex, flexible, and yet understandable abstractions that can be used and adapted at so many different levels to solve new and evolving problems.
It is the need for useful, flexible, and dynamic abstractions which led first to the creation of function libraries, then class/component libraries, and in more recent years design patterns and service-oriented architectures. Although the latter have more of a design focus, they are a reaction to the increasing need to build high-level abstractional bridges between programming problems and solutions.
For all of these reasons, programming is neither a subset nor a superset of math. It is simply yet another field which uses math that has deeper roots in it than others do.
The topics you listed are topics in Theoretical Computer Science, and THAT is a branch of Pure Mathematics. Programming is an applied science which uses theoretical computer science. Programming itself isn't a branch of mathematics but the Lambda Calculus/theory of computation/formal logic/set theory etc that programming languages are based on is.
Also I completely disagree with Dijkstra. It's either self-congratulatory or Dijkstra is being misquoted/quoted out of context. Pure mathematics is a very very very difficult field. It is so enormously abstract that no branch of applied mathematics is comparable in difficulty. It is one field that requires enormous leaps of imagination. I did my first degree in computer science where I focused a lot on theoretical CS and applied areas like programming, OS, compilers. I also did a degree in Electrical Engineering - arguably the most difficult branch of engineering - and worked on difficult areas of applied mathematics like Maxwell's equations, control theory and partial differential equations in general.
I've also done research in applied and pure mathematics, and to this day I find applied far easier. As for the pure mathematicians, they're a whole different breed.
Now there's a tendency for someone to study an year or two of calculus unhinged from application and conclude that pure mathematics is easy. They have no idea what they're talking about. Studying calculus or even topology unhinged from application does not give you any inkling of what a pure mathematician does. The task of actually proving those theorems are so profoundly difficult that I will defer to a computer scientist to point out the distinction:
"If P = NP, then the world would be a profoundly different place than we usually assume it to be. There would be no special value in 'creative leaps,' no fundamental gap between solving a problem and recognizing the solution once it’s found. Everyone who could appreciate a symphony would be Mozart; everyone who could follow a step-by-step argument would be Gauss..." —Scott Aaronson, (Theoretical Computer Scientist, MIT)
I think mathematics provides a set of tools for programmers which they use at abstract level
to solve real world problems.
I would say that programming is less about math than it used to be as we move up to 4th Generation Languages. Assembly is very much about math, C#, not so much. Thoughts?
If you just want the design specs handed out to you by your boss, then it's not much math but such a work isn't fun at all... However, coming up with how to do things does require mathematical ideas, at least things like abstraction, graphs, sometimes number theory stuffs and depending on the problems, calculus. Personally, more I've been involved with programming, more I see the mathematical side to it. However, most of the times IMO, you can just pick up the book from library and look up the basics of the thing you need to do but that requires some mathematical grasp upfront.
You really can't design "good" algorithms without understanding the maths behind it. Searching in google takes you only so far.
Programming is a too wide subject. Good software based not only on math (logic) but also on psychology, linguistics etc. Algorithms are part of math, but there are many other programming-related things besides algorithms.
As a mathematician, it is clear to me that Math is not equal to Programming but that the process which is used to solve problems in either discipline is extremely similar.
Solving a higher level mathematics questions requires analytical thinking, a toolbox of possible ways of solving problems, experience with the field, and some formalized ways of constructing your answer so that other mathematicians agree. If you find a particularly clever, abstract, or elegant way of solving a problem, you get Kudos from your fellow mathematicians. For particularly difficult math problems, you may solve the problem in stages, and codify your stage arguments using things called conjectures and proofs.
I think programming involves the same set of skills. In programming, the same set of principles applies to the solving and presenting of solutions to problems. When you have a partial solution to a programming dilemna, you include it as part of your personal library and use it as part of another bigger problem later. These skills seem very similar to the skills used in mathematics.
The major difference between Math and Programming is the latter has a lot more in common between different disciplines of programming than Math does. Two fields of mathematics can be very, very different in presentation and what is used to communicate the field. By contrast, programming structures, to me at least, look very similar in many different languages.
The difference between programming and pure mathematics is the concept of state. A program is a state machine that uses logic (maths) to transition between states. The actual logic used to transition between states is usually very simple, which is why being a math genius doesn't necessarily help you all that much as a programmer.
Part of the reason I'm a programmer is because I don't like math. I have no problem with math itself, and I'm fine with it conceptually, I just don't like doing calculations by hand. When I found I could tell a computer what the math problem is and let it do the calculating for me, a life-long passion and career was born.
To answer the question, according to my alma mater, math == programming since they allowed me to take Intro to C++ to fulfill my math requirement.
Edit: I should mention my degree is in telecommunications which, at the time, had only the standard liberal arts math requirement of one semester.
Math is the purest form of truth. Everything inherits from math.
Amen.
It's interesting to compare programming with music too. In UK, anyway, there are computing based undergrad university courses that will accept applicants on the bases of music qualifications as supposed to computing due to the logic, patterns, etc. involved.
Maths is powerful, programming is powerful, if maths is a subset of programming then it is equally true to state that programming is a subset of maths.
Maths is described using language, often written down. Therefore is maths a subset of writing too?
Historicly maths came before computer programming, but then lists and processes probably preceded maths, both of which could be equally thought of as mathematical or do with programming.
Cirtainly programming can be represented using maths, so there is some bases for it being true that programming is a sub-set of maths. However a computer program could also implement maths, representing information symbolically, as maths typically does when done on paper, including the infinite and only somewhat defined, from the fundamental axioms, as well as allowing higher level structures to be defined that use each other and other sorts of relationships beyond composition, supporting the drawing of diagrams and allowing the system to be expanded. Maths is equally a subset of programming.
While maths can represent structures such as words, maths is by design about numbers. Strings for example are more programmatic than mathematic.
It's half math, half man speak, duh.

Resources