besides BM25, whats other ranking functions exists? - information-retrieval

besides BM25, what's other ranking functions exists? Where I found information on this topic?

BM25 is one of term-based ranking algorithms. Nowadays there are concept-based algorithms as well.
BM25 if state-of-the-art of term based information retrieval; however, there are some challenges that term-based cannot overcome such as, relating synonyms, matching an abbreviation or recognizing homonyms.
Here are the examples:
synonym: "buy" and "purchase"
antonym: "Professor" and "Prof."
homonym:
bow – a long wooden stick with horse hair that is used to play certain string instruments such as the violin
bow – to bend forward at the waist in respect (e.g. "bow down")
To deal with these problems, some are using concept-based models such as this article and this article.
Concept-based models are mostly using dictionaries or external terminologies to identify concepts and each have their own representation of concepts or weighting algorithms.

vanilla tf-idf is what is often used. If you want to learn about these things the best place to start is this book.

Related

What are the symbols for specializations and generalizations etc. in crow foot notation

My university is forcing me to learn from a terrible textbook on ERD's, in which they are using a notation I personally don't like because I've never seen it used before (the book is so bad it doesn't even say what notation they're using), and I like learning it using a more common notation. Therefore I chose to learn it using the crow foot notation. (Please enlighten me if you think this is a bad decision)
Now the textbook is covering is-a(n) relationships (a.k.a. specializations/generalizations) and I'd like to know how I have to represent one in a consistent way with martin/crow foot notation... I learned about it thanks to a yt video (https://www.youtube.com/watch?v=MTG1zl8PkXk) but I noticed he's not using the same notation I'm using.
So how do I represent a specialization or generalization in crow foot notation? Or is crow foot notation only specific to cardinality? In my textbook, a few pages ahead, I also see concepts like multidimensional relationships (entity A has the same relation with entity B as with entity C) and relationships that refer to the same entity itsself (so 1 employee can hire multiple other employee's). Extra much love if you can show me how I should draw those as well :)
Unfortunately I could not find much information on this using search engines...
Try searching on "EER Diagram PDF". You'll get more images than you can shake a stick at. Some of them use crow's foot notation. Others don't.
The extra "E" stands for "Enhanced". This has to do with the fact that the original ER model did not have modeling conventions for Gen-spec (superclass/subclass) or for unions.
Unlike most people, I prefer to make a sharp distinction between diagrams that depict an ER model and ones that depict a relational model. Contrary to prevailing opinion, ER modeling isn't just "relational lite". It's a different model, with different purposes. You can look up the history if you are really interested.
I tend to use crow's foot notation in ER diagrams, and I always leave out junction boxes and foreign keys. This makes the diagram more useful for stakeholders who want to see the big picture.
I like arrowhead notation for relational diagrams. Foreign keys and junction boxes must be included in relational diagrams. They are part of the model, and implement relationships.
As far as a relational table design for gen-spec, I don't think you can beat Fowler's treatment of the subject. Try searching on "Fowler Class Table Inheritance" for an entry point into this aspect of the topic.

Usage of Difference Equations(recurrence relations) in IT fields

I'm doing a IT diploma and Mathematics also be tought there.In these days, we have to learn Difference Equations(recurrence relations) and I'm confused with what are the usages of these concepts in different areas of IT like computing, algorithms and data structures, circuit analysis, etc.
Can someone please explain why we learn these concepts and usage of them. It will be helpful my learnings.
Unutbu has an excellent explanation and link to what is a recurrence relation. Here is a small window into the why.
Computer scientists need to know how algorithms compare to one another (in terms of speed or memory space). The Recurrence Relations help provide this comparison (on an order of magnitude).
Here is a general question: How many possible moves in chess are there? If there are 318 billion moves (for the first four), could you use an exhaustive algorithm to search for the best, first move? If that is impractical, what algorithms might trim that to a reasonable size? The complexity measures give us an insight into the difficulty of the problem and into the practicality of a given algorithm.
Recurrence relations come up in complexity theory. The number of steps performed by a recursive algorithm can be expressed as a recurrence relation, such as
The goal then is to express T as (or bound T by) a (hopefully simple) function of n.
The master theorem is one example of how this is done.

Generating articles automatically

This question is to learn and understand whether a particular technology exists or not. Following is the scenario.
We are going to provide 200 english words. Software can add additional 40 words, which is 20% of 200. Now, using these, the software should write dialogs, meaningful dialogs with no grammar mistake.
For this, I looked into Spintax and Article Spinning. But you know what they do, taking existing articles and rewrite it. But that is not the best way for this (is it? let me know if it is please). So, is there any technology which is capable of doing this? May be semantic theory that Google uses? Any proved AI method?
Please help.
To begin with, a word of caution: this is quite the forefront of research in natural language generation (NLG), and the state-of-the-art research publications are not nearly good enough to replace human teacher. The problem is especially complicated for students with English as a second language (ESL), because they tend to think in their native tongue before mentally translating the knowledge into English. If we disregard this fearful prelude, the normal way to go about this is as follows:
NLG comprises of three main components:
Content Planning
Sentence Planning
Surface Realization
Content Planning: This stage breaks down the high-level goal of communication into structured atomic goals. These atomic goals are small enough to be reached with a single step of communication (e.g. in a single clause).
Sentence Planning: Here, the actual lexemes (i.e. words or word-parts that bear clear semantics) are chosen to be a part of the atomic communicative goal. The lexemes are connected through predicate-argument structures. The sentence planning stage also decides upon sentence boundaries. (e.g. should the student write "I went there, but she was already gone." or "I went there to see her. She has already left." ... notice the different sentence boundaries and different lexemes, but both answers indicating the same meaning.)
Surface Realization: The semi-formed structure attained in the sentence planning step is morphed into a proper form by incorporating function words (determiners, auxiliaries, etc.) and inflections.
In your particular scenario, most of the words are already provided, so choosing the lexemes is going to be relatively simple. The predicate-argument structures connecting the lexemes needs to be learned by using a suitable probabilistic learning model (e.g. hidden Markov models). The surface realization, which ensures the final correct grammatical structure, should be a combination of grammar rules and statistical language models.
At a high-level, note that content planning is language-agnostic (but it is, quite possibly, culture-dependent), while the last two stages are language-dependent.
As a final note, I would like to add that the choice of the 40 extra words is something I have glossed over, but it is no less important than the other parts of this process. In my opinion, these extra words should be chosen based on their syntagmatic relation to the 200 given words.
For further details, the two following papers provide a good start (complete with process flow architectures, examples, etc.):
Natural Language Generation in Dialog Systems
Stochastic Language Generation for Spoken Dialogue Systems
To better understand the notion of syntagmatic relations, I had found Sahlgren's article on distributional hypothesis extremely helpful. The distributional approach in his work can also be used to learn the predicate-argument structures I mentioned earlier.
Finally, to add a few available tools: take a look at this ACL list of NLG systems. I haven't used any of them, but I've heard good things about SPUD and OpenCCG.

How can I learn higher-level programming-related math without much formal training? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I haven't taken any math classes above basic college calculus. However, in the course of my programming work, I've picked up a lot of math and comp sci from blogs and reading, and I genuinely believe I have a decent mathematical mind. I enjoy and have success doing Project Euler, for example.
I want to dive in and really start learning some cool math, particularly discrete mathematics, set theory, graph theory, number theory, combinatorics, category theory, lambda calculus, etc.
My impression so far is that I'm well equipped to take these on at a conceptual level, but I'm having a really hard time with the mathematical language and symbols. I just don't "speak the language" and though I'm trying to learn it, I'm the going is extremely slow. It can take me hours to work through even one formula or terminology heavy paragraph. And yeah, I can look up terms and definitions, but it's a terribly onerous process that very much obscures the theoretical simplicity of what I'm trying to learn.
I'm really afraid I'm going to have to back up to where I left off, get a mid-level math textbook, and invest some serious time in exercises to train myself in that way of thought. This sounds amazingly boring, though, so I wondered if anyone else has any ideas or experience with this.
If you don't want to attend a class, you still need to get what the class would have given you: time in the material and lots of practice.
So, grab that text book and start doing the practice problems. There really isn't any other way (unless you've figured out how osmosis can actually happen...).
There is no knowledge that can only be gained in a classroom.
Check out the MIT Courseware for Mathematics
Also their YouTube site
Project Euler is also a great way to think about math as it relates to programming
Take a class at your local community college. If you're like me you'd need the structure. There's something to be said for the pressure of being graded. I mean there's so much to learn that going solo is really impractical if you want to have more than just a passing nod-your-head-mm-hmm sort of understanding.
Sounds like you're in the same position I am. What I'm finding out about math education is that most of it is taught incorrectly. Whether a cause or result of this, I also find most math texts are written incorrectly. Exceptions are rare, but notable. For instance, anything written by Donald Knuth is a step in the right direction.
Here are a couple of articles that state the problem quite clearly:
A Gentle Introduction To Learning
Calculus
Developing Your Intuition For
Math
And here's an article on a simple study technique that aims at retaining knowledge:
Teaching linear algebra
Consider auditing classes in discrete mathematics and proofs at a local university. The discrete math class will teach you some really useful stuff (graph theory, combinatorics, etc.), and the proofs class will teach you more about the mathematical style of thinking and writing.
I'd agree with #John Kugelman, classes are the way to go to get it done properly but I'd add that if you don't want to take classes, the internet has many resources to help you, including recorded lectures which I find can be more approachable than books and papers.
I'd recommend checking out MIT Open Courseware. There's a Maths for Computer Science module there, and I'm enjoying working through Gilbert Strang's Linear Algebra course of video lectures.
Youtube and videolectures.com are also good resources for video lectures.
Finally, there's a free Maths for CS book at bookboon.
To this list I would now add The Haskel Road to Logic, Maths, and Programming, and Conceptual Mathematics: A First Introduction to Categories.
--- Nov 16 '09 answer for posterity--
Two books. Diestel's Graph Theory, and Knuth's Concrete Mathematics. Once you get the hang of those try CAGES.
Find a good mentor who is an expert in the field who is willing to spend time with you on a regular basis.
There is a sort of trick to learning dense material, like math and mathematical CS. Learning unfamiliar abstract stuff is hard, and the most effective way to do it is to familiarize yourself with it in stages. First, you need to skim it: don't worry if you don't understand everything in the first pass. Then take a break; after you have rested, go through it again in more depth. Lather, rinse, repeat; meditate, and eventually you may become enlightened.
I'm not sure exactly where I'd start, to become familiar with the language of mathematics; I just ended up reading through lots of papers until I got better at it. You might look for introductory textbooks on formal mathematical logic, since a lot of math (especially in language theory) is based off of that; if you learn to hack the formal stuff a bit, the everyday notation might look a bit easier.
You should probably look through books on topics you're personally interested in; the inherent interest should help get you over the hump. Also, make sure you find texts that are actually introductory; I have become wary of slim, undecorated hardbacks labeled Elementary Foobar Theory, which tend to be elementary only to postdocs with a PhD in Foobar.
A word of warning: do not start out with category theory -- it is the most boring math I have ever encountered! Due to its relevance to language design and type theory, I would like to know more about it, but so far I have not been able to deal...
For a nice, scattershot intro to bits of many kinds of CS-ish math, I recommend Godel, Escher, Bach by Hofstadter (if you haven't read it already, of course). It's not a formal math book, though, so it won't help you with the familiarity problem, but it is quite inspirational.
Mathematical notation is is akin to several computer languages:
concise
exacting
based on many idioms
a fair amount of local variations and conventions
As with a computer language, you don't need to "wash the whole elephant at once": take it one part a at time.
A tentative plan for you could be
identify areas of mathematics that are interesting or important to you. (seems you already have a bit of a sense for that, CS has helped you develop quite a culture for it.)
take (or merely audit) a few formal classes in this area. I agree with several answers in this post, an in-person course, at local college is preferable, but, maybe at first, or to be sure to get the most of a particular class, first self-teaching yourself in this area with MIT OCW, similar online resources and associated books is ok/fine.
if an area of math introduces too high of a pre-requisite in terms of fluency with notation or with some underlying concept or (most often mechanical computation and transformation techniques). No problem! Just backtrack a bit, learn these foundations (and just these foundations!) and move forward again.
Find a "guru", someone that has a broad mathematical culture and exposure, not necessarily a mathematician, physics folks are good too, indeed they can often articulate math in a more practical fashion. Use this guru to guide you, as he/she can show you how the big pieces fit together.
Note: There is little gain to be had of learning mathematical notation for its own sake. Rather it should be learned in context, just like say a C# idiom is better memorized when used and when associated with a specific task, rather than learned in vacuo. A related SO posting however provides several resources to decipher and learn mathematical notation
Project Euler takes problems out of context and drops them in for people to solve them. Project Euler cannot teach you anything effectively. I think you should forget about it, if it is popular it does not mean anything. You cannot study Mathematics through Project Euler as it contains only bits and pieces(and some pretty high level pieces) that you're supposed to know in order to solve the problems. Learning mathematics means to consider a subject and a read a book about it and solving exercices or reading solutions, that's how you learn math. If it so happens that through your reading you find something that is close to some project euler thing, your luck , but otherwise Project euler is a complete waste of time. I think the time is much better invested choosing a particular branch of mathematics and studying that. Let me explain why: I solved 3 pretty advanced Projec Euler problems and they were all making appeal to knowledge from Number theory which I happened to have because i studies some part of it. I do not think Iearned anything from Project Euler, it just happened that I already knew some number theory and solved the problems.
For example, if you find out you like number theory, take H. Davenport -> Hardy & Wright -> Kenneth & Rosen's , study those.
If you like Graph Theory take Reinhard Diestel's book which is freely available and study that(or check books.google.com and find whichever is more appropriate to your taste) but don't spread your attention in 999999 directions just because Project Euler has problems ranging from dynamic programming to advanced geometry or to advanced number theory, that is clearly the wrong way to go and it will not bring you closer to your goal.
This sounds amazingly boring
Well ... Mathematics is not boring when you find some problem that you are attached to, which you like and you'd like to find the solution to, and when you have the sufficient time to reflect on it while not behind a computer screen. Mathematics is done with pen and paper mostly(yes you can use computers .. but that's not really the point).
So, if you find a real-world problem, or some programming problem that would benefit from
you knowing some advanced maths, and you know what maths you have to study , it can be motivating to learn in that way.
If you feel you are not motivated it is hard to study properly.
There is also the question of what you actually mean when you say learn. Does the learning process stop after you solved the problems at the end of the chapter of a book ? Well you decide. You can consider you have finished learning that subject, or you can consider you have not finished and read more about it. There are entire books on just one equation and variations of it.
The amount of programming-related math that you can learn without formal training is limited, but it's more than enough. But maybe you can self-teach yourself.
It all boils down to your resources and motivation.
To know mathematics you have to do mathematics not programming(project euler).
For beginning to learn category theory I recommend David Spivak's Category Theory for the Sciences (AKA Category Theory for Scientists) because its relatively comprehensible due to many examples that enable understanding by analogy and which quickly builds a foundation for understanding more abstract concepts.
It requires the ability to reason logically and an intuitive notion of what is a set. It proceeds from sets and functions through basic category theory to adjoint functors, categories of functors, sheaves, monads and an introduction to operads. Two main threads throughout are modeling databases in terms of categories and describing categories with annotated diagrams called ologs. The bibliography provides references to more advanced and specialized topics including recent papers by Dr. Spivak.
An expected outcome from reading this book is the capability of understanding category theory texts and papers written for mathematicians such as Mac Lane's Category Theory for the Working Mathematician.
In PDF format it is available from http://math.mit.edu/~dspivak/teaching/sp13/ (the dynamic version is recommended since its the most recent). The open access HTML version is available from https://mitpress.mit.edu/books/category-theory-sciences (which is recommended since it includes additional content including answers to some exercises).

How to get started on Information Extraction?

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.
Update: Just to answer one of the comment. I am more interested in Text Information Extraction.
Just to answer one of the comment. I am more interested in Text Information Extraction.
Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from the textual information, and apply training, scoring, or classification.
Good introductory books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).
Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from the plain text). You can use Wikipedia as a training corpus since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.
The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and a very good motivator -maybe you could reimplement some of their results as a learning exercise.
As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task is usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high-quality info is currently in obscure white papers (Google Scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.
I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).
I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.
You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a PhD paper on building a new machine-learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context-specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.
A standard process model you can follow to do your extraction is to adapt a data/text mining approach:
pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data
segmentation/classification/clustering/association - your black box where most of your extraction work will be done
post-processing - cleansing your data back to where you want to store it or represent it as information
Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.
standard steps for: input->process->output
If you are using Java/C++ there are loads of frameworks and libraries available you can work with.
Perl would be an excellent language to do your NLP extraction work with if you want to do a lot of standard text extraction.
You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.
Good sources to read are:
Handbook of Computational Linguistics and Natural Language Processing
Foundations of Statistical Natural Language Processing
Information Extraction Applications in Prospect
An Introduction to Language Processing with Perl and Prolog
Speech and Language Processing (Jurafsky)
Text Mining Application Programming
The Text Mining Handbook
Taming Text
Algorithms of Intelligent Web
Building Search Applications
IEEE Journal
Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say Reuters corpus, tipster, TREC, etc. You can even check out alchemy API, GATE, UIMA, OpenNLP, etc.
Building extractions from standard text is easier than say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.
Standard measures include precision, recall, f1 measure amongst others.
I disagree with the people who recommend reading Programming Collective Intelligence. If you want to do anything of even moderate complexity, you need to be good at applied math and PCI gives you a false sense of confidence. For example, when it talks of SVM, it just says that libSVM is a good way of implementing them.
Now, libSVM is definitely a good package but who cares about packages. What you need to know is why SVM gives the terrific results that it gives and how it is fundamentally different from Bayesian way of thinking ( and how Vapnik is a legend).
IMHO, there is no one solution to it. You should have a good grip on Linear Algebra and probability and Bayesian theory. Bayes, I should add, is as important for this as oxygen for human beings ( its a little exaggerated but you get what I mean, right ?). Also, get a good grip on Machine Learning. Just using other people's work is perfectly fine but the moment you want to know why something was done the way it was, you will have to know something about ML.
Check these two for that :
http://pindancing.blogspot.com/2010/01/learning-about-machine-learniing.html
http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
Okay, now that's three of them :) / Cool
The Wikipedia Information Extraction article is a quick introduction.
At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.
Take a look here if you need enterprise grade NER service. Developing a NER system (and training sets) is a very time consuming and high skilled task.
This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

Resources