There is a question illustrate below:
//--------question start---------------------
Consider the following small corpus consisting of three sentences:
The judge struck the gavel to silence the court. Buying the cheap saw is false
economy. The nail was driven in when the hammer struck it hard.
Use distributional similarity to determine whether the word gavel is more similar in mean-
ing to hammer or saw . To compute distributional similarity you must (1) use bag-of-words
in a ± 2 window around the target as features, (2) not alter the context words in any way
(e.g. by stemming or removing stop words) and (3) use the Dice measure to compare
the feature vectors. Make sure to show all stages of your working.
//--------question end---------------------
I don't understand what is a ± 2 window in (1). Would someone explain for me? Thank you guys very much.
A ± 2 window means 2 words to the left and 2 words to the right of the target word. For target word "silence", the window would be ["gavel", "to", "the", "court"], and for "hammer", it would be ["when", "the", "struck", "it"].
Related
I have this problem of matching two strings for 'more general', 'less general', 'same meaning', 'opposite meaning' etc.
The strings can be from any domain. Assume that the strings can be from people's emails.
To give an example,
String 1 = "movies"
String 2 = "Inception"
Here I should know that Inception is less general than movies (sort of is-a relationship)
String 1 = "Inception"
String 2 = "Christopher Nolan"
Here I should know that Inception is less general than Christopher Nolan
String 1 = "service tax"
String 2 = "service tax 2015"
At a glance it appears to me that S-match will do the job. But I am not sure if S-match can be made to work on knowledge bases other than WordNet or GeoWordNet (as mentioned in their page).
If I use word2vec or dl4j, I guess it can give me the similarity scores. But does it also support telling a string is more general or less general than the other?
But I do see word2vec can be based on a training set or large corpus like wikipedia etc.
Can some one throw light on the way to go forward?
The current usage of machine learning methods such as word2vec and dl4j for modelling words are based on distributional hypothesis. They train models of words and phrases based on their context. There is no ontological aspects in these word models. At its best trained case a model based on these tools can say if two words can appear in similar contexts. That is how their similarity measure works.
The Mikolov papers (a, b and c) which suggests that these models can learn "Linguistic Regularity" doesn't have any ontological test analysis, it only suggests that these models are capable of predicting "similarity between members of the word pairs". This kind of prediction doesn't help your task. These models are even incapable of recognising similarity in contrast with relatedness (e.g. read this page SimLex test set).
I would say that you need an ontological database to solve your problem. More specifically about your examples, it seems for String 1 and String 2 in your examples:
String 1 = "a"
String 2 = "b"
You are trying to check entailment relations in sentences:
(1) "c is b"
(2) "c is a"
(3) "c is related to a".
Where:
(1) entails (2)
or
(1) entails (3)
In your two first examples, you can probably use semantic knowledge bases to solve the problem. But your third example will probably need a syntactical parsing before understanding the difference between two phrases. For example, these phrases:
"men"
"all men"
"tall men"
"men in black"
"men in general"
It needs a logical understanding to solve your problem. However, you can analyse that based on economy of language, adding more words to a phrase usually makes it less general. Longer phrases are less general comparing to shorter phrases. It doesn't give you a precise tool to solve the problem, but it can help to analyse some phrases without special words such as all, general or every.
I need to demonstrate the Bayesian spam filter in school.
To do this I want to write a small Java application with GUI (this isn't a problem).
I just want to make sure that I really grasped the concept of the filter, before starting to write my code. So i will describe what I am going to build and how I will program it and would be really grateful if you could give a "thumbs up" or "thumbs down".
Remember: It is for a short presentation, just to demonstrate. It does not have to be performant or something else ;)
I imagine the program having 2 textareas.
In the first I want to enter a text, for example
"The quick brown fox jumps over the lazy dog"
I then want to have two buttons under this field with "good" or "bad".
When I hit one of the buttons, the program counts the appearances of each word in each section.
So for example, when I enter following texts:
"hello you viagra" | bad
"hello how are you" | good
"hello drugs viagra" | bad
For words I do not know I assume a probability of 0.5
My "database" then looks like this:
<word>, <# times word appeared in bad message>
hello, 2
you, 1
viagra, 2
how, 0
are, 0
drugs, 1
In the second textarea I then want to enter a text to evaluate if it is "good" or "bad".
So for example:
"hello how is the viagra selling"
The algorithm then takes the whole text apart and looks up for every word it's probability to appear in a "bad" message.
This is now where I'm stuck:
If I calculate the probability of a word to appear in a bad message by # times it appeared in bad messages / # times it appeared in all messages, the above text would have 0 probability to be in any category, because:
how never appeared in a bad message, so probability is 0
viagra never appeared in a good message, so probability also 0
When I now multiply the single probabilities, this would give 0 in both cases.
Could you please explain, how I calculate the probability for a single word to be "good" or "bad"?
Best regards and many thanks in advance
me
For unseen words you would like to do Laplace smoothing. What does that mean: having a zero for some word count is counterintuitive since it implies that probability of this word is 0, which is false for any word you can imagine :-) Thus you want to add a little, but positive probability to every word.
Also, consider using logarithms. Long messages will have many words with probability < 1. When you multiply lots of small floating numbers on a computer you can easily run into numerical issues. In order to overcome it, you may note that:
log (p1 * ... * pn) = log p1 + ... + log pn
So we traded n multiplications of small numbers for n additions of relatively big (and negative) ones. Then you can exponentiate result to obtain a probability estimate.
UPD: Actually, it's an interesting subtopic for your demo. It shows a drawback of NB of outputting zero probabilities and a way one can fix it. And it's not an ad-hoc patch, but a result of applying Bayesian approach (it's equivalent to adding a prior)
UPD 2: Didn't notice it first time, but it looks like you got Naive Bayes concept wrong. Especially, bayesian part of it.
Essentially, NB consists of 2 components:
We use Bayes' rule for a posterior distribution over class labels. This gives us p(class|X) = p(X|class) p(class) / p(X) where p(X) is the same for all classes, so it doesn't have any influence on the order of probabilities. Or, another way to say the same, is to say that p(class|X) is proportional to p(X|class) p(class) (up to a constant). As you may guessed already, that's where Bayes comes from.
The formula above does not have any model assumptions, it's a probability theory law. However, it's too hard to apply it directly since p(X|class) denotes probability of encountering message X in a class. No way we would have enough data to estimate probability of every single message possible. So here goes our model assumption: we say that words of a message are independent (which is, obviously, wrong and incorrect, thus method is Naive). This leads us to p(X|class) = p(x1|class) * ... * p(xn|class) where n is amount of words in X.
Now we need somehow estimate probabilities p(x|class). x here is not a whole message, but just a (one) word. Intuitively, probability of getting some word from a given a class is equal to the number of occurrences of that word in that class divided by the total size of the class: #(word, class) / #(class) (or, we could use Bayes' rule once again: p(x|class) = p(x, class) / p(class)).
Accordingly, since p(x|class) is a distribution over xs, we need it to sum to 1. Thus, if we apply Laplace smoothing by saying p(x|class) = (#(x, class) + a) / Z where Z is a normalizing constant, we need to enforce the following constraint: sum_x p(x|class) = 1, or, equivalently, sum_x(#(x, class) + a) = Z. It gives us Z = #(class) + a * N where N is number of all words (just number of words, not their occurrences!)
I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document.
Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune.
The best (as in more representative) terms in a document are those with higher tf-idf because those terms are common in the document, while being rare in the collection.
As a quick note, as #Kevin pointed out, very common terms in the collection (i.e., stop-words) produce very low tf-idf anyway. However, they will change some computations and this would be wrong if you assume they are pure noise (which might not be true depending on the task). In addition, if they are included your algorithm would be slightly slower.
edit:
As #FelipeHammel says, you can directly use the IDF (remember to invert the order) as a measure which is (inversely) proportional to df. This is completely equivalent for ranking purposes, and therefore to select the top "k" terms. However, it is not possible to use it to select based on ratios (e.g., words that appear in more than 50% of the documents), although a simple thresholding will fix that (i.e., selecting terms with idf lower than a specific value). In general, a fix number of terms is used.
I hope this helps.
From "Introduction to Information Retrieval" book:
tf-idf assigns to term t a weight in document d that is
highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);
lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
lowest when the term occurs in virtually all documents.
So words with lowest tf-idf can considered as stop words.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have made algorithm for scrabble . It uses strategy of highest score . But I do not think that is is the best way to play the game.
My question is: is there any advanced math for scrabble that suggests not the highest score word but an other one that will increase the probability to win?
Or in other words some different strategy then highest score?
I have my own ideas how it can be. For example, suppose there are two words that have almost the same score (s1 > s2) but lets say the second word does not open new way to 3W or 2W and even its score is less then the score of first one, than it is good to use the second word and not the first one.
From my experience with scrabble, you are correct in that you don't necessarily want to always suggest the highest scoring word. Rather, you want to suggest the best word. I don't think this requires a lot of advanced math to pull off.
Here are some suggestions:
In your current algorithm, rank all your letters, particularly consonants, by ease of use. For example, the letter "S" would have the highest ease of use because it is the most flexible. That is, when you play a given word and leave out the letter "S", you are essentially opening up the possibility for better word choices with the new letters than come into play on your next turn.
Balance out vowel and consonant usage in your words. As a regular scrabble player, I don't always play the best scoring word if the best scoring word doesn't use enough vowels. For example, if I use 4 letters than contain no vowels and I have 3 vowels left in my array of letters, chances are I am going to draw at least two vowels on my next turn, which would leave me with 5 vowels and 2 consonants, which chances are doesn't open up a lot of opportunity for high scoring words. It is almost always better to use more vowels than consonants in your words, especially the letter I. Your algorithm should reflect some of this when selecting the best word.
I hope this gives you a good start. Once your algorithm is able to select the best scoring word, you can fine tune it with these suggestions in order to be an overall better scorer in your scrabble games. (I am assuming this is some sort of AI you are creating)
My question is: is there any advanced math for scrabble that suggests not the highest score word but an other one that will increase the probability to win?
As ROFLwTIME mentioned, you need to also account for the letters that you haven't played.
In doing that accounting, you need to account for how letters interact with one another. For example, suppose you have a Q, a U, and five other letters. Suppose the best you can score playing both the Q and the U is 30 points, but you can score more by playing the U but leaving the Q unplayed. Unless that "more" is much more than 30, either play the word with the Q or find a third word that leaves both the Q and the U unplayed.
You also need to account for the opportunities the word you play creates for your opponents. A typical game theory strategy is to maximize your score while minimizing your opponents score, maximin for short. Playing a 20 point word that allows your opponent to play a 50 point word is not a good idea.
As a programmer, I frequently need to be able to know the
how to calculate the number of permutations of a set, usually
for estimation purposes.
There are a lot of different ways specify the allowable
combinations, depending on the problem at hand. For example,
given the set of letters A,B,C,D
Assuming a 4 digit result, how many ways can those letters
be arranged?
What if you can have 1,2,3 or 4 digits, then how many ways?
What if you are only allowed to use each letter at most once?
twice?
What if you must avoid the same letter appearing twice in
a row, but if they are not in a row, then twice is ok?
Etc. I'm sure there are many more.
Does anyone know of a web reference or book that talks about
this subject in terms that a non-mathematician can understand?
Thanks!
Assuming a 4 digit result, how many
ways can those letters be arranged?
when picking the 1st digital , you have 4 choices ,which is one of A, B , C and D ; it is the same when picking the 2nd, 3rd ,4th since repetition is allowed:
so you have total : 4*4*4*4 = 256 choices.
What if you can have 1,2,3 or 4
digits, then how many ways?
It is easy to deduce from question 1.
What if you are only allowed to use
each letter at most once?
When pick the 1st digital , you have 4 choices ,which is one of A , B , c and D ; when picking the 2nd , you have 3 choice except the one you have picked for the 1st ; and 2 choices for 3rd , 1 choices for the 4th.
So you have total : 4 * 3 * 2 * 1 = 24 choice.
The knowledge involving here include combination , permutation and probability. Here is a good tutorial to understand their difference.
First of all the topics you are speaking of are
Permutations (where the order matters)
Combinations (order doesn't matter)
I would recommend Math Tutor DVD for teaching yourself math topics. The "probability and statistics" disk set will give you the formulas and skill you need to solve the problems. It's great because it's the closest thing you can get to going back to school, because a teacher solves problems on a white board for you.
I've found a clip on the Combinations chapter of the video for you to check out.
If you need to do more than just count the number of combinations and permutations, if you actually need to generate the sequences, then Donald Knuth's books Generating all combinations and partitions and Generating all tuples and permutations. He goes into great detail regarding algorithms subject to various restrictions, looking at the advantages and disadvantages of different solutions for each problem.
It all depends on how simply do you need the explanation to be.
The topic you are looking for is called "Permutations and Combinations".
Here's a fairly simply introduction. There are dozens like this on the first few pages from google.