How to find Top n topics for a document - information-retrieval

I am using the tf - IDF to rank terms in a document. When terms are arranged in descending order of the tf - IDF, top 'n' terms are most relevant to that document.
When we choose a document, top 'n' terms of that document has to be displayed.
My question is how to decide the value of 'n'?
For example: for a document terms arranged in descending order of the tf - IDF is as follows:
Document 1
president
Obama
Barak
speech
inauguration
come
the
look
again
took
Now when I want to show topics for document 1, I need only top 5 terms, since all others are not relevant or not topics for the document.
How do I decide this breaking point of terms in a document?
Thanks in advance

In relation to your sample data, there seems to be a problem because 6 to 10 are non-informative stop-words, some of them even stop-words, such as 'the'.
So, a first step that you should try is to remove stop-words.
Coming back to your question, there is no best practice for choosing the value of K in a top-K keyword extraction. This varies from one document to another because some documents are more informative (often multi-topical) than others, which means that these documents should have a higher value of K.
A way to decide on a stopping point is to check the relative differences between the tfidf values between consecutive terms and then stop at a point where this relative difference becomes higher than a threshold, which indicates that there is a big fall in the amount of key informtion that you are outputting.
Assuming that you have computed a tfidf score for each term and have sorted them in descending order of their values, compute the following
before adding every new term
If the above expression is true, where delta is a pre-defined threshold, add the new term... because the new term's informativeness measure is close-enough to the ones already in the list. Otherwise stop if the expression is false, i.e. the difference is higher than delta.
A note: You can play around with different term scoring functions... not just tfidf.

Related

Does column order matter in RNN?

My question is somewhat similar to this one. But I want to ask whether the column order matters or not. I have some time series data. For each cycle I computed some features (let's call them var1, var2,.... I now train the model using the following column order which of course will be consistent for the test set.
X_train=data['var1','var2','var3',var4']
After watching this video I've concluded that the order in which the columns appear is significant i.e. swapping var 1 and var 3 as:
X_train=data['var3','var2','var1',var4']
I would get a different loss function.
If the above is true, then how does one figure out the correct feature order to minimize the loss function, especially when the number of features could be in dozens.

Need to get combination of records from Data Frame in R that satisfies a specific target in R

Let me say that I have a below Data Frame in R with 500 player records with the following columns
PlayerID
TotalRuns
RunRate
AutionCost
Now out of the 500 players, I want my code to give me multiple combinations of 3 players that would satisfy the following criteria. Something like a Moneyball problem.
The sum of auction cost of all the 3 players shouldn't exceed X
They should have a minimum of Y TotalRuns
Their RunRate must be higher than the average run rate of all the players.
Kindly help with this. Thank you.
So there are choose(500,3) ways to choose 3 players which is 20,708,500. It's not impossible to generate all these combinations combn might do it for you, but I couldn't be bothered waiting to find out. If you do this with player IDs and then test your three conditions, this would be one way to solve your problem. An alternative would be a Monte Carlo method. Select three players that initially satisfy your conditions. Randomly select another player who doesn't belong to the current trio, if he satisfies the conditions save the combination and repeat. If you're optimizing (it's not clear but your question has optimization in the tag), then the new player has to result in a new trio that's better than the last, so if he doesn't improve your objective function (whatever it might be), then you don't accept the trade.
choose(500,3)
Shows there are almost 21,000,000 combinations of 3 players drawn from a pool of 500 which means a complete analysis of the entire search space ought to be reasonably doable in a reasonable time on a modern machine.
You can generate the indeces of these combinations using iterpc() and getnext() from the iterpc package. As in
# library(iterpc) # uncomment if not loaded
I <- iterpc(5, 3)
getnext(I)
You can also drastically cut the search space in a number of ways by setting up initial filtering criteria and/or by taking the first solution (while loop with condition = meeting criterion). Or, you can get and rank order all of them (loop through all combinations) or some intermediate where you get n solutions. And preprocessing can help reduce the search space. For example, ordering salaries in ascending order first will give you the cheapest salary solution first. Ordering the file by descending runs will give you the highest runs solutions first.
NOTE: While this works fine, I see iterpc now is superseded by the arrangements package where the relevant iterator is icombinations(). getnext() is still the access method for succeeding iterators.
Thanks, I used a combination of both John's and James's answers.
Filtered out all the players who don't satisfy the criteria and that boiled down only to 90+ players.
Then I used picked up players in random until all the variations got exhausted
Finally, I computed combined metrics for each variation (set) of players to arrive at the optimized set.
The code is a bit messy and doesn't wanna post it here.

association from term document matrix

Is there a way to find associated words from a term document matrix , other than using findAssoc() in r. My objective is to find all words with decided frequency (lets say if i want to find all wrods with freq more than 200)and then find words which were appearing with these words.
findAssoc() has a terms argument. E.g. feed it with the words, which have a frequency greater than 200. You can find those words by using findFreqTerms().

The approach to calculating 'similar' objects based on certain weighted criteria

I have a site that has multiple Project objects. Each project has (for example):
multiple tags
multiple categories
a size
multiple types
etc.
I would like to write a method to grab all 'similar' projects based on the above criteria. I can easily retrieve similar projects for each of the above singularly (i.e. projects of a similar size or projects that share a category etc.) but I would like it to be more intelligent then just choosing projects that either have all the above in common, or projects that have at least one of the above in common.
Ideally, I would like to weight each of the criteria, i.e. a project that has a tag in common is less 'similar' then a project that is close in size etc. A project that has two tags in common is more similar than a project that has one tag in common etc.
What approach (practically and mathimatically) can I take to do this?
The common way to handle this (in machine learning at least) is to create a metric which measures the similarity -- A Jaccard metric seems like a good match here, given that you have types, categories, tags, etc, which are not really numbers.
Once you have a metric, you can speed up searching for similar items by using a KD tree, vp-tree or another metric tree structure, provided your metric obeys the triangle inequality( d(a,b) < d(a,c) + d(c, b) )
The problem is, that there are obviously an infinite number of ways of solving this.
First of all, define a similarity measure for each of your attributes (tag similarity, category similarity, description similarity, ...)
Then try to normalize all these similarities to use a common scale, e.g. 0 to 1, with 0 being most similar, and the values having a similar distribution.
Next, assign each feature a weight. E.g. tag similarity is more important than description similarity.
Finally, compute a combined similarity as weighted sum of the individual similarities.
There is an infinite number of ways, as you can obviously assign arbitrary weights, have various choices for the single-attribute similarities already, infinite number of ways of normalizing the individual values. And so on.
There are methods for learning the weights. See ensemble methods. However, to learn the weights you need to have user input on what is a good result and what not. Do you have such training data?
Start with a value of 100 in each category.
Apply penalties. Like, -1 for each kB difference in size, or -2 for each tag not found in the other project. You end up with a value of 0..100 in each category.
Multiply each category's value with the "weight" of the category (i.e., similarity in size is multiplied with 1, similarity in tags with 3, similarity in types with 2).
Add up the weighted values.
Divide by the sum of weight factors (in my example, 1 + 3 + 2 = 6) to get an overall similarity of 0..100.
Possibilities to reduce the comparing of projects below the initial O(n^2) (i.e. comparing each project with each other) is heavily depending on context. It might be the real crux of your software, or it might not be necessary at all if n is low.

n-gram sentence similarity with cosine similarity measurement

I have been working on a project about sentence similarity. I know it has been asked many times in SO, but I just want to know if my problem can be accomplished by the method I use by the way that I am doing it, or I should change my approach to the problem. Roughly speaking, the system is supposed to split all sentences of an article and find similar sentences among other articles that are fed to the system.
I am using cosine similarity with tf-idf weights and that is how I did it.
1- First, I split all the articles into sentences, then I generate trigrams for each sentence and sort them(should I?).
2- I compute the tf-idf weights of trigrams and create vectors for all sentences.
3- I calculate the dot product and magnitude of original sentence and of the sentence to be compared. Then calculate the cosine similarity.
However, the system does not work as I expected. Here, I have some questions in my mind.
As far as I have read about tf-idf weights, I guess they are more useful for finding similar "documents". Since I am working on sentences, I modified the algorithm a little by changing some variables of the formula of tf and idf definitions(instead of document I tried to come up with sentence based definition).
tf = number of occurrences of trigram in sentence / number of all trigrams in sentence
idf = number of all sentences in all articles / number of sentences where trigram appears
Do you think it is ok to use such a definition for this problem?
Another one is that I saw the normalization is mentioned many times when calculating the cosine similarity. I am guessing that this is important because the trigrams vectors might not be the same size(which they rarely are in my case). If a trigram vector is size of x and the other is x+1, then I treat the first vector as it was the size of x+1 with the last value is being 0. Is this what it is meant by normalization? If not, how do I do the normalization?
Besides these, if I have chosen the wrong algorithm what else can be used for such problem(preferably with n-gram approach)?
Thank you in advance.
I am not sure why you are sorting the trigrams for every sentence. All you need to care about when computing cosine similarity is that whether the same trigram occurred in the two sentences or not and with what frequencies. Conceptually speaking you define a fixed and common order among all possible trigrams. Remember the order has to be the same for all sentences. If the number of possible trigrams is N, then for each sentence you obtain a vector of dimensionality N. If a certain trigram does not occur, you set the corresponding value in the vector to zero. You dont really need to store the zeros, but have to take care of them when you define the dot product.
Having said that, trigrams are not a good choice as chances of a match are a lot sparser. For high k you will have better results from bags of k consecutive words, rather than k-grams. Note that the ordering does not matter inside a bag, its a set. You are using k=3 k-grams, but that seems to be on the high side, especially for sentences. Either drop down to bi-grams or use bags of different lengths, starting from 1. Preferably use both.
I am sure you have noticed that sentences that do not use the exact trigram has 0 similarity in your method. K-bag of words
will alleviate the situation somewhat but not solve it completely. Because now you need sentences to share actual words. Two sentences may be similar without using the same words. There are a couple of ways to fix this. Either use LSI(latent Semantic Indexing) or clustering of the words and use the cluster labels to define your cosine similarity.
In order to compute the cosine similarity between vectors x and y you compute the dot product and divide by the norms of x and y.
The 2-norm of the vector x can be computed as square root of the sum of the components squared. However you should also try your algorithm out without any normalization to compare. Usually it works fine, because you are already taking care of the relative sizes of the sentences when you compute the term frequencies (tf).
Hope this helps.

Resources