I have 2 questions,
I've made a vector from a document by finding out how many times each word appeared in a document. Is this the right way of making the vector? Or do I have to do something else also?
Using the above method I've created vectors of 16 documents, which are of different sizes. Now i want to apply cosine similarity to find out how similar each document is. The problem I'm having is getting the dot product of two vectors because they are of different sizes. How would i do this?
Sounds reasonable, as long as it means you have a list/map/dict/hash of (word, count) pairs as your vector representation.
You should pretend that you have zero values for the words that do not occur in some vector, without storing these zeros anywhere. Then, you can use the following algorithm to compute the dot product of these vectors (pseudocode):
algorithm dot_product(a : WordVector, b : WordVector):
dot = 0
for word, x in a do
y = lookup(word, b)
dot += x * y
return dot
The lookup part can be anything, but for speed, I'd use hashtables as the vector representation (e.g. Python's dict).
Related
I'm working on a string similarity algorithm, and was thinking on how to give a score between 0 and 1 when comparing two strings. The two variables for this function are the Levenshtein distance D: (added, removed and changed characters) and the maximum length of the two strings L (but you could also take the average).
My initial algorithm was just 1-D/L but this gave too high scores for short strings, e.g. 'tree' and 'bee' would get a score of 0.5, and too low scores for longer strings which have more in common even if half of the characters is different.
Now I'm looking for a mathematical function that can output a better score. I wasn't able to come up with one, so I sketched this height map of a 3D plot (L is x and D = y).
Does anyone know how to convert such a graph to an equation, if I would be better off to just create a lookup table or if there is an existing solution?
I'm trying to use TFIDF for relative frequency to calculate cosine distance. I've selected 10 words from one document say: File 1 and selected another 10 files from my folder, using the 10 words and their frequency to check which of the 10 files are similar to File 1. Say Total number of files in folder are 46.i know that DF(is the no of documents the word appears in) IDF(is log(total no of files(46)/DF) and TFIDF(is the product of TF(frequency of the word in one doc) and IDF)
QUESTION:
Assuming what i said above is 100% correct, after getting the TFIDF for all 10 words in one document say: File 2, Do i add all the TFIDF for each of the 10 words together to get the TFIDF for File 2?
What is the cosine distance?
Could anyone help with an example?
The problem is you are confused between cosine similarity and tf-idf. While the former is a measure of similarity between two vectors (in this case documents), the latter simply is a technique of setting the components for the vectors to be eventually used in the former.
Particular to your question, it is rather inconvenient to select 10 terms from each document. I'd rather suggest to work with all terms. Let V be the total number of terms (the cardinality of the set of union over all documents in the collection). You can the represent each document as a vector of V dimensions. The ith component of a particular document D can be set to the tf-idf weight corresponding to that term (say t), i.e. D_i = tf(t,D)*idf(t)
Once you represent every document in your collection in this way, you can then compute the inter-document similarities in the following way.
cosine-sim(D, D') = (1/|D_1|*|D'|) * \sum_{i=1}^{V} D_i * D'_i
= (1/|D_1|*|D'|) * \sum_{i=1}^{V} tf(t,D)*idf(t)*tf(t,D')*idf(t)
Note that the contributing terms in this summation are only those ones which occur in both documents. If a term t occurs in D but not in D' then tf(t,D')=0 which thus contributes 0 to the sum.
I have records (rows) in a database and I want to identify similar records. I have a constraint to use cosine similarity. If the variables (attributes, columns) vary in type and come in this form:
[number] [number] [boolean] [20 words string]
how can I proceed to the vectorization to apply the cosine similarity? For the string I can take the simple tf-idf. But for numbers and boolean values?. And how can this be combined? My thought is that the vector would be of 1+1+1+20 length. But is it semantically "efficient" to just transform the numbers of the record to coefficients in my vector and to concatenate them with the tf-idf of the string to compute the cosine similarity? Or i can treat numbers as words and apply tf-idf to numbers as well. Is there another technique?
Each positional element of the vectors must measure a particular attribute/feature of the entities of interest. Frequently, when words are involved, there is a vector element for the count of each word that may appear. Thus, your vector might have the size of 1 + 1 + 1 + (vocabulary size).
Because cosine similarity calculates based on numbers, you might have to convert non-numbers to numbers. For example, you might use 0, 1 for booleans.
You don't mention whether your numeric fields represent measurements or discrete values (e.g., keys). If the numeric values are measurements, then cosine similarity is well-suited (although if there are different scales of the numbers of the different attributes, it can bias your results). However, if the numbers represent keys, then using a single attribute for each field will give poor results, because a key of 5 is no closer to 6 than it is to 200. But cosine similarity doesn't know that. In the case where a database field contains keys, you might want to have a boolean (0, 1) vector element for each possible value.
For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.
This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.
How can I best approach this problem?
How can I calculate the correlation between words?
I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words
The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.
Well a simple way to solve your question is by shaping the data in a 2x2 matrix
obama | not obama
barack A B
not barack C D
and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.
I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.
Suppose the text has length N, say it is an array
text[0], text[1], ..., text[N-1]
Suppose the following words appear in the text
word[0], word[1], ..., word[k]
For each word word[i], define a vector of length N-1
X[i] = array(); // of length N-1
as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.
// compute the vector X[i]
for (j = 0:N-2){
if (text[j] == word[i] OR text[j+1] == word[i])
X[i][j] = 1;
else
X[i][j] = 0;
}
Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.
This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.
This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.
If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.
You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.
Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.
Integers can be used to store individual numbers, but not mathematical expressions. For example, lets say I have the expression:
6x^2 + 5x + 3
How would I store the polynomial? I could create my own object, but I don't see how I could represent the polynomial through member data. I do not want to create a function to evaluate a passed in argument because I do not only need to evaluate it, but also need to manipulate the expression.
Is a vector my only option or is there a more apt solution?
A simple yet inefficient way would be to store it as a list of coefficients. For example, the polynomial in the question would look like this:
[6, 5, 3]
If a term is missing, place a zero in its place. For instance, the polynomial 2x^3 - 4x + 7 would be represented like this:
[2, 0, -4, 7]
The degree of the polynomial is given by the length of the list minus one. This representation has one serious disadvantage: for sparse polynomials, the list will contain a lot of zeros.
A more reasonable representation of the term list of a sparse polynomial is as a list of the nonzero terms, where each term is a list containing the order of the term and the coefficient for that order; the degree of the polynomial is given by the order of the first term. For example, the polynomial x^100+2x^2+1 would be represented by this list:
[[100, 1], [2, 2], [0, 1]]
As an example of how useful this representation is, the book SICP builds a simple but very effective symbolic algebra system using the second representation for polynomials described above.
A list is not the only option.
You can use a map (dictionary) mapping the exponent to the corresponding coefficient.
Using a map, your example would be
{2: 6, 1: 5, 0: 3}
A list of (coefficient, exponent) pairs is quite standard. If you know your polynomial is dense, that is, all the exponent positions are small integers in the range 0 to some small maximum exponent, you can use the array, as I see Óscar Lopez just posted. :)
You can represent expressions as Expression Trees. See for example .NET Expression Trees.
This allows for much more complex expressions than simple polynomials and those expressions can also use multiple variables.
In .NET you can manipulate the expression tree as a tree AND you can evaluate it as a function.
Expression<Func<double,double>> polynomial = x => (x * x + 2 * x - 1);
double result = polynomial.Compile()(23.0);
An object-oriented approach would say that a Polynomial is a collection of Monomials, and a Monomial encapsulates a coefficient and exponent together.
This approach works when when you have a polynomial like this:
y(x) = x^1000 + 1
An approach that tied a data structure to a polynomial order would be terribly wasteful for this pathological case.
You need to store two things:
The degree of your polynomial (e.g. "3")
A list containing each coefficient (e.g. "{3, 0, 2}")
In standard C++, "std::vector<>" and "std::list<>" can do both.
Vector/array is obvious choice. Depending on type of expressions you may consider some sort of sparse vector type (custom made, i.e. based on dictionary or even linked list if you expressions have 2-3 non-zero coefficients 5x^100+x ).
In either case exposing through custom class/interface would be beneficial as you can replace implementation later. You would likely want to provide standard operations (+, -, *, equals) if you plan to write a lot of expression manipulation code.
Just store the coefficients in an array or vector. For example, in C++ if you are only using integer coefficients, you could use std::vector<int>, or for real numbers, std::vector<double>. Then you just push the coefficients in order and access them by variable exponent number.
For example (again in C++), to store 5*x^3 + 9*x - 2 you might do:
std::vector<int> poly;
poly.push_back(-2); // x^0, acceesed with poly[0]
poly.push_back(9); // x^1, accessed with poly[1]
poly.push_back(0); // x^2, etc
poly.push_back(5); // x^3, etc
If you have large, sparse, polynomials, then maybe you'd want to use a map instead of a vector. If you have fixed sized lengths, then you'd perhaps use an fixed length array instead of a vector.
I've used C++ for examples, but this same scheme can be used in any language.
You can also transform it into reverse Polish notation:
6x^2 + 5x + 3 -> x 2 ^ 6 * x 5 * + 3 +
Where x and numbers are "pushed" onto a stack and operations (^,*,+) take the two top-most values from the stack and replace them with the result of the operation. In the end you get the resultant value on the stack.
In this form it's easy to calculate arbitrarily complex expressions.
This representation is also close to tree representation of expressions where non-leaf tree nodes represent operations and functions and leaf nodes are for constants and variables.
What's good about trees is that you can also easily evaluate expressions and you can also do things like symbolic differentiation on them. Both have recursive nature.