Go language sentiment analysis - math

I'm using the following library to perform some sentiment analysis of Facebook posts on my feed as an experiment for a bit of fun: https://github.com/cdipaolo/sentiment
But the problem I'm facing is that the Analysis object that is returned from the model.SentimentAnalysis() call doesn't have a weighted score. The score value it returns for the sentence is either 0 or 1. This makes it too vaguely defined, and I'd like to have a scale of sentiment for each Facebook post, so a float from 0.0-1.0 would be ideal where 1 is 100% positive and 0 is 100% negative.
Is there a way I can utilize the Words variable in the object (in this file you can see it down the bottom https://github.com/cdipaolo/sentiment/blob/master/model.go) to loop over each of the word scores in the sentence to create my own weighted sentiment score? For example, would something like positive_words / total_words work? That would give me a number representing what percentage of positive sentiment the post is, but then the weighting problem comes back into it again because the words aren't weighted either, so for example say I got back a score of 0.75, in reality the true score could be a lot less because the words may have only very slightly been above the positive threshold but I don't know that because it's only either a 0 for negative or a 1 for positive, it's not a weighted float value.
So my question here is, is there some way mathematically that I can create my own weight score given the data that is provided, or do I not have enough data to do this?

Related

correlation; lower values better than higher values R

I am trying to calculate the correlation between some vector of investment returns and a matching vector that has a number from 1 to 5 rating the quality of the company. It looks something like this (lets call this data returnrank:
company returns rank
at&t 0.09034 2
verizon 0.23341 1
sprint 0.03021 3
How can I make it so that when I calculate cor(returnrank$returns,returnrank$rank) it treats lower values as better and higher values as worse in the rank column
(ie: if a stock has high returns and what R would consider a low score (1), I want to see a high positive correlation because I am treating 1 as better than 5).
You probably just want:
cor(returnrank$returns, max(returnrank$rank) - returnrank$rank))
It may be better to just graph the data since it's unlikely to be a linear relationship given the nature of rank

Mathematical function for string similarity score

I'm working on a string similarity algorithm, and was thinking on how to give a score between 0 and 1 when comparing two strings. The two variables for this function are the Levenshtein distance D: (added, removed and changed characters) and the maximum length of the two strings L (but you could also take the average).
My initial algorithm was just 1-D/L but this gave too high scores for short strings, e.g. 'tree' and 'bee' would get a score of 0.5, and too low scores for longer strings which have more in common even if half of the characters is different.
Now I'm looking for a mathematical function that can output a better score. I wasn't able to come up with one, so I sketched this height map of a 3D plot (L is x and D = y).
Does anyone know how to convert such a graph to an equation, if I would be better off to just create a lookup table or if there is an existing solution?

In tf-idf why do we normalize by document frequency and not average term frequency across all documents in the corpus?

Average term frequency would be the average frequency that term appears in other documents. Intuitively I want to compare how frequently it appears in this document relative to the other documents in the corpus.
An example:
d1 has the word "set" 100 times, d2 has the word "set" 1 time, d3 has the word "set" 1 time, d4-N does not have the word set
d1 has the word "theory" 100 times, d2 has the word "theory" 100 times, d3 has the word "theory" 100 times, d4-N does not have the word set
Document 1 has the same tf-idf for the word "set" and the word "theory" even though the word set is more important to d1 than theory.
Using average term frequency would distinguish these two examples. Is tf-iatf (inverse average term frequency) a valid approach? To me it would give me more important keywords, rather than just "rare" and "unique" keywords. If idf is "an estimate of how rare that word is" wouldn't iatf be a better estimate? It seems only marginally harder to implement (especially if the data is pre-processed).
I am thinking of running an experiment and manually analyzing the highest ranked keywords with each measure, but wanted to pass it by some other eyes first.
A follow-up question:
Why is tf-idf used so frequently as opposed to alternative methods like this which MAY be more accurate? (If this is a valid approach that is).
Update:
Ran an experiment where I manually analyzed the scores and corresponding top words for a few dozen documents, and it seems like iatf and inverse collection frequency (the standard approach to what I described) have super similar results.
Tf-idf is not meant to compare the importance of a word in a document across two corpora.
It is rather meant to distinguish the importance of a word within a document in relation to the distribution of the same term in the other documents of the same collection (not across collections).
A standard approach that you can apply for your case is: collection frequency, cf(t), instead of document frequency, df(t).
cf(t) measures how many times does a term t occurs in the corpus.
cf(t) divided by the total collection size would give you the probability
of sampling t from the collection.
And then you can compute a linear combination of tf(t,d) and cf(t) values, which gives you the probability of sampling a term t either from a document or from the collection.
P(t,d) = \lambda P(t|d) + (1-\lambda) P(t|Collection)
This is known by the name of Jelinek Mercer smoothed Language Model.
For your example (letting \lambda=0.5):
Corpus 1: P("set",d1) = 0.5*100/100 + 0.5*100/102
Corpus 2: P("set",d1) = 0.5*100/100 + 0.5*100/300
Clearly, P("set",d1) for corpus 2 is less (almost one-third) of that in corpus 1.

Calculate correlation coefficient between words?

For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.
This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.
How can I best approach this problem?
How can I calculate the correlation between words?
I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words
The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.
Well a simple way to solve your question is by shaping the data in a 2x2 matrix
obama | not obama
barack A B
not barack C D
and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.
I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.
Suppose the text has length N, say it is an array
text[0], text[1], ..., text[N-1]
Suppose the following words appear in the text
word[0], word[1], ..., word[k]
For each word word[i], define a vector of length N-1
X[i] = array(); // of length N-1
as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.
// compute the vector X[i]
for (j = 0:N-2){
if (text[j] == word[i] OR text[j+1] == word[i])
X[i][j] = 1;
else
X[i][j] = 0;
}
Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.
This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.
This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.
If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.
You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.
Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.

Cosine similarity and tf-idf

I am confused by the following comment about TF-IDF and Cosine Similarity.
I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90."
Now I'm wondering....aren't they 2 different things?
Is tf-idf already inside the cosine similarity? If yes, then what the heck - I can only see the inner dot products and euclidean lengths.
I thought tf-idf was something you could do before running cosine similarity on the texts. Did I miss something?
Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.
If d2 and q are tf-idf vectors, then
where θ is the angle between the vectors. As θ ranges from 0 to 90 degrees, cos θ ranges from 1 to 0. θ can only range from 0 to 90 degrees, because tf-idf vectors are non-negative.
There's no particularly deep connection between tf-idf and the cosine similarity/vector space model; tf-idf just works quite well with document-term matrices. It has uses outside of that domain, though, and in principle you could substitute another transformation in a VSM.
(Formula taken from the Wikipedia, hence the d2.)
TF-IDF is just a way to measure the importance of tokens in text; it's just a very common way to turn a document into a list of numbers (the term vector that provides one edge of the angle you're getting the cosine of).
To compute cosine similarity, you need two document vectors; the vectors represent each unique term with an index, and the value at that index is some measure of how important that term is to the document and to the general concept of document similarity in general.
You could simply count the number of times each term occurred in the document (Term Frequency), and use that integer result for the term score in the vector, but the results wouldn't be very good. Extremely common terms (such as "is", "and", and "the") would cause lots of documents to appear similar to each other. (Those particular examples can be handled by using a stopword list, but other common terms that are not general enough to be considered a stopword cause the same sort of issue. On Stackoverflow, the word "question" might fall into this category. If you were analyzing cooking recipes, you'd probably run into issues with the word "egg".)
TF-IDF adjusts the raw term frequency by taking into account how frequent each term occurs in general (the Document Frequency). Inverse Document Frequency is usually the log of the number of documents divided by the number of documents the term occurs in (image from Wikipedia):
Think of the 'log' as a minor nuance that helps things work out in the long run -- it grows when it's argument grows, so if the term is rare, the IDF will be high (lots of documents divided by very few documents), if the term is common, the IDF will be low (lots of documents divided by lots of documents ~= 1).
Say you have 100 recipes, and all but one requires eggs, now you have three more documents that all contain the word "egg", once in the first document, twice in the second document and once in the third document. The term frequency for 'egg' in each document is 1 or 2, and the document frequency is 99 (or, arguably, 102, if you count the new documents. Let's stick with 99).
The TF-IDF of 'egg' is:
1 * log (100/99) = 0.01 # document 1
2 * log (100/99) = 0.02 # document 2
1 * log (100/99) = 0.01 # document 3
These are all pretty small numbers; in contrast, let's look at another word that only occurs in 9 of your 100 recipe corpus: 'arugula'. It occurs twice in the first doc, three times in the second, and does not occur in the third document.
The TF-IDF for 'arugula' is:
1 * log (100/9) = 2.40 # document 1
2 * log (100/9) = 4.81 # document 2
0 * log (100/9) = 0 # document 3
'arugula' is really important for document 2, at least compared to 'egg'. Who cares how many times egg occurs? Everything contains egg! These term vectors are a lot more informative than simple counts, and they will result in documents 1 & 2 being much closer together (with respect to document 3) than they would be if simple term counts were used. In this case, the same result would probably arise (hey! we only have two terms here), but the difference would be smaller.
The take-home here is that TF-IDF generates more useful measures of a term in a document, so you don't focus on really common terms (stopwords, 'egg'), and lose sight of the important terms ('arugula').
The complete mathematical procedure for cosine similarity is explained in these tutorials
part-I
part-II
part-III
Suppose if you want to calculate cosine similarity between two documents, first step will be to calculate the tf-idf vectors of the two documents. and then find the dot product of these two vectors. Those tutorials will help you :)
tf/idf weighting has some cases where they fail and generate NaN error in code while computing. It's very important to read this:
http://www.p-value.info/2013/02/when-tfidf-and-cosine-similarity-fail.html
Tf-idf is just used to find the vectors from the documents based on tf - Term Frequency - which is used to find how many times the term occurs in the document and inverse document frequency - which gives the measure of how many times the term appears in the whole collection.
Then you can find the cosine similarity between the documents.
TFIDF is inverse documet frequency matrix and finding cosine similarity against document matrix returns similar listings

Resources