GloVe word embeddings containing sentiment? - r

I've been researching sentiment analysis with word embeddings. I read papers that state that word embeddings ignore sentiment information of the words in the text. One paper states that among the top 10 words that are semantically similar, around 30 percent of words have opposite polarity e.g. happy - sad.
So, I computed word embeddings on my dataset (Amazon reviews) with the GloVe algorithm in R. Then, I looked at the most similar words with cosine similarity and I found that actually every word is sentimentally similar. (E.g. beautiful - lovely - gorgeous - pretty - nice - love). Therefore, I was wondering how this is possible since I expected the opposite from reading several papers. What could be the reason for my findings?
Two of the many papers I read:
Yu, L. C., Wang, J., Lai, K. R. & Zhang, X. (2017). Refining Word Embeddings Using
Intensity Scores for Sentiment Analysis. IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 26(3), 671-681.
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T. & Qin, B. (2014). Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 1: Long Papers, 1555-1565.

Assumption:
When you say you computed GLoVe embeddings you mean you used pretrained GLoVe.
Static Word Embeddings does not carry sentiment information of the input text at runtime
Above statement means that word embedding algorithms(most of them in my knowledge, like GLoVe, Word2Vec) are not designed or formulated to capture sentiment of the word. But, in general word embedding algorithms map the words that are similar in meaning (based on statistical nearness and co-occurrences). Example, "Woman" and "Girl" will lie near to each other in the n-dimensional space of the embeddings. But that does not mean that any sentiment related information is captured here.
Hence,
Words : (beautiful - lovely - gorgeous - pretty - nice - love), being sentimentally similar to a given word is not a co-incident. We have to look these words in terms of their meaning, all these words are similar in meaning, but we cannot say that, they necessarily carry the same sentiments. These words lie near to each other in GLoVe's vector space, because the model was trained well on the corpus that carried sufficient information in terms of words that can be grouped similar.
Also, please study the similarity score, that will make it clearer.
The top 10 words that are semantically similar, around 30 percent of words have opposite polarity
Here, asemanticity, is lesser related to context, whereas sentiment is more related to context. One word cannot define sentiment.
Example:
Jack: "Your dress is beautiful, Gloria"!
Gloria: "Beautiful my foot!"
In both the sentences, beautiful carries completely different sentiment, where as for both of them will have same embedding for the word beautiful. Now, replace beautiful with (lovely - gorgeous - pretty - nice), semantic thing holds true as described in one of the papers. Also, sentiment is not captured by Word Embeddings, hence, other paper also stands true.
The point where confusion may have occurred is considering two or more word with similar meanings to be sentimentally similar. Sentiment information can be gathered at sentence level or doc level and not at word level.

Related

How to use SentimentAnalysis package in R? what i the meaning of GI, ME, and LM?

Can anyone explain the meaning of GI, HE, LM, QDAP in the SentimentAnalysis package in R? What is the best way to identify the polarity of a sentence using this package given that it gives multiple answers if we look at each of these above-mentioned columns?
It is better to have somebody with domain knowledge to deepen your understanding on NLP (Natural Language Processing), Sentiment Analysis, and along with it all its terminologies.
You can also read up the docs for SentimentAnalysis package at https://cran.r-project.org/web/packages/SentimentAnalysis/SentimentAnalysis.pdf. GI for example is General Inquirer and is a Harvard-IV dictionary. It is one of many dictionaries available in SentimentAnalysis package. Usually in realm of sentiment analysis, these dictionaries provide a list of words that are considered positive or negative. Some dictionaries can even provide scoring to rate how positive or negative a word is. Please read up this doc first to ask more specific question.
One way to identify polarity of a sentence is that to collect all words with sentiment and average out the sentiments (for example, in 1 sentence there 3 positive and 2 negative sentiments, we can regard this sentence to have overall positive sentiment). This is just a basic method to analyze sentiments within a sentence. Natural language is very complex, of course you have to see the context when analyzing a sentiment (Ex: "I am not sad" has positive sentiment due to the negation term "not" despite having word which is considered negative ie. "sad").

How to determine negative trigrams

I am trying to search through a load of medical reports. I want to determine the type of language used to say an absence of a finding. Eg "There is no blabla" or "Neither blabla nor bleble is seen here". There are many variations.
My original idea was to divide the text into trigrams and then perform sentiment analysis on the trigrams, and then have a look at the negative trigrams and manually select the ones that denote an absence of something. Then I wanted to design some regular expressions around these absence trigrams.
I get the feeling however that I am not really looking for a sentiment analysis but more of a negation search. I could, I suppose just look for all sentences with 'not' or 'neither' or 'nor or 'no' but I'm sure I'll fall into some kind of linguistic trap.
Does anyone have a comment on my sentiment approach and if this is correct can I have some guidance about sentiment analysis on trigrams (or bigrams I suppose) as all the R tutorials I have found demonstrate unigram sentiment analysis.

word vector and paragraph vector query

I am trying to understand relation between word2vec and doc2vec vectors in Gensim's implementation. In my application, I am tagging multiple documents with same label (topic), I am training a doc2vec model on my corpus using dbow_words=1 in order to train word vectors as well. I have been able to obtain similarities between word and document vectors in this fashion which does make a lot of sense
For ex. getting documents labels similar to a word-
doc2vec_model.docvecs.most_similar(positive = [doc2vec_model["management"]], topn = 50))
My question however is about theoretical interpretation of computing similarity between word2vec and doc2vec vectors. Would it be safe to assume that when trained on the same corpus with same dimensionality (d = 200), word vectors and document vectors can always be compared to find similar words for a document label or similar document labels for a word. Any suggestion/ideas are most welcome.
Question 2: My other questions is about impact of high/low frequency of a word in final word2vec model. If wordA and wordB have similar contexts in a particular doc label(set) of documents but wordA has much higher frequency than wordB, would wordB have higher similarity score with the corresponding doc label or not. I am trying to train multiple word2vec models by sampling corpus in a temporal fashion and want to know if the hypothesis that as words get more and more frequent, assuming context relatively stays similar, similarity score with a document label would also increase. Am I wrong to make this assumption? Any suggestions/ideas are very welcome.
Thanks,
Manish
In a training mode where word-vectors and doctag-vectors are interchangeably used during training, for the same surrounding-words prediction-task, they tend to be meaningfully comparable. (Your mode, DBOW with interleaved skip-gram word-training, fits this and is the mode used by the paper 'Document Embedding with Paragraph Vectors'.)
Your second question is abstract and speculative; I think you'd have to test those ideas yourself. The Word2Vec/Doc2Vec processes train the vectors to be good at certain mechanistic word-prediction tasks, subject to the constraints of the model and tradeoffs with other vectors' quality. That the resulting spatial arrangement happens to be then useful for other purposes – ranked/absolute similarity, similarity along certain conceptual lines, classification, etc. – is then just an observed, pragmatic benefit. It's a 'trick that works', and might yield insights, but many of the ways models change in response to different parameter choices or corpus characteristics haven't been theoretically or experimentally worked-out.

How to apply topic modeling?

I have 10000 tweets for 5 topics. Assume I know the ground truth (the actual topic of each tweet) and I group the tweets into 5 documents where each document contain tweets for a particular topic. Then I apply LDA on to the 5 documents with number of topics set to 5. In which case I get good topic words.
Now If I don't know the ground truth of tweets, how do I make input documents in a way that LDA will still give me good topic words describing the 5 topics.
What if I create input documents by randomly selecting a sample of tweets? What if this ends up with similar topic mixtures for input documents? Should LDA still find good topic words as in the case of 1st paragraph?
If I understand correctly, your problem is about topic modeling on short texts (Tweets). One approach is to combine Tweets into long pseudo-documents before training LDA. Another one is to assume that there is only one topic per document/Tweet.
In the case that you don't know the ground truth labels of Tweets, you might want to try the one-topic-per-document topic model (i.e. mixture-of-unigrams). The model details are described in:
Jianhua Yin and Jianyong Wang. 2014. A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 233–242.
You can find my Java implementations for this model and LDA at http://jldadmm.sourceforge.net/ Assumed that you know ground truth labels, you can also use my implementation to compare these topic models in document clustering task.
If you'd like to evaluate topic coherence (i.e. evaluate how good topic words), I would suggest you to have a look at the Palmetto toolkit (https://github.com/AKSW/Palmetto) which implements the topic coherence calculations.

Sentiment Analysis Dictionaries

I was wondering if anybody knew where I could obtain dictionaries of positive and negative words. I'm looking into sentiment analysis and this is a crucial part of it.
The Sentiment Lexicon, at the University of Pittsburgh might be what you are after. It's a lexicon of about 8,000 words with positive/neutral/negative sentiment. It's described in more detail in this paper and released under the GPL.
Sentiment Analysis (Opinion Mining) lexicons
MPQA Subjectivity Lexicon
Bing Liu and Minqing Hu Sentiment Lexicon
SentiWordNet (Included in NLTK)
VADER Sentiment Lexicon
SenticNet
LIWC (not free)
Harvard Inquirer
ANEW
Sources:
Keenformatics - Sentiment Analysis lexicons and datasets (my blog)
Hutto, C. J., and Eric Gilbert. "Vader: A parsimonious rule-based model for sentiment analysis of social media text." Eighth International AAAI Conference on Weblogs and Social Media. 2014.
Sentiment Symposium Tutorial by Christopher Potts
Personal experience
Arriving a bit late I'll just note that dictionaries have a limited contribution for sentiment analysis.
Some sentiment bearing sentences do not contain any "sentiment" word - e.g. "read the book" which could be positive in a book review while negative in a movie review.
Similarly, the sentiment word "unpredictable" could be positive in the context of a thriller but negative when describing the breaks system of the Toyota.
and there are many more...
Professor Bing Liu provide an English Lexicon of about 6800 word, you can download form this link:
Opinion Mining, Sentiment Analysis, and Opinion Spam Detection
This paper from 2002 describes an algorithm for deriving such a dictionary from text samples automatically, using only two words as a seed set.
AFINN you can find here and also create it dynamically. Like whenever unknown +ve word comes add it with +1. Like banana is new +ve word and appearing twice then it will become +2.
As much articles and data you craws your dictionary would become stronger!
The Harvard-IV dictionary directory http://www.wjh.harvard.edu/~inquirer/homecat.htm has at least two sets of ready-to-use dictionaries for positive/negative orientation.
You can use vader sentiment lexicon
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentence='APPle is good for health'
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sentence)
print(ss)
it will give you the polarity of sentence.
output:
{'compound': 0.4404, 'neu': 0.58, 'pos': 0.42, 'neg': 0.0}
Sentiwords gives 155,000 words (and their polarity, that is, a score between -1 and 1 for very negative through to very positive). The lexicon is discussed here

Resources