Sentiment Analysis Dictionaries - dictionary

I was wondering if anybody knew where I could obtain dictionaries of positive and negative words. I'm looking into sentiment analysis and this is a crucial part of it.

The Sentiment Lexicon, at the University of Pittsburgh might be what you are after. It's a lexicon of about 8,000 words with positive/neutral/negative sentiment. It's described in more detail in this paper and released under the GPL.

Sentiment Analysis (Opinion Mining) lexicons
MPQA Subjectivity Lexicon
Bing Liu and Minqing Hu Sentiment Lexicon
SentiWordNet (Included in NLTK)
VADER Sentiment Lexicon
SenticNet
LIWC (not free)
Harvard Inquirer
ANEW
Sources:
Keenformatics - Sentiment Analysis lexicons and datasets (my blog)
Hutto, C. J., and Eric Gilbert. "Vader: A parsimonious rule-based model for sentiment analysis of social media text." Eighth International AAAI Conference on Weblogs and Social Media. 2014.
Sentiment Symposium Tutorial by Christopher Potts
Personal experience

Arriving a bit late I'll just note that dictionaries have a limited contribution for sentiment analysis.
Some sentiment bearing sentences do not contain any "sentiment" word - e.g. "read the book" which could be positive in a book review while negative in a movie review.
Similarly, the sentiment word "unpredictable" could be positive in the context of a thriller but negative when describing the breaks system of the Toyota.
and there are many more...

Professor Bing Liu provide an English Lexicon of about 6800 word, you can download form this link:
Opinion Mining, Sentiment Analysis, and Opinion Spam Detection

This paper from 2002 describes an algorithm for deriving such a dictionary from text samples automatically, using only two words as a seed set.

AFINN you can find here and also create it dynamically. Like whenever unknown +ve word comes add it with +1. Like banana is new +ve word and appearing twice then it will become +2.
As much articles and data you craws your dictionary would become stronger!

The Harvard-IV dictionary directory http://www.wjh.harvard.edu/~inquirer/homecat.htm has at least two sets of ready-to-use dictionaries for positive/negative orientation.

You can use vader sentiment lexicon
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentence='APPle is good for health'
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sentence)
print(ss)
it will give you the polarity of sentence.
output:
{'compound': 0.4404, 'neu': 0.58, 'pos': 0.42, 'neg': 0.0}

Sentiwords gives 155,000 words (and their polarity, that is, a score between -1 and 1 for very negative through to very positive). The lexicon is discussed here

Related

Format for adding references in R package DESCRIPTION?

I just submitted an R package to CRAN. I got this comment back:
If there are references describing the methods in your package, please add these in the description field of your DESCRIPTION file in the form
authors (year) <doi:...>
authors (year) <arXiv:...>
authors (year, ISBN:...)
or if those are not available: <https:...>
with no space after 'doi:', 'arXiv:', 'https:' and angle brackets for auto-linking.
(If you want to add a title as well please put it in quotes: "Title")
But I thought that the description field is limited to one paragraph, which means you can't include additional text besides the single paragraph in that field. So I was unsure what the exact formatting is for including references in the description field. My guess is below, but this format returns a note stating that the description is malformed.
Description: Text describing the package, blah blah blah.
More text goes here, etc etc etc.
Foo, B., and J. Baz. (1999) <doi:23232/xxxxx.00>
Smith, C. (2021) <https://something.etc/foo>
Note returned when running R CMD check:
checking DESCRIPTION meta-information ... NOTE
Malformed Description field: should contain one or more complete sentences.
This question is related but does not have a satisfactory answer so I am asking again.
I started with Julia Silge's blog post here:
cran <- tools::CRAN_package_db()
desc_with_doi <- grep("doi:", cran$Description, value = TRUE)
Here are some examples:
Given a protein multiple sequence alignment, it is daunting task to assess the effects of substitutions along sequence length. 'aaSEA' package is intended to help researchers to rapidly analyse property changes caused by single, multiple and correlated amino acid substitutions in proteins. Methods for identification of co-evolving positions from multiple sequence alignment are as described in : Pelé et al., (2017) <doi:10.4172/2379-1764.1000250>.
or
Estimate parameters of accumulated damage (load duration) models based on failure time data under a Bayesian framework, using Approximate Bayesian Computation (ABC). Assess long-term reliability under stochastic load profiles. Yang, Zidek, and Wong (2019) <doi:10.1080/00401706.2018.1512900>.
Using a similar filter for "https" shows (unsurprisingly) a lot more generic website links than scholarly references, but e.g.:
Designed for studies where animals tagged with acoustic tags are expected\n to move through receiver arrays. This package combines the advantages of automatic sorting and checking \n of animal movements with the possibility for user intervention on tags that deviate from expected \n behaviour. The three analysis functions (explore(), migration() and residency()) \n allow the users to analyse their data in a systematic way, making it easy to compare results from \n different studies.\n CJS calculations are based on Perry et al. (2012) <https://www.researchgate.net/publication/256443823_Using_mark-recapture_models_to_estimate_survival_from_telemetry_data>.
ArXiv (there are only 24 packages with such links out of 17962 total at present):
Provides functions for model fitting and selection of generalised hypergeometric ensembles of random graphs (gHypEG).\n To learn how to use it, check the vignettes for a quick tutorial.\n Please reference its use as Casiraghi, G., Nanumyan, V. (2019) doi:10.5281/zenodo.2555300\n together with those relevant references from the one listed below.\n The package is based on the research developed at the Chair of Systems Design, ETH Zurich.\n Casiraghi, G., Nanumyan, V., Scholtes, I., Schweitzer, F. (2016) <arXiv:1607.02441>.\n Casiraghi, G., Nanumyan, V., Scholtes, I., Schweitzer, F. (2017) <doi:10.1007/978-3-319-67256-4_11>.\n Casiraghi, G., (2017) <arxiv:1702.02048>\n Casiraghi, G., Nanumyan, V. (2018) <arXiv:1810.06495>.\n Brandenberger, L., Casiraghi, G., Nanumyan, V., Schweitzer, F. (2019) <doi:10.1145/3341161.3342926>\n Casiraghi, G. (2019) <doi:10.1007/s41109-019-0241-1>.

GloVe word embeddings containing sentiment?

I've been researching sentiment analysis with word embeddings. I read papers that state that word embeddings ignore sentiment information of the words in the text. One paper states that among the top 10 words that are semantically similar, around 30 percent of words have opposite polarity e.g. happy - sad.
So, I computed word embeddings on my dataset (Amazon reviews) with the GloVe algorithm in R. Then, I looked at the most similar words with cosine similarity and I found that actually every word is sentimentally similar. (E.g. beautiful - lovely - gorgeous - pretty - nice - love). Therefore, I was wondering how this is possible since I expected the opposite from reading several papers. What could be the reason for my findings?
Two of the many papers I read:
Yu, L. C., Wang, J., Lai, K. R. & Zhang, X. (2017). Refining Word Embeddings Using
Intensity Scores for Sentiment Analysis. IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 26(3), 671-681.
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T. & Qin, B. (2014). Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 1: Long Papers, 1555-1565.
Assumption:
When you say you computed GLoVe embeddings you mean you used pretrained GLoVe.
Static Word Embeddings does not carry sentiment information of the input text at runtime
Above statement means that word embedding algorithms(most of them in my knowledge, like GLoVe, Word2Vec) are not designed or formulated to capture sentiment of the word. But, in general word embedding algorithms map the words that are similar in meaning (based on statistical nearness and co-occurrences). Example, "Woman" and "Girl" will lie near to each other in the n-dimensional space of the embeddings. But that does not mean that any sentiment related information is captured here.
Hence,
Words : (beautiful - lovely - gorgeous - pretty - nice - love), being sentimentally similar to a given word is not a co-incident. We have to look these words in terms of their meaning, all these words are similar in meaning, but we cannot say that, they necessarily carry the same sentiments. These words lie near to each other in GLoVe's vector space, because the model was trained well on the corpus that carried sufficient information in terms of words that can be grouped similar.
Also, please study the similarity score, that will make it clearer.
The top 10 words that are semantically similar, around 30 percent of words have opposite polarity
Here, asemanticity, is lesser related to context, whereas sentiment is more related to context. One word cannot define sentiment.
Example:
Jack: "Your dress is beautiful, Gloria"!
Gloria: "Beautiful my foot!"
In both the sentences, beautiful carries completely different sentiment, where as for both of them will have same embedding for the word beautiful. Now, replace beautiful with (lovely - gorgeous - pretty - nice), semantic thing holds true as described in one of the papers. Also, sentiment is not captured by Word Embeddings, hence, other paper also stands true.
The point where confusion may have occurred is considering two or more word with similar meanings to be sentimentally similar. Sentiment information can be gathered at sentence level or doc level and not at word level.

How to use SentimentAnalysis package in R? what i the meaning of GI, ME, and LM?

Can anyone explain the meaning of GI, HE, LM, QDAP in the SentimentAnalysis package in R? What is the best way to identify the polarity of a sentence using this package given that it gives multiple answers if we look at each of these above-mentioned columns?
It is better to have somebody with domain knowledge to deepen your understanding on NLP (Natural Language Processing), Sentiment Analysis, and along with it all its terminologies.
You can also read up the docs for SentimentAnalysis package at https://cran.r-project.org/web/packages/SentimentAnalysis/SentimentAnalysis.pdf. GI for example is General Inquirer and is a Harvard-IV dictionary. It is one of many dictionaries available in SentimentAnalysis package. Usually in realm of sentiment analysis, these dictionaries provide a list of words that are considered positive or negative. Some dictionaries can even provide scoring to rate how positive or negative a word is. Please read up this doc first to ask more specific question.
One way to identify polarity of a sentence is that to collect all words with sentiment and average out the sentiments (for example, in 1 sentence there 3 positive and 2 negative sentiments, we can regard this sentence to have overall positive sentiment). This is just a basic method to analyze sentiments within a sentence. Natural language is very complex, of course you have to see the context when analyzing a sentiment (Ex: "I am not sad" has positive sentiment due to the negation term "not" despite having word which is considered negative ie. "sad").

How to determine negative trigrams

I am trying to search through a load of medical reports. I want to determine the type of language used to say an absence of a finding. Eg "There is no blabla" or "Neither blabla nor bleble is seen here". There are many variations.
My original idea was to divide the text into trigrams and then perform sentiment analysis on the trigrams, and then have a look at the negative trigrams and manually select the ones that denote an absence of something. Then I wanted to design some regular expressions around these absence trigrams.
I get the feeling however that I am not really looking for a sentiment analysis but more of a negation search. I could, I suppose just look for all sentences with 'not' or 'neither' or 'nor or 'no' but I'm sure I'll fall into some kind of linguistic trap.
Does anyone have a comment on my sentiment approach and if this is correct can I have some guidance about sentiment analysis on trigrams (or bigrams I suppose) as all the R tutorials I have found demonstrate unigram sentiment analysis.

How to apply topic modeling?

I have 10000 tweets for 5 topics. Assume I know the ground truth (the actual topic of each tweet) and I group the tweets into 5 documents where each document contain tweets for a particular topic. Then I apply LDA on to the 5 documents with number of topics set to 5. In which case I get good topic words.
Now If I don't know the ground truth of tweets, how do I make input documents in a way that LDA will still give me good topic words describing the 5 topics.
What if I create input documents by randomly selecting a sample of tweets? What if this ends up with similar topic mixtures for input documents? Should LDA still find good topic words as in the case of 1st paragraph?
If I understand correctly, your problem is about topic modeling on short texts (Tweets). One approach is to combine Tweets into long pseudo-documents before training LDA. Another one is to assume that there is only one topic per document/Tweet.
In the case that you don't know the ground truth labels of Tweets, you might want to try the one-topic-per-document topic model (i.e. mixture-of-unigrams). The model details are described in:
Jianhua Yin and Jianyong Wang. 2014. A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 233–242.
You can find my Java implementations for this model and LDA at http://jldadmm.sourceforge.net/ Assumed that you know ground truth labels, you can also use my implementation to compare these topic models in document clustering task.
If you'd like to evaluate topic coherence (i.e. evaluate how good topic words), I would suggest you to have a look at the Palmetto toolkit (https://github.com/AKSW/Palmetto) which implements the topic coherence calculations.

Resources