I am trying to search through a load of medical reports. I want to determine the type of language used to say an absence of a finding. Eg "There is no blabla" or "Neither blabla nor bleble is seen here". There are many variations.
My original idea was to divide the text into trigrams and then perform sentiment analysis on the trigrams, and then have a look at the negative trigrams and manually select the ones that denote an absence of something. Then I wanted to design some regular expressions around these absence trigrams.
I get the feeling however that I am not really looking for a sentiment analysis but more of a negation search. I could, I suppose just look for all sentences with 'not' or 'neither' or 'nor or 'no' but I'm sure I'll fall into some kind of linguistic trap.
Does anyone have a comment on my sentiment approach and if this is correct can I have some guidance about sentiment analysis on trigrams (or bigrams I suppose) as all the R tutorials I have found demonstrate unigram sentiment analysis.
Related
I've been researching sentiment analysis with word embeddings. I read papers that state that word embeddings ignore sentiment information of the words in the text. One paper states that among the top 10 words that are semantically similar, around 30 percent of words have opposite polarity e.g. happy - sad.
So, I computed word embeddings on my dataset (Amazon reviews) with the GloVe algorithm in R. Then, I looked at the most similar words with cosine similarity and I found that actually every word is sentimentally similar. (E.g. beautiful - lovely - gorgeous - pretty - nice - love). Therefore, I was wondering how this is possible since I expected the opposite from reading several papers. What could be the reason for my findings?
Two of the many papers I read:
Yu, L. C., Wang, J., Lai, K. R. & Zhang, X. (2017). Refining Word Embeddings Using
Intensity Scores for Sentiment Analysis. IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 26(3), 671-681.
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T. & Qin, B. (2014). Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 1: Long Papers, 1555-1565.
Assumption:
When you say you computed GLoVe embeddings you mean you used pretrained GLoVe.
Static Word Embeddings does not carry sentiment information of the input text at runtime
Above statement means that word embedding algorithms(most of them in my knowledge, like GLoVe, Word2Vec) are not designed or formulated to capture sentiment of the word. But, in general word embedding algorithms map the words that are similar in meaning (based on statistical nearness and co-occurrences). Example, "Woman" and "Girl" will lie near to each other in the n-dimensional space of the embeddings. But that does not mean that any sentiment related information is captured here.
Hence,
Words : (beautiful - lovely - gorgeous - pretty - nice - love), being sentimentally similar to a given word is not a co-incident. We have to look these words in terms of their meaning, all these words are similar in meaning, but we cannot say that, they necessarily carry the same sentiments. These words lie near to each other in GLoVe's vector space, because the model was trained well on the corpus that carried sufficient information in terms of words that can be grouped similar.
Also, please study the similarity score, that will make it clearer.
The top 10 words that are semantically similar, around 30 percent of words have opposite polarity
Here, asemanticity, is lesser related to context, whereas sentiment is more related to context. One word cannot define sentiment.
Example:
Jack: "Your dress is beautiful, Gloria"!
Gloria: "Beautiful my foot!"
In both the sentences, beautiful carries completely different sentiment, where as for both of them will have same embedding for the word beautiful. Now, replace beautiful with (lovely - gorgeous - pretty - nice), semantic thing holds true as described in one of the papers. Also, sentiment is not captured by Word Embeddings, hence, other paper also stands true.
The point where confusion may have occurred is considering two or more word with similar meanings to be sentimentally similar. Sentiment information can be gathered at sentence level or doc level and not at word level.
Using TM package in R, how can I score a document in term of its uniqueness? I want to somehow separate documents with very unique words from documents that contain often used words.
I know how to find the frequently used words and least used words with e.g. findFreqTerms, but how do I score a document with regards to it's uniqueness?
I am struggling to come up with a good solution.
A good starting point for assessing which words are used only in some documents is the so-called tf-idf weighting (tidytext package vignette). This assigns a score to each (word, document) combination, so once you have that calculated you can try summarizing along the 'document' margin, maybe literally just colMeans, to get a sense of how many relatively unique terms it uses.
To separate documents, a weighting scheme like tf-idf may be better than just finding the rarest overall tokens: a rare word used once in most documents is treated quite differently than a word used several times in just a few documents.
R packages TM, tidytext, and quanteda all have functions to calculate this.
Can anyone explain the meaning of GI, HE, LM, QDAP in the SentimentAnalysis package in R? What is the best way to identify the polarity of a sentence using this package given that it gives multiple answers if we look at each of these above-mentioned columns?
It is better to have somebody with domain knowledge to deepen your understanding on NLP (Natural Language Processing), Sentiment Analysis, and along with it all its terminologies.
You can also read up the docs for SentimentAnalysis package at https://cran.r-project.org/web/packages/SentimentAnalysis/SentimentAnalysis.pdf. GI for example is General Inquirer and is a Harvard-IV dictionary. It is one of many dictionaries available in SentimentAnalysis package. Usually in realm of sentiment analysis, these dictionaries provide a list of words that are considered positive or negative. Some dictionaries can even provide scoring to rate how positive or negative a word is. Please read up this doc first to ask more specific question.
One way to identify polarity of a sentence is that to collect all words with sentiment and average out the sentiments (for example, in 1 sentence there 3 positive and 2 negative sentiments, we can regard this sentence to have overall positive sentiment). This is just a basic method to analyze sentiments within a sentence. Natural language is very complex, of course you have to see the context when analyzing a sentiment (Ex: "I am not sad" has positive sentiment due to the negation term "not" despite having word which is considered negative ie. "sad").
My question concerns the proper training of the model for unique and really specific use of the Word2Vec model. See Word2Vec details here
I am working on identifying noun-adjective (or ) relationships within the word embeddings.
(E.g. we have 'nice car' in a sentence of the data-set. Given the word embeddings of the corpus and the nouns and adjectives all labeled, I am trying to design a technique to find the proper vector that connects 'nice' with 'car'.)
Of course I am not trying to connect only that pair of words, but the technique should would for all relationships. A supervised approach is taken at this moment, then try to work towards designing an unsupervised method.
Now that you understand what I am trying to do, I will explain the problem. I obviously know that word2vec needs to be trained on large amounts of data, to learn the proper embeddings as accurately as possible, but I am afraid to give it more data than the data-set with labelled sentences (500-700).
I am afraid that if I give it more data to train on (e.g. Latest Wikipedia dump data-set), it will learn better vectors, but the extra data will influence the positioning of my words, then this word relationship is biased by the extra training data. (e.g. what if there is also 'nice Apple' in the extra training data, then the positioning of the word 'nice' could be compromised).
Hopefully this makes sense and I am not making bad assumptions, but I am just in the dilemma of having bad vectors because of not enough training data, or having good vectors, but compromised vector positioning in the word embeddings.
What would be the proper way to train on ? As much training data as possible (billions of words) or just the labelled data-set (500-700 sentences) ?
Thank you kindly for your time, and let me know if anything that I explained does not make sense.
As always in similar situations it is best to check...
I wonder if you tested the difference in training on the labelled dataset results vs. the wikipedia dataset. Are there really the issues you are afraid of seeing?
I would just run an experiment and check if the vectors in both cases are indeed different (statistically speaking).
I suspect that you may introduce some noise with larger corpus but more data may be beneficial wrt. to vocabulary coverage (larger corpus - more universal). It all depends on your expected use case. It is likely to be a trade off between high precision with very low recall vs. so-so precision with relatively good recall.
I am trying to figure out what improvements in the text analysis of long documents can be obtained by using Syntaxnet, rather than something "dumb" like word counts, sentence length, etc.
The goal would be to get more accurate linguistic measures (such as "tone" or "sophistication"), for quantifying attributes of long(er) documents like newspaper articles or letters/memos.
What I am trying to figure out is what to do with Syntaxnet output once the POS tagging is concluded. What types of things do people use to process Syntaxnet output?
Ideally I am looking for an example workflow that transforms Syntaxnet output into something quantitative that can be used in statistical analysis.
Also, can someone point me to sources that show how the inferences drawn from a "smart" analysis with Syntaxnet compare to those that can be attained by word counts, sentence length, etc.?