Package `sentimentr`: how remove emoticons and stopwords before `sentiment_by` - r

Here is a basic sentiment example. The text data is splitted into sentences via the get_sentences function. With sentiment_by we approximate the sentiment (polarity) of text for an entire element of a list (mytext in this example).
E.g. for the example:
library(sentimentr)
mytext <- c(
'do you like it? But I hate really bad dogs',
'I am the best friend.',
'Do you really like it? I\'m not a fan'
)
mytext <- get_sentences(mytext)
sentiment_by(mytext)
I obtained the following result:
element_id word_count sd ave_sentiment
1: 1 10 1.497465 -0.8088680
2: 2 5 NA 0.5813777
3: 3 9 0.284605 0.2196345
Before applying sentiment function, I would like to remove stop words, number, emoticons from mytext. I figured I could use, e.g:
library("tm")
tm_map(mytext, removeNumbers)
tm_map(mytext, removeWords, stopwords())
but I obtained:
Error in UseMethod("tm_map", x) :
no applicable method for 'tm_map' applied to an object of class "c('get_sentences',
'get_sentences_character', 'list')"

Related

Remove a verb as a stopword

There are some words which are used sometimes as a verb and sometimes as other part of speech.
Example
A sentence with the meaning of the word as verb:
I blame myself for what happened
And a sentence with the meaning of word as noun:
For what happened the blame is yours
The word I want to detect is known to me, in the example above is "blame". I would like to detect and remove as stopwords only when it has meaning like a verb.
Is there any easy way to make it?
You can install TreeTagger and then use the koRpus package in R to use TreeTagger from R. Install it in a location like e.g. C:\Treetagger.
I will first show how treetagger works so you understand what's going in the actual solution further down below in this answer:
Intro treetagger
library(koRpus)
your_sentences <- c("I blame myself for what happened",
"For what happened the blame is yours")
text.tagged <- treetag(file="I blame myself for what happened",
format="obj", treetagger="manual", lang="en",
TT.options = list(path="C:\\Treetagger", preset="en") )
text.tagged#TT.res[, 1:2]
# token tag
#1 I PP
#2 blame VVP
#3 myself PP
#4 for IN
#5 what WP
#6 happened VVD
The sentences have been analysed now and the "only thing left" is to remove those occurrences of "blame" that are a verb.
Solution
I'll do this sentence for sentence by creating a function that first tags the sentence, then checks for "bad words" like "blame" that are also a verb and finally removes them from the sentence:
remove_words <- function(sentence, badword="blame"){
tagged.text <- treetag(file=sentence, format="obj", treetagger="manual", lang="en",
TT.options=list(path=":C\\Treetagger", preset="en"))
# Check for bad words AND verb:
cond1 <- (tagged.text#TT.res$token == badword)
cond2 <- (substring(tagged.text#TT.res$tag, 0, 1) == "V")
redflag <- which(cond1 & cond2)
# If no such case, return sentence as is. If so, then remove that word:
if(length(redflag) == 0) return(sentence)
else{
splitsent <- strsplit(sentence, " ")[[1]]
splitsent <- splitsent[-redflag]
return(paste0(splitsent, collapse=" "))
}
}
lapply(your_sentences, remove_words)
# [[1]]
# [1] "I myself for what happened"
# [[2]]
# [1] "For what happened the blame is yours"
In python it is done as:
from nltk import pos_tag
s1 = "I blame myself for what happened"
pos_tag(s1.split())
It will give you words with there tags
You can do something like this in Python
.
import ntlk
>>> text = word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
And add youre filter to eliminate Verbs for instance .
Hope this is helpful !

Is there any method of reading non-standard table use organizing labels without loop?

As far as we know, the parsing library like XML and xml2 can read standard table on web page perfectly. But there are some sorts of table which has no grid of table but organizing labels, such as “<span>” and “<div>”.
Now I am coping with a table like this,
The structure of table marks with “<span>”, and every 4 “<span>” Labels organize one record. I’ve used a loop to solve this problem and succeed. But I want to process it without loop. I heard that library purrr may help on this problem, but I don’t know how to use it in this situation.
I do my analysis by both “XML” and “xml2”:
Analysis with “XML” package
pg<-"http://www.irgrid.ac.cn/simple-search?fq=eperson.unique.id%3A311007%5C-000920"
library(XML)
tableNodes <- getNodeSet(htmlParse(pg), "//table[#class='miscTable2']")
itemlines <- xpathApply(tableNodes[[1]], "//tr[#class='itemLine']/td[#width='750']")
ispan <- xmlElementsByTagName(itemlines[[2]], "span")
title <- xmlValue(ispan$span)
isuedate <- xmlValue(ispan$span[1,2])
author <- xmlValue(ispan$span[3])
In this case, “XML” got a list of one span, but this list is very strange but met my expectations:
> attributes(ispan)
$names
[1] "span" "span" "span" "span"
It seems have one row only, but four columns. However, it doesn’t. The 2-4 “span” couldn’t be select by column. The first “span” occupied 2 columns, and other “span” could not get.
> val <- xmlValue(ispan$span[[1]])
> val
[1] "超高周疲劳裂纹萌生与初始扩展的特征尺度"
> isuedate <- xmlValue(ispan$span[[2]])
> isuedate
[1] " \r\n [科普文章]"
> isuedate <- xmlValue(ispan$span[[3]])
> isuedate
[1] NA
> author <- xmlValue(ispan$span[[4]])
> author
[1] NA
None of the selection method used in list works:
> title <- xmlValue(ispan$span[1,1])
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('XMLInternalNodeList', 'XMLNodeList')"
title <- xmlValue(ispan$span[1,])
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('XMLInternalNodeList', 'XMLNodeList')"
author <- xmlValue(ispan[1,3])
Error in ispan[1, 3] : incorrect number of dimensions
Analysis with “xml2”
Use “xml2” the obstacle of “span” makes same problem
pg<-"http://www.irgrid.ac.cn/simple-search?fq=eperson.unique.id%3A311007%5C-000920"
library(xml2)
tableSource <- xml_find_all(read_html(pg, encoding = "UTF-8"), "//table[#class='miscTable2']")
itemspan <- xml_child(itemspantab, "span")
It could not gether any of these “span” labels:
> itemspan
{xml_nodeset (1)}
[1] <NA>
If we make a step further to locate the “span” labels, it only get nothing:
> itemspanl <- xml_find_all(itemspantab, '//tr[#class="itemLine"]/td/span')
> itemspan <- xml_child(itemspanl, "span")
> itemspan
{xml_nodeset (40)}
[1] <NA>
[2] <NA>
[3] <NA>
...
An suggest told me use library(purrr) to do this, but the “purrr” process dataframe only, the “list” prepared by “xml2” could not be analyzed.
I want not to use loop and get the result like below, can we do it? I hope the scholars who have experience on “XML” and “xml2” could give me some advise on how to cope with this non-standard table. Thanks a lot.

Give a new variable value 0 or 1 based on the distance between two words in another variable

I am new to R. In my dataset, I have a variable called Reason . I want to create a new column called Price. If any of the following conditions is met:
word "Price" and word "High" are both mentioned in Reason and the distance between them is less than 6 words
word "Price" and word "expensive" are both mentioned in Reason and the distance between them is less than 6 words
-word "Price" and word "increase" are both mentioned in Reason and the distance between them is less than 6 words
than Price=1. Otherwise, price=0.
I found the following user defined function to get the distance between 2 words
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
but I don't know how to apply it the whole column to get the expected results. I tried the following code but it only give me "logical(0)" as the result.
for (j in seq(Survey$Reason))
{
Survey$Price[[j]]<- distance(Survey$Reason[[j]], " price ", " high ") <=6
}
Any help is highly appreciated.
Thanks
Starting from your sample data:
survey <- structure(list(Reason = c("Their price are extremely high.", "Because my price was increased so much, I wouldn't want anyone else to have to deal with that.", "Just because the intial workings were fine, but after we realised it would affect our contract, it left a sour taste in our mouth.", "Problems with the repair", "They did not handle my complaint as well I would have liked.", "Bad service overall.")), .Names = "Reason", row.names = c(NA, 6L), class = "data.frame")
First, I updated your fonction to remove punctuation and directrly returns your position test
distanceOK <- function(string, term1, term2,n=6) {
words <- strsplit(gsub("[[:punct:]]", "", string), "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
dist <- abs(indices[term1] - indices[term2])
ifelse(is.na(dist)|dist>n,0,1)
}
Then we apply:
survey$Price <- sapply(survey$Reason, FUN=function(str) distanceOK(str, "price","high"))

R and tm package: create a term-document matrix with a dictionary of one or two words?

Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams, as some of the keywords.
Web Search: Being new to text-mining and the tm package in R, I went to the web to figure out how to do this. Below are some relevant links that I found:
FAQS on the tm-package website
finding 2 & 3 word phrases using r tm package
counter ngram with tm package in r
findassocs for multiple terms in r
Background: Of these, I preferred the solution that uses NGramTokenizer in the RWeka package in R, but I ran into a problem. In the example code below, I create three documents and place them in a corpus. Note that Docs 1 and 2 each contain two words. Doc 3 only contains one word. My dictionary keywords are two bigrams and a unigram.
Problem: The NGramTokenizer solution in the above links does not correctly count the unigram keyword in the Doc 3.
library(tm)
library(RWeka)
my.docs = c('jedi master', 'jedi grandmaster', 'jedi')
my.corpus = Corpus(VectorSource(my.docs))
my.dict = c('jedi master', 'jedi grandmaster', 'jedi')
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
inspect(DocumentTermMatrix(my.corpus, control=list(tokenize=BigramTokenizer,
dictionary=my.dict)))
# <<DocumentTermMatrix (documents: 3, terms: 3)>>
# ...
# Docs jedi jedi grandmaster jedi master
# 1 1 0 1
# 2 1 1 0
# 3 0 0 0
I was expecting the row for Doc 3 to give 1 for jedi and 0 for the other two. Is there something I am misunderstanding?
I ran into the same problem and found that token counting functions from the TM package rely on an option called wordLengths, which is a vector of two numbers -- the minimum and the maximum length of tokens to keep track of. By default, TM uses a minimum word length of 3 characters (wordLengths = c(3, Inf)). You can override this option by adding it to the control list in a call to DocumentTermMatrix like this:
DocumentTermMatrix(my.corpus,
control=list(
tokenize=newBigramTokenizer,
wordLengths = c(1, Inf)))
However, your 'jedi' word is more than 3 characters long. Although, you probably tweaked the option's value earlier while trying to figure out how to count ngrams, so still try this. Also, look at the bounds option, which tells TM to discard words less or more frequent than specified values.
I noticed that NGramTokenizer returns character(0) when a one-word string is submitted as input and NGramTokenizer is asked to return unigrams and bigrams.
NGramTokenizer('jedi', Weka_control(min = 1, max = 2))
# character(0)
I am not sure why this is the output, but I believe this behavior is the reason why the keyword jedi was not counted in Doc 3. However, a simple if-then-else solution appears to work for my situation: both for the sample set and my actual data set.
library(tm)
library(RWeka)
my.docs = c('jedi master', 'jedi grandmaster', 'jedi')
my.corpus = Corpus(VectorSource(my.docs))
my.dict = c('jedi master', 'jedi grandmaster', 'jedi')
newBigramTokenizer = function(x) {
tokenizer1 = NGramTokenizer(x, Weka_control(min = 1, max = 2))
if (length(tokenizer1) != 0L) { return(tokenizer1)
} else return(WordTokenizer(x))
} # WordTokenizer is an another tokenizer in the RWeka package.
inspect(DocumentTermMatrix(my.corpus, control=list(tokenize=newBigramTokenizer,
dictionary=my.dict)))
# <<DocumentTermMatrix (documents: 3, terms: 3)>>
# ...
# Docs jedi jedi grandmaster jedi master
# 1 1 0 1
# 2 1 1 0
# 3 1 0 0
Please let me know if anyone finds a "gotcha" that I am not considering in the code above. I would also appreciate any insight into why NGramTokenizer returns character(0) in my observation above.

DocumentTermMatrix error on Corpus argument

I have the following code:
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings.
corpus_clean <- tm_map(news_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english'))
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, trim)
news_dtm <- DocumentTermMatrix(corpus_clean) # errors here
When I run the DocumentTermMatrix() method, it gives me this error:
Error: inherits(doc, "TextDocument") is not TRUE
Why do I get this error? Are my rows not text documents?
Here is the output upon inspecting corpus_clean:
[[153]]
[1] obama holds technical school model us
[[154]]
[1] oil boom produces jobs bonanza archaeologists
[[155]]
[1] islamic terrorist group expands territory captures tikrit
[[156]]
[1] republicans democrats feel eric cantors loss
[[157]]
[1] tea party candidates try build cantor loss
[[158]]
[1] vehicles materials stored delaware bridges
[[159]]
[1] hill testimony hagel defends bergdahl trade
[[160]]
[1] tweet selfpropagates tweetdeck
[[161]]
[1] blackwater guards face trial iraq shootings
[[162]]
[1] calif man among soldiers killed afghanistan
[[163]]
[1] stocks fall back world bank cuts growth outlook
[[164]]
[1] jabhat alnusra longer useful turkey
[[165]]
[1] catholic bishops keep focus abortion marriage
[[166]]
[1] barbra streisand visits hill heart disease
[[167]]
[1] rand paul cantors loss reason stop talking immigration
[[168]]
[1] israeli airstrike kills northern gaza
Edit: Here is my data:
type,text
neutral,The week in 32 photos
neutral,Look at me! 22 selfies of the week
neutral,Inside rebel tunnels in Homs
neutral,Voices from Ukraine
neutral,Water dries up ahead of World Cup
positive,Who's your hero? Nominate them
neutral,Anderson Cooper: Here's how
positive,"At fire scene, she rescues the pet"
neutral,Hunger in the land of plenty
positive,Helping women escape 'the life'
neutral,A tour of the sex underworld
neutral,Miss Universe Thailand steps down
neutral,China's 'naked officials' crackdown
negative,More held over Pakistan stoning
neutral,Watch landmark Cold War series
neutral,In photos: History of the Cold War
neutral,Turtle predicts World Cup winner
neutral,What devoured great white?
positive,Nun wins Italy's 'The Voice'
neutral,Bride Price app sparks debate
neutral,China to deport 'pork' artist
negative,Lightning hits moving car
neutral,Singer won't be silenced
neutral,Poland's mini desert
neutral,When monarchs retire
negative,Murder on Street View?
positive,Meet armless table tennis champ
neutral,Incredible 400 year-old globes
positive,Man saves falling baby
neutral,World's most controversial foods
Which I retrieve like:
news_raw <- read.csv('news_csv.csv', stringsAsFactors = F)
Edit: Here is the traceback():
> news_dtm <- DocumentTermMatrix(corpus_clean)
Error: inherits(doc, "TextDocument") is not TRUE
> traceback()
9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"),
ch), call. = FALSE, domain = NA)
8: stopifnot(inherits(doc, "TextDocument"), is.list(control))
7: FUN(X[[1L]], ...)
6: lapply(X, FUN, ...)
5: mclapply(unname(content(x)), termFreq, control)
4: TermDocumentMatrix.VCorpus(x, control)
3: TermDocumentMatrix(x, control)
2: t(TermDocumentMatrix(x, control))
1: DocumentTermMatrix(corpus_clean)
When I evaluate inherits(corpus_clean, "TextDocument") it is FALSE.
It seems this would have worked just fine in tm 0.5.10 but changes in tm 0.6.0 seems to have broken it. The problem is that the functions tolower and trim won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters.
So you could change to
corpus_clean <- tm_map(news_corpus, content_transformer(tolower))
Or you can run
corpus_clean <- tm_map(corpus_clean, PlainTextDocument)
after all of your non-standard transformations (those not in getTransformations()) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.
I have found a way to solve this problem in an article about TM.
An example in which the error follows below:
getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1") # import files
corpus <- VCorpus(x=files) # load files, create corpus
summary(corpus) # get a summary
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation);
matrix_terms <- DocumentTermMatrix(corpus)
Warning messages:
In TermDocumentMatrix.VCorpus(x, control) : invalid document identifiers
This error occurs because you need an object of the class Vector Source to do your Term Document Matrix, but the previous transformations transform your corpus of texts in character, therefore, changing a class which is not accepted by the function.
However, if you add the function content_transformer inside the tm_map command you may not need even one more command before using the function TermDocumentMatrix to keep going.
The code below changes the class (see second last line) and avoids the error:
getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1")
corpus <- VCorpus(x=files) # load files, create corpus
summary(corpus) # get a summary
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- tm_map(corpus,content_transformer(stripWhitespace))
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- Corpus(VectorSource(corpus)) # change class
matrix_term <- DocumentTermMatrix(corpus)
Change this:
corpus_clean <- tm_map(news_corpus, tolower)
For this:
corpus_clean <- tm_map(news_corpus, content_transformer(tolower))
This should work.
remove.packages(tm)
install.packages("http://cran.r-project.org/bin/windows/contrib/3.0/tm_0.5-10.zip",repos=NULL)
library(tm)

Resources