read full review from the dataset with the specific word that shown in wordcloud - count

Hy! I am finding for the reviews from dataset related to those specific words that are shown in wordCloud
here is my code for visualizing and counting for words
visualize wordcloud with removing stopwords
stopwords=set(STOPWORDS)
stopwords.update(["car","vehicle","time","buy","new","infiniti",])
wordcloud_spam=WordCloud(stopwords=stopwords,background_color="white").generate(negitive_reviews_str)
plt.figure(figsize=(10,10))
plt.imshow(wordcloud_spam, interpolation = 'bilinear')
plt.axis("off")
plt.show()
count first 10 words from cloud
filtered_word_spam=[word for word in negitive_reviews_str.split()if word not in stopwords]
counted_word_spam=collections.Counter(filtered_word_spam)
word_count_spam={}
for letter, count in counted_word_spam.most_common(10):
word_count_spam[letter]=count
for i,j in word_count_spam.items():
print('Word: {0}, count: {1}'.format(i,j))

Related

Text mining: how to count the frequency of two words occurring close together

Let's say I have the following data frame, df.
speaker <- c('Lincoln','Douglas')
text <- c('The framers of the Constitution, those framers...',
'The framers of our great Constitution.')
df <- data.frame(speaker,text)
I want to find (or write) a function that can count the frequency of two words occurring close together. Say I want to count instance of "framers" occurring within three words of the word "Constitution" and vice versa.
Thus, for Lincoln, the function would return 2 because you have one instance of "framers" followed by "Constitution" and another instance of "Constitution" followed by "framers." For Douglas, the function would return 0 because "Constitution" is four words away from "framers."
I'm new to text mining, so I apologize if I'm missing an easy solution.

Is there a way to calculate the frequency of one word and ngrams at the same time?

Right now I have two steps in my text processing using R and the package quanteda:
At the one hand I calculate the frequency of my target words for single words like citiziens.
On the other hand I calculate the frequency of ngrams.
txt<- "for this purpose citizens and sydney siders are the same"
tok_single<- tokens_select(tokens(txt), pattern = "citizens", padding = TRUE)
textstat_frequency(dfm(tok_single))
####bigrams
toks_bigram <- tokens_select(tokens_ngrams(tokens(txt),n=2:3), pattern = c("sydney_siders"), padding = TRUE)
textstat_frequency(dfm(toks_bigram))
Finally I have the problem that the keywords in my dictionary consist of single words like citizens and multiple words like sydney siders. By now my calculation is done in two steps. Calculating tokens with 1-grams and n-grams separately. Is there a function to combine ngrams and 1-grams?
For example, I would like to calculate the frequency of my dictionary items in one step.
mydic<-dictionary(list(citizen=c("citizen*","sydney siders","public")))

Getting the most significant words per document using R and the tm library

I have a dataframe of 1400 rows df I create a corpus from the textcolumn as such:
library(tm)
dc<-Corpus(VectorSource(df$text))
I then create the Document term matrix with tfidf weighting as such:
tm<-DocumentTermMatrix(dc,control=list(weighting=weightTfIdf))
I now want to reduce this dataset so that for each document I (1) remove words with a 0 weight (2) choose the 10 most significant words for that documents (highest weight).
However
as.data.frame(inspect(tm))
Results in the error
Error: cannot allocate vectors of size 1.2 Gb
So any suggestions as to how to manipulate the tm without converting it?
To rephrase, for each document I want to pull out the 10 words with the highest tfidf weighting.

Find specific strings, count their frequency in a given text, and report it as a proportion of the number of words

Trying to write a function in R that would :
1) look through each observation's string variables
2) identify and count certain strings that the user defines
3) report the findings as a proportion of the total number of words each observation contains.
Here's a sample dataset:
df <- data.frame(essay1=c("OMG. american sign language. knee-slides in leather pants", "my face looks totally different every time. lol."),
essay2=c("cheez-its and dried cranberries. sparkling apple juice is pretty\ndamned coooooool too.<br />\nas for music, movies and books: the great american authors, mostly\nfrom the canon, fitzgerald, vonnegut, hemmingway, hawthorne, etc.\nthen of course the europeans, dostoyevski, joyce, the romantics,\netc. also, one of the best books i have read is all quiet on the\nwestern front. OMG. I really love that. lol", "at i should have for dinner\nand when; some random math puzzle, which I loooooove; what it means to be alive; if\nthe meaning of life exists in the first place; how the !##$ can the\npolitical mess be fixed; how the %^&* can the education system\nbe fixed; my current game design project; my current writing Lol"),
essay3=c("Lol. I enjoy life and then no so sure what else to say", "how about no?"))
The furtherest I managed to get is this function:
find.query <- function(char.vector, query){
which.has.query <- grep(query, char.vector, ignore.case = TRUE)
length(which.has.query) != 0
}
profile.has.query <- function(data.frame, query){
query <- tolower(query)
has.query <- apply(data.frame, 1, find.query, query=query)
return(has.query)
}
This allows the user to detect if a given value is in the 'essay' for a given used, but that's not enough for the three goals outlined above. What this function would ideally do is to count the number of words identified, then divide that count by the total count of words in the overall essays (row sum of counts for each user).
Any advice on how to approach this?
Using the stringi package as in this post:
How do I count the number of words in a text (string) in R?
library(stringi)
words.identified.over.total.words <- function(dataframe, query){
# make the query all lower-case
query <- tolower(query)
# count the total number of words
total.words <- apply(dataframe, 2, stri_count, regex = "\\S+")
# count the number of words matching query
number.query <- apply(dataframe, 2, stri_count, regex = query)
# divide the number of words identified by total words for each column
final.result <- colSums(number.query) / colSums(total.words)
return(final.result)
}
(The df in your question has each essay in a column, so the function sums each column. However, in the text of your question you say you want row sums. If the input data frame was meant to have one essay per row, then you can change the function to reflect that.)

how to find similar sentences / phrases in R?

Example, I have billions of short phrases, and I want to clusters of them that are similar.
> strings.to.cluster <- c("Best Toyota dealer in bay area. Drive out with a new car today",
"Largest Selection of Furniture. Stock updated everyday" ,
" Unique selection of Handcrafted Jewelry",
"Free Shipping for orders above $60. Offer Expires soon",
"XXXX is where smart men buy anniversary gifts",
"2012 Camrys on Sale. 0% APR for select customers",
"Closing Sale on office desks. All Items must go"
)
assume that this vector is hundreds of thousands of rows. Is there a package in R to cluster these phrases by meaning?
or could someone suggest a way to rank "similar" phrases by meaning to a given phrase.
You can view your phrases as "bags of words", i.e., build a matrix (a "term-document" matrix), with one row per phrase, one column per word, with 1 if the word occurs in the phrase and 0 otherwise. (You can replace 1 with some weight that would account for phrase length and word frequency). You can then apply any clustering algorithm. The tm package can help you build this matrix.
library(tm)
library(Matrix)
x <- TermDocumentMatrix( Corpus( VectorSource( strings.to.cluster ) ) )
y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) )
plot( hclust(dist(t(y))) )
Maybe looking at this document:
http://www.inside-r.org/howto/mining-twitter-airline-consumer-sentiment
could help, it uses R and looks at market sentiment for airlines using twitter.

Resources