Get word occurrences - r

I am trying to get the number of occurrences of each word in a csv file with r.
My dataset looks like this:
TITLE
1 My first Android app after a year
2 Unmanned drone buzzes French police car
3 Make anything editable with HTML5
4 Predictive vs Reactive control
5 What was it like to move to San Antonio and go through TechStars Cloud?
6 Health-care sector vulnerable to hackers, researchers say
And I have tried using the funciton used in 'Machine Learning for Hackers':
get.tdm <- function(doc.vec) {
doc.corpus <- Corpus(VectorSource(doc.vec))
control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, minDocFreq=2)
doc.dtm <- TermDocumentMatrix(doc.corpus, control)
return(doc.dtm)
}
But I get an error I dont understand:
Error: is.Source(s) is not TRUE
In addition: Warning message:
In is.Source(s) : vectorized sources must have a positive length entry
What could possibly the problem?

This works for me (calling your dataframe df)
library(tm)
doc.corpus <- Corpus(VectorSource(df))
freq <- data.frame(count=termFreq(doc.corpus[[1]]))
freq
# count
# after 1
# and 1
# android 1
# antonio 1
# anything 1
# ...
# unmanned 1
# vulnerable 1
# was 1
# what 1
# with 1
# year 1

Related

How to work with scores and regex in a keywords dictionary to get a rudimentary sentiment analysis, with R?

I would like to optimize the size of a sentiment dictionary by using regular expressions. But I don't know how to match the keywords with the text to be analysed, without losting the rating of each keyword.
I work with R. And I'd like to stay in about a "matching words" solution.
This is what I tried
library(stringr)
library(tidytext) # tidy text analysis + unnest_tokens
library(tidyverse) # visualization + tibble
# text to be quoted
Corpus<- c("Radicals in their time, early Impressionists violated the rules of academic painting.",
"They also painted realistic scenes of modern life, and often painted outdoors.",
"The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision.",
"Even if the art critics and art establishment disapproved of the new style.")
# dictionary : words and quotes lists
WordsList <- c("^academ.+$","^disapprov.*$","^friend.*$","^fresh.*$","^hostil.+$","^modern.*$","^new.*$","^original.*$","^outstand.*$","^radical.*$","^uncorrect.+$","^violat.+$")
QuotesList <- c(1,-2,2,2,-2,2,1,2,3,-3,-1,-3)
Lexicon <- data.frame(words=WordsList, quotes=QuotesList)
Lexicon
# words quotes
# 1 ^academ.+$ 1
# 2 ^disapprov.*$ -2
# 3 ^friend.*$ 2
# 4 ^fresh.*$ 2
# 5 ^hostil.+$ -2
# 6 ^modern.*$ 2
# 7 ^new.*$ 1
# 8 ^original.*$ 2
# 9 ^outstand.*$ 3
# 10 ^radical.*$ -3
# 11 ^uncorrect.+$ -1
# 12 ^violat.+$ -3
messag <- tibble(docidx = 1:length(Corpus), text = Corpus)
# split into words : 1 row per word per "document"
txt.by.word <- messag %>%
unnest_tokens(mots, text)
# size order instead of alphabetic order
matching<- paste(Lexicon[order(-nchar(Lexicon$words)),]$words, collapse = '|')
matching
# [1] "^disapprov.*$|^original.*$|^radical.*$|^academ.+$|^hostil.+$|^modern.*$|^violat.+$|^fresh.*$|^new.*$"
# search matchings
test<- str_extract_all(txt.by.word$mots, matching, simplify= T) # sensible à la casse
# result
test
tst <- as.data.frame(test)
# except empty
tst[!tst$V1 %in% "",]
# [1] "radicals" "violated" "academic" "modern" "hostile" "fresh" "original" "disapproved"
# [9] "new"
# from here I don't know how to get this expected result: by docidx, matching the words and their associated ratings.
# how to extract both the keyword and the sentiment rating ?
# Expected result
# docidx text quote
# 1 radicals -3
# 1 violated -3
# 1 academic 1
# 2 modern 2
# 3 hostile -2
# 3 fresh 2
# 3 original 2
# 4 disapproved -2
# 4 new 1
Thanks to Maël who answered another post from myself, see an equivalent of the 'match' function that works with regex
I have found an acceptable solution. Very close to my target. Here the heart of the code to be implemented instead of str_extract_all.
'''R
dt.unl <- as.data.table(unlist(sapply(Lexicon$words, grep, Corpus, value = TRUE)), keep.rownames=T)
dt.unl
dt.unl[ , keywords := lapply(.SD, function(x){gsub("[0-9]$", "", x)}), .SDcols=1, by="V1"]
dt.unl
dt.scor <- merge(dt.unl[,.(V2,keywords)], Lexicon, by.x="keywords", by.y="words")
dt.scor
# keywords V2 quotes
# 1: \\bacadem.+\\b Radicals in their time, early Impressionists violated the rules of academic painting. 1
# 2: \\bdisapprov.*\\b Even if the art critics and art establishment disapproved of the new style. -2
# 3: \\bfresh.*\\b The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision. 2
# 4: \\bhostil.+\\b The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision. -2
# 5: \\bmodern.*\\b They also painted realistic scenes of modern life, and often painted outdoors. 2
# 6: \\bnew.*\\b Even if the art critics and art establishment disapproved of the new style. 1
# 7: \\boriginal.*\\b The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision. 2
# 8: \\bviolat.+\\b Radicals in their time, early Impressionists violated the rules of academic painting. -3
#
'''

tokenizing on a pdf for quantitative analysis

I ran into an issue using the unnest_tokens function on a data_frame. I am working with pdf files I want to compare.
text_path <- "c:/.../text1.pdf"
text_raw <- pdf_text("c:/.../text1.pdf")
text1df<- data_frame(Zeile = 1:25,
text_raw)
So far so good. But here comes my problemo:
unnest_tokens(output = token, input = content) -> text1_long
Error: Must extract column with a single valid subscript.
x Subscript var has the wrong type function.
i It must be numeric or character.
I want to tokenize my pdf files so I can analyse the word frequencies and maybe compare multiple pdf files on wordclouds.
Here is a piece of simple code. I kept your German words so you can copy paste everything.
library(pdftools)
library(dplyr)
library(stringr)
library(tidytext)
file_location <- "d:/.../my_doc.pdf"
text_raw <- pdf_text(file_location)
# Zeile 12 because I only have 12 pages
text1df <- data_frame(Zeile = 1:12,
text_raw)
text1df_long <- unnest_tokens(text1df , output = wort, input = text_raw ) %>%
filter(str_detect(wort, "[a-z]"))
text1df_long
# A tibble: 4,134 x 2
Zeile wort
<int> <chr>
1 1 training
2 1 and
3 1 development
4 1 policy
5 1 contents
6 1 policy
7 1 statement
8 1 scope
9 1 induction
10 1 training
# ... with 4,124 more rows

How do I include stopwords(terms) in text2vec

In text2vec package, I am using create_vocabulary function. For eg:
My text is "This book is very good" and suppose I am not using stopwords and an ngram of 1L to 3L. so the vocab terms will be
This, book, is, very, good, This book,..... book is very, very good. I just want to remove the term "book is very" (and host of other terms using a vector). Since I just want to remove a phrase I cant use stopwords. I have coded the below code:
vocab<-create_vocabulary(it,ngram=c(1L,3L))
vocab_mod<- subset(vocab,!(term %in% stp) # where stp is stop phrases.
x<- read.csv(Filename') #these are all stop phrases
stp<-as.vector(x$term)
When I do the above step, the metainformation in attributes get lost in vocab_mod and so can't be used in create_dtm.
It seems that subset function drops some attributes. You can try:
library(text2vec)
txt = "This book is very good"
it = itoken(txt)
v = create_vocabulary(it, ngram = c(1, 3))
v = v[!(v$term %in% "is_very_good"), ]
v
# Number of docs: 1
# 0 stopwords: ...
# ngram_min = 1; ngram_max = 3
# Vocabulary:
# term term_count doc_count
# 1: good 1 1
# 2: book_is_very 1 1
# 3: This_book 1 1
# 4: This 1 1
# 5: book 1 1
# 6: very_good 1 1
# 7: is_very 1 1
# 8: book_is 1 1
# 9: This_book_is 1 1
# 10: is 1 1
# 11: very 1 1
dtm = create_dtm(it, vocab_vectorizer(v))
#Dmitriy even this lets to drop the attributes... So the way out that I found was just adding the attributes manually for now using attr function
attr(vocab_mod,"ngram")<-c(ngram_min = 1L,ngram_max=3L) and son one for other attributes as well. We can get attribute details from vocab.

join quanteda dfm top ten 1grams with all dfm 2 thru 5grams

To conserve memory space when dealing with a very large corpus sample i'm looking to take just the top 10 1grams and combine those with all of the 2 thru 5grams to form my single quanteda::dfmSparse object that will be used in natural language processing [nlp] predictions. Carrying around all the 1grams will be pointless because only the top ten [ or twenty ] will ever get used with the simple back off model i'm using.
I wasn't able to find a quanteda::dfm(corpusText, . . .) parameter that instructs it to only return the top ## features. So based on comments from package author #KenB in other threads i'm using the dfm_select/remove functions to extract the top ten 1grams and based on the "quanteda dfm join" search results hit "concatenate dfm matrices in 'quanteda' package" i'm using rbind.dfmSparse??? function to join those results.
So far everything looks right from what i can tell. Thought i'd bounce this game plan off of SO community to see if i'm overlooking a more efficient route to arrive at this result or some flaw in solution I've arrived at thus far.
corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n",
"adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n"))
corpusObject$documents
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1))
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5)
dfm1gramsSorted; dfm2to5grams
#featnames(dfm1gramsSorted); featnames(dfm2to5grams)
#colSums(dfm1gramsSorted); colSums(dfm2to5grams)
dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted))
# option1 - select top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10])
dfmTopTen1grams; featnames(dfmTopTen1grams)
# option2 - drop all but top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen])
dfmTopTen1grams; featnames(dfmTopTen1grams)
dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams)
dfmTopTen1gramsAndAll2to5grams;
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams)
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50],
keep.rownames = F, stringsAsFactors = F)
/eoq
For extracting the top 10 unigrams, this strategy will work just fine:
sort the dfm by the (default) decreasing order of overall feature frequency, which you have already done, but then add a step tp slice out the first 10 columns.
combine this with the 2- to 5-gram dfm using cbind() (not rbind())).
That should do it:
dfmCombined <- cbind(dfm1gramsSorted[, 1:10], dfm2to5grams)
head(dfmCombined, nfeat = 15)
# Document-feature matrix of: 1 document, 195 features (0% sparse).
# (showing first document and first 15 features)
# features
# docs some corpus text of to very large top ten no some_corpus corpus_text text_of of_no no_consequence
# text1 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1
Your example code includes some use of data.table, although this does not appear in the question. In v0.99 we have added a new function textstat_frequency() which produces a "long"/"tidy" format of frequencies in a data.frame that might be helpful:
head(textstat_frequency(dfmCombined), 10)
# feature frequency rank docfreq
# 1 some 2 1 1
# 2 corpus 2 2 1
# 3 text 2 3 1
# 4 of 2 4 1
# 5 to 2 5 1
# 6 very 2 6 1
# 7 large 2 7 1
# 8 top 2 8 1
# 9 ten 2 9 1
# 10 some_corpus 2 10 1

Check n-words frequency for the open ended questions

I want to analyze answers from open ended questions. Single Word Cloud first, and then I met a problem when I want to count the frequency of 2-3 words phrase.
Here is my codes:
library('tm')
tokenize_ngrams <- function(x,n=2)return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
corpus <- Corpus(VectorSource(texts))
matrix <- TermDocumentMatrix(corpus,control=list(tokenize=tokenize_ngrams))
inspect(matrix[1:4, 1:3])
The results should be the 2-word phrases and the frequency.
But I got the results as follows:
Docs
Terms 1 2 3
document 1 0 0
first 1 0 0
the 1 1 1
this 1 1 1
Don't know the answer using tm, but this will work fine:
require(quanteda)
matrix <- dfm(texts, ngrams = 2)
head(matrix)

Resources