How do I include stopwords(terms) in text2vec - r

In text2vec package, I am using create_vocabulary function. For eg:
My text is "This book is very good" and suppose I am not using stopwords and an ngram of 1L to 3L. so the vocab terms will be
This, book, is, very, good, This book,..... book is very, very good. I just want to remove the term "book is very" (and host of other terms using a vector). Since I just want to remove a phrase I cant use stopwords. I have coded the below code:
vocab<-create_vocabulary(it,ngram=c(1L,3L))
vocab_mod<- subset(vocab,!(term %in% stp) # where stp is stop phrases.
x<- read.csv(Filename') #these are all stop phrases
stp<-as.vector(x$term)
When I do the above step, the metainformation in attributes get lost in vocab_mod and so can't be used in create_dtm.

It seems that subset function drops some attributes. You can try:
library(text2vec)
txt = "This book is very good"
it = itoken(txt)
v = create_vocabulary(it, ngram = c(1, 3))
v = v[!(v$term %in% "is_very_good"), ]
v
# Number of docs: 1
# 0 stopwords: ...
# ngram_min = 1; ngram_max = 3
# Vocabulary:
# term term_count doc_count
# 1: good 1 1
# 2: book_is_very 1 1
# 3: This_book 1 1
# 4: This 1 1
# 5: book 1 1
# 6: very_good 1 1
# 7: is_very 1 1
# 8: book_is 1 1
# 9: This_book_is 1 1
# 10: is 1 1
# 11: very 1 1
dtm = create_dtm(it, vocab_vectorizer(v))

#Dmitriy even this lets to drop the attributes... So the way out that I found was just adding the attributes manually for now using attr function
attr(vocab_mod,"ngram")<-c(ngram_min = 1L,ngram_max=3L) and son one for other attributes as well. We can get attribute details from vocab.

Related

How to work with scores and regex in a keywords dictionary to get a rudimentary sentiment analysis, with R?

I would like to optimize the size of a sentiment dictionary by using regular expressions. But I don't know how to match the keywords with the text to be analysed, without losting the rating of each keyword.
I work with R. And I'd like to stay in about a "matching words" solution.
This is what I tried
library(stringr)
library(tidytext) # tidy text analysis + unnest_tokens
library(tidyverse) # visualization + tibble
# text to be quoted
Corpus<- c("Radicals in their time, early Impressionists violated the rules of academic painting.",
"They also painted realistic scenes of modern life, and often painted outdoors.",
"The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision.",
"Even if the art critics and art establishment disapproved of the new style.")
# dictionary : words and quotes lists
WordsList <- c("^academ.+$","^disapprov.*$","^friend.*$","^fresh.*$","^hostil.+$","^modern.*$","^new.*$","^original.*$","^outstand.*$","^radical.*$","^uncorrect.+$","^violat.+$")
QuotesList <- c(1,-2,2,2,-2,2,1,2,3,-3,-1,-3)
Lexicon <- data.frame(words=WordsList, quotes=QuotesList)
Lexicon
# words quotes
# 1 ^academ.+$ 1
# 2 ^disapprov.*$ -2
# 3 ^friend.*$ 2
# 4 ^fresh.*$ 2
# 5 ^hostil.+$ -2
# 6 ^modern.*$ 2
# 7 ^new.*$ 1
# 8 ^original.*$ 2
# 9 ^outstand.*$ 3
# 10 ^radical.*$ -3
# 11 ^uncorrect.+$ -1
# 12 ^violat.+$ -3
messag <- tibble(docidx = 1:length(Corpus), text = Corpus)
# split into words : 1 row per word per "document"
txt.by.word <- messag %>%
unnest_tokens(mots, text)
# size order instead of alphabetic order
matching<- paste(Lexicon[order(-nchar(Lexicon$words)),]$words, collapse = '|')
matching
# [1] "^disapprov.*$|^original.*$|^radical.*$|^academ.+$|^hostil.+$|^modern.*$|^violat.+$|^fresh.*$|^new.*$"
# search matchings
test<- str_extract_all(txt.by.word$mots, matching, simplify= T) # sensible à la casse
# result
test
tst <- as.data.frame(test)
# except empty
tst[!tst$V1 %in% "",]
# [1] "radicals" "violated" "academic" "modern" "hostile" "fresh" "original" "disapproved"
# [9] "new"
# from here I don't know how to get this expected result: by docidx, matching the words and their associated ratings.
# how to extract both the keyword and the sentiment rating ?
# Expected result
# docidx text quote
# 1 radicals -3
# 1 violated -3
# 1 academic 1
# 2 modern 2
# 3 hostile -2
# 3 fresh 2
# 3 original 2
# 4 disapproved -2
# 4 new 1
Thanks to Maël who answered another post from myself, see an equivalent of the 'match' function that works with regex
I have found an acceptable solution. Very close to my target. Here the heart of the code to be implemented instead of str_extract_all.
'''R
dt.unl <- as.data.table(unlist(sapply(Lexicon$words, grep, Corpus, value = TRUE)), keep.rownames=T)
dt.unl
dt.unl[ , keywords := lapply(.SD, function(x){gsub("[0-9]$", "", x)}), .SDcols=1, by="V1"]
dt.unl
dt.scor <- merge(dt.unl[,.(V2,keywords)], Lexicon, by.x="keywords", by.y="words")
dt.scor
# keywords V2 quotes
# 1: \\bacadem.+\\b Radicals in their time, early Impressionists violated the rules of academic painting. 1
# 2: \\bdisapprov.*\\b Even if the art critics and art establishment disapproved of the new style. -2
# 3: \\bfresh.*\\b The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision. 2
# 4: \\bhostil.+\\b The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision. -2
# 5: \\bmodern.*\\b They also painted realistic scenes of modern life, and often painted outdoors. 2
# 6: \\bnew.*\\b Even if the art critics and art establishment disapproved of the new style. 1
# 7: \\boriginal.*\\b The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision. 2
# 8: \\bviolat.+\\b Radicals in their time, early Impressionists violated the rules of academic painting. -3
#
'''

Keep special characters in a word-frequency matrix

I analyze some brands in text to find out KPI´s like Ad recognition. However brands which contain special characters are destroyed by my code so far.
library(qdap)
library(stringr)
test <- c("H&M", "C&A", "Zalando", "Zalando", "Amazon", "Sportscheck")
wfm(test)
This is the output:
all
a 1
amazon 1
c 1
h 1
m 1
sportscheck 1
zalando 2
Is there a package or method to archieve that H&M gets h&m, but not "h" and "m", like its two brands?
edit: The wfm function has got a ... argument which SHOULD allow me to use the strip function.
wfm(test, ... = strip(test, char.keep = "&"))
Does not work unfortunately.
I am not familiar with the qdap package but maybe substituting & could solve your problem
replacement <- "" # set your replacement e.g. "" (empty string) or "_"
test <- gsub("&", replacement, test, fixed = T)
I would say something like this. In the udpipe package there is a function document_term_frequencies where you can specify the split and it turns the data into a data.frame with the frequency count. If there is no id column to specify it will generate one. The resulting object of the document_term_frequencies is a data.table.
library(udpipe)
# data.frame without a ID column
my_data <- data.frame(text = c("H&M, C&A, Zalando, Zalando, Amazon, Sportscheck",
"H&M, C&A, Amazon, Sportscheck"),
stringsAsFactors = FALSE)
# if you have an ID column add document = my_data$id to the function
# see more examples in ?document_term_frequencies
document_term_frequencies(my_data$text, split = ",")
doc_id term freq
1: doc1 H&M 1
2: doc1 C&A 1
3: doc1 Zalando 2
4: doc1 Amazon 1
5: doc1 Sportscheck 1
6: doc2 H&M 1
7: doc2 C&A 1
8: doc2 Amazon 1
9: doc2 Sportscheck 1

Extract string within data.table

I have the following data.table called D.
ngram
1 in_the_years
2 the_years_thereafter
3 years_thereafter_most
4 he_wasn't_home
5 how_are_you
6 thereafter_most_of
I need to add a few variables.
1.queryWord (the requirement is to extract the first 2 words)
the following is my code
D[,queryWord:=strsplit(ngram,"_[^_]+$")[[1]],by=ngram]
ngram queryWord
1 in_the_years in_the
2 the_years_thereafter the_years
3 years_thereafter_most years_thereafter
4 he_wasn't_home he_wasn't
5 how_are_you how_are
6 thereafter_most_of thereafter_most
2.predict. The requirement is to extract the last word.
The following is desired output
ngram queryWord predict
1 in_the_years in_the years
2 the_years_thereafter the_years thereafter
3 years_thereafter_most years_thereafter most
4 he_wasn't_home he_wasn't home
5 how_are_you how_are you
6 thereafter_most_of thereafter_most of
For this purpose I wrote the following function
getLastTerm<-function(x){
y<-strsplit(x,"_")
y[[1]][length(y[[1]])]
}
getLasTerm("in_the_years","_") return "years" however is not working inside the data.table object D.
D[,predict:=getLastTerm(ngram)[[1]],by=ngram]
Please I need help
Before adressing your actual question, you can simplify your first step to:
# option 1
D[, queryWord := strsplit(ngram,"_[^_]+$")][]
# option 2
D[, queryWord := sub('(.*)_.*$','\\1',ngram)][]
To get the predict-column, you don't need to write a special function. Using a combination of strsplit, lapply and last:
D[, predict := lapply(strsplit(D$ngram,"_"), last)][]
Or an even easier solution is using only sub:
D[, predict := sub('.*_(.*)$','\\1',ngram)][]
Both approaches give the following final result:
> D
ngram queryWord predict
1: in_the_years in_the years
2: the_years_thereafter the_years thereafter
3: years_thereafter_most years_thereafter most
4: he_wasn't_home he_wasn't home
5: how_are_you how_are you
6: thereafter_most_of thereafter_most of
Used data:
D <- fread("ngram
in_the_years
the_years_thereafter
years_thereafter_most
he_wasn't_home
how_are_you
thereafter_most_of", header = TRUE)
Your get last term function only selects the first list. Try below.
getLastTerm <- function(x){
y <- strsplit(x,"_")
for (i in (1:6)) {
x[i] <- y[[i]][length(y[[i]])]
}
x
}
D$new <- getLastTerm(D$ngram)

join quanteda dfm top ten 1grams with all dfm 2 thru 5grams

To conserve memory space when dealing with a very large corpus sample i'm looking to take just the top 10 1grams and combine those with all of the 2 thru 5grams to form my single quanteda::dfmSparse object that will be used in natural language processing [nlp] predictions. Carrying around all the 1grams will be pointless because only the top ten [ or twenty ] will ever get used with the simple back off model i'm using.
I wasn't able to find a quanteda::dfm(corpusText, . . .) parameter that instructs it to only return the top ## features. So based on comments from package author #KenB in other threads i'm using the dfm_select/remove functions to extract the top ten 1grams and based on the "quanteda dfm join" search results hit "concatenate dfm matrices in 'quanteda' package" i'm using rbind.dfmSparse??? function to join those results.
So far everything looks right from what i can tell. Thought i'd bounce this game plan off of SO community to see if i'm overlooking a more efficient route to arrive at this result or some flaw in solution I've arrived at thus far.
corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n",
"adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n"))
corpusObject$documents
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1))
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5)
dfm1gramsSorted; dfm2to5grams
#featnames(dfm1gramsSorted); featnames(dfm2to5grams)
#colSums(dfm1gramsSorted); colSums(dfm2to5grams)
dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted))
# option1 - select top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10])
dfmTopTen1grams; featnames(dfmTopTen1grams)
# option2 - drop all but top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen])
dfmTopTen1grams; featnames(dfmTopTen1grams)
dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams)
dfmTopTen1gramsAndAll2to5grams;
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams)
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50],
keep.rownames = F, stringsAsFactors = F)
/eoq
For extracting the top 10 unigrams, this strategy will work just fine:
sort the dfm by the (default) decreasing order of overall feature frequency, which you have already done, but then add a step tp slice out the first 10 columns.
combine this with the 2- to 5-gram dfm using cbind() (not rbind())).
That should do it:
dfmCombined <- cbind(dfm1gramsSorted[, 1:10], dfm2to5grams)
head(dfmCombined, nfeat = 15)
# Document-feature matrix of: 1 document, 195 features (0% sparse).
# (showing first document and first 15 features)
# features
# docs some corpus text of to very large top ten no some_corpus corpus_text text_of of_no no_consequence
# text1 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1
Your example code includes some use of data.table, although this does not appear in the question. In v0.99 we have added a new function textstat_frequency() which produces a "long"/"tidy" format of frequencies in a data.frame that might be helpful:
head(textstat_frequency(dfmCombined), 10)
# feature frequency rank docfreq
# 1 some 2 1 1
# 2 corpus 2 2 1
# 3 text 2 3 1
# 4 of 2 4 1
# 5 to 2 5 1
# 6 very 2 6 1
# 7 large 2 7 1
# 8 top 2 8 1
# 9 ten 2 9 1
# 10 some_corpus 2 10 1

calculate term document matrix while looking for words within strings also

This question is related to to my earlier question. Treat words separated by space in the same manner
Posting it as a separate one since it might help other users find it easily.
The question is regarding the way the term document matrix is calculated by tm package currently. I want to tweak this way a little bit as explained below.
Currently any term document matrix gets created by looking for a word say 'milky' as a separate word (and not as a string) in a document. For example, let us assume 2 documents
document 1: "this is a milky way galaxy"
document 2: "this is a milkyway galaxy"
As per the way current algorithm works (tm package) 'milky' would get found in first document but not in second document since the algorithm looks for the term milky as a separate word. But if the algorithm had looked for the term milky a strings like function grepl does, it would have found the term 'milky' in second document as well.
grepl('milky', 'this is a milkyway galaxy')
TRUE
Can someone please help me create a term document matrix meeting my requirement (which is to be able to find term milky in both the documents. Please note that I don't want a solution specific to a word or milky, I want a general solution which I will apply on a larger scale to take care of all such cases)? Even if the solution does not use tm package, it is fine. I just have to get a term document matrix meeting my requirement in the end. Ultimately I want to be able to get a term document matrix such that each term in it should get looked for as string (not just as word) inside all the strings of the document in question (grepl like functionality while calculating term document matrix).
Current code which I use to get term document matrix is
doc1 <- "this is a document about milkyway"
doc2 <- "milky way is huge"
library(tm)
tmp.text<-data.frame(rbind(doc1,doc2))
tmp.corpus<-Corpus(DataframeSource(tmp.text))
tmpDTM<-TermDocumentMatrix(tmp.corpus, control= list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df<-as.data.frame(as.matrix(tmpDTM))
tmp.df
1 2
document 1 0
huge 0 1
milky 0 1
milkyway 1 0
way 0 1
I am not sure that tm makes it easy (or possible) to select or group features based on regular expressions. But the text package quanteda does, through a thesaurus argument that groups terms according to a dictionary, when constructing its document-feature matrix.
(quanteda uses the generic term "feature" since here, your category is terms containing the phrase milky rather than original "terms".)
The valuetype argument can be the "glob" format (default), a regular expression ("regex"), or as-is fixed ("fixed"). Below I show the versions with glob and regular expressions.
require(quanteda)
myDictGlob <- dictionary(list(containsMilky = c("milky*")))
myDictRegex <- dictionary(list(containsMilky = c("^milky")))
(plainDfm <- dfm(c(doc1, doc2)))
## Creating a dfm from a character vector ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2 documents
## ... indexing features: 9 feature types
## ... created a 2 x 9 sparse dfm
## ... complete.
## Elapsed time: 0.008 seconds.
## Document-feature matrix of: 2 documents, 9 features.
## 2 x 9 sparse Matrix of class "dfmSparse"
## features
## docs this is a document about milkyway milky way huge
## text1 1 1 1 1 1 1 0 0 0
## text2 0 1 0 0 0 0 1 1 1
dfm(c(doc1, doc2), thesaurus = myDictGlob, valuetype = "glob", verbose = FALSE)
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
## this is a document about way huge CONTAINSMILKY
## text1 1 1 1 1 1 0 0 1
## text2 0 1 0 0 0 1 1 1
dfm(c(doc1, doc2), thesaurus = myDictRegex, valuetype = "regex")
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
## this is a document about way huge CONTAINSMILKY
## text1 1 1 1 1 1 0 0 1
## text2 0 1 0 0 0 1 1 1

Resources