join quanteda dfm top ten 1grams with all dfm 2 thru 5grams - r

To conserve memory space when dealing with a very large corpus sample i'm looking to take just the top 10 1grams and combine those with all of the 2 thru 5grams to form my single quanteda::dfmSparse object that will be used in natural language processing [nlp] predictions. Carrying around all the 1grams will be pointless because only the top ten [ or twenty ] will ever get used with the simple back off model i'm using.
I wasn't able to find a quanteda::dfm(corpusText, . . .) parameter that instructs it to only return the top ## features. So based on comments from package author #KenB in other threads i'm using the dfm_select/remove functions to extract the top ten 1grams and based on the "quanteda dfm join" search results hit "concatenate dfm matrices in 'quanteda' package" i'm using rbind.dfmSparse??? function to join those results.
So far everything looks right from what i can tell. Thought i'd bounce this game plan off of SO community to see if i'm overlooking a more efficient route to arrive at this result or some flaw in solution I've arrived at thus far.
corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n",
"adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n"))
corpusObject$documents
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1))
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5)
dfm1gramsSorted; dfm2to5grams
#featnames(dfm1gramsSorted); featnames(dfm2to5grams)
#colSums(dfm1gramsSorted); colSums(dfm2to5grams)
dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted))
# option1 - select top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10])
dfmTopTen1grams; featnames(dfmTopTen1grams)
# option2 - drop all but top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen])
dfmTopTen1grams; featnames(dfmTopTen1grams)
dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams)
dfmTopTen1gramsAndAll2to5grams;
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams)
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50],
keep.rownames = F, stringsAsFactors = F)
/eoq

For extracting the top 10 unigrams, this strategy will work just fine:
sort the dfm by the (default) decreasing order of overall feature frequency, which you have already done, but then add a step tp slice out the first 10 columns.
combine this with the 2- to 5-gram dfm using cbind() (not rbind())).
That should do it:
dfmCombined <- cbind(dfm1gramsSorted[, 1:10], dfm2to5grams)
head(dfmCombined, nfeat = 15)
# Document-feature matrix of: 1 document, 195 features (0% sparse).
# (showing first document and first 15 features)
# features
# docs some corpus text of to very large top ten no some_corpus corpus_text text_of of_no no_consequence
# text1 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1
Your example code includes some use of data.table, although this does not appear in the question. In v0.99 we have added a new function textstat_frequency() which produces a "long"/"tidy" format of frequencies in a data.frame that might be helpful:
head(textstat_frequency(dfmCombined), 10)
# feature frequency rank docfreq
# 1 some 2 1 1
# 2 corpus 2 2 1
# 3 text 2 3 1
# 4 of 2 4 1
# 5 to 2 5 1
# 6 very 2 6 1
# 7 large 2 7 1
# 8 top 2 8 1
# 9 ten 2 9 1
# 10 some_corpus 2 10 1

Related

How to work with scores and regex in a keywords dictionary to get a rudimentary sentiment analysis, with R?

I would like to optimize the size of a sentiment dictionary by using regular expressions. But I don't know how to match the keywords with the text to be analysed, without losting the rating of each keyword.
I work with R. And I'd like to stay in about a "matching words" solution.
This is what I tried
library(stringr)
library(tidytext) # tidy text analysis + unnest_tokens
library(tidyverse) # visualization + tibble
# text to be quoted
Corpus<- c("Radicals in their time, early Impressionists violated the rules of academic painting.",
"They also painted realistic scenes of modern life, and often painted outdoors.",
"The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision.",
"Even if the art critics and art establishment disapproved of the new style.")
# dictionary : words and quotes lists
WordsList <- c("^academ.+$","^disapprov.*$","^friend.*$","^fresh.*$","^hostil.+$","^modern.*$","^new.*$","^original.*$","^outstand.*$","^radical.*$","^uncorrect.+$","^violat.+$")
QuotesList <- c(1,-2,2,2,-2,2,1,2,3,-3,-1,-3)
Lexicon <- data.frame(words=WordsList, quotes=QuotesList)
Lexicon
# words quotes
# 1 ^academ.+$ 1
# 2 ^disapprov.*$ -2
# 3 ^friend.*$ 2
# 4 ^fresh.*$ 2
# 5 ^hostil.+$ -2
# 6 ^modern.*$ 2
# 7 ^new.*$ 1
# 8 ^original.*$ 2
# 9 ^outstand.*$ 3
# 10 ^radical.*$ -3
# 11 ^uncorrect.+$ -1
# 12 ^violat.+$ -3
messag <- tibble(docidx = 1:length(Corpus), text = Corpus)
# split into words : 1 row per word per "document"
txt.by.word <- messag %>%
unnest_tokens(mots, text)
# size order instead of alphabetic order
matching<- paste(Lexicon[order(-nchar(Lexicon$words)),]$words, collapse = '|')
matching
# [1] "^disapprov.*$|^original.*$|^radical.*$|^academ.+$|^hostil.+$|^modern.*$|^violat.+$|^fresh.*$|^new.*$"
# search matchings
test<- str_extract_all(txt.by.word$mots, matching, simplify= T) # sensible à la casse
# result
test
tst <- as.data.frame(test)
# except empty
tst[!tst$V1 %in% "",]
# [1] "radicals" "violated" "academic" "modern" "hostile" "fresh" "original" "disapproved"
# [9] "new"
# from here I don't know how to get this expected result: by docidx, matching the words and their associated ratings.
# how to extract both the keyword and the sentiment rating ?
# Expected result
# docidx text quote
# 1 radicals -3
# 1 violated -3
# 1 academic 1
# 2 modern 2
# 3 hostile -2
# 3 fresh 2
# 3 original 2
# 4 disapproved -2
# 4 new 1
Thanks to Maël who answered another post from myself, see an equivalent of the 'match' function that works with regex
I have found an acceptable solution. Very close to my target. Here the heart of the code to be implemented instead of str_extract_all.
'''R
dt.unl <- as.data.table(unlist(sapply(Lexicon$words, grep, Corpus, value = TRUE)), keep.rownames=T)
dt.unl
dt.unl[ , keywords := lapply(.SD, function(x){gsub("[0-9]$", "", x)}), .SDcols=1, by="V1"]
dt.unl
dt.scor <- merge(dt.unl[,.(V2,keywords)], Lexicon, by.x="keywords", by.y="words")
dt.scor
# keywords V2 quotes
# 1: \\bacadem.+\\b Radicals in their time, early Impressionists violated the rules of academic painting. 1
# 2: \\bdisapprov.*\\b Even if the art critics and art establishment disapproved of the new style. -2
# 3: \\bfresh.*\\b The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision. 2
# 4: \\bhostil.+\\b The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision. -2
# 5: \\bmodern.*\\b They also painted realistic scenes of modern life, and often painted outdoors. 2
# 6: \\bnew.*\\b Even if the art critics and art establishment disapproved of the new style. 1
# 7: \\boriginal.*\\b The public, at first hostile, gradually came to believe that the Impressionists had captured a fresh and original vision. 2
# 8: \\bviolat.+\\b Radicals in their time, early Impressionists violated the rules of academic painting. -3
#
'''

Newbie working on Horse Racing Database using R

I'm new to the group and to R language.
I've written some code (below) that achieves the desired result.
However, i'm aware that i'm reproducing lines of the same code which would surely be more efficiently coded using a for loop.
Also, there will be races with large numbers of horses so I really need to be able to run a for loop that runs through each horse.
ie. num_runners = NROW(my_new_data)
my_new_data holds data on horses previous races.
DaH is a numeric rating that is attached to each of a horse's previous runs with DaH1 being the most recent and DaH6 is six races back.
Code, a character, signifies the type of race that the horse competed in. ie. Flat, Fences.
I have played with using for loops, ie. for(i in 1:6) without success.
Since I am assigning to a new horse each time I would hope something such as the following would work:
horse(i) = c(my_new_data$DaH1[i],my_new_data$DaH2[i],my_new_data$DaH3[i],my_new_data$DaH4[i],my_new_data$DaH5[i],my_new_data$DaH6[i])
But I know that horse(i) is not allowed.
Would my best strategy be to pre-define a dataframe of size: 6 rows and 6 columns
and use 2 for loops to populate [row][column]? Something like:
final_data[i,j]
Here is the code I am presently using which creates the dataframe racetest:
horse1 = c(my_new_data$DaH1[1],my_new_data$DaH2[1],my_new_data$DaH3[1],my_new_data$DaH4[1],my_new_data$DaH5[1],my_new_data$DaH6[1])
horse2 = c(my_new_data$DaH1[2],my_new_data$DaH2[2],my_new_data$DaH3[2],my_new_data$DaH4[2],my_new_data$DaH5[2],my_new_data$DaH6[2])
horse3 = c(my_new_data$DaH1[3],my_new_data$DaH2[3],my_new_data$DaH3[3],my_new_data$DaH4[3],my_new_data$DaH5[3],my_new_data$DaH6[3])
horse4 = c(my_new_data$DaH1[4],my_new_data$DaH2[4],my_new_data$DaH3[4],my_new_data$DaH4[4],my_new_data$DaH5[4],my_new_data$DaH6[4])
horse5 = c(my_new_data$DaH1[5],my_new_data$DaH2[5],my_new_data$DaH3[5],my_new_data$DaH4[5],my_new_data$DaH5[5],my_new_data$DaH6[5])
horse6 = c(my_new_data$DaH1[6],my_new_data$DaH2[6],my_new_data$DaH3[6],my_new_data$DaH4[6],my_new_data$DaH5[6],my_new_data$DaH6[6])
horse1.code = c(my_new_data$Code1[1],my_new_data$Code2[1],my_new_data$Code3[1],my_new_data$Code4[1],my_new_data$Code5[1],my_new_data$Code6[1])
horse2.code = c(my_new_data$Code1[2],my_new_data$Code2[2],my_new_data$Code3[2],my_new_data$Code4[2],my_new_data$Code5[2],my_new_data$Code6[2])
horse3.code = c(my_new_data$Code1[3],my_new_data$Code2[3],my_new_data$Code3[3],my_new_data$Code4[3],my_new_data$Code5[3],my_new_data$Code6[3])
horse4.code = c(my_new_data$Code1[4],my_new_data$Code2[4],my_new_data$Code3[4],my_new_data$Code4[4],my_new_data$Code5[4],my_new_data$Code6[4])
horse5.code = c(my_new_data$Code1[5],my_new_data$Code2[5],my_new_data$Code3[5],my_new_data$Code4[5],my_new_data$Code5[5],my_new_data$Code6[5])
horse6.code = c(my_new_data$Code1[6],my_new_data$Code2[6],my_new_data$Code3[6],my_new_data$Code4[6],my_new_data$Code5[6],my_new_data$Code6[6])
racetest = data.frame(horse1,horse1.code,horse2,horse2.code, horse3, horse3.code,
horse4,horse4.code,horse5,horse5.code, horse6, horse6.code)
Thanks in advance for any help that can be offered!
Graham
using loops in R is usually not the correct approach. Still I will give you something which might work.
There are two possible approaches I see here, I will address the simpler one:
if columns are ordered such that column 1:6 are named DaH1 to DaH6 and columns 7: 12 are the ones named horse1.code etc... in this case:
library(magrittr)
temp<- cbind(my_new_data[,1:6] %>% t,
my_new_data[,7:12]%>% t)
Odd = seq(1,12,2)
my_new_data[ , Odd] = temp[,1:6]
my new_data[ , -Odd] = temp[,7:12]
#cleanup
rm(temp,Odd)
my_new_data should now contain your desired output. Before you run this, make sure your data is backed up inside another object as this is untested code.
Actually we want to reshape the wide format of the data in a different wide format. But first let's look at your desired for loop approach, to understand what's going on.
Using a loop
For the loop we'll need two variables with sequences i and j.
## initialize matrix with dimnames
racetest <- matrix(NA, 3, 6,
dimnames=list(c("DaH1", "DaH2", "DaH3"),
c("horse1", "horse1.code", "horse2", "horse2.code",
"horse3", "horse3.code")))
## loop
for (i in 0:2) {
for (j in 1:3) {
racetest[j, 1:2+2*i] <- unlist(my_new_data[i+1, c(1, 4)])
}
}
# horse1 horse1.code horse2 horse2.code horse3 horse3.code
# DaH1 1 1 2 2 3 3
# DaH2 1 1 2 2 3 3
# DaH3 1 1 2 2 3 3
Often for loops are discouraged in R, because they might be slow and doesn't use the vectorized features of the R language. Moreover they can also be tricky to program.
Transposing column sets
We also could do a different approach. Actually we want to transpose the DaH* and Code* column sets (identifiable using grep) and bring them in the appropriate order using substring of names, with nchar as first character.
rownames(my_new_data) <- paste0("horse.", seq(nrow(my_new_data)))
rr <- data.frame(DaH=t(my_new_data[, grep("DaH", names(my_new_data))]),
Code=t(my_new_data[, grep("Code", names(my_new_data))]))
rr <- rr[order(substring(names(rr), nchar(names(rr))))]
rr
# DaH.horse.1 Code.horse.1 DaH.horse.2 Code.horse.2 DaH.horse.3 Code.horse.3
# DaH1 1 1 2 2 3 3
# DaH2 1 1 2 2 3 3
# DaH3 1 1 2 2 3 3
Reshaping data
Last but not least, we actually want to reshape the data. For this we give the data set an ID variable.
my_new_data <- transform(my_new_data, horse=1:nrow(my_new_data))
At first, we reshape the data into "long" format, using the new ID variable horse and put the two varying column sets into a list.
rr1 <- reshape(my_new_data, idvar="horse", varying=list(1:3, 4:6), direction="long", sep="",
v.names=c("DaH", "Code"))
rr1
# horse time DaH Code
# 1.1 1 1 1 1
# 2.1 2 1 2 2
# 3.1 3 1 3 3
# 1.2 1 2 1 1
# 2.2 2 2 2 2
# 3.2 3 2 3 3
# 1.3 1 3 1 1
# 2.3 2 3 2 2
# 3.3 3 3 3 3
Then, in order to get the desired wide format, what we want is to swap idvar and timevar, where our new idvar is "time" and our new timevar is "horse".
reshape(rr1, timevar="horse", idvar="time", direction= "wide")
# time DaH.1 Code.1 DaH.2 Code.2 DaH.3 Code.3
# 1.1 1 1 1 2 2 3 3
# 1.2 2 1 1 2 2 3 3
# 1.3 3 1 1 2 2 3 3
Benchmark
The benchmark reveals that of these three approaches transposing of the matrices is fastest, while the 'for' loop is actually by far the slowest.
# Unit: microseconds
# expr min lq mean median uq max neval cld
# forloop 7191.038 7373.5890 8381.8036 7576.678 7980.4320 46677.324 100 c
# transpose 620.748 656.0845 707.7248 692.953 733.1365 944.773 100 a
# reshape 2791.710 2858.6830 3013.8372 2958.825 3118.4125 3871.960 100 b
Toy data:
my_new_data <- data.frame(DaH1=1:3, DaH2=1:3, DaH3=1:3, Code1=1:3, Code2=1:3, Code3=1:3)

is there a way to filter words by length in a bag of words matrix in r?

I have created a matrix in R (called bag_of_words) I need to compute the top 100 most popular words (most occurrences), but filter tokens by length (min. size= 4 and max. size = 20) and indicate the total occurrences of the words.
I have created code to find the top 100 words without this filter which works, but cannot find a way of filtering words in matrix by length. Any help would be appreciated.
My attempt:
#view the top 100 most common words
term_f <- colSums(bag_of_words)
term_f <- sort(term_f, decreasing = T)
term_f[1:100]
Maybe I did not understand your question. But I think a vector might be easier to handle, especially if it is column of a data.table
library(data.table)
list_words <- data.table(x = as.numeric(bag_of_words))
If you only want words between 4 and 20 characters, use nchar:
list_words <- list_words[nchar(x) %between% c(4,20)]
Count the number of occurrences for each words
list_words <- list_words[,.(n = .N), by = "x"]
Get the top 100
list_words <- list_words[arrange(desc(n))][1:100]
I am not sure what NLP infrastructure you are using, but my recommendation is to use quanteda. If you don't have the package, just install it from CRAN with install.packages("quanteda").
Please find below a way to easily solve your issue ahead of computing token frequencies.
library(quanteda)
text = c("some short tokens, but maybe just fine.",
"thesearesomeverylongtokens.",
"v e r y s hort tokens" )
mycorp = corpus( text )
mytok = tokens( mycorp )
my_selected_tok = tokens_keep( mytok, min_nchar = 4, max_nchar = 20 )
mydfm = dfm(my_selected_tok)
frequencies = textstat_frequency( mydfm )
> frequencies
feature frequency rank docfreq group
1 tokens 2 1 2 all
2 some 1 2 1 all
3 short 1 2 1 all
4 maybe 1 2 1 all
5 just 1 2 1 all
6 fine 1 2 1 all
7 hort 1 2 1 all
> class(frequencies)
[1] "frequency" "textstat" "data.frame"

Split up ngrams in (sparse) document-feature matrix

This is a follow up question to this one. There, I asked if it's possible to split up ngram-features in a document-feature matrix (dfm-class from the quanteda-package) in such a way that e.g. bigrams result in two separate unigrams.
For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").
library(quanteda)
eg.txt <- c('increase in_the great plenary',
'great plenary emission_reduction',
'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)
There was a nice answer to this example, which works absolutely fine for relatively small matrices as the one above. However, as soon as the matrix is bigger, I'm constantly running into the following memory error.
> #turn the dfm into a matrix
> DF <- as.data.frame(eg.dfm)
Error in asMethod(object) :
Cholmod-error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Hence, is there a more memory efficient way to solve this ngram-problem or to deal with large (sparse) matrices/data frames? Thank you in advance!
The problem here is that you are turning the sparse (dfm) matrix into a dense object when you call as.data.frame(). Since the typical document-feature matrix is 90% sparse, this means you are creating something larger than you can handle. The solution: use dfm handling functions to maintain the sparsity.
Note that this is both a better solution than proposed in the linked question but also should work efficiently for your much larger object.
Here's a function that does that. It allows you to set the concatenator character(s), and works with ngrams of variable sizes. Most importantly, it uses dfm methods to make sure the dfm remains sparse.
# function to split and duplicate counts in features containing
# the concatenator character
dfm_splitgrams <- function(x, concatenator = "_") {
# separate the unigrams
x_unigrams <- dfm_remove(x, concatenator, valuetype = "regex")
# separate the ngrams
x_ngrams <- dfm_select(x, concatenator, valuetype = "regex")
# split into components
split_ngrams <- stringi::stri_split_regex(featnames(x_ngrams), concatenator)
# get a repeated index for the ngram feature names
index_split_ngrams <- rep(featnames(x_ngrams), lengths(split_ngrams))
# subset the ngram matrix using the (repeated) ngram feature names
x_split_ngrams <- x_ngrams[, index_split_ngrams]
# assign the ngram dfm the feature names of the split ngrams
colnames(x_split_ngrams) <- unlist(split_ngrams, use.names = FALSE)
# return the column concatenation of unigrams and split ngrams
suppressWarnings(cbind(x_unigrams, x_split_ngrams))
}
So:
dfm_splitgrams(eg.dfm)
## Document-feature matrix of: 3 documents, 9 features (40.7% sparse).
## 3 x 9 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction emission increase
## text1 1 1 1 1 1 0 0 0 0
## text2 0 1 1 0 0 1 1 0 0
## text3 1 0 0 1 1 1 1 1 1
Here, splitting ngrams results in new "unigrams" of the same feature name. You can (re)combine them efficiently with dfm_compress():
dfm_compress(dfm_splitgrams(eg.dfm))
## Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
## 3 x 7 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction
## text1 1 1 1 1 1 0 0
## text2 0 1 1 0 0 1 1
## text3 2 0 0 1 1 2 1

Getting The Top Terms for each Topic in LDA in R

I am implementing LDA for some simple data Sets , I am able to do the topic modelling but the issue is when i am trying to organise the top 6 terms according to their Topics , I am getting some numerical values ( maybe their indexes )
# docs is the dataset formatted and cleaned properly
dtm<- TermDocumentMatrix(docs, control = list(removePunctuation = TRUE, stopwords=TRUE))
ldaOut<-LDA(dtm,k,method="Gibbs",control=list(nstart=nstart,seed=seed,best=best,burnin=burnin,iter=iter,thin=thin))
# 6 top terms in each topic
ldaOut.terms<-as.matrix(terms(ldaOut,6))
write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicsToTerms.csv"))
The TopicsToTerms file is Generated like ,
Topic 1 Topic 2 Topic 3
1 1 5 3
2 2 1 4
3 3 2 1
4 4 3 2
5 5 4 5
While I want The Terms (top words for each topic) In the tables , like the following -
Topic 1 Topic 2 Topic 3
1 Hat Cat Food
You just need one line of code to fix your problem:
> text = read.csv("~/Desktop/your_data.csv") #your initial dataset
> docs = Corpus(VectorSource(text)) #converting to corpus
> docs = tm_map(docs, content_transformer(tolower)) #cleaning
> ... #cleaning
> dtm = DocumentTermMatrix(docs) #creating a document term matrix
> rownames(dtm) = text
After adding that last line, you can proceed with the remaining code, and you'll get the Terms, and not their indexes. Hope that helped.

Resources