Split up ngrams in (sparse) document-feature matrix - r

This is a follow up question to this one. There, I asked if it's possible to split up ngram-features in a document-feature matrix (dfm-class from the quanteda-package) in such a way that e.g. bigrams result in two separate unigrams.
For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").
library(quanteda)
eg.txt <- c('increase in_the great plenary',
'great plenary emission_reduction',
'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)
There was a nice answer to this example, which works absolutely fine for relatively small matrices as the one above. However, as soon as the matrix is bigger, I'm constantly running into the following memory error.
> #turn the dfm into a matrix
> DF <- as.data.frame(eg.dfm)
Error in asMethod(object) :
Cholmod-error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Hence, is there a more memory efficient way to solve this ngram-problem or to deal with large (sparse) matrices/data frames? Thank you in advance!

The problem here is that you are turning the sparse (dfm) matrix into a dense object when you call as.data.frame(). Since the typical document-feature matrix is 90% sparse, this means you are creating something larger than you can handle. The solution: use dfm handling functions to maintain the sparsity.
Note that this is both a better solution than proposed in the linked question but also should work efficiently for your much larger object.
Here's a function that does that. It allows you to set the concatenator character(s), and works with ngrams of variable sizes. Most importantly, it uses dfm methods to make sure the dfm remains sparse.
# function to split and duplicate counts in features containing
# the concatenator character
dfm_splitgrams <- function(x, concatenator = "_") {
# separate the unigrams
x_unigrams <- dfm_remove(x, concatenator, valuetype = "regex")
# separate the ngrams
x_ngrams <- dfm_select(x, concatenator, valuetype = "regex")
# split into components
split_ngrams <- stringi::stri_split_regex(featnames(x_ngrams), concatenator)
# get a repeated index for the ngram feature names
index_split_ngrams <- rep(featnames(x_ngrams), lengths(split_ngrams))
# subset the ngram matrix using the (repeated) ngram feature names
x_split_ngrams <- x_ngrams[, index_split_ngrams]
# assign the ngram dfm the feature names of the split ngrams
colnames(x_split_ngrams) <- unlist(split_ngrams, use.names = FALSE)
# return the column concatenation of unigrams and split ngrams
suppressWarnings(cbind(x_unigrams, x_split_ngrams))
}
So:
dfm_splitgrams(eg.dfm)
## Document-feature matrix of: 3 documents, 9 features (40.7% sparse).
## 3 x 9 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction emission increase
## text1 1 1 1 1 1 0 0 0 0
## text2 0 1 1 0 0 1 1 0 0
## text3 1 0 0 1 1 1 1 1 1
Here, splitting ngrams results in new "unigrams" of the same feature name. You can (re)combine them efficiently with dfm_compress():
dfm_compress(dfm_splitgrams(eg.dfm))
## Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
## 3 x 7 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction
## text1 1 1 1 1 1 0 0
## text2 0 1 1 0 0 1 1
## text3 2 0 0 1 1 2 1

Related

Combining spacyr and quanteda to produce a lemmatized corpus or dfm

I understand how to build corpora and dfm with quanteda.
I also understand how to use spacy_parse to lemmatize a text or corpus object.
But I do not understand how to replace the original textual tokens with lemmas in my corpus.
I would expect something like:
corpus(my_txt) %>%
dfm(lemmatize = spacy_parse)
To produce a matrix of lemmas, for example:
be have go
first_text 2 6 6
second_text 4 4 2
third_text 6 4 3
Instead the only solution I found is to reassemble the lemmatized texts from the "lemma" column in the spacy_parse output dataframe, with some code like this one:
txt_parsed %>%
select(doc_id, lemma) %>%
group_by(doc_id) %>%
summarise(new_txt = str_c(lemma, collapse = " "))
Any suggestions for a better solution?
You can use quanteda::as.tokens() to convert a spacy_parsed object to tokens. Before this, you can swap the token column of the spacy_parsed object for the lemma column.
txt <- c("I like having to be going.", "Then I will be gone.", "I had him going.")
library("spacyr")
sp <- spacy_parse(txt, lemma = TRUE, entity = FALSE, pos = FALSE)
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.2, language model: en_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
sp$token <- sp$lemma
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
as.tokens(sp) %>%
dfm()
## Document-feature matrix of: 3 documents, 9 features (37.04% sparse) and 0 docvars.
## features
## docs -pron- like have to be go . then will
## text1 1 1 1 1 1 1 1 0 0
## text2 1 0 0 0 1 1 1 1 1
## text3 2 0 1 0 0 1 1 0 0
Created on 2021-04-12 by the reprex package (v2.0.0)
Actually, I found an even easier solution, that is to use the use_lemma = T option in the as.tokens function.
Exemple:
library(spacyr)
spacy_initialize(model = "fr_core_news_sm")
sp1 <- spacy_parse(macron, lemma = TRUE, entity = FALSE, pos = FALSE)
dfm1 <- as.tokens(sp1, use_lemma = T) %>% dfm

Output text with both unigrams and bigrams in R

I'm trying to figure out how to identify unigrams and bigrams in a text in R, and then keep both in the final output based on a threshold. I've done this in Python with gensim's Phraser model, but haven't figured out how to do it in R.
For example:
strings <- data.frame(text = 'This is a great movie from yesterday', 'I went to the movies', 'Great movie time at the theater', 'I went to the theater yesterday')
#Pseudocode below
bigs <- tokenize_uni_bi(strings, n = 1:2, threshold = 2)
print(bigs)
[['this', 'great_movie', 'yesterday'], ['went', 'movies'], ['great_movie', 'theater'], ['went', 'theater', 'yesterday']]
Thank you!
You could use quanteda framework for this:
library(quanteda)
# tokenize, tolower, remove stopwords and create ngrams
my_toks <- tokens(strings$text)
my_toks <- tokens_tolower(my_toks)
my_toks <- tokens_remove(my_toks, stopwords("english"))
bigs <- tokens_ngrams(my_toks, n = 1:2)
# turn into document feature matrix and filter on minimum frequency of 2 and more
my_dfm <- dfm(bigs)
dfm_trim(my_dfm, min_termfreq = 2)
Document-feature matrix of: 4 documents, 6 features (50.0% sparse).
features
docs great movie yesterday great_movie went theater
text1 1 1 1 1 0 0
text2 0 0 0 0 1 0
text3 1 1 0 1 0 1
text4 0 0 1 0 1 1
# use convert function to turn this into a data.frame
Alternatively you could use tidytext package, tm, tokenizers etc etc. It all depends a bit on the output you are expecting.
An example using tidytext / dplyr looks like this:
library(tidytext)
library(dplyr)
strings %>%
unnest_ngrams(bigs, text, n = 2, n_min = 1, ngram_delim = "_", stopwords = stopwords::stopwords()) %>%
count(bigs) %>%
filter(n >= 2)
bigs n
1 great 2
2 great_movie 2
3 movie 2
4 theater 2
5 went 2
6 yesterday 2
Both quanteda and tidytext have a lot of online help available. See vignettes wiht both packages on cran.

How to do add/subtract document-term matrices in quanteda?

Consider this simple example
dfm1 <- tibble(text = c('hello world',
'hello quanteda')) %>%
corpus() %>% tokens() %>% dfm()
> dfm1
Document-feature matrix of: 2 documents, 3 features (33.3% sparse).
2 x 3 sparse Matrix of class "dfm"
features
docs hello world quanteda
text1 1 1 0
text2 1 0 1
and
dfm2 <- tibble(text = c('hello world',
'good nigth quanteda')) %>%
corpus() %>% tokens() %>% dfm()
Document-feature matrix of: 2 documents, 5 features (50.0% sparse).
2 x 5 sparse Matrix of class "dfm"
features
docs hello world good nigth quanteda
text1 1 1 0 0 0
text2 0 0 1 1 1
As you can see, we have the same text identifiers in the two dfms: text1 and text2.
I would like to "subtract" dfm2 to dfm1 so that each entry in dfm1 is subtracted to its (possibly) matching entry in dfm2 (same text, same word)
So for instance, in text1, hello occur 1 time and in text2 it also occurs 1 time. So the output should have 0 for that entry (that is: 1-1). Of course, entries that are not in both dfms should be kept unchanged.
How can I do that in quanteda?
You can match the feature set of a dfm to that of another dfm using dfm_match(). I've also tidied up your code since for this short example, some of your pipeline could be simplified.
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dfm1 <- dfm(c("hello world", "hello quanteda"))
dfm2 <- dfm(c("hello world", "good night quanteda"))
as.dfm(dfm1 - dfm_match(dfm2, features = featnames(dfm1)))
## Document-feature matrix of: 2 documents, 3 features (33.3% sparse).
## 2 x 3 sparse Matrix of class "dfm"
## features
## docs hello world quanteda
## text1 0 0 0
## text2 1 0 0
The as.dfm() comes from the fact that the + operator is defined for the parent sparse Matrix class and not specifically for a quanteda dfm, so it drops the dfm's class and turns it into a dgCMatrix. Coercing it back into a dfm using as.dfm() solves that but it will drop the original attributes of the dfm objects such as docvars.

join quanteda dfm top ten 1grams with all dfm 2 thru 5grams

To conserve memory space when dealing with a very large corpus sample i'm looking to take just the top 10 1grams and combine those with all of the 2 thru 5grams to form my single quanteda::dfmSparse object that will be used in natural language processing [nlp] predictions. Carrying around all the 1grams will be pointless because only the top ten [ or twenty ] will ever get used with the simple back off model i'm using.
I wasn't able to find a quanteda::dfm(corpusText, . . .) parameter that instructs it to only return the top ## features. So based on comments from package author #KenB in other threads i'm using the dfm_select/remove functions to extract the top ten 1grams and based on the "quanteda dfm join" search results hit "concatenate dfm matrices in 'quanteda' package" i'm using rbind.dfmSparse??? function to join those results.
So far everything looks right from what i can tell. Thought i'd bounce this game plan off of SO community to see if i'm overlooking a more efficient route to arrive at this result or some flaw in solution I've arrived at thus far.
corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n",
"adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n"))
corpusObject$documents
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1))
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5)
dfm1gramsSorted; dfm2to5grams
#featnames(dfm1gramsSorted); featnames(dfm2to5grams)
#colSums(dfm1gramsSorted); colSums(dfm2to5grams)
dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted))
# option1 - select top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10])
dfmTopTen1grams; featnames(dfmTopTen1grams)
# option2 - drop all but top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen])
dfmTopTen1grams; featnames(dfmTopTen1grams)
dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams)
dfmTopTen1gramsAndAll2to5grams;
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams)
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50],
keep.rownames = F, stringsAsFactors = F)
/eoq
For extracting the top 10 unigrams, this strategy will work just fine:
sort the dfm by the (default) decreasing order of overall feature frequency, which you have already done, but then add a step tp slice out the first 10 columns.
combine this with the 2- to 5-gram dfm using cbind() (not rbind())).
That should do it:
dfmCombined <- cbind(dfm1gramsSorted[, 1:10], dfm2to5grams)
head(dfmCombined, nfeat = 15)
# Document-feature matrix of: 1 document, 195 features (0% sparse).
# (showing first document and first 15 features)
# features
# docs some corpus text of to very large top ten no some_corpus corpus_text text_of of_no no_consequence
# text1 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1
Your example code includes some use of data.table, although this does not appear in the question. In v0.99 we have added a new function textstat_frequency() which produces a "long"/"tidy" format of frequencies in a data.frame that might be helpful:
head(textstat_frequency(dfmCombined), 10)
# feature frequency rank docfreq
# 1 some 2 1 1
# 2 corpus 2 2 1
# 3 text 2 3 1
# 4 of 2 4 1
# 5 to 2 5 1
# 6 very 2 6 1
# 7 large 2 7 1
# 8 top 2 8 1
# 9 ten 2 9 1
# 10 some_corpus 2 10 1

calculate term document matrix while looking for words within strings also

This question is related to to my earlier question. Treat words separated by space in the same manner
Posting it as a separate one since it might help other users find it easily.
The question is regarding the way the term document matrix is calculated by tm package currently. I want to tweak this way a little bit as explained below.
Currently any term document matrix gets created by looking for a word say 'milky' as a separate word (and not as a string) in a document. For example, let us assume 2 documents
document 1: "this is a milky way galaxy"
document 2: "this is a milkyway galaxy"
As per the way current algorithm works (tm package) 'milky' would get found in first document but not in second document since the algorithm looks for the term milky as a separate word. But if the algorithm had looked for the term milky a strings like function grepl does, it would have found the term 'milky' in second document as well.
grepl('milky', 'this is a milkyway galaxy')
TRUE
Can someone please help me create a term document matrix meeting my requirement (which is to be able to find term milky in both the documents. Please note that I don't want a solution specific to a word or milky, I want a general solution which I will apply on a larger scale to take care of all such cases)? Even if the solution does not use tm package, it is fine. I just have to get a term document matrix meeting my requirement in the end. Ultimately I want to be able to get a term document matrix such that each term in it should get looked for as string (not just as word) inside all the strings of the document in question (grepl like functionality while calculating term document matrix).
Current code which I use to get term document matrix is
doc1 <- "this is a document about milkyway"
doc2 <- "milky way is huge"
library(tm)
tmp.text<-data.frame(rbind(doc1,doc2))
tmp.corpus<-Corpus(DataframeSource(tmp.text))
tmpDTM<-TermDocumentMatrix(tmp.corpus, control= list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df<-as.data.frame(as.matrix(tmpDTM))
tmp.df
1 2
document 1 0
huge 0 1
milky 0 1
milkyway 1 0
way 0 1
I am not sure that tm makes it easy (or possible) to select or group features based on regular expressions. But the text package quanteda does, through a thesaurus argument that groups terms according to a dictionary, when constructing its document-feature matrix.
(quanteda uses the generic term "feature" since here, your category is terms containing the phrase milky rather than original "terms".)
The valuetype argument can be the "glob" format (default), a regular expression ("regex"), or as-is fixed ("fixed"). Below I show the versions with glob and regular expressions.
require(quanteda)
myDictGlob <- dictionary(list(containsMilky = c("milky*")))
myDictRegex <- dictionary(list(containsMilky = c("^milky")))
(plainDfm <- dfm(c(doc1, doc2)))
## Creating a dfm from a character vector ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2 documents
## ... indexing features: 9 feature types
## ... created a 2 x 9 sparse dfm
## ... complete.
## Elapsed time: 0.008 seconds.
## Document-feature matrix of: 2 documents, 9 features.
## 2 x 9 sparse Matrix of class "dfmSparse"
## features
## docs this is a document about milkyway milky way huge
## text1 1 1 1 1 1 1 0 0 0
## text2 0 1 0 0 0 0 1 1 1
dfm(c(doc1, doc2), thesaurus = myDictGlob, valuetype = "glob", verbose = FALSE)
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
## this is a document about way huge CONTAINSMILKY
## text1 1 1 1 1 1 0 0 1
## text2 0 1 0 0 0 1 1 1
dfm(c(doc1, doc2), thesaurus = myDictRegex, valuetype = "regex")
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
## this is a document about way huge CONTAINSMILKY
## text1 1 1 1 1 1 0 0 1
## text2 0 1 0 0 0 1 1 1

Resources