Compute ngrams for each row of text data in R - r

I have a data column of the following format:
Text
Hello world
Hello
How are you today
I love stackoverflow
blah blah blahdy
I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately?

Is this what you're after?
library("RWeka")
library("tm")
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 3, max = 3))
# Using Tyler's method of making the 'Text' object here
tdm <- TermDocumentMatrix(Corpus(VectorSource(Text)),
control = list(tokenize = TrigramTokenizer))
inspect(tdm)
A term-document matrix (4 terms, 5 documents)
Non-/sparse entries: 4/16
Sparsity : 80%
Maximal term length: 20
Weighting : term frequency (tf)
Docs
Terms 1 2 3 4 5
are you today 0 0 1 0 0
blah blah blahdy 0 0 0 0 1
how are you 0 0 1 0 0
i love stackoverflow 0 0 0 1 0

Here's an ngram approach using the qdap package
## Text <- readLines(n=5)
## Hello world
## Hello
## How are you today
## I love stackoverflow
## blah blah blahdy
library(qdap)
ngrams(Text, seq_along(Text), 3)
It's a list and you can access the components with typical list indexing.
Edit:
As far as your first approach try it like this:
library(tau)
sapply(Text, textcnt, method = "ngram")
## sapply(eta_dedup$title, textcnt, method = "ngram")

Here's how using the quanteda package:
txt <- c("Hello world", "Hello", "How are you today", "I love stackoverflow", "blah blah blahdy")
require(quanteda)
dfm(txt, ngrams = 3, concatenator = " ", verbose = FALSE)
## Document-feature matrix of: 5 documents, 4 features.
## 5 x 4 sparse Matrix of class "dfmSparse"
## features
## docs how are you are you today i love stackoverflow blah blah blahdy
## text1 0 0 0 0
## text2 0 0 0 0
## text3 1 1 0 0
## text4 0 0 1 0
## text5 0 0 0 1

I guess the OP wanted to use tau but others didn't use that package. Here's how you do it in tau:
data = "Hello world\nHello\nHow are you today\nI love stackoverflow\n
blah blah blahdy"
bigram_tau <- textcnt(data, n = 2L, method = "string", recursive = TRUE)
This is gonna be as a trie but you can format it as more classic datam-frame type with tokens and size:
data.frame(counts = unclass(bigram_tau), size = nchar(names(bigram_tau)))
format(r)
I highly suggest using tau because it performs really well with large data. I have used it for creating bigrams of 1 GB and it was both fast and smooth.

Related

Combining spacyr and quanteda to produce a lemmatized corpus or dfm

I understand how to build corpora and dfm with quanteda.
I also understand how to use spacy_parse to lemmatize a text or corpus object.
But I do not understand how to replace the original textual tokens with lemmas in my corpus.
I would expect something like:
corpus(my_txt) %>%
dfm(lemmatize = spacy_parse)
To produce a matrix of lemmas, for example:
be have go
first_text 2 6 6
second_text 4 4 2
third_text 6 4 3
Instead the only solution I found is to reassemble the lemmatized texts from the "lemma" column in the spacy_parse output dataframe, with some code like this one:
txt_parsed %>%
select(doc_id, lemma) %>%
group_by(doc_id) %>%
summarise(new_txt = str_c(lemma, collapse = " "))
Any suggestions for a better solution?
You can use quanteda::as.tokens() to convert a spacy_parsed object to tokens. Before this, you can swap the token column of the spacy_parsed object for the lemma column.
txt <- c("I like having to be going.", "Then I will be gone.", "I had him going.")
library("spacyr")
sp <- spacy_parse(txt, lemma = TRUE, entity = FALSE, pos = FALSE)
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.2, language model: en_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
sp$token <- sp$lemma
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
as.tokens(sp) %>%
dfm()
## Document-feature matrix of: 3 documents, 9 features (37.04% sparse) and 0 docvars.
## features
## docs -pron- like have to be go . then will
## text1 1 1 1 1 1 1 1 0 0
## text2 1 0 0 0 1 1 1 1 1
## text3 2 0 1 0 0 1 1 0 0
Created on 2021-04-12 by the reprex package (v2.0.0)
Actually, I found an even easier solution, that is to use the use_lemma = T option in the as.tokens function.
Exemple:
library(spacyr)
spacy_initialize(model = "fr_core_news_sm")
sp1 <- spacy_parse(macron, lemma = TRUE, entity = FALSE, pos = FALSE)
dfm1 <- as.tokens(sp1, use_lemma = T) %>% dfm

Output text with both unigrams and bigrams in R

I'm trying to figure out how to identify unigrams and bigrams in a text in R, and then keep both in the final output based on a threshold. I've done this in Python with gensim's Phraser model, but haven't figured out how to do it in R.
For example:
strings <- data.frame(text = 'This is a great movie from yesterday', 'I went to the movies', 'Great movie time at the theater', 'I went to the theater yesterday')
#Pseudocode below
bigs <- tokenize_uni_bi(strings, n = 1:2, threshold = 2)
print(bigs)
[['this', 'great_movie', 'yesterday'], ['went', 'movies'], ['great_movie', 'theater'], ['went', 'theater', 'yesterday']]
Thank you!
You could use quanteda framework for this:
library(quanteda)
# tokenize, tolower, remove stopwords and create ngrams
my_toks <- tokens(strings$text)
my_toks <- tokens_tolower(my_toks)
my_toks <- tokens_remove(my_toks, stopwords("english"))
bigs <- tokens_ngrams(my_toks, n = 1:2)
# turn into document feature matrix and filter on minimum frequency of 2 and more
my_dfm <- dfm(bigs)
dfm_trim(my_dfm, min_termfreq = 2)
Document-feature matrix of: 4 documents, 6 features (50.0% sparse).
features
docs great movie yesterday great_movie went theater
text1 1 1 1 1 0 0
text2 0 0 0 0 1 0
text3 1 1 0 1 0 1
text4 0 0 1 0 1 1
# use convert function to turn this into a data.frame
Alternatively you could use tidytext package, tm, tokenizers etc etc. It all depends a bit on the output you are expecting.
An example using tidytext / dplyr looks like this:
library(tidytext)
library(dplyr)
strings %>%
unnest_ngrams(bigs, text, n = 2, n_min = 1, ngram_delim = "_", stopwords = stopwords::stopwords()) %>%
count(bigs) %>%
filter(n >= 2)
bigs n
1 great 2
2 great_movie 2
3 movie 2
4 theater 2
5 went 2
6 yesterday 2
Both quanteda and tidytext have a lot of online help available. See vignettes wiht both packages on cran.

R- textminig with tm packages

I want to use the tm packakes, so I have created the next code:
x<-inspect(DocumentTermMatrix(docs,
list(dictionary = c("survive", "survival"))))
I need to find any word beginning with "surviv" in the text, such as to include words like "survival" "survivor" "survive" and others. Is there any way to write that condition - words begining with "surviv"- in the code?
What you could do is creating a general DocumentTermMatrix and then filtering it to keep only rows which start with surviv using startsWith.
corp <- VCorpus(VectorSource(c("survival", "survivance", "survival",
"random", "yes", "survive")))
tdm <- TermDocumentMatrix(corp)
inspect(tdm[startsWith(rownames(tdm), "surv"),])
You can stem the words with stemDocument. Then you need to look only for surviv and survivor as these are the stem words you are looking for. Using and expanding the list of words from #AshOfFire
my_corpus <- VCorpus(VectorSource(c("survival", "survivance", "survival",
"random", "yes", "survive", "survivors", "surviving")))
my_corpus <- tm_map(my_corpus, stemDocument)
my_dtm <- DocumentTermMatrix(my_corpus, control = list(dictionary = c("surviv", "survivor")))
inspect(my_dtm)
<<DocumentTermMatrix (documents: 8, terms: 2)>>
Non-/sparse entries: 6/10
Sparsity : 62%
Maximal term length: 8
Weighting : term frequency (tf)
Sample :
Terms
Docs surviv survivor
1 1 0
2 1 0
3 1 0
4 0 0
5 0 0
6 1 0
7 0 1
8 1 0
p.s. only do x <- inspect(DocumentTermMatrix(docs, .....) if you want to get the first 10 rows and 10 columns in your x variable.

Split up ngrams in (sparse) document-feature matrix

This is a follow up question to this one. There, I asked if it's possible to split up ngram-features in a document-feature matrix (dfm-class from the quanteda-package) in such a way that e.g. bigrams result in two separate unigrams.
For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").
library(quanteda)
eg.txt <- c('increase in_the great plenary',
'great plenary emission_reduction',
'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)
There was a nice answer to this example, which works absolutely fine for relatively small matrices as the one above. However, as soon as the matrix is bigger, I'm constantly running into the following memory error.
> #turn the dfm into a matrix
> DF <- as.data.frame(eg.dfm)
Error in asMethod(object) :
Cholmod-error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Hence, is there a more memory efficient way to solve this ngram-problem or to deal with large (sparse) matrices/data frames? Thank you in advance!
The problem here is that you are turning the sparse (dfm) matrix into a dense object when you call as.data.frame(). Since the typical document-feature matrix is 90% sparse, this means you are creating something larger than you can handle. The solution: use dfm handling functions to maintain the sparsity.
Note that this is both a better solution than proposed in the linked question but also should work efficiently for your much larger object.
Here's a function that does that. It allows you to set the concatenator character(s), and works with ngrams of variable sizes. Most importantly, it uses dfm methods to make sure the dfm remains sparse.
# function to split and duplicate counts in features containing
# the concatenator character
dfm_splitgrams <- function(x, concatenator = "_") {
# separate the unigrams
x_unigrams <- dfm_remove(x, concatenator, valuetype = "regex")
# separate the ngrams
x_ngrams <- dfm_select(x, concatenator, valuetype = "regex")
# split into components
split_ngrams <- stringi::stri_split_regex(featnames(x_ngrams), concatenator)
# get a repeated index for the ngram feature names
index_split_ngrams <- rep(featnames(x_ngrams), lengths(split_ngrams))
# subset the ngram matrix using the (repeated) ngram feature names
x_split_ngrams <- x_ngrams[, index_split_ngrams]
# assign the ngram dfm the feature names of the split ngrams
colnames(x_split_ngrams) <- unlist(split_ngrams, use.names = FALSE)
# return the column concatenation of unigrams and split ngrams
suppressWarnings(cbind(x_unigrams, x_split_ngrams))
}
So:
dfm_splitgrams(eg.dfm)
## Document-feature matrix of: 3 documents, 9 features (40.7% sparse).
## 3 x 9 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction emission increase
## text1 1 1 1 1 1 0 0 0 0
## text2 0 1 1 0 0 1 1 0 0
## text3 1 0 0 1 1 1 1 1 1
Here, splitting ngrams results in new "unigrams" of the same feature name. You can (re)combine them efficiently with dfm_compress():
dfm_compress(dfm_splitgrams(eg.dfm))
## Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
## 3 x 7 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction
## text1 1 1 1 1 1 0 0
## text2 0 1 1 0 0 1 1
## text3 2 0 0 1 1 2 1

calculate term document matrix while looking for words within strings also

This question is related to to my earlier question. Treat words separated by space in the same manner
Posting it as a separate one since it might help other users find it easily.
The question is regarding the way the term document matrix is calculated by tm package currently. I want to tweak this way a little bit as explained below.
Currently any term document matrix gets created by looking for a word say 'milky' as a separate word (and not as a string) in a document. For example, let us assume 2 documents
document 1: "this is a milky way galaxy"
document 2: "this is a milkyway galaxy"
As per the way current algorithm works (tm package) 'milky' would get found in first document but not in second document since the algorithm looks for the term milky as a separate word. But if the algorithm had looked for the term milky a strings like function grepl does, it would have found the term 'milky' in second document as well.
grepl('milky', 'this is a milkyway galaxy')
TRUE
Can someone please help me create a term document matrix meeting my requirement (which is to be able to find term milky in both the documents. Please note that I don't want a solution specific to a word or milky, I want a general solution which I will apply on a larger scale to take care of all such cases)? Even if the solution does not use tm package, it is fine. I just have to get a term document matrix meeting my requirement in the end. Ultimately I want to be able to get a term document matrix such that each term in it should get looked for as string (not just as word) inside all the strings of the document in question (grepl like functionality while calculating term document matrix).
Current code which I use to get term document matrix is
doc1 <- "this is a document about milkyway"
doc2 <- "milky way is huge"
library(tm)
tmp.text<-data.frame(rbind(doc1,doc2))
tmp.corpus<-Corpus(DataframeSource(tmp.text))
tmpDTM<-TermDocumentMatrix(tmp.corpus, control= list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df<-as.data.frame(as.matrix(tmpDTM))
tmp.df
1 2
document 1 0
huge 0 1
milky 0 1
milkyway 1 0
way 0 1
I am not sure that tm makes it easy (or possible) to select or group features based on regular expressions. But the text package quanteda does, through a thesaurus argument that groups terms according to a dictionary, when constructing its document-feature matrix.
(quanteda uses the generic term "feature" since here, your category is terms containing the phrase milky rather than original "terms".)
The valuetype argument can be the "glob" format (default), a regular expression ("regex"), or as-is fixed ("fixed"). Below I show the versions with glob and regular expressions.
require(quanteda)
myDictGlob <- dictionary(list(containsMilky = c("milky*")))
myDictRegex <- dictionary(list(containsMilky = c("^milky")))
(plainDfm <- dfm(c(doc1, doc2)))
## Creating a dfm from a character vector ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2 documents
## ... indexing features: 9 feature types
## ... created a 2 x 9 sparse dfm
## ... complete.
## Elapsed time: 0.008 seconds.
## Document-feature matrix of: 2 documents, 9 features.
## 2 x 9 sparse Matrix of class "dfmSparse"
## features
## docs this is a document about milkyway milky way huge
## text1 1 1 1 1 1 1 0 0 0
## text2 0 1 0 0 0 0 1 1 1
dfm(c(doc1, doc2), thesaurus = myDictGlob, valuetype = "glob", verbose = FALSE)
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
## this is a document about way huge CONTAINSMILKY
## text1 1 1 1 1 1 0 0 1
## text2 0 1 0 0 0 1 1 1
dfm(c(doc1, doc2), thesaurus = myDictRegex, valuetype = "regex")
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
## this is a document about way huge CONTAINSMILKY
## text1 1 1 1 1 1 0 0 1
## text2 0 1 0 0 0 1 1 1

Resources