How to get unigram and trigram only? - r

I need to get the unigrame and trigram without bigrame
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
how to edit this code to get the answer

One way is to use the dfm function from quanteda package as follows,
library(quanteda)
dfm('I only want uni and trigrams', ngrams = c(1,3), verbose = FALSE)
#Document-feature matrix of: 1 document, 10 features.
#1 x 10 sparse Matrix of class "dfmSparse"
# features
#docs i only want uni and trigrams i_only_want only_want_uni want_uni_and uni_and_trigrams
# text1 1 1 1 1 1 1 1 1 1 1

Related

Output text with both unigrams and bigrams in R

I'm trying to figure out how to identify unigrams and bigrams in a text in R, and then keep both in the final output based on a threshold. I've done this in Python with gensim's Phraser model, but haven't figured out how to do it in R.
For example:
strings <- data.frame(text = 'This is a great movie from yesterday', 'I went to the movies', 'Great movie time at the theater', 'I went to the theater yesterday')
#Pseudocode below
bigs <- tokenize_uni_bi(strings, n = 1:2, threshold = 2)
print(bigs)
[['this', 'great_movie', 'yesterday'], ['went', 'movies'], ['great_movie', 'theater'], ['went', 'theater', 'yesterday']]
Thank you!
You could use quanteda framework for this:
library(quanteda)
# tokenize, tolower, remove stopwords and create ngrams
my_toks <- tokens(strings$text)
my_toks <- tokens_tolower(my_toks)
my_toks <- tokens_remove(my_toks, stopwords("english"))
bigs <- tokens_ngrams(my_toks, n = 1:2)
# turn into document feature matrix and filter on minimum frequency of 2 and more
my_dfm <- dfm(bigs)
dfm_trim(my_dfm, min_termfreq = 2)
Document-feature matrix of: 4 documents, 6 features (50.0% sparse).
features
docs great movie yesterday great_movie went theater
text1 1 1 1 1 0 0
text2 0 0 0 0 1 0
text3 1 1 0 1 0 1
text4 0 0 1 0 1 1
# use convert function to turn this into a data.frame
Alternatively you could use tidytext package, tm, tokenizers etc etc. It all depends a bit on the output you are expecting.
An example using tidytext / dplyr looks like this:
library(tidytext)
library(dplyr)
strings %>%
unnest_ngrams(bigs, text, n = 2, n_min = 1, ngram_delim = "_", stopwords = stopwords::stopwords()) %>%
count(bigs) %>%
filter(n >= 2)
bigs n
1 great 2
2 great_movie 2
3 movie 2
4 theater 2
5 went 2
6 yesterday 2
Both quanteda and tidytext have a lot of online help available. See vignettes wiht both packages on cran.

is there a way to filter words by length in a bag of words matrix in r?

I have created a matrix in R (called bag_of_words) I need to compute the top 100 most popular words (most occurrences), but filter tokens by length (min. size= 4 and max. size = 20) and indicate the total occurrences of the words.
I have created code to find the top 100 words without this filter which works, but cannot find a way of filtering words in matrix by length. Any help would be appreciated.
My attempt:
#view the top 100 most common words
term_f <- colSums(bag_of_words)
term_f <- sort(term_f, decreasing = T)
term_f[1:100]
Maybe I did not understand your question. But I think a vector might be easier to handle, especially if it is column of a data.table
library(data.table)
list_words <- data.table(x = as.numeric(bag_of_words))
If you only want words between 4 and 20 characters, use nchar:
list_words <- list_words[nchar(x) %between% c(4,20)]
Count the number of occurrences for each words
list_words <- list_words[,.(n = .N), by = "x"]
Get the top 100
list_words <- list_words[arrange(desc(n))][1:100]
I am not sure what NLP infrastructure you are using, but my recommendation is to use quanteda. If you don't have the package, just install it from CRAN with install.packages("quanteda").
Please find below a way to easily solve your issue ahead of computing token frequencies.
library(quanteda)
text = c("some short tokens, but maybe just fine.",
"thesearesomeverylongtokens.",
"v e r y s hort tokens" )
mycorp = corpus( text )
mytok = tokens( mycorp )
my_selected_tok = tokens_keep( mytok, min_nchar = 4, max_nchar = 20 )
mydfm = dfm(my_selected_tok)
frequencies = textstat_frequency( mydfm )
> frequencies
feature frequency rank docfreq group
1 tokens 2 1 2 all
2 some 1 2 1 all
3 short 1 2 1 all
4 maybe 1 2 1 all
5 just 1 2 1 all
6 fine 1 2 1 all
7 hort 1 2 1 all
> class(frequencies)
[1] "frequency" "textstat" "data.frame"

Split up ngrams in (sparse) document-feature matrix

This is a follow up question to this one. There, I asked if it's possible to split up ngram-features in a document-feature matrix (dfm-class from the quanteda-package) in such a way that e.g. bigrams result in two separate unigrams.
For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").
library(quanteda)
eg.txt <- c('increase in_the great plenary',
'great plenary emission_reduction',
'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)
There was a nice answer to this example, which works absolutely fine for relatively small matrices as the one above. However, as soon as the matrix is bigger, I'm constantly running into the following memory error.
> #turn the dfm into a matrix
> DF <- as.data.frame(eg.dfm)
Error in asMethod(object) :
Cholmod-error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Hence, is there a more memory efficient way to solve this ngram-problem or to deal with large (sparse) matrices/data frames? Thank you in advance!
The problem here is that you are turning the sparse (dfm) matrix into a dense object when you call as.data.frame(). Since the typical document-feature matrix is 90% sparse, this means you are creating something larger than you can handle. The solution: use dfm handling functions to maintain the sparsity.
Note that this is both a better solution than proposed in the linked question but also should work efficiently for your much larger object.
Here's a function that does that. It allows you to set the concatenator character(s), and works with ngrams of variable sizes. Most importantly, it uses dfm methods to make sure the dfm remains sparse.
# function to split and duplicate counts in features containing
# the concatenator character
dfm_splitgrams <- function(x, concatenator = "_") {
# separate the unigrams
x_unigrams <- dfm_remove(x, concatenator, valuetype = "regex")
# separate the ngrams
x_ngrams <- dfm_select(x, concatenator, valuetype = "regex")
# split into components
split_ngrams <- stringi::stri_split_regex(featnames(x_ngrams), concatenator)
# get a repeated index for the ngram feature names
index_split_ngrams <- rep(featnames(x_ngrams), lengths(split_ngrams))
# subset the ngram matrix using the (repeated) ngram feature names
x_split_ngrams <- x_ngrams[, index_split_ngrams]
# assign the ngram dfm the feature names of the split ngrams
colnames(x_split_ngrams) <- unlist(split_ngrams, use.names = FALSE)
# return the column concatenation of unigrams and split ngrams
suppressWarnings(cbind(x_unigrams, x_split_ngrams))
}
So:
dfm_splitgrams(eg.dfm)
## Document-feature matrix of: 3 documents, 9 features (40.7% sparse).
## 3 x 9 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction emission increase
## text1 1 1 1 1 1 0 0 0 0
## text2 0 1 1 0 0 1 1 0 0
## text3 1 0 0 1 1 1 1 1 1
Here, splitting ngrams results in new "unigrams" of the same feature name. You can (re)combine them efficiently with dfm_compress():
dfm_compress(dfm_splitgrams(eg.dfm))
## Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
## 3 x 7 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction
## text1 1 1 1 1 1 0 0
## text2 0 1 1 0 0 1 1
## text3 2 0 0 1 1 2 1

"Bag of characters" n-grams in R

I would like to create a Term document matrix containing character n-grams. For example, take the following sentence:
"In this paper, we focus on a different but simple text representation."
Character 4-grams would be: |In_t|, |n_th|, |_thi|, |this|, |his__|, |is_p|, |s_pa|, |_pap|, |pape|, |aper|, etc.
I have used the R/Weka package to work with "bag of words" n-grams, but I'm having difficulty adapting tokenizers such as the one below to work with characters:
BigramTokenizer <- function(x){
NGramTokenizer(x, Weka_control(min = 2, max = 2))}
tdm_bigram <- TermDocumentMatrix(corpus,
control = list(
tokenize = BigramTokenizer, wordLengths=c(2,Inf)))
Any thoughts on how to use R/Weka or an other package to create character n-grams?
I find quanteda quite useful:
library(tm)
library(quanteda)
txts <- c("In this paper.", "In this lines this.")
tokens <- tokenize(gsub("\\s", "_", txts), "character", ngrams=4L, conc="")
dfm <- dfm(tokens)
tdm <- as.TermDocumentMatrix(t(dfm), weighting=weightTf)
as.matrix(tdm)
# Docs
# Terms text1 text2
# In_t 1 1
# n_th 1 1
# _thi 1 2
# this 1 2
# his_ 1 1
# is_p 1 0
# s_pa 1 0
# _pap 1 0
# pape 1 0
# aper 1 0
# per. 1 0
# is_l 0 1
# s_li 0 1
# _lin 0 1
# line 0 1
# ines 0 1
# nes_ 0 1
# es_t 0 1
# s_th 0 1
# his. 0 1
You need to use the CharacterNGramTokenizer instead.
The NGramTokenizer splits on characters like spaces.
##########
### the following lines are mainly a one to one copy from RWeka.
### Only hardocded CharacterNGramTokenizer is new
library(rJava)
CharacterNGramTokenizer <- structure(function (x, control = NULL)
{
tokenizer <- .jnew("weka/core/tokenizers/CharacterNGramTokenizer")
x <- Filter(nzchar, as.character(x))
if (!length(x))
return(character())
.jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
"weka/core/tokenizers/Tokenizer"), .jarray(as.character(control)),
.jarray(as.character(x)))
}, class = c("R_Weka_tokenizer_interface", "R_Weka_interface"
), meta = structure(list(name = "weka/core/tokenizers/NGramTokenizer",
kind = "R_Weka_tokenizer_interface", class = "character",
init = NULL), .Names = c("name", "kind", "class", "init")))
### copy till here
###################
BigramTokenizer <- function(x){
CharacterNGramTokenizer(x, Weka_control(min = 2, max = 2))}
Sadly it is not included in RWeka by default.
However, if you want to use weka this seems to be a kind of holistic version

R tm TermDocumentMatrix based on a sparse matrix

I have a collection of books in txt format and want to apply some procedures of the tm R library to them. However, I prefer to clean the texts in bash rather than in R because it is much faster.
Suppose I am able to get from bash a data.frame such as:
book term frequency
--------------------
1 the 10
1 zoo 2
2 animal 2
2 car 3
2 the 20
I know that TermDocumentMatrices are actually sparse matrices with metadata. In fact, I can create a sparse matrix from the TDM using the TDM's i, j and v entries for the i, j and x ones of the sparseMatrix function. Please help me if you know how to do the inverse, or in this case, how to construct a TDM by using the three columns in the data.frame above. Thanks!
You could try
library(tm)
library(reshape2)
txt <- readLines(n = 7)
book term frequency
--------------------
1 the 10
1 zoo 2
2 animal 2
2 car 3
2 the 20
df <- read.table(header=T, text=txt[-2])
dfwide <- dcast(data = df, book ~ term, value.var = "frequency", fill = 0)
mat <- as.matrix(dfwide[, -1])
dimnames(mat) <- setNames(dimnames(dfwide[-1]), names(df[, 1:2]))
(tdm <- as.TermDocumentMatrix(t(mat), weighting = weightTf))
# <<TermDocumentMatrix (terms: 4, documents: 2)>>
# Non-/sparse entries: 5/3
# Sparsity : 38%
# Maximal term length: 6
# Weighting : term frequency (tf)
as.matrix(tdm)
# Docs
# Terms 1 2
# animal 0 2
# car 0 3
# the 10 20
# zoo 2 0

Resources