R- textminig with tm packages - r

I want to use the tm packakes, so I have created the next code:
list(dictionary = c("survive", "survival"))))
I need to find any word beginning with "surviv" in the text, such as to include words like "survival" "survivor" "survive" and others. Is there any way to write that condition - words begining with "surviv"- in the code?

What you could do is creating a general DocumentTermMatrix and then filtering it to keep only rows which start with surviv using startsWith.
corp <- VCorpus(VectorSource(c("survival", "survivance", "survival",
"random", "yes", "survive")))
tdm <- TermDocumentMatrix(corp)
inspect(tdm[startsWith(rownames(tdm), "surv"),])

You can stem the words with stemDocument. Then you need to look only for surviv and survivor as these are the stem words you are looking for. Using and expanding the list of words from #AshOfFire
my_corpus <- VCorpus(VectorSource(c("survival", "survivance", "survival",
"random", "yes", "survive", "survivors", "surviving")))
my_corpus <- tm_map(my_corpus, stemDocument)
my_dtm <- DocumentTermMatrix(my_corpus, control = list(dictionary = c("surviv", "survivor")))
<<DocumentTermMatrix (documents: 8, terms: 2)>>
Non-/sparse entries: 6/10
Sparsity : 62%
Maximal term length: 8
Weighting : term frequency (tf)
Sample :
Docs surviv survivor
1 1 0
2 1 0
3 1 0
4 0 0
5 0 0
6 1 0
7 0 1
8 1 0
p.s. only do x <- inspect(DocumentTermMatrix(docs, .....) if you want to get the first 10 rows and 10 columns in your x variable.


Split up ngrams in (sparse) document-feature matrix

This is a follow up question to this one. There, I asked if it's possible to split up ngram-features in a document-feature matrix (dfm-class from the quanteda-package) in such a way that e.g. bigrams result in two separate unigrams.
For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").
eg.txt <- c('increase in_the great plenary',
'great plenary emission_reduction',
'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)
There was a nice answer to this example, which works absolutely fine for relatively small matrices as the one above. However, as soon as the matrix is bigger, I'm constantly running into the following memory error.
> #turn the dfm into a matrix
> DF <- as.data.frame(eg.dfm)
Error in asMethod(object) :
Cholmod-error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Hence, is there a more memory efficient way to solve this ngram-problem or to deal with large (sparse) matrices/data frames? Thank you in advance!
The problem here is that you are turning the sparse (dfm) matrix into a dense object when you call as.data.frame(). Since the typical document-feature matrix is 90% sparse, this means you are creating something larger than you can handle. The solution: use dfm handling functions to maintain the sparsity.
Note that this is both a better solution than proposed in the linked question but also should work efficiently for your much larger object.
Here's a function that does that. It allows you to set the concatenator character(s), and works with ngrams of variable sizes. Most importantly, it uses dfm methods to make sure the dfm remains sparse.
# function to split and duplicate counts in features containing
# the concatenator character
dfm_splitgrams <- function(x, concatenator = "_") {
# separate the unigrams
x_unigrams <- dfm_remove(x, concatenator, valuetype = "regex")
# separate the ngrams
x_ngrams <- dfm_select(x, concatenator, valuetype = "regex")
# split into components
split_ngrams <- stringi::stri_split_regex(featnames(x_ngrams), concatenator)
# get a repeated index for the ngram feature names
index_split_ngrams <- rep(featnames(x_ngrams), lengths(split_ngrams))
# subset the ngram matrix using the (repeated) ngram feature names
x_split_ngrams <- x_ngrams[, index_split_ngrams]
# assign the ngram dfm the feature names of the split ngrams
colnames(x_split_ngrams) <- unlist(split_ngrams, use.names = FALSE)
# return the column concatenation of unigrams and split ngrams
suppressWarnings(cbind(x_unigrams, x_split_ngrams))
## Document-feature matrix of: 3 documents, 9 features (40.7% sparse).
## 3 x 9 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction emission increase
## text1 1 1 1 1 1 0 0 0 0
## text2 0 1 1 0 0 1 1 0 0
## text3 1 0 0 1 1 1 1 1 1
Here, splitting ngrams results in new "unigrams" of the same feature name. You can (re)combine them efficiently with dfm_compress():
## Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
## 3 x 7 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction
## text1 1 1 1 1 1 0 0
## text2 0 1 1 0 0 1 1
## text3 2 0 0 1 1 2 1

Keep EXACT words from R corpus

From answer posted on: Keep document ID with R corpus by #MrFlick
I am trying to slightly modify what is a great example.
Question: How do I modify the content_transformer function to keep only exact words? You can see in the inspect output that wonderful is counted as wonder and ratio is counted as rationale. I do not have a strong understanding of gregexpr and regmatches.
Create data frame:
dd <- data.frame(
id = 10:13,
text = c("No wonderful, then, that ever",
"So that in many cases such a ",
"But there were still other and",
"Not even at the rationale")
, stringsAsFactors = F
Now, in order to read special attributes from a data.frame, we will use the readTabular function to make our own custom data.frame reader
myReader <- readTabular(mapping = list(content = "text", id = "id"))
specify the column to use for the contents and the id in the data.frame. Now we read it in with DataframeSource but use our custom reader.
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
Now if we want to only keep a certain set of words, we can create our own content_transformer function. One way to do this is
keepOnlyWords <- content_transformer(function(x, words) {
gregexpr(paste0("\\b(", paste(words, collapse = "|"), "\\b)"), x)
, invert = T) <- " "
This will replace everything that's not in the word list with a space. Note that you probably want to run stripWhitespace after this. Thus our transformations would look like
keep <- c("wonder", "then", "that", "the")
tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)
Inspect dtm matrix:
> inspect(dtm)
<<DocumentTermMatrix (documents: 4, terms: 4)>>
Non-/sparse entries: 7/9
Sparsity : 56%
Maximal term length: 6
Weighting : term frequency (tf)
Docs ratio that the wonder
10 0 1 1 1
11 0 1 0 0
12 0 0 1 0
13 1 0 1 0
Switching grammars to tidytext, your current transformation would be
dd %>% unnest_tokens(word, text) %>%
mutate(word = str_replace_all(word, setNames(keep, paste0('.*', keep, '.*')))) %>%
inner_join(data_frame(word = keep))
## id word
## 1 10 wonder
## 2 10 the
## 3 10 that
## 4 11 that
## 5 12 the
## 6 12 the
## 7 13 the
Keeping exact matches is easier, as you can use joins (which use ==) instead of regex:
dd %>% unnest_tokens(word, text) %>%
inner_join(data_frame(word = keep))
## id word
## 1 10 then
## 2 10 that
## 3 11 that
## 4 13 the
To take it back to a document-term matrix,
dd %>% mutate(id = factor(id)) %>% # to keep empty rows of DTM
unnest_tokens(word, text) %>%
inner_join(data_frame(word = keep)) %>%
mutate(i = 1) %>%
cast_dtm(id, word, i) %>%
## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity : 67%
## Maximal term length: 4
## Weighting : term frequency (tf)
## Terms
## Docs then that the
## 10 1 1 0
## 11 0 1 0
## 12 0 0 0
## 13 0 0 1
Currently, your function is matching words with a boundary before or after. To change it to before and after, change the collapse parameter to include boundaries:
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
keepOnlyWords<-content_transformer(function(x,words) {
gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x)
, invert = T) <- " "
tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)
## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity : 67%
## Maximal term length: 4
## Weighting : term frequency (tf)
## Terms
## Docs that the then
## 10 1 0 1
## 11 1 0 0
## 12 0 0 0
## 13 0 1 0
I got same result as #alistaire with tm, with the following modified line in keepOnlyWords content transformer first defined by #BEMR:
gregexpr(paste0("\\b(", paste(words, collapse = "|"), ")\\b"), x)
There was a misplaced ")" in gregexpr first specified by #BEMR i.e. should be ")\\b" not "\\b)"
I think the above gregexpr is equivalent to that specified by #alistaire:
gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x)

R tm TermDocumentMatrix based on a sparse matrix

I have a collection of books in txt format and want to apply some procedures of the tm R library to them. However, I prefer to clean the texts in bash rather than in R because it is much faster.
Suppose I am able to get from bash a data.frame such as:
book term frequency
1 the 10
1 zoo 2
2 animal 2
2 car 3
2 the 20
I know that TermDocumentMatrices are actually sparse matrices with metadata. In fact, I can create a sparse matrix from the TDM using the TDM's i, j and v entries for the i, j and x ones of the sparseMatrix function. Please help me if you know how to do the inverse, or in this case, how to construct a TDM by using the three columns in the data.frame above. Thanks!
You could try
txt <- readLines(n = 7)
book term frequency
1 the 10
1 zoo 2
2 animal 2
2 car 3
2 the 20
df <- read.table(header=T, text=txt[-2])
dfwide <- dcast(data = df, book ~ term, value.var = "frequency", fill = 0)
mat <- as.matrix(dfwide[, -1])
dimnames(mat) <- setNames(dimnames(dfwide[-1]), names(df[, 1:2]))
(tdm <- as.TermDocumentMatrix(t(mat), weighting = weightTf))
# <<TermDocumentMatrix (terms: 4, documents: 2)>>
# Non-/sparse entries: 5/3
# Sparsity : 38%
# Maximal term length: 6
# Weighting : term frequency (tf)
# Docs
# Terms 1 2
# animal 0 2
# car 0 3
# the 10 20
# zoo 2 0

Corpus build with phrases

I have my documents as:
doc1 = very good, very bad, you are great
doc2 = very bad, good restaurent, nice place to visit
I want to make my corpus separated with , so that my final DocumentTermMatrix becomes:
docs very good very bad you are great good restaurent nice place to visit
doc1 tf-idf tf-idf tf-idf 0 0
doc2 0 tf-idf 0 tf-idf tf-idf
I know, how to calculate DocumentTermMatrix of individual words but don't know how to make the corpus separated for each phrase in R. A solution in R is preferred but solution in Python is also welcomed.
What I have tried is:
> library(tm)
> library(RWeka)
> BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
> options(mc.cores=1)
> texts <- c("very good, very bad, you are great","very bad, good restaurent, nice place to visit")
> corpus <- Corpus(VectorSource(texts))
> a <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
> as.matrix(a)
I am getting:
Terms 1 2
bad good restaurent 0 1
bad you are 1 0
good restaurent nice 0 1
good very bad 1 0
nice place to 0 1
place to visit 0 1
restaurent nice place 0 1
very bad good 0 1
very bad you 1 0
very good very 1 0
you are great 1 0
What I want is not combination of words but only the phrases that I showed in my matrix.
Here's one approach using qdap + tm packages:
library(qdap); library(tm); library(qdapTools)
dat <- list2df(list(doc1 = "very good, very bad, you are great",
doc2 = "very bad, good restaurent, nice place to visit"), "text", "docs")
x <- sub_holder(", ", dat$text)
m <- dtm(wfm(x$unhold(gsub(" ", "~~", x$output)), dat$docs) )
## A document-term matrix (2 documents, 5 terms)
## Non-/sparse entries: 4/6
## Sparsity : 60%
## Maximal term length: 19
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
## Terms
## Docs good restaurent nice place to visit very bad very good you are great
## doc1 0.0000000 0.0000000 0 0.3333333 0.3333333
## doc2 0.3333333 0.3333333 0 0.0000000 0.0000000
You could also do one fell swoop and return a DocumentTermMatrix but this may be harder to understand:
x <- sub_holder(", ", dat$text)
apply_as_tm(t(wfm(x$unhold(gsub(" ", "~~", x$output)), dat$docs)),
weightTfIdf, to.qdap=FALSE)
What if you just used strsplit to split on commas and then turned your phrases into single "words" by combining with some character. For example
docs <- c(D1 = "very good, very bad, you are great",
D2 = "very bad, good restaurent, nice place to visit")
dd <- Corpus(VectorSource(docs))
dd <- tm_map(dd, function(x) {
# A corpus with 2 text documents
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
# create_date creator
# Available variables in the data frame are:
# MetaID
# $D1
# very~good
# very~bad
# you~are~great
# $D2
# very~bad
# good~restaurent
# nice~place~to~visit
dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
This will produce
# Docs good~restaurent nice~place~to~visit very~bad very~good you~are~great
# D1 0.0000000 0.0000000 0 0.3333333 0.3333333
# D2 0.3333333 0.3333333 0 0.0000000 0.0000000
For anyone using text2vec this is quite handy solution based on custom vocabulary:
doc1 <- 'very good, very bad, you are great'
doc2 <- 'very bad, good restaurent, nice place to visit'
docs <- list(doc1, doc2)
docs <- sapply(docs, strsplit, split=', ')
vocab <- vocab_vectorizer(create_vocabulary(unique(unlist(docs))))
dtm <- create_dtm(itoken(docs), vocab)
This will result in:
2 x 5 sparse Matrix of class "dgCMatrix"
very good very bad you are great good restaurent nice place to visit
1 1 1 1 . .
2 . 1 . 1 1
Such approach allows for more customization in loading files and preparing vacabulary.

Compute ngrams for each row of text data in R

I have a data column of the following format:
Hello world
How are you today
I love stackoverflow
blah blah blahdy
I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately?
Is this what you're after?
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 3, max = 3))
# Using Tyler's method of making the 'Text' object here
tdm <- TermDocumentMatrix(Corpus(VectorSource(Text)),
control = list(tokenize = TrigramTokenizer))
A term-document matrix (4 terms, 5 documents)
Non-/sparse entries: 4/16
Sparsity : 80%
Maximal term length: 20
Weighting : term frequency (tf)
Terms 1 2 3 4 5
are you today 0 0 1 0 0
blah blah blahdy 0 0 0 0 1
how are you 0 0 1 0 0
i love stackoverflow 0 0 0 1 0
Here's an ngram approach using the qdap package
## Text <- readLines(n=5)
## Hello world
## Hello
## How are you today
## I love stackoverflow
## blah blah blahdy
ngrams(Text, seq_along(Text), 3)
It's a list and you can access the components with typical list indexing.
As far as your first approach try it like this:
sapply(Text, textcnt, method = "ngram")
## sapply(eta_dedup$title, textcnt, method = "ngram")
Here's how using the quanteda package:
txt <- c("Hello world", "Hello", "How are you today", "I love stackoverflow", "blah blah blahdy")
dfm(txt, ngrams = 3, concatenator = " ", verbose = FALSE)
## Document-feature matrix of: 5 documents, 4 features.
## 5 x 4 sparse Matrix of class "dfmSparse"
## features
## docs how are you are you today i love stackoverflow blah blah blahdy
## text1 0 0 0 0
## text2 0 0 0 0
## text3 1 1 0 0
## text4 0 0 1 0
## text5 0 0 0 1
I guess the OP wanted to use tau but others didn't use that package. Here's how you do it in tau:
data = "Hello world\nHello\nHow are you today\nI love stackoverflow\n
blah blah blahdy"
bigram_tau <- textcnt(data, n = 2L, method = "string", recursive = TRUE)
This is gonna be as a trie but you can format it as more classic datam-frame type with tokens and size:
data.frame(counts = unclass(bigram_tau), size = nchar(names(bigram_tau)))
I highly suggest using tau because it performs really well with large data. I have used it for creating bigrams of 1 GB and it was both fast and smooth.
