Corpus build with phrases

Corpus build with phrases - r

I have my documents as:
doc1 = very good, very bad, you are great
doc2 = very bad, good restaurent, nice place to visit
I want to make my corpus separated with , so that my final DocumentTermMatrix becomes:
terms
docs very good very bad you are great good restaurent nice place to visit
doc1 tf-idf tf-idf tf-idf 0 0
doc2 0 tf-idf 0 tf-idf tf-idf
I know, how to calculate DocumentTermMatrix of individual words but don't know how to make the corpus separated for each phrase in R. A solution in R is preferred but solution in Python is also welcomed.
What I have tried is:
> library(tm)
> library(RWeka)
> BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
> options(mc.cores=1)
> texts <- c("very good, very bad, you are great","very bad, good restaurent, nice place to visit")
> corpus <- Corpus(VectorSource(texts))
> a <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
> as.matrix(a)
I am getting:
Docs
Terms 1 2
bad good restaurent 0 1
bad you are 1 0
good restaurent nice 0 1
good very bad 1 0
nice place to 0 1
place to visit 0 1
restaurent nice place 0 1
very bad good 0 1
very bad you 1 0
very good very 1 0
you are great 1 0
What I want is not combination of words but only the phrases that I showed in my matrix.

Here's one approach using qdap + tm packages:
library(qdap); library(tm); library(qdapTools)
dat <- list2df(list(doc1 = "very good, very bad, you are great",
doc2 = "very bad, good restaurent, nice place to visit"), "text", "docs")
x <- sub_holder(", ", dat$text)
m <- dtm(wfm(x$unhold(gsub(" ", "~~", x$output)), dat$docs) )
weightTfIdf(m)
inspect(weightTfIdf(m))
## A document-term matrix (2 documents, 5 terms)
##
## Non-/sparse entries: 4/6
## Sparsity : 60%
## Maximal term length: 19
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
##
## Terms
## Docs good restaurent nice place to visit very bad very good you are great
## doc1 0.0000000 0.0000000 0 0.3333333 0.3333333
## doc2 0.3333333 0.3333333 0 0.0000000 0.0000000
You could also do one fell swoop and return a DocumentTermMatrix but this may be harder to understand:
x <- sub_holder(", ", dat$text)
apply_as_tm(t(wfm(x$unhold(gsub(" ", "~~", x$output)), dat$docs)),
weightTfIdf, to.qdap=FALSE)

What if you just used strsplit to split on commas and then turned your phrases into single "words" by combining with some character. For example
library(tm)
docs <- c(D1 = "very good, very bad, you are great",
D2 = "very bad, good restaurent, nice place to visit")
dd <- Corpus(VectorSource(docs))
dd <- tm_map(dd, function(x) {
PlainTextDocument(
gsub("\\s+","~",strsplit(x,",\\s*")[[1]]),
id=ID(x)
)
})
inspect(dd)
# A corpus with 2 text documents
#
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
# create_date creator
# Available variables in the data frame are:
# MetaID
# $D1
# very~good
# very~bad
# you~are~great
#
# $D2
# very~bad
# good~restaurent
# nice~place~to~visit
dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm)
This will produce
# Docs good~restaurent nice~place~to~visit very~bad very~good you~are~great
# D1 0.0000000 0.0000000 0 0.3333333 0.3333333
# D2 0.3333333 0.3333333 0 0.0000000 0.0000000

For anyone using text2vec this is quite handy solution based on custom vocabulary:
library(text2vec)
doc1 <- 'very good, very bad, you are great'
doc2 <- 'very bad, good restaurent, nice place to visit'
docs <- list(doc1, doc2)
docs <- sapply(docs, strsplit, split=', ')
vocab <- vocab_vectorizer(create_vocabulary(unique(unlist(docs))))
dtm <- create_dtm(itoken(docs), vocab)
dtm
This will result in:
2 x 5 sparse Matrix of class "dgCMatrix"
very good very bad you are great good restaurent nice place to visit
1 1 1 1 . .
2 . 1 . 1 1
Such approach allows for more customization in loading files and preparing vacabulary.

Related

Output text with both unigrams and bigrams in R

I'm trying to figure out how to identify unigrams and bigrams in a text in R, and then keep both in the final output based on a threshold. I've done this in Python with gensim's Phraser model, but haven't figured out how to do it in R.
For example:
strings <- data.frame(text = 'This is a great movie from yesterday', 'I went to the movies', 'Great movie time at the theater', 'I went to the theater yesterday')
#Pseudocode below
bigs <- tokenize_uni_bi(strings, n = 1:2, threshold = 2)
print(bigs)
[['this', 'great_movie', 'yesterday'], ['went', 'movies'], ['great_movie', 'theater'], ['went', 'theater', 'yesterday']]
Thank you!

You could use quanteda framework for this:
library(quanteda)
# tokenize, tolower, remove stopwords and create ngrams
my_toks <- tokens(strings$text)
my_toks <- tokens_tolower(my_toks)
my_toks <- tokens_remove(my_toks, stopwords("english"))
bigs <- tokens_ngrams(my_toks, n = 1:2)
# turn into document feature matrix and filter on minimum frequency of 2 and more
my_dfm <- dfm(bigs)
dfm_trim(my_dfm, min_termfreq = 2)
Document-feature matrix of: 4 documents, 6 features (50.0% sparse).
features
docs great movie yesterday great_movie went theater
text1 1 1 1 1 0 0
text2 0 0 0 0 1 0
text3 1 1 0 1 0 1
text4 0 0 1 0 1 1
# use convert function to turn this into a data.frame
Alternatively you could use tidytext package, tm, tokenizers etc etc. It all depends a bit on the output you are expecting.
An example using tidytext / dplyr looks like this:
library(tidytext)
library(dplyr)
strings %>%
unnest_ngrams(bigs, text, n = 2, n_min = 1, ngram_delim = "_", stopwords = stopwords::stopwords()) %>%
count(bigs) %>%
filter(n >= 2)
bigs n
1 great 2
2 great_movie 2
3 movie 2
4 theater 2
5 went 2
6 yesterday 2
Both quanteda and tidytext have a lot of online help available. See vignettes wiht both packages on cran.

R- textminig with tm packages

I want to use the tm packakes, so I have created the next code:
x<-inspect(DocumentTermMatrix(docs,
list(dictionary = c("survive", "survival"))))
I need to find any word beginning with "surviv" in the text, such as to include words like "survival" "survivor" "survive" and others. Is there any way to write that condition - words begining with "surviv"- in the code?

What you could do is creating a general DocumentTermMatrix and then filtering it to keep only rows which start with surviv using startsWith.
corp <- VCorpus(VectorSource(c("survival", "survivance", "survival",
"random", "yes", "survive")))
tdm <- TermDocumentMatrix(corp)
inspect(tdm[startsWith(rownames(tdm), "surv"),])

You can stem the words with stemDocument. Then you need to look only for surviv and survivor as these are the stem words you are looking for. Using and expanding the list of words from #AshOfFire
my_corpus <- VCorpus(VectorSource(c("survival", "survivance", "survival",
"random", "yes", "survive", "survivors", "surviving")))
my_corpus <- tm_map(my_corpus, stemDocument)
my_dtm <- DocumentTermMatrix(my_corpus, control = list(dictionary = c("surviv", "survivor")))
inspect(my_dtm)
<<DocumentTermMatrix (documents: 8, terms: 2)>>
Non-/sparse entries: 6/10
Sparsity : 62%
Maximal term length: 8
Weighting : term frequency (tf)
Sample :
Terms
Docs surviv survivor
1 1 0
2 1 0
3 1 0
4 0 0
5 0 0
6 1 0
7 0 1
8 1 0
p.s. only do x <- inspect(DocumentTermMatrix(docs, .....) if you want to get the first 10 rows and 10 columns in your x variable.

Split up ngrams in (sparse) document-feature matrix

This is a follow up question to this one. There, I asked if it's possible to split up ngram-features in a document-feature matrix (dfm-class from the quanteda-package) in such a way that e.g. bigrams result in two separate unigrams.
For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").
library(quanteda)
eg.txt <- c('increase in_the great plenary',
'great plenary emission_reduction',
'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)
There was a nice answer to this example, which works absolutely fine for relatively small matrices as the one above. However, as soon as the matrix is bigger, I'm constantly running into the following memory error.
> #turn the dfm into a matrix
> DF <- as.data.frame(eg.dfm)
Error in asMethod(object) :
Cholmod-error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Hence, is there a more memory efficient way to solve this ngram-problem or to deal with large (sparse) matrices/data frames? Thank you in advance!

The problem here is that you are turning the sparse (dfm) matrix into a dense object when you call as.data.frame(). Since the typical document-feature matrix is 90% sparse, this means you are creating something larger than you can handle. The solution: use dfm handling functions to maintain the sparsity.
Note that this is both a better solution than proposed in the linked question but also should work efficiently for your much larger object.
Here's a function that does that. It allows you to set the concatenator character(s), and works with ngrams of variable sizes. Most importantly, it uses dfm methods to make sure the dfm remains sparse.
# function to split and duplicate counts in features containing
# the concatenator character
dfm_splitgrams <- function(x, concatenator = "_") {
# separate the unigrams
x_unigrams <- dfm_remove(x, concatenator, valuetype = "regex")
# separate the ngrams
x_ngrams <- dfm_select(x, concatenator, valuetype = "regex")
# split into components
split_ngrams <- stringi::stri_split_regex(featnames(x_ngrams), concatenator)
# get a repeated index for the ngram feature names
index_split_ngrams <- rep(featnames(x_ngrams), lengths(split_ngrams))
# subset the ngram matrix using the (repeated) ngram feature names
x_split_ngrams <- x_ngrams[, index_split_ngrams]
# assign the ngram dfm the feature names of the split ngrams
colnames(x_split_ngrams) <- unlist(split_ngrams, use.names = FALSE)
# return the column concatenation of unigrams and split ngrams
suppressWarnings(cbind(x_unigrams, x_split_ngrams))
}
So:
dfm_splitgrams(eg.dfm)
## Document-feature matrix of: 3 documents, 9 features (40.7% sparse).
## 3 x 9 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction emission increase
## text1 1 1 1 1 1 0 0 0 0
## text2 0 1 1 0 0 1 1 0 0
## text3 1 0 0 1 1 1 1 1 1
Here, splitting ngrams results in new "unigrams" of the same feature name. You can (re)combine them efficiently with dfm_compress():
dfm_compress(dfm_splitgrams(eg.dfm))
## Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
## 3 x 7 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction
## text1 1 1 1 1 1 0 0
## text2 0 1 1 0 0 1 1
## text3 2 0 0 1 1 2 1

Keep EXACT words from R corpus

From answer posted on: Keep document ID with R corpus by #MrFlick
I am trying to slightly modify what is a great example.
Question: How do I modify the content_transformer function to keep only exact words? You can see in the inspect output that wonderful is counted as wonder and ratio is counted as rationale. I do not have a strong understanding of gregexpr and regmatches.
Create data frame:
dd <- data.frame(
id = 10:13,
text = c("No wonderful, then, that ever",
"So that in many cases such a ",
"But there were still other and",
"Not even at the rationale")
, stringsAsFactors = F
)
Now, in order to read special attributes from a data.frame, we will use the readTabular function to make our own custom data.frame reader
library(tm)
myReader <- readTabular(mapping = list(content = "text", id = "id"))
specify the column to use for the contents and the id in the data.frame. Now we read it in with DataframeSource but use our custom reader.
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
Now if we want to only keep a certain set of words, we can create our own content_transformer function. One way to do this is
keepOnlyWords <- content_transformer(function(x, words) {
regmatches(x,
gregexpr(paste0("\\b(", paste(words, collapse = "|"), "\\b)"), x)
, invert = T) <- " "
x
})
This will replace everything that's not in the word list with a space. Note that you probably want to run stripWhitespace after this. Thus our transformations would look like
keep <- c("wonder", "then", "that", "the")
tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)
Inspect dtm matrix:
> inspect(dtm)
<<DocumentTermMatrix (documents: 4, terms: 4)>>
Non-/sparse entries: 7/9
Sparsity : 56%
Maximal term length: 6
Weighting : term frequency (tf)
Terms
Docs ratio that the wonder
10 0 1 1 1
11 0 1 0 0
12 0 0 1 0
13 1 0 1 0

Switching grammars to tidytext, your current transformation would be
library(tidyverse)
library(tidytext)
library(stringr)
dd %>% unnest_tokens(word, text) %>%
mutate(word = str_replace_all(word, setNames(keep, paste0('.*', keep, '.*')))) %>%
inner_join(data_frame(word = keep))
## id word
## 1 10 wonder
## 2 10 the
## 3 10 that
## 4 11 that
## 5 12 the
## 6 12 the
## 7 13 the
Keeping exact matches is easier, as you can use joins (which use ==) instead of regex:
dd %>% unnest_tokens(word, text) %>%
inner_join(data_frame(word = keep))
## id word
## 1 10 then
## 2 10 that
## 3 11 that
## 4 13 the
To take it back to a document-term matrix,
library(tm)
dd %>% mutate(id = factor(id)) %>% # to keep empty rows of DTM
unnest_tokens(word, text) %>%
inner_join(data_frame(word = keep)) %>%
mutate(i = 1) %>%
cast_dtm(id, word, i) %>%
inspect()
## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity : 67%
## Maximal term length: 4
## Weighting : term frequency (tf)
##
## Terms
## Docs then that the
## 10 1 1 0
## 11 0 1 0
## 12 0 0 0
## 13 0 0 1
Currently, your function is matching words with a boundary before or after. To change it to before and after, change the collapse parameter to include boundaries:
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
keepOnlyWords<-content_transformer(function(x,words) {
regmatches(x,
gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x)
, invert = T) <- " "
x
})
tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)
inspect(DocumentTermMatrix(tm))
## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity : 67%
## Maximal term length: 4
## Weighting : term frequency (tf)
##
## Terms
## Docs that the then
## 10 1 0 1
## 11 1 0 0
## 12 0 0 0
## 13 0 1 0

I got same result as #alistaire with tm, with the following modified line in keepOnlyWords content transformer first defined by #BEMR:
gregexpr(paste0("\\b(", paste(words, collapse = "|"), ")\\b"), x)
There was a misplaced ")" in gregexpr first specified by #BEMR i.e. should be ")\\b" not "\\b)"
I think the above gregexpr is equivalent to that specified by #alistaire:
gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x)

How to tabulate term fequency data with or without using document term matrix?

I am trying to tabulated the following data:
Input
Big Fat Apple 3
Small Fat Apple 2
Little Small Pear 1
Expected output:
Big = 3
Fat = 3+2=5
Apple = 3+2=5
Small = 2+1=3
Little = 1
Pear = 1
I was trying to get document term matrix to treat this as corpus, but I am unable to find a way to do in a way that "Big Fat Apple" would be actually appearing in the corpus: "Big Fat Apple Big Fat Apple Big Fat Apple".
Is there any methods to do such tabulation? Ideally I would love to have it in the form of input into document term matrix so that I could use other functions.

Using the sample data from #scoa's answer, you can try using cSplit from my "splitstackshape" package, like this:
> library(splitstackshape)
> cSplit(d, "text", " ", "long")[, sum(n), by = text]
text V1
1: Big 3
2: Fat 5
3: Apple 5
4: Small 3
5: Little 1
6: Pear 1

To transform such a data frame into a corpus, you will have to tell it explicitly that each text should be reproduced x times, using rep()
d <- data.frame(
text=c("Big Fat Apple",
"Small Fat Apple",
"Little Small Pear"),
n = c(3,2,1),stringsAsFactors=FALSE)
library(tm)
corpus <- Corpus(VectorSource(rep(d$text,d$n)))
dtm <- DocumentTermMatrix(corpus)
You can then compute terms frequency (see How to find term frequency within a DTM in R?).

Might I suggest the quanteda package (quantitative analysis of textual data). What you want can be approached either by tokenising and tabulating, or by creating a document-feature matrix (here, with a single document):
cat("Big Fat Apple 3
Small Fat Apple 2
Little Small Pear 1\n", file = "example.txt")
mydata <- read.table("example.txt", stringsAsFactors = FALSE)
mydata <- paste(with(mydata, rep(paste(V1, V2, V3), V4)), collapse = " ")
mydata
## [1] "Big Fat Apple Big Fat Apple Big Fat Apple Small Fat Apple Small Fat Apple Little Small Pear"
# use the quanteda package as an alternative to tm
install.packages("quanteda")
library(quanteda)
# can simply tokenize and tabulate
table(tokenize(mydata))
## apple big fat little pear small
## 5 3 5 1 1 3
# alternatively, can create a one-document document-term matrix
myDfm <- dfm(mydata)
## Creating a dfm from a character vector ...
## ... indexing 1 document
## ... tokenizing texts, found 18 total tokens
## ... cleaning the tokens, 0 removed entirely
## ... summing tokens by document
## ... indexing 6 feature types
## ... building sparse matrix
## ... created a 1 x 6 sparse dfm
## ... complete. Elapsed time: 0.011 seconds.
myDfm
## Document-feature matrix of: 1 document, 6 features.
## 1 x 6 sparse Matrix of class "dfmSparse"
## features
## docs apple big fat little pear small
## text1 5 3 5 1 1 3
Happy to help with any questions you might have about quanteda, as we are actively seeking to improve it.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Corpus build with phrases - r

Related

Output text with both unigrams and bigrams in R

R- textminig with tm packages

Split up ngrams in (sparse) document-feature matrix

Keep EXACT words from R corpus

How to tabulate term fequency data with or without using document term matrix?

Categories

Resources