I'm trying to obtain a document term matrix from a book in Italian. I have the pdf file of this book and I wrote few rows of code:
#install.packages("pdftools")
library(pdftools)
library(tm)
text <- pdf_text("IoRobot.pdf")
# collapse pdf pages into 1
text <- paste(unlist(text), collapse ="")
myCorpus <- VCorpus(VectorSource(text))
mydtm <-DocumentTermMatrix(myCorpus,control = list(removeNumbers = TRUE, removePunctuation = TRUE,
stopwords=stopwords("it"), stemming=TRUE))
inspect(mydtm)
The result I obtained after the last row is:
<<DocumentTermMatrix (documents: 1, terms: 10197)>>
Non-/sparse entries: 10197/0
Sparsity : 0%
Maximal term length: 39
Weighting : term frequency (tf)
Sample :
Terms
Docs calvin cosa donovan esser piú poi powel prima quando robot
1 201 191 254 193 288 211 287 166 184 62
I noticed that the sparsity is 0%. Is this normal?
Yes it seems correct.
A document term matrix is a matrix that has as rows the documents, as columns the terms, and 0 or 1 if the term is in the document in the row (1) or not (0).
Sparsity is and indicator that points out the "quantity of 0s" in document term matrix. You can define a sparse term, when it's not in a document, looking from here.
To understand those gists, let's have a look to a reproducible example that creates a situation similar to your:
library(tm)
text <- c("here some text")
corpus <- VCorpus(VectorSource(text))
DTM <- DocumentTermMatrix(corpus)
DTM
<<DocumentTermMatrix (documents: 1, terms: 3)>>
Non-/sparse entries: 3/0
Sparsity : 0%
Maximal term length: 4
Weighting : term frequency (tf)
Looking at the output, we can see you have one document (so a DTM with that corpus is made of one line).
Having a look at it:
as.matrix(DTM)
Terms
Docs here some text
1 1 1 1
Now it could be easier to understand the output:
You have one doc with three terms:
<<DocumentTermMatrix (documents: 1, terms: 3)>>
Your non sparse (i.e. != 0 in DTM) are 3, and sparse == 0:
Non-/sparse entries: 3/0
So your sparsity is == 0%, because you cannot have some 0s in one document corpus; every term belongs to the unique document, so you'll have all ones:
Sparsity : 0%
Having a look at a different example, that has sparse terms:
text <- c("here some text", "other text")
corpus <- VCorpus(VectorSource(text))
DTM <- DocumentTermMatrix(corpus)
DTM
<<DocumentTermMatrix (documents: 2, terms: 4)>>
Non-/sparse entries: 5/3
Sparsity : 38%
Maximal term length: 5
Weighting : term frequency (tf)
as.matrix(DTM)
Terms
Docs here other some text
1 1 0 1 1
2 0 1 0 1
Now you have 3 sparse terms (3/5), and if you do 3/8 = 0.375 i.e. the 38% of sparsity.
Related
I have a document-term matrix dtm, for example:
dtm
<<DocumentTermMatrix (documents: 50, terms: 50)>>
Non-/sparse entries: 220/2497
Sparsity : 100%
Maximal term length: 7
Weighting : term frequency (tf)
Now I want transfer it into a list of matrices, each represents a document. This is to fulfill the formal requirement of the package STM:
[[1]]
[,1] [,2] [,3] [,4]
[1,] 23 33 42 117
[2,] 2 1 3 1
[[2]]
[,1] [,2] [,3] [,4]
[1,] 2 19 93 168
[2,] 2 2 1 1
I am thinking of finding all the non-zero entries from dtm and generate them into matrices, each row at a time, so:
mat = matrix()
dtm.to.mat = function(x){
mat[1,] = x[x != 0]
mat[2,] = colnames(x[x != 0])
return(mat)
}
matrix = list(apply(dtm, 1, dtm.to.mat))
However,
x[x != 0]
just won't work. The error says:
$ operator is invalid for atomic vectors
I was wondering why this is the case. If I change x to matrix beforehand, it won't give me this error. However, I actually have a dtm of approximately 2,500,000 lines. I fear this will be very inefficient.
Me again!
I wouldn't use a dtm as the input for the stm package unless your data is particularly strange. Use the function stm::textProcessor. You can specify the documents to be raw (unprocessed) text from an any length character vector. You can also specify the metadata as you wish:
Suppose you have a dataframe df with a column called df$documents which is your raw text and df$meta which is your covariate:
processed <- textProcessor(df$documents, metadata = df$meta, lowercase = TRUE,
removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE,
stem = TRUE, wordLengths = c(3, Inf))
stm_50 <- stm(documents = processed$documents, vocab = processed$vocab,
K = 50, prevalence = ~ meta, init.type = "Spectral", seed = 57468)
This will run a 50 topic STM.
textProcessor will deal with empty documents and their associated metadata.
Edit: stm::textProcessor is technically just a wrapper for the tm package. But it is designed to remove problem documents, while dealing with their associated covariates.
Also the metadata argument can take a dataframe if you have multiple covariates. In that case you would also need to modify the prevalence argument in the second equation.
If you have something tricky like this I'd switch over to the quanteda package as it has nice converters to stm. If you want to stick with tm have you tried using stm::convertCorpus to change the object into the list structure stm needs?
I would like to append two Document Term Matrices together. I have one row of data and would like to use different control functions on them (an n-gram tokenizer, removal of stopwords, and wordLength bounds for text, none of these for my non-text fields).
When I use the tm_combine: c(dtm_text,dtm_inputs) it adds the second set as a new row. I want to append these attributes to the same row.
library("tm")
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE)
# Data to be tokenized
txt_fields <- paste("i like your store","i love your products","i am happy")
# Data not to be tokenized
other_inputs <- paste("cd1_ABC","cd2_555","cd3_7654")
# NGram tokenize text data
dtm_text <- DocumentTermMatrix(Corpus(VectorSource(txt_fields)),
control = list(
tokenize = BigramTokenizer,
stopwords=TRUE,
wordLengths=c(2, Inf),
bounds=list(global = c(1,Inf))))
# Do not perform tokenization of other inputs
dtm_inputs <- DocumentTermMatrix(Corpus(VectorSource(other_inputs)),
control = list(
bounds = list(global = c(1,Inf))))
# DESIRED OUTPUT
<<DocumentTermMatrix (documents: 1, terms: 12)>>
Non-/sparse entries: 12/0
Sparsity : 0%
Maximal term length: 13
Weighting : term frequency (tf)
Terms
Docs am happy happy like like your love love your products products am store store love
1 1 1 1 1 1 1 1 1 1 1
Terms
Docs your products your store cd1_abc cd2_555 cd3_7654
1 1 1 1
1 1 1
I suggest to use text2vec (but I'm biased, since I'm the author).
library(text2vec)
# Data to be tokenized
txt_fields <- paste("i like your store","i love your products","i am happy")
# Data not to be tokenized
other_inputs <- paste("cd1_ABC","cd2_555","cd3_7654")
stopwords = tm::stopwords("en")
# tokenize by whitespace
txt_toknens = strsplit(txt_fields, ' ', TRUE)
vocab = create_vocabulary(itoken(txt_toknens), ngram = c(1, 2), stopwords = stopwords)
# if you need word lengths:
# vocab$vocab = vocab$vocab[nchar(terms) > 1]
# but note, it will not remove "i_am", etc.
# you can add word "i" to stopwords to remove such terms
txt_vectorizer = vocab_vectorizer(vocab)
dtm_text = create_dtm(itoken(txt_fields), vectorizer = txt_vectorizer)
# also tokenize by whitespace, but won't create bigrams in next step
other_inputs_toknes = strsplit(other_inputs, ' ', TRUE)
vocab_other = create_vocabulary(itoken(other_inputs))
other_vectorizer = vocab_vectorizer(vocab_other)
dtm_other = create_dtm(itoken(other_inputs), vectorizer = other_vectorizer)
# combine
result = cbind(dtm_text, dtm_other)
dtm_combined = as.DocumentTermMatrix(cbind(dtm_text, dtm_inputs), weighting = weightTf)
inspect(dtm_combined)
# <<DocumentTermMatrix (documents: 1, terms: 8)>>
# Non-/sparse entries: 8/0
# Sparsity : 0%
# Maximal term length: 8
# Weighting : term frequency (tf)
#
# Terms
# Docs happy like love products store cd1_abc cd2_555 cd3_7654
# 1 1 1 1 1 1 1 1 1
But it will give wrong results if you have the same words in the dtm_text and in the dtm_inputs. This words won't be combined and will appear twice in the dtm_combined.
I am using findAssocs() of the tm package on a document frequency matrix to identify words which are associated with particular term(s) across various documents in a corpus.
My problem is that I get different output when giving a vector of terms as input to the function compared to giving a single term as input.
Here is my example.
library(tm)
txt <- c("alpha bravo", "alpha charlie", "alpha charlie", "zulu")
corp <- Corpus(VectorSource(txt))
dtm <- DocumentTermMatrix(corp)
Returns the following dtm
> as.matrix(dtm)
Terms
Docs alpha bravo charlie zulu
1 1 1 0 0
2 1 0 1 0
3 1 0 1 0
4 0 0 0 1
If I would want to identify all terms associated with "alpha" I get the following output (as intended):
> findAssocs(dtm, "alpha", 0.00)
$alpha
charlie bravo
0.58 0.33
I could do the same for "bravo" and get the following output (as intended):
> findAssocs(dtm, "bravo", 0.00)
$bravo
alpha
0.33
As I would like to find those associations for a number of terms I have passed a vector to findAssocs in order to get the required output. However, if I pass a vector of terms (chr) to the function the output is different from the one I get for single inputs:
> findAssocs(dtm, c("alpha","bravo"), 0.00)
$alpha
charlie
0.58
$bravo
numeric(0)
Actually, the assocation between "alpha"and "bravo" is omitted which is not the behavior I would have expected here. The function seems to treat the individual terms independently of each other and thus does not analyze the correlation between "alpha" and "bravo" if they are both passed to the function in a vector.
Can anyone explain that behavior and tell me how to omitt it? As a workaround I could apply the function for each single term but that is not really handy...
UPDATE
What I find odd is that the correlation between "alpha" and "bravo" is not omitted if we plot the associations, e.g. through the following code:
> freqTerm <- findFreqTerms(dtm, 1)
> freqTerm
[1] "alpha" "bravo" "charlie" "zulu"
plot(dtm, term=freqTerm, corThreshold=0.0, weighting=T, attrs=list(node=list(fixedsize=FALSE, shape="ellipse")))
How is plot(dtm, term=freqTerm ... different from "findAssocs()"?
tm::findAssocs() omits direct comparisons for exactly the reasons stated in the comment by #Steven Beauport. Given that you are searching for a small set of terms likely to be highly correlated, this seems more like a bug than a feature. This is illustrated by the example of this function (see ?tm::findAssocs) where the terms oil and opec are the most similar, but this is masked by the omission of each from the other's association vector.
An alternative is to use the equivalent feature from the quanteda package:
library(quanteda)
txt <- c("alpha bravo", "alpha charlie", "alpha charlie", "zulu")
corp <- corpus(txt)
dtm <- dfm(corp, verbose = FALSE)
# this also works fine if you want to go straight from text:
# dtm <- dfm(txt, verbose = FALSE)
(simlist <- similarity(dtm, c("alpha","bravo"), margin = "features"))
## similarity Matrix:
## $alpha
## charlie bravo zulu
## 0.5774 0.3333 -1.0000
##
## $bravo
## alpha zulu charlie
## 0.3333 -0.3333 -0.5774
Or if you prefer it as a matrix:
as.matrix(simlist)
## alpha bravo
## alpha 1.0000000 0.3333333
## charlie 0.5773503 -0.5773503
## bravo 0.3333333 1.0000000
## zulu -1.0000000 -0.3333333
similarity() can do cosine similarities as well as other similarities defined in the proxy package, but the (Pearson's) correlation and cosine methods are currently implemented in fully sparse computation, whereas the others are not (yet). By defining margin = "documents", furthermore, you can compare documents instead of terms, for instance for clustering.
I have a corpus of 11 text documents. I have found word associations using the commands:
findAssocs(dtm, c("youngster","campaign"), corlimit=0.9)
findAssocs(dtms, "corruption", corlimit=0.9)
dtm is a document term matrix.
dtm <- DocumentTermMatrix(docs)
where docs is the corpus.
dtms is the document term matrix after removing 10% sparse terms.
dtms <- removeSparseTerms(dtm, 0.1)
I would like to plot the correlated terms I got against (i) 2 specific words and (ii) 1 specific word
I tried following this post : Plot highly correlated words against a specific word of interest
toi <- "corruption" # term of interest
corlimit <- 0.9 # lower correlation bound limit.
cor_0.9 <- data.frame(corr = findAssocs(dtm, toi, corlimit)[,1],terms=row.names(findAssocs(dtm, toi, corlimit)))
But unfortunately the code :
cor_0.9 <- data.frame(corr = findAssocs(dtm, toi, corlimit)[,1],terms=row.names(findAssocs(dtm, toi, corlimit)))
gives me an error :
Error in findAssocs(dtm, toi, corlimit)[, 1]:incorrect number of dimensions
This is the structure of the document term matrix:
dtm
<<DocumentTermMatrix (documents: 11, terms: 1847)>>
Non-/sparse entries: 8024/12293
Sparsity : 61%
Maximal term length: 23
Weighting : term frequency (tf)
and in the environemt it is of form:
dtm List of 6
i: int [1:8024] 1 1 1 1 1 ...
j: int [1:8024] 17 29 34 43 47 ...
v: num [1:8024] 9 4 9 5 5 ...
nrow : int 11
ncol : int 1847
dimnames: list of 2
...$ Docs : chr [1:11] "character (0)" "character (0)" "character (0)"
...$ Terms: chr [1:1847] "campaigning"|__truncated__"a"|__"truncated"__
attr(*,"class") = chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"...
attr(*,"weighting") = chr [1:2] "term frequency" "tf"
How do I plot word correlations for a single word and multiple words? Please help.
Here is the output of
findAssocs(dtm, c("youngster","campaign"), corlimit=0.9)
$youngster
character colleges controversi expect corrupt much
1.00 1.00 1.00 1.00 0.99 0.99
okay saritha existing leads satisfi social
0.99 0.99 0.98 0.98 0.98 0.98
$campaign
basic make lack internal general method satisfied time
0.95 0.95 0.94 0.93 0.92 0.92 0.92 0.92
A slightly different approach is required for two words, here's a quick attempt:
require(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
# Compute correlations and store in data frame...
toi1 <- "oil" # term of interest
toi2 <- "winter"
corlimit <- 0.7 # lower correlation bound limit.
corr1 <- findAssocs(tdm, toi1, corlimit)[[1]]
corr1 <- cbind(read.table(text = names(corr1), stringsAsFactors = FALSE), corr1)
corr2 <- findAssocs(tdm, toi2, corlimit)[[1]]
corr2 <- cbind(read.table(text = names(corr2), stringsAsFactors = FALSE), corr2)
# join them together
library(dplyr)
two_terms_corrs <- full_join(corr1, corr2)
# gather for plotting
library(tidyr)
two_terms_corrs_gathered <- gather(two_terms_corrs, term, correlation, corr1:corr2)
# insert the actual terms of interest so they show up on the legend
two_terms_corrs_gathered$term <- ifelse(two_terms_corrs_gathered$term == "corr1", toi1, toi2)
# Draw the plot...
require(ggplot2)
ggplot(two_terms_corrs_gathered, aes(x = V1, y = correlation, colour = term ) ) +
geom_point(size = 3) +
ylab(paste0("Correlation with the terms ", "\"", toi1, "\"", " and ", "\"", toi2, "\"")) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
I am creating a trigram and quadgram model using RWeka. There is an odd behavior I notice
For the trigram
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer))
> dim(tdm)
[1] 1540099 3
> tdm
<<TermDocumentMatrix (terms: 1540099, documents: 3)>>
Non-/sparse entries: 1548629/3071668
Sparsity : 66%
Maximal term length: 180
Weighting : term frequency (tf)
When I remove sparse terms it shrinks the above ~1 million rows to 8307
> b <- removeSparseTerms(tdm, 0.66)
> dim(b)
[1] 8307 3
For a Quadgram removal does not affect it at all
quadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
tdm <- TermDocumentMatrix(docs, control = list(tokenize = QuadgramTokenizer))
<<TermDocumentMatrix (terms: 1427403, documents: 3)>>
Non-/sparse entries: 1427936/2854273
Sparsity : 67%
Maximal term length: 185
Weighting : term frequency (tf)
> dim(tdm)
[1] 1427403 3
> tdm <- removeSparseTerms(tdm, 0.67)
> dim(tdm)
[1] 1427403 3
Has 1 million items after removal of sparse terms.
This does not look right.
Please let me know if I am doing something wrong
Regards
Ganesh
This is weird. A logical behaviour is that removing sparse terms will remove a lot in both cases, as trigrams and quadgrams are less common single gram cases. Do you have any other QuadgramTokenizer object in your session? your original function is called with a small "q" quadgramTokenize. But I am wondering why it did not show an error, it might have taken it as empty?
I think it must be something as simple as this. Check this out and if not I ll run it with a data sample and see what could be wrong here.