I am creating a trigram and quadgram model using RWeka. There is an odd behavior I notice
For the trigram
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer))
> dim(tdm)
[1] 1540099 3
> tdm
<<TermDocumentMatrix (terms: 1540099, documents: 3)>>
Non-/sparse entries: 1548629/3071668
Sparsity : 66%
Maximal term length: 180
Weighting : term frequency (tf)
When I remove sparse terms it shrinks the above ~1 million rows to 8307
> b <- removeSparseTerms(tdm, 0.66)
> dim(b)
[1] 8307 3
For a Quadgram removal does not affect it at all
quadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
tdm <- TermDocumentMatrix(docs, control = list(tokenize = QuadgramTokenizer))
<<TermDocumentMatrix (terms: 1427403, documents: 3)>>
Non-/sparse entries: 1427936/2854273
Sparsity : 67%
Maximal term length: 185
Weighting : term frequency (tf)
> dim(tdm)
[1] 1427403 3
> tdm <- removeSparseTerms(tdm, 0.67)
> dim(tdm)
[1] 1427403 3
Has 1 million items after removal of sparse terms.
This does not look right.
Please let me know if I am doing something wrong
Regards
Ganesh
This is weird. A logical behaviour is that removing sparse terms will remove a lot in both cases, as trigrams and quadgrams are less common single gram cases. Do you have any other QuadgramTokenizer object in your session? your original function is called with a small "q" quadgramTokenize. But I am wondering why it did not show an error, it might have taken it as empty?
I think it must be something as simple as this. Check this out and if not I ll run it with a data sample and see what could be wrong here.
Related
I'm trying to obtain a document term matrix from a book in Italian. I have the pdf file of this book and I wrote few rows of code:
#install.packages("pdftools")
library(pdftools)
library(tm)
text <- pdf_text("IoRobot.pdf")
# collapse pdf pages into 1
text <- paste(unlist(text), collapse ="")
myCorpus <- VCorpus(VectorSource(text))
mydtm <-DocumentTermMatrix(myCorpus,control = list(removeNumbers = TRUE, removePunctuation = TRUE,
stopwords=stopwords("it"), stemming=TRUE))
inspect(mydtm)
The result I obtained after the last row is:
<<DocumentTermMatrix (documents: 1, terms: 10197)>>
Non-/sparse entries: 10197/0
Sparsity : 0%
Maximal term length: 39
Weighting : term frequency (tf)
Sample :
Terms
Docs calvin cosa donovan esser piú poi powel prima quando robot
1 201 191 254 193 288 211 287 166 184 62
I noticed that the sparsity is 0%. Is this normal?
Yes it seems correct.
A document term matrix is a matrix that has as rows the documents, as columns the terms, and 0 or 1 if the term is in the document in the row (1) or not (0).
Sparsity is and indicator that points out the "quantity of 0s" in document term matrix. You can define a sparse term, when it's not in a document, looking from here.
To understand those gists, let's have a look to a reproducible example that creates a situation similar to your:
library(tm)
text <- c("here some text")
corpus <- VCorpus(VectorSource(text))
DTM <- DocumentTermMatrix(corpus)
DTM
<<DocumentTermMatrix (documents: 1, terms: 3)>>
Non-/sparse entries: 3/0
Sparsity : 0%
Maximal term length: 4
Weighting : term frequency (tf)
Looking at the output, we can see you have one document (so a DTM with that corpus is made of one line).
Having a look at it:
as.matrix(DTM)
Terms
Docs here some text
1 1 1 1
Now it could be easier to understand the output:
You have one doc with three terms:
<<DocumentTermMatrix (documents: 1, terms: 3)>>
Your non sparse (i.e. != 0 in DTM) are 3, and sparse == 0:
Non-/sparse entries: 3/0
So your sparsity is == 0%, because you cannot have some 0s in one document corpus; every term belongs to the unique document, so you'll have all ones:
Sparsity : 0%
Having a look at a different example, that has sparse terms:
text <- c("here some text", "other text")
corpus <- VCorpus(VectorSource(text))
DTM <- DocumentTermMatrix(corpus)
DTM
<<DocumentTermMatrix (documents: 2, terms: 4)>>
Non-/sparse entries: 5/3
Sparsity : 38%
Maximal term length: 5
Weighting : term frequency (tf)
as.matrix(DTM)
Terms
Docs here other some text
1 1 0 1 1
2 0 1 0 1
Now you have 3 sparse terms (3/5), and if you do 3/8 = 0.375 i.e. the 38% of sparsity.
I have a document-term matrix dtm, for example:
dtm
<<DocumentTermMatrix (documents: 50, terms: 50)>>
Non-/sparse entries: 220/2497
Sparsity : 100%
Maximal term length: 7
Weighting : term frequency (tf)
Now I want transfer it into a list of matrices, each represents a document. This is to fulfill the formal requirement of the package STM:
[[1]]
[,1] [,2] [,3] [,4]
[1,] 23 33 42 117
[2,] 2 1 3 1
[[2]]
[,1] [,2] [,3] [,4]
[1,] 2 19 93 168
[2,] 2 2 1 1
I am thinking of finding all the non-zero entries from dtm and generate them into matrices, each row at a time, so:
mat = matrix()
dtm.to.mat = function(x){
mat[1,] = x[x != 0]
mat[2,] = colnames(x[x != 0])
return(mat)
}
matrix = list(apply(dtm, 1, dtm.to.mat))
However,
x[x != 0]
just won't work. The error says:
$ operator is invalid for atomic vectors
I was wondering why this is the case. If I change x to matrix beforehand, it won't give me this error. However, I actually have a dtm of approximately 2,500,000 lines. I fear this will be very inefficient.
Me again!
I wouldn't use a dtm as the input for the stm package unless your data is particularly strange. Use the function stm::textProcessor. You can specify the documents to be raw (unprocessed) text from an any length character vector. You can also specify the metadata as you wish:
Suppose you have a dataframe df with a column called df$documents which is your raw text and df$meta which is your covariate:
processed <- textProcessor(df$documents, metadata = df$meta, lowercase = TRUE,
removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE,
stem = TRUE, wordLengths = c(3, Inf))
stm_50 <- stm(documents = processed$documents, vocab = processed$vocab,
K = 50, prevalence = ~ meta, init.type = "Spectral", seed = 57468)
This will run a 50 topic STM.
textProcessor will deal with empty documents and their associated metadata.
Edit: stm::textProcessor is technically just a wrapper for the tm package. But it is designed to remove problem documents, while dealing with their associated covariates.
Also the metadata argument can take a dataframe if you have multiple covariates. In that case you would also need to modify the prevalence argument in the second equation.
If you have something tricky like this I'd switch over to the quanteda package as it has nice converters to stm. If you want to stick with tm have you tried using stm::convertCorpus to change the object into the list structure stm needs?
I would like to append two Document Term Matrices together. I have one row of data and would like to use different control functions on them (an n-gram tokenizer, removal of stopwords, and wordLength bounds for text, none of these for my non-text fields).
When I use the tm_combine: c(dtm_text,dtm_inputs) it adds the second set as a new row. I want to append these attributes to the same row.
library("tm")
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE)
# Data to be tokenized
txt_fields <- paste("i like your store","i love your products","i am happy")
# Data not to be tokenized
other_inputs <- paste("cd1_ABC","cd2_555","cd3_7654")
# NGram tokenize text data
dtm_text <- DocumentTermMatrix(Corpus(VectorSource(txt_fields)),
control = list(
tokenize = BigramTokenizer,
stopwords=TRUE,
wordLengths=c(2, Inf),
bounds=list(global = c(1,Inf))))
# Do not perform tokenization of other inputs
dtm_inputs <- DocumentTermMatrix(Corpus(VectorSource(other_inputs)),
control = list(
bounds = list(global = c(1,Inf))))
# DESIRED OUTPUT
<<DocumentTermMatrix (documents: 1, terms: 12)>>
Non-/sparse entries: 12/0
Sparsity : 0%
Maximal term length: 13
Weighting : term frequency (tf)
Terms
Docs am happy happy like like your love love your products products am store store love
1 1 1 1 1 1 1 1 1 1 1
Terms
Docs your products your store cd1_abc cd2_555 cd3_7654
1 1 1 1
1 1 1
I suggest to use text2vec (but I'm biased, since I'm the author).
library(text2vec)
# Data to be tokenized
txt_fields <- paste("i like your store","i love your products","i am happy")
# Data not to be tokenized
other_inputs <- paste("cd1_ABC","cd2_555","cd3_7654")
stopwords = tm::stopwords("en")
# tokenize by whitespace
txt_toknens = strsplit(txt_fields, ' ', TRUE)
vocab = create_vocabulary(itoken(txt_toknens), ngram = c(1, 2), stopwords = stopwords)
# if you need word lengths:
# vocab$vocab = vocab$vocab[nchar(terms) > 1]
# but note, it will not remove "i_am", etc.
# you can add word "i" to stopwords to remove such terms
txt_vectorizer = vocab_vectorizer(vocab)
dtm_text = create_dtm(itoken(txt_fields), vectorizer = txt_vectorizer)
# also tokenize by whitespace, but won't create bigrams in next step
other_inputs_toknes = strsplit(other_inputs, ' ', TRUE)
vocab_other = create_vocabulary(itoken(other_inputs))
other_vectorizer = vocab_vectorizer(vocab_other)
dtm_other = create_dtm(itoken(other_inputs), vectorizer = other_vectorizer)
# combine
result = cbind(dtm_text, dtm_other)
dtm_combined = as.DocumentTermMatrix(cbind(dtm_text, dtm_inputs), weighting = weightTf)
inspect(dtm_combined)
# <<DocumentTermMatrix (documents: 1, terms: 8)>>
# Non-/sparse entries: 8/0
# Sparsity : 0%
# Maximal term length: 8
# Weighting : term frequency (tf)
#
# Terms
# Docs happy like love products store cd1_abc cd2_555 cd3_7654
# 1 1 1 1 1 1 1 1 1
But it will give wrong results if you have the same words in the dtm_text and in the dtm_inputs. This words won't be combined and will appear twice in the dtm_combined.
I have a corpus of 11 text documents. I have found word associations using the commands:
findAssocs(dtm, c("youngster","campaign"), corlimit=0.9)
findAssocs(dtms, "corruption", corlimit=0.9)
dtm is a document term matrix.
dtm <- DocumentTermMatrix(docs)
where docs is the corpus.
dtms is the document term matrix after removing 10% sparse terms.
dtms <- removeSparseTerms(dtm, 0.1)
I would like to plot the correlated terms I got against (i) 2 specific words and (ii) 1 specific word
I tried following this post : Plot highly correlated words against a specific word of interest
toi <- "corruption" # term of interest
corlimit <- 0.9 # lower correlation bound limit.
cor_0.9 <- data.frame(corr = findAssocs(dtm, toi, corlimit)[,1],terms=row.names(findAssocs(dtm, toi, corlimit)))
But unfortunately the code :
cor_0.9 <- data.frame(corr = findAssocs(dtm, toi, corlimit)[,1],terms=row.names(findAssocs(dtm, toi, corlimit)))
gives me an error :
Error in findAssocs(dtm, toi, corlimit)[, 1]:incorrect number of dimensions
This is the structure of the document term matrix:
dtm
<<DocumentTermMatrix (documents: 11, terms: 1847)>>
Non-/sparse entries: 8024/12293
Sparsity : 61%
Maximal term length: 23
Weighting : term frequency (tf)
and in the environemt it is of form:
dtm List of 6
i: int [1:8024] 1 1 1 1 1 ...
j: int [1:8024] 17 29 34 43 47 ...
v: num [1:8024] 9 4 9 5 5 ...
nrow : int 11
ncol : int 1847
dimnames: list of 2
...$ Docs : chr [1:11] "character (0)" "character (0)" "character (0)"
...$ Terms: chr [1:1847] "campaigning"|__truncated__"a"|__"truncated"__
attr(*,"class") = chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"...
attr(*,"weighting") = chr [1:2] "term frequency" "tf"
How do I plot word correlations for a single word and multiple words? Please help.
Here is the output of
findAssocs(dtm, c("youngster","campaign"), corlimit=0.9)
$youngster
character colleges controversi expect corrupt much
1.00 1.00 1.00 1.00 0.99 0.99
okay saritha existing leads satisfi social
0.99 0.99 0.98 0.98 0.98 0.98
$campaign
basic make lack internal general method satisfied time
0.95 0.95 0.94 0.93 0.92 0.92 0.92 0.92
A slightly different approach is required for two words, here's a quick attempt:
require(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
# Compute correlations and store in data frame...
toi1 <- "oil" # term of interest
toi2 <- "winter"
corlimit <- 0.7 # lower correlation bound limit.
corr1 <- findAssocs(tdm, toi1, corlimit)[[1]]
corr1 <- cbind(read.table(text = names(corr1), stringsAsFactors = FALSE), corr1)
corr2 <- findAssocs(tdm, toi2, corlimit)[[1]]
corr2 <- cbind(read.table(text = names(corr2), stringsAsFactors = FALSE), corr2)
# join them together
library(dplyr)
two_terms_corrs <- full_join(corr1, corr2)
# gather for plotting
library(tidyr)
two_terms_corrs_gathered <- gather(two_terms_corrs, term, correlation, corr1:corr2)
# insert the actual terms of interest so they show up on the legend
two_terms_corrs_gathered$term <- ifelse(two_terms_corrs_gathered$term == "corr1", toi1, toi2)
# Draw the plot...
require(ggplot2)
ggplot(two_terms_corrs_gathered, aes(x = V1, y = correlation, colour = term ) ) +
geom_point(size = 3) +
ylab(paste0("Correlation with the terms ", "\"", toi1, "\"", " and ", "\"", toi2, "\"")) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
I have a matrix 60 000 x 60 000 in a txt file, I need to get svd of this matrix. I use R but I don´t know if R can generate it.
I think it's possible to compute (partial) svd using the irlba package and bigmemory and bigalgebra without using a lot of memory.
First let's create a 20000 * 20000 matrix and save it into a file
require(bigmemory)
require(bigalgebra)
require(irlba)
con <- file("mat.txt", open = "a")
replicate(20, {
x <- matrix(rnorm(1000 * 20000), nrow = 1000)
write.table(x, file = 'mat.txt', append = TRUE,
row.names = FALSE, col.names = FALSE)
})
file.info("mat.txt")$size
## [1] 7.264e+09 7.3 Gb
close(con)
Then you can read this matrix using bigmemory::read.big.matrix
bigm <- read.big.matrix("mat.txt", sep = " ",
type = "double",
backingfile = "mat.bk",
backingpath = "/tmp",
descriptorfile = "mat.desc")
str(bigm)
## Formal class 'big.matrix' [package "bigmemory"] with 1 slots
## ..# address:<externalptr>
dim(bigm)
## [1] 20000 20000
bigm[1:3, 1:3]
## [,1] [,2] [,3]
## [1,] -0.3623255 -0.58463 -0.23172
## [2,] -0.0011427 0.62771 0.73589
## [3,] -0.1440494 -0.59673 -1.66319
Now we can use the use the excellent irlba package as explained in the package vignette.
The first step consist of defining matrix multiplication operator which can work with big.matrix object and then use the irlba::irlba function
### vignette("irlba", package = "irlba") # for more info
matmul <- function(A, B, transpose=FALSE) {
## Bigalgebra requires matrix/vector arguments
if(is.null(dim(B))) B <- cbind(B)
if(transpose)
return(cbind((t(B) %*% A)[]))
cbind((A %*% B)[])
}
dim(bigm)
system.time(
S <- irlba(bigm, nu = 2, nv = 2, matmul = matmul)
)
## user system elapsed
## 169.820 0.923 170.194
str(S)
## List of 5
## $ d : num [1:2] 283 283
## $ u : num [1:20000, 1:2] -0.00615 -0.00753 -0.00301 -0.00615 0.00734 ...
## $ v : num [1:20000, 1:2] 0.020086 0.012503 0.001065 -0.000607 -0.006009 ...
## $ iter : num 10
## $ mprod: num 310
I forgot to set the seed to make it reproductible but I just wanted to show that it's possible to do that in R.
EDIT
If you are using a new version of the package irlba, the above code throw an error because the matmult parameter of the function irlba has been renamed to mult. Therefore, you should change this part of the code
S <- irlba(bigm, nu = 2, nv = 2, matmul = matmul)
By
S <- irlba(bigm, nu = 2, nv = 2, mult = matmul)
I want to thank #FrankD for pointing this out.
In R 3.x+ you can construct a matrix of that size, the upper limit of vector sizes being 2^53 (or maybe 2^53-1 ), up from 2^31-1 as it was before which was why Andrie was throwing an error on his out-of-date installation. It generally takes 10 bytes per numeric element. At any rate:
> 2^53 < 10*60000^2
[1] FALSE # so you are safe on that account.
It would also fit in 64GB (but not in 32GB):
> 64000000000 < 10*60000^2
[1] FALSE
Generally to do any serious work you need at least 3 times the size of your largest object, so this seems pretty borderline even with the new expanded vectors/matrices.