I have a document-term matrix dtm, for example:
dtm
<<DocumentTermMatrix (documents: 50, terms: 50)>>
Non-/sparse entries: 220/2497
Sparsity : 100%
Maximal term length: 7
Weighting : term frequency (tf)
Now I want transfer it into a list of matrices, each represents a document. This is to fulfill the formal requirement of the package STM:
[[1]]
[,1] [,2] [,3] [,4]
[1,] 23 33 42 117
[2,] 2 1 3 1
[[2]]
[,1] [,2] [,3] [,4]
[1,] 2 19 93 168
[2,] 2 2 1 1
I am thinking of finding all the non-zero entries from dtm and generate them into matrices, each row at a time, so:
mat = matrix()
dtm.to.mat = function(x){
mat[1,] = x[x != 0]
mat[2,] = colnames(x[x != 0])
return(mat)
}
matrix = list(apply(dtm, 1, dtm.to.mat))
However,
x[x != 0]
just won't work. The error says:
$ operator is invalid for atomic vectors
I was wondering why this is the case. If I change x to matrix beforehand, it won't give me this error. However, I actually have a dtm of approximately 2,500,000 lines. I fear this will be very inefficient.
Me again!
I wouldn't use a dtm as the input for the stm package unless your data is particularly strange. Use the function stm::textProcessor. You can specify the documents to be raw (unprocessed) text from an any length character vector. You can also specify the metadata as you wish:
Suppose you have a dataframe df with a column called df$documents which is your raw text and df$meta which is your covariate:
processed <- textProcessor(df$documents, metadata = df$meta, lowercase = TRUE,
removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE,
stem = TRUE, wordLengths = c(3, Inf))
stm_50 <- stm(documents = processed$documents, vocab = processed$vocab,
K = 50, prevalence = ~ meta, init.type = "Spectral", seed = 57468)
This will run a 50 topic STM.
textProcessor will deal with empty documents and their associated metadata.
Edit: stm::textProcessor is technically just a wrapper for the tm package. But it is designed to remove problem documents, while dealing with their associated covariates.
Also the metadata argument can take a dataframe if you have multiple covariates. In that case you would also need to modify the prevalence argument in the second equation.
If you have something tricky like this I'd switch over to the quanteda package as it has nice converters to stm. If you want to stick with tm have you tried using stm::convertCorpus to change the object into the list structure stm needs?
Related
When applying Matrix::qr() on the sparse matrix in R, the output is quite different from that of base::qr. There are V, beta, p, R, q but not rank and pivot. Below is a small example code. I want to detect linear dependent columns of the A sparse matrix, which requires the pivot and rank. How should I get these information?
library(Matrix)
A <- matrix(c(0, -2, 1, 0,
0, -4, 2, 0,
1, -2, 1, 2,
1, -2, 1, 2,
1, -2, 1, 2), nrow = 5, byrow = T)
A.s <- as(A, "dgCMatrix")
qrA.M <- Matrix::qr(A.s)
qrA.R <- base::qr(A)
There is another related but not answered question, Get base::qr pivoting in Matrix::qr method
I would reconstruct your example matrix A a little bit:
A <- A[, c(1,4,3,2)]
# [,1] [,2] [,3] [,4]
#[1,] 0 0 1 -2
#[2,] 0 0 2 -4
#[3,] 1 2 1 -2
#[4,] 1 2 1 -2
#[5,] 1 2 1 -2
You did not mention in your question why rank and pivot returned by a dense QR factorization are useful. But I think this is what you are looking for:
dQR <- base::qr(A)
with(dQR, pivot[1:rank])
#[1] 1 3
So columns 1 and 3 of A gives a basis for A's column space.
I don't really understand the logic of a sparse QR factorization. The 2nd column of A is perfectly linearly dependent on the 1st column, so I expect column pivoting to take place during the factorization. But very much to my surprise, it doesn't!
library(Matrix)
sA <- Matrix(A, sparse = TRUE)
sQR <- Matrix::qr(sA)
sQR#q + 1L
#[1] 1 2 3 4
No column pivoting is done! As a result, there isn't an obvious way to determine the rank of A.
At this moment, I could only think of performing a dense QR factorization on the R factor to get what you are looking for.
R <- as.matrix(Matrix::qrR(sQR))
QRR <- base::qr(R)
with(QRR, pivot[1:rank])
#[1] 1 3
Why does this work? Well, the Q factor has orthogonal hence linearly independent columns, thus columns of R inherit linear dependence or independence of A. For a matrix with much more rows than columns, the computational costs of this 2nd QR factorization is negligible.
I need to figure out the algorithm behind a sparse QR factorization before coming up with a better idea.
I've been looking at a similar problem and I ended up not relying on Matrix::qr() to calculate rank and to detect linear dependency. Instead I programmed the function GaussIndependent in the package SSBtools.
In the package examples I included an example that demonstrates wrong conclusion from rankMatrix(x, method = "qr"). Input x is a 44*20 dummy matrix.
Starting with your example matrix, A.s:
library(SSBtools)
GaussIndependent(A.s) # List of logical vectors specifying independent rows and columns
# $rows
# [1] TRUE FALSE TRUE FALSE FALSE
#
# $columns
# [1] TRUE TRUE FALSE FALSE
GaussRank(A.s) # the rank
# [1] 2
I have received a paper in which they included the R files for their empirical results. Nevertheless, I have some problems while trying to run their codes:
data <- vni$R[198:length(vni$R)]; date <- vni$Date[198:length(vni$R)]
l <- length(data)
rw_length <- 52 # 52 weeks (~ 1 year)
bound <- vector()
avr <- vector()
for (i in (rw_length+1):l) {
AVR.test <- AutoBoot.test(data[(i-rw_length):i],nboot=2000,"Normal",c(0.025, 0.975))
bound <- append(bound, AVR.test$CI.stat)
avr <- append(avr, AVR.test$test.stat)
}
lower <- bound[seq(1, length(bound), 2)]
upper <- bound[seq(2, length(bound), 2)]
results <- matrix(c(date[(rw_length+1):l],data[(rw_length+1):l],avr,upper, lower),ncol=5, dimnames = list(c(),c("Date", "Return", "AVR", "Upper", "Lower")))
And I get the following error: `
Error in as.Date.numeric(e) : 'origin' must be supplied`
for the results <- matrix(c(date[(rw_length+1):l],data[(rw_length+1):l],avr,upper, lower),ncol=5, dimnames = list(c(),c("Date", "Return", "AVR", "Upper", "Lower")))
My dataset is:
Date P R
1 2001-03-23 259.60 0.0000000000
2 2001-03-30 269.30 0.0366840150
3 2001-04-06 284.69 0.0555748690
4 2001-04-13 300.36 0.0535808860
5 2001-04-20 317.76 0.0563146260
...
935 2019-02-15 950.89 0.0454163960
936 2019-02-22 988.91 0.0392049380
937 2019-03-01 979.63 -0.0094283770
Could you please help me with that issue?
Thanks alot!
Everything in a matrix must be the same class. This is often found when there's a string among numbers, where
m <- matrix(0, nr=2, nc=2)
m
# [,1] [,2]
# [1,] 0 0
# [2,] 0 0
m[1] <- "a"
m
# [,1] [,2]
# [1,] "a" "0"
# [2,] "0" "0"
In this case, you have Date (first column) and numeric (all others? no idea what AutoBoot is). And because it's trying to coerce from least-complex to most-complex (from numeric to Date), the non-Date objects are being converted.
matrix(c(Sys.Date(), 1.1))
# Error in as.Date.numeric(e) : 'origin' must be supplied
I suggest that trying to store this in a matrix is therefore fundamentally flawed. If you want to store a Date object among numbers, you have two options:
Store it as a data.frame, where each column can have its own class.
Pre-convert the "Date" data to numeric and store it as a number. This means that if/when you need the dates to be of class Date again, you'll need to as.Date(..., origin="1970-01-01").
I have two xml text files and using quanteda and tm package, i have tokenized them and tranform to tf-idf matrix. here is my rstudio environment:
enter image description here
how can i calculate the similarities between these two files, for example, using Jaccard.
I have try dist(), cosine(), and text2vec, however, i all encounter errors.
for examples:
cosine(x = pta2.tokens.tfidf, y = pta3.tokens.tfidf)
Error in cosine(x = pta2.tokens.tfidf, y = pta3.tokens.tfidf) :
argument mismatch. Either one matrix or two vectors needed as input.
simi <- sim2(pta2.tokens.tfidf, pta3.tokens.tfidf, method = "jaccard", norm = "none")
Error: ncol(x) == ncol(y) is not TRUE
The problem is that you have a data.frame with string values and you are using distance that need a numeric matrix input
DIST
you need a numeric matrix:
?dist
Usage
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p=2)
Arguments
x a numeric matrix, data frame or "dist" object.
COSINE
you need numeric values:
?cosine
Usage
cosine(x, y, use = "everything", inverse = FALSE)
Arguments
x A numeric dataframe/matrix or vector
SIM2
Your error is due to the difference of the number of columns in pta2.tokens.tfidf and pta3.tokens.tfidf. Here an example of the error:
df1<-as.matrix(data.frame(a=c("a","b","c"),b=c("d","e","f")))
df2<-as.matrix(data.frame(c=c("a","b","c"),d=c("d","e","f"),e=c("g","h","i")))
sim2(df1,df2)
Error: ncol(x) == ncol(y) is not TRUE
But also if you have same dimentions, this method will not work as you can see because it needs numeric argument in input:
sim2(df1,df1)
Error in m^2 : non-numeric argument to binary operator
You must have matrices with same dimensions and numeric, like this:
df3<-as.matrix(data.frame(a=c(1,2,3),b=c(4,5,6)))
> df4<-as.matrix(data.frame(a=c(3,2,3),b=c(3,3,6)))
> sim2(df3,df4)
[,1] [,2] [,3]
[1,] 0.8574929 0.9417419 0.9761871
[2,] 0.9191450 0.9785498 0.9965458
[3,] 0.9486833 0.9922779 1.0000000
A possible solution
Use function stringdist from stringdist package, here a toy example:
Two dataframes with string values
df1<-data.frame(a=c("abc","bav","cda"),b=c("ddd","ese","feff"))
df2<-data.frame(a=c("abc","gfb","cdd"),b=c("dsd","eeesfd","fafe"))
Function to compare string values in two big data.frames:
f<-function(i,df1,df2)
{
f2<-function(y,list1,list2)
{
return(stringdist(list1[y],list2[y],method="jw"))
}
return(unlist(lapply(seq(1:length(df1[,i])),f2,list1=df1[,i],list2=df2[,i])))
}
dist_matrix<-do.call(cbind,lapply(seq(1:ncol(df1)),f,df1=df1,df2=df2))
Distance matrix
dist_matrix
[,1] [,2]
[1,] 0.0000000 0.2222222
[2,] 1.0000000 0.2777778
[3,] 0.2222222 0.3333333
I would like to append two Document Term Matrices together. I have one row of data and would like to use different control functions on them (an n-gram tokenizer, removal of stopwords, and wordLength bounds for text, none of these for my non-text fields).
When I use the tm_combine: c(dtm_text,dtm_inputs) it adds the second set as a new row. I want to append these attributes to the same row.
library("tm")
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE)
# Data to be tokenized
txt_fields <- paste("i like your store","i love your products","i am happy")
# Data not to be tokenized
other_inputs <- paste("cd1_ABC","cd2_555","cd3_7654")
# NGram tokenize text data
dtm_text <- DocumentTermMatrix(Corpus(VectorSource(txt_fields)),
control = list(
tokenize = BigramTokenizer,
stopwords=TRUE,
wordLengths=c(2, Inf),
bounds=list(global = c(1,Inf))))
# Do not perform tokenization of other inputs
dtm_inputs <- DocumentTermMatrix(Corpus(VectorSource(other_inputs)),
control = list(
bounds = list(global = c(1,Inf))))
# DESIRED OUTPUT
<<DocumentTermMatrix (documents: 1, terms: 12)>>
Non-/sparse entries: 12/0
Sparsity : 0%
Maximal term length: 13
Weighting : term frequency (tf)
Terms
Docs am happy happy like like your love love your products products am store store love
1 1 1 1 1 1 1 1 1 1 1
Terms
Docs your products your store cd1_abc cd2_555 cd3_7654
1 1 1 1
1 1 1
I suggest to use text2vec (but I'm biased, since I'm the author).
library(text2vec)
# Data to be tokenized
txt_fields <- paste("i like your store","i love your products","i am happy")
# Data not to be tokenized
other_inputs <- paste("cd1_ABC","cd2_555","cd3_7654")
stopwords = tm::stopwords("en")
# tokenize by whitespace
txt_toknens = strsplit(txt_fields, ' ', TRUE)
vocab = create_vocabulary(itoken(txt_toknens), ngram = c(1, 2), stopwords = stopwords)
# if you need word lengths:
# vocab$vocab = vocab$vocab[nchar(terms) > 1]
# but note, it will not remove "i_am", etc.
# you can add word "i" to stopwords to remove such terms
txt_vectorizer = vocab_vectorizer(vocab)
dtm_text = create_dtm(itoken(txt_fields), vectorizer = txt_vectorizer)
# also tokenize by whitespace, but won't create bigrams in next step
other_inputs_toknes = strsplit(other_inputs, ' ', TRUE)
vocab_other = create_vocabulary(itoken(other_inputs))
other_vectorizer = vocab_vectorizer(vocab_other)
dtm_other = create_dtm(itoken(other_inputs), vectorizer = other_vectorizer)
# combine
result = cbind(dtm_text, dtm_other)
dtm_combined = as.DocumentTermMatrix(cbind(dtm_text, dtm_inputs), weighting = weightTf)
inspect(dtm_combined)
# <<DocumentTermMatrix (documents: 1, terms: 8)>>
# Non-/sparse entries: 8/0
# Sparsity : 0%
# Maximal term length: 8
# Weighting : term frequency (tf)
#
# Terms
# Docs happy like love products store cd1_abc cd2_555 cd3_7654
# 1 1 1 1 1 1 1 1 1
But it will give wrong results if you have the same words in the dtm_text and in the dtm_inputs. This words won't be combined and will appear twice in the dtm_combined.
I have a contingency table for which I would like to calculate Cohens's kappa - the level of agreement. I have tried using three different packages, which all seem to fail to some degree. The package e1071 has a function specifically for a contingency table, but that too seems to fail. Below is reproducable code. You will need to install packages concord, e1071, and irr.
# Recreate my contingency table, output with dput
conf.mat<-structure(c(810531L, 289024L, 164757L, 114316L), .Dim = c(2L,
2L), .Dimnames = structure(list(landsat_2000_bin = c("0", "1"
), MOD12_2000_binForest = c("0", "1")), .Names = c("landsat_2000_bin",
"MOD12_2000_binForest")), class = "table")
library(concord)
cohen.kappa(conf.mat)
library(e1071)
classAgreement(conf.mat, match.names=TRUE)
library(irr)
kappa2(conf.mat)
The output I get from running this is:
> cohen.kappa(conf.mat)
Kappa test for nominally classified data
4 categories - 2 methods
kappa (Cohen) = 0 , Z = NaN , p = NaN
kappa (Siegel) = -0.333333 , Z = -0.816497 , p = 0.792892
kappa (2*PA-1) = -1
> classAgreement(conf.mat, match.names=TRUE)
$diag
[1] 0.6708459
$kappa
[1] NA
$rand
[1] 0.5583764
$crand
[1] 0.0594124
Warning message:
In ni[lev] * nj[lev] : NAs produced by integer overflow
> kappa2(conf.mat)
Cohen's Kappa for 2 Raters (Weights: unweighted)
Subjects = 2
Raters = 2
Kappa = 0
z = NaN
p-value = NaN
Could anyone advise on why these might fail? I have a large dataset, but as this table is simple I didn't think that could cause such problems.
In the first function, cohen.kappa, you need to specify that you are using count data and not just a n*m matrix of n subjects and m raters.
# cohen.kappa(conf.mat,'count')
cohen.kappa(conf.mat,'count')
The second function is much more tricky. For some reason, your matrix is full of integer and not numeric. integer can't store really big numbers. So, when you multiply two of your big numbers together, it fails. For example:
i=975288
j=1099555
class(i)
# [1] "numeric"
i*j
# 1.072383e+12
as.integer(i)*as.integer(j)
# [1] NA
# Warning message:
# In as.integer(i) * as.integer(j) : NAs produced by integer overflow
So you need to convert your matrix to have integers.
# classAgreement(conf.mat)
classAgreement(matrix(as.numeric(conf.mat),nrow=2))
Finally take a look at the documentation for ?kappa2. It requires an n*m matrix as explained above. It just won't work with your (efficient) data structure.
Do you need to know specifically why those fail? Here is a function that computes the statistic -- in a hurry, so I might clean it up later (kappa wiki):
kap <- function(x) {
a <- (x[1,1] + x[2,2]) / sum(x)
e <- (sum(x[1,]) / sum(x)) * (sum(x[,1]) / sum(x)) + (1 - (sum(x[1,]) / sum(x))) * (1 - (sum(x[,1]) / sum(x)))
(a-e)/(1-e)
}
Tests/output:
> (x = matrix(c(20,5,10,15), nrow=2, byrow=T))
[,1] [,2]
[1,] 20 5
[2,] 10 15
> kap(x)
[1] 0.4
> (x = matrix(c(45,15,25,15), nrow=2, byrow=T))
[,1] [,2]
[1,] 45 15
[2,] 25 15
> kap(x)
[1] 0.1304348
> (x = matrix(c(25,35,5,35), nrow=2, byrow=T))
[,1] [,2]
[1,] 25 35
[2,] 5 35
> kap(x)
[1] 0.2592593
> kap(conf.mat)
[1] 0.1258621