text similarity between two tfidf matrix - r

I have two xml text files and using quanteda and tm package, i have tokenized them and tranform to tf-idf matrix. here is my rstudio environment:
enter image description here
how can i calculate the similarities between these two files, for example, using Jaccard.
I have try dist(), cosine(), and text2vec, however, i all encounter errors.
for examples:
cosine(x = pta2.tokens.tfidf, y = pta3.tokens.tfidf)
Error in cosine(x = pta2.tokens.tfidf, y = pta3.tokens.tfidf) :
argument mismatch. Either one matrix or two vectors needed as input.
simi <- sim2(pta2.tokens.tfidf, pta3.tokens.tfidf, method = "jaccard", norm = "none")
Error: ncol(x) == ncol(y) is not TRUE

The problem is that you have a data.frame with string values and you are using distance that need a numeric matrix input
DIST
you need a numeric matrix:
?dist
Usage
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p=2)
Arguments
x a numeric matrix, data frame or "dist" object.
COSINE
you need numeric values:
?cosine
Usage
cosine(x, y, use = "everything", inverse = FALSE)
Arguments
x A numeric dataframe/matrix or vector
SIM2
Your error is due to the difference of the number of columns in pta2.tokens.tfidf and pta3.tokens.tfidf. Here an example of the error:
df1<-as.matrix(data.frame(a=c("a","b","c"),b=c("d","e","f")))
df2<-as.matrix(data.frame(c=c("a","b","c"),d=c("d","e","f"),e=c("g","h","i")))
sim2(df1,df2)
Error: ncol(x) == ncol(y) is not TRUE
But also if you have same dimentions, this method will not work as you can see because it needs numeric argument in input:
sim2(df1,df1)
Error in m^2 : non-numeric argument to binary operator
You must have matrices with same dimensions and numeric, like this:
df3<-as.matrix(data.frame(a=c(1,2,3),b=c(4,5,6)))
> df4<-as.matrix(data.frame(a=c(3,2,3),b=c(3,3,6)))
> sim2(df3,df4)
[,1] [,2] [,3]
[1,] 0.8574929 0.9417419 0.9761871
[2,] 0.9191450 0.9785498 0.9965458
[3,] 0.9486833 0.9922779 1.0000000
A possible solution
Use function stringdist from stringdist package, here a toy example:
Two dataframes with string values
df1<-data.frame(a=c("abc","bav","cda"),b=c("ddd","ese","feff"))
df2<-data.frame(a=c("abc","gfb","cdd"),b=c("dsd","eeesfd","fafe"))
Function to compare string values in two big data.frames:
f<-function(i,df1,df2)
{
f2<-function(y,list1,list2)
{
return(stringdist(list1[y],list2[y],method="jw"))
}
return(unlist(lapply(seq(1:length(df1[,i])),f2,list1=df1[,i],list2=df2[,i])))
}
dist_matrix<-do.call(cbind,lapply(seq(1:ncol(df1)),f,df1=df1,df2=df2))
Distance matrix
dist_matrix
[,1] [,2]
[1,] 0.0000000 0.2222222
[2,] 1.0000000 0.2777778
[3,] 0.2222222 0.3333333

Related

When I concatenate in R am I creating a row or a column?

I concatenate the following:
ExampleConCat <- c(1, 1, 1, 0) and I have a 20x4 matrix (MatrixExample as below).
I can do matrix multiplication in Rstudio as below:
matrix.multipl <- MatrixExample %*% ExampleConCat
I get the below results:
# [,1]
# cycle_1 0.99019608
# cycle_2 0.96400149
# cycle_3 0.91064055
# cycle_4 0.83460040
# cycle_5 0.74478532
# cycle_6 0.64981877
# cycle_7 0.55637987
# cycle_8 0.46893791
# cycle_9 0.39005264
# cycle_10 0.32083829
# cycle_11 0.26141338
# cycle_12 0.21127026
# cycle_13 0.16955189
# cycle_14 0.13524509
# cycle_15 0.10730721
# cycle_16 0.08474320
# cycle_17 0.06664783
# cycle_18 0.05222437
# cycle_19 0.04078855
# cycle_20 0.03176356
My understanding is that:
To multiply an m×n matrix by an n×p matrix, the ns must be the same, and the result is an m×p matrix. https://www.mathsisfun.com/algebra/matrix-multiplying.html
So, the fact that it calculates at all indicates to me that concatenate above creates a column, i.e.: MatrixExample is a 20X4 matrix, thus ExampleConCat must be a 4X1 vector, in order for these two to multiply by eachother.
Or, are there different rules when one multiplies a vector by a matrix, and could you explain those to me simply?
I noticed that when I tried
matrix.multipl <- ExampleConCat %*% MatrixExample
I get the following:
Error in ExampleConCat %*% MatrixExample : non-conformable arguments
I would appreciate an explanation which reflects that I am new to R and newer still to matrix multiplication.
# MatrixExample:
# State A State B State C State D
# cycle_1 0.721453287 0.201845444 0.06689735 0.009803922
# cycle_2 0.520494846 0.262910628 0.18059602 0.035998510
# cycle_3 0.375512717 0.257831905 0.27729592 0.089359455
# cycle_4 0.270914884 0.225616773 0.33806874 0.165399604
# cycle_5 0.195452434 0.185784574 0.36354831 0.255214678
# cycle_6 0.141009801 0.147407084 0.36140189 0.350181229
# cycle_7 0.101731984 0.114117654 0.34053023 0.443620127
# cycle_8 0.073394875 0.086845747 0.30869729 0.531062087
# cycle_9 0.052950973 0.065278842 0.27182282 0.609947364
# cycle_10 0.038201654 0.048620213 0.23401643 0.679161707
# cycle_11 0.027560709 0.035963116 0.19788955 0.738586622
# cycle_12 0.019883764 0.026460490 0.16492601 0.788729740
# cycle_13 0.014345207 0.019389137 0.13581754 0.830448113
# cycle_14 0.010349397 0.014162175 0.11073351 0.864754914
# cycle_15 0.007466606 0.010318351 0.08952225 0.892692795
# cycle_16 0.005386808 0.007502899 0.07185350 0.915256795
# cycle_17 0.003886330 0.005447095 0.05731440 0.933352173
# cycle_18 0.002803806 0.003949642 0.04547092 0.947775632
# cycle_19 0.002022815 0.002860998 0.03590474 0.959211445
# cycle_20 0.001459366 0.002070768 0.02823342 0.968236444
If you check the help section help("%*%"), it briefly describes the rule for matrix multiplcation is used for vectors.
Multiplies two matrices, if they are conformable. If one argument is a vector, it will be promoted to either a row or column matrix to make the two arguments conformable. If both are vectors of the same length, it will return the inner product (as a matrix).
Doing MatrixExample %*% ExampleConCat, as you rightly pointed out conforms to those rules, ExampleConCat is treated as a 4 by 1 matrix. But when ExampleConCat %*% MatrixExample is done, the dimensions don't match i.e. ExampleConCat has 4*1 (or 1*4) whereas MatrixExample has 20*4 as dimension.
The vector will be converted to either row or column matrix, whichever makes the matrix work, as an example please see below:
exm = c(1,1,1,0)
exm_matrix = matrix(rnorm(16),
ncol=4)
exm_matrix%*%exm
#> [,1]
#> [1,] 2.1098758
#> [2,] -1.4432619
#> [3,] -0.2540392
#> [4,] -0.4211889
exm%*%exm_matrix
#> [,1] [,2] [,3] [,4]
#> [1,] 1.161164 -0.3602107 -0.3883783 -1.580562
Created on 2021-07-02 by the reprex package (v0.3.0)

using cor.test function in R

If x be a n*m matrix, when I use cor(x), I have a m*m correlation matrix between each pair of columns.
How can I use cor.test function on the n*m matrix to have a m*m p-value matrix also?
There may be an existing function, but here's my version. p_cor_mat runs cor.test on each pair of columns in matrix x and records the p-value. These are then put into a square matrix and returned.
# Set seed
set.seed(42)
# Matrix of data
x <- matrix(runif(120), ncol = 4)
# Function for creating p value matrix
p_cor_mat <- function(x){
# All combinations of columns
colcom <- t(combn(1:ncol(x), 2))
# Calculate p values
p_vals <- apply(colcom, MAR = 1, function(i)cor.test(x[,i[1]], x[,i[2]])$p.value)
# Create matrix for result
p_mat <- diag(ncol(x))
# Fill upper & lower triangles
p_mat[colcom] <- p_mat[colcom[,2:1]] <- p_vals
# Return result
p_mat
}
# Test function
p_cor_mat(x)
#> [,1] [,2] [,3] [,4]
#> [1,] 1.0000000 0.4495713 0.9071164 0.8462530
#> [2,] 0.4495713 1.0000000 0.5960786 0.7093539
#> [3,] 0.9071164 0.5960786 1.0000000 0.7466226
#> [4,] 0.8462530 0.7093539 0.7466226 1.0000000
Created on 2019-03-06 by the reprex package (v0.2.1)
Please also see the cor.mtest() function in the corrplot package.
https://www.rdocumentation.org/packages/corrplot/versions/0.92/topics/cor.mtest

Document-term matrix to a list of matrices R

I have a document-term matrix dtm, for example:
dtm
<<DocumentTermMatrix (documents: 50, terms: 50)>>
Non-/sparse entries: 220/2497
Sparsity : 100%
Maximal term length: 7
Weighting : term frequency (tf)
Now I want transfer it into a list of matrices, each represents a document. This is to fulfill the formal requirement of the package STM:
[[1]]
[,1] [,2] [,3] [,4]
[1,] 23 33 42 117
[2,] 2 1 3 1
[[2]]
[,1] [,2] [,3] [,4]
[1,] 2 19 93 168
[2,] 2 2 1 1
I am thinking of finding all the non-zero entries from dtm and generate them into matrices, each row at a time, so:
mat = matrix()
dtm.to.mat = function(x){
mat[1,] = x[x != 0]
mat[2,] = colnames(x[x != 0])
return(mat)
}
matrix = list(apply(dtm, 1, dtm.to.mat))
However,
x[x != 0]
just won't work. The error says:
$ operator is invalid for atomic vectors
I was wondering why this is the case. If I change x to matrix beforehand, it won't give me this error. However, I actually have a dtm of approximately 2,500,000 lines. I fear this will be very inefficient.
Me again!
I wouldn't use a dtm as the input for the stm package unless your data is particularly strange. Use the function stm::textProcessor. You can specify the documents to be raw (unprocessed) text from an any length character vector. You can also specify the metadata as you wish:
Suppose you have a dataframe df with a column called df$documents which is your raw text and df$meta which is your covariate:
processed <- textProcessor(df$documents, metadata = df$meta, lowercase = TRUE,
removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE,
stem = TRUE, wordLengths = c(3, Inf))
stm_50 <- stm(documents = processed$documents, vocab = processed$vocab,
K = 50, prevalence = ~ meta, init.type = "Spectral", seed = 57468)
This will run a 50 topic STM.
textProcessor will deal with empty documents and their associated metadata.
Edit: stm::textProcessor is technically just a wrapper for the tm package. But it is designed to remove problem documents, while dealing with their associated covariates.
Also the metadata argument can take a dataframe if you have multiple covariates. In that case you would also need to modify the prevalence argument in the second equation.
If you have something tricky like this I'd switch over to the quanteda package as it has nice converters to stm. If you want to stick with tm have you tried using stm::convertCorpus to change the object into the list structure stm needs?

Generate multivariate normal r.v.'s with rank-deficient covariance via Pivoted Cholesky Factorization

I'm just beating my head against the wall trying to get a Cholesky decomposition to work in order to simulate correlated price movements.
I use the following code:
cormat <- as.matrix(read.csv("http://pastebin.com/raw/qGbkfiyA"))
cormat <- cormat[,2:ncol(cormat)]
rownames(cormat) <- colnames(cormat)
cormat <- apply(cormat,c(1,2),FUN = function(x) as.numeric(x))
chol(cormat)
#Error in chol.default(cormat) :
# the leading minor of order 8 is not positive definite
cholmat <- chol(cormat, pivot=TRUE)
#Warning message:
# In chol.default(cormat, pivot = TRUE) :
# the matrix is either rank-deficient or indefinite
rands <- array(rnorm(ncol(cholmat)), dim = c(10000,ncol(cholmat)))
V <- t(t(cholmat) %*% t(rands))
#Check for similarity
cor(V) - cormat ## Not all zeros!
#Check the standard deviations
apply(V,2,sd) ## Not all ones!
I'm not really sure how to properly use the pivot = TRUE statement to generate my correlated movements. The results look totally bogus.
Even if I have a simple matrix and I try out "pivot" then I get bogus results...
cormat <- matrix(c(1,.95,.90,.95,1,.93,.90,.93,1), ncol=3)
cholmat <- chol(cormat)
# No Error
cholmat2 <- chol(cormat, pivot=TRUE)
# No warning... pivot changes column order
rands <- array(rnorm(ncol(cholmat)), dim = c(10000,ncol(cholmat)))
V <- t(t(cholmat2) %*% t(rands))
#Check for similarity
cor(V) - cormat ## Not all zeros!
#Check the standard deviations
apply(V,2,sd) ## Not all ones!
There are two errors with your code:
You did not use pivoting index to revert the pivoting done to the Cholesky factor. Note, pivoted Cholesky factorization for a semi-positive definite matrix A is doing:
P'AP = R'R
where P is a column pivoting matrix, and R is an upper triangular matrix. To recover A from R, we need apply the inverse of P (i.e., P'):
A = PR'RP' = (RP')'(RP')
Multivariate normal with covariance matrix A, is generated by:
XRP'
where X is multivariate normal with zero mean and identity covariance.
Your generation of X
X <- array(rnorm(ncol(R)), dim = c(10000,ncol(R)))
is wrong. First, it should not be ncol(R) but nrow(R), i.e., the rank of X, denoted by r. Second, you are recycling rnorm(ncol(R)) along columns, and the resulting matrix is not random at all. Therefore, cor(X) is never close to an identity matrix. The correct code is:
X <- matrix(rnorm(10000 * r), 10000, r)
As a model implementation of the above theory, consider your toy example:
A <- matrix(c(1,.95,.90,.95,1,.93,.90,.93,1), ncol=3)
We compute the upper triangular factor (suppressing possible rank-deficient warnings) and extract inverse pivoting index and rank:
R <- suppressWarnings(chol(A, pivot = TRUE))
piv <- order(attr(R, "pivot")) ## reverse pivoting index
r <- attr(R, "rank") ## numerical rank
Then we generate X. For better result we centre X so that column means are 0.
X <- matrix(rnorm(10000 * r), 10000, r)
## for best effect, we centre `X`
X <- sweep(X, 2L, colMeans(X), "-")
Then we generate target multivariate normal:
## compute `V = RP'`
V <- R[1:r, piv]
## compute `Y = X %*% V`
Y <- X %*% V
We can verify that Y has target covariance A:
cor(Y)
# [,1] [,2] [,3]
#[1,] 1.0000000 0.9509181 0.9009645
#[2,] 0.9509181 1.0000000 0.9299037
#[3,] 0.9009645 0.9299037 1.0000000
A
# [,1] [,2] [,3]
#[1,] 1.00 0.95 0.90
#[2,] 0.95 1.00 0.93
#[3,] 0.90 0.93 1.00

Vectorized element-wise division on Sparse Matrices in R

A/B in R performs an element-wise division on the matrix.
However, if I generate a sparse matrix from the Matrix package, and try to divide A/B, I get this error:
> class(N)
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
> N/N
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
>
Interesting. When the sparse matrix is small in total size, I get this behavior:
> m <- sparseMatrix(i=c(1,2,1,3), j=c(1,1,3,3), x=c(1,2,1,4))
> m/m
3 x 3 Matrix of class "dgeMatrix"
[,1] [,2] [,3]
[1,] 1 NaN 1
[2,] 1 NaN NaN
[3,] NaN NaN 1
>
But when it's moderately sized (~ 20000 elements), I get the Cholmod error.
Is there a workaround or a more proper way to do element-wise division on sparse matrices in R?
The problem with element-wise division is that if your matrices are both sparse, then you'll have a lot of Inf and NaN in the result, and these make it dense. That's why you get the out-of-memory errors.
If you want to replace Inf and NaN with zeros in the result, then the solution is relatively easy, you just get the summary() of both matrices and work with the indices and values directly.
You'll need to restrict the A and B index vectors to their intersection and perform the division on that. To get the intersection of index pairs, one can use merge().
Here is a quick and dirty implementation:
# Some example data
A <- sparseMatrix(i=c(1,1,2,3), j=c(1,3,1,3), x=c(1,1,2,3))
B <- sparseMatrix(i=c(3,2,1), j=c(3,2,1), x=c(3,2,1))
sdiv <- function(X, Y, names=dimnames(X)) {
sX <- summary(X)
sY <- summary(Y)
sRes <- merge(sX, sY, by=c("i", "j"))
sparseMatrix(i=sRes[,1], j=sRes[,2], x=sRes[,3]/sRes[,4],
dimnames=names)
}
sdiv(A, B)
# 3 x 3 sparse Matrix of class "dgCMatrix"
#
# [1,] 1 . .
# [2,] . . .
# [3,] . . 1
Thanks to flodel for the suggestion about using summary and merge.

Resources