Vectorized element-wise division on Sparse Matrices in R - r

A/B in R performs an element-wise division on the matrix.
However, if I generate a sparse matrix from the Matrix package, and try to divide A/B, I get this error:
> class(N)
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
> N/N
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
>
Interesting. When the sparse matrix is small in total size, I get this behavior:
> m <- sparseMatrix(i=c(1,2,1,3), j=c(1,1,3,3), x=c(1,2,1,4))
> m/m
3 x 3 Matrix of class "dgeMatrix"
[,1] [,2] [,3]
[1,] 1 NaN 1
[2,] 1 NaN NaN
[3,] NaN NaN 1
>
But when it's moderately sized (~ 20000 elements), I get the Cholmod error.
Is there a workaround or a more proper way to do element-wise division on sparse matrices in R?

The problem with element-wise division is that if your matrices are both sparse, then you'll have a lot of Inf and NaN in the result, and these make it dense. That's why you get the out-of-memory errors.
If you want to replace Inf and NaN with zeros in the result, then the solution is relatively easy, you just get the summary() of both matrices and work with the indices and values directly.
You'll need to restrict the A and B index vectors to their intersection and perform the division on that. To get the intersection of index pairs, one can use merge().
Here is a quick and dirty implementation:
# Some example data
A <- sparseMatrix(i=c(1,1,2,3), j=c(1,3,1,3), x=c(1,1,2,3))
B <- sparseMatrix(i=c(3,2,1), j=c(3,2,1), x=c(3,2,1))
sdiv <- function(X, Y, names=dimnames(X)) {
sX <- summary(X)
sY <- summary(Y)
sRes <- merge(sX, sY, by=c("i", "j"))
sparseMatrix(i=sRes[,1], j=sRes[,2], x=sRes[,3]/sRes[,4],
dimnames=names)
}
sdiv(A, B)
# 3 x 3 sparse Matrix of class "dgCMatrix"
#
# [1,] 1 . .
# [2,] . . .
# [3,] . . 1
Thanks to flodel for the suggestion about using summary and merge.

Related

When I concatenate in R am I creating a row or a column?

I concatenate the following:
ExampleConCat <- c(1, 1, 1, 0) and I have a 20x4 matrix (MatrixExample as below).
I can do matrix multiplication in Rstudio as below:
matrix.multipl <- MatrixExample %*% ExampleConCat
I get the below results:
# [,1]
# cycle_1 0.99019608
# cycle_2 0.96400149
# cycle_3 0.91064055
# cycle_4 0.83460040
# cycle_5 0.74478532
# cycle_6 0.64981877
# cycle_7 0.55637987
# cycle_8 0.46893791
# cycle_9 0.39005264
# cycle_10 0.32083829
# cycle_11 0.26141338
# cycle_12 0.21127026
# cycle_13 0.16955189
# cycle_14 0.13524509
# cycle_15 0.10730721
# cycle_16 0.08474320
# cycle_17 0.06664783
# cycle_18 0.05222437
# cycle_19 0.04078855
# cycle_20 0.03176356
My understanding is that:
To multiply an m×n matrix by an n×p matrix, the ns must be the same, and the result is an m×p matrix. https://www.mathsisfun.com/algebra/matrix-multiplying.html
So, the fact that it calculates at all indicates to me that concatenate above creates a column, i.e.: MatrixExample is a 20X4 matrix, thus ExampleConCat must be a 4X1 vector, in order for these two to multiply by eachother.
Or, are there different rules when one multiplies a vector by a matrix, and could you explain those to me simply?
I noticed that when I tried
matrix.multipl <- ExampleConCat %*% MatrixExample
I get the following:
Error in ExampleConCat %*% MatrixExample : non-conformable arguments
I would appreciate an explanation which reflects that I am new to R and newer still to matrix multiplication.
# MatrixExample:
# State A State B State C State D
# cycle_1 0.721453287 0.201845444 0.06689735 0.009803922
# cycle_2 0.520494846 0.262910628 0.18059602 0.035998510
# cycle_3 0.375512717 0.257831905 0.27729592 0.089359455
# cycle_4 0.270914884 0.225616773 0.33806874 0.165399604
# cycle_5 0.195452434 0.185784574 0.36354831 0.255214678
# cycle_6 0.141009801 0.147407084 0.36140189 0.350181229
# cycle_7 0.101731984 0.114117654 0.34053023 0.443620127
# cycle_8 0.073394875 0.086845747 0.30869729 0.531062087
# cycle_9 0.052950973 0.065278842 0.27182282 0.609947364
# cycle_10 0.038201654 0.048620213 0.23401643 0.679161707
# cycle_11 0.027560709 0.035963116 0.19788955 0.738586622
# cycle_12 0.019883764 0.026460490 0.16492601 0.788729740
# cycle_13 0.014345207 0.019389137 0.13581754 0.830448113
# cycle_14 0.010349397 0.014162175 0.11073351 0.864754914
# cycle_15 0.007466606 0.010318351 0.08952225 0.892692795
# cycle_16 0.005386808 0.007502899 0.07185350 0.915256795
# cycle_17 0.003886330 0.005447095 0.05731440 0.933352173
# cycle_18 0.002803806 0.003949642 0.04547092 0.947775632
# cycle_19 0.002022815 0.002860998 0.03590474 0.959211445
# cycle_20 0.001459366 0.002070768 0.02823342 0.968236444
If you check the help section help("%*%"), it briefly describes the rule for matrix multiplcation is used for vectors.
Multiplies two matrices, if they are conformable. If one argument is a vector, it will be promoted to either a row or column matrix to make the two arguments conformable. If both are vectors of the same length, it will return the inner product (as a matrix).
Doing MatrixExample %*% ExampleConCat, as you rightly pointed out conforms to those rules, ExampleConCat is treated as a 4 by 1 matrix. But when ExampleConCat %*% MatrixExample is done, the dimensions don't match i.e. ExampleConCat has 4*1 (or 1*4) whereas MatrixExample has 20*4 as dimension.
The vector will be converted to either row or column matrix, whichever makes the matrix work, as an example please see below:
exm = c(1,1,1,0)
exm_matrix = matrix(rnorm(16),
ncol=4)
exm_matrix%*%exm
#> [,1]
#> [1,] 2.1098758
#> [2,] -1.4432619
#> [3,] -0.2540392
#> [4,] -0.4211889
exm%*%exm_matrix
#> [,1] [,2] [,3] [,4]
#> [1,] 1.161164 -0.3602107 -0.3883783 -1.580562
Created on 2021-07-02 by the reprex package (v0.3.0)

text similarity between two tfidf matrix

I have two xml text files and using quanteda and tm package, i have tokenized them and tranform to tf-idf matrix. here is my rstudio environment:
enter image description here
how can i calculate the similarities between these two files, for example, using Jaccard.
I have try dist(), cosine(), and text2vec, however, i all encounter errors.
for examples:
cosine(x = pta2.tokens.tfidf, y = pta3.tokens.tfidf)
Error in cosine(x = pta2.tokens.tfidf, y = pta3.tokens.tfidf) :
argument mismatch. Either one matrix or two vectors needed as input.
simi <- sim2(pta2.tokens.tfidf, pta3.tokens.tfidf, method = "jaccard", norm = "none")
Error: ncol(x) == ncol(y) is not TRUE
The problem is that you have a data.frame with string values and you are using distance that need a numeric matrix input
DIST
you need a numeric matrix:
?dist
Usage
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p=2)
Arguments
x a numeric matrix, data frame or "dist" object.
COSINE
you need numeric values:
?cosine
Usage
cosine(x, y, use = "everything", inverse = FALSE)
Arguments
x A numeric dataframe/matrix or vector
SIM2
Your error is due to the difference of the number of columns in pta2.tokens.tfidf and pta3.tokens.tfidf. Here an example of the error:
df1<-as.matrix(data.frame(a=c("a","b","c"),b=c("d","e","f")))
df2<-as.matrix(data.frame(c=c("a","b","c"),d=c("d","e","f"),e=c("g","h","i")))
sim2(df1,df2)
Error: ncol(x) == ncol(y) is not TRUE
But also if you have same dimentions, this method will not work as you can see because it needs numeric argument in input:
sim2(df1,df1)
Error in m^2 : non-numeric argument to binary operator
You must have matrices with same dimensions and numeric, like this:
df3<-as.matrix(data.frame(a=c(1,2,3),b=c(4,5,6)))
> df4<-as.matrix(data.frame(a=c(3,2,3),b=c(3,3,6)))
> sim2(df3,df4)
[,1] [,2] [,3]
[1,] 0.8574929 0.9417419 0.9761871
[2,] 0.9191450 0.9785498 0.9965458
[3,] 0.9486833 0.9922779 1.0000000
A possible solution
Use function stringdist from stringdist package, here a toy example:
Two dataframes with string values
df1<-data.frame(a=c("abc","bav","cda"),b=c("ddd","ese","feff"))
df2<-data.frame(a=c("abc","gfb","cdd"),b=c("dsd","eeesfd","fafe"))
Function to compare string values in two big data.frames:
f<-function(i,df1,df2)
{
f2<-function(y,list1,list2)
{
return(stringdist(list1[y],list2[y],method="jw"))
}
return(unlist(lapply(seq(1:length(df1[,i])),f2,list1=df1[,i],list2=df2[,i])))
}
dist_matrix<-do.call(cbind,lapply(seq(1:ncol(df1)),f,df1=df1,df2=df2))
Distance matrix
dist_matrix
[,1] [,2]
[1,] 0.0000000 0.2222222
[2,] 1.0000000 0.2777778
[3,] 0.2222222 0.3333333

Can't get a positive definite variance matrix when very small eigen values

To run a Canonical correspondence analysis (cca package ade4) I need a positive definite variance matrix. (Which in theory is always the case)
but:
matrix(c(2,59,4,7,10,0,7,0,0,0,475,18714,4070,97,298,0,1,0,17,7,4,1,4,18,36),nrow=5)
> a
[,1] [,2] [,3] [,4] [,5]
[1,] 2 0 475 0 4
[2,] 59 7 18714 1 1
[3,] 4 0 4070 0 4
[4,] 7 0 97 17 18
[5,] 10 0 298 7 36
> eigen(var(a))
$values
[1] 6.380066e+07 1.973658e+02 3.551492e+01 1.033096e+01
[5] -1.377693e-09
The last eigen value is -1.377693e-09 which is < 0. But the theorical value is > 0.
I can't run the function if one of the eigen value is < 0
I really don't know how to fix this without changing the code of the function cca()
Thanks for help
You can change the input, just a little bit, to make the matrix positive definite.
If you have the variance matrix, you can truncate the eigenvalues:
correct_variance <- function(V, minimum_eigenvalue = 0) {
V <- ( V + t(V) ) / 2
e <- eigen(V)
e$vectors %*% diag(pmax(minimum_eigenvalue,e$values)) %*% t(e$vectors)
}
v <- correct_variance( var(a) )
eigen(v)$values
# [1] 6.380066e+07 1.973658e+02 3.551492e+01 1.033096e+01 1.326768e-08
Using the singular value decomposition, you can do the same thing directly with a.
truncate_singular_values <- function(a, minimum = 0) {
s <- svd(a)
s$u %*% diag( ifelse( s$d > minimum, s$d, minimum ) ) %*% t(s$v)
}
svd(a)$d
# [1] 1.916001e+04 4.435562e+01 1.196984e+01 8.822299e+00 1.035624e-01
eigen(var( truncate_singular_values(a,.2) ))$values
# [1] 6.380066e+07 1.973680e+02 3.551494e+01 1.033452e+01 6.079487e-09
However, this changes your matrix a by up to 0.1, which is a lot
(I suspect it is that high because the matrix a is square: as a result,
one of the eigenvalues of var(a) is exactly 0.)
b <- truncate_singular_values(a,.2)
max( abs(b-a) )
# [1] 0.09410187
We can actually do better simply by adding some noise.
b <- a + 1e-6*runif(length(a),-1,1) # Repeat if needed
eigen(var(b))$values
# [1] 6.380066e+07 1.973658e+02 3.551492e+01 1.033096e+01 2.492604e-09
Here are two approaches:
V <- var(a)
# 1
library(Matrix)
nearPD(V)$mat
# 2 perturb diagonals
eps <- 0.01
V + eps * diag(ncol(V))

Why does the calculation of Cohen's kappa fail across different packages on this contingency table?

I have a contingency table for which I would like to calculate Cohens's kappa - the level of agreement. I have tried using three different packages, which all seem to fail to some degree. The package e1071 has a function specifically for a contingency table, but that too seems to fail. Below is reproducable code. You will need to install packages concord, e1071, and irr.
# Recreate my contingency table, output with dput
conf.mat<-structure(c(810531L, 289024L, 164757L, 114316L), .Dim = c(2L,
2L), .Dimnames = structure(list(landsat_2000_bin = c("0", "1"
), MOD12_2000_binForest = c("0", "1")), .Names = c("landsat_2000_bin",
"MOD12_2000_binForest")), class = "table")
library(concord)
cohen.kappa(conf.mat)
library(e1071)
classAgreement(conf.mat, match.names=TRUE)
library(irr)
kappa2(conf.mat)
The output I get from running this is:
> cohen.kappa(conf.mat)
Kappa test for nominally classified data
4 categories - 2 methods
kappa (Cohen) = 0 , Z = NaN , p = NaN
kappa (Siegel) = -0.333333 , Z = -0.816497 , p = 0.792892
kappa (2*PA-1) = -1
> classAgreement(conf.mat, match.names=TRUE)
$diag
[1] 0.6708459
$kappa
[1] NA
$rand
[1] 0.5583764
$crand
[1] 0.0594124
Warning message:
In ni[lev] * nj[lev] : NAs produced by integer overflow
> kappa2(conf.mat)
Cohen's Kappa for 2 Raters (Weights: unweighted)
Subjects = 2
Raters = 2
Kappa = 0
z = NaN
p-value = NaN
Could anyone advise on why these might fail? I have a large dataset, but as this table is simple I didn't think that could cause such problems.
In the first function, cohen.kappa, you need to specify that you are using count data and not just a n*m matrix of n subjects and m raters.
# cohen.kappa(conf.mat,'count')
cohen.kappa(conf.mat,'count')
The second function is much more tricky. For some reason, your matrix is full of integer and not numeric. integer can't store really big numbers. So, when you multiply two of your big numbers together, it fails. For example:
i=975288
j=1099555
class(i)
# [1] "numeric"
i*j
# 1.072383e+12
as.integer(i)*as.integer(j)
# [1] NA
# Warning message:
# In as.integer(i) * as.integer(j) : NAs produced by integer overflow
So you need to convert your matrix to have integers.
# classAgreement(conf.mat)
classAgreement(matrix(as.numeric(conf.mat),nrow=2))
Finally take a look at the documentation for ?kappa2. It requires an n*m matrix as explained above. It just won't work with your (efficient) data structure.
Do you need to know specifically why those fail? Here is a function that computes the statistic -- in a hurry, so I might clean it up later (kappa wiki):
kap <- function(x) {
a <- (x[1,1] + x[2,2]) / sum(x)
e <- (sum(x[1,]) / sum(x)) * (sum(x[,1]) / sum(x)) + (1 - (sum(x[1,]) / sum(x))) * (1 - (sum(x[,1]) / sum(x)))
(a-e)/(1-e)
}
Tests/output:
> (x = matrix(c(20,5,10,15), nrow=2, byrow=T))
[,1] [,2]
[1,] 20 5
[2,] 10 15
> kap(x)
[1] 0.4
> (x = matrix(c(45,15,25,15), nrow=2, byrow=T))
[,1] [,2]
[1,] 45 15
[2,] 25 15
> kap(x)
[1] 0.1304348
> (x = matrix(c(25,35,5,35), nrow=2, byrow=T))
[,1] [,2]
[1,] 25 35
[2,] 5 35
> kap(x)
[1] 0.2592593
> kap(conf.mat)
[1] 0.1258621

Mystified by qr.Q(): what is an orthonormal matrix in "compact" form?

R has a qr() function, which performs QR decomposition using either LINPACK or LAPACK (in my experience, the latter is 5% faster). The main object returned is a matrix "qr" that contains in the upper triangular matrix R (i.e. R=qr[upper.tri(qr)]). So far so good. The lower triangular part of qr contains Q "in compact form". One can extract Q from the qr decomposition by using qr.Q(). I would like to find the inverse of qr.Q(). In other word, I do have Q and R, and would like to put them in a "qr" object. R is trivial but Q is not. The goal is to apply to it qr.solve(), which is much faster than solve() on large systems.
Introduction
R uses the LINPACK dqrdc routine, by default, or the LAPACK DGEQP3 routine, when specified, for computing the QR decomposition. Both routines compute the decomposition using Householder reflections. An m x n matrix A is decomposed into an m x n economy-size orthogonal matrix (Q) and an n x n upper triangular matrix (R) as A = QR, where Q can be computed by the product of t Householder reflection matrices, with t being the lesser of m-1 and n: Q = H1H2...Ht.
Each reflection matrix Hi can be represented by a length-(m-i+1) vector. For example, H1 requires a length-m vector for compact storage. All but one entry of this vector is placed in the first column of the lower triangle of the input matrix (the diagonal is used by the R factor). Therefore, each reflection needs one more scalar of storage, and this is provided by an auxiliary vector (called $qraux in the result from R's qr).
The compact representation used is different between the LINPACK and LAPACK routines.
The LINPACK Way
A Householder reflection is computed as Hi = I - viviT/pi, where I is the identity matrix, pi is the corresponding entry in $qraux, and vi is as follows:
vi[1..i-1] = 0,
vi[i] = pi
vi[i+1:m] = A[i+1..m, i] (i.e., a column of the lower triangle of A after calling qr)
LINPACK Example
Let's work through the example from the QR decomposition article at Wikipedia in R.
The matrix being decomposed is
> A <- matrix(c(12, 6, -4, -51, 167, 24, 4, -68, -41), nrow=3)
> A
[,1] [,2] [,3]
[1,] 12 -51 4
[2,] 6 167 -68
[3,] -4 24 -41
We do the decomposition, and the most relevant portions of the result is shown below:
> Aqr = qr(A)
> Aqr
$qr
[,1] [,2] [,3]
[1,] -14.0000000 -21.0000000 14
[2,] 0.4285714 -175.0000000 70
[3,] -0.2857143 0.1107692 -35
[snip...]
$qraux
[1] 1.857143 1.993846 35.000000
[snip...]
This decomposition was done (under the covers) by computing two Householder reflections and multiplying them by A to get R. We will now recreate the reflections from the information in $qr.
> p = Aqr$qraux # for convenience
> v1 <- matrix(c(p[1], Aqr$qr[2:3,1]))
> v1
[,1]
[1,] 1.8571429
[2,] 0.4285714
[3,] -0.2857143
> v2 <- matrix(c(0, p[2], Aqr$qr[3,2]))
> v2
[,1]
[1,] 0.0000000
[2,] 1.9938462
[3,] 0.1107692
> I = diag(3) # identity matrix
> H1 = I - v1 %*% t(v1)/p[1] # I - v1*v1^T/p[1]
> H2 = I - v2 %*% t(v2)/p[2] # I - v2*v2^T/p[2]
> Q = H1 %*% H2
> Q
[,1] [,2] [,3]
[1,] -0.8571429 0.3942857 0.33142857
[2,] -0.4285714 -0.9028571 -0.03428571
[3,] 0.2857143 -0.1714286 0.94285714
Now let's verify the Q computed above is correct:
> qr.Q(Aqr)
[,1] [,2] [,3]
[1,] -0.8571429 0.3942857 0.33142857
[2,] -0.4285714 -0.9028571 -0.03428571
[3,] 0.2857143 -0.1714286 0.94285714
Looks good! We can also verify QR is equal to A.
> R = qr.R(Aqr) # extract R from Aqr$qr
> Q %*% R
[,1] [,2] [,3]
[1,] 12 -51 4
[2,] 6 167 -68
[3,] -4 24 -41
The LAPACK Way
A Householder reflection is computed as Hi = I - piviviT, where I is the identity matrix, pi is the corresponding entry in $qraux, and vi is as follows:
vi[1..i-1] = 0,
vi[i] = 1
vi[i+1:m] = A[i+1..m, i] (i.e., a column of the lower triangle of A after calling qr)
There is another twist when using the LAPACK routine in R: column pivoting is used, so the decomposition is solving a different, related problem: AP = QR, where P is a permutation matrix.
LAPACK Example
This section does the same example as before.
> A <- matrix(c(12, 6, -4, -51, 167, 24, 4, -68, -41), nrow=3)
> Bqr = qr(A, LAPACK=TRUE)
> Bqr
$qr
[,1] [,2] [,3]
[1,] 176.2554964 -71.1694118 1.668033
[2,] -0.7348557 35.4388886 -2.180855
[3,] -0.1056080 0.6859203 -13.728129
[snip...]
$qraux
[1] 1.289353 1.360094 0.000000
$pivot
[1] 2 3 1
attr(,"useLAPACK")
[1] TRUE
[snip...]
Notice the $pivot field; we will come back to that. Now we generate Q from the information the Aqr.
> p = Bqr$qraux # for convenience
> v1 = matrix(c(1, Bqr$qr[2:3,1]))
> v1
[,1]
[1,] 1.0000000
[2,] -0.7348557
[3,] -0.1056080
> v2 = matrix(c(0, 1, Bqr$qr[3,2]))
> v2
[,1]
[1,] 0.0000000
[2,] 1.0000000
[3,] 0.6859203
> H1 = I - p[1]*v1 %*% t(v1) # I - p[1]*v1*v1^T
> H2 = I - p[2]*v2 %*% t(v2) # I - p[2]*v2*v2^T
> Q = H1 %*% H2
[,1] [,2] [,3]
[1,] -0.2893527 -0.46821615 -0.8348944
[2,] 0.9474882 -0.01602261 -0.3193891
[3,] 0.1361660 -0.88346868 0.4482655
Once again, the Q computed above agrees with the R-provided Q.
> qr.Q(Bqr)
[,1] [,2] [,3]
[1,] -0.2893527 -0.46821615 -0.8348944
[2,] 0.9474882 -0.01602261 -0.3193891
[3,] 0.1361660 -0.88346868 0.4482655
Finally, let's compute QR.
> R = qr.R(Bqr)
> Q %*% R
[,1] [,2] [,3]
[1,] -51 4 12
[2,] 167 -68 6
[3,] 24 -41 -4
Notice the difference? QR is A with its columns permuted given the order in Bqr$pivot above.
I have researched for this same problem as the OP asks and I don't think it is possible. Basically the OP question is whether having the explicitly computed Q, one can recover the H1 H2 ... Ht. I do not think this is possible without computing the QR from scratch but I would also be very interested to know whether there is such solution.
I have a similar issue as the OP but in a different context, my iterative algorithm needs to mutate the matrix A by adding columns and/or rows. The first time, the QR is computed using DGEQRF and thus, the compact LAPACK format. After the matrix A is mutated e.g. with new rows I can quickly build a new set of reflectors or rotators that will annihilate the non-zero elements of the lowest diagonal of my existing R and build a new R but now I have a set of H1_old H2_old ... Hn_old and H1_new H2_new ... Hn_new (and similarly tau's) which can't be mixed up into a single QR compact representation. The two possibilities I have are, and maybe the OP has the same two possibilities:
Always maintain Q and R explicitly separated whether when computed the first time or after every update at the cost of extra flops but keeping the required memory well bounded.
Stick to the compact LAPACK format but then every time a new update comes in, keep a list of all these mini sets of update reflectors. At the point of solving the system one would do a big Q'*c i.e. H1_u3*H2_u3*...*Hn_u3*H1_u2*H2_u2*...*Hn_u2*H1_u1*H2_u1...*Hn_u1*H1*H2*...*Hn*c where ui is the QR update number and this is potentially a lot of multiplications to do and memory to keep track of but definitely the fastest way.
The long answer from David basically explains what the compact QR format is but not how to get to this compact QR format having the explicit computed Q and R as input.

Resources