Reproducing Singular Value Decompition in R - r

I have an example word by document matrix (from Landauer and Dumais, 1997):
wxd <- matrix(c(1,1,1,0,0,0,0,0,0,0,0,0,
0,0,1,1,1,1,1,0,1,0,0,0,
0,1,0,1,1,0,0,1,0,0,0,0,
1,0,0,0,2,0,0,1,0,0,0,0,
0,0,0,1,0,1,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,
0,0,0,0,0,0,0,0,0,1,1,0,
0,0,0,0,0,0,0,0,0,1,1,1,
0,0,0,0,0,0,0,0,1,0,1,1)
,12, 9)
rownames(wxd) <- c("human", "interface", "computer", "user", "system",
"response", "time", "EPS", "survey", "trees", "graph", "minors")
colnames(wxd) <- c(paste0("c", 1:5), paste0("m", 1:4))
I can perform Singular Value Decomposition on this matrix using the svd() function and have three matrices U, S, and V:
SVD <- svd(wxd)
U <- SVD$u
S <- diag(SVD$d)
V <- SVD$v
I can multiply these matrices and get my original matrix returned (within some small margin or error):
U %*% S %*% t(V)
I can also take the first two columns of the U and V matrices and the first two columns and rows of the S matrix to get the least squares best approximation of the original data. This fits with the results of the same procedure in the paper I mentioned above:
U[ , 1:2] %*% S[1:2, 1:2] %*% t(V[ , 1:2])
I am wanting to make sure I understand what this function is doing (as best as I am able), and I have been able to generate the V and S matrices to match those from the svd() function:
ATA <- t(wxd) %*% wxd
V2 <- eigen(ATA)$vectors
S2 <- sqrt(diag(eigen(ATA)$values))
But, the U matrix I generate has the same absolute values for the first 9 columns then adds an additional 3 columns. And some elements of this U matrix have different signs than the U matrix from the svd() function:
AAT <- wxd %*% t(wxd)
U2 <- eigen(AAT)$vectors
So my question is, why is the U matrix different than when I attempt to calculate it from scratch?

wxd has rank of 9. Therefore, your AAT only has 9 non-zero eigenvalues (the rest are very small ~1e-16). For those zero eigenvalues, the eigenvectors are arbitrary as long as they span the subspace orthogonal to that spanned by the other eigenvectors in R^12.
Now, by default svd only computes nu=min(n,p) left singular vectors (similarly for right eigenvectors) where n is the number of rows and p is the number of columns in the input (see ?svd). Therefore, you only get 9 left singular vectors. To generate all 12, call svd with:
svd(wxd,nu=nrow(wxd))
However, those extra 3 left singular vectors will not correspond to those found with eigen(AAT)$vectors again because these eigenvectors are determined somewhat arbitrarily to span that orthogonal subspace.
As for why some of the signs have changed, recall that eigenvectors are only determined up to a scale factor. Although these eigenvectors are normalized, they may differ by a factor of -1. To check just divide one from U with the corresponding one from U2. You should get columns of all 1s or -1s:
U[,1:9]/U2[,1:9]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 1 -1 1 -1 1 -1 1 1 1
## [2,] 1 -1 1 -1 1 -1 1 1 1
## [3,] 1 -1 1 -1 1 -1 1 1 1
## [4,] 1 -1 1 -1 1 -1 1 1 1
## [5,] 1 -1 1 -1 1 -1 1 1 1
## [6,] 1 -1 1 -1 1 -1 1 1 1
## [7,] 1 -1 1 -1 1 -1 1 1 1
## [8,] 1 -1 1 -1 1 -1 1 1 1
## [9,] 1 -1 1 -1 1 -1 1 1 1
##[10,] 1 -1 1 -1 1 -1 1 1 1
##[11,] 1 -1 1 -1 1 -1 1 1 1
##[12,] 1 -1 1 -1 1 -1 1 1 1
Update to explain why Eigenvector is only determined up to a scale factor
This can be seen from the definition of the eigenvector. From Wikipedia,
In linear algebra, an eigenvector or characteristic vector of a linear transformation is a non-zero vector that does not change its direction when that linear transformation is applied to it.
In finite-dimensional vector space, the linear transformation is in terms of multiplying the vector with a square matrix A, and therefore the definition is (This is where I wish SO supports LaTeX markdown as this is not an equation in code; that is * is matrix-multiply here):
A * v = lambda * v
which is known as the Eigenvalue Equation for the matrix A where lambda is the eigenvalue associated with the eigenvector v. From this equation, it is clear that if v is an eigenvector of A then any k * v for some scalar k is also an eigenvector of A with associated eigenvalue lambda.

Related

Create a matrix of 0s and 1s, such that each row has only one 1 and each column has at least two 1s

I want to create a 100*4 matrix of 0s and 1s, such that each row has only one 1 and each column has at least two 1s, in R.
MyMat <- as.matrix(rsparsematrix(nrow=100, ncol=4, nnz = 100))
I am thinking of rsparsematrix but yet I am not sure how to apply my required conditions.
edit. My other try would be dummy_cols, but then no matter what. I am stuck with applying the two conditions yet. I guess there must be a more straightforward way of creating such a matrix.
1) A matrix consisting of 25 4x4 identity matrices stacked one on top of each other satisfies these requirements
m <- matrix(1, 25) %x% diag(4)
2) Exchanging the two arguments of %x% would also work and gives a different matrix which also satisfies this.
3) Any permutation of the rows and the columns of the two solution matrices in (1) and (2) would also satisfy the conditions.
m[sample(100), sample(4)]
4) If the objective is to generate a random table containing 0/1 values whose row sums are each 1 and whose column sums are each 25 then use r2dtable:
r <- r2dtable(1, rep(1, 100), rep(25, 4))[[1]]
5) or if it is desired to allow any column sums of at least 2 then:
rsums <- rep(1, 100)
csums <- rmultinom(1, 92, rep(0.25, 4)) + 2
r <- r2dtable(1, rsums, csums)[[1]]
Stochastically, with two rules:
All rows must have exactly one 1; and
All columns must have at least two 1s.
I control the first implicitly by construction; I test against the second.
nr <- 100 ; nc <- 4
set.seed(42)
lim <- 10000
while (lim > 0) {
lim <- lim - 1
M <- t(replicate(nr, sample(c(1, rep(0, nc-1)))))
if (all(colSums(M > 0) >= 2)) break
}
head(M)
# [,1] [,2] [,3] [,4]
# [1,] 1 0 0 0
# [2,] 0 0 0 1
# [3,] 0 0 0 1
# [4,] 0 1 0 0
# [5,] 0 0 0 1
# [6,] 0 1 0 0
colSums(M)
# [1] 25 30 21 24
lim
# [1] 9999
My use of lim is hardly needed in this example, but is there as a mechanism to stop this from running infinitely: if you change the dimensions and/or the rules, it might become highly unlikely or infeasible to meet all rules, so this keeps the execution time limited. (10000 is completely arbitrary.)
My point in the comment is that it would be rather difficult to find a 100x4 matrix that matches rule 1 that does not match rule 2. In fact, since the odds of a 0 or a 1 in any one cell is 0.75 and 0.25, respectively, to find a column (among 100 rows) that contains fewer than two 1s would be around 1.1e-11.
Here is a simple way to generate the 100 rows with the 1's randomly positioned and then create the matrix by transposing the rows object. The matrix generation is wrapped by a while loop (THX r2evans) to ensure each column contains at least two 1's.
minval <- 0
while(minval < 2) {
rows <- replicate(100, sample(c(0,0,0,1), 4))
m <- t(rows)
minval <- min(colSums(m))
}
m
[,1] [,2] [,3] [,4]
[1,] 0 0 0 1
[2,] 1 0 0 0
[3,] 0 0 0 1
[4,] 0 0 1 0
[5,] 1 0 0 0
[6,] 0 0 0 1
[7,] 1 0 0 0
[8,] 0 0 1 0
[9,] 0 1 0 0
[10,] 1 0 0 0
Code:
v <- tabulate(sample(1:4, 100-2*4, replace=TRUE), nbins=4) + 2
m <- diag(length(v))[sample(rep(seq_along(v), v)),]
Result check:
> dim(m)
[1] 100 4
> range(rowSums(m))
[1] 1 1
> range(colSums(m))
[1] 20 30
This works with any matrix size - just adjust the numbers 4 and 100. The first one controls the number of columns and the second one - the number of rows:
v <- tabulate(sample(1:10, 200-2*10, replace=TRUE), nbins=10) + 2
m <- diag(length(v))[sample(rep(seq_along(v), v)),]
> dim(m)
[1] 200 10
> range(rowSums(m))
[1] 1 1
> range(colSums(m))
[1] 15 31
Explanation: this works backwards from the properties of the resulting matrix. If you have 100 rows and 4 columns, with each row having only one 1 then the matrix will have 100 1s in total. Which means that the sum of all column-sums should also be 100. So we start with a vector of numbers (summing up to 100) which represents how many 1s each column will have. Say this vector is c(50,25,20,5). This tells us that there will be 50 rows of the form (1,0,0,0), 25 rows with the form (0,1,0,0), and so on. The final step is to generate all these rows and shuffle them.
The trick here:
v <- tabulate(sample(1:4, 100-2*4, replace=TRUE), nbins=4) + 2
Is to generate random column-sums while making sure the minimum is at least 2. We do this by generating values summing up to 92 and then adding 2 to each value (which, with 4 columns, ends up as additional 8).

Mapping edgelists to an adjacency matrix (and sum them together)

I want to map a number of (undirected) friendship networks (in edgelist format) to an adjacency matrix consisting of all possible nodes (i.e., persons) using R. To begin with, I construct a smaller 4-person circle x <- c(1, 2, 3, 4) which consists of 6 unique edges (1-2, 1-3, 1-4, 2-3, 2-4, 3-4). I then collapsed this set of 6 unique edges into a single list, such that it can be converted into a symmetric matrix using igraph applications (see below).
x = c(1,2,3,4)
x_pairs = combn(x, 2)
List <- split(x_pairs, rep(1:ncol(x_pairs), each = nrow(x_pairs)))
library(purrr)
new_list <- purrr::flatten(List)
g <- make_graph(unlist(new_list), directed = F)
m <- as_adjacency_matrix(g, sparse = F)
m
[,1] [,2] [,3] [,4]
[1,] 0 1 1 1
[2,] 1 0 1 1
[3,] 1 1 0 1
[4,] 1 1 1 0
My dataset has more than one of such smaller friendship circles consist of members out of a total of 50 persons and the memberships of these circles may or may not overlap. So my question is how do I map a series of smaller matrix values like the m above to a 50 by 50 adjacency matrix in two different ways:
(1) without repeating: say, if 3 and 4 are friends in one circle but they are also linked in another circle, the edge between 3 and 4 should remain 1 (but not add up to 2)
(2) cumulatively: if relationship in multiple circles indicates stronger friendship, then it might be more informative to map these circles into a weighted adjacency matrix where each cell in the matrix represents the cumulative counts of row and column id's friendship in different circles. In 3 and 4's situation, their edge value should be 1 + 1 = 2.
I've checked out this and other previous threads but can't seem to figure out how to do this, it will be really appreciated if someone could enlighten me on this.
There are various ways to achieve it. It looks like doing it in graph theoretical terms in igraph is a little more tedious than dealing directly with adjacency matrices. Let
circles <- list(1:3, 2:4) # Friendship circles with identities 1, ..., n
n <- max(unlist(circles)) # Total number of people
nM <- matrix(0, n, n) # n x n matrix of zeroes
Then
adjs <- lapply(circles, function(cr) {nM[cr, cr] <- 1; nM[cbind(cr, cr)] <- 0; nM})
is a list of n x n adjacency matrices for each friendship circle (mostly zeroes in each case).
Then the two types of aggregate matrices can be obtained by
(adj1 <- Reduce(`+`, adjs))
# [,1] [,2] [,3] [,4]
# [1,] 0 1 1 0
# [2,] 1 0 2 1
# [3,] 1 2 0 1
# [4,] 0 1 1 0
(adj2 <- 1 * (adj1 > 0))
# [,1] [,2] [,3] [,4]
# [1,] 0 1 1 0
# [2,] 1 0 1 1
# [3,] 1 1 0 1
# [4,] 0 1 1 0

assign cluster labels to data using a cluster assignment matrix

Hi I am using R and have a cluster assignment matrix that comes out of my clustering function. (I am applying a clustering algorithm on a gaussian mixture data) I want to create a data matrix of clusters. Here is a toy example of what I want to do.
#simulate data
dat=Z<-c(rnorm(2,0,1),rnorm(2,2,3),rnorm(3,0,1),rnorm(3,2,3))
dat
[1] -0.5350681 1.0444655 2.9229136 8.2528266 -0.7561170 -1.0240702 -1.0012780
[8] -0.1322981 7.8525855 2.2278264
# Making up a cluster assignment matrix (actually this one comes out of my
#clustering function
amat<-matrix(c(1,1,0,0,1,1,1,0,0,0,0,0,1,1,0,0,0,1,1,1), ncol=2, nrow=10)
amat
[,1] [,2]
[1,] 1 0
[2,] 1 0
[3,] 0 1
[4,] 0 1
[5,] 1 0
[6,] 1 0
[7,] 1 0
[8,] 0 1
[9,] 0 1
[10,] 0 1
I want to create dataframe or vector called (say) "clust" that contains cluster labels as follows using the assignment matrix given above.Basically it uses first column and second column of assignment matrix and assigns label 1 to data coming from normal distribution N(0,1) and assigns label 2 to the data coming from normal distribution N(2,3).Any help is appreciated. Thanks in advance.
# clust should look like this (I have no idea how to create this using amat and dat)
clust
[1] 1 1 2 2 1 1 1 2 2 2
The vector is already binary. We can add 1L to the second column:
clust <- amat[,2] + 1L
[1] 1 1 2 2 1 1 1 2 2 2
(The suffix L coerces the value to integer)
Isn't this essentially
1 * column1 + 2 * column2 +3 * column3 and so on?
that should be straight forward to write as a matrix multiplocation woth [1,2,3,4,...] and a sum operation.

R: Generate a list of canonical matrix

Here is my question.
I'm trying to create a list with all the symmetric canonical matrix of size nxn with a diagonal of 1,..,1 and k element equal to one in each triangle.
For instance if n=3 and k=2, I want to generate all 3x3 matrix symmetric with 1,1,1 diagonal and 2 element equal to 1 up and below the diagonale:
1 1 1
1 1 0
1 0 1
And
1 1 0
1 1 1
0 1 1
And
1 1 0
1 1 1
0 1 1
And
1 0 1
0 1 1
1 1 1
Can you help me ?
Regards
I don't understand why you generate the second matrix twice. So the solution below just creates all unique symmetric matrices with the desired properties. If you want some of them twice, you will have to tweak the code a bit.
# load required packages
require(plyr)
# function to generate a list of "canonical matrices"
generate.canonical.matrix <- function(n, k){
# initialize
m <- matrix(0, nrow=n, ncol=n)
# number of positions in the upper triangle
K <- n*(n-1)/2
if (K<k) stop("k cannot be larger than n*(n-1)/2")
# upper triangle matrix
upper <- which(upper.tri(m))
# for all combinations of k elements
alply(combn(K, k), 2, function(index){ # CHANGE combn(K, k) TO GET NON-UNIQUE MATRICES
# set upper tirangle matrix
m[upper[index]] <- 1
# combine upper, lower and diagonal matrices
m+t(m)+diag(n)
})
}
# function call
generate.canonical.matrix(3, 2)

How to efficiently retrieve top K-similar vectors by cosine similarity using R?

I'm working on a high-dimensional problem (~4k terms) and would like to retrieve top k-similar (by cosine similarity) and can't afford to do a pair-wise calculation.
My training set is 6million x 4k matrix and I would like to make predictions for 600k x 4k matrix.
What is the most efficient way to retrieve the k-similar items for each item in my 600k x 4k matrix?
Ideally, I would like to get a matrix which is 600k x 10 (i.e., top 10-similar items for each of the 600k items).
ps: I've researched the SO website and found almost all "cosine similarity in R" questions refer to cosine_sim(vector1, vector2). But this question refers to cosine_sim(matrix1, matrix2).
Update
The following code uses a naive method to find the cosine similarity between each row in the testset and every row in the training set.
set.seed(123)
train<-matrix(round(runif(30),0),nrow=6,ncol=5)
set.seed(987)
test<-matrix(round(runif(20),0),nrow=4,ncol=5)
train
[1,] 0 1 1 0 1
[2,] 1 1 1 1 1
[3,] 0 1 0 1 1
[4,] 1 0 1 1 1
[5,] 1 1 0 1 0
[6,] 0 0 0 1 0
test
[1,] 0 1 1 0 0
[2,] 1 0 1 0 1
[3,] 1 0 0 0 0
[4,] 1 0 0 1 1
coSim<-function(mat1, mat2, topK){
require(plyr)
#mat2: is the testset
#mat1: is the training set. We will find cosine similarity between each row in testset and every row in trainingset.
#topK: user-input. for each row in testset we will return 'topk' similar rows(index) from the testset
#set up an empty result matrix. nrow(result) will be the same as the cartesian product between mat1 & mat2.
result<-matrix(rep(NA, nrow(mat1)*nrow(mat2)), nrow=nrow(mat1)*nrow(mat2), ncol=3)
k=1
for(i in 1:nrow(mat2)){
for(j in 1:nrow(mat1)){
result[k,1]<-i
result[k,2]<-j
result[k,3]<-crossprod(mat1[j,], mat2[i,])/sqrt(crossprod(mat1[j,]) * crossprod(mat2[i,]))
k<-k+1
}
}
#sort the result matrix by cosine similarity found for each row in testset. not sure how to keep topK from each group so convert to df
result<-as.data.frame(result)
colnames(result)<-c("testRowId", "trainRowId","CosineSimilarity")
result<-ddply(result, "testRowId", function(x) head(x[order(x$CosineSimilarity, decreasing = TRUE) , ], topK))
resultMat<-matrix(result$trainRowId, nrow=nrow(mat2), ncol=topK,byrow=T)
finalResult<-list(similarity=result, index=resultMat)
}
system.time(cosineSim<-coSim(train, test, topK=2)) #0.12 secs
cosineSim
$similarity
testRowId trainRowId CosineSimilarity
1 1 1 0.8164966
2 1 2 0.6324555
3 2 4 0.8660254
4 2 2 0.7745967
5 3 5 0.5773503
6 3 4 0.5000000
7 4 4 0.8660254
8 4 2 0.7745967
$index
[,1] [,2]
[1,] 1 2
[2,] 4 2
[3,] 5 4
[4,] 4 2
set.seed(123)
train<-matrix(round(runif(1000000),0),nrow=5000,ncol=200)
set.seed(987)
test<-matrix(round(runif(400000),0),nrow=2000,ncol=200)
system.time(cosineSim<-coSim(train, test, topK=50)) #380secs
When I run the same function with 5000x200 matrix for training and 2000x200 matrix for testing, it took over 380secs.
Ideally, I would like to see some ideas where I do not have to calculate the similarity between each and every row. If that is not possible, some pointers on how to vectorise the above code will be helpful.
No need to compute the similarity for every row. You can use this instead:
coSim2<-function(mat1, mat2, topK){
#similarity computation:
xy <- tcrossprod(mat1, mat2)
xx <- rowSums(mat1^2)
yy <- rowSums(mat2^2)
result <- xy/sqrt(outer(xx,yy))
#top similar rows from train (per row in test):
top <- apply(result, 2, order, decreasing=TRUE)[1:topK,]
result_df <- data.frame(testRowId=c(col(top)), trainRowId=c(top))
result_df$CosineSimilarity <- result[as.matrix(result_df[,2:1])]
list(similarity=result_df, index=t(top))
}
Test data (I've reduced your train matrix)
set.seed(123)
train<-matrix(round(runif(100000),0),nrow=500,ncol=200)
set.seed(987)
test<-matrix(round(runif(400000),0),nrow=2000,ncol=200)
Result:
> system.time(cosineSim<-coSim(train, test, topK=50)) #380secs
user system elapsed
41.71 1.59 43.72
> system.time(cosineSim2<-coSim2(train, test, topK=50)) #380secs
user system elapsed
0.46 0.02 0.49
Using your full 5000 x 200 train matrix, coSim2 runs in 7.8 sec.
Also note:
> any(cosineSim$similarity != cosineSim2$similarity)
[1] FALSE
> any(cosineSim$index != cosineSim2$index)
[1] FALSE
You can't use identical because my function returns integers instead of doubles for row IDs.

Resources