Minimum overlap for symmetrical co-occurrence matrix - r

I have an asymmetrical occurrence matrix
OC <- matrix(c(2,0,2,
1,1,0,
0,3,3,
0,2,2,
0,0,1),
nrow=5, ncol=3, byrow = TRUE)
and I transform this matrix into a symmetrical co-occurrence matrix based on the multiplication over the columns:
t(OC)%*%OC
However, I was trying to find the symmetrical co-occurrence matrix, but using the minimum overlap. I am trying to apply the proposition of Morris (2005) for a bibliometric study. The results should be:
[,1] [,2] [,3]
[1,] 3 1 2
[2,] 1 6 5
[3,] 2 5 8
I don't know how to identify the minimum shortest path to estimate this matrix. Any help would be highly appreciated!
Reference:
Morris, S.A. (2005). Unified Mathematical Treatment of Complex Cascaded Bipartite Networks: The Case of Collections of Journal Papers. Unpublished PhD Thesis, Oklahoma State University. Retrieved from http://digital.library.okstate.edu/etd/umi-okstate-1334.pdf

Related

R principal component analysis interpreting the princomp() and eigen() functions for a non-square matrix

I'm trying to learn about and implement principal component analysis and study in particular how it relates to eigenvectors and eigenvalues and other things from linear algebra. Cross Validated has been helpful but I do have questions I haven't seen an answer for so far.
I've read online that eigenvalues and eigenvalues are for square matrices and singular value decomposition is like an extension of that for non-square matrices. Here is what I find on Google when I search the question:
Note. Eigenvalues and eigenvectors are only for square matrices. Eigenvectors are by definition nonzero. Eigenvalues may be equal to zero.
But if I take, for example, a selection from the mtcars dataset, by choosing only the first six rows but keeping all the observations, and then ask about the dimensions of this new dataset, I see that I have a m x n matrix that is a 32 x 6 matrix.
mtcars_selection <- mtcars %>%
dplyr::select(mpg:wt)
nrow(mtcars_selection) # 32
length(mtcars_selection) # 6
Now turning to principal component analysis, when I run these lines of code:
prcomp_attempt = stats::prcomp(mtcars_selection, scale = FALSE)
summary(prcomp_attempt)
I get the following as part of the output.
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 136.5265 38.11828 3.04062 0.67678 0.36761 0.3076
Proportion of Variance 0.9272 0.07228 0.00046 0.00002 0.00001 0.0000
Cumulative Proportion 0.9272 0.99951 0.99997 0.99999 1.00000 1.0000
Similarly, when I change prcomp() to princomp() I get a similar output.
princomp_attempt = stats::princomp(mtcars_selection, scale = FALSE)
summary(princomp_attempt)
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Standard deviation 134.3763479 37.51795949 2.9927313123 0.66612093849 0.361823645756 0.302784248968
Proportion of Variance 0.9272258 0.07228002 0.0004599126 0.00002278484 0.000006722546 0.000004707674
Cumulative Proportion 0.9272258 0.99950587 0.9999657849 0.99998856978 0.999995292326 1.000000000000
From ?prcomp() I see that the computation is done using singular value decomposition.
The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using eigen on the covariance matrix. This is generally the preferred method for numerical accuracy.
And from ?princomp() I see:
The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result.
But doesn't this all mean that one of the code chunks above should work and one of them should not work? In particular, how did princomp() work if the matrix that went into the princomp() function is a non-square matrix?
Now when I take a look at the eigen() function on the covariance matrix, which is non-square, I get an output that looks like it only printed the first six rows.
eigen(cov(mtcars_selection))
I see in this particular output
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.038119604 -0.009201962 0.99282561 -0.057597550 0.094821246 0.021236444
[2,] -0.012035481 0.003373585 -0.06561936 -0.965667568 0.050583003 0.245897079
[3,] -0.899622021 -0.435427435 0.03153733 0.006753430 -0.006294596 -0.001825989
[4,] -0.434782990 0.900148911 0.02503332 0.006406853 0.004534789 -0.002722171
[5,] 0.002660336 0.003898813 0.03993024 0.187172744 -0.494914521 0.847590278
[6,] -0.006239859 -0.004861028 -0.08231475 0.170435844 0.862235306 0.469748461
In the eigen() and the princomp() functions, is the the data being conformed to a square matrix by slicing off the rows that are greater in number than the columns, so m = n?

Predict mclust cluster membership outside R

I've used mclust to find clusters in a dataset. Now I want to implement these findings into external non-r software (predict.Mclust is thus not an option as has been suggested in previous similar Questions) to classify new observations. I need to know how mclust classifies observations.
Since mclust outputs a center and a covariance matrix for each cluster it felt reasonable to calculate mahalanobis distance for every observation and for every cluster. Observations could then be classified to the mahalonobi-nearest cluster. It seems not not to work fully however.
Example code with simulated data (in this example I only use one dataset, d, and try to obtain the same classification as mclust does by the mahalanobi approach outlined above):
set.seed(123)
c1<-mvrnorm(100,mu=c(0,0),Sigma=matrix(c(2,0,0,2),ncol=2))
c2<-mvrnorm(200,mu=c(3,3),Sigma=matrix(c(3,0,0,3),ncol=2))
d<-rbind(c1,c2)
m<-Mclust(d)
int_class<-m$classification
clust1_cov<-m$parameters$variance$sigma[,,1]
clust1_center<-m$parameters$mean[,1]
clust2_cov<-m$parameters$variance$sigma[,,2]
clust2_center<-m$parameters$mean[,2]
mahal_clust1<-mahalanobis(d,cov=clust1_cov,center=clust1_center)
mahal_clust2<-mahalanobis(d,cov=clust2_cov,center=clust2_center)
mahal_clust_dist<-cbind(mahal_clust1,mahal_clust2)
mahal_classification<-apply(mahal_clust_dist,1,function(x){
match(min(x),x)
})
table(int_class,mahal_classification)
#List mahalanobis distance for miss-classified observations:
mahal_clust_dist[mahal_classification!=int_class,]
plot(m,what="classification")
#Indicate miss-classified observations:
points(d[mahal_classification!=int_class,],pch="X")
#Results:
> table(int_class,mahal_classification)
mahal_classification
int_class 1 2
1 124 0
2 5 171
> mahal_clust_dist[mahal_classification!=int_class,]
mahal_clust1 mahal_clust2
[1,] 1.340450 1.978224
[2,] 1.607045 1.717490
[3,] 3.545037 3.938316
[4,] 4.647557 5.081306
[5,] 1.570491 2.193004
Five observations are classified differently between the mahalanobi approach and mclust. In the plots they are intermediate points between the two clusters. Could someone tell me why it does not work and how I could mimic the internal classification of mclust and predict.Mclust?
After formulating the above question I did some additional research (thx LoBu) and found that the key was to calculate the posterior probability (pp) for an observation to belong to a certain cluster and classify according to maximal pp. The following works:
denom<-rep(0,nrow(d))
pp_matrix<-matrix(rep(NA,nrow(d)*2),nrow=nrow(d))
for(i in 1:2){
denom<-denom+m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i])
}
for(i in 1:2){
pp_matrix[,i]<-m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i]) / denom
}
pp_class<-apply(pp_matrix,1,function(x){
match(max(x),x)
})
table(pp_class,m$classification)
#Result:
pp_class 1 2
1 124 0
2 0 176
But if someone in layman terms could explain the difference between the mahalanobi and pp approach I would be greatful. What do the "mixing probabilities" (m$parameters$pro) signify?
In addition to Mahalanobis distance, you also need to take the cluster weight into account.
These weight the relative importance of clusters when they overlap.

R_number of pairs for each lag in a Variogram

I am using geoR package for spatial interpolation of rainfall. I have to tell that I am quite new to geostatistics. Thanks to some video tutorials in youtube, I understood (well, I think so) the theory behind variogram. As per my understanding, the number of pairs should decrease with increasing lag distances. For eg, if we consider a 100m long stretch (say 100m long cross section of a river bed) the number of pairs for 5m lag is 20 and number of pairs for 10m lag is 10 and so on. But I am kind of confused with output from variog function in geoRpackage. An example is given below
mydata
X Y a
[1,] 415720 432795 2.551415
[2,] 415513 432834 2.553177
[3,] 415325 432740 2.824652
[4,] 415356 432847 2.751844
[5,] 415374 432858 2.194091
[6,] 415426 432774 2.598897
[7,] 415395 432811 2.699066
[8,] 415626 432762 2.916368
this is my dataset where a is my variable (rainfall intensity) and x, y are the coordinates of the points. The varigram calculation is shown below
geodata=as.geodata(data,header=TRUE)
variogram=variog(geodata,coords=geodata$coords,data=geodata$data)
variogram[1:3]
$u
[1] 46.01662 107.37212 138.04987 199.40537 291.43861 352.79411
$v
[1] 0.044636453 0.025991469 0.109742986 0.029081575 0.006289056 0.041963076
$n
[1] 3 8 3 3 3 2
where
u: a vector with distances.
v: a vector with estimated variogram values at distances given in u.
n: number of pairs in each bin
According to this, number of pairs (n) have a random pattern whereas corresponding lag distance (u) is increasing. I find it hard to understand this. Can anyone explain what is happening? Also any suggestions/advice to improve the variogram calculation for this application (spatial interpolation of rainfall intensity) is highly appreciated as I am new to geostatistics. Thanks in advance.
On a linear transect of 100 m with 5 m regular spacing between observations, if you'd have 20 pairs at 5 m lag, you'd have 19 pairs at 10 m lag. This idea does not hold for your data, because they are irregularly distributed, and they are distributed over two dimensions. For irregularly distributed data, you often have very few point pairs for the very short distances. The advice for obtaining a better looking variogram is to work with a larger data set: geostatistics starts getting interesting with 30 observations, and fun with over 100 observations.

normalizing matrices in R

How do I normalize/scale matrices in R by column. For example, when I compute eigenvectors of a matrix, R returns:
> eigen(matrix(c(2,-2,-2,5),2,2))$vectors
[,1] [,2]
[1,] -0.4472136 -0.8944272
[2,] 0.8944272 -0.4472136
// should be normalized to
[,1] [,2]
[1,] -1 -2
[2,] 2 -1
The function "scale" subtracts the means and divided by standard deviation by column which does not help in this case. How do I achieve this?
This produces the matrix you say you want:
> a <- eigen(matrix(c(2,-2,-2,5),2,2))$vectors
> a / min(abs(a))
[,1] [,2]
[1,] -1 -2
[2,] 2 -1
But I'm not sure I understand exactly what you want, so this may not do the right thing in general.
Wolfram Alpha gives the following result:
http://www.wolframalpha.com/input/?i=eigenvalues{{2,-2},{-2,5}}
Input:
alt text http://www4a.wolframalpha.com/Calculate/MSP/MSP2019c09551ice7322c0000597gh9iecce8ce5a?MSPStoreType=image/gif&s=58&w=162&h=36
Eigenvalues:
alt text http://www4a.wolframalpha.com/Calculate/MSP/MSP2319c09551ice7322c00000d87ab28c27g8i27?MSPStoreType=image/gif&s=58&w=500&h=52
Eigenvectors:
alt text http://www4a.wolframalpha.com/Calculate/MSP/MSP2619c09551ice7322c00001c9hcg6e2bgiefgf?MSPStoreType=image/gif&s=58&w=500&h=64
I'm not sure what you're talking about with means and standard deviations. A good iterative method like QR should get you the eigenvalues and eigenvectors you need. Check out Jacobi or Householder.
You normalize any vector by dividing every component by the square root of the sum of squares of its components. A unit vector will have magnitude equal to one.
In your case this is true: the vectors being presented by R have been normalized. If you normalize the two Wolfram eigenvectors, you'll see that both have a magnitude equal to the square root of 5. Divide each column vector by this value and you'll get the ones given to you by R. Both are correct.

(all the) directions perpendicular to hyperplane through p data points

I have a simple question:
given p points (non-collinear) in R^p i find the hyperplane passing by these points (to help clarify i type everything in R):
p<-2
x<-matrix(rnorm(p^2),p,p)
b<-solve(crossprod(cbind(1,x[,-2])))%*%crossprod(cbind(1,x[,-2]),x[,2])
then, given a p+1^th points not collinear with first p points, i find the direction perpendicular to b:
x2<-matrix(rnorm(p),p,1)
b2<-solve(c(-b[-1],1)%*%t(c(-b[-1],1))+x2%*%t(x2))%*%x2
That is, b2 defines a p dimensional hyperplane perpendicular to b and passing by x2.
Now, my questions are:
The formula comes from my interpretation of this wikipedia entry ("solve(A)" is the R command for A^-1). Why this doesn't work for p>2 ? What am i doing wrong ?
PS: I have seen this post (on stakeoverflow edit:sorry cannot post more than one link) but somehow it doesn't help me.
Thanks in advance,
i have a problem implementation/understanding of Liu's solution when p>2:
shouldn't the dot product between the qr decomposition of the sweeped matrix and the direction of the hyperplane be 0 ? (i.e. if the qr vectors are perpendicular to the hyperplane)
i.e, when p=2 this
c(-b[2:p],1)%*%c(a1)
gives 0. When p>2 it does not.
Here is my attempt to implement Victor Liu's solution.
a) given p linearly independent observations in R^p:
p<-2;x<-matrix(rnorm(p^2),p,p);x
[,1] [,2]
[1,] -0.4634923 -0.2978151
[2,] 1.0284040 -0.3165424
b) stake them in a matrix and subtract the first row:
a0<-sweep(x,2,x[1,],FUN="-");a0
[,1] [,2]
[1,] 0.000000 0.00000000
[2,] 1.491896 -0.01872726
c) perform a QR decomposition of the matrix a0. The vector in the nullspace is the direction im looking for:
qr(a0)
[,1] [,2]
[1,] -1.491896 0.01872726
[2,] 1.000000 0.00000000
Indeed; this direction is the same as the one given by application of the formula from wikipedia (using x2=(0.4965321,0.6373157)):
[,1]
[1,] 2.04694853
[2,] -0.02569464
...with the advantage that it works in higher dimensions.
I have one last question: what is the meaning of the other p-1 (i.e. (1,0) here) QR vector when p>2 ?
-thanks in advance,
A p-1 dimensional hyperplane is defined by a normal vector and a point that the plane passes through:
n.(x-x0) = 0
where n is the normal vector of length p, x0 is a point through which the hyperplane passes, . is a dot product, and the equation must be satisfied for any point x on the plane. We can also write this as
n.x = p
where p = n.x0 is just a number. This is a more compact representation of a hyperplane, which is parameterized by (n,p). To find your hyperplane, suppose your points are x1, ..., xp.
Form a matrix A with p-1 rows and p columns as follows. The rows of p are xi-x1, laid out as rows vectors, for all i>1 (there are only p-1 of them). If your p points are not "collinear" as you say (they need to be affinely independent), then matrix A will have rank p-1, and a nullspace dimension of 1. The one vector in the nullspace is the normal vector of the hyperplane. Once you find it (call it n), then p = n.x1. In order to find the nullspace of a matrix, you can use a QR decomposition (see here for details).

Resources