Numerical Programming - Matlab and R give different values - r

I'm trying to compute the PCA scores, and, part of the algorithm says: subtract the mean of the matrix, divided by the standard deviation
I have the following 2x2 matrix given by: A = [1 3; 2 4] let's say in Matlab, I do the following:
mean(A) -> This gives me back a vector of 2 values (column based) so.. 1.5 and 3.5. Which to me in this instance this would be correct.
In R however, when computing the mean mean(A) the mean is just one value. This is the same for the standard deviation.
So my question is, which is right? For the purposes of this function (in the algorithm):
function(x) {(x - mean(x))/sd(x) (http://strata.uga.edu/software/pdf/pcaTutorial.pdf)
Should I be subtracting the mean based on two values by Matlab or 1 value by R?
Thanks

The R command that will do this in one swoop for matrices or dataframes is scale()
> A = matrix(c(1, 3, 2, 4), 2)
> scale(A)
[,1] [,2]
[1,] -0.7071068 -0.7071068
[2,] 0.7071068 0.7071068
attr(,"scaled:center")
[1] 2 3
attr(,"scaled:scale")
[1] 1.414214 1.414214
It's done by column. When you used 'mean' you got the mean for all four numbers rather than by column. That is not what you would want if you are doing PCA calculations.

Related

Set time series vectors lengths equal (resize/rescale them with use of linear interpolation)

I have huge dataset of time series which are represented as vectors (no time labels available), due to some errors in measuring process their lengths (as values from length() show) varies slightly (~10%) but each of them definitively describs time interval of exacly two minutes. I would like to rescale/resize them and then calculate some statistics between them (so I need time series of equal lengths).
I need vary fast approach and linear interpolation is perfectly good choice for me, because speed is more important.
Simple example, rescaling vector of length 5 to vector of length of 10 :
input <- 0:4 # should be rescaled/resized into :
output <- c(0, .444, .888, 1.333, 1.777, 2.222, 2.666, 3.111, 3.555, 4)
I think that the fastest approach is to create matrix w ('w' for weights) which dimensions are : length(output) x length(input), so w %*% input gives output(as matrix object), if it is the fastest way, how to create matrices w efficiently ?
I think this could be enough:
resize <- function (input, len) approx(seq_along(input), input, n = len)$y
For example:
> resize(0:4, 10)
[1] 0.0000000 0.4444444 0.8888889 1.3333333 1.7777778 2.2222222 2.6666667 3.1111111 3.5555556 4.0000000
> resize( c(0, 3, 2, 1), 10)
[1] 0.000000 1.000000 2.000000 3.000000 2.666667 2.333333 2.000000 1.666667 1.333333 1.000000

Calculation of mutual information in R

I am having problems interpreting the results of the mi.plugin() (or mi.empirical()) function from the entropy package. As far as I understand, an MI=0 tells you that the two variables that you are comparing are completely independent; and as MI increases, the association between the two variables is increasingly non-random.
Why, then, do I get a value of 0 when running the following in R (using the {entropy} package):
mi.plugin( rbind( c(1, 2, 3), c(1, 2, 3) ) )
when I'm comparing two vectors that are exactly the same?
I assume my confusion is based on a theoretical misunderstanding on my part, can someone tell me where I've gone wrong?
Thanks in advance.
Use mutinformation(x,y) from package infotheo.
> mutinformation(c(1, 2, 3), c(1, 2, 3) )
[1] 1.098612
> mutinformation(seq(1:5),seq(1:5))
[1] 1.609438
and normalized mutual information will be 1.
the mi.plugin function works on the joint frequency matrix of the two random variables. The joint frequency matrix indicates the number of times for X and Y getting the specific outcomes of x and y.
In your example, you would like X to have 3 possible outcomes - x=1, x=2, x=3, and Y should also have 3 possible outcomes, y=1, y=2, y=3.
Let's go through your example and calculate the joint frequency matrix:
> X=c(1, 2, 3)
> Y=c(1, 2, 3)
> freqs=matrix(sapply(seq(max(X)*max(Y)), function(x) length(which(((X-1)*max(Y)+Y)==x))),ncol=max(X))
> freqs
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
This matrix shows the number of occurrences of X=x and Y=y. For example there was one observation for which X=1 and Y=1. There were 0 observations for which X=2 and Y=1.
You can now use the mi.plugin function:
> mi.plugin(freqs)
[1] 1.098612

R eigenvalues/eigenvectors

I have this correlation matrix
A
[,1] 1.00000 0.00975 0.97245 0.43887 0.02241
[,2] 0.00975 1.00000 0.15428 0.69141 0.86307
[,3] 0.97245 0.15428 1.00000 0.51472 0.12193
[,4] 0.43887 0.69141 0.51472 1.00000 0.77765
[,5] 0.02241 0.86307 0.12193 0.77765 1.00000
And I need to get the eigenvalues, eigenvectors and loadings in R.
When I use the princomp(A,cor=TRUE) function I get the variances(Eigenvalues)
but when I use the eigen(A) function I get the Eigenvalues and Eigenvectors, but the Eigenvalues in this case are different than when I use the Princomp-function..
Which function is the right one to get the Eigenvalues?
I believe you are referring to a PCA analysis when you talk of eigenvalues, eigenvectors and loadings. prcomp is essentially doing the following (when cor=TRUE):
###Step1
#correlation matrix
Acs <- scale(A, center=TRUE, scale=TRUE)
COR <- (t(Acs) %*% Acs) / (nrow(Acs)-1)
COR ; cor(Acs) # equal
###STEP 2
# Decompose matrix using eigen() to derive PC loadings
E <- eigen(COR)
E$vectors # loadings
E$values # eigen values
###Step 3
# Project data on loadings to derive new coordinates (principal components)
B <- Acs %*% E$vectors
eigen(M) gives you the correct eigen values and vectors of M.
princomp() is to be handed the data matrix - you are mistakenly feeding it the correlation matrix!
princomp(A,) will treat A as the data and then come up with a correlation matrix and its eigen vectors and values. So the eigen values of A (in case A holds the data as supposed) are not just irrelevant they are of course different from what princomp() comes up with at the end.
For an illustration of performing a PCA in R see here: http://www.joyofdata.de/blog/illustration-of-principal-component-analysis-pca/

using eigenvalues to test for singularity: identifying collinear columns

I am trying to check if my matrix is singular using the eigenvalues approach (i.e. if one of the eigenvalues is zero then the matrix is singular). Here is the code:
z <- matrix(c(-3,2,1,4,-9,6,3,12,5,5,9,4),nrow=4,ncol=3)
eigen(t(z)%*%z)$values
I know the eigenvalues are sorted in descending order. Can someone please let me know if there is a way to find out what eigenvalue is associated to what column in the matrix? I need to remove the collinear columns.
It might be obvious in the example above but it is just an example intended to save you time from creating a new matrix.
Example:
z <- matrix(c(-3,2,1,4,-9,6,3,12,5,5,9,4),nrow=4,ncol=3)
m <- crossprod(z) ## slightly more efficient than t(z) %*% z
This tells you that the third eigenvector corresponds to the collinear combinations:
ee <- eigen(m)
(evals <- zapsmall(ee$values))
## [1] 322.7585 124.2415 0.0000
Now examine the corresponding eigenvectors, which are listed as columns corresponding to their respective eigenvalues:
(evecs <- zapsmall(ee$vectors))
## [1,] -0.2975496 -0.1070713 0.9486833
## [2,] -0.8926487 -0.3212138 -0.3162278
## [3,] -0.3385891 0.9409343 0.0000000
The third eigenvalue is zero; the first two elements of the third eigenvector (evecs[,3]) are non-zero, which tells you that columns 1 and 2 are collinear.
Here's a way to automate this test:
testcols <- function(ee) {
## split eigenvector matrix into a list, by columns
evecs <- split(zapsmall(ee$vectors),col(ee$vectors))
## for non-zero eigenvalues, list non-zero evec components
mapply(function(val,vec) {
if (val!=0) NULL else which(vec!=0)
},zapsmall(ee$values),evecs)
}
testcols(ee)
## [[1]]
## NULL
## [[2]]
## NULL
## [[3]]
## [1] 1 2
You can use tmp <- svd(z) to do a svd. The eigenvalues are then saved in tmp$d as a diagonal matrix of eigenvalues. This works also with a non square matrix.
> diag(tmp$d)
[,1] [,2] [,3]
[1,] 17.96548 0.00000 0.000000e+00
[2,] 0.00000 11.14637 0.000000e+00
[3,] 0.00000 0.00000 8.787239e-16

Determining if a matrix is diagonalizable in the R Programming Language

I have a matrix and I would like to know if it is diagonalizable. How do I do this in the R programming language?
If you have a given matrix, m, then one way is the take the eigen vectors times the diagonal of the eigen values times the inverse of the original matrix. That should give us back the original matrix. In R that looks like:
m <- matrix( c(1:16), nrow = 4)
p <- eigen(m)$vectors
d <- diag(eigen(m)$values)
p %*% d %*% solve(p)
m
so in that example p %*% d %*% solve(p) should be the same as m
You can implement the full algorithm to check if the matrix reduces to a Jordan form or a diagonal one (see e.g., this document). Or you can take the quick and dirty way: for an n-dimensional square matrix, use eigen(M)$values and check that they are n distinct values. For random matrices, this always suffices: degeneracy has prob.0.
P.S.: based on a simple observation by JD Long below, I recalled that a necessary and sufficient condition for diagonalizability is that the eigenvectors span the original space. To check this, just see that eigenvector matrix has full rank (no zero eigenvalue). So here is the code:
diagflag = function(m,tol=1e-10){
x = eigen(m)$vectors
y = min(abs(eigen(x)$values))
return(y>tol)
}
# nondiagonalizable matrix
m1 = matrix(c(1,1,0,1),nrow=2)
# diagonalizable matrix
m2 = matrix(c(-1,1,0,1),nrow=2)
> m1
[,1] [,2]
[1,] 1 0
[2,] 1 1
> diagflag(m1)
[1] FALSE
> m2
[,1] [,2]
[1,] -1 0
[2,] 1 1
> diagflag(m2)
[1] TRUE
You might want to check out this page for some basic discussion and code. You'll need to search for "diagonalized" which is where the relevant portion begins.
All symmetric matrices across the diagonal are diagonalizable by orthogonal matrices. In fact if you want diagonalizability only by orthogonal matrix conjugation, i.e. D= P AP' where P' just stands for transpose then symmetry across the diagonal, i.e. A_{ij}=A_{ji}, is exactly equivalent to diagonalizability.
If the matrix is not symmetric, then diagonalizability means not D= PAP' but merely D=PAP^{-1} and we do not necessarily have P'=P^{-1} which is the condition of orthogonality.
you need to do something more substantial and there is probably a better way but you could just compute the eigenvectors and check rank equal to total dimension.
See this discussion for a more detailed explanation.

Resources