Principal component analysis using R. Automatic and manual results do not match - r

Two different methods of the principal component analysis were conducted to analyze the following data (ch082.dat) using the Box1's R-code, below.https://drive.google.com/file/d/1xykl6ln-bUnXIs-jIA3n5S3XgHjQbkWB/view?usp=sharing
The first method uses the rotation matrix (See 'ans_mat' under the '#rotated data' of the Box1's code) and,
the second method uses the 'pcomp' function (See 'rpca' under the '#rotated data' of the Box1's code).
However, there is a subtle discrepancy in the answer between the method using the rotation matrix and the method using the 'pcomp' function.
make it match
My Question
What should I do so that the result of the rotation matrix -based method matches the result of the'pcomp' function?
As far as I've tried with various data, including other data, the actual discrepancies seem to be limited to scale shifts and mirroring transformations.
The results of the rotation matrix -based method is shown in left panel.
The results of the pcomp function -based method is shown in right panel.
Mirror inversion can be seen in "ch082.dat" data.(See Fig.1);
It seems that, in some j, the sign of the "jth eigenvector of the correlation matrix" and the sign of the "jth column of the output value of the prcomp function" may be reversed. If there is a degree of overlap in the eigenvalues, it is possible that the difference may be more complex than mirror inversion.
Fig.1
There is a scale shift for the Box2's data (See See Fig.2), despite the centralization and normalization to the data.
Fig.2
Box.1
#dataload
##Use the 'setwd' function to specify the directory containing 'ch082.dat'.
##For example, if you put this file directly under the C drive of your Windows PC, you can run the following command.
setwd("C:/") #Depending on where you put the file, you may need to change the path.
getwd()
w1<-read.table("ch082.dat",header = TRUE,row.names = 1,fileEncoding = "UTF-8")
w1
#Function for standardizing data
#Thanks to https://qiita.com/ohisama2/items/5922fac0c8a6c21fcbf8
standalize <- function(data)
{ for(i in length(data[1,]))
{
x <- as.matrix(data[,i])
y <- (x-mean(x)/sd(x))
data[,i] <- y
}
return(data)}
#Method using rotation matrix
z_=standalize(w1)
B_mat=cor(z_) #Compute correlation matrix
eigen_m <- eigen(B_mat)
sample_mat <- as.matrix(z_)
ans_mat=sample_mat
for(j in 1:length(sample_mat[1,])){
ans_mat[,j]=sample_mat%*%eigen_m$vectors[,j]
}
#Method using "rpca" function
rpca <- prcomp(w1,center=TRUE, scale=TRUE)
#eigen vectors
eigen_m$vectors
rpca
#rotated data
ans_mat
rpca$x
#Graph Plots
par(mfrow=c(1,2))
plot(
ans_mat[,1],
ans_mat[,2],
main="Rotation using eigenvectors"
)
plot(rpca$x[,1], rpca$x[,2],
main="Principal component score")
par(mfrow=c(1,1))
#summary
summary(rpca)$importance
Box2.
sample_data <- data.frame(
X = c(2,4, 6, 5,7, 8,10),
Y = c(6,8,10,11,9,12,14)
)
X = c(2,4, 6, 5,7, 8,10)
Y = c(6,8,10,11,9,12,14)
plot(Y ~ X)
w1=sample_data
Reference
https://logics-of-blue.com/principal-components-analysis/
(Written in Japanease)

The two sets of results agree. First we can simplify your code a bit. You don't need your function or the for loop:
z_ <- scale(w1)
B_mat <- cor(z_)
eigen_m <- eigen(B_mat)
ans_mat <- z_ %*% eigen_m$vectors
Now the prcomp version
z_pca <- prcomp(z_)
z_pca$sdev^2 # Equals eigen_m$values
z_pca$rotation # Equals eigen_m$vectors
z_pca$x # Equals ans_mat
Your original code mislabeled ans_mat columns. They are actually the principal component scores. You can fix that with
colnames(ans_mat) <- colnames(z_pca$x)
The pc loadings (and therefore the scores) are not uniquely defined with respect to reflection. In other words multiplying all of the loadings or scores in one component by -1 flips them but does not change their relationships to one another. Multiply z_pca$x[, 1] by -1 and the plots will match:
z_pca$x[, 1] <- z_pca$x[, 1] * -1
dev.new(width=10, height=6)
par(mfrow=c(1,2))
plot(ans_mat[,1], ans_mat[,2], main="Rotation using eigenvectors")
plot(z_pca$x[,1], z_pca$x[,2], main="Principal component score")

Related

Using a loop in an ODE to graphically compare different parameters R

I'm using the deSolve package to plot a couple differential equations (read if interested http://www.maa.org/press/periodicals/loci/joma/the-sir-model-for-spread-of-disease-the-differential-equation-model).
My eventual goal is to create an iterative function or process (for loop) to plot how changes in certain parameters (beta and gamma) will affect the solution. The preferred output would be a list that contains all each ode solution for each specified value of beta in the loop. I'm running into issues for integrating a loop into the setup that the deSolve package requires for the ode function.
In the code below, I am trying to plot how the range of values (1 to 2 by increments of 0.1) in parameter beta will affect the plot of the differential equations.
for(k in seq(1,2,by=0.1)){ #range of values for beta
init <- c(S=1-1e-6, I=1e-6, R=0) #initial conditions for odes
time <- seq(0,80,by=1) #time period
parameters <- c(beta=k, gamma=0.15) #parameters in ode
SIR <- function(time,state,parameters){ #function containing equaations
with(as.list(c(state,parameters)),{
dS <- -beta*S*I
dI <- beta*S*I-gamma*I
dR <- gamma*I
return(list(c(dS,dI,dR)))
})
}
ode(y=init,times=time,func=SIR()[beta],parms=parameters[k])}
}
The first error I'm getting states that the argument parameters in the SIR function is missing
Error in as.list(c(init, parameters)) : argument "parameters" is
missing, with no default
I don't understand why this error is being reported when I've assigned parameters in the lines previous.
You might as well define your gradient function (and the other non-changing elements) outside the loop:
SIR <- function(time,state,parameters) {
with(as.list(c(state,parameters)),{
dS <- -beta*S*I
dI <- beta*S*I-gamma*I
dR <- gamma*I
return(list(c(dS,dI,dR)))
})
}
init <- c(S=1-1e-6, I=1e-6, R=0) #initial conditions for odes
time <- seq(0,80,by=1) #time period
Now define the vector of values to try (not necessary but convenient):
betavec <- seq(1,2,by=0.1)
and define a list to hold the results:
res <- vector(length(betavec),mode="list")
library(deSolve)
for (k in seq_along(betavec)){ #range of values for beta
res[[k]] <- ode(y=init,times=time,func=SIR,
parms=c(beta=betavec[k], gamma=0.15))
}
Now you have a list, each element of which contains the results from one run. You can sapply or lapply over this list, e.g. to get a matrix of the last states from each run:
t(sapply(res,tail,1))
Or if you want the results as one long data frame ...
names(res) <- betavec ## to get beta value incorporated in results
dd <- dplyr::bind_rows(lapply(res,as.data.frame),.id="beta")
dd$beta <- as.numeric(dd$beta)
do.call(rbind,...) would work nearly as well as bind_rows(), but bind_rows's .id argument is convenient for adding the beta values to each data frame. You could also leave the results as a list and loop over them while plotting with separate lines() calls, or (e.g.) bind just the infective columns together and use matplot() to plot them all at the same time. This is just a matter of style and idiom.
library(ggplot2); theme_set(theme_bw())
library(viridis)
ggplot(dd,aes(x=time,y=I,colour=beta))+
geom_line(aes(group=beta))+
scale_color_viridis()+
scale_y_log10()

Different results when performing PCA in R with princomp() and principal ()

I tried to use princomp() and principal() to do PCA in R with data set USArressts. However, I got two different results for loadings/rotaion and scores.
First, I centered and normalised the original data frame so it is easier to compare the outputs.
library(psych)
trans_func <- function(x){
x <- (x-mean(x))/sd(x)
return(x)
}
A <- USArrests
USArrests <- apply(USArrests, 2, trans_func)
princompPCA <- princomp(USArrests, cor = TRUE)
principalPCA <- principal(USArrests, nfactors=4 , scores=TRUE, rotate = "none",scale=TRUE)
Then I got the results for the loadings and scores using the following commands:
princompPCA$loadings
principalPCA$loadings
Could you please help me to explain why there is a difference? and how can we interprete these results?
At the very end of the help document of ?principal:
"The eigen vectors are rescaled by the sqrt of the eigen values to produce the component loadings more typical in factor analysis."
So principal returns the scaled loadings. In fact, principal produces a factor model estimated by the principal component method.
In 4 years, I would like to provide a more accurate answer to this question. I use iris data as an example.
data = iris[, 1:4]
First, do PCA by the eigen-decomposition
eigen_res = eigen(cov(data))
l = eigen_res$values
q = eigen_res$vectors
Then the eigenvector corresponding to the largest eigenvalue is the factor loadings
q[,1]
We can treat this as a reference or the correct answer. Now we check the results by different r functions.
First, by function 'princomp'
res1 = princomp(data)
res1$loadings[,1]
# compare with
q[,1]
No problem, this function actually just return the same results as 'eigen'. Now move to 'principal'
library(psych)
res2 = principal(data, nfactors=4, rotate="none")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
You may find they are still different. The problem is the 'principal' function does eigendecomposition on the correlation matrix by default. Note: PCA is not invariant with rescaling the variables. If you modify the code as
res2 = principal(data, nfactors=4, rotate="none", cor="cov")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
Now, you will get the same results as 'eigen' and 'princomp'.
Summarize:
If you want to do PCA, you'd better apply 'princomp' function.
PCA is a special case of the Factor model or a simplified version of the factor model. It is just equivalent to eigendecomposition.
We can apply PCA to get an approximation of a factor model. It doesn't care about the specific factors, i.e. epsilons in a factor model. So, if you change the number of factors in your model, you will get the same estimations of the loadings. It is different from the maximum likelihood estimation.
If you are estimating a factor model, you'd better use 'principal' function, since it provides more functions, like rotation, calculating the scores by different methods, and so on.
Rescale the loadings of a PCA model doesn't affect the results too much. Since you still project the data onto the same optimal direction, i.e. maximize the variation in the resulting PC.
ev <- eigen(R) # R is a correlation matrix of DATA
ev$vectors %*% diag(ev$values) %*% t(ev$vectors)
pc <- princomp(scale(DATA, center = F, scale = T),cor=TRUE)
p <-principal(DATA, rotate="none")
#eigen values
ev$values^0.5
pc$sdev
p$values^0.5
#eigen vectors - loadings
ev$vectors
pc$loadings
p$weights %*% diag(p$values^0.5)
pc$loading %*% diag(pc$sdev)
p$loadings
#weights
ee <- diag(0,2)
for (j in 1:2) {
for (i in 1:2) {
ee[i,j] <- ev$vectors[i,j]/p$values[j]^0.5
}
};ee
#scores
s <- as.matrix(scale(DATA, center = T, scale = T)) %*% ev$vectors
scale(s)
p$scores
scale(pc$scores)

R: backwards principal component calculation

I would like to perform a backwards principal component calculation in R, meaning: obtaining the original matrix by the PCA object itself.
This is an example case:
# Load an expression matrix
load(url("http://www.giorgilab.org/allexp_rsn.rda"))
# Calculate PCA
pca <- prcomp(t(allexp_rsn))
In order to obtain the original matrix, one should multiply the rotations by the PCA themselves, as such:
test<-pca$rotation%*%pca$x
However, as you may check, the calculated "test" matrix is completely different from the original "allexp_rsn" matrix. What am I doing wrong? Is the function prcomp adding something else to the svs procedure?
Thanks :-)
Using USArrests:
pca <- prcomp(t(USArrests))
out <- t(pca$x%*%t(pca$rotation))
out <- sweep(out, 1, pca$center, '+')
apply(USArrests - out, 2, sum)
Murder Assault UrbanPop Rape
1.070921e-12 -2.778222e-12 3.801404e-13 1.428191e-12
Remember that a prerequisite to perform PC analysis is to scale and center the data. I believe that prcomp procedure does that, so pca$x returns scaled original data (with mean 0 and std. equal to 1).
Here is a solution using the eigen function, applied to a B/W image matrix to illustrate the point. The function uses increasing numbers of PCs, but you can use all of them, or only some of them
library(gplots)
library(png)
# Download an image:
download.file("http://www.giorgilab.org/pictures/monalisa.tar.gz",destfile="monalisa.tar.gz",cacheOK = FALSE)
untar("monalisa.tar.gz")
# Read image:
img <- readPNG("monalisa.png")
# Dimension
d<-1
# Rotate it:
rotate <- function(x) t(apply(x, 2, rev))
centermat<-rotate(img[,,d])
# Plot it
image(centermat,col=gray(c(0:100)/100))
# Increasing PCA
png("increasingPCA.png",width=2000,height=2000,pointsize=20)
par(mfrow=c(5,5),mar=c(0,0,0,0))
for(end in (1:25)*12){
for(d in 1){
centermat<-rotate(img[,,d])
eig <- eigen(cov(centermat))
n <- 1:end
eigmat<-t(eig$vectors[,n] %*% (t(eig$vectors[,n]) %*% t(centermat)))
image(eigmat,col=gray(c(0:100)/100))
}
}
dev.off()

using k-NN in R with categorical values

I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.
I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance.
Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?
I'm using R 2.15
As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.
# Generate dummy data
y <- rep(1:2, each=50) # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5 # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)
# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)
# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
tab <- table(y[design.set][nn])
as.integer(names(tab)[which.max(tab)]) # This is a pretty dirty line
}
# Inspect the results
table(pred, y[test.set])
If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.
The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.

applying the pvclust R function to a precomputed dist object

I'm using R to perform an hierarchical clustering. As a first approach I used hclust and performed the following steps:
I imported the distance matrix
I used the as.dist function to transform it in a dist object
I run hclust on the dist object
Here's the R code:
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
hclust(d, "ward")
At this point I would like to do something similar with the function pvclust; however, I cannot because it's not possible to pass a precomputed dist object. How can I proceed considering that I'm using a distance not available among those provided by the dist function of R?
I've tested the suggestion of Vincent, you can do the following (my data set is a dissimilarity matrix):
# Import you data
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
# Compute the eigenvalues
x <- cmdscale(d,1,eig=T)
# Plot the eigenvalues and choose the correct number of dimensions (eigenvalues close to 0)
plot(x$eig,
type="h", lwd=5, las=1,
xlab="Number of dimensions",
ylab="Eigenvalues")
# Recover the coordinates that give the same distance matrix with the correct number of dimensions
x <- cmdscale(d,nb_dimensions)
# As mentioned by Stéphane, pvclust() clusters columns
pvclust(t(x))
If the dataset is not too large, you can embed your n points in a space of dimension n-1, with the same distance matrix.
# Sample distance matrix
n <- 100
k <- 1000
d <- dist( matrix( rnorm(k*n), nc=k ), method="manhattan" )
# Recover some coordinates that give the same distance matrix
x <- cmdscale(d, n-1)
stopifnot( sum(abs(dist(x) - d)) < 1e-6 )
# You can then indifferently use x or d
r1 <- hclust(d)
r2 <- hclust(dist(x)) # identical to r1
library(pvclust)
r3 <- pvclust(x)
If the dataset is large, you may have to check how pvclust is implemented.
It's not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by #Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on another question on CrossValidated). In the latter case, the authors of pvclust provide an example on how to use a custom distance function, although that means you will have to install their "unofficial version".

Resources