PCA Feature selection using R - r

I am a biologist. An output of my experiment contains large number of features(which are stored as numbers of columns and 563 rows). The columns are the features which are 8603 in number which are quite high.
So, when I tried to do PCA analysis in R and it gives "out of memory" errors.
I have tried also doing princomp in pieces, but it does not seem to work for our
approach.
I tried using the Script given in the link...
http://www.r-bloggers.com/introduction-to-feature-selection-for-bioinformaticians-using-r-correlation-matrix-filters-pca-backward-selection/
But still it does not wok :(
I am trying to use the following code
bumpus <- read.table("http://www.ndsu.nodak.edu/ndsu/doetkott/introsas/rawdata/bumpus.html",
skip=20, nrows=49,
col.names=c("id","total","alar","head","humerus","sternum"))
boxplot(bumpus, main="Boxplot of Bumpus' data") ## in this step it is showing the ERROR
# we first standardize the data:
bumpus.scaled <- data.frame( apply(bumpus,2,scale) )
boxplot(bumpus.scaled, main="Boxplot of standardized Bumpus' data")
pca.res <- prcomp(bumpus.scaled, retx=TRUE)
pca.res
# note:
# PC.1 is some kind of average of all the measurements
# => measure of size of the bird
# PC.2 has a negative weight for 'sternum'
# and positive weights for 'alar', 'head' and 'humerus'
# => measure of shape of the bird
# first two principal components:
pca.res$x[,1:2]
plot(pca.res$x[,1:2], pch="", main="PC.1 and PC.2 for Bumpus' data (blue=survived, red=died)")
text(pca.res$x[,1:2], labels=c(1:49), col=c(rep("blue",21),rep("red",28)))
abline(v=0, lty=2)
abline(h=0, lty=2)
# compare to segment plot:
windows()
palette(rainbow(12, s = 0.6, v = 0.75))
stars(bumpus, labels=c(1:49), nrow=6, key.loc=c(20,-1),
main="Segment plot of Bumpus' data", draw.segment=TRUE)
# compare to biplot:
windows()
biplot(pca.res, scale=0)
# what do the arrows mean?
# consider the arrow for sternum:
abline(0, pca.res$rotation[5,2]/pca.res$rotation[5,1])
# consider the arrow for head:
abline(0, pca.res$rotation[3,2]/pca.res$rotation[3,1])
But second line
boxplot(bumpus, main="Boxplot of Bumpus' data") ## shows an error
The error is
Error: cannot allocate vector of size 1.4 Mb
In addition: There were 27 warnings (use warnings() to see them)
Please help!

In cases where the number of features is either huge or exceeds the number of
observations, it is well advised to calculate the principal components based on
the transposed dataset. This is especially true in your case because the default
implies calculation of a 8603 x 8603 covariance matrix which itself already
consumes about 500 MB of memory (oh well, this isn't too much, but hey...).
Assuming that the rows of your matrix X correspond to observations
and columns correspond to features, center your data and then perform PCA on the
transpose of the centered X. There won't be more eigenpairs than number of
observations anyway. Finally, multiply each resulting eigenvector by X^T. You do
not need to do the latter for the eigenvalues (see way below for a detailed explanation):
What you want
This code demonstrates the implementation of PCA on the transposed dataset and compares the results of prcomp and the "transposed PCA":
pca.reduced <- function(X, center=TRUE, retX=TRUE) {
# Note that the data must first be centered on the *original* dimensions
# because the centering of the 'transposed covariance' is meaningless for
# the dataset. This is also why Sigma must be computed dependent on N
# instead of simply using cov().
if (center) {
mu <- colMeans(X)
X <- sweep(X, 2, mu, `-`)
}
# From now on we're looking at the transpose of X:
Xt <- t(X)
aux <- svd(Xt)
V <- Xt %*% aux$v
# Normalize the columns of V.
V <- apply(V, 2, function(x) x / sqrt(sum(x^2)))
# Done.
list(X = if (retX) X %*% V else NULL,
V = V,
sd = aux$d / sqrt(nrow(X)-1),
mean = if (center) mu else NULL)
}
# Example data (low-dimensional, but sufficient for this example):
X <- cbind(rnorm(1000), rnorm(1000) * 5, rnorm(1000) * 3)
original <- prcomp(X, scale=FALSE)
transposed <- pca.reduced(X)
# See what happens:
> print(original$sdev)
[1] 4.6468136 2.9240382 0.9681769
> print(transposed$sd)
[1] 4.6468136 2.9240382 0.9681769
>
> print(original$rotation)
PC1 PC2 PC3
[1,] -0.0055505001 0.0067322416 0.999961934
[2,] -0.9999845292 -0.0004024287 -0.005547916
[3,] 0.0003650635 -0.9999772572 0.006734371
> print(transposed$V)
[,1] [,2] [,3]
[1,] 0.0055505001 0.0067322416 -0.999961934
[2,] 0.9999845292 -0.0004024287 0.005547916
[3,] -0.0003650635 -0.9999772572 -0.006734371
Details
To see why it is possible to work on the transposed matrix consider the
following:
The general form of the eigenvalue equation is
A x = λ x (1)
Without loss of generality, let M be a centered "copy" of your original
dataset X. Substitution of M^T M for A yields
M^T M x = λ x (2)
Multiplication of this equation by M yields
M M^T M x = λ M x (3)
Consequent substitution of y = M x yields
M M^T y = λ y (4)
One can already see that y corresponds to an eigenvector of the "covariance"
matrix of the transposed dataset (note that M M^T is in fact no real
covariance matrix as the dataset X was centered along its columns and not its
rows. Also, scaling must be done by means of the number of samples (rows of M)
and not the number of features (columns of M resp. rows of M^T).
It can also be seen that the eigenvalues are the same for M M^T and M^T M.
Finally, one last multiplication by M^T results in
(M^T M) M^T y = λ M^T y (5)
where M^T M is the original covariance matrix.
From equation (5) it follows that M^T y is an eigenvector of M^T M with
eigenvalue λ.

Related

OLS estimator in R

I'm trying to compute OLS estimators manually in R for given vectors and matrices, but when I get the formula beta=(x'x)^-1(x'y), R tells me that the is a dimension issue, and I can't figure out why.
My code is
nr = 100
nc = 1000
x=matrix(rnorm(nr * nc, mean=1, sd=1), nrow = nr)
epsilon=matrix(rnorm(nr * nc, mean=0, sd=1), nrow = nr)
k=c(1,2,4,8)
eta1=((epsilon^1-mean(epsilon^1))/(mean(epsilon^(1*2))-mean(epsilon^1)^2)^(1/2))
eta2=((epsilon^2-mean(epsilon^2))/(mean(epsilon^(2*2))-mean(epsilon^2)^2)^(1/2))
eta4=((epsilon^4-mean(epsilon^4))/(mean(epsilon^(4*2))-mean(epsilon^4)^2)^(1/2))
eta8=((epsilon^8-mean(epsilon^8))/(mean(epsilon^(8*2))-mean(epsilon^8)^2)^(1/2))
y1=x+eta1
y2=x+eta2
y4=x+eta4
y8=x+eta8
beta1=inv(t(x)*x)*(t(x)*y1)
beta2=inv(t(x)*x)*(t(x)*y2)
beta4=inv(t(x)*x)*(t(x)*y4)
beta8=inv(t(x)*x)*(t(x)*y8)
Also, I feel that there should be a way to loop through the values of k to get this automated, instead of doing each eta by hand. So, a bit of help in this area would also be appreciated.
The ouput I'm looking for is to get a vector of beta for each of the different values of k.
You have got several issues. Firstly, you should have nx1 matrix for y and epsilon, but you have nxm matrix for them instead. Secondly, you should use matrix multiplication which is %*% in R. i.e. t(x)%*%y1. However you use dot product (*) instead.
For the sake of simplicity, lets create a matrix with 5 columns. My approach is creating a dependent variable which is related with x columns (independent variables or feature matrix in machine learning terminology)
nr = 100
nc = 5
x=matrix(rnorm(nr * nc, mean=1, sd=1), nrow = nr)
epsilon=matrix(rnorm(nr, mean=0, sd=1), nrow = nr) # it should be nx1
k=c(1,2,4,8)
eta1=((epsilon^1-mean(epsilon^1))/(mean(epsilon^(1*2))-mean(epsilon^1)^2)^(1/2))
eta2=((epsilon^2-mean(epsilon^2))/(mean(epsilon^(2*2))-mean(epsilon^2)^2)^(1/2))
eta4=((epsilon^4-mean(epsilon^4))/(mean(epsilon^(4*2))-mean(epsilon^4)^2)^(1/2))
eta8=((epsilon^8-mean(epsilon^8))/(mean(epsilon^(8*2))-mean(epsilon^8)^2)^(1/2))
To check the output we should create y values wisely. So, let's define some betas and create y values wrt them. At the end, we can compare the output with the inputs we defined. Note that, you should have 5 betas for 5 columns.
# made up betas
beta1_real <- 1:5
beta2_real <- -4:0
beta4_real <- 7:11
beta8_real <- seq(0.25,1.25,0.25)
To create the y values,
y1= 10 + x %*% matrix(beta1_real) + eta1
y2= 20 + x %*% matrix(beta2_real) + eta2
y4= 30 + x %*% matrix(beta4_real) + eta4
y8= 40 + x %*% matrix(beta8_real) + eta8
Here, I also added a constant term for each y values. To get the constant term at the end, we should add ones at the beginning of our x matrix like,
x <- cbind(matrix(1,nrow = nr),x)
The rest is almost same with yours. Only difference is I used solve instead of inv and also I used the matrix multiplication (%*%)
beta1=solve(t(x)%*%x)%*%(t(x)%*%y1)
beta2=solve(t(x)%*%x)%*%(t(x)%*%y2)
beta4=solve(t(x)%*%x)%*%(t(x)%*%y4)
beta8=solve(t(x)%*%x)%*%(t(x)%*%y8)
If we compare the outputs,
beta1_real was,
# [1] 1 2 3 4 5
and the output of beta1 is,
# [,1]
# [1,] 10.0049631
# [2,] 0.9632124
# [3,] 1.8987402
# [4,] 2.9816673
# [5,] 4.2111817
# [6,] 4.9529084
The results are similar. 10 at the beginning is the constant term I added. The difference stems from the error term applied (etas).

Find limiting distribution of transition matrix and plot in R

may I know how I can find and plot the results of the limiting distribution or a unique stationary distribution of a transition matrix in R? (my goal is to have a unique and constant result instead of a random result)
This is the P matrix used:
P=matrix(c(0.2,0.3,0.5,0.1,0.8,0.1,0.4,0.2,0.4),nrow=3,ncol=3,byrow=TRUE)
I misspoke in my earlier answer. Either the sums of the rows or the column need to all be 1 for a transition matrix. It depends on whether you are using v'P or Pv to transition to the next step.
I'll use Pv.
For the limiting distribution to be stable, we must have:
Pv = v, or (P - I)v = 0. So the limiting distribution is an eigenvector with eigenvalue 1. Then to be sure it's a distribution sum(v) == 1.
Since your matrix has rows that sum to 1, not columns, we need to use the transpose of the matrix to calculate the eigenvalues:
e <- eigen(t(P))$vectors[, 1]
e <- e / sum(e)
Gives:
e
[1] 0.1960784 0.5490196 0.2549020
To check this:
P=matrix(c(0.2,0.3,0.5,0.1,0.8,0.1,0.4,0.2,0.4),nrow=3,ncol=3, byrow = TRUE)
ans <- e
for (i in 1:1000) {
ans <- ans %*% P
}
ans
ans
[,1] [,2] [,3]
[1,] 0.1960784 0.5490196 0.254902
Same, so it's stable.
I'm not clear as to what you wanted to plot.

Faster alternative to R car::Anova for sum of square crossproduct matrix calculation for subsets of predictors

I need to compute the sum of squares crossproduct matrix (indeed the trace of this matrix) in a multivariate linear model, with Y (n x q) and X (n x p). Standard R code for doing that is:
require(MASS)
require(car)
# Example data
q <- 10
n <- 1000
p <- 10
Y <- mvrnorm(n, mu = rep(0, q), Sigma = diag(q))
X <- as.data.frame(mvrnorm(n, mu = rnorm(p), Sigma = diag(p)))
# Fit lm
fit <- lm( Y ~ ., data = X )
# Type I sums of squares
summary(manova(fit))$SS
# Type III sums of squares
type = 3 # could be also 2 (II)
car::Anova(fit, type = type)$SSP
This has to be done thousands of times, unfortunately, it gets slow when the number of predictors is relatively large. As often I am interested only in a subset of s predictors, I tried to re-implement this calculation. Although my implementation directly translating linear algebra for s = 1 (below) is faster for small sample sizes (n),
# Hat matrix (X here stands for the actual design matrix)
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)
# Remove predictor of interest (e.g. 2)
X.r <- X[, -2]
H1 <- tcrossprod(tcrossprod(X.r, solve(crossprod(X.r))), X.r)
# Compute e.g. type III sum of squares
SS <- crossprod(Y, H - H1) %*% Y
car still goes faster for large n:
I already tried Rcpp implementation which much success, as these matrix products in R already use a very efficient code.
Any hint on how to do this faster?
UPDATE
After reading the answers, I tried the solution proposed in this post which relies on QR/SVD/Cholesky factorization for hat matrix calculation. However it seems that car::Anova is still faster to compute all p = 30 matrices than me computing just one (s = 1)!! for e.g. n = 5000, q = 10:
Unit: milliseconds
expr min lq mean median uq max neval
ME 1137.5692 1202.9888 1257.8979 1251.6834 1318.9282 1398.9343 10
QR 1005.9082 1031.9911 1084.5594 1037.5659 1095.7449 1364.9508 10
SVD 1026.8815 1065.4629 1152.6631 1087.9585 1241.4977 1446.8318 10
Chol 969.9089 1056.3093 1115.9608 1102.1169 1210.7782 1267.1274 10
CAR 205.1665 211.8523 218.6195 214.6761 222.0973 242.4617 10
UPDATE 2
The best solution for now was to go over the car::Anova code (i.e. functions car:::Anova.III.mlm and subsequently car:::linearHypothesis.mlm) and re-implement them to account for a subset of predictors, instead of all of them.
The relevant code by car is as follows (I skipped checks, and simplified a bit):
B <- coef(fit) # Model coefficients
M <- model.matrix(fit) # Model matrix M
V <- solve(crossprod(M)) # M'M
p <- ncol(M) # Number of predictors in M
I.p <- diag(p) # Identity (p x p)
terms <- labels(terms(fit)) # terms (add intercept)
terms <- c("(Intercept)", terms)
n.terms <- length(terms)
assign <- fit$assign # assignation terms <-> p variables
SSP <- as.list(rep(0, n.terms)) # Initialize empty list for sums of squares cross-product matrices
names(SSP) <- terms
for (term in 1:n.terms){
subs <- which(assign == term - 1)
L <- I.p[subs, , drop = FALSE]
SSP[[term]] <- t(L %*% B) %*% solve(L %*% V %*% t(L)) %*% (L %*% B)
}
Then it is just a matter of selecting the subset of terms.
This line and the similar one below it for H1 could probably be improved:
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)
The general idea is that you should rarely use solve(Y) %*% Z, because it is the same as solve(Y, Z) but slower. I haven't fully expanded your tcrossprod calls to see what the best equivalent formulation of the expressions for H and H1 would be.
You could also look at this question https://stats.stackexchange.com/questions/139969/speeding-up-hat-matrices-like-xxx-1x-projection-matrices-and-other-as for a description of doing it via QR decomposition.

How can I code this equation with double summation in R?

So I'm having hard time coding the above equation, mainly the part which contains that double sum over i's and over j.
I'n my case, my n = 200 and p = 15. My yi:s are in a vector Y = (y1,y2,...yn) that is vector of length 200 and Xij:s are in a matrix which has 15 columns and 200 rows. Bj:s are in a vector of length 15.
My own solution, which I'm fairly certain is wrong, is this:
b0 <- 1/200 * sum(Y - sum(matr*b))
And here is code which you can use to reproduce my vectors and matrix:
matr <- t(mvrnorm(15,mu= rep(0,200),diag(1,nrow = 200)))
Y <- rnorm(n = 200)
b <- rnorm(n = 15)
Use matrix multiplication:
mean(y - x %*% b)
Note that if y and x are known and b is the least squares regression estimate of the coefficients then we can write it as:
fm <- lm(y ~ x + 0)
mean(resid(fm))
and that necessarily equals 0 if there is an intercept, i.e. a constant column in x, since the residual vector must be orthogonal to the range of x and taking the mean is the same as taking the inner product of the residuals and a vector whose elements are all the same (and equal to 1/n).

Calculating divergence between joint posterior distributions

I wish to calculate the distance between two 3-dimensional posterior distributions. The draws are stored at two 30,000x3 matrices.
So far I have been successful in calculating Total Variation distance between two 2-dimensional posteriors (two 30,000x2 matrices) by splitting the grid into bins. However, I am having trouble calculating the divergence between posteriors with more parameters. Some examples of related distance measures can be found here.
NOTE: I do not wish to calculate the distance between the marginals (column-wise entries), rather than obtain an overall value after comparing the joint distributions in R.
I would really appreciate it if somebody could point out what I am missing here.
EDIT 1: Some example code for calculating Total variation distance between posterior samples stored in two matrices has been added below:
EDIT 2: This is a R question.
set.seed(123)
comparison.2D <- matrix(rnorm(40000*2,0,1),ncol=2)
ground.truth.2D <- matrix(rnorm(40000*2,0,2),ncol=2)
# Function to calculate TVD between matrices with 2 columns:
Total.Variation.Distance.2D<-function(true,
comparison,
burnin,
window.size){
# Bandwidth for theta.1.
my_bw_x<-window.size
# Bandwidth for theta.2.
my_bw_y<-window.size
range_x<-range(c(true[-c(1:burnin),1],comparison[-c(1:burnin),1]))
range_y<-range(c(true[-c(1:burnin),2],comparison[-c(1:burnin),2]))
xx <- seq(range_x[1],range_x[2],by=my_bw_x)
yy <- seq(range_y[1],range_y[2],by=my_bw_y)
true.pointidxs <- matrix( c( findInterval(true[-c(1:burnin),1], xx),
findInterval(true[-c(1:burnin),2], yy) ), ncol=2)
comparison.pointidxs <- matrix( c( findInterval(comparison[-c(1:burnin),1], xx),
findInterval(comparison[-c(1:burnin),2], yy) ), ncol=2)
# Count the frequencies in the corresponding cells:
square.mat.dims <- max(length(xx),nrow=length(yy))
frequencies.true <- frequencies.comparison <- matrix(0, ncol=square.mat.dims, nrow=square.mat.dims)
for (i in 1:dim(true.pointidxs)[1]){
frequencies.true[true.pointidxs[i,1], true.pointidxs[i,2]] <- frequencies.true[true.pointidxs[i,1],
true.pointidxs[i,2]] + 1
frequencies.comparison[comparison.pointidxs[i,1], comparison.pointidxs[i,2]] <- frequencies.comparison[comparison.pointidxs[i,1],
comparison.pointidxs[i,2]] + 1
}# End for
# Normalize frequencies matrix:
frequencies.true <- frequencies.true/dim(true.pointidxs)[1]
frequencies.comparison <- frequencies.comparison/dim(comparison.pointidxs)[1]
TVD <-0.5*sum(abs(frequencies.comparison-frequencies.true))
return(TVD)
}# End function
TVD.2D <- Total.Variation.Distance.2D(true=ground.truth.2D, comparison=comparison.2D,burnin=10000,window.size=0.05)

Resources