Principal Component Analysis in R by hand

Principal Component Analysis in R by hand - r

The questions is about Principal Component Analysis, partly done by hand.
Disclaimer: My background is not in maths and I am using R for the first time.
Given are the following five data points in R^3. Where xi1-3 are variables and x1 - x5 are observations.
| x1 x2 x3 x4 x5
----------------------
xi1 | -2 -2 0 2 2
xi2 | -2 2 0 -2 2
xi3 | -4 0 0 0 4
Three principal component vectors after the principal component analysis has been performed are given, and look like this:
Phi1 = (0.41, 0.41, 0.82)T
Phi2 = (-0.71, 0.71, 0.00)T
Phi3 = (0.58, 0.58, -0.58)T
The questions are as follows
1) Calculate the principal component scores zi1, zi2 and zi3 for each of the 5 data points.
2) Calculate the proportion of the variance explained by each principal component.
So far I have answered question 1 with the following code, where Z represents the scores:
A = matrix(
c(-2, -2, 0, 2, 2, -2, 2, 0, -2, 2, -4, 0, 0, 0, 4),
nrow = 3,
ncol = 5,
byrow = TRUE
)
Phi = matrix (
c(0.41, -0.71, 0.58,0.41, 0.71, 0.58, 0.82, 0.00, -0.58),
nrow = 3,
ncol = 3,
byrow = FALSE
)
Z = Phi%*%A
Now I am stuck with question 2, I am given the formula:
But I am not sure how I can recreate the formula with an R command, can anyone help me?

#Here is the numerator:
(Phi%*%A)^2%>%rowSums()
[1] 48.4128 16.1312 0.0000
#Here is the denominator:
sum(A^2)
[1] 64
#So the answer is:
(Phi%*%A)^2%>%rowSums()/sum(A^2)
[1] 0.75645 0.25205 0.00000
we can verify with prcomp+summary:
summary(prcomp(t(A)))
Importance of components:
PC1 PC2 PC3
Standard deviation 3.464 2.00 0
Proportion of Variance 0.750 0.25 0
Cumulative Proportion 0.750 1.00 1
This is roughly the same since your $\Phi$ is rounded to two decimals.

Related

VECM in R: Testing weak exogeneity and imposing restrictions

I estimated VECM and would like to make 4 separate tests of weak exogeneity for each variable.
library(urca)
library(vars)
data(Canada)
e prod rw U
1980 Q1 929.6105 405.3665 386.1361 7.53
1980 Q2 929.8040 404.6398 388.1358 7.70
1980 Q3 930.3184 403.8149 390.5401 7.47
1980 Q4 931.4277 404.2158 393.9638 7.27
1981 Q1 932.6620 405.0467 396.7647 7.37
1981 Q2 933.5509 404.4167 400.0217 7.13
...
jt = ca.jo(Canada, type = "trace", ecdet = "const", K = 2, spec = "transitory")
t = cajorls(jt, r = 1)
t$rlm$coefficients
e.d prod.d rw.d U.d
ect1 -0.005972228 0.004658649 -0.10607044 -0.02190508
e.dl1 0.812608320 -0.063226620 -0.36178542 -0.60482042
prod.dl1 0.208945048 0.275454380 -0.08418285 -0.09031236
rw.dl1 -0.045040603 0.094392696 -0.05462048 -0.01443323
U.dl1 0.218358784 -0.538972799 0.24391761 -0.16978208
t$beta
ect1
e.l1 1.00000000
prod.l1 0.08536852
rw.l1 -0.14261822
U.l1 4.28476955
constant -967.81673980
I guess that my equations are:
and I would like to test whether alpha_e, alpha_prod, alpha_rw, alpha_U (they marked red in the picture above) are zeros and impose necessary restrictions on my model. So, my question is: how can I do it?
I guess that my estimated alphas are:
e.d prod.d rw.d U.d
ect1 -0.005972228 0.004658649 -0.10607044 -0.02190508
I guess that I should use alrtest function from urca library:
alrtest(z = jt, A = A1, r = 1)
and probably my A matrix for alpha_e should be like this:
A1 = matrix(c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1),
nrow = 4, ncol = 3, byrow = TRUE)
The results of the test:
jt1 = alrtest(z = jt, A = A1, r = 1)
summary(jt1)
The value of the likelihood ratio test statistic:
0.48 distributed as chi square with 1 df.
The p-value of the test statistic is: 0.49
Eigenvectors, normalised to first column
of the restricted VAR:
[,1]
RK.e.l1 1.0000
RK.prod.l1 0.1352
RK.rw.l1 -0.1937
RK.U.l1 3.9760
RK.constant -960.2126
Weights W of the restricted VAR:
[,1]
[1,] 0.0000
[2,] 0.0084
[3,] -0.1342
[4,] -0.0315
Which I guess means that I can't reject my hypothesis of weak exogeneity of alpha_e. And my new alphas here are: 0.0000, 0.0084, -0.1342, -0.0315.
Now the question is how can I impose this restriction on my VECM model?
If I do:
t1 = cajorls(jt1, r = 1)
t1$rlm$coefficients
e.d prod.d rw.d U.d
ect1 -0.005754775 0.007717881 -0.13282970 -0.02848404
e.dl1 0.830418381 -0.049601229 -0.30644063 -0.60236338
prod.dl1 0.207857861 0.272499006 -0.06742147 -0.08561076
rw.dl1 -0.037677197 0.102991919 -0.05986655 -0.02019326
U.dl1 0.231855899 -0.530897862 0.30720652 -0.16277775
t1$beta
ect1
e.l1 1.0000000
prod.l1 0.1351633
rw.l1 -0.1936612
U.l1 3.9759842
constant -960.2126150
the new model don't have 0.0000, 0.0084, -0.1342, -0.0315 for alphas. It has -0.005754775 0.007717881 -0.13282970 -0.02848404 instead.
How can I get reestimated model with alpha_e = 0? I want reestimated model with alpha_e = 0 because I would like to use it for predictions (vecm -> vec2var -> predict, but vec2var doesn't accept jt1 directly). And in general - are calculations which I made correct or not?
Just for illustration, in EViews imposing restriction on alpha looks like this (not for this example):

If you have 1 cointegrating relationship (r=1), as it is in t = cajorls(jt, r = 1),
your loading matrix can not have 4 rows and 3 columns:
A1 = matrix(c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1),
nrow = 4, ncol = 3, byrow = TRUE)
Matrix A can only have 4 rows and 1 column, if you have 4 variables and 1 cointegrating relationship.

R solve.QP tracking error minimization constraints inconsistent

I am struggling with Solve.QP to get a solution to minimize tracking error. I have a benchmark consisting of 6 assets (asset_a to asset_f). For my portfolio I have upper and lower bounds (I cannot have a position in asset_f). The cov matrix is also given. I want to get the portfolio weights for the 6 assets that minimizes tracking error vs the benchmark (with position in asset_f equal to zero).
benchmark:
asset_a: 0.3
asset_b: 0.3
asset_c: 0.1
asset_d: 0.1
asset_e: 0.1
asset_f: 0.1
lowerbounds:
asset_a: 0.166
asset_b: 0.133
asset_c: 0.037
asset_d: 0.035
asset_e: 0.039
asset_f: 0
upperbounds:
asset_a: 1
asset_b: 1
asset_c: 1
asset_d: 1
asset_e: 1
asset_f: 0
benchmark weights and bounds:
test.benchmark_weights = c(0.3, 0.3, 0.1, 0.1, 0.1, 0.1)
test.lowerbound = c(0.166, 0.133, 0.037, 0.035, 0.039,0)
test.upperbound = c(1, 1, 1, 1, 1, 0)
cov matrix (test.Dmat):
test.dmat = matrix(c(0.0119127162, 0.010862842, 0.010266683, 0.0009550136, 0.008242322, 0.00964462, 0.0108628421, 0.010603072, 0.009872992, 0.0011019412, 0.007422522, 0.0092528873, 0.0102666826, 0.009872992, 0.010487808, 0.0012107665, 0.006489204, 0.0096216627, 0.0009550136, 0.001101941, 0.001210766, 0.0115527788, 0.001181745, 0.0008387247, 0.0082423222, 0.007422522, 0.006489204, 0.0011817453, 0.012920482, 0.005973886, 0.00964462, 0.009252887, 0.009621663, 0.0008387247, 0.005973886, 0.0089904809), nrow=6, ncol=6)
dvec (test.dvec):
test.dvec = matrix(c(0, 0, 0, 0, 0, 0), nrow=6, ncol=1)
Amat constraints matrix (test.Amat):
test.amat = matrix(c(1,1,1,1,1,1, 1,1,1,1,1,0, -1,0,0,0,0,0, 0,-1,0,0,0,0, 0,0,-1,0,0,0, 0,0,0,-1,0,0, 0,0,0,0,-1,0, 0,0,0,0,0,-1, 1,0,0,0,0,0, 0,1,0,0,0,0, 0,0,1,0,0,0, 0,0,0,1,0,0, 0,0,0,0,1,0, 0,0,0,0,0,0, -1,0,0,0,0,0, 0,-1,0,0,0,0, 0,0,-1,0,0,0, 0,0,0,-1,0,0, 0,0,0,0,-1,0, 0,0,0,0,0,0), nrow=6, ncol=20)
bvec (test.bvec)
test.bvec =cbind(0, 1, t(test.benchmark_weights), t(test.lowerbound), -t(test.upperbound)) %>% as.matrix()
then running the solver
solve.QP(as.matrix(test.Dmat), test.dvec, test.Amat, test.bvec)
gives me
constraints are inconsistent, no solution!

Seems like there is something wrong with your Amat and bvec, i.e. you need not have to pass in both sum of weights on first 5 assets equal to 1 and sum of 6 assets equal 1 and also benchmark weights are not constraints but the bounds are:
library(quadprog)
N = 6L
test.dvec = rep(0, N)
test.amat = cbind(
rep(1, N),
diag(1, N),
diag(-1, N))
test.bvec = c(1, test.lowerbound, -test.upperbound)
res = solve.QP(test.dmat, test.dvec, test.amat, test.bvec, meq=1L)
round(res$solution, 2)
#[1] 0.17 0.13 0.10 0.44 0.17 0.00

R, use binomial distribution with more than two possibilities

I know this is probably elementary, but I seem to have a mental block. Let's say you want to calculate the probability of tossing a 4, 5, or 6 on a roll of one die. In R, it's easy enough:
sum(1/6, 1/6, 1/6)
This gives 1/2 which is the correct answer. However, I have in the back of my mind (where it possibly should remain) that I should be able to use the binomial distribution for this. I've tried various combinations of arguments for pbinom and dbinom, but I can't get the right answer.
With coin tosses, it works fine. Is it entirely inappropriate for situations where there are more than two possible outcomes? (I'm a programmer, not a statistician, so I'm expecting to get killed by the stat guys here.)
Question: How can I use pbinom() or dbinom() to calculate the probability of throwing a 4, 5, or 6 with one roll of a die? I'm familiar with the prob and dice packages, but I really want to use one of the built-in distributions.
Thanks.

As #Alex mentioned above, dice-throwing can be represented in terms of multinomial probabilities. The probability of rolling a 4, for example, is
dmultinom(c(0, 0, 0, 1, 0, 0), size = 1, prob = rep(1/6, 6))
# [1] 0.1666667
and the probability of rolling a 4, 5, or 6 is
X <- cbind(matrix(rep(0, 9), nc = 3), diag(1, 3))
X
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 0 0 1 0 0
# [2,] 0 0 0 0 1 0
# [3,] 0 0 0 0 0 1
sum(apply(X, MAR = 1, dmultinom, size = 1, prob = rep(1/6, 6)))
# [1] 0.5

Though it's not quite obvious, this can be done with pmultinom, implemented either in my pmultinom package on CRAN or this other pmultinom package on Github.
You conceptualize it as the event that it is not a 1, 2 or 3. Then, you write this probability as
P(X_1 ≤ 0, X_2 ≤ 0, X_3 ≤ 0, X_4 ≤ ∞, X_5 ≤ ∞, X_6 ≤ ∞)
where X_i is the number of occurrences of side i. All the X's together have a multinomial distribution, with a size parameter of 1, and all probabilities equal to 1/6. This probability above can be calculated (using my package) as
pmultinom(upper=c(0, 0, 0, Inf, Inf, Inf), size=1,
probs=c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6), method="exact")
# [1] 0.5
Though it's a bit of an awkward reformulation, I like it because I prefer to use a "p" function rather than take a sum of "d" functions.

Generating random variables with specific correlation threshold value

I am generating random variables with specified range and dimension.I have made a following code for this.
generateRandom <- function(size,scale){
result<- round(runif(size,1,scale),1)
return(result)
}
flag=TRUE
x <- generateRandom(300,6)
y <- generateRandom(300,6)
while(flag){
corrXY <- cor(x,y)
if(corrXY>=0.2){
flag=FALSE
}
else{
x <- generateRandom(300,6)
y <- generateRandom(300,6)
}
}
I want following 6 variables with size 300 and scale of all is between 1 to 6 except for one variable which would have scale 1-7 with following correlation structure among them.
1 0.45 -0.35 0.46 0.25 0.3
1 0.25 0.29 0.5 -0.3
1 -0.3 0.1 0.4
1 0.4 0.6
1 -0.4
1
But when I try to increase threshold value my program gets very slow.Moreover,I want more than 7 variables of size 300 and between each pair of those variables I want some specific correlation threshold.How would I do it efficiently?

This answer is directly inspired from here and there.
We would like to generate 300 samples of a 6-variate uniform distribution with correlation structure equal to
Rhos <- matrix(0, 6, 6)
Rhos[lower.tri(Rhos)] <- c(0.450, -0.35, 0.46, 0.25, 0.3,
0.25, 0.29, 0.5, -0.3, -0.3,
0.1, 0.4, 0.4, 0.6, -0.4)
Rhos <- Rhos + t(Rhos)
diag(Rhos) <- 1
We first generate from this correlation structure the correlation structure of the Gaussian copula:
Copucov <- 2 * sin(Rhos * pi/6)
This matrix is not positive definite, we use instead the nearest positive definite matrix:
library(Matrix)
Copucov <- cov2cor(nearPD(Copucov)$mat)
This correlation structure can be used as one of the inputs of MASS::mvrnorm:
G <- mvrnorm(n=300, mu=rep(0,6), Sigma=Copucov, empirical=TRUE)
We then transform G into a multivariate uniform sample whose values range from 1 to 6, except for the last variable which ranges from 1 to 7:
U <- matrix(NA, 300, 6)
U[, 1:5] <- 5 * pnorm(G[, 1:5]) + 1
U[, 6] <- 6 * pnorm(G[, 6]) + 1
After rounding (and taking the nearest positive matrix to the copula's covariance matrix etc.), the correlation structure is not changed much:
Ur <- round(U, 1)
cor(Ur)

get.basis() in lpSolveAPI

I am confused with the return of function get.basis(). For example,
lprec <- make.lp(0, 4)
set.objfn(lprec, c(1, 3, 6.24, 0.1))
add.constraint(lprec, c(0, 78.26, 0, 2.9), ">=", 92.3)
add.constraint(lprec, c(0.24, 0, 11.31, 0), "<=", 14.8)
add.constraint(lprec, c(12.68, 0, 0.08, 0.9), ">=", 4)
set.bounds(lprec, lower = c(28.6, 18), columns = c(1, 4))
set.bounds(lprec, upper = 48.98, columns = 4)
RowNames <- c("THISROW", "THATROW", "LASTROW")
ColNames <- c("COLONE", "COLTWO", "COLTHREE", "COLFOUR")
dimnames(lprec) <- list(RowNames, ColNames)
solve(lprec)
Then the basic variables are
> get.basis(lprec)
[1] -7 -2 -3
However, the solution is
> get.variables(lprec)
[1] 28.60000 0.00000 0.00000 31.82759
From the solution, it seems variable 1 and variable 4 are basis. Hence how does vector (-7, -2, -3) come from?
I am guessing it is from 3 constraints and 4 decision variables.

After I reviewed the simplex method for bounded variables, finally I understood how it happens. These two links are helpful. Example; Video
Come back to this problem, the structure is like
lpSolveAPI (R interface for lp_solve) would rewrite the constraint structure as the following format after adding appropriate slack variables. The first three columns are for slack variables. Hence, the return of get.basis(), which is -7,-2,-3, are column 7, 2, 3 that represent variable 4, slack variable 2 and 3.
With respect to this kind of LP with bounded variables, a variable could be nonbasic at either lower bound or upper bound. The return of get.basis(lp, nonbasic=TRUE) is -1,-4,-5,-6. Minus means these variables are at their lower bound. It means slack variable 1 = 0, variable 4 = 28.6, variable 5 = 0, variable 6 = 0.
Thus, the optimal solution is 28.6(nonbasic), 0(nonbasic), 0(nonbasic), 31.82(basic)