Predict() new data into PCA space in R - r

After performing a principal component analysis of a first data set (a), I projected a second data set (b) into PCA space of the first data set.
From this, I want to extract the variable loadings for the projected analysis of (b). Variable loadings of the PCA of (a) are returned by prcomp(). How can I retrieve the variable loadings of (b), projected into PCA space of (a)?
# set seed and define variables
set.seed(1)
a = replicate(10, rnorm(10))
b = replicate (10, rnorm(10))
# pca of data A and project B into PCA space of A
pca.a = prcomp(a)
project.b = predict(pca.a, b)
# variable loadings
loads.a = pca.a$rotation

Here's an annotated version of your code to make it clear what is happening at each step. First, the original PCA is performed on matrix a:
pca.a = prcomp(a)
This calculates the loadings for each principal component (PC). At the next step, these loadings together with a new data set, b, are used to calculate PC scores:
project.b = predict(pca.a, b)
So, the loadings are the same, but the PC scores are different. If we look at project.b, we see that each column corresponds to a PC:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
[1,] -0.2922447 0.10253581 0.55873366 1.3168437 1.93686163 0.998935945 2.14832483 -1.43922296
[2,] 0.1855480 -0.97631967 -0.06419207 0.6375200 -1.63994127 0.110028191 -0.27612541 -0.37640710
[3,] -1.5924242 0.31368878 -0.63199409 -0.2535251 0.59116005 0.214116915 1.20873962 -0.64494388
[4,] 1.2117977 0.29213928 1.53928110 -0.7755299 0.16586295 0.030802395 0.63225374 -1.72053189
[5,] 0.5637298 0.13836395 -1.41236348 0.2931681 -0.64187233 1.035226594 0.67933996 -1.05234872
[6,] 0.2874210 1.18573157 0.04358772 -1.1941734 -0.04399808 -0.113752847 -0.33507195 -1.34592414
[7,] 0.5629731 -1.02835365 0.36218131 1.4117908 -0.96923175 -1.213684882 0.02221423 1.14483112
[8,] 1.2854406 0.09373952 -1.46038333 0.6885674 0.39455369 0.756654205 1.97699073 -1.17281174
[9,] 0.8573656 0.07810452 -0.06576772 -0.5200661 0.22985518 0.007571489 2.29289637 -0.79979214
[10,] 0.1650144 -0.50060018 -0.14882996 0.2065622 2.79581428 0.813803739 0.71632238 0.09845912
PC9 PC10
[1,] -0.19795112 0.7914249
[2,] 1.09531789 0.4595785
[3,] -1.50564724 0.2509829
[4,] 0.05073079 0.6066653
[5,] -1.62126318 0.1959087
[6,] 0.14899277 2.9140809
[7,] 1.81473300 0.0617095
[8,] 1.47422298 0.6670124
[9,] -0.53998583 0.7051178
[10,] 0.80919039 1.5207123
Hopefully, that makes sense, but I'm yet to finish my first coffee of the day, so no guarantees.

Related

Incorrect result when multiping matrixes in R?

I'm getting some weird results when multiplying these two matrices in R:
> matrix3
[1,] 3.19747172 -2.806e-05 -0.00579284 -0.00948720 -0.01054026 0.17575719
[2,] -0.00002806 2.000e-08 0.00000057 0.00000006 -0.00000009 -0.00000358
[3,] -0.00579284 5.700e-07 0.00054269 0.00001793 -0.00002686 -0.00310465
[4,] -0.00948720 6.000e-08 0.00001793 0.00003089 0.00002527 -0.00066290
[5,] -0.01054026 -9.000e-08 -0.00002686 0.00002527 0.00023776 -0.00100898
[6,] 0.17575719 -3.580e-06 -0.00310465 -0.00066290 -0.00100898 0.03725362
> matrix4
[,1]
x0 2428.711
x1 1115178.561
x2 74411.013
x3 925700.445
x4 74727.396
x5 13342.182
> matrix3%*%matrix4
[,1]
[1,] 78.4244581753
[2,] -0.0023802299
[3,] 0.1164568885
[4,] -0.0018504732
[5,] -0.0006493249
[6,] -0.1497822396
The thing is that if you try to multiply these two matrices in excel you get:
>78.4824494081686
>-0.0000419022486847151
>0.112430295996347
>-0.000379343461780479
>0.000340414687578061
>-0.14454024116344
And using online matrices I also got to excel's result.
Would love your help in understanding how to get the same result in R.
The problem occurred due to the use of the function inv() from the library(matlib).
matrix3 is a result of inversing using the inv() function.
Not sure why when I used solve() to inverse and then continued normally I got the correct matrix.
Perheps there is some kind of rounding in the inv() function.

Generate n-dim random samples based on empirical distribution and copula

I am given an empirical distribution FXemp of a real-valued random variable X. Given now X1,..., Xn having the same distribution as X and dependencies given by a copula C. I would like now to produce random samples of X1,..., Xn element of R.
E.g. I am given a vector of samples and the corresponding cdf
x <- rnorm(1000)
df <- ecdf(x)
Assume that I pick for a example a t-student or Clayton copula C. How can I produce random samples of for example 10 copies of x, where their dependency is determined by C.
Is there an easy way?
Or are their any packages that can be used here?
You can sample from the copula (with uniform margins) by using the copula package, and then apply the inverse ecdf to each component:
library(copula)
x <- rnorm(100) # sample of X
d <- 5 # desired number of copies
copula <- claytonCopula(param = 2, dim = d)
nsims <- 25 # number of simulations
U <- rCopula(nsims, copula) # sample from the copula (with uniform margins)
# now sample the copies of X ####
Xs <- matrix(NA_real_, nrow = nsims, ncol = d)
for(i in 1:d){
Xs[,i] <- quantile(x, probs = U[,i], type = 1) # type=1 is the inverse ecdf
}
Xs
# [,1] [,2] [,3] [,4] [,5]
# [1,] -0.5692185 -0.9254869 -0.6821624 -1.2148041 -0.682162391
# [2,] -0.4680407 -0.4263257 -0.3456553 -0.6132320 -0.925486872
# [3,] -1.1322063 -1.2148041 -0.8115089 -1.0074435 -1.430405604
# [4,] 0.9760268 1.2600186 1.0731551 1.2369623 0.835024471
# [5,] -1.1280825 -0.8995429 -0.5761037 -0.8115089 -0.543125426
# [6,] -0.1848303 -1.2148041 -0.5692185 0.8974921 -0.613232036
# [7,] -0.5692185 -0.3070884 -0.8995429 -0.8115089 -0.007292346
# [8,] 0.1696306 0.4072428 0.7646646 0.4910863 1.236962330
# [9,] -0.7908557 -1.1280825 -1.2970952 0.3655081 -0.633521404
# [10,] -1.3226053 -1.0074435 -1.6857615 -1.3226053 -1.685761474
# [11,] -2.5410325 -2.3604936 -2.3604936 -2.3604936 -2.360493569
# [12,] -2.3604936 -2.2530003 -1.9311289 -2.2956444 -2.360493569
# [13,] 0.4072428 -0.2150035 -0.3564803 -0.1051930 -0.166434458
# [14,] -0.4680407 -1.0729763 -0.6335214 -0.8995429 -0.899542914
# [15,] -0.9143225 -0.1522242 0.4053462 -1.0729763 -0.158375658
# [16,] -0.4998761 -0.7908557 -0.9813504 -0.1763604 -0.283013334
# [17,] -1.2148041 -0.9143225 -0.5176347 -0.9143225 -1.007443492
# [18,] -0.2150035 0.5675260 0.5214050 0.8310799 0.464151265
# [19,] -1.2148041 -0.6132320 -1.2970952 -1.1685962 -1.132206305
# [20,] 1.4456635 1.0444720 0.7850181 1.0742214 0.785018119
# [21,] 0.3172811 1.2369623 -0.1664345 0.9440006 1.260018624
# [22,] 0.5017980 1.4068250 1.9950305 1.2600186 0.976026807
# [23,] 0.5675260 -1.0729763 -1.2970952 -0.3653535 -0.426325703
# [24,] -2.5410325 -2.2956444 -2.3604936 -2.2956444 -2.253000326
# [25,] 0.4053462 -0.5431254 -0.5431254 0.8350245 0.950891450

How to calculate "terms" from predict-function manually when regression has an interaction term

does anyone know how predict-function calculates terms when there are an interaction term in a regression model? I know how to solve terms when regression has no interaction terms in it but when I add one I cant solve those manually anymore. Here is some example data and I would like to see how to calculate those values manually. Thanks! -Aleksi
set.seed(2)
a <- c(4,3,2,5,3) # first I make some data
b <- c(2,1,4,3,5)
e <- rnorm(5)
y= 0.6*a+e
data <- data.frame(a,b,y)
model1 <- lm(y~a*b,data=data) # regression
predict(model1,type='terms',data) # terms
#This gives the result:
a b a:b
1 0.04870807 -0.3649011 0.2049069
2 -0.03247205 -0.7298021 0.7740928
3 -0.11365216 0.3649011 0.2049069
4 0.12988818 0.0000000 -0.5919534
5 -0.03247205 0.7298021 -0.5919534
attr(,"constant")
[1] 1.973031
Your model is technically y ~ b0 + b1*a + b2*a*b + e. Calculating a is done by multiplying independent variable by its coefficient and centering the result. So for example, terms for a would be
cf <- coef(model1)
scale(a * cf[2], scale = FALSE)
[,1]
[1,] 0.04870807
[2,] -0.03247205
[3,] -0.11365216
[4,] 0.12988818
[5,] -0.03247205
which matches your output above.
And since interaction term is nothing else than multiplying independent variables, this translates to
scale(a * b * cf[4], scale = FALSE)
[,1]
[1,] 0.2049069
[2,] 0.7740928
[3,] 0.2049069
[4,] -0.5919534
[5,] -0.5919534

Extracting gap statistic info to identify K for Kmeans clustering

I was looking at the 'cluster' library which has the function 'clusGap' to extract the number of clusters for Kmeans clustering.
This is the code:
# Compute Gap statistic (http://web.stanford.edu/~hastie/Papers/gap.pdf)
computeGapStatistic() <- function(data) {
gap <<- clusGap(shift_len_avg_data, FUN = kmeans, K.max = 8, B = 3)
if (ENABLE_PLOTS) {
plot(gap, main = "Gap statistic for the Nursing shift data")
}
print(gap)
return(gap)
}
Which gives me the following output when 'gap' is printed out:
> print(gap)
Clustering Gap statistic ["clusGap"].
B=3 simulated reference sets, k = 1..8
--> Number of clusters (method 'firstSEmax', SE.factor=1): 2
logW E.logW gap SE.sim
[1,] 8.702334 9.238385 0.53605067 0.007945542
[2,] 7.940133 8.544323 0.60418996 0.003790244
[3,] 7.772673 8.139836 0.36716303 0.005755805
[4,] 7.325798 7.849233 0.52343473 0.002732731
[5,] 7.233667 7.629954 0.39628748 0.003496058
[6,] 7.020220 7.439709 0.41948820 0.006451708
[7,] 6.707678 7.285907 0.57822872 0.002810682
[8,] 7.166932 7.150724 -0.01620749 0.004274151
and this is how the plot look like:
Question:
How do i extract the number of clusters from the 'gap' variable? 'gap' seems to be a list. From the above description it seems to have found 2 clusters.
I figured this out on my own. This is what i used: with(gap,maxSE(Tab[,"gap"],Tab[,"SE.sim"]))

Reuse a HoltWinters model using new data

I'm trying to reuse a HoltWinters model previously generated in R. I have found a related entry here, but it does not seem to work with HoltWinters. Basically I have tried something like this:
myModel<-HoltWinters(ts(myData),gamma=FALSE)
predict(myModel,n.ahead=10)
#time to change the data
predict(myModel,n.ahead=10,newdata=myNewData)
When I try to predict using the new data I get the same prediction.
I would appreciate any suggestion.
You can use update:
mdl <- HoltWinters(EuStockMarkets[,"FTSE"],gamma=FALSE)
predict(mdl,n.ahead=10)
Time Series:
Start = c(1998, 170)
End = c(1998, 179)
Frequency = 260
fit
[1,] 5451.093
[2,] 5447.186
[3,] 5443.279
[4,] 5439.373
[5,] 5435.466
[6,] 5431.559
[7,] 5427.652
[8,] 5423.745
[9,] 5419.838
[10,] 5415.932
predict(update(mdl,x=EuStockMarkets[,"CAC"]),n.ahead=10)]
Time Series:
Start = c(1998, 170)
End = c(1998, 179)
Frequency = 260
fit
[1,] 3995.127
[2,] 3995.253
[3,] 3995.380
[4,] 3995.506
[5,] 3995.633
[6,] 3995.759
[7,] 3995.886
[8,] 3996.013
[9,] 3996.139
[10,] 3996.266
predict.HoltWinters doesn't have a newdata argument, which is why the data doesn't get replaced. This is because the prediction doesn't require any data – it is described entirely by the coefficients argument of the model.
m <- HoltWinters(co2)
m$coefficients #These values describe the model completely;
#adding new data makes no difference

Resources