PCA -how are the principal components mapped? - r

I have a problem with the pca in R, probably a simple one:
I have 10 Vectors a,b,c,d,e,f,g,h,i,j and bind them with cbind.
With the Result I perform a pca, using prcomp. I get the scores all right and also I get the principal components in descending order.
Only: how on earth do I know which of the components a to j is the first, which the second and so on?
Probably really a beginner's question - still cannot solve it and would appreciate some help.

The rotation matrix can tell you which original variables are important in each of the principal components. For example, the first column of the rotation matrix shows the contributions for PC1. A high value in the first row (relative to the other coefficients) means that the first original variable is important in the first principal component. Let's say that the first column has high positive values for the first five rows, and high negative values for the second five rows. This means that the PC axis can be interpreted as the ratio between those two groups.

It's a old question... but maybe someone needs it in the future
library(stats)
data(USArrests)
PCA.USA <- prcomp(USArrests[,c(1,2,4)], scale=TRUE)
proporcionDeInfluencia <- abs(PCA.USA$rotation)
sweep(proporcionDeInfluencia, 2, colSums(proporcionDeInfluencia), "/")
More info in Principal Components Analysis - how to get the contribution (%) of each parameter to a Prin.Comp.?

Related

Centering/standardizing variables in R [duplicate]

I'm trying to understand the definition of scale that R provides. I have data (mydata) that I want to make a heat map with, and there is a VERY strong positive skew. I've created a heatmap with a dendrogram for both scale(mydata) and log(my data), and the dendrograms are different for both. Why? What does it mean to scale my data, versus log transform my data? And which would be more appropriate if I want to look at the dendrogram illustrating the relationship between the columns of my data?
Thank you for any help! I've read the definitions but they are whooping over my head.
log simply takes the logarithm (base e, by default) of each element of the vector.
scale, with default settings, will calculate the mean and standard deviation of the entire vector, then "scale" each element by those values by subtracting the mean and dividing by the sd. (If you use scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.)
Note that this will give you the same values
set.seed(1)
x <- runif(7)
# Manually scaling
(x - mean(x)) / sd(x)
scale(x)
It provides nothing else but a standardization of the data. The values it creates are known under several different names, one of them being z-scores ("Z" because the normal distribution is also known as the "Z distribution").
More can be found here:
http://en.wikipedia.org/wiki/Standard_score
This is a late addition but I was looking for information on the scale function myself and though it might help somebody else as well.
To modify the response from Ricardo Saporta a little bit.
Scaling is not done using standard deviation, at least not in version 3.6.1 of R, I base this on "Becker, R. (2018). The new S language. CRC Press." and my own experimentation.
X.man.scaled <- X/sqrt(sum(X^2)/(length(X)-1))
X.aut.scaled <- scale(X, center = F)
The result of these rows are exactly the same, I show it without centering because of simplicity.
I would respond in a comment but did not have enough reputation.
I thought I would contribute by providing a concrete example of the practical use of the scale function. Say you have 3 test scores (Math, Science, and English) that you want to compare. Maybe you may even want to generate a composite score based on each of the 3 tests for each observation. Your data could look as as thus:
student_id <- seq(1,10)
math <- c(502,600,412,358,495,512,410,625,573,522)
science <- c(95,99,80,82,75,85,80,95,89,86)
english <- c(25,22,18,15,20,28,15,30,27,18)
df <- data.frame(student_id,math,science,english)
Obviously it would not make sense to compare the means of these 3 scores as the scale of the scores are vastly different. By scaling them however, you have more comparable scoring units:
z <- scale(df[,2:4],center=TRUE,scale=TRUE)
You could then use these scaled results to create a composite score. For instance, average the values and assign a grade based on the percentiles of this average. Hope this helped!
Note: I borrowed this example from the book "R In Action". It's a great book! Would definitely recommend.

Display the name of corresponding PC when using prcomp for PCA in r

I use prcomp to run PCA in r. When I output the summary, i.e. standard deviation, proportion of variance, cumulative proportion, the results are always ordered and the actual column name is replaced by PC1, PC2. Thus, I cannot tell the exact proportion of variance for each column.
Can anyone show me or give me some hint on how to display the column when outputting summary results. Two results pics are attached here:
It is not clear that you understand what principal components does. It reduces the dimensionality of the data. Assuming the rows are observations and the columns are variables, imagine plotting your rows in 35 dimensions (the columns). Most people have trouble visualizing more than 3 dimensions. Principal components creates a smaller set of axes that explains most the the variation in the data. The axes are Euclidian meaning they are at right angles to one another. Your plot and the result of the summary(res.pca5) and plot(res.pca5) functions show that the first dimension explains 28% of the variation in the 35 variables. Adding a second dimension gives you almost 38% and three gives you 44%. These new variables are combinations of your original variables, not the original variables. The first two components explain more of the variability than any other combination.
For some reason you did not try res.pca5 as a command (or the equivalent print(res.pca5)) which would show you the coefficients that pca used to create the components from the original variables or biplot(res.pca5) which plots the rows and columns in the new two dimensional space.

PCA scores vs. Varimax-rotated PCA scores

I have performed PCA using prcomp in R with my databases of 75-76 indicator variables and 7232 companies, including NAs. Before applying the function, I centred my data, but did not rescale them because they are all indicator variables. (Is my reasoning correct?)
After that I varimax-rotated the loadings of the 2 or 3 first principal components following the instructions by amoeba here.
Since I had centred, but not rescaled my data, I changed the code to:
Varimax_results <- varimax(rawLoadings,normalize = FALSE)
invLoadings <- t(pracma::pinv(VarimaxLoadings))
scores <- scale(DatosPCA, scale = FALSE) %*% invLoadings
Now I am trying to figure out why the scores given by "prcomp" and the scores obtained using the code above are not the same.
I am probably missing some theoretical background, so I would be grateful if someone could tell me if the scores are supposed to be the same and, in that case, what I am doing wrong in my code. If they are not supposed to be the same, which ones should I use?
Thank you very much!

Cross correlatation for time series with weights?

Part 1: Previously I was using the ccf function in R to compute cross correlations between two timeseries ts_1 and ts_2. Now, I want to compute the crosscorrelation, but I have a vector of weights which is the same length as ts_1 and ts_2. From what I understand, the built-in ccf function I was using in does not have an argument for weights. Does another function/package have these capabilities?
Part 2: On a more conceptual note, I understand that Pearsons crosscorrelation cannot exceed 1 (or rather the absolute value of the crosscorrelation value). However, does this also hold true for weighted cross correlation? If so, what does this mean/how do I interpret this?
Thank you in advance for your help!
You can use wcc function from ptw package for calculation of cross-correlations. E.g. please the code below putting more weight on the tail of the series:
library(ptw)
data(gaschrom)
wcc(gaschrom[1,], gaschrom[2,], trwdth = 20, wghts = rep(seq_along(gaschrom[1, ])))
Output:
[1] 0.9997758
The wcc is a suitable measure for the similarity of two patterns when features may be shifted. Identical patterns lead to a wcc value of 1.
This means that even if two patterns are identical however shifted on time axis the weighted crosscorrelation will give you 1 (perfectly correlated).

Understanding `scale` in R

I'm trying to understand the definition of scale that R provides. I have data (mydata) that I want to make a heat map with, and there is a VERY strong positive skew. I've created a heatmap with a dendrogram for both scale(mydata) and log(my data), and the dendrograms are different for both. Why? What does it mean to scale my data, versus log transform my data? And which would be more appropriate if I want to look at the dendrogram illustrating the relationship between the columns of my data?
Thank you for any help! I've read the definitions but they are whooping over my head.
log simply takes the logarithm (base e, by default) of each element of the vector.
scale, with default settings, will calculate the mean and standard deviation of the entire vector, then "scale" each element by those values by subtracting the mean and dividing by the sd. (If you use scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.)
Note that this will give you the same values
set.seed(1)
x <- runif(7)
# Manually scaling
(x - mean(x)) / sd(x)
scale(x)
It provides nothing else but a standardization of the data. The values it creates are known under several different names, one of them being z-scores ("Z" because the normal distribution is also known as the "Z distribution").
More can be found here:
http://en.wikipedia.org/wiki/Standard_score
This is a late addition but I was looking for information on the scale function myself and though it might help somebody else as well.
To modify the response from Ricardo Saporta a little bit.
Scaling is not done using standard deviation, at least not in version 3.6.1 of R, I base this on "Becker, R. (2018). The new S language. CRC Press." and my own experimentation.
X.man.scaled <- X/sqrt(sum(X^2)/(length(X)-1))
X.aut.scaled <- scale(X, center = F)
The result of these rows are exactly the same, I show it without centering because of simplicity.
I would respond in a comment but did not have enough reputation.
I thought I would contribute by providing a concrete example of the practical use of the scale function. Say you have 3 test scores (Math, Science, and English) that you want to compare. Maybe you may even want to generate a composite score based on each of the 3 tests for each observation. Your data could look as as thus:
student_id <- seq(1,10)
math <- c(502,600,412,358,495,512,410,625,573,522)
science <- c(95,99,80,82,75,85,80,95,89,86)
english <- c(25,22,18,15,20,28,15,30,27,18)
df <- data.frame(student_id,math,science,english)
Obviously it would not make sense to compare the means of these 3 scores as the scale of the scores are vastly different. By scaling them however, you have more comparable scoring units:
z <- scale(df[,2:4],center=TRUE,scale=TRUE)
You could then use these scaled results to create a composite score. For instance, average the values and assign a grade based on the percentiles of this average. Hope this helped!
Note: I borrowed this example from the book "R In Action". It's a great book! Would definitely recommend.

Resources