Relationship between input variables and principal components in PCA - r

Here is the result of PCA.
RC1 and RC3 can be interpreted which variables are related.
But, can not interpreted in RC2.
When the eigen value is checked, the number of factor is 3.
But can there really be only two? or Which variables should be related in RC2?
Input variable is 7 types. and I used 'principal()' function.
names(mydata)
[1] "A" "B" "C" "D" "E" "F" "G"
> x<-cbind(A, B, C, D, E, F, G)
> e_value<-eigen(cor(x))
> e_value
eigen() decomposition
$values
[1] 2.3502254 1.4170606 1.2658360 0.8148231 0.5608698 0.3438629 0.2473222
$vectors
[,1] [,2] [,3] [,4] [,5]
[,6] [,7]
[1,] 0.2388621 0.46839043 0.37003850 0.47205027 -0.58802244
-0.133939151 -0.009233395
[2,] 0.1671739 -0.71097984 -0.14062597 0.25083439 -0.26726985
-0.502411130 -0.244983436
[3,] 0.2132841 -0.19677142 0.64662974 0.34508779 0.61416969
-0.003950736 0.036814153
[4,] 0.1697817 -0.24468987 0.55631886 -0.69016805 -0.34039757
0.039899816 0.089531675
[5,] 0.4857016 0.36681570 -0.09905329 -0.31456085 0.26225761
-0.344919726 -0.577088755
[6,] -0.5359245 0.20164924 0.17958243 -0.13144417 0.11755661
-0.748885304 0.218966481
[7,] 0.5635252 0.03619081 -0.27131854 -0.05105919 0.08439733
-0.219629096 0.741315659
> PCA<-principal(x,nfactors = 3, rotate = "varimax")
> print(PCA)
Principal Components Analysis
Call: principal(r = x, nfactors = 3, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 com
A 0.24 0.69 0.29 0.62 0.38 1.6
B 0.25 -0.83 0.24 0.81 0.19 1.3
C 0.06 0.05 0.83 0.69 0.31 1.0
D 0.03 -0.04 0.74 0.54 0.46 1.0
E 0.76 0.42 -0.01 0.76 0.24 1.5
F -0.83 0.24 -0.17 0.77 0.23 1.3
G 0.92 -0.01 0.00 0.84 0.16 1.0
RC1 RC2 RC3
SS loadings 2.23 1.40 1.40
Proportion Var 0.32 0.20 0.20
Cumulative Var 0.32 0.52 0.72
Proportion Explained 0.44 0.28 0.28
Cumulative Proportion 0.44 0.72 1.00
Mean item complexity = 1.3
Test of the hypothesis that 3 components are sufficient.
The root mean square of the residuals (RMSR) is 0.11
with the empirical chi square 63.33 with prob < 1.1e-13
Fit based upon off diagonal values = 0.84

The relationship between the PCA and 7 variables is right in the output:
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 com
A 0.24 0.69 0.29 0.62 0.38 1.6
B 0.25 -0.83 0.24 0.81 0.19 1.3
C 0.06 0.05 0.83 0.69 0.31 1.0
D 0.03 -0.04 0.74 0.54 0.46 1.0
E 0.76 0.42 -0.01 0.76 0.24 1.5
F -0.83 0.24 -0.17 0.77 0.23 1.3
G 0.92 -0.01 0.00 0.84 0.16 1.0
This matrix is telling you which variables are correlated with THE PCA. So you can see that A and B are strongly correlated with principal component 2 (labeled RC2).
PCA is a rotation of the data, so there are as many principal components as there are variables (7). But most people are interested in visualization, so generally only the first 2 or 3 principal components are chosen for plotting.

Related

Exclude factor loadings from ID variable in order to create latent concept

I conducted a factor analysis and wanted to create the latent concept (postmaterialism and materialism) with the correlated variables (see output fa). Later on I want to merge this data set I used for the fa with another data set, hence I kept the ID variable in order to use it later as key variable. Now my problem is that I need to exclude the factor loadings from the ID variable because otherwise it'll contort the score of the latent concept of each individual. I tried different commands like:
!("ID"), with = FALSE, - ("ID"), with = FALSE, setdiff(names(expl_fa2),("ID")), with = FALSE
but nothing worked.
This is my code for the latent variables:
data_fa_1 <- data_fa_1 %>% mutate(postmat = expl_fa2$score[,1], mat = expl_fa2$scores[,2])
And this is the output from the factor analysis:
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 h2 u2 com
import_of_new_ideas 0.48 0.06 0.233 0.77 1.0
import_of_safety 0.06 0.61 0.375 0.63 1.0
import_of_trying_things 0.66 0.03 0.435 0.57 1.0
import_of_obedience 0.01 0.49 0.240 0.76 1.0
import_of_modesty 0.01 0.44 0.197 0.80 1.0
import_of_good_time 0.62 0.01 0.382 0.62 1.0
import_of_freedom 0.43 0.16 0.208 0.79 1.3
import_of_strong_gov 0.15 0.57 0.350 0.65 1.1
import_of_adventures 0.64 -0.15 0.427 0.57 1.1
import_of_well_behav 0.03 0.64 0.412 0.59 1.0
import_of_traditions 0.03 0.50 0.253 0.75 1.0
import_of_fun 0.67 0.03 0.449 0.55 1.0
ID 0.07 0.04 0.007 0.99 1.7
Can anyone help me with the command I need to use in order to exclude the factor loadings from the ID variable (see output fa) from the creation of the latent variables "postmat" and "mat"?
Not sure if this is really your question, but assuming you just want to remove the first column from a data.table, here is an example data.table and 3 ways how you could exclude the ID column for that example:
DT <- data.table(
ID=LETTERS[1:10],
matrix(rnorm(50), nrow=10, dimnames = list(NULL, paste0("col", 1:5)))
)
DT[,- 1]
DT[, -"ID"]
DT[, setdiff(colnames(DT), "ID"), with=FALSE]

Creating an igraph from weighted correlation matrix csv

First of all, I'd like to say that I'm completely new to R, and I'm just trying to accomplish this one task.
So, what I'm trying to do is that I'd like to create an network diagram from a weighted matrix. I made an example:
The CSV is a simple correlation matrix that looks like this:
,A,B,C,D,E,F,G
A,1,0.9,0.64,0.43,0.38,0.33,0.33
B,0.9,1,0.64,0.33,0.43,0.38,0.38
C,0.64,0.64,1,0.59,0.69,0.64,0.64
D,0.43,0.33,0.59,1,0.28,0.23,0.28
E,0.38,0.43,0.69,0.28,1,0.95,0.9
F,0.33,0.38,0.64,0.23,0.95,1,0.9
G,0.33,0.38,0.64,0.28,0.9,0.9,1
I tried to draw the wanted result by myself and came up with this:
To be more precise, I draw the diagram first, then, using a ruler, I took note of the distances, calculated an equation to get the weights and made the CSV table.
The higher the value is, the closer the two points are to each other.
However, whatever I do, the best result I get is this:
And this is how I'm trying to accomplish it, using this tutorial:
First of all, I import my matrix:
> matrix <- read.csv(file = 'test_dataset.csv')
But after printing the matrix out with head(), this already somehow cuts the last line of the matrix:
> head(matrix)
ï.. A B C D E F G
1 A 1.00 0.90 0.64 0.43 0.38 0.33 0.33
2 B 0.90 1.00 0.64 0.33 0.43 0.38 0.38
3 C 0.64 0.64 1.00 0.59 0.69 0.64 0.64
4 D 0.43 0.33 0.59 1.00 0.28 0.23 0.28
5 E 0.38 0.43 0.69 0.28 1.00 0.95 0.90
6 F 0.33 0.38 0.64 0.23 0.95 1.00 0.90
> dim(matrix)
[1] 7 8
I then proceed with removing the first column so the matrix is square again...
> matrix <- data.matrix(matrix)[,-1]
> head(matrix)
A B C D E F G
[1,] 1.00 0.90 0.64 0.43 0.38 0.33 0.33
[2,] 0.90 1.00 0.64 0.33 0.43 0.38 0.38
[3,] 0.64 0.64 1.00 0.59 0.69 0.64 0.64
[4,] 0.43 0.33 0.59 1.00 0.28 0.23 0.28
[5,] 0.38 0.43 0.69 0.28 1.00 0.95 0.90
[6,] 0.33 0.38 0.64 0.23 0.95 1.00 0.90
> dim(matrix)
[1] 7 7
Then I create the graph and try to plot it:
> network <- graph_from_adjacency_matrix(matrix, weighted=T, mode="undirected", diag=F)
> plot(network)
And the result above appears...
So, after spending the last few hours googling and trying way, way more things, this is the closest I've been able to get to.
So I'm asking for your help, thank you very much!
This is all fine.
head() just prints out the first 6 rows of a matrix or dataframe, if you want to see all of it use print() or just the name of the matrix variable.
graph_from_adjacency_matrix produces a link between two nodes if the value is non-zero. That's why you are getting every node linked to every other node.
To get what that tutorial is doing you need to add a line like
matrix[matrix<0.5] <- 0
to remove the edges for correlations below a cut off before you create the graph.
It's still not going to produce a chart like your hand drawn one (where closeness is roughly the correlation), just clump them together if they are above 0.5 correlation.

'x' and 'y' lengths differ in custom entropy function

I am trying to learn R and I am having problems with the way it works. I tried to make an entropy function of variables p and 1-p from scratch and I am having problems when I try to add some ifs to avoid the NaN when dividing by 0.
When I try the custom entropy with the plot, it just works but it shows the NaN when I print the results. But when I try to add the ifs, then it says:
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
entropy <- function(p){
cat("p = " , p)
if (p==0 || p==1) {
result = 0
}else{
result = - p*log2(p)-(1-p)*log2((1-p))
}
cat("\nresult=",result)
return(result)
}
p <- seq(0,1,0.01)
plot(p, entropy(p), type='l', main='Funcion entropia con dos valores posibles')
I don't understand it since I am using a plot of an array as x and a function with that array as parameter as y, so it should be the same lengths with and without ifs.
Console without the ifs:
p = 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1
result= NaN 0.08079314 0.1414405 0.1943919 0.2422922 0.286397 0.3274449 0.3659237 0.4021792 0.4364698 0.4689956 0.499916 0.5293609 0.5574382 0.5842388 0.6098403 0.6343096 0.6577048 0.680077 0.7014715 0.7219281 0.7414827 0.7601675 0.7780113 0.7950403 0.8112781 0.8267464 0.8414646 0.8554508 0.8687212 0.8812909 0.8931735 0.9043815 0.9149264 0.9248187 0.9340681 0.9426832 0.9506721 0.958042 0.9647995 0.9709506 0.9765005 0.9814539 0.985815 0.9895875 0.9927745 0.9953784 0.9974016 0.9988455 0.9997114 1 0.9997114 0.9988455 0.9974016 0.9953784 0.9927745 0.9895875 0.985815 0.9814539 0.9765005 0.9709506 0.9647995 0.958042 0.9506721 0.9426832 0.9340681 0.9248187 0.9149264 0.9043815 0.8931735 0.8812909 0.8687212 0.8554508 0.8414646 0.8267464 0.8112781 0.7950403 0.7780113 0.7601675 0.7414827 0.7219281 0.7014715 0.680077 0.6577048 0.6343096 0.6098403 0.5842388 0.5574382 0.5293609 0.499916 0.4689956 0.4364698 0.4021792 0.3659237 0.3274449 0.286397 0.2422922 0.1943919 0.1414405 0.08079314 NaN
Console with the ifs:
p = 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1
result= 0Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
You did not create a vector but a scalar since you did not used a vectorized functionality in you if else clause. The result of your function has been just one number.
This should work:
entropy <- function(p){
# initialize a vector of the desired length with zeros
result <- numeric(length(p))
# subset the vector for which you want to apply your formula on
x <- p[!(p %in% c(0,1))]
# overwrite only those positions for which you want to calculate values based
# on your formula
result[!(p %in% c(0,1))] <- - x*log2(x)-(1-x)*log2((1-x))
#cat("\nresult=",result)
return(result)
}
p <- seq(0,1,0.01)
plot(p, entropy(p), type='l', main='Funcion entropia con dos valores posibles')
EDIT:
Even tho I was suggested to do it vectorizing it, I wanted to do it somewhat similar to other languages I know for the moment, since I am starting. I was able to fix it, althought I ended up using a for and printing 2 arrays instead of the function itself.
entropy <- function(p){
if (p==0 || p==1) {
result = 0
}else{
result = - p*log2(p)-(1-p)*log2((1-p))
}
return(result)
}
x <- seq(0,1,0.01)
y <- numeric(length(p))
i = 1
for (p in x) {
y[i] = entropy(p)
cat(x[i],"=",y[i],"\n")
i=i+1
}
plot(x, y, type='l', main='Funcion entropia con dos valores posibles')
I just applied your entropy function to the p vector prior to trying to plot it using the sapply function.
entropy <- function(p){
cat("p = " , p)
if (p==0 || p==1) {
result = 0
}else{
result = - p*log2(p)-(1-p)*log2((1-p))
}
cat("\nresult=",result)
return(result)
}
p <- seq(0,1,0.01)
# Apply the function over all the values of 'p'
entropy_p <- sapply(p,FUN = entropy)
plot(p, entropy_p, type='l', main='Funcion entropia con dos valores posibles')

How to find and replace min value in dataframe with text in r

i have dataframe with 20 columns and I like to identify the minimum value in each of the column and replace them with text such as "min". Appreciate any help
sample data :
a b c
-0.05 0.31 0.62
0.78 0.25 -0.01
0.68 0.33 -0.04
-0.01 0.30 0.56
0.55 0.28 -0.03
Desired output
a b c
min 0.31 0.62
0.78 min -0.01
0.68 0.33 min
-0.01 0.30 0.56
0.55 0.28 -0.03
You can apply a function to each column that replaces the minimum value with a string. This returns a matrix which could be converted into a data frame if desired. As IceCreamToucan pointed out, all rows will be of type character since each variable must have the same type:
apply(df, 2, function(x) {
x[x == min(x)] <- 'min'
return(x)
})
a b c
[1,] "min" "0.31" "0.62"
[2,] "0.78" "min" "-0.01"
[3,] "0.68" "0.33" "min"
[4,] "-0.01" "0.3" "0.56"
[5,] "0.55" "0.28" "-0.03"
You can use the method below, but know that this converts all your columns to character, since vectors must have elements which all have the same type.
library(dplyr)
df %>%
mutate_all(~ replace(.x, which.min(.x), 'min'))
# a b c
# 1 min 0.31 0.62
# 2 0.78 min -0.01
# 3 0.68 0.33 min
# 4 -0.01 0.3 0.56
# 5 0.55 0.28 -0.03
apply(df, MARGIN=2, FUN=(function(x){x[which.min(x)] <- 'min'; return(x)})

Obtain standardized loadings ("pattern matrix") from psych::fa object

The psych::print.psych() function produces beautiful output for the factor analysis objects produced by psych::fa(). I would like to obtain the table that follows the text "Standardized loadings (pattern matrix) based upon correlation matrix" as a data frame without cutting and pasting.
library(psych)
my.fa <- fa(Harman74.cor$cov, 4)
my.fa #Equivalent to print.psych(my.fa)
Yields the following (I'm showing the first four items here):
Factor Analysis using method = minres
Call: fa(r = Harman74.cor$cov, nfactors = 4)
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR3 MR2 MR4 h2 u2 com
VisualPerception 0.04 0.69 0.04 0.06 0.55 0.45 1.0
Cubes 0.05 0.46 -0.02 0.01 0.23 0.77 1.0
PaperFormBoard 0.09 0.54 -0.15 0.06 0.34 0.66 1.2
Flags 0.18 0.52 -0.04 -0.02 0.35 0.65 1.2
I tried examining the source code for print.psych (Using View(print.psych) in RStudio), but could only find a section for printing standardized loadings for 'Factor analysis by Groups'.
The my.fa$weights are not standardized, and the table is missing the h2, u2, and com columns. If they can be standardized, the following code could work:
library(data.table)
library(psych)
my.fa <- fa(Harman74.cor$cov,4)
my.fa.table <- data.table(dimnames(Harman74.cor$cov)[[1]],
my.fa$weights, my.fa$communalities, my.fa$uniquenesses, my.fa$complexity)
setnames(my.fa.table, old = c("V1", "V3", "V4", "V5"),
new = c("item", "h2", "u2", "com"))
Printing my.fa.table gives the following (I show the first four lines), which indicates $weights is incorrect:
item MR1 MR3 MR2 MR4 h2 u2 com
1: VisualPerception -0.021000973 0.28028576 0.006002429 -0.001855021 0.5501829 0.4498201 1.028593
2: Cubes -0.003545975 0.11022570 -0.009545919 -0.012565221 0.2298420 0.7701563 1.033828
3: PaperFormBoard 0.028562047 0.13244895 -0.019162262 0.014448449 0.3384722 0.6615293 1.224154
4: Flags 0.009187032 0.14430196 -0.025374834 -0.033737089 0.3497962 0.6502043 1.246102
Replacing $weights with $loadings gives the following error message:
Error in as.data.frame.default(x, ...) :
cannot coerce class ‘"loadings"’ to a data.frame
Update:
Adding [,] fixed the class issue:
library(data.table)
library(psych)
my.fa <- fa(Harman74.cor$cov,4)
my.fa.table <- data.table(dimnames(Harman74.cor$cov)[[1]],
my.fa$loadings[,], my.fa$communalities, my.fa$uniquenesses, my.fa$complexity)
setnames(my.fa.table, old = c("V1", "V3", "V4", "V5"),
new = c("item", "h2", "u2", "com"))
my.fa.table
item MR1 MR3 MR2 MR4 h2 u2 com
1: VisualPerception 0.04224875 0.686002901 0.041831185 0.05624303 0.5501829 0.4498201 1.028593
2: Cubes 0.05309628 0.455343417 -0.022143990 0.01372376 0.2298420 0.7701563 1.033828
3: PaperFormBoard 0.08733001 0.543848733 -0.147686005 0.05523805 0.3384722 0.6615293 1.224154
4: Flags 0.17641395 0.517235582 -0.038878915 -0.02229273 0.3497962 0.6502043 1.246102
I would still be happy to get an answer that does this more elegantly or explains why this isn't built in.
It is not built in because each person wants something slightly different. As you discovered, you can create a table by combining four objects from fa: the loadings, the communalities, the uniqueness, and the complexity.
df <- data.frame(unclass(f$loadings), h2=f$communalities, u2= f$uniqueness,com=f$complexity)
round(df,2)
so, for the Thurstone correlation matrix:
f <- fa(Thurstone,3)
df <- data.frame(unclass(f$loadings), h2=f$communalities, u2= f$uniqueness,com=f$complexity)
round(df,2)
Produces
MR1 MR2 MR3 h2 u2 com
Sentences 0.90 -0.03 0.04 0.82 0.18 1.01
Vocabulary 0.89 0.06 -0.03 0.84 0.16 1.01
Sent.Completion 0.84 0.03 0.00 0.74 0.26 1.00
First.Letters 0.00 0.85 0.00 0.73 0.27 1.00
Four.Letter.Words -0.02 0.75 0.10 0.63 0.37 1.04
Suffixes 0.18 0.63 -0.08 0.50 0.50 1.20
Letter.Series 0.03 -0.01 0.84 0.73 0.27 1.00
Pedigrees 0.38 -0.05 0.46 0.51 0.49 1.96
Letter.Group -0.06 0.21 0.63 0.52 0.48 1.25
Or, you can try the fa2latex for nice LaTex based formatting.
fa2latex(f)
which produces a LateX table in quasi APA style.

Resources