Principal components order - PCA in R - r

I'm trying to do PCA in R with principal. Actually, I did but I'm curious why my principal compenents are not ordered numerically? I mean Why they are PC1, PC2, PC3. What's the point between this?
tb2 <- principal(tba, nfactors = 4)
tb2
Principal Components Analysis
Call: principal(r = tba, nfactors = 4)
Standardized loadings (pattern matrix) based upon correlation matrix
PC2 PC3 PC1 PC4 h2 u2 com
bio1 0.89 0.28 0.32 -0.05 0.98 0.0248 1.5
bio2 -0.07 -0.22 0.09 0.96 0.99 0.0091 1.1
bio3 0.63 0.21 -0.22 0.60 0.85 0.1497 2.5
bio4 -0.60 -0.40 0.34 0.44 0.83 0.1682 3.3
bio5 0.78 0.15 0.46 0.33 0.95 0.0454 2.1
bio6 0.89 0.36 0.17 -0.21 0.99 0.0088 1.5
bio7 -0.50 -0.38 0.26 0.70 0.96 0.0395 2.8
bio8 0.85 0.12 0.20 -0.19 0.81 0.1896 1.3
bio9 0.85 0.24 0.41 0.03 0.95 0.0525 1.6
bio10 0.85 0.23 0.40 0.04 0.95 0.0533 1.6
bio11 0.90 0.34 0.21 -0.13 0.99 0.0058 1.4
bio12 0.16 0.94 0.03 -0.15 0.93 0.0743 1.1
bio13 0.29 0.93 0.18 -0.09 0.99 0.0086 1.3
bio14 -0.31 -0.18 -0.89 -0.05 0.92 0.0777 1.3
bio15 0.34 0.72 0.56 -0.02 0.94 0.0577 2.4
bio16 0.27 0.93 0.22 -0.10 0.99 0.0069 1.3
bio17 -0.17 -0.16 -0.93 -0.07 0.93 0.0725 1.1
bio18 -0.40 -0.29 -0.84 -0.06 0.96 0.0440 1.7
bio19 0.26 0.93 0.22 -0.09 0.99 0.0066 1.3
PC2 PC3 PC1 PC4
SS loadings 6.84 4.99 3.81 2.26
Proportion Var 0.36 0.26 0.20 0.12
Cumulative Var 0.36 0.62 0.82 0.94
Proportion Explained 0.38 0.28 0.21 0.13
Cumulative Proportion 0.38 0.66 0.87 1.00
Mean item complexity = 1.7
Test of the hypothesis that 4 components are sufficient.
The root mean square of the residuals (RMSR) is 0.03
with the empirical chi square 96803.04 with prob < 0
Thanks in advance!

Related

Create data frame from EFA output in R

I am working on EFA and would like to customize my tables. There is a function, psych.print to suppress factor loadings of a certain value to make the table easier to read. When I run this function, it produces this data and the summary stats in the console (in an .RMD document, it produces console text and a separate data frame of the factor loadings with loadings suppressed). However, if I attempt to save this as an object, it does not keep this data.
Here is an example:
library(psych)
bfi_data=bfi
bfi_data=bfi_data[complete.cases(bfi_data),]
bfi_cor <- cor(bfi_data)
factors_data <- fa(r = bfi_cor, nfactors = 6)
print.psych(fa_ml_oblimin_2, cut=.32, sort="TRUE")
In an R script, it produces this:
item MR2 MR3 MR1 MR5 MR4 MR6 h2 u2 com
N2 17 0.83 0.654 0.35 1.0
N1 16 0.82 0.666 0.33 1.1
N3 18 0.69 0.549 0.45 1.1
N5 20 0.47 0.376 0.62 2.2
N4 19 0.44 0.43 0.506 0.49 2.4
C4 9 -0.67 0.555 0.45 1.3
C2 7 0.66 0.475 0.53 1.4
C5 10 -0.56 0.433 0.57 1.4
C3 8 0.56 0.317 0.68 1.1
C1 6 0.54 0.344 0.66 1.3
In R Markdown, it produces this:
How can I save that data.frame as an object?
Looking at the str of the object it doesn't look that what you want is built-in. An ugly way would be to use capture.output and try to convert the character vector to dataframe using string manipulation. Else since the data is being displayed it means that the data is present somewhere in the object itself. I could find out vectors of same length which can be combined to form the dataframe.
loadings <- unclass(factors_data$loadings)
h2 <- factors_data$communalities
#There is also factors_data$communality which has same values
u2 <- factors_data$uniquenesses
com <- factors_data$complexity
data <- cbind(loadings, h2, u2, com)
data
This returns :
# MR2 MR3 MR1 MR5 MR4 MR6 h2 u2 com
#A1 0.11 0.07 -0.07 -0.56 -0.01 0.35 0.38 0.62 1.85
#A2 0.03 0.09 -0.08 0.64 0.01 -0.06 0.47 0.53 1.09
#A3 -0.04 0.04 -0.10 0.60 0.07 0.16 0.51 0.49 1.26
#A4 -0.07 0.19 -0.07 0.41 -0.13 0.13 0.29 0.71 2.05
#A5 -0.17 0.01 -0.16 0.47 0.10 0.22 0.47 0.53 2.11
#C1 0.05 0.54 0.08 -0.02 0.19 0.05 0.34 0.66 1.32
#C2 0.09 0.66 0.17 0.06 0.08 0.16 0.47 0.53 1.36
#C3 0.00 0.56 0.07 0.07 -0.04 0.05 0.32 0.68 1.09
#C4 0.07 -0.67 0.10 -0.01 0.02 0.25 0.55 0.45 1.35
#C5 0.15 -0.56 0.17 0.02 0.10 0.01 0.43 0.57 1.41
#E1 -0.14 0.09 0.61 -0.14 -0.08 0.09 0.41 0.59 1.34
#E2 0.06 -0.03 0.68 -0.07 -0.08 -0.01 0.56 0.44 1.07
#E3 0.02 0.01 -0.32 0.17 0.38 0.28 0.51 0.49 3.28
#E4 -0.07 0.03 -0.49 0.25 0.00 0.31 0.56 0.44 2.26
#E5 0.16 0.27 -0.39 0.07 0.24 0.04 0.41 0.59 3.01
#N1 0.82 -0.01 -0.09 -0.09 -0.03 0.02 0.67 0.33 1.05
#N2 0.83 0.02 -0.07 -0.07 0.01 -0.07 0.65 0.35 1.04
#N3 0.69 -0.03 0.13 0.09 0.02 0.06 0.55 0.45 1.12
#N4 0.44 -0.14 0.43 0.09 0.10 0.01 0.51 0.49 2.41
#N5 0.47 -0.01 0.21 0.21 -0.17 0.09 0.38 0.62 2.23
#O1 -0.05 0.07 -0.01 -0.04 0.57 0.09 0.36 0.64 1.11
#O2 0.12 -0.09 0.01 0.12 -0.43 0.28 0.30 0.70 2.20
#O3 0.01 0.00 -0.10 0.05 0.65 0.04 0.48 0.52 1.06
#O4 0.10 -0.05 0.34 0.15 0.37 -0.04 0.24 0.76 2.55
#O5 0.04 -0.04 -0.02 -0.01 -0.50 0.30 0.33 0.67 1.67
#gender 0.20 0.09 -0.12 0.33 -0.21 -0.15 0.18 0.82 3.58
#education -0.03 0.01 0.05 0.11 0.12 -0.22 0.07 0.93 2.17
#age -0.06 0.07 -0.02 0.16 0.03 -0.26 0.10 0.90 2.05
Ronak Shaw answered my question above, and I used his answer to help create the following function, which nearly reproduces the psych.print data.frame of fa.sort output
fa_table <- function(x, cut) {
#get sorted loadings
loadings <- fa.sort(fa_ml_oblimin)$loadings %>% round(3)
#cut loadings
loadings[loadings < cut] <- ""
#get additional info
add_info <- cbind(x$communalities,
x$uniquenesses,
x$complexity) %>%
as.data.frame() %>%
rename("commonality" = V1,
"uniqueness" = V2,
"complexity" = V3) %>%
rownames_to_column("item")
#build table
loadings %>%
unclass() %>%
as.data.frame() %>%
rownames_to_column("item") %>%
left_join(add_info) %>%
mutate(across(where(is.numeric), round, 3))
}

R. Remove blocks of observations in df if they fulfill condition

I have a huge dataframe (>1,000,000 rows) like this.
term estimate st.error statistic p.value SNP
(Intercept) 7.68 0.17 44.64 0 rs1406947
GT 0.01 0.01 0.07 0.19 rs1406947
SEX 1.52 0.14 10.87 0.1 rs1406947
M 0.12 0.29 0.41 0.67 rs1406947
N -0.06 0.12 -0.48 0.63 rs1406947
GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
GT:N -0.00 0.06 -0.08 0.93 rs1406947
(Intercept) 9.23 0.20 34.64 0 rs25904
GT 0.05 0.04 0.12 0.22 rs25904
SEX 1.67 0.76 10.34 0.1 rs25904
M 0.14 0.39 0.51 0.55 rs25904
N -0.08 0.05 -0.46 0.55 rs25904
GT:SEX -0.19 0.11 -0.34 0.44 rs25904
GT:N -0.22 0.33 -0.44 0.55 rs25904
(Intercept) 7.99 0.66 44.44 0 rs7133579
GT 0.01 0.3 0.04 0.33 rs7133579
SEX 1.22 0.22 10.44 0.15 rs7133579
M 0.88 0.22 0.33 0.44 rs7133579
N -0.5 0.5 -0.5 0.6 rs7133579
GT:N -0.00 0.03 -0.04 0.78 rs7133579
It is composed by blocks of 7 observations: (Intercept), GT, SEX, M, N, GT:SEX and GT:N. However, a few blocks lack one or more of the observations (e.g. the third block lacks GT:SEX). Using R, I want to remove these blocks. In this toy example I would get:
term estimate st.error statistic p.value SNP
(Intercept) 7.68 0.17 44.64 0 rs1406947
GT 0.01 0.01 0.07 0.19 rs1406947
SEX 1.52 0.14 10.87 0.1 rs1406947
M 0.12 0.29 0.41 0.67 rs1406947
N -0.06 0.12 -0.48 0.63 rs1406947
GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
GT:N -0.00 0.06 -0.08 0.93 rs1406947
(Intercept) 9.23 0.20 34.64 0 rs25904
GT 0.05 0.04 0.12 0.22 rs25904
SEX 1.67 0.76 10.34 0.1 rs25904
M 0.14 0.39 0.51 0.55 rs25904
N -0.08 0.05 -0.46 0.55 rs25904
GT:SEX -0.19 0.11 -0.34 0.44 rs25904
GT:N -0.22 0.33 -0.44 0.55 rs25904
I think you'd want to group by SNP and check those blocks for whether they comply with your expectations:
library(dplyr)
expected_terms <- c("(Intercept)", "GT", "SEX", "M", "N", "GT:SEX", "GT:N")
df %>%
group_by(SNP) %>%
filter(
all(expected_terms %in% term)
)
Stricter than that, if you need to make sure that each of your terms exist only once or no other terms appear:
df %>%
group_by(SNP) %>%
filter(
# use `table` to count occurrence of terms, keep only if all are counted exactly once
all(table(term)[expected_terms] == 1),
# keep only if no terms remain after removing your expected set
length(setdiff(term, expected_terms)) == 0
)
Assuming that (Intercept) is present everytime, you can test if the length of each block is 7.
x[unlist(lapply(split(seq_len(nrow(x)), cumsum(x$term == "(Intercept)")),
function(y) {if(length(y) == 7) y else NULL})), ]
# term estimate st.error statistic p.value SNP
#1 (Intercept) 7.68 0.17 44.64 0.00 rs1406947
#2 GT 0.01 0.01 0.07 0.19 rs1406947
#3 SEX 1.52 0.14 10.87 0.10 rs1406947
#4 M 0.12 0.29 0.41 0.67 rs1406947
#5 N -0.06 0.12 -0.48 0.63 rs1406947
#6 GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
#7 GT:N 0.00 0.06 -0.08 0.93 rs1406947
#8 (Intercept) 9.23 0.20 34.64 0.00 rs25904
#9 GT 0.05 0.04 0.12 0.22 rs25904
#10 SEX 1.67 0.76 10.34 0.10 rs25904
#11 M 0.14 0.39 0.51 0.55 rs25904
#12 N -0.08 0.05 -0.46 0.55 rs25904
#13 GT:SEX -0.19 0.11 -0.34 0.44 rs25904
#14 GT:N -0.22 0.33 -0.44 0.55 rs25904
Data:
x <- read.table(header=TRUE, text="term estimate st.error statistic p.value SNP
(Intercept) 7.68 0.17 44.64 0 rs1406947
GT 0.01 0.01 0.07 0.19 rs1406947
SEX 1.52 0.14 10.87 0.1 rs1406947
M 0.12 0.29 0.41 0.67 rs1406947
N -0.06 0.12 -0.48 0.63 rs1406947
GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
GT:N -0.00 0.06 -0.08 0.93 rs1406947
(Intercept) 9.23 0.20 34.64 0 rs25904
GT 0.05 0.04 0.12 0.22 rs25904
SEX 1.67 0.76 10.34 0.1 rs25904
M 0.14 0.39 0.51 0.55 rs25904
N -0.08 0.05 -0.46 0.55 rs25904
GT:SEX -0.19 0.11 -0.34 0.44 rs25904
GT:N -0.22 0.33 -0.44 0.55 rs25904
(Intercept) 7.99 0.66 44.44 0 rs7133579
GT 0.01 0.3 0.04 0.33 rs7133579
SEX 1.22 0.22 10.44 0.15 rs7133579
M 0.88 0.22 0.33 0.44 rs7133579
N -0.5 0.5 -0.5 0.6 rs7133579
GT:N -0.00 0.03 -0.04 0.78 rs7133579")

R. Add column to df where rows have names of element from list

I have a list of all files (dataframes) within a directory:
library("plyr")
library("dplyr")
library("broom")
library("tidyr")
snp_list <- list.files(pattern="*.txt", all.files = T,full.names = F)
I also have a dataframe A obtained through the following function:
pv1= lapply(snp_list, function(x) tidy(lm(PV ~ GT*SEX + M + GT*N,read.table(x,header=TRUE)))) %>%
bind_rows()
Dataframe A has 7 rows ((Intercept), GT, SEX, M, N, GT:SEX, GT:N) for each element in list snp_list. In this toy example the list has 3 elements (rs1406947.txt rs25904.txt rs7133579.txt), but in reality there are 1,200,000 elements
A:
term estimate st.error statistic p.value
(Intercept) 7.68 0.17 44.64 0
GT 0.01 0.01 0.07 0.19
SEX 1.52 0.14 10.87 0.1
M 0.12 0.29 0.41 0.67
N -0.06 0.12 -0.48 0.63
GT:SEX -0.03 0.08 -0.44 0.65
GT:N -0.00 0.06 -0.08 0.93
(Intercept) 9.23 0.20 34.64 0
GT 0.05 0.04 0.12 0.22
SEX 1.67 0.76 10.34 0.1
M 0.14 0.39 0.51 0.55
N -0.08 0.05 -0.46 0.55
GT:SEX -0.19 0.11 -0.34 0.44
GT:N -0.22 0.33 -0.44 0.55
(Intercept) 7.99 0.66 44.44 0
GT 0.01 0.3 0.04 0.33
SEX 1.22 0.22 10.44 0.15
M 0.88 0.22 0.33 0.44
N -0.5 0.5 -0.5 0.6
GT:SEX -0.06 0.09 -0.74 0.35
GT:N -0.00 0.03 -0.04 0.78
I want to add a new column "SNP" to A, where each row has the name of the element the rows belongs to (nrows = 7*1,200,000). I would get this:
term estimate st.error statistic p.value SNP
(Intercept) 7.68 0.17 44.64 0 rs1406947
GT 0.01 0.01 0.07 0.19 rs1406947
SEX 1.52 0.14 10.87 0.1 rs1406947
M 0.12 0.29 0.41 0.67 rs1406947
N -0.06 0.12 -0.48 0.63 rs1406947
GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
GT:N -0.00 0.06 -0.08 0.93 rs1406947
(Intercept) 9.23 0.20 34.64 0 rs25904
GT 0.05 0.04 0.12 0.22 rs25904
SEX 1.67 0.76 10.34 0.1 rs25904
M 0.14 0.39 0.51 0.55 rs25904
N -0.08 0.05 -0.46 0.55 rs25904
GT:SEX -0.19 0.11 -0.34 0.44 rs25904
GT:N -0.22 0.33 -0.44 0.55 rs25904
(Intercept) 7.99 0.66 44.44 0 rs7133579
GT 0.01 0.3 0.04 0.33 rs7133579
SEX 1.22 0.22 10.44 0.15 rs7133579
M 0.88 0.22 0.33 0.44 rs7133579
N -0.5 0.5 -0.5 0.6 rs7133579
GT:SEX -0.06 0.09 -0.74 0.35 rs7133579
GT:N -0.00 0.03 -0.04 0.78 rs7133579
Here's how to do what you asked:
A$SNP=rep(0,nrow(A))
for (i in 1:nrow(A)){
A$SNP[i]=snp_list[(i%/%8)+1]
}
Using integer division, you can generate an index for 7 elements to map to each element in snp_list.

How to retrieve observation scores for each Principal Component in R using principal Function

pc_unrotate = principal(correlate1,nfactors = 4,rotate = "none")
print(pc_unrotate)
output:
Principal Components Analysis
Call: principal(r = correlate1, nfactors = 4, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 PC3 PC4 h2 u2 com
ProdQual 0.25 -0.50 -0.08 0.67 0.77 0.232 2.2
Ecom 0.31 0.71 0.31 0.28 0.78 0.223 2.1
TechSup 0.29 -0.37 0.79 -0.20 0.89 0.107 1.9
CompRes 0.87 0.03 -0.27 -0.22 0.88 0.119 1.3
Advertising 0.34 0.58 0.11 0.33 0.58 0.424 2.4
ProdLine 0.72 -0.45 -0.15 0.21 0.79 0.213 2.0
SalesFImage 0.38 0.75 0.31 0.23 0.86 0.141 2.1
ComPricing -0.28 0.66 -0.07 -0.35 0.64 0.359 1.9
WartyClaim 0.39 -0.31 0.78 -0.19 0.89 0.108 2.0
OrdBilling 0.81 0.04 -0.22 -0.25 0.77 0.234 1.3
DelSpeed 0.88 0.12 -0.30 -0.21 0.91 0.086 1.4
PC1 PC2 PC3 PC4
SS loadings 3.43 2.55 1.69 1.09
Proportion Var 0.31 0.23 0.15 0.10
Cumulative Var 0.31 0.54 0.70 0.80
Proportion Explained 0.39 0.29 0.19 0.12
Cumulative Proportion 0.39 0.68 0.88 1.00
Mean item complexity = 1.9
Test of the hypothesis that 4 components are sufficient.
The root mean square of the residuals (RMSR) is 0.06
Fit based upon off diagonal values = 0.97
Now i need to get the scores, Tried pc_unrotate$scores but it returns null.
executed names(pc_unrotate),
Name of PCA
and found that Scores attribute is missing...so what can i do to get PCA scores?
Add argument scores=TRUE to the principal() function call: https://www.rdocumentation.org/packages/psych/versions/1.9.12.31/topics/principal
pc_unrotate = principal(correlate1,nfactors = 4,rotate = "none", scores = TRUE)

`psych::alpha`- detailed interpretation of the output

I am aware that Cronbach's alpha has been extensively discussed here and elsewhere, but I cannot find a detailed interpretation of the output table.
psych::alpha(questionaire)
Reliability analysis
Call: psych::alpha(x = diagnostic_test)
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
0.69 0.73 1 0.14 2.7 0.026 0.6 0.18 0.12
lower alpha upper 95% confidence boundaries
0.64 0.69 0.74
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
Score1 0.69 0.73 0.86 0.14 2.7 0.027 0.0136 0.12
Score2 0.68 0.73 0.87 0.14 2.7 0.027 0.0136 0.12
Score3 0.69 0.73 0.87 0.14 2.7 0.027 0.0136 0.12
Score4 0.67 0.72 0.86 0.14 2.5 0.028 0.0136 0.11
Score5 0.68 0.73 0.87 0.14 2.7 0.027 0.0134 0.12
Score6 0.69 0.73 0.91 0.15 2.7 0.027 0.0138 0.12
Score7 0.69 0.73 0.85 0.15 2.7 0.027 0.0135 0.12
Score8 0.68 0.72 0.86 0.14 2.6 0.028 0.0138 0.12
Score9 0.68 0.73 0.92 0.14 2.7 0.027 0.0141 0.12
Score10 0.68 0.72 0.90 0.14 2.6 0.027 0.0137 0.12
Score11 0.67 0.72 0.86 0.14 2.5 0.028 0.0134 0.11
Score12 0.67 0.71 0.87 0.13 2.5 0.029 0.0135 0.11
Score13 0.67 0.72 0.86 0.14 2.6 0.028 0.0138 0.11
Score14 0.68 0.72 0.86 0.14 2.6 0.028 0.0138 0.11
Score15 0.67 0.72 0.86 0.14 2.5 0.028 0.0134 0.11
Score16 0.68 0.72 0.88 0.14 2.6 0.028 0.0135 0.12
score 0.65 0.65 0.66 0.10 1.8 0.030 0.0041 0.11
Item statistics
n raw.r std.r r.cor r.drop mean sd
Score1 286 0.36 0.35 0.35 0.21 0.43 0.50
Score2 286 0.37 0.36 0.36 0.23 0.71 0.45
Score3 286 0.34 0.34 0.34 0.20 0.73 0.44
Score4 286 0.46 0.46 0.46 0.33 0.35 0.48
Score5 286 0.36 0.36 0.36 0.23 0.73 0.44
Score6 286 0.29 0.32 0.32 0.18 0.87 0.34
Score7 286 0.33 0.32 0.32 0.18 0.52 0.50
Score8 286 0.42 0.41 0.41 0.28 0.36 0.48
Score9 286 0.32 0.36 0.36 0.22 0.90 0.31
Score10 286 0.37 0.40 0.40 0.26 0.83 0.37
Score11 286 0.48 0.47 0.47 0.34 0.65 0.48
Score12 286 0.49 0.49 0.49 0.37 0.71 0.46
Score13 286 0.46 0.44 0.44 0.31 0.44 0.50
Score14 286 0.44 0.43 0.43 0.30 0.43 0.50
Score15 286 0.48 0.47 0.47 0.35 0.61 0.49
Score16 286 0.39 0.39 0.39 0.26 0.25 0.43
score 286 1.00 1.00 1.00 1.00 0.60 0.18
Warning messages:
1: In cor.smooth(r) : Matrix was not positive definite, smoothing was done
2: In cor.smooth(R) : Matrix was not positive definite, smoothing was done
3: In cor.smooth(R) : Matrix was not positive definite, smoothing was done
as far as I know, r.cor stand for the total-item correlation, or biserial correlation. I have seen that this is usually interpreted together with the corresponding p-value.
1. What is the exact interpretation of r.cor and r.drop?
2. How can the p-value be calculated ?
1. Although this is more of a question for Crossvalidated, here is the detailed explanation of ‘Item statistics’ section:
raw.r: correlation between the item and the total score from the scale (i.e., item-total correlations); there is a problem with raw.r, that is, the item itself is included in the total—this means we’re correlating the item with itself, so of course it will correlate (r.cor and r.drop solve this problem; see ?alpha for details)
r.drop: item-total correlation without that item itself (i.e., item-rest correlation or corrected item-total correlation); low item-total correlations indicate that that item doesn’t correlate well with the scale overall
r.cor: item-total correlation corrected for item overlap and scale reliability
mean and sd: mean and sd of the scale if that item is dropped
2. You should not use the p-values corresponding to these correlation coefficient to guide your decisions. I would suggest not to bother calculating them.

Resources