PCA on Transposed Dataframe

PCA on Transposed Dataframe - r

I have a data set of 4 variables and 16 observations. I want to apply PCA over the first column. I have transposed my dataframe so that I can run prcomp. I am using prcomp so there is no limit for the number of variables being more than observations.
#Transpose the dataframe
n <- SkillCount_df$Skill
t_SkillCount_df <- as.data.frame(t(SkillCount_df[, -1]))
colnames(t_SkillCount_df) <- n
str(t_SkillCount_df)
summary(t_SkillCount_df)
#Apply PCA
pca_salary <- prcomp(t_SkillCount_df, scale. = T, center = T)
The code that I run doesn't have any errors but instead of getting 16 PCs I get 3 PCs.
the summary and structure of the data seem fine. All values are numerical and the 16 variables that I want to apply pca for, are the variable names.
Can someone help me figure out what I'm doing wrong?

Related

How to project a new data row onto PCA space using dudi.mix in R?

I have a mixed dataset (comprising continuous, ordinal and nominal variables) that is high-dimensional (with more variables than rows). I want to perform a mixed data PCA using the dudi.mix() function in the R package ade4. After PCA, I want to project a new supplementary row onto the PCA space, i.e. find its coordinates in the PCA coordinate system. I tried the suprow() function in ade4 but it gives me the following error message: “Not yet implemented for 'dudi.mix'. Please use 'dudi.hillsmith'.” I don’t want to use the dudi.hillsmith() function because I think it only allows for mixed continuous and nominal variables, but my dataset comprises continuous, nominal and ordinal variables, and my understanding is that dudi.mix() is the correct function to use in this case.
Is there an alternative way how I can project a new row onto the PCA space generated by dudi.mix()?
Below an example:
# load ade4 package
library(ade4)
# a high-dimensional mixed dataset with 11 rows and 13 variables
dat <- data.frame(
a=as.numeric(c(2.5,0.5,2.2,1.9,3.1,2.3,2.0,1.0,1.5,1.1,3.4)),
b=as.numeric(c(2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9,3.1)),
c=as.numeric(c(1.3,1.1,2.4,3.1,2.2,1.3,1.5,1.8,1.1,0.5,3.8)),
d=as.numeric(c(1.9,0.9,2.1,2.3,2.8,1.9,1.9,1.3,2.9,0.8,2.9)),
e=as.numeric(c(2.2,1.2,2.5,2.9,1.9,3.1,2.1,0.9,1.8,0.9,2.8)),
f=as.factor(c(0,0,0,0,1,0,1,1,1,0,1)),
g=as.factor(c(0,1,0,0,1,0,1,0,1,0,0)),
h=as.factor(c(1,1,1,1,0,1,0,0,0,1,1)),
i=as.factor(c(1,0,0,0,0,1,0,1,0,0,0)),
j=as.ordered(c(0,1,0,2,3,4,0,1,2,4,2)),
k=as.ordered(c(1,2,1,3,4,4,1,2,2,3,3)),
l=as.ordered(c(0,1,1,2,3,2,0,1,1,3,1)),
m=as.ordered(c(0,0,1,2,1,2,2,1,0,2,1)))
# first 10 rows are used for PCA
dat.1 <- dat[1:10,]
# the 11s row should be projected onto the PCA space
dat.2 <- dat[11,]
# pca on dat.1 with 9 kept axes (i.e. number of rows - 1)
pca.res <- dudi.mix(df=dat.1, scann=FALSE, nf = 9)
# my attempt to project dat.2 onto pca.res fails
suprow(x=pca.res, Xsup=dat.2)

Factor scores from factor analysis on ordinal categorical data in R

I'm having trouble computing factor scores from an exploratory factor analysis on ordered categorical data. I've managed to assess how many factors to draw, and to run the factor analysis using the psych package, but can't figure out how to get factor scores for individual participants, and haven't found much help online. Here is where I'm stuck:
library(polycor)
library(nFactors)
library(psych)
# load data
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
# convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# 2. choose number of factors
ev <- eigen(pc)
ap <- parallel(subject = nrow(dat),
var=ncol(dat),rep=100,cent=.05)
nS <- nScree(x = ev$values, aparallel = ap$eigen$qevpea)
dev.new(height=4,width=6,noRStudioGD = T)
plotnScree(nS) # 2 factors, maybe 1
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
faPC$loadings
Edit: I've found a way to get scores using irt.fa() and scoreIrt(), but it involved converting my ordered categories to numeric so I'm not sure it's valid. Any advice would be much appreciated!
x = as.matrix(dat)
fairt <- irt.fa(x = x,nfactors=2,correct=TRUE,plot=TRUE,n.obs=NULL,rotate="varimax",fm="ml",sort=FALSE)
for(i in 1:length(dat)){dat[,i] <- as.numeric(dat[,i])}
scoreIrt(stats = fairt, items = dat, cut = 0.2, mod="logistic")

That's an interesting problem. Regular factor analysis assumes your input measures are ratio or interval scaled. In the case of ordinal variables, you have a few options. You could either use an IRT based approach (in which case you'd be using something like the Graded Response Model), or to do as you do in your example and use the polychoric correlation matrix as the input to factor analysis. You can see more discussion of this issue here
Most factor analysis packages have a method for getting factor scores, but will give you different output depending on what you choose to use as input. For example, normally you can just use factor.scores() to get your expected factor scores, but only if you input your original raw score data. The problem here is the requirement to use the polychoric matrix as input
I'm not 100% sure (and someone please correct me if I'm wrong), but I think the following should be OK in your situation:
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
dat_orig <- dat
#convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
factor.scores(dat_orig, faPC)
In essence what you're doing is:
Calculate the polychoric correlation matrix
Use that matrix to conduct the factor analysis and extract 2 factors and associated loadings
Use the loadings from the FA and the raw (numeric) data to get your factor scores
Both this method, and the method you use in your edit, treat the original data as numeric rather than factors. I think this should be OK because you're just taking your raw data and projecting it down on the factors identified by the FA, and the loadings there are already taking into account the ordinal nature of your variables (as you used the polychoric matrix as input into FA). The post linked above cautions against this approach, however, and suggests some alternatives, but this is not a straightforward problem to solve

Using R to perform a PCA on melted variables

I have a dataset where I've measured gene expression of 21 genes and also measured the output of 3 other assays. I have measured these for 8 different clones. I have also measured these on 5 different days.
However, I haven't measured every gene or assay on every day, or for every clone. So I have datasets of varying lengths. In order to easily combine them into one large dataset, to perform a PCA on them, I melted each dataset and then row bound them. I then standardized all the values. I now have a dataset that looks like the below.
What I want to do is a PCA where each of the factors in "group" is calculated in the PCA. Then, I'd like to create graphs where different colors of datapoints represent different "clones" or "days". I've pasted my sad attempt to get that working below. Any help would be appreciated!
set.seed(1)
# Creates variables for a dataset
clone <- sample(c(rep(c("1A","2A","2B","3B","3C"), each=100),rep(c("1B","2C","3A"), each=200)))
day <- sample(c(rep(1,225),rep(2,25),rep(3,600),rep(4,25),rep(5,225)))
group <- sample(c(rep(paste0("gene",1:21), each=42),rep("assay1",90),rep("assay2",80),rep("assay3",48)))
value = rnorm(1100, mean=0, sd=3)
# Create data frame from variables
df <- data.frame(clone,day,group,value)
df$day <- as.factor(df$day)
# Create PCA data
df_PCA <- prcomp(clone + day + group ~ value, data = df, scale = FALSE)
# Graphing results of PCA
par(mfrow=c(2,3))
plot(df_PCA$x[,1:2], col=clone)
plot(df_PCA$x[,1:2], col=day)
plot(df_PCA$x[,1:3], col=clone)
plot(df_PCA$x[,1:3], col=day)
plot(df_PCA$x[,2:3], col=clone)
plot(df_PCA$x[,2:3], col=day)

R programming: cor.test for a data frame with only values of 0

I was wondering why when I run cor.test on a data frame with only values of 0 to get a correlation similarity matrix it returns NAs. The 0s in the data frame in this case represent the actual abundance of a widget based of our measurements. So I would expect that each vector that is being correlated to have a correlation estimate^2 of 1. I think I don't have a good grasp of what a correlation is and how it is implemented in cor.test. Any advice or help in understanding correlations and cor.test would be great. Code is below.
corpij <- function(i,j,data) {(cor.test(data[,i],data[,j])$estimate)^2}
corp <- Vectorize(corpij, vectorize.args=list("i","j"))
r2_scores <- outer(1:n,1:n,corp,ready_cor)
Best

In R, how can you calculate P-values using cor() when you have multiple variables?

I've got a large data set (120 data points and 10+ variables) that I want to explore using a correlation matrix:
H5<- log[ which(log$Harvest=="e"), ]
H5.cor <- cor(H5[sapply(H5, is.numeric)])
I presented the data using the package corrplot:
corrplot(H5.cor, method = "number")
Additionally to this would like to know the P - value of each of the correlations. I know that I can use cor.test() but as I understand it, it needs an X and a Y value.
Thanks

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

PCA on Transposed Dataframe - r

Related

How to project a new data row onto PCA space using dudi.mix in R?

Factor scores from factor analysis on ordinal categorical data in R

Using R to perform a PCA on melted variables

R programming: cor.test for a data frame with only values of 0

In R, how can you calculate P-values using cor() when you have multiple variables?

Categories

Resources