I have a mixed dataset (comprising continuous, ordinal and nominal variables) that is high-dimensional (with more variables than rows). I want to perform a mixed data PCA using the dudi.mix() function in the R package ade4. After PCA, I want to project a new supplementary row onto the PCA space, i.e. find its coordinates in the PCA coordinate system. I tried the suprow() function in ade4 but it gives me the following error message: “Not yet implemented for 'dudi.mix'. Please use 'dudi.hillsmith'.” I don’t want to use the dudi.hillsmith() function because I think it only allows for mixed continuous and nominal variables, but my dataset comprises continuous, nominal and ordinal variables, and my understanding is that dudi.mix() is the correct function to use in this case.
Is there an alternative way how I can project a new row onto the PCA space generated by dudi.mix()?
Below an example:
# load ade4 package
library(ade4)
# a high-dimensional mixed dataset with 11 rows and 13 variables
dat <- data.frame(
a=as.numeric(c(2.5,0.5,2.2,1.9,3.1,2.3,2.0,1.0,1.5,1.1,3.4)),
b=as.numeric(c(2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9,3.1)),
c=as.numeric(c(1.3,1.1,2.4,3.1,2.2,1.3,1.5,1.8,1.1,0.5,3.8)),
d=as.numeric(c(1.9,0.9,2.1,2.3,2.8,1.9,1.9,1.3,2.9,0.8,2.9)),
e=as.numeric(c(2.2,1.2,2.5,2.9,1.9,3.1,2.1,0.9,1.8,0.9,2.8)),
f=as.factor(c(0,0,0,0,1,0,1,1,1,0,1)),
g=as.factor(c(0,1,0,0,1,0,1,0,1,0,0)),
h=as.factor(c(1,1,1,1,0,1,0,0,0,1,1)),
i=as.factor(c(1,0,0,0,0,1,0,1,0,0,0)),
j=as.ordered(c(0,1,0,2,3,4,0,1,2,4,2)),
k=as.ordered(c(1,2,1,3,4,4,1,2,2,3,3)),
l=as.ordered(c(0,1,1,2,3,2,0,1,1,3,1)),
m=as.ordered(c(0,0,1,2,1,2,2,1,0,2,1)))
# first 10 rows are used for PCA
dat.1 <- dat[1:10,]
# the 11s row should be projected onto the PCA space
dat.2 <- dat[11,]
# pca on dat.1 with 9 kept axes (i.e. number of rows - 1)
pca.res <- dudi.mix(df=dat.1, scann=FALSE, nf = 9)
# my attempt to project dat.2 onto pca.res fails
suprow(x=pca.res, Xsup=dat.2)
I'm having trouble computing factor scores from an exploratory factor analysis on ordered categorical data. I've managed to assess how many factors to draw, and to run the factor analysis using the psych package, but can't figure out how to get factor scores for individual participants, and haven't found much help online. Here is where I'm stuck:
library(polycor)
library(nFactors)
library(psych)
# load data
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
# convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# 2. choose number of factors
ev <- eigen(pc)
ap <- parallel(subject = nrow(dat),
var=ncol(dat),rep=100,cent=.05)
nS <- nScree(x = ev$values, aparallel = ap$eigen$qevpea)
dev.new(height=4,width=6,noRStudioGD = T)
plotnScree(nS) # 2 factors, maybe 1
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
faPC$loadings
Edit: I've found a way to get scores using irt.fa() and scoreIrt(), but it involved converting my ordered categories to numeric so I'm not sure it's valid. Any advice would be much appreciated!
x = as.matrix(dat)
fairt <- irt.fa(x = x,nfactors=2,correct=TRUE,plot=TRUE,n.obs=NULL,rotate="varimax",fm="ml",sort=FALSE)
for(i in 1:length(dat)){dat[,i] <- as.numeric(dat[,i])}
scoreIrt(stats = fairt, items = dat, cut = 0.2, mod="logistic")
That's an interesting problem. Regular factor analysis assumes your input measures are ratio or interval scaled. In the case of ordinal variables, you have a few options. You could either use an IRT based approach (in which case you'd be using something like the Graded Response Model), or to do as you do in your example and use the polychoric correlation matrix as the input to factor analysis. You can see more discussion of this issue here
Most factor analysis packages have a method for getting factor scores, but will give you different output depending on what you choose to use as input. For example, normally you can just use factor.scores() to get your expected factor scores, but only if you input your original raw score data. The problem here is the requirement to use the polychoric matrix as input
I'm not 100% sure (and someone please correct me if I'm wrong), but I think the following should be OK in your situation:
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
dat_orig <- dat
#convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
factor.scores(dat_orig, faPC)
In essence what you're doing is:
Calculate the polychoric correlation matrix
Use that matrix to conduct the factor analysis and extract 2 factors and associated loadings
Use the loadings from the FA and the raw (numeric) data to get your factor scores
Both this method, and the method you use in your edit, treat the original data as numeric rather than factors. I think this should be OK because you're just taking your raw data and projecting it down on the factors identified by the FA, and the loadings there are already taking into account the ordinal nature of your variables (as you used the polychoric matrix as input into FA). The post linked above cautions against this approach, however, and suggests some alternatives, but this is not a straightforward problem to solve
I have a dataset where I've measured gene expression of 21 genes and also measured the output of 3 other assays. I have measured these for 8 different clones. I have also measured these on 5 different days.
However, I haven't measured every gene or assay on every day, or for every clone. So I have datasets of varying lengths. In order to easily combine them into one large dataset, to perform a PCA on them, I melted each dataset and then row bound them. I then standardized all the values. I now have a dataset that looks like the below.
What I want to do is a PCA where each of the factors in "group" is calculated in the PCA. Then, I'd like to create graphs where different colors of datapoints represent different "clones" or "days". I've pasted my sad attempt to get that working below. Any help would be appreciated!
set.seed(1)
# Creates variables for a dataset
clone <- sample(c(rep(c("1A","2A","2B","3B","3C"), each=100),rep(c("1B","2C","3A"), each=200)))
day <- sample(c(rep(1,225),rep(2,25),rep(3,600),rep(4,25),rep(5,225)))
group <- sample(c(rep(paste0("gene",1:21), each=42),rep("assay1",90),rep("assay2",80),rep("assay3",48)))
value = rnorm(1100, mean=0, sd=3)
# Create data frame from variables
df <- data.frame(clone,day,group,value)
df$day <- as.factor(df$day)
# Create PCA data
df_PCA <- prcomp(clone + day + group ~ value, data = df, scale = FALSE)
# Graphing results of PCA
par(mfrow=c(2,3))
plot(df_PCA$x[,1:2], col=clone)
plot(df_PCA$x[,1:2], col=day)
plot(df_PCA$x[,1:3], col=clone)
plot(df_PCA$x[,1:3], col=day)
plot(df_PCA$x[,2:3], col=clone)
plot(df_PCA$x[,2:3], col=day)
I was wondering why when I run cor.test on a data frame with only values of 0 to get a correlation similarity matrix it returns NAs. The 0s in the data frame in this case represent the actual abundance of a widget based of our measurements. So I would expect that each vector that is being correlated to have a correlation estimate^2 of 1. I think I don't have a good grasp of what a correlation is and how it is implemented in cor.test. Any advice or help in understanding correlations and cor.test would be great. Code is below.
corpij <- function(i,j,data) {(cor.test(data[,i],data[,j])$estimate)^2}
corp <- Vectorize(corpij, vectorize.args=list("i","j"))
r2_scores <- outer(1:n,1:n,corp,ready_cor)
Best