Factor scores from factor analysis on ordinal categorical data in R - r

I'm having trouble computing factor scores from an exploratory factor analysis on ordered categorical data. I've managed to assess how many factors to draw, and to run the factor analysis using the psych package, but can't figure out how to get factor scores for individual participants, and haven't found much help online. Here is where I'm stuck:
library(polycor)
library(nFactors)
library(psych)
# load data
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
# convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# 2. choose number of factors
ev <- eigen(pc)
ap <- parallel(subject = nrow(dat),
var=ncol(dat),rep=100,cent=.05)
nS <- nScree(x = ev$values, aparallel = ap$eigen$qevpea)
dev.new(height=4,width=6,noRStudioGD = T)
plotnScree(nS) # 2 factors, maybe 1
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
faPC$loadings
Edit: I've found a way to get scores using irt.fa() and scoreIrt(), but it involved converting my ordered categories to numeric so I'm not sure it's valid. Any advice would be much appreciated!
x = as.matrix(dat)
fairt <- irt.fa(x = x,nfactors=2,correct=TRUE,plot=TRUE,n.obs=NULL,rotate="varimax",fm="ml",sort=FALSE)
for(i in 1:length(dat)){dat[,i] <- as.numeric(dat[,i])}
scoreIrt(stats = fairt, items = dat, cut = 0.2, mod="logistic")

That's an interesting problem. Regular factor analysis assumes your input measures are ratio or interval scaled. In the case of ordinal variables, you have a few options. You could either use an IRT based approach (in which case you'd be using something like the Graded Response Model), or to do as you do in your example and use the polychoric correlation matrix as the input to factor analysis. You can see more discussion of this issue here
Most factor analysis packages have a method for getting factor scores, but will give you different output depending on what you choose to use as input. For example, normally you can just use factor.scores() to get your expected factor scores, but only if you input your original raw score data. The problem here is the requirement to use the polychoric matrix as input
I'm not 100% sure (and someone please correct me if I'm wrong), but I think the following should be OK in your situation:
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
dat_orig <- dat
#convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
factor.scores(dat_orig, faPC)
In essence what you're doing is:
Calculate the polychoric correlation matrix
Use that matrix to conduct the factor analysis and extract 2 factors and associated loadings
Use the loadings from the FA and the raw (numeric) data to get your factor scores
Both this method, and the method you use in your edit, treat the original data as numeric rather than factors. I think this should be OK because you're just taking your raw data and projecting it down on the factors identified by the FA, and the loadings there are already taking into account the ordinal nature of your variables (as you used the polychoric matrix as input into FA). The post linked above cautions against this approach, however, and suggests some alternatives, but this is not a straightforward problem to solve

Related

How to convert random forest prediction probabilities to a single classified response?

I have many large random forest classification models (~60min run time each) that are used for prediction of a raster using the type="prob" option. I am happy with the raster output (probabilities for each of x classes as a raster stack). However, I would like a simple way to covert these probabilities (a raster stack with x layers, where x is the number of classes) to a simple one layer classification (i.e. winners only, no probabilities). This would be equivalent of type="response".
Here is a simple example (which is not a raster, but still applies):
library(randomForest)
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.prob <- predict(iris.rf, type="prob")
iris.resp <- predict(iris.rf, type="response")
What is the most efficient way to use the iris.prob object to get the equivalent output of iris.resp without rerunning the randomforests (which, in my case with many large rasters, would take too many hours)?
Thanks in advance
If you are trying to determine the max of multiple columns, with the same general format as iris.prob I would try to find the max from each row and return the colname.
colnames(iris.prob)[max.col(iris.prob,ties.method="first")]
Got the exact usage from another thread so if this isn't working you might try another response
iris.prob should contains a classification result, with the probability that one observation is classified in one category. So you just need to extract the colname of the maximum value of each row.
Eg :
iris.resp2 = colnames(iris.prob)[apply(iris.prob,1,which.max)]
iris.resp2 == as.character(iris.resp) should return TRUE everytime

Missing Categories in Validation Data

I built a classification model in R based on training dataset with 12 categorical predictors, each variable holds tens to hundreds of categories.
The problem is that in the dataset I use for validation, some of the variables has less categories than in the training data.
For example, if I have in the training data variable v1 with 3 categories - 'a','b','c', in the validation dataset v1 has only 2 categories - 'a','b'.
In tree based methods like decision tree or random forest it makes no problem, but in logistic regression methods (I use LASSO) that require a preparation of a dummy variables matrix, the number of columns in the training data matrix and validation data matrix doesn't match. If we go back to the example of variable v1, in the training data I get three dummy variables for v1, and in the validation data I get only 2.
Any idea how to solve this?
You can try to avoid this problem by setting the levels correctly. Look at following very stupid example:
set.seed(106)
thedata <- data.frame(
y = rnorm(100),
x = factor(sample(letters[1:3],100,TRUE))
)
head(model.matrix(y~x, data = thedata))
thetrain <- thedata[1:7,]
length(unique(thetrain$x))
head(model.matrix(y~x, data = thetrain))
I make a dataset with a x and a y variable, and x is a factor with 3 levels. The training dataset only has 2 levels of x, but the model matrix is still constructed correctly. That is because R kept the level data of the original dataset:
> levels(thetrain$x)
[1] "a" "b" "c"
The problem arises when your training set is somehow constructed using eg the function data.frame() or any other method that drops the levels information of the factor.
Try the following:
thetrain$x <- factor(thetrain$x) # erases the levels
levels(thetrain$x)
head(model.matrix(y~x, data = thetrain))
You see in the second line that the level "b" has been dropped, and consequently the model matrix isn't what you want any longer. So make sure that all factors in your training dataset actually have all levels, eg:
thetrain$x <- factor(thetrain$x, levels = c("a","b","c"))
On a sidenote: if you build your model matrices yourself using either model.frame() or model.matrix(), the argument xlev might be of help:
thetrain$x <- factor(thetrain$x) # erases the levels
levels(thetrain$x)
head(model.matrix(y~x, data = thetrain,
xlev = list(x = c('a','b','c'))))
Note that this xlev argument is actually from model.frame, and model.matrix doesn't call model.frame in every case. So that solution is not guaranteed to always work, but it should for data frames.

Lasso, glmnet, preprocessing of the data

Im trying to use the glmnet package to fit a lasso (L1 penalty) on a model with a binary outcome (a logit). My predictors are all binary (they're 1/0 not ordered, ~4000) except for one continuous variable.
I need to convert the predictors into a sparse matrix, since it takes forever and a day otherwise.
My question is: it seems that people are using sparse.model.matrix rather than just converting their matrix into a sparse matrix. Why is that? and do I need to do this here? Outcome is a little different for both methods.
Also, do my factors need to be coded as factors (when it comes to both the outcome and the predictors) or is it sufficient to use the sparse matrix and specify in the glmnet model that the outcome is binomial?
Here's what im doing so far
#Create a random dataset, y is outcome, x_d is all the dummies (10 here for simplicity) and x_c is the cont variable
y<- sample(c(1:0), 200, replace = TRUE)
x_d<- matrix(data= sample(c(1:0), 2000, replace = TRUE), nrow=200, ncol=10)
x_c<- sample(60:90, 200, replace = TRUE)
#FIRST: scale that one cont variable.
scaled<-scale(x_c,center=TRUE, scale=TRUE)
#then predictors together
x<- cbind(x_d, scaled)
#HERE'S MY MAIN QUESTION: What i currently do is:
xt<-Matrix(x , sparse = TRUE)
#then run the cross validation...
cv_lasso_1<-cv.glmnet(xt, y, family="binomial", standardize=FALSE)
#which gives slightly different results from (here the outcome variable is in the x matrix too)
xt<-sparse.model.matrix(data=x, y~.)
#then run CV.
So to sum up my 2 questions are:
1-Do i need to use sparse.model.matrix even if my factors are just binary and not ordered? [and if yes what does it actually do differently from just converting the matrix to a sparse matrix]
2- Do i need to code the binary variables as factors?
the reason i ask that is my dataset is huge. it saves a lot of time to just do it without coding as factors.
I don't think you need a sparse.model.matrix, as all that it really gives you above a regular matrix is expansion of factor terms, and if you're binary already that won't give you anything. You certainly don't need to code as factors, I frequently use glmnet on a regular (non-model) sparse matrix with only 1's. At the end of the day glmnet is a numerical method, so a factor will get converted to a number in the end regardless.

multiclass.roc from predict.gbm

I am having a hard time understanding how to format and utilize the output from predict.gbm ('gbm' package) with the multiclass.roc function ('pROC' packagage).
I used a multinomial gbm to predict a validation dataset, the output of which appears to be probabilities of each datapoint of belonging to each factor level. (Correct me if I am wrong)
preds2 <- predict.gbm(density.tc5.lr005, ProxFiltered, n.trees=best.iter, type="response")
> head(as.data.frame(preds2))
1.2534 2.2534 3.2534 4.2534 5.2534
1 0.62977743 0.25756095 0.09044278 0.021497259 7.215793e-04
2 0.16992912 0.24545691 0.45540153 0.094520208 3.469224e-02
3 0.02633356 0.06540245 0.89897614 0.009223098 6.474949e-05
The factor levels are 1-5, not sure why the decimal addition
I am trying to compute the multi-class AUC as deļ¬ned by Hand and Till (2001) using multiclass.roc but I'm not sure how to supply the predicted values in the single vector it requires.
I can try to work up an example if necessary, though I assume this is routine for some and I am missing something as a novice with the procedure.
Pass in the response variable as-is, and use the most likely candidate for the predictor:
multiclass.roc(ProxFiltered$response_variable, apply(preds2, 1, function(row) which.max(row)))
An alternative is to define a custom scoring function - for instance the ratio between the probabilities of two classes and to do the averaging yourself:
names(preds2) <- 1:5
aucs <- combn(1:5, 2, function(X) {
auc(roc(ProxFiltered$response_variable, preds2[[X[1]]] / preds2[[X[2]]], levels = X))
})
mean(aucs)
Yet another (better) option is to convert your question to a non-binary one, i.e is the best prediction (or some weighted-best prediction) correlated with the true class?

Is it possible to specify a range for numbers randomly generated by mvrnorm( ) in R?

I am trying to generate a random set of numbers that exactly mirror a data set that I have (to test it). The dataset consists of 5 variables that are all correlated with different means and standard deviations as well as ranges (they are likert scales added together to form 1 variable). I have been able to get mvrnorm from the MASS package to create a dataset that replicated the correlation matrix with the observed number of observations (after 500,000+ iterations), and I can easily reassign means and std. dev. through z-score transformation, but I still have specific values within each variable vector that are far above or below the possible range of the scale whose score I wish to replicate.
Any suggestions how to fix the range appropriately?
Thank you for sharing your knowledge!
To generate a sample that does "exactly mirror" the original dataset, you need to make sure that the marginal distributions and the dependence structure of the sample matches those of the original dataset.
A simple way to achieve this is with resampling
my.data <- matrix(runif(1000, -1, 2), nrow = 200, ncol = 5) # Some dummy data
my.ind <- sample(1:nrow(my.data), nrow(my.data), replace = TRUE)
my.sample <- my.data[my.ind, ]
This will ensure that the margins and the dependence structure of the sample (closely) matches those of the original data.
An alternative is to use a parametric model for the margins and/or the dependence structure (copula). But as staded by #dickoa, this will require serious modeling effort.
Note that by using a multivariate normal distribution, you are (implicity) assuming that the dependence structure of the original data is the Gaussian copula. This is a strong assumption, and it would need to be validated beforehand.

Resources