Lasso, glmnet, preprocessing of the data - r

Im trying to use the glmnet package to fit a lasso (L1 penalty) on a model with a binary outcome (a logit). My predictors are all binary (they're 1/0 not ordered, ~4000) except for one continuous variable.
I need to convert the predictors into a sparse matrix, since it takes forever and a day otherwise.
My question is: it seems that people are using sparse.model.matrix rather than just converting their matrix into a sparse matrix. Why is that? and do I need to do this here? Outcome is a little different for both methods.
Also, do my factors need to be coded as factors (when it comes to both the outcome and the predictors) or is it sufficient to use the sparse matrix and specify in the glmnet model that the outcome is binomial?
Here's what im doing so far
#Create a random dataset, y is outcome, x_d is all the dummies (10 here for simplicity) and x_c is the cont variable
y<- sample(c(1:0), 200, replace = TRUE)
x_d<- matrix(data= sample(c(1:0), 2000, replace = TRUE), nrow=200, ncol=10)
x_c<- sample(60:90, 200, replace = TRUE)
#FIRST: scale that one cont variable.
scaled<-scale(x_c,center=TRUE, scale=TRUE)
#then predictors together
x<- cbind(x_d, scaled)
#HERE'S MY MAIN QUESTION: What i currently do is:
xt<-Matrix(x , sparse = TRUE)
#then run the cross validation...
cv_lasso_1<-cv.glmnet(xt, y, family="binomial", standardize=FALSE)
#which gives slightly different results from (here the outcome variable is in the x matrix too)
xt<-sparse.model.matrix(data=x, y~.)
#then run CV.
So to sum up my 2 questions are:
1-Do i need to use sparse.model.matrix even if my factors are just binary and not ordered? [and if yes what does it actually do differently from just converting the matrix to a sparse matrix]
2- Do i need to code the binary variables as factors?
the reason i ask that is my dataset is huge. it saves a lot of time to just do it without coding as factors.

I don't think you need a sparse.model.matrix, as all that it really gives you above a regular matrix is expansion of factor terms, and if you're binary already that won't give you anything. You certainly don't need to code as factors, I frequently use glmnet on a regular (non-model) sparse matrix with only 1's. At the end of the day glmnet is a numerical method, so a factor will get converted to a number in the end regardless.

Related

penalty.factor with model matrix input

I am trying to fit a model using glmnet. For the data input I am converting my data to sparse.model.matrix format using a model formula. I am trying to de-regularize one of the variables I wish to include as a control but I can not get the penalty.factor argument to work! First I am not sure how long the vector needs to be, the model matrix has columns for each level of my original variable, do i need to specify a penalty.factor for each level? I believe I have tried both, the longer penalty vector seems to do nothing while the shorter one results in an convergence error. The code is set up as below:
X <- sparse.model.matrix(model.formula, data)
fit <- glmnet::cv.glmnet(X, y, family = "poisson", type.multinomial = "ungrouped" , penalty.factor = reg.weights)
Yes, you're on the right track. penalty.factor should be a vector the same length as there are columns in your sparse model matrix. In the example below, it would need to be length 8.
dim(X)
[1] 32 8
If you're getting convergence issues, that's a separate problem unfortunately, and not necessarily related to the penalty.factor.

Factor scores from factor analysis on ordinal categorical data in R

I'm having trouble computing factor scores from an exploratory factor analysis on ordered categorical data. I've managed to assess how many factors to draw, and to run the factor analysis using the psych package, but can't figure out how to get factor scores for individual participants, and haven't found much help online. Here is where I'm stuck:
library(polycor)
library(nFactors)
library(psych)
# load data
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
# convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# 2. choose number of factors
ev <- eigen(pc)
ap <- parallel(subject = nrow(dat),
var=ncol(dat),rep=100,cent=.05)
nS <- nScree(x = ev$values, aparallel = ap$eigen$qevpea)
dev.new(height=4,width=6,noRStudioGD = T)
plotnScree(nS) # 2 factors, maybe 1
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
faPC$loadings
Edit: I've found a way to get scores using irt.fa() and scoreIrt(), but it involved converting my ordered categories to numeric so I'm not sure it's valid. Any advice would be much appreciated!
x = as.matrix(dat)
fairt <- irt.fa(x = x,nfactors=2,correct=TRUE,plot=TRUE,n.obs=NULL,rotate="varimax",fm="ml",sort=FALSE)
for(i in 1:length(dat)){dat[,i] <- as.numeric(dat[,i])}
scoreIrt(stats = fairt, items = dat, cut = 0.2, mod="logistic")
That's an interesting problem. Regular factor analysis assumes your input measures are ratio or interval scaled. In the case of ordinal variables, you have a few options. You could either use an IRT based approach (in which case you'd be using something like the Graded Response Model), or to do as you do in your example and use the polychoric correlation matrix as the input to factor analysis. You can see more discussion of this issue here
Most factor analysis packages have a method for getting factor scores, but will give you different output depending on what you choose to use as input. For example, normally you can just use factor.scores() to get your expected factor scores, but only if you input your original raw score data. The problem here is the requirement to use the polychoric matrix as input
I'm not 100% sure (and someone please correct me if I'm wrong), but I think the following should be OK in your situation:
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
dat_orig <- dat
#convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
factor.scores(dat_orig, faPC)
In essence what you're doing is:
Calculate the polychoric correlation matrix
Use that matrix to conduct the factor analysis and extract 2 factors and associated loadings
Use the loadings from the FA and the raw (numeric) data to get your factor scores
Both this method, and the method you use in your edit, treat the original data as numeric rather than factors. I think this should be OK because you're just taking your raw data and projecting it down on the factors identified by the FA, and the loadings there are already taking into account the ordinal nature of your variables (as you used the polychoric matrix as input into FA). The post linked above cautions against this approach, however, and suggests some alternatives, but this is not a straightforward problem to solve

R Function for Rounding Imputed Binary Variables

There is an ongoing discussion about the reliable methods of rounding imputed binary variables. Still, the so-called Adaptive Rounding Procedure developed by Bernaards and colleagues (2007) is currently the most widely accepted solution.
Adoptive Rounding Procedure involves normal approximation to a binomial distribution. That is, the imputed values in a binary variable are assigned the values of either 0 or 1, based on the threshold derived by the below formula, where x is the mean of the imputed binary variable:
threshold <- mean(x) - qnorm(mean(x))*sqrt(mean(x)*(1-mean(x)))
To the best of my knowledge, major R packages on imputation (such as Amelia or mice) have yet to include functions that help with the rounding of binary variables. This shortcoming makes it difficult especially for researchers who intend to use the imputed values in logistic regression analysis, given that their dependent variable is coded in binary.
Therefore, it makes sense to write an R function for the Bernaards formula above:
bernaards <- function(x)
{
mean(x) - qnorm(mean(x))*sqrt(mean(x)*(1-mean(x)))
}
With this formula, it is much easier to calculate the threshold for an imputed binary variable with a mean of, say, .623:
bernaards(.623)
[1] 0.4711302
After calculating the threshold, the usual next step is to round the imputed values in variable x.
My question is: how can the above function be extended to include that task as well?
In other words, one can do all of the above in R with three lines of code:
threshold <- mean(x) - qnorm(mean(x))*sqrt(mean(x)*(1-mean(x)))
df$x[x > threshold] <- 1
df$x[x < threshold] <- 0
It would be best if the function included the above recoding/rounding, as repeating the same process for each binary variable would be time-consuming, especially when working with large data sets. With such a function, one could simply run an extra line of code (as below) after imputation, and continue with the analyses:
bernaards(dummy1, dummy2, dummy3)

SVD with missing values in R

I am performing a SVD analysis with R, but I have a matrix with structural NA values. Is it possible to obtain a SVD decomposition in this case? Are there alternative solutions? Thanks in advance
You might want to try out the SVDmiss function in SpatioTemporal package which does missing value imputation as well as computes the SVD on the imputed matrix. Check this link SVDmiss Function
However, you might want to be wary of the nature of your data and whether missing value imputation makes sense in your case.
I have tried using the SVM in R with NA values without succes.
Sometimes they are important in analysis so I usually transform my data as follows:
If you have lots of variables try to reduce their number (clustering, lasso, etc...)
Transform the remaining predictors like this:
- for quantitative variables:
- calculate deciles per predictor (leaving missing obs out)
- calculate frequency of Y per decile (assuming Y is qualitative)
- regroup deciles on their Y freq similarity into 2/3/4 groups
(you can do this by looking at their plot too)
- create for each group a new binary variable
(X11 = 1 if X1 takes values in the interval ...)
- calculate Y frequency for missing obs of that predictor
- join the missing obs category to the variable that has the closest Y freq
- for qualitative variables:
- if you have variables with lots of levels you should do clustering by Y
variable
- for variables with lesser levels, you can calculate Y freq per class
- regroup the classes like above
- calculate the same thing for missing obs and attach it to the most similar
group of non-missing
- recode the variable as for numeric case*
There, now you have a complete database of dummy variables and the chance to perform SVM, neural networks, LASSO, etc...

multiclass.roc from predict.gbm

I am having a hard time understanding how to format and utilize the output from predict.gbm ('gbm' package) with the multiclass.roc function ('pROC' packagage).
I used a multinomial gbm to predict a validation dataset, the output of which appears to be probabilities of each datapoint of belonging to each factor level. (Correct me if I am wrong)
preds2 <- predict.gbm(density.tc5.lr005, ProxFiltered, n.trees=best.iter, type="response")
> head(as.data.frame(preds2))
1.2534 2.2534 3.2534 4.2534 5.2534
1 0.62977743 0.25756095 0.09044278 0.021497259 7.215793e-04
2 0.16992912 0.24545691 0.45540153 0.094520208 3.469224e-02
3 0.02633356 0.06540245 0.89897614 0.009223098 6.474949e-05
The factor levels are 1-5, not sure why the decimal addition
I am trying to compute the multi-class AUC as deļ¬ned by Hand and Till (2001) using multiclass.roc but I'm not sure how to supply the predicted values in the single vector it requires.
I can try to work up an example if necessary, though I assume this is routine for some and I am missing something as a novice with the procedure.
Pass in the response variable as-is, and use the most likely candidate for the predictor:
multiclass.roc(ProxFiltered$response_variable, apply(preds2, 1, function(row) which.max(row)))
An alternative is to define a custom scoring function - for instance the ratio between the probabilities of two classes and to do the averaging yourself:
names(preds2) <- 1:5
aucs <- combn(1:5, 2, function(X) {
auc(roc(ProxFiltered$response_variable, preds2[[X[1]]] / preds2[[X[2]]], levels = X))
})
mean(aucs)
Yet another (better) option is to convert your question to a non-binary one, i.e is the best prediction (or some weighted-best prediction) correlated with the true class?

Resources