I want to perform LASSO for cox ph model in R for variable selection.
Somewhere, I found this code and done my analysis, somewhere else I found it is for elastic net, someone please confirm I am using the right code.
lasso<- cv.glmnet(xmat, ysurv, alpha = 1, family = 'cox', nfolds = 30)
The help page for cv.glmnet() (type ?cv.glmnet in R or go through the help system in RStudio) isn't useful because the alpha parameter is passed through to glmnet().
alpha: The elasticnet mixing parameter, with 0<=alpha<= 1. The penalty is defined as
(1-alpha)/2||beta||_2^2+alpha||beta||_1.
‘alpha=1’ is the lasso penalty, and ‘alpha=0’ the ridge
penalty.
So alpha=1 is lasso (as described there), and alpha=0.95 is a mixture that is mostly lasso (L1) with a little bit of ridge (L2) mixed in.
I doubt there's much of a difference between 10-fold and 30-fold cross-validation: the reasons you might want to choose different numbers of folds are (1) computational efficiency (computation goes up with number of folds unless there is some trick for computing CV score without refitting the model, as is often the case for LOOCV) and (2) bias-variance tradeoff; see section 5.1.4 of Introduction to Statistical Learning with R.
Follow-up questions that are more statistical or data-sciencey than computational should probably go to CrossValidated.
Related
I am in my first experience using mixed models in R for my statistical analysis. Due to my data being comprised of binary outcome variables, I have managed to build a logistic model using the glmer function of the lme4 package that I think works as I wanted it to.
I am now aiming to investigate the statistical significance of my model coefficients. I have read that generally, the best approach for generalized mixed models is to bootstrap confidence intervals, but I haven't managed to find a good, clear, explanation of how to do this in R.
Would anyone have any suggestions? Are there any packages in R that expedite this process, or do people generally build their own functions for this? I haven't really done any bootstrapping before so I'd appreciate some more in-depth answers.
If you want to compute parametric bootstrap confidence intervals, the built-in functionality
confint(fitted_model, method = "boot")
should work (see ?confint.merMod)
Also see this answer (which illustrates both parametric and nonparametric bootstrapping for user-defined quantities).
If you have multiple cores, you can speed this up by adding parallel = "multicore", ncpus = parallel::detectCores()-1 (or some other appropriate number of cores to use): see ?lme4::bootMer for details.
I wish to confirm my understanding of CV procedure in the glmnet package to explain it to a reviewer of my paper. I will be grateful if someone can add information to clarify the answer further.
Specifically, I had a binary classification problem with 29 input variables and 106 rows. Instead of splitting into training/test data (and further decreasing training data) I went with lasso choosing lambda through cross-validation as a means to minimise overfitting. After training the model with cv.glmnet I tested its classification accuracy on the same dataset (bootstrapped x 10000 for error intervals). I acknowledge that overfitting cannot be eliminated in this setting, but lasso with its penalizing term chosen by cross-validation is going to lessen its effect.
My explanation to the reviewer (who is a doctor like me) of how cv.glmnet does this is :
In each step of 10 fold cross-validation, data were divided randomly
into two groups containing 9/10th data for training and 1/10th for
internal validation (i.e., measuring binomial deviance/error of model
developed with that lambda). Lambda vs. deviance was plotted. When the
process was repeated 9 more times, 95% confidence intervals of lambda
vs. deviance were derived. The final lambda value to go into the model
was the one that gave the best compromise between high lambda and low
deviance. High lambda is the factor that minimises overfitting because
the regression model is not allowed to improve by assigning large
coefficients to the variables. The model is then trained on the entire
dataset using least squares approximation that minimises model error
penalized by lambda term. Because the lambda term is chosen through
cross-validation (and not from the entire dataset), the choice of
lambda is somewhat independent of the data.
I suspect my explanation can be improved much or the flaws in the methodology pointed out by the experts reading this.
Thanks in advance.
A bit late I guess, but here goes.
By default glmnet chooses the lambda.1se. It is the largest λ at which the MSE is within one standard error of the minimal MSE. Along the lines of overfitting, this usually reduces overfitting by selecting a simpler model (less non zero terms) but whose error is still close to the model with the least error. You can also check out this post. Not very sure if you mean this with "The final lambda value to go into the model was the one that gave the best compromise between high lambda and low deviance."
The main issue with your approach is calculating its accuracy on the same training data. This does not tell you how good the model will perform on unseen data, and bootstrapping does not address the error in the accuracy. For an estimate of the error, you should actually use the error from the cross validation. If your model does not work on 90% of the data, I don't see how using all of the training data works.
I have a collection of documents, that might have latent topics associated with them. It is likely that each document might relate to one or more topics. I have a master file of all possible "topics"/categories and descriptions to these topics. I am seeking to create a model that predicts the topics for each document.
I could potentially use Supervised text classification using RTextTools, but that would only help me categorize documents to belong to one category or another. I am seeking to find a solution that would not only help me determine the topic proportions to the document, but also give the term-topic/category distributions.
sLDA seems like a good fit, but it seems to only predict continuous variable outcomes instead of categorical.
LDA is more of a classification method, predicting classes. other methods can be multinational logistic regression. LDA could be harder to train compared to Multinational, given a possible little improved fit it can provide.
update: LDA is a classification method where unlike logistic regression that you directly predict Pr(Y = k|X = x) using the logit link, LDA uses the Bayes theorem for prediction. It is normally a more popular compared to logistic regression (and its extension for multi-class prediction, namely multinational logistic regression) for multi-class problems.
LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix in each class, and so can provide some improvements over logistic regression when this assumption approximately holds. in contrast,it is suggested that logistic regression can outperform LDA if these Gaussian assumptions are not hold. To sum up, While both are appropriate for the development of linear classification models, linear discriminant analysis makes more assumptions about the underlying data as opposed to logistic regression, which makes logistic regression a more flexible and robust method when these assumptions are not hold. So what I meant was, it is important to understand your data well, and see which might fit your data better. There are good sources on read you can read and comparison of classification methods:
http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf
I suggest Introduction to statistical learning, on classification chapter. Hope this helps
I used ML PipeLine to run logistic regression models but for some reasons I got worst results than R. I have done some researches and the only post that I found that is related to this issue is this . It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood. The Spark model only got 71.3% of the records right while R can predict 95.55% of the cases correctly. I was wondering if I did something wrong on the set up and if there's a way to improve the prediction. The below is my Spark code and R code-
Spark code
partial model_input
label,AGE,GENDER,Q1,Q2,Q3,Q4,Q5,DET_AGE_SQ
1.0,39,0,0,1,0,0,1,31.55709342560551
1.0,54,0,0,0,0,0,0,83.38062283737028
0.0,51,0,1,1,1,0,0,35.61591695501733
def trainModel(df: DataFrame): PipelineModel = {
val lr = new LogisticRegression().setMaxIter(100000).setTol(0.0000000000000001)
val pipeline = new Pipeline().setStages(Array(lr))
pipeline.fit(df)
}
val meta = NominalAttribute.defaultAttr.withName("label").withValues(Array("a", "b")).toMetadata
val assembler = new VectorAssembler().
setInputCols(Array("AGE","GENDER","DET_AGE_SQ",
"QA1","QA2","QA3","QA4","QA5")).
setOutputCol("features")
val model = trainModel(model_input)
val pred= model.transform(model_input)
pred.filter("label!=prediction").count
R code
lr <- model_input %>% glm(data=., formula=label~ AGE+GENDER+Q1+Q2+Q3+Q4+Q5+DET_AGE_SQ,
family=binomial)
pred <- data.frame(y=model_input$label,p=fitted(lr))
table(pred $y, pred $p>0.5)
Feel free to let me know if you need any other information. Thank you!
Edit 9/18/2015 I have tried increasing the maximum iteration and decreasing the tolerance dramatically. Unfortunately, it didn't improve the prediction. It seems the model converged to a local minimum instead of the global minimum.
It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood.
Minimization of a loss function is pretty much a definition of the linear models and both glm and ml.classification.LogisticRegression are no different here. Fundamental difference between these two is the way how it is achieved.
All linear models from ML/MLlib are based on some variants of Gradient descent. Quality of the model generated using this approach vary on a case by case basis and depend on the Gradient Descent and regularization parameters.
R from the other hand computes an exact solution which, given its time complexity, is not well suited for large datasets.
As I've mentioned above quality of the model generated using GS depends on the input parameters so typical way to improve it is to perform hyperparameter optimization. Unfortunately ML version is rather limited here compared to MLlib but for starters you can increase a number of iterations.
I have a regression model with binary outcome. I fitted the model with glmnet and got the selected variables and their coefficients.
Since glmnet doesn't calculate variable importance, I would like to feed the exact output (selected variables and their coefficients) to glm to get the information (Standard errors, etc).
I searched r documents, it seems I can use "method" option in glm to specify user defined function.
But I failed to do so, could someone help me with this?
"It is a very natural question to ask for standard errors of regression
coefficients or other estimated quantities. In principle such standard
errors can easily be calculated, e.g. using the bootstrap.
Still, this package deliberately does not provide them. The reason for
this is that standard errors are not very meaningful for strongly
biased estimates such as arise from penalized estimation methods.
Penalized estimation is a procedure that reduces the variance of
estimators by introducing substantial bias. The bias of each estimator
is therefore a major component of its mean squared error, whereas its
variance may contribute only a small part.
Unfortunately, in most applications of penalized regression it is
impossible to obtain a sufficiently precise estimate of the bias. Any
bootstrap-based calculations can only give an assessment of the
variance of the estimates. Reliable estimates of the bias are only
available if reliable unbiased estimates are available, which is
typically not the case in situations in which penalized estimates are
used.
Reporting a standard error of a penalized estimate therefore tells
only part of the story. It can give a mistaken impression of great
precision, completely ignoring the inaccuracy caused by the bias. It
is certainly a mistake to make confidence statements that are only
based on an assessment of the variance of the estimates, such as
bootstrap-based confidence intervals do."
Jelle Goeman, Ph.D. Leiden University, Author of the Penalized package in R.
There is CRAN packages hdi and selectiveInference which provide inference for high-dimensional models, you might want to take a look at those...
I've also seen people just run a glm using the predictors selected by glmnet, but this doesn't take into account the uncertainty produced by the selection process of the best model itself...