I am trying to do feature selection using the glmnet package. I have been about to run the glmnet. However, I have a tough time understanding the output. My goal is to get the list of genes and their respective coefficients so I can rank the list of gene based on how relevant they are at separating my two group of labels.
x = manual_normalized_melt[,colnames(manual_normalized_melt) %in%
sig_0_01_ROTS$Gene]
y = cellID_reference$conditions
glmnet_l0 <- glmnet(x = as.matrix(x), y = y, family = "binomial",alpha = 1)
Any hints/instructions on how I go from here? I know that the data I want is within the glmnet_l0 but I am a bit unsure on how to extract it.
Additionally, anyone know if there is a way to use L0-norm for feature selection in R?
Thank you so much!
Here are some approaches in glmnet:
first some data because you did not post any (iris data with two levels in species):
data(iris)
x <- iris[,1:4]
y <- iris[,5]
y[y == "setosa"] <- "virginica"
y <- factor(y)
First run a cross validation model to see what is the best lambda:
library(glmnet)
model_cv <- cv.glmnet(x = as.matrix(x),
y = y,
family = "binomial",
alpha = 1,
nfolds = 5,
intercept = FALSE)
Here I chose to have 5-fold cross validation to determine the best lambda.
Too see the coefficients at best lambda:
coef(model_cv, s = "lambda.min")
#output
#5 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) .
Sepal.Length -0.7966676
Sepal.Width 1.9291364
Petal.Length -0.9502821
Petal.Width 2.7113327
Here you can see no variables were dropped (or they would have . instead of a coefficient). If all the features are on the same scale (like gene expression data) you might consider adding standardize = FALSE as an argument to the glmnet call since it is by default set to TRUE. At least I would when modeling expression.
To see the best lambda:
model_cv$lambda[which.min(model_cv$cvm)]
Now you can make a model with all the data:
glmnet_l0 <- glmnet(x = as.matrix(x),
y = y,
family = "binomial",
alpha = 1,
intercept = FALSE)
You can plot it on the lambda scale and add a vertical line depicting best lambda:
plot(glmnet_l0, xvar = "lambda")
abline(v = log(model_cv$lambda[which.min(model_cv$cvm)]))
Here one can see coefficients were hardly shrunk at all at best lambda.
with higher dimensional data you will see many coefficient traces go towards 0 before best lambda kicks in and many . in the coef matrix.
When using predict.glmnet set s = model_cv$lambda[which.min(model_cv$cvm)] or it will generate predictions for all tested lambda.
Also check this post it contains some other relevant information.
A while back I wrapped glmnet in a package for feature selection, you can either look at the code (beginning from line 89) or you can download the package using devtools::install_github('mlampros/FeatureSelection'). I explained also how it works in a blog post.
Related
I have a linear mixed effects model and I am trying to do variable selection. The model is testing the level of forest degradation in 1000 sampled points. Most points have no degradation, and so the dependent variable is highly skewed with many zeros. Therefore, I am using the Tweedie distribution to fit the model. My main question is: can the Tweedie distribution actually be used in the glmmLasso function? My second question is: do I even need to use this distribution in glmmLasso()? Any help is much appreciated!
When I run the function with family = tweedie(var.power=1.2,link.power=0) I get the following error:
Error in logLik.glmmLasso(y = y, yhelp = yhelp, mu = mu, family = family, :
object 'loglik' not found
If I change the link.power from 0 to 1 (which I think is not correct for my model, but just for the sake of figuring out the problem), I get a different error:
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
Here tweedie comes from the statmod package. A simple example:
library(tweedie)
library(tidyverse)
library(glmmLasso)
library(statmod)
power <- 2
mu <- 1
phi <- seq(2, 8, by=0.1)
set.seed(10000)
y <- rtweedie( 100, mu=mu, power=power, phi=3)
x <- rnorm(100)
z <- c(rep(1, 50), rep(2,50))
df = as.data.frame(cbind(y,x,z))
df$z = as.factor(df$z)
f = y ~ x
varSelect = glmmLasso(fix = f, rnd = list(z=~1), data = df,
lambda = 5, family = tweedie(var.power=1.2,link.power=0))
I created a hacked version of glmmLasso that incorporates the Tweedie distribution as an option and put it on Github. I had to change two aspects of the code:
add a clause to compute the log-likelihood if family$family == "Tweedie"
in a number of places where the code was essentially if (family$family in list_of_families) ..., add "Tweedie" as an option.
remotes::install_github("bbolker/glmmLasso-bmb")
packageVersion("glmmLasso")
## [1] ‘1.6.2.9000’
Your example runs for me now, but I haven't checked at all to see if the results are sensible.
I have a bit of an issue. I am trying to develop some code that will allow me to do the following: 1) run a logistic regression analysis, 2) extract the estimates from the logistic regression analysis, and 3) use those estimates to create another logistic regression formula that I can use in a subsequent simulation of the original model. As I am, relatively new to R, I understand I can extract these coefficients 1-by-1 through indexing, but it is difficult to "scale" this to models with different numbers of coefficients. I am wondering if there is a better way to extract the coefficients and setup the formula. Then, I would have to develop the actual variables, but the development of these variables would have to be flexible enough for any number of variables and distributions. This appears to be easily done in Mplus (example 12.7 in the Mplus manual), but I haven't figured this out in R. Here is the code for as far as I have gotten:
#generating the data
set.seed(1)
gender <- sample(c(0,1), size = 100, replace = TRUE)
age <- round(runif(100, 18, 80))
xb <- -9 + 3.5*gender + 0.2*age
p <- 1/(1 + exp(-xb))
y <- rbinom(n = 100, size = 1, prob = p)
#grabbing the coefficients from the logistic regression model
matrix_coef <- summary(glm(y ~ gender + age, family = "binomial"))$coefficients
the_estimates <- matrix_coef[,1]
the_estimates
the_estimates[1]
the_estimates[2]
the_estimates[3]
I just cannot seem to figure out how to have R create the formula with the variables (x's) and the coefficients from the original model in a flexible manner to accommodate any number of variables and different distributions. This is not class assignment, but a necessary piece for the research that I am producing. Any help will be greatly appreciated, and please, treat this as a teaching moment. I really want to learn this.
I'm not 100% sure what your question is here.
If you want to simulate new data from the same model with the same predictor variables, you can use the simulate() method:
dd <- data.frame(y, gender, age)
## best practice when modeling in R: take the variables from a data frame
model <- glm(y ~ gender + age, data = dd, family = "binomial")
simulate(model)
You can create multiple replicates by specifying the nsim= argument (or you can simulate anew every time through a for() loop)
If you want to simulate new data from a different set of predictor variables, you have to do a little bit more work (some model types in R have a newdata= argument, but not GLMs alas):
## simulate new model matrix (including intercept)
simdat <- cbind(1,
gender = rbinom(100, prob = 0.5, size = 1),
age = sample(18:80, size = 100, replace = TRUE))
## extract inverse-link function
invlink <- family(model)$linkinv
## sample new values
resp <- rbinom(n = 100, size = 1, prob = invlink(simdat %*% coef(model)))
If you want to do this later from coefficients that have been stored, substitute the retrieved coefficient vector for coef(model) in the code above.
If you want to flexibly construct formulas, reformulate() is your friend — but I don't see how it fits in here.
If you want to (say) re-fit the model 1000 times to new responses simulated from the original model fit (same coefficients, same predictors: i.e. a parametric bootstrap), you can do something like this.
nsim <- 1000
res <- matrix(NA, ncol = length(coef(model)), nrow = nsim)
for (i in 1:nsim) {
## simulate returns a list (in this case, of length 1);
## extract the response vector
newresp <- simulate(model)[[1]]
newfit <- update(model, newresp ~ .)
res[i,] <- coef(newfit)
}
You don't have to store coefficients - you can extract/compute whatever model summaries you like (change the number of columns of res appropriately).
Let’s say your data matrix including age and gender, or whatever predictors, is X. Then you can use X on the right-hand side of your glm formula, get xb_hat <- X %*% the_estimates (or whatever other data matrix replacing X as long as it has same columns) and plug xb_hat into whatever link function you want.
I was wondering if it might be possible (and perhaps recommended) to obtain standardized coefficients from stan_glm() in the rstanarm package? (did not find anything specific in the documentation)
Can I just standardize all variables as in normal regression? (see below)
Example:
library("rstanarm")
fit <- stan_glm(wt ~ vs*gear, data = mtcars)
Standardization:
design <- wt ~ vs*gear
vars <- all.vars(design)
stand.vars <- lapply(mtcars[, vars], scale)
fit <- stan_glm(stand.vars, data = mtcars)
I would not say that it is affirmatively recommended, but I would recommend that you not subtract the sample mean and divide by the sample standard deviation of the outcome because the estimation uncertainty in those two statistics will not be propagated to the posterior distribution.
Standardizing the predictors is more debatable. You can do it, but it makes doing posterior prediction with new data harder because you have to remember to subtract the old means from the new data and divide by the old standard deviations.
The most computationally efficient approach is to leave the variables as they are but specify the non-default argument QR = TRUE, especially if you are not going to modify the default (normal) priors on the coefficients anyway.
You can then standardize the posterior coefficients after-the-fact if standardized coefficients are of interest. To do so, you can do
X <- model.matrix(fit)
sd_X <- apply(X, MARGIN = 2, FUN = sd)[-1]
sd_Y <- apply(posterior_predict(fit), MARGIN = 1, FUN = sd)
beta <- as.matrix(fit)[ , 2:ncol(X), drop = FALSE]
b <- sweep(sweep(beta, MARGIN = 2, STATS = sd_X, FUN = `*`),
MARGIN = 1, STATS = sd_Y, FUN = `/`)
summary(b)
However, standardizing regression coefficients just gives the illusion of comparability across variables and says nothing about how germane a one standard deviation difference is, particularly for dummy variables. If your question is really whether manipulating this predictor or that predictor is going to make a bigger difference on the outcome variable, then simply simulate those manipulations like
PPD_0 <- posterior_predict(fit)
nd <- model.frame(fit)
nd[ , 2] <- nd[ , 2] + 1 # for example
PPD_1 <- posterior_predict(fit, newdata = nd)
summary(c(PPD_1 - PPD_0))
and repeat that process for other manipulations of interest.
I am attempting to do classification prediction using glmnet, however I cannot deduce what the return object of "glmnet.predict" is supposed to represent. Using the code
mlogit_r<-glmnet(train_x, cbind(cns_label, renal_label,breast_label,nsclc_label,ovarian_label,leuk_label,colon_label, mela_label),
family="multinomial", alpha=0)
pred <- predict(mlogit_r, train_x, type="class")
with train_x being 57(n) x 6830(p), and the y object being 57(n) x 8 (num classes). The returned prediction object is a 57 x 100 matrix with labels. Which of these are the predicted labels?
It does not show in the documentation, as it just says
The object returned depends the . . . argument which is passed on to the
predict method for glmnet objects.
When you fit a glmnet model without specifying the lambda value, by default a range containing 100 lambda values is fit. When you call predict on such a model without specifying the lambda, the predictions are made for all lambda hence you receive 100 different predictions from a 100 different models.
Usually one runs cross validation to choose one lambda that is best and then predicts using it:
library(glmnet)
data(iris)
lets use 120 rows for training:
z <- sample(1:nrow(iris), 120)
now run a 5 - fold cross validation using miss classification error to chose the best lambda:
cv_fit <- cv.glmnet(as.matrix(iris[z,-5]),
iris[z,5],
nfolds = 5,
type.measure = "class",
alpha = 0,
grouped = FALSE,
family = "multinomial")
plot(cv_fit)
Here you can see the lambda.min corresponding to the dashed line on the left (lambda with lowest error in 5 fold cross validation) and lambda.1se (lambda with error of 1 se withing the lowest error near it on slightly on the right.
These values are in:
cv_fit$lambda.min
#[1] 0.05560455
cv_fit$lambda.1se
#[1] 0.09717054
Now when you know the best lambda you can either build a model on 100 lambda values:
fit <- glmnet(as.matrix(iris[z,-5]),
iris[z, 5],
alpha = 0,
family = "multinomial")
and predict on a specific one:
predict(fit, as.matrix(iris[-z,-5]), s = cv_fit$lambda.min, type = "class")
or build a model on one lambda
fit1 <- glmnet(as.matrix(iris[z,-5]),
iris[z, 5],
alpha = 0,
lambda = cv_fit$lambda.min,
family = "multinomial")
and predict without specifying lambda:
all.equal(as.vector(predict(fit, as.matrix(iris[-z,-5]), s = cv_fit$lambda.min, type = "class")),
as.vector(predict(fit1, as.matrix(iris[-z,-5]), type = "class")))
#TRUE
To see how much the coefficients were constrained you can plot the model and the lambda used:
plot(fit, xvar = "lambda")
abline(v = log(cv_fit$lambda.min), lty = 2)
I am using glmnet and for the best lambda I want to check the VIF between variables. Can anyone suggest how can I accomplish this?
Below is the code I am following and fielddfm is the data frame containing the independent variables:
x<- model.matrix(depvar ~ ., fielddfm) [,-1]
y <- depvar
lambda <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = lambda)
predict(ridge.mod, s = 0, exact = T, type = 'coefficients')
cv.out <- cv.glmnet(x, y, alpha = 0, nfolds = 3)
bestlam <- cv.out$lambda.min
ridge.pred <- predict(ridge.mod, s = bestlam, newx = x)
predict(ridge.mod, type = "coefficients", s = bestlam)'
Here, I get the coefficients for different promotion vehicles but I want to know, VIF values for the best lambda for different independent variables
Could yo please suggest how can I achieve this?
Since a) VIF is a function of your predictors rather than your model and b) a ridge regression keeps all variables irrespective of lambda, you could get the VIFs from an arbitrarily-fitted linear model. For example:
vifs = car::vif(lm(y ~ ., data = X))
where y is your response and X is your dataframe of predictors. Note that the results are independent of the values contained in y.
Given the above however, It's a little dubious whether this question makes sense in the first place...