I am attempting to do classification prediction using glmnet, however I cannot deduce what the return object of "glmnet.predict" is supposed to represent. Using the code
mlogit_r<-glmnet(train_x, cbind(cns_label, renal_label,breast_label,nsclc_label,ovarian_label,leuk_label,colon_label, mela_label),
family="multinomial", alpha=0)
pred <- predict(mlogit_r, train_x, type="class")
with train_x being 57(n) x 6830(p), and the y object being 57(n) x 8 (num classes). The returned prediction object is a 57 x 100 matrix with labels. Which of these are the predicted labels?
It does not show in the documentation, as it just says
The object returned depends the . . . argument which is passed on to the
predict method for glmnet objects.
When you fit a glmnet model without specifying the lambda value, by default a range containing 100 lambda values is fit. When you call predict on such a model without specifying the lambda, the predictions are made for all lambda hence you receive 100 different predictions from a 100 different models.
Usually one runs cross validation to choose one lambda that is best and then predicts using it:
library(glmnet)
data(iris)
lets use 120 rows for training:
z <- sample(1:nrow(iris), 120)
now run a 5 - fold cross validation using miss classification error to chose the best lambda:
cv_fit <- cv.glmnet(as.matrix(iris[z,-5]),
iris[z,5],
nfolds = 5,
type.measure = "class",
alpha = 0,
grouped = FALSE,
family = "multinomial")
plot(cv_fit)
Here you can see the lambda.min corresponding to the dashed line on the left (lambda with lowest error in 5 fold cross validation) and lambda.1se (lambda with error of 1 se withing the lowest error near it on slightly on the right.
These values are in:
cv_fit$lambda.min
#[1] 0.05560455
cv_fit$lambda.1se
#[1] 0.09717054
Now when you know the best lambda you can either build a model on 100 lambda values:
fit <- glmnet(as.matrix(iris[z,-5]),
iris[z, 5],
alpha = 0,
family = "multinomial")
and predict on a specific one:
predict(fit, as.matrix(iris[-z,-5]), s = cv_fit$lambda.min, type = "class")
or build a model on one lambda
fit1 <- glmnet(as.matrix(iris[z,-5]),
iris[z, 5],
alpha = 0,
lambda = cv_fit$lambda.min,
family = "multinomial")
and predict without specifying lambda:
all.equal(as.vector(predict(fit, as.matrix(iris[-z,-5]), s = cv_fit$lambda.min, type = "class")),
as.vector(predict(fit1, as.matrix(iris[-z,-5]), type = "class")))
#TRUE
To see how much the coefficients were constrained you can plot the model and the lambda used:
plot(fit, xvar = "lambda")
abline(v = log(cv_fit$lambda.min), lty = 2)
Related
I am running GLM with linear regression, then i am using predict to fit the response on my test data, but the problem is i am getting the probabilities and i don't know how to convert those probabilities to real values.
log<- glm(formula=stock_out_duration~lag_2_market_unres_dos+lag_2_percentage_bias_forecast_error + forecast,train_data_final,family = inverse.gaussian(link = "log"),maxit=100)
summary(log)
predict <- predict(log, test_data, type = 'response')
table_mat <- table(test_data$stock_out_duration)
table_mat
As far as I'm aware, there isn't a magic function that does this for you given that you're using glm. As you've noted, what typically gets returned is the probabilities. You can convert the probabilities into predictions for the outcome of the underlying categories by choosing the outcome with the largest probability. I agree a one-line function for this would be nice though.
You can get this functionality if use the glmnet package.
library(glmnet)
y = ifelse(rnorm(100) > 0, "red", "blue")
y = factor(y)
x = rnorm(100)
fit = glmnet(x, y, family="binomial") # use family="multinomial" if there are more than 2 categories in your factor
yhat = predict(fit, newx=x, type="class", s=0)
yhat in the above will be a vector containing either "red" or "blue".
Note, the type="class" is the bit that gets you the category outcomes returned in yhat. The s=0 means to use a lambda penalty of zero for the coefficients you use to get predictions. You indicated in the question that you were just doing ordinary regression without any ridge or lasso style penalty factors, so s=0 ensures you get that in your predictions.
I am trying to do feature selection using the glmnet package. I have been about to run the glmnet. However, I have a tough time understanding the output. My goal is to get the list of genes and their respective coefficients so I can rank the list of gene based on how relevant they are at separating my two group of labels.
x = manual_normalized_melt[,colnames(manual_normalized_melt) %in%
sig_0_01_ROTS$Gene]
y = cellID_reference$conditions
glmnet_l0 <- glmnet(x = as.matrix(x), y = y, family = "binomial",alpha = 1)
Any hints/instructions on how I go from here? I know that the data I want is within the glmnet_l0 but I am a bit unsure on how to extract it.
Additionally, anyone know if there is a way to use L0-norm for feature selection in R?
Thank you so much!
Here are some approaches in glmnet:
first some data because you did not post any (iris data with two levels in species):
data(iris)
x <- iris[,1:4]
y <- iris[,5]
y[y == "setosa"] <- "virginica"
y <- factor(y)
First run a cross validation model to see what is the best lambda:
library(glmnet)
model_cv <- cv.glmnet(x = as.matrix(x),
y = y,
family = "binomial",
alpha = 1,
nfolds = 5,
intercept = FALSE)
Here I chose to have 5-fold cross validation to determine the best lambda.
Too see the coefficients at best lambda:
coef(model_cv, s = "lambda.min")
#output
#5 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) .
Sepal.Length -0.7966676
Sepal.Width 1.9291364
Petal.Length -0.9502821
Petal.Width 2.7113327
Here you can see no variables were dropped (or they would have . instead of a coefficient). If all the features are on the same scale (like gene expression data) you might consider adding standardize = FALSE as an argument to the glmnet call since it is by default set to TRUE. At least I would when modeling expression.
To see the best lambda:
model_cv$lambda[which.min(model_cv$cvm)]
Now you can make a model with all the data:
glmnet_l0 <- glmnet(x = as.matrix(x),
y = y,
family = "binomial",
alpha = 1,
intercept = FALSE)
You can plot it on the lambda scale and add a vertical line depicting best lambda:
plot(glmnet_l0, xvar = "lambda")
abline(v = log(model_cv$lambda[which.min(model_cv$cvm)]))
Here one can see coefficients were hardly shrunk at all at best lambda.
with higher dimensional data you will see many coefficient traces go towards 0 before best lambda kicks in and many . in the coef matrix.
When using predict.glmnet set s = model_cv$lambda[which.min(model_cv$cvm)] or it will generate predictions for all tested lambda.
Also check this post it contains some other relevant information.
A while back I wrapped glmnet in a package for feature selection, you can either look at the code (beginning from line 89) or you can download the package using devtools::install_github('mlampros/FeatureSelection'). I explained also how it works in a blog post.
I am using glmnet and for the best lambda I want to check the VIF between variables. Can anyone suggest how can I accomplish this?
Below is the code I am following and fielddfm is the data frame containing the independent variables:
x<- model.matrix(depvar ~ ., fielddfm) [,-1]
y <- depvar
lambda <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = lambda)
predict(ridge.mod, s = 0, exact = T, type = 'coefficients')
cv.out <- cv.glmnet(x, y, alpha = 0, nfolds = 3)
bestlam <- cv.out$lambda.min
ridge.pred <- predict(ridge.mod, s = bestlam, newx = x)
predict(ridge.mod, type = "coefficients", s = bestlam)'
Here, I get the coefficients for different promotion vehicles but I want to know, VIF values for the best lambda for different independent variables
Could yo please suggest how can I achieve this?
Since a) VIF is a function of your predictors rather than your model and b) a ridge regression keeps all variables irrespective of lambda, you could get the VIFs from an arbitrarily-fitted linear model. For example:
vifs = car::vif(lm(y ~ ., data = X))
where y is your response and X is your dataframe of predictors. Note that the results are independent of the values contained in y.
Given the above however, It's a little dubious whether this question makes sense in the first place...
I have data where number of observation n is smaller than number of variables p. The answer variable is binary. For example:
n <- 10
p <- 100
x <- matrix(rnorm(n*p), ncol = p)
y <- rbinom(n, size = 1, prob = 0.5)
I would like to fit logistic model for this data. So I used the code:
model <- glmnet(x, y, family = "binomial", intercept = FALSE)
The function returns 100 models for different $\lambda$ values (penalization parameter in LASSO regression). I would like to choose the biggest model which also has n - 1 parameters or less (so less than number of observations). Let's say the chosen model is for lambda_opt.
model_one <- glmnet(x, y, family = "binomial", intercept = FALSE, lambda = lambda_opt)
Now I would like to do the second step - use step function to my model to choose the submodel which will be the best in term of BIC - Bayesian Information Criterion. Unfortunately the step function doesn't work for objects of the glmnet class.
step(model_one, direction = "backward", k = log(n))
How can I perform such procedure? Is there any other function for this specific class (glmnet) to do what I want?
BIC is a fine way to select a penalty parameter from the sequence returned by glmnet, it's faster the cross validation and works quite well at least in the settings where I've tried it.
Compute the residuals sum of square for each value of the penalty parameter in the sequence (use predict(model,x) to get the fit)
model$df gives you the degrees of freedom.
Combine those to get a BIC and pick the value of lambda corresponding to the lowers BIC.
I work on calibration of probabilities. I'm using a probability mapping approach called generalized additive models.
The algorithm I wrote is:
probMapping = function(x, y, datax, datay) {
if(length(x) < length(y))stop("train smaller than test")
if(length(datax) < length(datay))stop("train smaller than test")
datax$prob = x # trainset: data and raw probabilities
datay$prob = y # testset: data and raw probabilities
prob_map = gam(Target ~ prob, data = datax, familiy = binomial, trace = TRUE)
prob_map_prob = predict(prob_map, newdata = datay, type = "prob")
# return(str(datax))
return(prob_map_prob)
}
The package I'm using is mgcv.
x - prediction on train dataset
y - prediction on test dataset
datax - traindata
datay - testdata
Problems:
The output values are not between 0 and 1
I get the following warning message:
In predict.gam(prob_map, newdata = datay, type = "prob") :
Unknown type, reset to terms.
The warning is telling you that predict.gam doesn't recognize the value you passed to the type parameter. Since it didn't understand, it decided to use the default value of type, which is "terms".
Note that predict.gam with type="terms" returns information about the model terms, not probabilties. Hence the output values are not between 0 and 1.
For more information about mgcv::predict.gam, take a look here.