Validating AUC values in R's GLMNET package - r

I am using the cv.glmnet function from the glmnet library in R. I run the following function:
model = cv.glmnet(x = data, y = label, family = 'binomial', alpha = 0.1,type.measure = "auc")
From the following plot I expect to get an AUC between 0.62 and 0.64
However, when I pass the same data (i.e. the training data) to the predict function i.e.
pred = predict(fit, newx = data, type = 'class',s ="lambda.min")
auc(label,pred)
I only get back an AUC of 0.56.
I understand the random nature of the cross validation - but intuitively I would expect to get something back in the range suggested by the cross validation.
What have I missed here? Thanks in advance

I'm not sure how auc() works (it seems to be an internal function of glmnet), but you should try passing the predictions as a link or response instead of a class:
pred = predict(fit, newx = data, type = 'response',s ="lambda.min")
auc(label,pred)
class gives you the predicted class at a single response threshold (e.g., 0.5), while response gives you the probability which can be used to calculate the auc across the entire response function.

Related

difference between data and newdata arguments when predicting

When predicting in R, using the predict() function, the argument for the data on which we want to predict is newdata = . My question is, when putting data = instead of newdata = what happens ? Because it doesn't give an error, and the rmse obtained is not the same when using newdata =
Here is an example:
library(MASS)
set.seed(18)
Boston_idx = sample(1:nrow(Boston), nrow(Boston) / 2)
Boston_train = Boston[Boston_idx,]
Boston_test = Boston[-Boston_idx,]
library(rpart)
Boston_tree<-rpart(medv~., data=Boston_train)
tree.pred <- predict(Boston_tree, data=Boston_test)
tree.pred2 <- predict(Boston_tree, newdata=Boston_test)
rmse = function(m, o){
sqrt(mean((m - o)^2))
}
rmse(tree.pred,Boston_test$medv)
rmse(tree.pred2,Boston_test$medv)
data are the data used for fitting the model, newdata are data used for prediction. The help page of ?predict.rpart says:
newdata: data frame containing the values at which predictions are required. The predictors referred to in the right side of formula(object) must be present by name in newdata. If missing, the fitted values are returned.

predict function with lasso regression

I am trying to implement lasso regression for my sales prediction problem. I am using glmnet package and cv.glmnet function to train the model.
library(glmnet)
set.seed(123)
model = cv.glmnet(as.matrix(x = train[, -which(names(train) %in% "Sales")]),
y = train$Sales,
alpha = 1,
lambda = 10^seq(4,-1,-0.1))
best_lambda = model$lambda.min
lasso_predictions_valid <- predict(model,s = best_lambda,type = "coefficients")
After I read few articles about implementing lasso regression I still don't know how to add my test data on which I want to apply the prediction. There is newx argument to be added to predict function that I do not know also. I mean in most regression types we have newdata or data argument that we fill our test data to it.
I think there is an error in your lasso_predictions_valid, you shouldn't put valid$sales as your newx, as I believe this is the actual sales number.
Once you have created the model with the train set, then for newx you need to pass matrix values of x that you want to make predictions on, I guess in this case it will be your validation set.
Looking at your example code above, I think your predict line should be something like:
lasso_predictions_valid <- predict(model, s = best_lambda,
newx = as.matrix(valid[, -which(names(valid) %in% "Sales")]),
type = "coefficients")
Then you should run your RMSE() line:
RMSE(lasso_predictions_valid, valid$Sales)

cv.glmnet and Leave-one out CV

I'm trying to use the function cv.glmnet to find the best lambda (using the RIDGE regression) in order to predict the class of belonging of some objects.
So the code that I have used is:
CVGLM<-cv.glmnet(x,y,nfolds=34,type.measure = "class",alpha=0,grouped = FALSE)
actually I'm not using a K-fold cross validation because my size dataset is too small, in fact I have only 34 rows. So, I'm using in nfolds the number of my rows, to compute a Leave-one out CV.
Now, I have some questions:
1) First of all: Does cv.glmnet function tune the Hyperpameter lambda or also test the "final model"?
2)One time got the best lambda, what have I to do? Have I to use predict function?
If yes, which data I have to use if I use all data to find lambda since I have used LOO CV?
3)How can I calculate R^2 from cv.glmnet function?
Here is an attempt to answer your questions:
1) cv.glmnet tests the performance of each lambda by using the cross validation of your specification. Here is an example:
library(glmnet)
data(iris)
find best lambda for iris prediction:
CVGLM <- cv.glmnet(as.matrix(iris[,-5]),
iris[,5],
nfolds = nrow(iris),
type.measure = "class",
alpha = 0,
grouped = FALSE,
family = "multinomial")
the miss classification error of best lambda is in
CVGLM$cvm
#output
0.06
If you test this independently using LOOCV and best lambda:
z <- lapply(1:nrow(iris), function(x){
fit <- glmnet(as.matrix(iris[-x,-5]),
iris[-x,5],
alpha = 0,
lambda = CVGLM$lambda.min,
family="multinomial")
pred <- predict(fit, as.matrix(iris[x,-5]), type = "class")
return(data.frame(pred, true = iris[x,5]))
})
z <- do.call(rbind, z)
and check the error rate it is:
sum(z$pred != z$true)/150
#output
0.06
so it looks like there is no need to test the performance using the same method as in cv.glmnet since it will be the same.
2) when you have the optimal lambda you should fit a model on the whole data set using glmnet function. What you do after with the model is entirely up to you. Most people train a model to predict something.
3) what is R^2 for a classification problem? If you could explain that then you could calculate it.
R^2 = Explained variation / Total variation
what is this in terms of classes?
Anyhow R^2 is not used for classification but rather AUC, deviance, accuracy, balanced accuracy, kappa, joudens J and so on - most of these are used for binary classification but some are available for multinomial.
I suggest this as further reading

R coefficients of glmnet::cvfit

As far as I am concerned, cvfit does a K fold cross validation, which means that in each time, it separates all the data into training & validation set. For every fixed lambda, first it uses training data to get a coefficient vector. Then implements this constructed model to predict on the validation set to get the error.
Hence, for K fold CV, it has k coefficient vectors (each is generated from a training set). So what does
coef(cvfit)
get?
Here is an example:
x <- iris[1:100,1:4]
y <- iris[1:100,5]
y <- factor(y)
fit <- cv.glmnet(data.matrix(x), y, family = "binomial", type.measure = "class",alpha=1,nfolds=3,standardize = T)
coef(fit, s=c(fit$lambda.min,fit$lambda.1se))
fit1 <- glmnet(data.matrix(x), y, family = "binomial",
standardize = T,
lambda = c(fit$lambda.1se,fit$lambda.min))
coef(fit1)
in fit1, I use the whole dataset as the training set, seems that the coefficients of fit1 and fit are just the same. That's why?
Thanks in advance.
Although cv.glmnet checks model performance by cross-validation, the actual model coefficients it returns for each lambda value are based on fitting the model with the full dataset.
The help for cv.glmnet (type ?cv.glmnet) includes a Value section that describes the object returned by cv.glmet. The returned list object (fit in your case) includes an element called glmnet.fit. The help describes it like this:
glmnet.fit a fitted glmnet object for the full data.

how to get predictions using gbm in r

fit <- gbm(Crop_Damage ~ Estimated_Insects_Count+Crop_Type+ Soil_Type
+Pesticide_Use_Category+Number_Doses_Week+Number_Weeks_Used
+Number_Weeks_Quit+Season,
data = mydata, distribution="multinomial")
gbmpred <- predict(fit,mydata,n.trees = fit$n.trees)
I tried above code but it gives me probabilities.I want to get predictions
By default the predictions of the predict function are in the scale of f(x), hence in the returned values for the multinomial (classification) case will return the probabilities. All you have to do is to translate them to responses by taking the class that has the highest probability

Resources