I am trying to implement lasso regression for my sales prediction problem. I am using glmnet package and cv.glmnet function to train the model.
library(glmnet)
set.seed(123)
model = cv.glmnet(as.matrix(x = train[, -which(names(train) %in% "Sales")]),
y = train$Sales,
alpha = 1,
lambda = 10^seq(4,-1,-0.1))
best_lambda = model$lambda.min
lasso_predictions_valid <- predict(model,s = best_lambda,type = "coefficients")
After I read few articles about implementing lasso regression I still don't know how to add my test data on which I want to apply the prediction. There is newx argument to be added to predict function that I do not know also. I mean in most regression types we have newdata or data argument that we fill our test data to it.
I think there is an error in your lasso_predictions_valid, you shouldn't put valid$sales as your newx, as I believe this is the actual sales number.
Once you have created the model with the train set, then for newx you need to pass matrix values of x that you want to make predictions on, I guess in this case it will be your validation set.
Looking at your example code above, I think your predict line should be something like:
lasso_predictions_valid <- predict(model, s = best_lambda,
newx = as.matrix(valid[, -which(names(valid) %in% "Sales")]),
type = "coefficients")
Then you should run your RMSE() line:
RMSE(lasso_predictions_valid, valid$Sales)
Related
When predicting in R, using the predict() function, the argument for the data on which we want to predict is newdata = . My question is, when putting data = instead of newdata = what happens ? Because it doesn't give an error, and the rmse obtained is not the same when using newdata =
Here is an example:
library(MASS)
set.seed(18)
Boston_idx = sample(1:nrow(Boston), nrow(Boston) / 2)
Boston_train = Boston[Boston_idx,]
Boston_test = Boston[-Boston_idx,]
library(rpart)
Boston_tree<-rpart(medv~., data=Boston_train)
tree.pred <- predict(Boston_tree, data=Boston_test)
tree.pred2 <- predict(Boston_tree, newdata=Boston_test)
rmse = function(m, o){
sqrt(mean((m - o)^2))
}
rmse(tree.pred,Boston_test$medv)
rmse(tree.pred2,Boston_test$medv)
data are the data used for fitting the model, newdata are data used for prediction. The help page of ?predict.rpart says:
newdata: data frame containing the values at which predictions are required. The predictors referred to in the right side of formula(object) must be present by name in newdata. If missing, the fitted values are returned.
I am using the cv.glmnet function from the glmnet library in R. I run the following function:
model = cv.glmnet(x = data, y = label, family = 'binomial', alpha = 0.1,type.measure = "auc")
From the following plot I expect to get an AUC between 0.62 and 0.64
However, when I pass the same data (i.e. the training data) to the predict function i.e.
pred = predict(fit, newx = data, type = 'class',s ="lambda.min")
auc(label,pred)
I only get back an AUC of 0.56.
I understand the random nature of the cross validation - but intuitively I would expect to get something back in the range suggested by the cross validation.
What have I missed here? Thanks in advance
I'm not sure how auc() works (it seems to be an internal function of glmnet), but you should try passing the predictions as a link or response instead of a class:
pred = predict(fit, newx = data, type = 'response',s ="lambda.min")
auc(label,pred)
class gives you the predicted class at a single response threshold (e.g., 0.5), while response gives you the probability which can be used to calculate the auc across the entire response function.
I have been trying to establish predictive performance (AUC ROC) for a glmer model. When I try and use the predict() function on a test data set, the output for this function is the length of my train data set.
folds = 10;
glmerperf=rep(0,folds); glmperf=glmerperf;
TB_Train.glmer.subset <- TB_Train.glmer %>% select(one_of(subset.vars), IDNO)
TB_Train.glmer.fs <- TB_Train.glmer.subset[,c(1:7, 22)]
TB_Train.glmer.ns <- TB_Train.glmer.subset[, 8:21]
TB_Train.glmer.cns <- TB_Train.glmer.ns %>% scale(center=TRUE, scale=TRUE) %>% cbind(TB_Train.glmer.fs)
foldsamples = caret::createFolds(TB_Train.glmer.cns$Case.Status, k = folds, list = TRUE, returnTrain = FALSE)
for (n in 1:folds)
{
testdata = TB_Train.glmer.cns[foldsamples[[n]],]
traindata = TB_Train.glmer.cns[-foldsamples[[n]],]
GLMER <- lme4::glmer(Case.Status ~ . + (1 | IDNO), data = traindata, family="binomial", control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=1000000)))
glmer.probs <- predict(GLMER, newdata=testdata$Non.TB.Case, type="response")
glmer.ROC <- roc(predictor=glmer.probs, response=testdata$Case.Status, levels=rev(levels(testdata$Case.Status)))
glmerperf[n] <- glmer.ROC$auc
}
prob <- predict(GLMER, newdata=TB_Test.glmer$Non.TB.Case, type="response", re.form=~(1|IDNO))
print(sprintf('Mean AUC ROC of model on test set for GLMER %f', mean(glmerperf)))
Both the prob and glmer.probs objects are the length of the traindata object, despite specifying the newdata argument. I have noticed issues with the predict function in the past, but none as specific as this one.
Also, when the model is run, I get several errors about needing to scale my data (which I already have) and that the model fails to converge. Any ideas on how to fix this? I have already bumped up the iterations and selected a new optimizer.
Figured out that error was arising from using the "." shortcut to specify all predictors for the model.
I found that the predict function is currently not implemented in cumulative link mixed models fitted using the clmm function in ordinal R package. While predict is implemented for clmm2 in the same package, I chose to apply clmm instead because the later allows for more than one random effects. Further, I also fitted several clmm models and performed model averaging using model.avg function in MuMIn package. Ideally, I want to predict probabilities using the average model. However, while MuMIn supports clmm models, predict will also not work with the average model.
Is there a way to hack the predict function so that the function not only could predict probabilities from a clmm model, but also predict using model averaged coefficients from clmm (i.e. object of class "averaging")? For example:
require(ordinal)
require(MuMIn)
mm1 <- clmm(SURENESS ~ PROD + (1|RESP) + (1|RESP:PROD), data = soup,
link = "probit", threshold = "equidistant")
## test random effect:
mm2 <- clmm(SURENESS ~ PROD + (1|RESP) + (1|RESP:PROD), data = soup,
link = "logistic", threshold = "equidistant")
#create a model selection object
mm.sel<-model.sel(mm1,mm2)
##perform a model average
mm.avg<-model.avg(mm.sel)
#create new data and predict
new.data<-soup
##predict with indivindual model
predict(mm1, new.data)
I got the following error message:
In UseMethod("predict") :
no applicable method for predict applied to an object of class "clmm"
##predict with model average
predict(mm.avg, new.data)
Another error is returned:
Error in predict.averaging(mm.avg, new.data) :
predict for models 'mm1' and 'mm2' caused errors
I've been using clmm as well and yes I confirm predict.clmm is NOT (yet?) implemented. I didn't yet check the source code for fake.predict.clmm. It might work. If it doesn't, you're stuck with doing stuff by hand or using predict.clmm2.
I found a potential solution (pasted below) but have not been able to make work for my data.
Solution here: https://gist.github.com/mainambui/c803aaf857e54a5c9089ea05f91473bc
I think the problem is the number of coefficients I am using but am not experienced enough to figure it out. Hopefully this helps someone out though.
This is the model and newdata that I am using, though it is actually a model averaged version. Same predictors though.
ma10 <- clmm(Location3 ~ Sex * Grass3 + Sex * Forb3 + (1|Tag_ID), data =
IP_all_dunes)
ma_1 <- model.avg(ma10, ma8, ma5)##top 3 models
new_ma<- data.frame(Sex = c("m","f","m","f","m","f","m","f"),
Grass3 = c("1","1","1","1","0","0","0","0"),
Forb3 = c("0","0","1","1","0","0","1","1"))
# Arguments:
# - model = a clmm model
# - modelAvg = a clmm model average (object of class averaging)
# - newdata = a dataframe of new data to apply the model to
# Returns a dataframe of predicted probabilities for each row and response level
fake.predict.clmm <- function(modelAvg, newdata) {
# Actual prediction function
pred <- function(eta, theta, cat = 1:(length(theta) + 1), inv.link = plogis) {
Theta <- c(-1000, theta, 1000)
sapply(cat, function(j) inv.link(Theta[j + 1] - eta) - inv.link(Theta[j] -
eta))
}
# Multiply each row by the coefficients
#coefs <- c(model$beta, unlist(model$ST))##turn off if a model average is used
beta <- modelAvg$coefficients[2,3:12]
coefs <- c(beta, unlist(modelAvg$ST))
xbetas <- sweep(newdata, MARGIN=2, coefs, `*`)
# Make predictions
Theta<-modelAvg$coefficients[2,1:2]
#pred.mat <- data.frame(pred(eta=rowSums(xbetas), theta=model$Theta))
pred.mat <- data.frame(pred(eta=rowSums(xbetas), theta=Theta))
#colnames(pred.mat) <- levels(model$model[,1])
a<-attr(modelAvg, "modelList")
colnames(pred.mat) <- levels(a[[1]]$model[,1])
pred.mat
}
fit <- gbm(Crop_Damage ~ Estimated_Insects_Count+Crop_Type+ Soil_Type
+Pesticide_Use_Category+Number_Doses_Week+Number_Weeks_Used
+Number_Weeks_Quit+Season,
data = mydata, distribution="multinomial")
gbmpred <- predict(fit,mydata,n.trees = fit$n.trees)
I tried above code but it gives me probabilities.I want to get predictions
By default the predictions of the predict function are in the scale of f(x), hence in the returned values for the multinomial (classification) case will return the probabilities. All you have to do is to translate them to responses by taking the class that has the highest probability