how to get predictions using gbm in r - r

fit <- gbm(Crop_Damage ~ Estimated_Insects_Count+Crop_Type+ Soil_Type
+Pesticide_Use_Category+Number_Doses_Week+Number_Weeks_Used
+Number_Weeks_Quit+Season,
data = mydata, distribution="multinomial")
gbmpred <- predict(fit,mydata,n.trees = fit$n.trees)
I tried above code but it gives me probabilities.I want to get predictions

By default the predictions of the predict function are in the scale of f(x), hence in the returned values for the multinomial (classification) case will return the probabilities. All you have to do is to translate them to responses by taking the class that has the highest probability

Related

predict function with lasso regression

I am trying to implement lasso regression for my sales prediction problem. I am using glmnet package and cv.glmnet function to train the model.
library(glmnet)
set.seed(123)
model = cv.glmnet(as.matrix(x = train[, -which(names(train) %in% "Sales")]),
y = train$Sales,
alpha = 1,
lambda = 10^seq(4,-1,-0.1))
best_lambda = model$lambda.min
lasso_predictions_valid <- predict(model,s = best_lambda,type = "coefficients")
After I read few articles about implementing lasso regression I still don't know how to add my test data on which I want to apply the prediction. There is newx argument to be added to predict function that I do not know also. I mean in most regression types we have newdata or data argument that we fill our test data to it.
I think there is an error in your lasso_predictions_valid, you shouldn't put valid$sales as your newx, as I believe this is the actual sales number.
Once you have created the model with the train set, then for newx you need to pass matrix values of x that you want to make predictions on, I guess in this case it will be your validation set.
Looking at your example code above, I think your predict line should be something like:
lasso_predictions_valid <- predict(model, s = best_lambda,
newx = as.matrix(valid[, -which(names(valid) %in% "Sales")]),
type = "coefficients")
Then you should run your RMSE() line:
RMSE(lasso_predictions_valid, valid$Sales)

Get real predicted values from GLM

I am running GLM with linear regression, then i am using predict to fit the response on my test data, but the problem is i am getting the probabilities and i don't know how to convert those probabilities to real values.
log<- glm(formula=stock_out_duration~lag_2_market_unres_dos+lag_2_percentage_bias_forecast_error + forecast,train_data_final,family = inverse.gaussian(link = "log"),maxit=100)
summary(log)
predict <- predict(log, test_data, type = 'response')
table_mat <- table(test_data$stock_out_duration)
table_mat
As far as I'm aware, there isn't a magic function that does this for you given that you're using glm. As you've noted, what typically gets returned is the probabilities. You can convert the probabilities into predictions for the outcome of the underlying categories by choosing the outcome with the largest probability. I agree a one-line function for this would be nice though.
You can get this functionality if use the glmnet package.
library(glmnet)
y = ifelse(rnorm(100) > 0, "red", "blue")
y = factor(y)
x = rnorm(100)
fit = glmnet(x, y, family="binomial") # use family="multinomial" if there are more than 2 categories in your factor
yhat = predict(fit, newx=x, type="class", s=0)
yhat in the above will be a vector containing either "red" or "blue".
Note, the type="class" is the bit that gets you the category outcomes returned in yhat. The s=0 means to use a lambda penalty of zero for the coefficients you use to get predictions. You indicated in the question that you were just doing ordinary regression without any ridge or lasso style penalty factors, so s=0 ensures you get that in your predictions.

glmnet multinomial logistic regression prediction result

I'm building a penalized multinomial logistic regression, but I'm having trouble coming up with a easy way to get the prediction accuracy. Here's my code:
fit.ridge.cv <- cv.glmnet(train[,-1], train[,1], type.measure="mse", alpha=0,
family="multinomial")
fit.ridge.best <- glmnet(train[,-1], train[,1], family = "multinomial", alpha = 0,
lambda = fit.ridge.cv$lambda.min)
fit.ridge.pred <- predict(fit.ridge.best, test[,-1], type = "response")
The first column of my test data is the response variable and it has 4 categories. And if I look at the result(fit.ridge.pred) it looks like this:
1,2,3,4
0.8743061353, 0.0122328811, 0.004798154, 0.1086628297
From what I understand these are the class probabilities. I want to know if there's a easy way to compute the model accuracy on the test data. Now I'm taking the max for each row and comparing with the original label. Thanks
Something like:
predicted <- colnames(fit.ridge.pred)[apply(fit.ridge.pred,1,which.max)]
table(predicted, test[, 1]
The first line takes the class for which the model outputs the highest probability per row, after which the second line constructs a confusion matrix.
The accuracy is then basically the proportion of observations classified correct (sum of the diagonal / total)
For more details see Glmnet Vignette
fit.ridge.pred <- predict(fit.ridge.best, test[,-1], type = "class") # predict classes, not probability
table(fit.ridge.pred,test[,1]) # confusion matrix
mean(fit.ridge.pred==test[,1]) # accuracy

Probability predictions with model averaged Cumulative Link Mixed Models fitted with clmm in ordinal package

I found that the predict function is currently not implemented in cumulative link mixed models fitted using the clmm function in ordinal R package. While predict is implemented for clmm2 in the same package, I chose to apply clmm instead because the later allows for more than one random effects. Further, I also fitted several clmm models and performed model averaging using model.avg function in MuMIn package. Ideally, I want to predict probabilities using the average model. However, while MuMIn supports clmm models, predict will also not work with the average model.
Is there a way to hack the predict function so that the function not only could predict probabilities from a clmm model, but also predict using model averaged coefficients from clmm (i.e. object of class "averaging")? For example:
require(ordinal)
require(MuMIn)
mm1 <- clmm(SURENESS ~ PROD + (1|RESP) + (1|RESP:PROD), data = soup,
link = "probit", threshold = "equidistant")
## test random effect:
mm2 <- clmm(SURENESS ~ PROD + (1|RESP) + (1|RESP:PROD), data = soup,
link = "logistic", threshold = "equidistant")
#create a model selection object
mm.sel<-model.sel(mm1,mm2)
##perform a model average
mm.avg<-model.avg(mm.sel)
#create new data and predict
new.data<-soup
##predict with indivindual model
predict(mm1, new.data)
I got the following error message:
In UseMethod("predict") :
no applicable method for predict applied to an object of class "clmm"
##predict with model average
predict(mm.avg, new.data)
Another error is returned:
Error in predict.averaging(mm.avg, new.data) :
predict for models 'mm1' and 'mm2' caused errors
I've been using clmm as well and yes I confirm predict.clmm is NOT (yet?) implemented. I didn't yet check the source code for fake.predict.clmm. It might work. If it doesn't, you're stuck with doing stuff by hand or using predict.clmm2.
I found a potential solution (pasted below) but have not been able to make work for my data.
Solution here: https://gist.github.com/mainambui/c803aaf857e54a5c9089ea05f91473bc
I think the problem is the number of coefficients I am using but am not experienced enough to figure it out. Hopefully this helps someone out though.
This is the model and newdata that I am using, though it is actually a model averaged version. Same predictors though.
ma10 <- clmm(Location3 ~ Sex * Grass3 + Sex * Forb3 + (1|Tag_ID), data =
IP_all_dunes)
ma_1 <- model.avg(ma10, ma8, ma5)##top 3 models
new_ma<- data.frame(Sex = c("m","f","m","f","m","f","m","f"),
Grass3 = c("1","1","1","1","0","0","0","0"),
Forb3 = c("0","0","1","1","0","0","1","1"))
# Arguments:
# - model = a clmm model
# - modelAvg = a clmm model average (object of class averaging)
# - newdata = a dataframe of new data to apply the model to
# Returns a dataframe of predicted probabilities for each row and response level
fake.predict.clmm <- function(modelAvg, newdata) {
# Actual prediction function
pred <- function(eta, theta, cat = 1:(length(theta) + 1), inv.link = plogis) {
Theta <- c(-1000, theta, 1000)
sapply(cat, function(j) inv.link(Theta[j + 1] - eta) - inv.link(Theta[j] -
eta))
}
# Multiply each row by the coefficients
#coefs <- c(model$beta, unlist(model$ST))##turn off if a model average is used
beta <- modelAvg$coefficients[2,3:12]
coefs <- c(beta, unlist(modelAvg$ST))
xbetas <- sweep(newdata, MARGIN=2, coefs, `*`)
# Make predictions
Theta<-modelAvg$coefficients[2,1:2]
#pred.mat <- data.frame(pred(eta=rowSums(xbetas), theta=model$Theta))
pred.mat <- data.frame(pred(eta=rowSums(xbetas), theta=Theta))
#colnames(pred.mat) <- levels(model$model[,1])
a<-attr(modelAvg, "modelList")
colnames(pred.mat) <- levels(a[[1]]$model[,1])
pred.mat
}

GBM cross validation

I'm trying to use R's gbm regression model.
I want to compute the coefficient of determination (R squared) between the cross validation predicted response values and the true response values. However, the cv.fitted values of the gbm.object only provides the predicted response values for 1-train.fraction. So in order to get what I want I need to find which of the observations correspond to the cv.fitted values.
Any idea how to get that information?
You can use the predict function to easily get at model predictions, if I'm understanding your question correctly.
dat <- data.frame(y = runif(1000), x=rnorm(1000))
gbmMod <- gbm::gbm(y~x, data=dat, n.trees=5000, cv.folds=0)
summary(lm(predict(gbmMod, n.trees=5000) ~ dat$y))$adj.r.squared
But shouldn't we hold data to the side and assess model accuracy on test data? This would correspond to the following, where I partition the data into a training set (70%) and testing set (30%):
inds <- sample(1:nrow(dat), 0.7*nrow(dat))
train <- dat[inds, ]
test <- dat[-inds, ]
gbmMod2 <- gbm::gbm(y~x, data=train, n.trees=5000)
preds <- predict(gbmMod2, newdata = test, n.trees=5000)
summary(lm(preds ~ test[,1]))$adj.r.squared
It's also worth noting that the number of trees in the gbm can be tuned using the gbm.perf function and the cv.folds argument to the gbm function. This helps avoids overfitting.

Resources