glmnet multinomial logistic regression prediction result - r

I'm building a penalized multinomial logistic regression, but I'm having trouble coming up with a easy way to get the prediction accuracy. Here's my code:
fit.ridge.cv <- cv.glmnet(train[,-1], train[,1], type.measure="mse", alpha=0,
family="multinomial")
fit.ridge.best <- glmnet(train[,-1], train[,1], family = "multinomial", alpha = 0,
lambda = fit.ridge.cv$lambda.min)
fit.ridge.pred <- predict(fit.ridge.best, test[,-1], type = "response")
The first column of my test data is the response variable and it has 4 categories. And if I look at the result(fit.ridge.pred) it looks like this:
1,2,3,4
0.8743061353, 0.0122328811, 0.004798154, 0.1086628297
From what I understand these are the class probabilities. I want to know if there's a easy way to compute the model accuracy on the test data. Now I'm taking the max for each row and comparing with the original label. Thanks

Something like:
predicted <- colnames(fit.ridge.pred)[apply(fit.ridge.pred,1,which.max)]
table(predicted, test[, 1]
The first line takes the class for which the model outputs the highest probability per row, after which the second line constructs a confusion matrix.
The accuracy is then basically the proportion of observations classified correct (sum of the diagonal / total)

For more details see Glmnet Vignette
fit.ridge.pred <- predict(fit.ridge.best, test[,-1], type = "class") # predict classes, not probability
table(fit.ridge.pred,test[,1]) # confusion matrix
mean(fit.ridge.pred==test[,1]) # accuracy

Related

Get real predicted values from GLM

I am running GLM with linear regression, then i am using predict to fit the response on my test data, but the problem is i am getting the probabilities and i don't know how to convert those probabilities to real values.
log<- glm(formula=stock_out_duration~lag_2_market_unres_dos+lag_2_percentage_bias_forecast_error + forecast,train_data_final,family = inverse.gaussian(link = "log"),maxit=100)
summary(log)
predict <- predict(log, test_data, type = 'response')
table_mat <- table(test_data$stock_out_duration)
table_mat
As far as I'm aware, there isn't a magic function that does this for you given that you're using glm. As you've noted, what typically gets returned is the probabilities. You can convert the probabilities into predictions for the outcome of the underlying categories by choosing the outcome with the largest probability. I agree a one-line function for this would be nice though.
You can get this functionality if use the glmnet package.
library(glmnet)
y = ifelse(rnorm(100) > 0, "red", "blue")
y = factor(y)
x = rnorm(100)
fit = glmnet(x, y, family="binomial") # use family="multinomial" if there are more than 2 categories in your factor
yhat = predict(fit, newx=x, type="class", s=0)
yhat in the above will be a vector containing either "red" or "blue".
Note, the type="class" is the bit that gets you the category outcomes returned in yhat. The s=0 means to use a lambda penalty of zero for the coefficients you use to get predictions. You indicated in the question that you were just doing ordinary regression without any ridge or lasso style penalty factors, so s=0 ensures you get that in your predictions.

hurdle model prediction - count vs response

I'm working on a hurdle model and ran into a question I can't quite figure out. It was my understanding that the overall response prediction of the hurdle is the multiplication of the count prediction by the probability prediction. I.e., the overall response has to be smaller or equal to the count prediction. However, in my data, the response prediction is higher than the count prediction, and I can't figure out why.
Here's a similar result for a toy model (code adapted from here):
library("pscl")
data("RecreationDemand", package = "AER")
## model
m <- hurdle(trips ~ quality | ski, data = RecreationDemand, dist = "negbin")
nd <- data.frame(quality = 0:5, ski = "no")
predict(m, newdata = nd, type = "count")
predict(m, newdata = nd, type = "response")
Why is it that the counts are higher than the responses?
added comparison to glm.nb
Also - I was under the impression that the count part of the hurdle model should give identical predictions to a count-model of only positive values. When I try that, I get completely different values. What am I missing??
library(MASS)
m.nb <- glm.nb(trips ~ quality, data = RecreationDemand[RecreationDemand$trips > 0,])
predict(m, newdata = nd, type = "count") ## hurdle
predict(m.nb, newdata = nd, type = "response") ## positive counts only
The last question is the easiest to answer. The "count" part of the hurdle modle is not simply a standard count model (including a positive probability for zeros) but a zero-truncated count model (where zeros cannot occur).
Using the countreg package from R-Forge you can fit the model you attempted to fit with glm.nb in your example. (Alternatively, VGAM or gamlss could also be used to fit the same model.)
library("countreg")
m.truncnb <- zerotrunc(trips ~ quality, data = RecreationDemand,
subset = trips > 0, dist = "negbin")
cbind(hurdle = coef(m, model = "count"), zerotrunc = coef(m.truncnb), negbin = coef(m.nb))
## hurdle zerotrunc negbin
## (Intercept) 0.08676189 0.08674119 1.75391028
## quality 0.02482553 0.02483015 0.01671314
Up to small numerical differences the first two models are exactly equivalent. The non-truncated model, however, has to compensate the lack of zeros by increasing the intercept and dampening the slope parameter, which is clearly not appropriate here.
As for the predictions, one can distinguish three quantities:
The expectation from the untruncated count part, i.e., simply exp(x'b).
The conditional/truncated expectation from the count part, i.e., accounting for the zero trunctation: exp(x'b)/(1 - f(0)) where f(0) is the probability for 0 in that count part.
The overall expectation for the complete hurdle model, i.e., the probability for crossing the hurdle times the conditional expectation from 2.: exp(x'b)/(1 - f(0)) * (1 - g(0)) where g(0) is the probability for 0 in the zero hurdle part of the model.
See also Section 2.2 and Appendix C in vignette("countreg", package = "pscl") for more details and formulas. predict(..., type = "count") computes item 1 from above where predict(..., type = "response") computes item 3 for a hurdle model and item 2 for a zerotrunc model.

R: varying-coefficient GAMM models in mgcv - extracting 'by' variable coefficients?

I am creating a varying-coefficient GAMM using 'mgcv' in R with a continuous 'by' variable by using the by setting. However, I am having difficulty in locating the parameter estimate of the effect of the 'by' variable. In this example we determine the spatially-dependent effect of temperature t on sole eggs (i.e. how the linear effect of temperature on sole eggs changes across space):
require(mgcv)
require(gamair)
data(sole)
b = gam(eggs ~ s(la,lo) + s(la,lo, by = t), data = sole)
We can then plot the predicted effects of s(la,lo, by = t) against the predictor t:
pred <- predict(b, type = "terms", se.fit =T)
by.variable.prediction <- pred[[1]][,2]
plot(x= sole$t, y = by.variable.prediction)
However, I can't find a listing/function with the parameter estimates of the 'by' variable t for each sampling location. summary(), coef(), and predict() do not give you the parameter estimates.
Any help would be appreciated!
So the coefficient for the variable t is the value where t is equal to 1, conditional on the latitude and longitude. So one way to get the coefficient/parameter estimate for t at each latitude and longitude is to construct your own dataframe with a range of latitude/longitude combinations with t=1 and run predict.gam on that (rather than running predict.gam on the data used the fit the model, as you have done). So:
preddf <- expand.grid(list(la=seq(min(sole$la), max(sole$la), length.out=100),
lo=seq(min(sole$lo), max(sole$lo), length.out=100),
t=1))
preddf$parameter <- predict(b, preddf, type="response")
And then if you want to visualize this coefficient over space, you could graph it with ggplot2.
library(ggplot2)
ggplot(preddf) +
geom_tile(aes(x=lo, y=la, fill=parameter))

how to get predictions using gbm in r

fit <- gbm(Crop_Damage ~ Estimated_Insects_Count+Crop_Type+ Soil_Type
+Pesticide_Use_Category+Number_Doses_Week+Number_Weeks_Used
+Number_Weeks_Quit+Season,
data = mydata, distribution="multinomial")
gbmpred <- predict(fit,mydata,n.trees = fit$n.trees)
I tried above code but it gives me probabilities.I want to get predictions
By default the predictions of the predict function are in the scale of f(x), hence in the returned values for the multinomial (classification) case will return the probabilities. All you have to do is to translate them to responses by taking the class that has the highest probability

Performing Anova on Bootstrapped Estimates from Quantile Regression

So I'm using the quantreg package in R to conduct quantile regression analyses to test how the effects of my predictors vary across the distribution of my outcome.
FML <- as.formula(outcome ~ VAR + c1 + c2 + c3)
quantiles <- c(0.25, 0.5, 0.75)
q.Result <- list()
for (i in quantiles){
i.no <- which(quantiles==i)
q.Result[[i.no]] <- rq(FML, tau=i, data, method="fn", na.action=na.omit)
}
Then i call anova.rq which runs a Wald test on all the models and outputs a pvalue for each covariate telling me whether the effects of each covariate vary significantly across the distribution of my outcome.
anova.Result <- anova(q.Result[[1]], q.Result[[2]], q.Result[[3]], joint=FALSE)
Thats works just fine. However, for my particular data (and in general?), bootstrapping my estimates and their error is preferable. Which i conduct with a slight modification of the code above.
q.Result <- rqs(FML, tau=quantiles, data, method="fn", na.action=na.omit)
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb",
covariance=TRUE)
Here's where i get stuck. The quantreg currently cannot peform the anova (Wald) test on boostrapped estimates. The information files on the quantreg packages specifically states that "extensions of the methods to be used in anova.rq should be made" regarding the boostrapping method.
Looking at the details of the anova.rq method. I can see that it requires 2 components not present in the quantile model when bootstrapping.
1) Hinv (Inverse Hessian Matrix). The package information files specifically states "note that for se = "boot" there is no way to split the estimated covariance matrix into its sandwich constituent parts."
2) J which, according to the information files, is "Unscaled Outer product of gradient matrix returned if cov=TRUE and se != "iid". The Huber sandwich is cov = tau (1-tau) Hinv %*% J %*% Hinv. as for the Hinv component, there is no J component when se == "boot". (Note that to make the Huber sandwich you need to add the tau (1-tau) mayonnaise yourself.)"
Can i calculate or estimate Hinv and J from the bootstrapped estimates? If not what is the best way to proceed?
Any help on this much appreciated. This my first timing posting a question here, though I've greatly benefited from the answers to other peoples questions in the past.
For question 2: You can use R = for resampling. For example:
anova(object, ..., test = "Wald", joint = TRUE, score =
"tau", se = "nid", R = 10000, trim = NULL)
Where R is the number of resampling replications for the anowar form of the test, used to estimate the reference distribution for the test statistic.
Just a heads up, you'll probably get a better response to your questions if you only include 1 question per post.
Consulted with a colleague, and he confirmed that it was unlikely that Hinv and J could be 'reverse' computed from bootstrapped estimates. However we resolved that estimates from different taus could be compared using Wald test as follows.
From object rqs produced by
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb", covariance=TRUE)
you extract the bootstrapped Beta values for variable of interest in this case VAR, the first covariate in FML for each tau
boot.Bs <- sapply(q.Summary, function (x) x[["B"]][,2])
B0 <- coef(summary(lm(FML, data)))[2,1] # Extract liner estimate data linear estimate
Then compute wald statistic and get pvalue with number of quantiles for degrees of freedom
Wald <- sum(apply(boot.Bs, 2, function (x) ((mean(x)-B0)^2)/var(x)))
Pvalue <- pchisq(Wald, ncol(boot.Bs), lower=FALSE)
You also want to verify that bootstrapped Betas are normally distributed, and if you're running many taus it can be cumbersome to check all those QQ plots so just sum them by row
qqnorm(apply(boot.Bs, 1, sum))
qqline(apply(boot.Bs, 1, sum), col = 2)
This seems to be working, and if anyone can think of anything wrong with my solution, please share

Resources