Get real predicted values from GLM - r

I am running GLM with linear regression, then i am using predict to fit the response on my test data, but the problem is i am getting the probabilities and i don't know how to convert those probabilities to real values.
log<- glm(formula=stock_out_duration~lag_2_market_unres_dos+lag_2_percentage_bias_forecast_error + forecast,train_data_final,family = inverse.gaussian(link = "log"),maxit=100)
summary(log)
predict <- predict(log, test_data, type = 'response')
table_mat <- table(test_data$stock_out_duration)
table_mat

As far as I'm aware, there isn't a magic function that does this for you given that you're using glm. As you've noted, what typically gets returned is the probabilities. You can convert the probabilities into predictions for the outcome of the underlying categories by choosing the outcome with the largest probability. I agree a one-line function for this would be nice though.
You can get this functionality if use the glmnet package.
library(glmnet)
y = ifelse(rnorm(100) > 0, "red", "blue")
y = factor(y)
x = rnorm(100)
fit = glmnet(x, y, family="binomial") # use family="multinomial" if there are more than 2 categories in your factor
yhat = predict(fit, newx=x, type="class", s=0)
yhat in the above will be a vector containing either "red" or "blue".
Note, the type="class" is the bit that gets you the category outcomes returned in yhat. The s=0 means to use a lambda penalty of zero for the coefficients you use to get predictions. You indicated in the question that you were just doing ordinary regression without any ridge or lasso style penalty factors, so s=0 ensures you get that in your predictions.

Related

Why are the predictions from poisson lasso regression model in glmnet not integers?

I am conducting a lasso regression modeling predictors of a count outcome in glmnet.
I am wondering what to make of the predictions from this model.
Here is some toy data. It's not very good because I don't know how to simulate multivariate data but I'm mainly interested in whether I'm getting the syntax right.
set.seed(123)
df <- data.frame(count = rpois(500, lambda = 3),
pred1 = rnorm(500),
pred2 = rnorm(500),
pred3 = rnorm(500),
pred4 = rnorm(500),
pred5 = rnorm(500),
pred6 = rnorm(500),
pred7 = rnorm(500),
pred8 = rnorm(500),
pred9 = rnorm(500),
pred10 = rnorm(500))
Now run the model
x <- model.matrix(count ~ ., df)[,-1]
y <- df$count
cvg <- cv.glmnet(x,y,family = "poisson")
now when I generate predicted outcomes
yTest <- predict(cvg, newx = x, family = "poisson", type = "link")
This is the output
# 1 1.094604
# 2 1.094604
# 3 1.094604
# 4 1.094604
# 5 1.094604
# 6 1.094604
# ... ........
Now obviously the model predictions are all the same and all terrible (unsurprising given the absence of any association between the predictors and the outcome), but the thing I am wondering is why they are not integers (with my real data I have the same problem).
I have several questions.
So my questions are:
Am I specifying the correct arguments in the glmnet.predict() function? In the help for the predict function it states that specifying type = "link" gives "the linear predictors" for poisson models, whereas specifying type = "response" gives the "fitted mean" for poisson models (in the case of my dumb example it generates 500 values of 2.988).
Shouldn't the predicted outcomes match the form of the data itself, i.e. be integers?
If I am specifying the correct arguments in the predict() function, how do I use the non-integer predictions Do I round them to the nearest integer, or just leave them alone?
Shouldn't the predicted outcomes match the form of the data itself,
i.e. be integers?
When you use a regression model you are associating a (conditional) probability distribution, indexed by parameters (in the Poisson case, the lambda parameter, which represents the mean) to each predictor configuration. A prediction of the response minimizes some expected loss function conditional to the predictor values so it depends on what loss function you are using.
If you consider a 0-1 loss, then yes, the predicted values should be an integer: the mode of the distribution, its most probable value, which in the case of a Possion distribution is the floor of lambda if it is not an integer (https://en.wikipedia.org/wiki/Poisson_distribution).
If you consider a squared loss (y - y_prediction)^2 then your prediction is the conditional expectation (see https://en.wikipedia.org/wiki/Minimum_mean_square_error#Properties), which is not necessarily an integer, just like the result you are getting.
glmnet uses squared loss, but you can easily predict an integer value (the one that minimizes the 0-1 loss) by applying the floor() function to the predicted values output by glmnet.

glmnet multinomial logistic regression prediction result

I'm building a penalized multinomial logistic regression, but I'm having trouble coming up with a easy way to get the prediction accuracy. Here's my code:
fit.ridge.cv <- cv.glmnet(train[,-1], train[,1], type.measure="mse", alpha=0,
family="multinomial")
fit.ridge.best <- glmnet(train[,-1], train[,1], family = "multinomial", alpha = 0,
lambda = fit.ridge.cv$lambda.min)
fit.ridge.pred <- predict(fit.ridge.best, test[,-1], type = "response")
The first column of my test data is the response variable and it has 4 categories. And if I look at the result(fit.ridge.pred) it looks like this:
1,2,3,4
0.8743061353, 0.0122328811, 0.004798154, 0.1086628297
From what I understand these are the class probabilities. I want to know if there's a easy way to compute the model accuracy on the test data. Now I'm taking the max for each row and comparing with the original label. Thanks
Something like:
predicted <- colnames(fit.ridge.pred)[apply(fit.ridge.pred,1,which.max)]
table(predicted, test[, 1]
The first line takes the class for which the model outputs the highest probability per row, after which the second line constructs a confusion matrix.
The accuracy is then basically the proportion of observations classified correct (sum of the diagonal / total)
For more details see Glmnet Vignette
fit.ridge.pred <- predict(fit.ridge.best, test[,-1], type = "class") # predict classes, not probability
table(fit.ridge.pred,test[,1]) # confusion matrix
mean(fit.ridge.pred==test[,1]) # accuracy

R coefficients of glmnet::cvfit

As far as I am concerned, cvfit does a K fold cross validation, which means that in each time, it separates all the data into training & validation set. For every fixed lambda, first it uses training data to get a coefficient vector. Then implements this constructed model to predict on the validation set to get the error.
Hence, for K fold CV, it has k coefficient vectors (each is generated from a training set). So what does
coef(cvfit)
get?
Here is an example:
x <- iris[1:100,1:4]
y <- iris[1:100,5]
y <- factor(y)
fit <- cv.glmnet(data.matrix(x), y, family = "binomial", type.measure = "class",alpha=1,nfolds=3,standardize = T)
coef(fit, s=c(fit$lambda.min,fit$lambda.1se))
fit1 <- glmnet(data.matrix(x), y, family = "binomial",
standardize = T,
lambda = c(fit$lambda.1se,fit$lambda.min))
coef(fit1)
in fit1, I use the whole dataset as the training set, seems that the coefficients of fit1 and fit are just the same. That's why?
Thanks in advance.
Although cv.glmnet checks model performance by cross-validation, the actual model coefficients it returns for each lambda value are based on fitting the model with the full dataset.
The help for cv.glmnet (type ?cv.glmnet) includes a Value section that describes the object returned by cv.glmet. The returned list object (fit in your case) includes an element called glmnet.fit. The help describes it like this:
glmnet.fit a fitted glmnet object for the full data.

R: varying-coefficient GAMM models in mgcv - extracting 'by' variable coefficients?

I am creating a varying-coefficient GAMM using 'mgcv' in R with a continuous 'by' variable by using the by setting. However, I am having difficulty in locating the parameter estimate of the effect of the 'by' variable. In this example we determine the spatially-dependent effect of temperature t on sole eggs (i.e. how the linear effect of temperature on sole eggs changes across space):
require(mgcv)
require(gamair)
data(sole)
b = gam(eggs ~ s(la,lo) + s(la,lo, by = t), data = sole)
We can then plot the predicted effects of s(la,lo, by = t) against the predictor t:
pred <- predict(b, type = "terms", se.fit =T)
by.variable.prediction <- pred[[1]][,2]
plot(x= sole$t, y = by.variable.prediction)
However, I can't find a listing/function with the parameter estimates of the 'by' variable t for each sampling location. summary(), coef(), and predict() do not give you the parameter estimates.
Any help would be appreciated!
So the coefficient for the variable t is the value where t is equal to 1, conditional on the latitude and longitude. So one way to get the coefficient/parameter estimate for t at each latitude and longitude is to construct your own dataframe with a range of latitude/longitude combinations with t=1 and run predict.gam on that (rather than running predict.gam on the data used the fit the model, as you have done). So:
preddf <- expand.grid(list(la=seq(min(sole$la), max(sole$la), length.out=100),
lo=seq(min(sole$lo), max(sole$lo), length.out=100),
t=1))
preddf$parameter <- predict(b, preddf, type="response")
And then if you want to visualize this coefficient over space, you could graph it with ggplot2.
library(ggplot2)
ggplot(preddf) +
geom_tile(aes(x=lo, y=la, fill=parameter))

how to get predictions using gbm in r

fit <- gbm(Crop_Damage ~ Estimated_Insects_Count+Crop_Type+ Soil_Type
+Pesticide_Use_Category+Number_Doses_Week+Number_Weeks_Used
+Number_Weeks_Quit+Season,
data = mydata, distribution="multinomial")
gbmpred <- predict(fit,mydata,n.trees = fit$n.trees)
I tried above code but it gives me probabilities.I want to get predictions
By default the predictions of the predict function are in the scale of f(x), hence in the returned values for the multinomial (classification) case will return the probabilities. All you have to do is to translate them to responses by taking the class that has the highest probability

Resources