Not able to predict with gamma distributed GLM model in H2O - r

I have just computed a gamma GLM with the h2o package in R.
When I'm trying to predict on the test set I get this error:
Illegal argument(s) for GLM model: GLM_model_R_1644680218230_95. Details: ERRR on field: _family: Response value for gamma distribution must be greater than 0.
I understand that a gamma model cannot be trained on data with zero response, but one should be able to predict on data with a true value 0 (this is used a a lot in actuaries).
Does any one know a h2o solution? I know I can simply make the model with glm() or something similar, but I'm relying on mean encoded categorical variables (which is really convenient in h2o).
Thanks!

Based on description in comments - this seems like a bug and fix will be needed.

Related

glm with gamma family with NA / 0 Values in r

I would like to use a generalized linear mixed effect model. My data follows a gamma distribution but contains NA and 0 values. However, the gamma family does not allow me to compute these models if I have 0 values. Does anyone know of a way to go around this problem?
I heard that the glmmTMB package allows the use of gamma distributions with negative values, but I work on a mac, and it seems that I cannot download this package.
When I try, I get an error code stating "clang: error: unsupported option '-fopenmp'".
It would be great if any of you had an idea.
The Gamma distribution has no support on the non-positive real numbers. Accordingly, you are basically asking it to model data which it could never produce and therefore the software throws an error.
Similarly, the missing data cannot be modelled because the model you specify does not itself model or marginalize out the missing data. You will need to either replace the missing values with some number (impute missing values) deterministically/probabilistically or drop the observations with missing values.
In short, you will need to employ an alternative model. You could use the zero-inflated gamma model or the gamma hurdle. See here for an example. There is no "correct" alternative model: it is a model and you will need to think about their relative strengths and weaknesses (assumptions, etc.).

How to run a truncated and inflated Poisson model in R?

My data doesn't contain any zeros. The minimum value for my outcome, y, is 1 and that is the value that is inflated. My objective is to run a truncated and inflated Poisson regression model using R.
I already know how to separate way each regression zero truncated and zero inflated. I want to know how to combine the two conditions into one model.
Thanks for you help.
For zero inflated models or zero-hurdle models, the standard approach is to use pscl package. I also wrote a package fitting that kind of models here but it is not yet mature and fully tested. Unless you have voluminous data, I still recommend you to use pscl that is more flexible, robust and documented.
For zero-truncated models, you can have a look at the VGML::vglm function. You might find useful information here.
Note that you are not doing the same distributional assumption so you won't need the same estimation data. Given the description of your dataset, I think you are looking for a zero-truncated model (since you do not observe zeros). With zero-inflated models, you decompose your observed pattern into zeros generated by a selection model and others generated by a count data model. This doesn't look to be a pattern consistent with your dataset.

R - Testing for homo/heteroscedasticity and collinearity in a multivariate regression model

I'm trying to optimize a multivariate linear regression model lmMod=lm(depend_var~var1+var2+var3+var4....,data=df) and I'm presently working on the premises of the model: the constant variance of residuals and the absence of auto-correlation. For this I'm using:
Breusch-Pagan test for homo/heteroscedasticity: lmtest::bptest(lmMod) 
Durbin Watson test for auto-correlation: durbinWatsonTest(lmMod)
I found examples which are testing either one independent variable at a time:
example for Breush-Pagan test – one independent variable:
https://datascienceplus.com/how-to-detect-heteroscedasticity-and-rectify-it/
example for Durbin Watson test - one independent variable:
http://math.furman.edu/~dcs/courses/math47/R/library/lmtest/html/dwtest.html
or the whole model with several independent variables at a time:
example for Durbin Watson test – multiple independent variable:
https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/durbinWatsonTest
Here are the questions:
Can durbinWatsonTest() and bptest() be fed with a whole multivariate model
If answer to 1 is yes, how is it then possible to determine which variable is causing heteroscedasticity or auto-correlation in the model in order to fix it as each of those tests give only one p-value for the entire multivariate model?
If answer to 1 is no, the test should be then performed with one dependent variable at a time. But in the case of homoscedasticity, it can only be tested AFTER a particular regression has been modelled. Hence a pattern of homo/heteroscedasticity in an univariate regression model lmMod_1=lm(depend_var~var1, data=df) will be different from the pattern of a multivariate regression model lmMod_2=lm(depend_var~var1+var2+var3+var4....,data=df)
Thank very much in advance for your help!
I would like to try to give a first help
The answer to the first question: Yes, you can use the Breusch-Pagan test and the Durbin Watson test for mutlivariate models. (However, I have always used the dwtest() instead of the durbinWatsonTest()).
Also note that the dwtest() checks only the first-order autocorrelation. Unfortunately, I do not know how to find out which variable is causing heteroscedasticity or auto-correlation. However, if you encounter these problems, then one possible solution is that you use a robust estimation method, e.g. after NeweyWest (using: coeftest (regression model, vcov = NeweyWest)) at autocorrelation or with coeftest(regression model, vcov = vcovHC) at heteroscedasticity, both from the AER package.

How are the predictions obtained?

I have been unable to find information on how exactly predict.cv.glmnet works.
Specifically, when a prediction is being made are the predictions based on a fit that uses all the available data? Or are predictions based on a fit where some data has been discarded as part of the cross validation procedure when running cv.glmnet?
I would strongly assume the former but was unable to find a sentence in the documentation that clearly states that after a cross validation is finished, the model is fitted with all available data for a new prediction.
If I have overlooked a statement along those lines, I would also appreciate a hint on where to find this.
Thanks!
In the documentation for predict.cv.glmnet :
"This function makes predictions from a cross-validated glmnet model, using the stored "glmnet.fit" object ... "
In the documentation for cv.glmnet (under value):
"glmnet.fit a fitted glmnet object for the full data."

R: Estimating model variance

In the demo for ROC, there are models that when plotted have a spread, like hiv.svm$predictions which contains 10 estimates of response. Can someone remind me how to calculate N estimates of a model. I'm using RPART and neural network to estimate a single output (true/false). How can I run 10 different sampling for training data to get 10 different model responses to the input. I think the function is called bootstraping, but I don't know how to implement it.
I need to do this outside of caret, cause when I use caret I keep getting the message "Error in tab[1:m, 1:m] : subscript out of bounds". Is there a "simple" bootstrap function?
Obviously the answer is too late, but you could have used caret just by simply renaming the levels of your factor, because caret doesn't work if your binary response is of type logical. For example:
factor(responseWithTrueFalseLevel,
levels=c(TRUE,FALSE),
labels=c("myTrueLevel","myFalseLevel"))

Resources