R: Estimating model variance - r

In the demo for ROC, there are models that when plotted have a spread, like hiv.svm$predictions which contains 10 estimates of response. Can someone remind me how to calculate N estimates of a model. I'm using RPART and neural network to estimate a single output (true/false). How can I run 10 different sampling for training data to get 10 different model responses to the input. I think the function is called bootstraping, but I don't know how to implement it.
I need to do this outside of caret, cause when I use caret I keep getting the message "Error in tab[1:m, 1:m] : subscript out of bounds". Is there a "simple" bootstrap function?

Obviously the answer is too late, but you could have used caret just by simply renaming the levels of your factor, because caret doesn't work if your binary response is of type logical. For example:
factor(responseWithTrueFalseLevel,
levels=c(TRUE,FALSE),
labels=c("myTrueLevel","myFalseLevel"))

Related

glm with gamma family with NA / 0 Values in r

I would like to use a generalized linear mixed effect model. My data follows a gamma distribution but contains NA and 0 values. However, the gamma family does not allow me to compute these models if I have 0 values. Does anyone know of a way to go around this problem?
I heard that the glmmTMB package allows the use of gamma distributions with negative values, but I work on a mac, and it seems that I cannot download this package.
When I try, I get an error code stating "clang: error: unsupported option '-fopenmp'".
It would be great if any of you had an idea.
The Gamma distribution has no support on the non-positive real numbers. Accordingly, you are basically asking it to model data which it could never produce and therefore the software throws an error.
Similarly, the missing data cannot be modelled because the model you specify does not itself model or marginalize out the missing data. You will need to either replace the missing values with some number (impute missing values) deterministically/probabilistically or drop the observations with missing values.
In short, you will need to employ an alternative model. You could use the zero-inflated gamma model or the gamma hurdle. See here for an example. There is no "correct" alternative model: it is a model and you will need to think about their relative strengths and weaknesses (assumptions, etc.).

Not able to predict with gamma distributed GLM model in H2O

I have just computed a gamma GLM with the h2o package in R.
When I'm trying to predict on the test set I get this error:
Illegal argument(s) for GLM model: GLM_model_R_1644680218230_95. Details: ERRR on field: _family: Response value for gamma distribution must be greater than 0.
I understand that a gamma model cannot be trained on data with zero response, but one should be able to predict on data with a true value 0 (this is used a a lot in actuaries).
Does any one know a h2o solution? I know I can simply make the model with glm() or something similar, but I'm relying on mean encoded categorical variables (which is really convenient in h2o).
Thanks!
Based on description in comments - this seems like a bug and fix will be needed.

Multiple Regression in R

I have been trying to do a simple regression in R using the following syntax:
Unfortunately, R keeps giving me warnings and the summary is not possible:
I can't find out the problem. The data includes more than just the 11 predictors mentioned in the syntax.
Thank you!
Melanie
This answer partially consists of comments in the original question.
That is not an error. It's a warning message (it differs from error). It's generated because you attempt to use lm() for a factor-type response variable. Operations like + and - does not work on factor, hence the message "-" not meaningful for factors.
If the response is truly a categorical variable, lm() might not be the right way to go to model it. Alternatives in this situation:
glm(): Binary logistic regression, Poisson regression, negative binomial regression
MASS::polr(): Ordinal logistic regression
nnet::multinom(): Multinomial logistic regression
and many more others.
Please research the corresponding methods before actually using it.
If the response is actually NOT a categorical variable, you will want to look further why it is coded as a factor, and turn it to numeric first.

random forest package in R

I use random forest package in R for regression, it gives me two kind of information: Mean of squared residuals and % Var explained. But I wanna calculate the RMSE and R^2 of the training and test sets, can anyone help me how can I find these kind of information?
Sorry this is not a specific answer, but I do not have enough cred to leave a comment.
It is tough to say how you may get at what you want without a reproducible example. However, if you used the xtest= and ytest= arguments in the call to randomForest (assuming you are using the "randomForest" package), then what you are looking for should be a part of the resulting randomForest object. What you want to look in is the test part of the resulting random forest list.
An attempted example:
rf.results <- randomForest( whatever arguments )
rf.results$test$mse # mse (maybe you can take the square root to get rmse)
rf.results$test$rsq # pseudo-R2 for random forest
If you have the random forest package loaded you can validate this information as well as do some exploration yourself with ?randomForest. The "Value" section of the documentation details the object that results from a call to randomForest and where you can find various performance metrics.

R - How to get one "summary" prediction map instead for 5 when using 5-fold cross-validation in maxent model?

I hope I have come to the right forum. I'm an ecologist making species distribution models using the maxent (version 3.3.3, http://www.cs.princeton.edu/~schapire/maxent/) function in R, through the dismo package. I have used the argument "replicates = 5" which tells maxent to do a 5-fold cross-validation. When running maxent from the maxent.jar file directly (the maxent software), an html file with statistics will be made, including the prediction maps. In R, an html file is also made, but the prediction maps have to be extracted afterwards, using the function "predict" in the dismo package in r. When I do this, I get 5 maps, due to the 5-fold cross-validation setting. However, (and this is the problem) I want only one output map, one "summary" prediction map. I assume this is possible, although I don't know how maxent computes it. The maxent tutorial (see link above) says that:
"...you may want to avoid eating up disk space by turning off the “write output grids” option, which will suppress writing of output grids for the replicate runs, so that you only get the summary statistics grids (avg, stderr etc.)."
A list of arguments that can be put into R is found in this forum https://groups.google.com/forum/#!topic/maxent/yRBlvZ1_9rQ.
I have tried to use the argument "outputgrids=FALSE" both in the maxent function itself, and in the predict function, but it doesn't work. I still get 5 maps, even though I don't get any errors in R.
So my question is: How do I get one "summary" prediction map instead of the five prediction maps that results from the cross-validation?
I hope someone can help me with this, I am really stuck and haven't found any answers anywhere on the internet. Not even a discussion about this. Hope my question is clear. This is the R-script that I use:
model1<-maxent(x=predvars, p=presence_points, a=target_group_absence, path="//home//...//model1", args=c("replicates=5", "outputgrids=FALSE"))
model1map<-predict(model1, predvars, filename="//home//...//model1map.tif", outputgrids=FALSE)
Best regards,
Kristin
Sorry to be the bearer of bad news, but based on the source code, it looks like Dismo's predict function does not have the ability to generate a summary map.
Nitty-gritty details for those who care: When you call maxent with replicates set to something greater than 1, the maxent function returns a MaxEntReplicates object, rather than a normal MaxEnt object. When predict receives a MaxEntReplicates object, it just iterates through all of the models that it contains and calls predict on them individually.
So, what next? Fortunately, all is not lost! The reason that Dismo doesn't have this functionality is that for most kinds of model-building, there isn't actually a valid way to average parameters across your cross-validation models. I don't want to go so far as to say that that's definitely the case for MaxEnt specifically, but I suspect it is. As such, cross-validation is usually used more as a way of checking that your model building methodology works for your data than as a way of building your model directly (see this question for further discussion of that point). After verifying via cross-validation that models built using a given procedure seem to be accurate for the phenomenon you're modelling, it's customary to build a final model using all of your data. In theory this new model should only be better than models trained on a subset of your data.
So basically, assuming your cross-validated models look reasonable, you can run MaxEnt again with only one replicate. Your final result will be a model accuracy estimate based on the cross-validation and a map based on the second run with all of your data lumped together. Depending on what exactly your question is, there might be other useful summary statistics from the cross-validation that you want to use, but those are all things you've already seen in the html output.
I may have found this a couple of years later. But you could do something like this:
xm <- maxent(predictors, pres_train) # basically the maxent model
px <- predict(predictors, xm, ext=ext, progress= '' ) #prediction
px2 <- predict(predictors, xm2, ext=ext, progress= '' ) #prediction #02
models <- stack(px,px2) # create a stack of prediction from all the models
final_map <- mean(px,px2) # Take a mean of all the prediction
plot(final_map) #plot the averaged map
xm1,xm2,.. would be the maxent models for each partitions in cross-validation, and px, px2,.. would be the predicted maps.

Resources