Prediction Intervals for R tidymodels Stacked model from stacks() - r

Is it possible to calculate prediction intervals from a tidymodels stacked model?
Working through the example from the stacks() package here yields the stacked frog model (which can be downloaded here for reprex) and the testing data:
data("tree_frogs")
tree_frogs <- tree_frogs %>%
filter(!is.na(latency)) %>%
select(-c(clutch, hatched))
set.seed(1)
tree_frogs_split <- initial_split(tree_frogs)
tree_frogs_train <- training(tree_frogs_split)
tree_frogs_test <- testing(tree_frogs_split)
I tried to run something like this:
pi <- predict(tree_frogs_model_st, tree_frogs_test, type = "pred_int")
but this gives an error:
Error in UseMethod("stack_predict") : no applicable method for 'stack_predict' applied to an object of class "NULL"
Reading the documentation of stacks() I also tried passing "pred_int" in the opts list:
pi <- predict(tree_frogs_model_st, tree_frogs_test, opts = list(type = "pred_int"))
but this just gives: opts is only used with type = raw and was ignored.
For reference, I am trying to do a similar thing that is done in Ch.19 of Tidy Modeling with R book
lm_fit <- fit(lm_wflow, data = Chicago_train)
predict(lm_fit, Chicago_test, type = "pred_int")
which seems to work fine for a single model fit like lm_fit, but apparently not for a stacked model?
Am I missing something? Is it not possible to calculate prediction intervals for stacked models for some reason?

This is very difficult to do.
Even if glmnet produced a prediction interval, it would be a significant underestimate since it doesn’t know anything about the error in each of the ensemble members.
We would have to get the standard error of prediction from all of the models to compute it for the stacking model. A lot of these models don’t/can’t generate that standard error.
The alternative is the use bootstrapping to get the interval but you would have to bootstrap each model a large number of times to get the overall prediction interval.

Related

Why does ranger predict give different numbers when re-applied to training data?

I am very new to machine learning. I am trying to explore fitting random forests with the ranger library in R. My dependent variable is continuous - so it would be a regression tree (and not just classification). Upon trying out the functions, I have noticed that there seems to be a discrepancy between ranger and predict ranger. The following lines result in different predictions in results and results_alternative:
rf_reg <- ranger(formula = y ~ ., data = training_df)
results <- rf_reg$predictions
results_alterantive <- predict(rf_reg, data = training_df)$predictions
Could anybody please explain why there is a discrepancy and what is causing it? Which one is correct? I have tried it with classification on iris data and that seemed to give the same results. Many thanks!

GAM residuals missing in plot

I am applying a GAM model to my data: cell abundance over time.
The model works just fine (although I am aware of a pattern in my resiudals, but this is a different issue not relevant here).
It just fails to display the partial residuals in the final plot, although i set residuals = TRUE. Here is my output:
https://i.stack.imgur.com/C1MlY.png
also I used mgcv package.
Previously this code worked as I wanted, but on different data. Any ideas on why it is not working are welcome!
GAM_EA <- mgcv::gam(EUB_FISH ~ s(Day, by = Heatwave), data = HnH, method = "REML")
gam.check(GAM_EA) #Checking the model
mgcv::anova.gam(GAM_EA) #Retrieving the statistical results. See ?anova.gam
summary.gam(GAM_EA)
plot(GAM_EA, shift = coef(GAM_EA)[1], residuals = TRUE)
See argument by.resid in ?plot.gam. They way these are used in plot.gam would been meaningless for factor by terms unless you were to subset the partial residuals and plot only the residuals for observations in the specific level of the by factor.

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

Random forest evaluation in R

I am a newbie in R and I am trying to do my best to create my first model. I am working in a 2- classes random forest project and so far I have programmed the model as follows:
library(randomForest)
set.seed(2015)
randomforest <- randomForest(as.factor(goodkit) ~ ., data=training1, importance=TRUE,ntree=2000)
varImpPlot(randomforest)
prediction <- predict(randomforest, test,type='prob')
print(prediction)
I am not sure why I don't get the overall prediction for my model.I must be missing something in my code. I get the OOB and the prediction per case in the test set but not the overall prediction of the model.
library(pROC)
auc <-roc(test$goodkit,prediction)
print(auc)
This doesn't work at all.
I have been through the pROC manual but I cannot get to understand everything. It would be very helpful if anyone can help with the code or post a link to a good practical sample.
Using the ROCR package, the following code should work for calculating the AUC:
library(ROCR)
predictedROC <- prediction(prediction[,2], as.factor(test$goodkit))
as.numeric(performance(predictedROC, "auc")#y.values))
Your problem is that predict on a randomForest object with type='prob' returns two predictions: each column contains the probability to belong to each class (for binary prediction).
You have to decide which of these predictions to use to build the ROC curve. Fortunately for binary classification they are identical (just reversed):
auc1 <-roc(test$goodkit, prediction[,1])
print(auc1)
auc2 <-roc(test$goodkit, prediction[,2])
print(auc2)

Trouble getting se.fit and confidence intervals using clmm2 from ordinal package

I'm using clmm function from ordinal package in R in order to fit cumulative mixed models to my data. It worked fine until I tried to get predicted probabilities. I can't get either SE or confidence intervals by specifying se.fit=TRUE and interval=TRUE. It looks like this:
mod1<-clmm2(response~X0+X1+X2+X3+X4+X5+X7+X0*X2*X3+X2*X3*X4+X0:X4, random=X6,
data=df,link ="logistic", threshold ="flexible",
Hess=TRUE, nAGQ=7)
As you can see there a bunch of interaction there (all important). I've tried to create a dummy dataset for my problem to be reproducible but clmm can't achieve convergence with a simpler dataset. I took the wine dataset included in the package ordinal and did some changes with the formula to mimic my own (I don't think it makes any sense though):
library(ordinal)
data(wine)
fm1 <- clmm2(rating ~ temp + contact+bottle+temp:contact:bottle+temp:contact+ temp:bottle+bottle:contact,random=judge, data=wine,link ="logistic", threshold ="flexible",
Hess=TRUE, nAGQ=7)
head(do.call("cbind", predict(fm1, se.fit=TRUE, interval=TRUE)))
And then I get this error:
Error in head(do.call("cbind", predict(fm1, se.fit = TRUE, interval = TRUE))) :
error in evaluating the argument 'x' in selecting a method for function 'head' : Erreur dans do.call("cbind", predict(fm1, se.fit = TRUE, interval = TRUE)) : second argument must be a list
My guess is that predict does'nt even compute SE and IC in a case like this. Does anybody knows why? Is there anyway to get those values?
Thanks a lot!
The predict method for clmm2 objects does not offer std-errors. See its help page. This is in keeping with the usual practice of R package authors when dealing with mixed effects models.

Resources