randomForest in R: Is there a possibility of calculating casewise confidence intervals? - r

R package randomForest reports mean squared errors for each tree in the forest. I need, however, a measure of confidence for each case in the data. Since randomForest calculates the casewise predictions by averaging the predictions of the single trees, I guess that it should also be possible to calculate a casewise standard error and thus a confidence interval. Can this be done using the output randomForest object (if so: how?) or do I have to dig into the source code?

No need to dig into the source code. You only need to read the documentation. ?predict.randomForest states that one of its arguments is called predict.all:
predict.all Should the predictions of all trees be kept?
So setting that to TRUE will keep a prediction for each case, for each tree, which you can then use to calculate standard error for each case.
I have recently been made aware of this paper by Stefan Wager, Trevor Hastie and Brad Efron which investigates more rigorously the idea of standard errors for the predictions generated by random forests (and other bagged predictors).

Related

Assessing LDA predictions with textmineR in R - Calculating perplexity?

I am working on a LDA model with textmineR, have calculated coherence, log-likelihood measures and optimized my model.
As a last step I would like to see how well the model predicts topics on unseen data. Thus, I am using the predict() function from the textminer package in combination with GIBBS sampling on my testset-sample.
This results in predicted "Theta" values for each document in my testset-sample.
While I have read in another post that perplexity-calculations are not available with the texminer package (See this post here: How do i measure perplexity scores on a LDA model made with the textmineR package in R?), I am now wondering what the purpose of the prediction function is then for? Especially with a large dataset of over 100.000 Documents it is hard to just visually assess whether the prediction has performed well or not.
I do not want to use perplexity for model selection (I am using coherence/log-likelihood instead), but as far as I understand, perplexity would help me to understand how well the prediction is and how "surprised" the model is with new, previously unseen data.
Since this does not seem to be available for textmineR, I am not sure how to assess the model prediction. Is there anything else that I could use to measure the prediction quality of my textminer model?
Thank you!

ROCR Package - Classification algo other than logistic regression

I was referring to this link which explains the usage of ROCR package for plotting ROC curves and other related accuracy measurement metrics. The author mentions about logistic regression in the beginning, but do these functions(prediction, performance from ROCR) apply to other classification algorithms like SVM, Decision Trees, etc. ?
I tried using prediction() function with results of my SVM model, but it threw a format error despite the arguments being of same type and dimensions. Also I am not sure that if we try coming up with ROC curves for these algorithms, we would get a shape like the one we generally see with logistic regression(like this).
The prediction and performance functions are model-agnostic in the sense that they only require the user to input actual and predicted values from a binary classifier. (More precisely, this is what prediction requires, and performance takes as input an object output by prediction). Therefore, you should be able to use both functions for any classification algorithm that can output this information - including both logistic regression and SVM.
Having said this, model predictions can come in different formats (e.g., propensity scores vs. classes; results stored as numeric vs. factor), and you'll need to ensure that what you feed into prediction is appropriate. This can be quite specific, for example, while the predictions argument can represent continuous or class information, it can’t be a factor. See the “details” section of the function’s help file for more information.

gbm::interact.gbm vs. dismo::gbm.interactions

Background
The reference manual for the gbm package states the interact.gbm function computes Friedman's H-statistic to assess the strength of variable interactions. the H-statistic is on the scale of [0-1].
The reference manual for the dismo package does not reference any literature for how the gbm.interactions function detects and models interactions. Instead it gives a list of general procedures used to detect and model interactions. The dismo vignette "Boosted Regression Trees for ecological modeling" states that the dismo package extends functions in the gbm package.
Question
How does dismo::gbm.interactions actually detect and model interactions?
Why
I am asking this question because gbm.interactions in the dismo package yields results >1, which the gbm package reference manual says is not possible.
I checked the tar.gz for each of the packages to see if the source code was similar. It is different enough that I cannot determine if these two packages are using the same method to detect and model interactions.
To summarize, the difference between the two approaches boils down to how the "partial dependence function" of the two predictors is estimated.
The dismo package is based on code originally given in Elith et al., 2008 and you can find the original source in the supplementary material. The paper very briefly describes the procedure. Basically the model predictions are obtained over a grid of two predictors, setting all other predictors at their means. The model predictions are then regressed onto the grid. The mean squared errors of this model are then multiplied by 1000. This statistic indicates departures of the model predictions from a linear combination of the predictors, indicating a possible interaction.
From the dismo package, we can also obtain the relevant source code for gbm.interactions. The interaction test boils down to the following commands (copied directly from source):
interaction.test.model <- lm(prediction ~ as.factor(pred.frame[,1]) + as.factor(pred.frame[,2]))
interaction.flag <- round(mean(resid(interaction.test.model)^2) * 1000,2)
pred.frame contains a grid of the two predictors in question, and prediction is the prediction from the original gbm fitted model where all but two predictors under consideration are set at their means.
This is different than Friedman's H statistic (Friedman & Popescue, 2005), which is estimated via formula (44) for any pair of predictors. This is essentially the departure from additivity for any two predictors averaging over the values of the other variables, NOT setting the other variables at their means. It is expressed as a percent of the total variance of the partial dependence function of the two variables (or model implied predictions) so will always be between 0-1.

Output posterior distribution from bayesian network in R (bnlearn)

I'm experimenting with Bayesian networks in R and have built some networks using the bnlearn package. I can use them to make predictions for new observations with predict(), however I would also like to have the posterior distribution over the possible classes. Is there a way of retrieving this information?
It seems like there is a prob-parameter that does this for the naive bayes implementation in the bnlearn package, but not for networks fitted with bn.fit.
Thankful for any help with this.
See the documentation of bnlearn.
predict function implements prob only for naive.bayes and TAN.
In short, because all other methods do not necessarily compute posterior probabilities.
[bnlearn] :: predict returns the predicted values for node given the data specified by data. Depending on the
value of method, the predicted values are computed as follows:
a)parents b)bayes-lw
When using bayes-lw , likelihood weighting simulations are performed for making predictions.
Hope this helps. :)

Why is it inadvisable to get statistical summary information for regression coefficients from glmnet model?

I have a regression model with binary outcome. I fitted the model with glmnet and got the selected variables and their coefficients.
Since glmnet doesn't calculate variable importance, I would like to feed the exact output (selected variables and their coefficients) to glm to get the information (Standard errors, etc).
I searched r documents, it seems I can use "method" option in glm to specify user defined function.
But I failed to do so, could someone help me with this?
"It is a very natural question to ask for standard errors of regression
coefficients or other estimated quantities. In principle such standard
errors can easily be calculated, e.g. using the bootstrap.
Still, this package deliberately does not provide them. The reason for
this is that standard errors are not very meaningful for strongly
biased estimates such as arise from penalized estimation methods.
Penalized estimation is a procedure that reduces the variance of
estimators by introducing substantial bias. The bias of each estimator
is therefore a major component of its mean squared error, whereas its
variance may contribute only a small part.
Unfortunately, in most applications of penalized regression it is
impossible to obtain a sufficiently precise estimate of the bias. Any
bootstrap-based calculations can only give an assessment of the
variance of the estimates. Reliable estimates of the bias are only
available if reliable unbiased estimates are available, which is
typically not the case in situations in which penalized estimates are
used.
Reporting a standard error of a penalized estimate therefore tells
only part of the story. It can give a mistaken impression of great
precision, completely ignoring the inaccuracy caused by the bias. It
is certainly a mistake to make confidence statements that are only
based on an assessment of the variance of the estimates, such as
bootstrap-based confidence intervals do."
Jelle Goeman, Ph.D. Leiden University, Author of the Penalized package in R.
There is CRAN packages hdi and selectiveInference which provide inference for high-dimensional models, you might want to take a look at those...
I've also seen people just run a glm using the predictors selected by glmnet, but this doesn't take into account the uncertainty produced by the selection process of the best model itself...

Resources