gbm::interact.gbm vs. dismo::gbm.interactions - r

Background
The reference manual for the gbm package states the interact.gbm function computes Friedman's H-statistic to assess the strength of variable interactions. the H-statistic is on the scale of [0-1].
The reference manual for the dismo package does not reference any literature for how the gbm.interactions function detects and models interactions. Instead it gives a list of general procedures used to detect and model interactions. The dismo vignette "Boosted Regression Trees for ecological modeling" states that the dismo package extends functions in the gbm package.
Question
How does dismo::gbm.interactions actually detect and model interactions?
Why
I am asking this question because gbm.interactions in the dismo package yields results >1, which the gbm package reference manual says is not possible.
I checked the tar.gz for each of the packages to see if the source code was similar. It is different enough that I cannot determine if these two packages are using the same method to detect and model interactions.

To summarize, the difference between the two approaches boils down to how the "partial dependence function" of the two predictors is estimated.
The dismo package is based on code originally given in Elith et al., 2008 and you can find the original source in the supplementary material. The paper very briefly describes the procedure. Basically the model predictions are obtained over a grid of two predictors, setting all other predictors at their means. The model predictions are then regressed onto the grid. The mean squared errors of this model are then multiplied by 1000. This statistic indicates departures of the model predictions from a linear combination of the predictors, indicating a possible interaction.
From the dismo package, we can also obtain the relevant source code for gbm.interactions. The interaction test boils down to the following commands (copied directly from source):
interaction.test.model <- lm(prediction ~ as.factor(pred.frame[,1]) + as.factor(pred.frame[,2]))
interaction.flag <- round(mean(resid(interaction.test.model)^2) * 1000,2)
pred.frame contains a grid of the two predictors in question, and prediction is the prediction from the original gbm fitted model where all but two predictors under consideration are set at their means.
This is different than Friedman's H statistic (Friedman & Popescue, 2005), which is estimated via formula (44) for any pair of predictors. This is essentially the departure from additivity for any two predictors averaging over the values of the other variables, NOT setting the other variables at their means. It is expressed as a percent of the total variance of the partial dependence function of the two variables (or model implied predictions) so will always be between 0-1.

Related

how check overfitting on point pattern on a linear network using spatstat

I have been using lppm (point pattern on a linear network) on spatstat with bunch of covariates and fitting a log-linear model but I couldn't see how to check over-fitting. Is there a quick way to do it?
It depends on what you want.
What tool would you use to check overfitting in (say) a linear model?
To identify whether individual observations may have been over-fitted, you could use influence.lppm (from the spatstat.linnet package).
To identify collinearity in the covariates, currently we do not provide a dedicated function in spatstat, but you could use the following trick. If fit is your fitted model of class lppm, first extract the corresponding GLM using
g <- getglmfit(as.ppm(fit))
Next install the package faraway and use the vif function to calculate the variance inflation factors
library(faraway)
vif(g)

predicting with zoib models (MCMC / RJags)

I am using the zoib package in R to build zero-inflated beta regression models. I am looking for a simple way to use the models that zoib produces to calculate a predicted response for a new dataset. By "new dataset" I mean data not used to build the original zoib models.
I know I can just take the zoib model parameters and manually write a function in R to predict with but I want to utilise the fact that zoib models are Bayesian so I can get a posterior distribution of possible response values. My plan is to use the posterior distributions to calculate confidence intervals around each prediction.
Because zoib uses a MCMC approach within RJags I have investigated these two solutions:
manipulating the code within RJags
appending the new data with an "NA" response variable
The first solution I don't know how to implement because zoib runs RJags internally and the zero-inflated model it runs is very complicated. I tried the second solution but it just ignored the rows of data that I appended with "NA" response values.
I emailed the zoib package developers and this was there response.
For now, the zoib function can only output posterior predictive samples for Y given the X in the data set where the zoib regression is applied to, but not for a new set of X's. Your suggestion can be easily incorporated into the new version of the package, which is expected to be out in about a few weeks.

ROCR Package - Classification algo other than logistic regression

I was referring to this link which explains the usage of ROCR package for plotting ROC curves and other related accuracy measurement metrics. The author mentions about logistic regression in the beginning, but do these functions(prediction, performance from ROCR) apply to other classification algorithms like SVM, Decision Trees, etc. ?
I tried using prediction() function with results of my SVM model, but it threw a format error despite the arguments being of same type and dimensions. Also I am not sure that if we try coming up with ROC curves for these algorithms, we would get a shape like the one we generally see with logistic regression(like this).
The prediction and performance functions are model-agnostic in the sense that they only require the user to input actual and predicted values from a binary classifier. (More precisely, this is what prediction requires, and performance takes as input an object output by prediction). Therefore, you should be able to use both functions for any classification algorithm that can output this information - including both logistic regression and SVM.
Having said this, model predictions can come in different formats (e.g., propensity scores vs. classes; results stored as numeric vs. factor), and you'll need to ensure that what you feed into prediction is appropriate. This can be quite specific, for example, while the predictions argument can represent continuous or class information, it can’t be a factor. See the “details” section of the function’s help file for more information.

Quantifying importance of variables in Canonical Correspondence Analysis using R? (x-post from researchgate)

I currently have species abundance data for multiple lakes along with measurements of some environmental variables of those lakes. I decided to do Canonical Correspondence Analysis of the data in R, as demonstrated by ter Braak and Verdenschot (1995), see link: http://link.springer.com/article/10.1007%2FBF00877430 (section: "Ranking environmental variables in importance")
I am not very good with R yet, and I do not have access to the software specified in the article (CANOCO). My problem is, in order to do stepwise ranking of importance of environmental variables, I have to obtain the Lambda (Is this the same as Wilk's Lambda?) and perform a Monte Carlo permutation test on each CCA constrained axis. 
Does anybody know how I can do this in R? I would love to be able to use this analysis.
You want the anova() method that vegan provides for cca(), the function that does CCA in the package, if you want to test effects in a current model. See ?anova.cca for details and perhaps the by = "margin" option to test marginal terms.
To do stepwise selection you have two options
Use the standard step() function that works with an AIC-like statistic for CCA, or
For the sort of selection done in that paper and implemented in CANOCO you want ordistep(). This does forward selection & backward elimination testing changes to models via permutation tests.
Lambda is often used to indicate Eigenvalues and is not Wilk's Lambda. The pseudo-F statistic will be mentioned in the paper and it is this that is computed in the test and for which permutations give the sampling distribution under the null hypothesis, which ultimately determines significance of terms in the model or whether a term enters or leaves the model.
See ?ordistep for more details.

randomForest in R: Is there a possibility of calculating casewise confidence intervals?

R package randomForest reports mean squared errors for each tree in the forest. I need, however, a measure of confidence for each case in the data. Since randomForest calculates the casewise predictions by averaging the predictions of the single trees, I guess that it should also be possible to calculate a casewise standard error and thus a confidence interval. Can this be done using the output randomForest object (if so: how?) or do I have to dig into the source code?
No need to dig into the source code. You only need to read the documentation. ?predict.randomForest states that one of its arguments is called predict.all:
predict.all Should the predictions of all trees be kept?
So setting that to TRUE will keep a prediction for each case, for each tree, which you can then use to calculate standard error for each case.
I have recently been made aware of this paper by Stefan Wager, Trevor Hastie and Brad Efron which investigates more rigorously the idea of standard errors for the predictions generated by random forests (and other bagged predictors).

Resources