Elith et al. [1] describe a method of measuring the dissimilarity between the values used in model fitting and the values used in making predictions. In the context of species distribution modelling (ecological niche modelling) that prediction is a 'projection'. The method is called 'multivariate environmental similarity surface (MESS) analysis. There is a function in the dismo package to estimate it (as well as a function built into the MAXENT java program).
q1: Does anyone know what units are reported by the dismo::mess function?
The dismo::mess function reports not only a MESS for each predictor (received and reported as a raster), but also reports a layer named 'rmess'. In the help file it is described as " an additional layer with the MESS values".
q2: How are the MESS values calculated?
q3: What is the rmess layer a measure of?
Thanks for your help!
[1] Elith, J., Kearney, M. & Phillips, S. 2010 The art of modelling range-shifting species. Methods in Ecology and Evolution 1, 330-342. (doi:10.1111/j.2041-210X.2010.00036.x).
You can see what dismo does by typing
dismo::mess
It calls .messi3, which you can see with
dismo:::.messi3
(I found the answer in the appendices...)
Related
I've recently started using the bnlearn package in R for detecting Markov Blankets of a target node in the dataset.
Based on my understanding of Bayesian Inference, two nodes are connected if there is a causal relationship between the two and this is measured using some conditional independence tests to check for correlation while taking into account potential confounders.
I just wanted to clarify if bnlearn checks for both linear and non-linear correlations in these tests. I tried looking for stuff about this in the documentation for the package but I wasn't able to get anything.
It would be really helpful if someone can explain how bnlearn performs the CI tests.
Thanks a bunch <3
Correlation implies statistical dependence, but not vice versa. There are cases of statistical dependence where there is no correlation, e.g. in periodic signals (correlation between sin(x) and x is very low for many periods). The concept of statistical dependence is more abstract than correlation and thus the documentation is written differently.
As you can see in the example of sin(x) and x: This is indeed a non-linear dependency which should be captured by the Bayesian network.
I am running boosted regression trees (BRT) in R, with the package dismo and I have included a predictor (residual autocovariate) that, in theory, correct for spatial autocorrelation, following a paper from Crase et al (2012). My data units are grid-cells in vector format. I have defined binary neighbours (i.e. they all have the same weight. I don't have any reason to consider any other type) and of type 'queen' (i.e. those 8 neighbours that have any contact with each grid cell, in my case).
I'm using these BRT to relate environmental predictors to different biodiversity metrics (responses) at the global scale.
The thing is that, even after correcting in the way I exposed above, the residuals still have spatial correlation (measured as global Moran's I). I've used this approach before and never got this problem. So, I have two questions:
Is there any way to solve this issue
Is it that bad to have remaining spatial autocorrelation? I know global richness of species (for example) has this characteristic and, of course, all models are gonna miss some predictor in order to fully explain this natural clusterization of fauna
Any thought is welcome!
Thank you.
This is a follow-up to a previous question I asked a while back that was recently answered.
I have built several gbm models with dismo::gbm.step, which relies on the gbm fitting functions found in R package gbm, as well as cross validation tools from R package splines.
As part of my analysis, I would like to use some of the graphical tools available in R (e. g. perspective plots) to visualize pairwise interactions in the data. Both the gbm and the dismo packages have functions for detecting and modelling interactions in the data.
The implementation in dismo is explained in Elith et. al (2008) and returns a statistic which indicates departures of the model predictions from a linear combination of the predictors, while holding all other predictors at their means.
The implementation in gbm uses Friedman`s H statistic (Friedman & Popescue, 2005), and returns a different metric, and also does NOT set the other variables at their means.
The interactions modelled and plotted with dismo::gbm.interactions are great and have been very informative. However, I would also like to use gbm::interact.gbm, partly for publication strength and also to compare the results from the two methods.
If I try to run gbm::interact.gbm in a gbm.object created with dismo, an error is returned…
"Error in is.factor(data[, x$var.names[j]]) :
argument "data" is missing, with no default"
I understand dismo::gmb.step adds extra data the authors thought would be useful to the gbm model.
I also understand that the answer to my question lies somewherein the source code.
My questions is...
Is it possible to modify a gbm object created in dismo to be used in gbm::gbm.interact? If so, would this be accomplished by...
a. Modifying the gbm object created in dismo::gbm.step?
b. Modifying the source code for gbm::interact.gbm?
c. Doing something else?
I will be going through the source code trying to solve this myself, if I come up with a solution before anyone answers I will answer my own question.
The gbm::interact.gbm function requires data as an argument interact.gbm <- function(x, data, i.var = 1, n.trees = x$n.trees).
The dismo gbm.object is essentially the same as the gbm gbm.object, but with extra information attached so I don't imagine changing the gbm.object would help.
R package randomForest reports mean squared errors for each tree in the forest. I need, however, a measure of confidence for each case in the data. Since randomForest calculates the casewise predictions by averaging the predictions of the single trees, I guess that it should also be possible to calculate a casewise standard error and thus a confidence interval. Can this be done using the output randomForest object (if so: how?) or do I have to dig into the source code?
No need to dig into the source code. You only need to read the documentation. ?predict.randomForest states that one of its arguments is called predict.all:
predict.all Should the predictions of all trees be kept?
So setting that to TRUE will keep a prediction for each case, for each tree, which you can then use to calculate standard error for each case.
I have recently been made aware of this paper by Stefan Wager, Trevor Hastie and Brad Efron which investigates more rigorously the idea of standard errors for the predictions generated by random forests (and other bagged predictors).
I'm working on a project now that's rather unlike anything I've done before. I have two tests with binary results that will be administered to the same sample, which is drawn from a clustered population (i.e., some subjects will be from the same family). I'd like to compare proportions of positive test results, but the clustering makes McNemar's test inappropriate so I've been reading up on alternative approaches. The two main routes seem to be 1) the clustering-adjusted McNemar alternatives by Rao and Scott (1992), Eliasziw and Donner (1991), and Obuchowski (1998), and 2) GEE.
Do you know of any implementations of the Rao-Obuchowski lineage in R (or, I suppose, SAS)? GEE is easy to find, but have you had a positive or negative experience with any particular packages? Is there another route to analyzing these data that I'm completely missing?
You could always just use a clustered bootstrap. Resample across families, which you believe are independent. That is, keep families together when you resample. Compute p2 - p1 for each sample. After 1000 iterations or so, compute the upper and bottom 2.5% quantiles. This will give you a bootstrapped 95% confidence interval. Alternatively compute the fraction of samples above zero, or whatever your hypothesis is. The procedure should have good pretty good properties unless the number of families is small.
It's probably easiest to do this by hand in R rather than relying on any package.
Check out the survey package: it is designed to take into account correlations induced by clustered sampling.
Have you already checked the CorrBin package in R?
It is for analysis of correlated binary data, there is a paper named: Using the CorrBin package for nonparametric analysis of
correlated binary data by Szabo, it includes the Rao-Scott, stochastic ordering and three versions of a GEE-based test.
The clust.bin.pair package for clustered binary matched-pair data was recently published to CRAN.
It contains implementations of Eliasziw and Donner (1991) and Obuchowski (1998), as well as two more recent tests in the same family Durkalski (2003) and Yang (2010).