Mahalanobis distance based classifier leads to seemingly wrong scores for points identical to training data - r

I have been using the mahal classifier function (Dismo package in r) in several of my analyses and recently I have discovered that it seems to give apparently wrong distance results for points that are identical to points used in training of the classifier. For background, from what I understand of mahalanobis-based classifiers, is that they use Mahalanobis distance to describe the similarity of a unclassified point by measuring the point's distance from the center of mass of the training set (while accounting for differences in scale and covariance, etc.). The mahalanobis distance score varies from –inf to 1, where one indicates no distance between the unclassified point and the centroid defined by the training set. However, I found that, for all points with identical predictor values than the training points, I still get a score of 1, as if the routine is working as a nearest neighbor classifier. This is a very troubling behavior because it has the potential to artificially increase the confidence of my overall classification.
Has anyone encountered this behavior? Any ideas on how to fix/ avoid this behavior?
I have written a small script below that showcases the odd behavior clearly:
rm(list = ls()) #remove all past worksheet variables
library(dismo)
logo <- stack(system.file("external/rlogo.grd", package="raster"))
#presence data (points that fall within the 'r' in the R logo)
pts <- matrix(c(48.243420, 48.243420, 47.985820, 52.880230, 49.531423, 46.182616,
54.168232, 69.624263, 83.792291, 85.337894, 74.261072, 83.792291, 95.126713,
84.565092, 66.275456, 41.803408, 25.832176, 3.936132, 18.876962, 17.331359,
7.048974, 13.648543, 26.093446, 28.544714, 39.104026, 44.572240, 51.171810,
56.262906, 46.269272, 38.161230, 30.618865, 21.945145, 34.390047, 59.656971,
69.839163, 73.233228, 63.239594, 45.892154, 43.252326, 28.356155), ncol=2)
# fit model
m <- mahal(logo, pts)
#using model, predict train data
training_vals=extract(logo, pts)
x <- predict(m, training_vals)
x #results show a perfect 1 prediction, which is highly unlikely
Now, I try to make predictions for values that are an average for directly adjacent point pairs
I do this because given that:
(1) each point for each pair used to train the model have a perfect suitability and
(2) that at least some of these average points are likely to be as close to the center of the mahalanobis centroid than the original pairs
(3) I would expect at least a few of the average points to have a perfect suitability as well.
#pick two adjacent points and fit model
adjacent_pts=pts
adjacent_pts[,2]=adjacent_pts[,2]+1
adjacent_training_vals=extract(logo, adjacent_pts)
new_pts=rbind(pts, adjacent_pts)
plot(logo[[1]]) #plot predictor raster and response point pairs
points(new_pts[,1],new_pts[,2])
#use model to predict mahalanobis score for new training data (point pairs)
m <- mahal(logo, new_pts)
new_training_vals=extract(logo, new_pts)
x <- predict(m, new_training_vals)
x
As expected from the odd behavior described, all training points have a distance score of 1. However, lets try to predict points that are an average of each pair:
mid_vals=(adjacent_training_vals+training_vals)/2
x <- predict(m, mid_vals)
x #NONE DO!
This for me is further indication that the Mahal routine will give a perfect score for any data point that has equal values to any of the points used to train the model
This below is uncessessary, but just another way to prove the point:
Here I predict the same original train data with a near insignificant 'budge' of values for only one of the predictors and show that the resulting scores change quite significantly.
mod_training_vals=training_vals
mod_training_vals[,1]=mod_training_vals[,1]*1.01
x <- predict(m, mod_training_vals)
x #predictions suddenly are far from perfect predictions

Related

Accounting for Spatial Autocorrelation in Model

I am trying to account for spatial autocorrelation in a model in R. Each observation is a country for which I have the average latitude and longitude. Here's some sample data:
country <- c("IQ", "MX", "IN", "PY")
long <- c(43.94511, -94.87018, 78.10349, -59.15377)
lat <- c(33.9415073, 18.2283975, 23.8462264, -23.3900255)
Pathogen <- c(10.937891, 13.326284, 12.472374, 12.541716)
Answer.values <- c(0, 0, 1, 0)
data <- data.frame(country, long, lat, Pathogen, Answer.values)
I know spatial autocorrelation is an issue (Moran's i is significant in the whole dataset). This is the model I am testing (Answer Values (a 0/1 variable) ~ Pathogen Prevalence (a continuous variable)).
model <- glm(Answer.values ~ Pathogen,
na.action = na.omit,
data = data,
family = "binomial")
How would I account for spatial autocorrelation with a data structure like that?
There are a lot of potential answers to this. One easy(ish) way is to use mgcv::gam() to add a spatial smoother. Most of your model would stay the same:
library(mgcv)
gam(Answer.values ~ Pathogen +s([something]),
family="binomial",
data=data)
where s([something]) is some form of smooth spatial term. Three possible/reasonable choices would be:
a spherical spline (?mgcv::smooth.construct.sos.smooth.spec), which takes lat/long as input; this would be useful if (1) you have data over a significant fraction of the earth's surface (so that a smoother that constructs a 2D planar spatial smooth is less reasonable); (2) you want to account for distance between locations in a continuous way
a Markov random field (?mgcv::smooth.construct.mrf.smooth.spec). This is essentially the spatial analogue of a discrete order-1 autoregressive structure (i.e. countries are directly correlated only with their direct neighbours, however you choose to define that). In order to do this you have to come up somehow with a neighbourhood list (i.e. a list of countries, where the elements are lists of countries that are neighbours of the original countries). You could do this however you like, e.g. by finding nearest neighbours geographically. (Check out some introductions to spatial statistics/spatial data analysis in R.) (On the other hand, if you're testing Moran's I then you've presumably already come up with some way to identify first-order neighbours ...)
if you're comfortable treating lat/long as coordinates in a 2D plane, then you have lot of choices of smoothing basis, e.g. ?mgcv::smooth.construct.gp.smooth.spec (Gaussian process smoothers, which include most of the standard spatial autocorrelation models as special cases)
A helpful link for getting up to speed with GAMs in R ...

How to get X axis on Fig 5.3 in Elements of Statistical Learning?

I am trying to make figure 5.3 in Elements of statistical learning using the South African Heart Disease data. I have gotten to a point where I have been able to get the pointwise variances and plot it against "sbp" of the model predictor variables thus far. In part, because since my pointwise variance vector is of dimension 462 by 1 , the only other things that could plot the point wise variance is one of the predictor variables, in my case "sbp" which contains the same number of data points 462. With that, I get a plot that looks like this: 
Eye balling this plot, I can see knots at 33% (123) and 66%(162) for the cubic spline model with df=6-1 (Note:-1 because there is an intercept) in agreement to the fig 5.3 with knots at 0.33 and 0.66, as explained in the description from figure 5.3. I think I am getting close but my problem now is that this is not being plotted against X from 0 to 1 with 50 points like the figure explains. Here's what the figure should display in principle:
The code for my figure is done in r and is curently only attempting the cubic spline model. If I wanted to do the natural cubic spline I would just replace the bs() function used for the cubic spline with ns() function to build the required H matrix of basis functions. Please see code showing how I am constructing the Cubic Spline model:
library(sqldf)
library(splines)
library(gam)
library(mgcv)
SAheart <- read.table("SAheart.data",
sep = ",", head=T,row.names = 1)
SAheart.var<-sqldf("select sbp,tobacco,ldl,famhist,obesity,alcohol,age,chd from SAheart")
attach(SAheart.var)
sbp<-SAheart.var[,1]
tobacco<-SAheart.var[,2]
ldl.bsf<-SAheart.var[,3]
famhist<-SAheart.var[,4]
obesity<-SAheart.var[,5]
alcohol<-SAheart.var[,6]
age<-SAheart.var[,7]
chd<-SAheart.var[,8]
#Ignore these two models since they are simply dummy models for the natural cubic spline and global linear
SAheartGlobalLinear<-gam(chd~ sbp,data=SAheart)
SAheartNaturalCubicSpline<-gam(chd~ns(sbp,df=5),method="REML",data=SAheart)
#SAheartCubicSpline
sbp.bs <- bs(sbp,df=5)
tobacco.bs<-bs(tobacco,df=5)
ldl.bsf.bs<-bs(ldl.bsf,df=5)
famhist<-as.numeric(famhist)-1
obesity.bs<-bs(obesity,df=5)
alcohol.bs<-bs(alcohol,df=5)
age.bs<-bs(age,df=5)
chd.bs<-bs(chd,df=5)
#build required H matrix of basis functions using df=6-1 degrees of freedom
H <-cbind(sbp.bs,tobacco.bs,ldl.bsf.bs,famhist,obesity.bs,age.bs)
#centering the columns of H, intercept column is not centered
#producing another basis of the column space
H<-cbind(rep(1,dim(SAheart)[1]),scale(H,scale=FALSE))
#obtain coefficients with glm.fit
SAheartCubicSpline<-glm.fit(H,chd, family = binomial())
coeff<-SAheartCubicSpline$coefficients
#make W eight matrix 462 by 462
W= diag(SAheartCubicSpline$weights)
#construct covariance matrix Note: I made it two different ways, not sure if it matters
Sigma = solve(t(H)%*%W%*%H)
sigma = (t(H)%*%W%*%H)^-1
#Calculate pointwise variance for one single predictor "sbp"
pw.var<-diag(H[,2:6]%*%Sigma[2:6,2:6]%*%t(H[,2:6]))
#make plot
plot(sbp,pw.var)
I think I am getting close but my problem now is that, this is not being plotted against X from 0 to 1 with 50 points because my point wise variance vector has 462 points. I wonder how point wise variance against X as an interval of U[0,1] with 50 random points would get you the cubic spline plot as seen in figure 5.3. Also, if possible, I would also like to know how could I also fit the global cubic polynomial and global linear. Otherwise, I completely understand yet I would love to know where I am going wrong in terms of the x-axis from figure 5.3. Thanks in advance!

Converting point process model intensity predictions to probabilities at specific points spatstat

I am working on a similar dataset as the chorley dataset in the spatstat package and am following a similar analysis as presented in the sample book chapter, Spatial Point Patterns: Methodology and Applications with R. https://book.spatstat.org/sample-chapters/chapter09.pdf
library(spatstat)
data("chorley")
X <- split(chorley)$larynx
D <- split(chorley)$lung
Q <- quadscheme.logi(X,D)
fit <- ppm(Q ~ x + y)
locations = data.frame(x=chorley$x, y=chorley$y)
pred <- predict(fit, locations = locations, type="intensity")
summary(pred)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.09059 0.15562 0.17855 0.18452 0.20199 0.33201
data.ppm(fit)
Planar point pattern: 58 points
window: polygonal boundary
enclosing rectangle: [343.45, 366.45] x [410.41, 431.79] km
Q
Quadrature scheme (logistic)
58 data points, 978 dummy points
Total weight 315.1553
I was wondering why when running the data.ppm on the model, it only seems that the positive cases were included in the model?
There is also a warning message, "Warning message:
vcov is not implemented for dummy type ‘given’ - using ‘poisson’ formula" that comes up with both datasets (chorley and my own) that I do not know how to interpret.
Any help is greatly appreciated!
We are modelling the spatial risk. Your log-linear risk in the Cartesian coordinates is odd, but I guess it is just an example. So what we usually think of as the intensity of the fitted model is really the relative risk. So predicting the "intensity" really gives us the predicted risk (odds of case) at the given location. To covert the relative risk to a probability you can do (continuing from the middle of the original code):
rr <- predict(fit, locations=unmark(chorley))
p <- rr/(1+rr)
The warning is related to the estimate of the variance covariance matrix of the estimator. It is somewhat technical, but in essence the methodology assumes you are using randomly generated dummy points (lung cancer cases in this example), and it needs to know which point process model generated these points. Since you supplied these directly it is just assuming they were generated from a Poisson point process. I wouldn't be too worried about this part if you have a reasonable number of controls in your data.

Sensitivity of hierarchical clustering solution in r

I'm using hierarchical clustering to pull out a set number of clusters from a dataset. My objective is to test how robust the clustering solution is when I reduce the amount of data used (and potentially the variables included). I think this means subsampling the data, and then making a new distance matrix, and a new dendrogram each time I adjust something. One way I can think to measure sensitivity of the clustering solution is to compare the cluster centroids made with full data to those made with a subset of the data, I could do this by projecting them in PCoA space and calculating distance between cluster centroids (in PCoA space). This is close to what the betadisper function from package vegan does (apart from it calculates distance of points in the cluster to the centroid). However, my problem is that if I have created different distance matrices when subsampling, then the PCoA space will be different between subsample runs, and therefore non-comparable. Is it possible to simply standardise the PCoA space from different subsample runs to make them comparable?
Any pointers or alternative approaches would be greatly appreciated,
Mark
library(vegan)
# my data has categorial variables so I'll use gower with the iris dataset for example
mydist<-dist(iris[,1:4])
# Pull, out 3 clusters
hc_av<-hclust(d=mydist, method='average')
my_cut<-cutree(hc_av, 3)
# calc distance to cluster centre
mod<-betadisper(mydist, my_cut)
mod
plot(mod)
# randomly remove 5% of data and recalc as above - this would be bootstrapped
mydist2<-dist(iris[sort(sample(1:150, 145)),1:4])
# Pull, out 3 clusters
hc_av2<-hclust(d=mydist2, method='average')
my_cut2<-cutree(hc_av2, 3)
# calc distance to cluster centre
mod2<-betadisper(mydist2, my_cut2)
mod2
par(mfrow=c(1,2))
plot(mod, main='full model'); plot(mod2, main='subset')
# How can I to calculate the distance each cluster centroid has moved when
subsampling the data relative to the full model?

Correlation between 2 rasters accounting for spatial autocorrelation

I want to test the correlation in the values between 2 spatial raster data sets (that perfectly overlap).
I could just do:
correlation(getValues(raster1), getValues(raster2))
but both raster datasets are spatial autocorrelated.
Instead, I am using:
modified.ttest(getValues(raster1), getValues(raster2), coordinates)
from the SpatialPack library.
This is based on Dutilleul's test that modifies that effective sample size based on the degree of autocorrelation.
However, the modified test does not change the estimated correlation coefficient, only the p-value.
How do I also correct the estimated correlation coefficient for the extent of autocorrelation?
This is more a stats than a programming question.
I do not think you can "correct the correlation coefficient for autocorrelation". The correlation coefficient is what it is. It is not affected by "oversampling".
a <- 1:10
b <- c(1:5,1:5)
cor(a,b)
#[1] 0.492366
No "inflation" when using the same values twice
cor(c(a,a),c(b,b))
#[1] 0.492366
The p-value is affected
t.test(a,b)$p.value
#[1] 0.03554967
t.test(c(a,a), c(b,b))$p.value
#[1] 0.002042504
You can adjust the p-value for oversampling. However, a question with raster data is whether you should indeed consider these as a sample. That depends on context, but raster data often represent the entire population (with some local averaging given that cells are discreet). If there is no uncertainty due to (a small) sample size, presenting a p-value is not meaningful.

Resources