I'm trying to reproduce this example using Excel to calculate the Mahalanobis distance between two groups.
To my mind the example provides a good explanation of the concept. However, I'm not able to reproduce in R.
The result obtained in the example using Excel is Mahalanobis(g1, g2) = 1.4104.
Following the answer given here for R and apply it to the data above as follows:
# dataset used in the Excel example
g1 <- matrix(c(2, 2, 2, 5, 6, 5, 7, 3, 4, 7, 6, 4, 5, 3, 4, 6, 2, 5, 1, 3), ncol = 2, byrow = TRUE)
g2 <- matrix(c(6, 5, 7, 4, 8, 7, 5, 6, 5, 4), ncol = 2, byrow = TRUE)
# function adopted from R example
D.sq <- function (g1, g2) {
dbar <- as.vector(colMeans(g1) - colMeans(g2))
S1 <- cov(g1)
S2 <- cov(g2)
n1 <- nrow(g1)
n2 <- nrow(g2)
V <- as.matrix((1/(n1 + n2 - 2)) * (((n1 - 1) * S1) + ((n2 - 1) * S2)))
D.sq <- t(dbar) %*% solve(V) %*% dbar
res <- list()
res$D.sq <- D.sq
res$V <- V
res
}
D.sq(g1,g2)
and executing the function on the data returns the following output:
$D.sq
[,1]
[1,] 1.724041
$V
[,1] [,2]
[1,] 3.5153846 0.3153846
[2,] 0.3153846 2.2230769
Afaik $D.sq represents the distance and 1.724 is significantly different to the 1.4101 result from the Excel example. As I'm new to the concept of the Mahalanobis distance I was wondering if I did something wrong and/or there's a better way to calculate this e.g. using mahalanobis()?
The reasons why do you get different result are
The Excel algorithm is actually different to the R algorithm in how you calculate the pooled covariance matrix, the R version gives you the result of unbiased estimate of covariance matrix, while the Excel version gives you the MLE estimate. In the R version, you calculate the matrix like: ((n1 - 1) * cov(g1) + (n2 - 1) * cov(g2)) / (n1 + n2 - 2); while in Excel version: ((n1 - 1) * cov(g1) + (n2 - 1) * cov(g2)) / (n1 + n2).
The last calculation step in the Excel post you refer to is incorrect, the result should be 1.989278 instead.
Edit:
The unbiased estimator for pooled covariance matrix is the standard way, as is in the Wikipedia page: https://en.wikipedia.org/wiki/Pooled_variance . A related fact is that in R, when you use cov or var, you get an unbiased estimator instead of MLE estimator for covariance matrix.
Edit2:
The mahalanobis function in R calculates the mahalanobis distance from points to a distribution. It does not calculate the mahalanobis distance of two samples.
Conclusion: In sum, the most standard way to calculate mahalanobis distance between two samples is the R code in the original post, which uses the unbiased estimator of pooled covariance matrix.
Related
I wish to apply a custom function to each element of a matrix whilst also using elements of a different matrix as inputs to the function.
Specifically, my function generates random samples from a von Mises distribution (circular normal distribution), calling the Rfast package's rvonmises function.
I have one matrix (radians) which records the angle I wish to use for the central tendency of the random generation (similar to the mean), and another matrix (kappa) which records the concentration parameter of the von Mises I wish to use (similar to standard deviation).
I wish to use (for example) element [1, 1] of the radians matrix together with element [1, 1] of the kappa matrix in a call to the von Mises random generator. So, my call for one element would be:
rvonmises(n = 1, m = radians[1, 1], k = kappa[1, 1])
But of course I want this applied across all elements of the matrices. (The rvonmises function doesn't accept multiple m or k values, so for example I couldn't use rvonmises(4, m = c(1, 2, 3, 4), k = c(1, 1.2, 1.4, 1.6)).)
To summarise: I am basically after a more principled (and faster!) way of doing this:
for(i in 1:nrow(radians)){
for(j in 1:ncol(radians)){
result[i, j] <- Rfast::rvonmises(1, radians[i, j], kappa[i, j])
}
}
What I have tried
Based on this post, I have tried to use mapply:
library(Rfast)
set.seed(42)
# random radians to use as input
radians <- matrix(data = runif(12, 0, 2 * pi),
ncol = 4)
# random concentration parameters of the von Mises distribution
kappa <- matrix(data = rgamma(12, 70, 30),
ncol = 4)
# function to generate random von mises sample with angle x and
# concentration parameter k
my_function <- function(m, k){
Rfast::rvonmises(1, m, k)
}
# my attempt
out <- matrix(mapply(my_function, m = as.data.frame(radians), k = kappa),
ncol = 4, byrow = TRUE)
However, I don't think this is working. For example, if I test it by the following (where the central tendency in test_radians increases steadily and I use large values for kappa which leads to precise estimates):
test_radians <- matrix(data = seq(from = 1, to = 2 * pi, length.out = 12),
ncol = 4)
test_kappa <- matrix(data = rep(20, times = 12),
ncol = 4)
test <- matrix(mapply(my_function, m = as.data.frame(test_radians),
k = test_kappa),
ncol = 4, byrow = TRUE)
test[1, 1] should be smaller (on average), and test[3, 4] should be largest. (I know due to random variability this won't always be the case, but I've tried it with many replications.)
So, the mapping and matching between matrices isn't working as I had anticipated.
Any guidance welcomed.
You cannot compute the mean of circular observations by simply calling "mean". This is wrong. The correct way is to compute the mean of the cosinus and sinus of the angles and then use the arc tangent. See pcakcges for directional or circular data for this.
Secondly, you gave us an idea, to return a matrix of von Mises generated data. But, since brms does this job for you, at the moment I would go there.
I wrote a function to calculate the log-likelihood of a set of observations sampled from a mixture of two normal distributions. This function is not giving me the correct answer.
I will not know which of the two distributions any given sample is from, so the function needs to sum over possibilities.
This function takes a vector of five model parameters as its first argument (μ1, σ1, μ2, σ2 and p) where μi and σi are the mean and standard deviation of the ith distribution and p is the probability a sample is from the first distribution. For the second argument, the function takes a vector of observations.
I have written the following function:
mixloglik <- function(p, v) {
sum(log(dnorm(v, p[1], p[2])*p[5] + dnorm(v,p[3],p[4]))*p[5])
}
I can create test data, for which I know the solution should be ~ -854.6359:
set.seed(42)
v<- c(rnorm(100), rnorm(200, 8, 2))
p <- c(0, 1, 6, 2, 0.5)
When I test this function on the test data I do not get the correct solution
> mixloglik(p, v)
[1] -356.7194
I know the solution should be ~ -854.6359. Where am I going wrong in my function?
The correct expression for the log-likelihood is the following.
mixloglik <- function(p, v) {
sum(log(p[5]*dnorm(v, p[1], p[2]) + (1 - p[5])*dnorm(v, p[3], p[4])))
}
Now try it:
set.seed(42)
v<- c(rnorm(100), rnorm(200, 8, 2))
p <- c(0, 1, 6, 2, 0.5)
mixloglik(p, v)
#[1] -854.6359
In cases like this, the best way to solve the error is to restart by rewriting the expression on paper and recode it.
Consider a matrix A and an array b. I would like to calculate the distance between b and each row of A. For instance consider the following data:
A <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15), 3, 5, byrow=TRUE)
b <- c(1, 2, 3, 4, 5)
I would expect as output some array of the form:
distance_array = c(0, 11.18, 22.36)
where the value 11.18 comes from the euclidean distance between a[2,] and b:
sqrt(sum((a[2,]-b)^2))
This seems pretty basic but so far all R functions I have found allow to compute distance matrices between all the pairs of rows of a matrix, but not this array-matrix calculation.
I would recommend putting the rows a A in list instead of a matrix as it might allow for faster processing time. But here's how I would do it with respect to your example
A <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15), 3, 5, byrow=TRUE)
b <- c(1, 2, 3, 4, 5)
apply(A,1,function(x)sqrt(sum((x-b)^2)))
I have a joint probability mass function of two variables X,Y like here
How can I calculate the covariance in R?
I created two vectors x,y and fed them into cov(), but I get the wrong result.
How can I do this right?
Thanks in advance and happy coding!
Since SO is a coding forum, I'll leave working out the math/stats details up to you. Here is an implementation in R.
We start by noting the sample spaces for X and Y
# For G
G <- 0:3;
# For R
R <- 0:2;
The joint probability mass function is given by the following matrix
joint_pmf <- matrix(
c(4/84, 12/84, 4/84,
18/84, 24/84, 3/84,
12/84, 6/84, 0,
1/84, 0, 0),
ncol = 3, byrow = T);
We calculate the population means
# For G
mu_G <- rowSums(joint_pmf) %*% G;
# For R
mu_R <- colSums(joint_pmf) %*% R;
We can make use of the theorem Cov(X, Y) = E[XY] - E[X]E[Y] to calculate the covariance
cov_GR <- G %*% joint_pmf %*% R - mu_G * mu_R;
# [,1]
#[1,] -0.1666667
where we have used the fact that E[G] = mu_G and E[R] = mu_R are the respective population means.
I am not familiar with function over a vector in R.
I would like a vector with the different values of cumulative probability of a bivariate when some parameters change value simultaneously according to different function. For example here:
library(mvtnorm)
m<-2
corr<-diag(2)
corr[2,1]<-0
vapply(2*1:3,function(x)
pmvnorm(mean=c(2,x),corr,lower=c(-Inf,-Inf), upper=c(1,2)),1)
[1] 7.932763e-02 3.609428e-03 5.024809e-06
I have the different value of cumulative probability when the mean of the second distribution takes value 2,4 and 6.
My problem is that I want simultaneously change also the value of the value of the mean of the first distribution. I can't write properly the vapply command with more than one function. What can I do?
Thank you very much
You will need to use mapply for this task
library(mvtnorm)
corr <- diag(2)
m1 <- c(3, 5, 7)
m2 <- c(2, 4, 6)
mapply(function(x, y)
pmvnorm(mean = c(x, y), corr, lower = c(-Inf, -Inf), upper = c(1, 2)),
m1, m2)
## [1] 1.1375e-02 7.2052e-07 3.1246e-14