I am trying to compute the Mahalanobis distance between each observations of a dataset dat, where each row is an observation and each column is a variable. Such distance is defined as:
I wrote a function that does it, but I feel like it is slow. Is there any better way to compute this in R ?
To generate some data to test the function:
generateData <- function(nObs, nVar){
library(MASS)
mvrnorm(n=nObs, rep(0,nVar), diag(nVar))
}
This is the function I have written so far. They both work and for my data (800 obs and 90 variables), it takes approximatively 30 and 33 seconds for the method = "forLoop" and method = "apply", respectively.
mhbd_calc2 <- function(dat, method) { #Method is either "forLoop" or "apply"
dat <- as.matrix(na.omit(dat))
nObs <- nrow(dat)
mhbd <- matrix(nrow=nObs,ncol = nObs)
cv_mat_inv = solve(var(dat))
distMH = function(x){ #Mahalanobis distance function
diff = dat[x[1],]-dat[x[2],]
diff %*% cv_mat_inv %*% diff
}
if(method=="forLoop")
{
for (i in 1:nObs){
for(j in 1:i){
mhbd[i,j] <- distMH(c(i,j))
}
}
}
if(method=="apply")
{
mhbd[lower.tri(mhbd)] = apply(combn(nrow(dat),2),2, distMH)
}
result = sqrt(mhbd)
colnames(result)=rownames(dat)
rownames(result)=rownames(dat)
return(as.dist(result))
}
NB: I tried using outer() but it was even slower (60seconds)
You need some mathematical knowledge.
Do a Cholesky factorization of empirical covariance, then standardize your observations;
use dist to compute Euclidean distance on transformed observations.
dist.maha <- function (dat) {
X <- as.matrix(na.omit(dat)) ## ensure a valid matrix
V <- cov(X) ## empirical covariance; positive definite
L <- t(chol(V)) ## lower triangular factor
stdX <- t(forwardsolve(L, t(X))) ## standardization
dist(stdX) ## use `dist`
}
Example
set.seed(0)
x <- matrix(rnorm(6 * 3), 6, 3)
dist.maha(x)
# 1 2 3 4 5
#2 2.362109
#3 1.725084 1.495655
#4 2.959946 2.715641 2.690788
#5 3.044610 1.218184 1.531026 2.717390
#6 2.740958 1.694767 2.877993 2.978265 2.794879
The result agrees with your mhbd_calc2.
Related
I would like to perform a Sobol sensitivity analysis in R
The package "sensitivity" should allow me to do so, but I don't understand how to generate the sampling matrixes (X1, X2). I have a model that runs outside of R. I have 6 parameters with uniform distribution.
In my text book: N = (2k+2)*M ; M = 2^b ; b=[8,12] (New sampling method : Wu et al. 2012)
I had the feeling that I should create two sampling matrix and feed the two to the sobol function X1_{M,k} X2_{M,k}.
The dimension of final sampling matrix x$X is then (k+2)*M. because:
X <- rbind(X1, X2)
for (i in 1:k) {
Xb <- X1
Xb[, i] <- X2[, i]
X <- rbind(X, Xb)
}
How should I conduct my sampling to get the right number of runs as (2*k+2)*M ?
This script is for the old method but does someone know if the new method is already implemented yet in the sensitivity package? Feel free to comment this procedure
name = c("a" , "b" , "c" , "d" , "e", "f")
vals <- list(list(var="a",dist="unif",params=list(min=0.1,max=1.5)),
list(var="b",dist="unif",params=list(min=-0.3,max=0.4)),
list(var="c",dist="unif",params=list(min=-0.3,max=0.3)),
list(var="d",dist="unif",params=list(min=0,max=0.5)),
list(var="e",dist="unif",params=list(min=2.4E-5,max=2.4E-3)),
list(var="f",dist="unif",params=list(min=3E-5,max=3E-3)))
k = 6
b = 8
M = 2^b
n <- 2*M
X1 <- makeMCSample(n,vals, p = 1)
X2 <- makeMCSample(n,vals, p = 2)
x <- sobol2007(model = NULL, X1, X2, nboot = 200)
if I understand correctly, I should provide a y for each x$X sampling combination
then I can use the function "tell" which will generate the Sobol' first-order indices as well as the total indices
tell(x,y)
ggplot(x)
Supplemental R function SobolR
makeMCSample <- function(n, vals) {
# Packages to generate quasi-random sequences
# and rearrange the data
require(randtoolbox)
require(plyr)
# Generate a Sobol' sequence
if (p == 2){ sob <- sobol(n, length(vals), seed = 4321, scrambling = 1)
}else{sob <- sobol(n, length(vals), seed = 1234, scrambling = 1)}
# Fill a matrix with the values
# inverted from uniform values to
# distributions of choice
samp <- matrix(rep(0,n*(length(vals)+1)), nrow=n)
samp[,1] <- 1:n
for (i in 1:length(vals)) {
# i=1
l <- vals[[i]]
dist <- l$dist
params <- l$params
fname <- paste("q",dist,sep="")
samp[,i+1] <- do.call(fname,c(list(p=sob[,i]),params))
}
# Convert matrix to data frame and add labels
samp <- as.data.frame(samp)
names(samp) <- c("n",laply(vals, function(l) l$var))
return(samp)
}
ref: Qiong-Li Wu, Paul-Henry Cournède, Amélie Mathieu, 2012, Efficient computational method for global sensitivity analysis and its application to tree growth modelling
I'm trying to compute a kind of Gini index using a generated dataset.
But, I got a problem in the last integrate function.
If I try to integrate the function named f1,
R says
Error in integrate(Q, 0, p) : length(upper) == 1 is not TRUE
My code is
# set up parameters b>a>1 and the number of observations n
n <- 1000
a <- 2
b <- 4
# generate x and y
# where x follows beta distribution
# y = 10x+3
x <- rbeta(n,a,b)
y <- 10*x+3
# the starting point of the integration having problem
Q <- function(q) {
quantile(y,q)
}
# integrate the function Q from 0 to p
G <- function(p) {
integrate(Q,0,p)
}
# compute a function
L <- function(p) {
numer <- G(p)$value
dino <- G(1)$value
numer/dino
}
# the part having problem
d <- 3
f1 <- function(p) {
((1-p)^(d-2))*L(p)
}
integrate(f1,0,1) # In this integration, the aforementioned error appears
I think, the repeated integrate could make a problem but I have no idea what is the exact problem.
Please help me!
As mentioned by #John Coleman, integrate needs to have a vectorized function and a proper subdivisions option to fulfill the integral task. Even if you have already provided a vectorized function for integral, it is sometimes tricky to properly set the subdivisions in integrate(...,subdivisions = ).
To address your problem, I recommend integral from package pracma, where you still a vectorized function for integral (see what I have done to functions G and L), but no need to set subdivisions manually, i.e.,
library(pracma)
# set up parameters b>a>1 and the number of observations n
n <- 1000
a <- 2
b <- 4
# generate x and y
# where x follows beta distribution
# y = 10x+3
x <- rbeta(n,a,b)
y <- 10*x+3
# the starting point of the integration having problem
Q <- function(q) {
quantile(y,q)
}
# integrate the function Q from 0 to p
G <- function(p) {
integral(Q,0,p)
}
# compute a function
L <- function(p) {
numer <- Vectorize(G)(p)
dino <- G(1)
numer/dino
}
# the part having problem
d <- 3
f1 <- function(p) {
((1-p)^(d-2))*L(p)
}
res <- integral(f1,0,1)
then you will get
> res
[1] 0.1283569
The error that you reported is due to the fact that the function in integrate must be vectorized and integrate itself isn't vectorized.
From the help (?integrate):
f must accept a vector of inputs and produce a vector of function
evaluations at those points. The Vectorize function may be helpful to
convert f to this form.
Thus one "fix" is to replace your definition of f1 by:
f1 <- Vectorize(function(p) {
((1-p)^(d-2))*L(p)
})
But when I run the resulting code I always get:
Error in integrate(Q, 0, p) : maximum number of subdivisions reached
A solution might be to assemble a large number of quantiles and then smooth it out and use that rather than your Q, although the error here strikes me as odd.
How can I count number of interactions poly will return?
If I have two variables, then the number of interactions poly will return in function of degree is given by:
degree <- 2
dim(poly(rnorm(10), rnorm(10), degree = degree))[2]
That is the same as:
(degree^2+3*degree)/2
Is there anyway to count the number of interactions depending on the number of degree and variables (in case I use more than two)?
Math result from combinations
Suppose you have p variables, the number of interactions associated with degree d is computed by:
fd <- function (p, d) {
k <- choose(p, d)
if (d > 1) k <- k + p * sum(choose(p-1, 0:(d-2)))
return(k)
}
The function poly (actually polym in this case), with p input variables and a degree = D, will construct interactions from degree = 1 up to degree = D. So the following function counts it:
fD <- function (p, D) {
if (D < 1) return(0)
component <- sapply(1:D, fd, p = p)
list(component = component, ncol = sum(component))
}
The entry component gives the number of interaction for each degree from 1 to D, and ncol component gives total number of interactions.
A quick test:
a <- runif(50)
b <- runif(50)
c <- runif(50)
d <- runif(50)
X <- poly(a, b, c, d, degree = 3)
ncol(X)
# 34
fD(4, 3)
# component
# [1] 4 10 20
#
# ncol
# [1] 34
How R does this?
The first few lines of the source code for polym explains how R addresses this problem. An expand.grid is first called to get all possible interactions, then a rowSums is called to compute the degree of all available interactions. Finally, a filter is applied to retain only interactions terms with degree between 1 and D.
More than three years later I had to work with degree >=3 polynomials. Unfortunately #李哲源 solution fails for degrees larger than 3. I could, however, build two solutions:
Expand Grid Solution
This method emulates polym original behavior, which is not very elegant for our purposes but is a natural benchmark.
expand_grid_solution <- function(nd, degree){
z <- do.call(expand.grid, c(rep.int(list(0:degree), nd),
KEEP.OUT.ATTRS = FALSE))
s <- rowSums(z)
ind <- 0 < s & s <= degree
z <- z[ind, , drop = FALSE]
s <- s[ind]
return(length(s))
}
Combination with repetion solution
combination_with_repetition <- function(n, r){
factorial(r+n-1)/(factorial(n-1)*factorial(r))
}
poly_elements <- function(n, d) {
x <- sapply(1:d, combination_with_repetition, n = n)
return(sum(x))
}
A quick test:
mapply(expand_grid_solution, c(2,2,2,3,3,3,4), c(2,3,4,2,3,4,4))
#[1] 5 9 14 9 19 34 69
mapply(poly_elements, c(2,2,2,3,3,3,4), c(2,3,4,2,3,4,4))
#[1] 5 9 14 9 19 34 69
I need to compare two probability matrices to know the degree of proximity of the chains, so I would use the resulting P-Value of the test.
I tried to use the markovchain r package, more specifically the divergenceTest function. But, the problem is that the function is not properly implemented. It is based on the test of the book "Statistical Inference Based on Divergence Measures" on page 139, I contacted the package developers, but they still have not corrected, so I tried to implement, but I'm having trouble, could anyone help me to find the error?
Parameters: freq_matrix: Is a frequency matrix used to estimate the probability matrix. hypothetic: Is the matrix used to compare with the estimated matrix.
divergenceTest3 <- function(freq_matrix, hypothetic){
n <- sum(freq_matrix)
empirical = freq_matrix
for (i in 1:length(hypothetic)){
empirical[i,] <- freq_matrix[i,]/rowSums(freq_matrix)[i]
}
M <- nrow(empirical)
v <- numeric()
out <- 2 * n / .phi2(1)
sum <- 0
c <- 0
for(i in 1:M){
sum2 <- 0
sum3 <- 0
for(j in 1:M){
if(hypothetic[i, j] > 0){
c <- c + 1
}
sum2 <- sum2 + hypothetic[i, j] * .phi(empirical[i, j] / hypothetic[i, j])
}
v[i] <- rowSums(freq_matrix)[i]
sum <- sum + ((v[i] / n) * sum2)
}
TStat <- out * sum
pvalue <- 1 - pchisq(TStat, c-M)
cat("The Divergence test statistic is: ", TStat, " the Chi-Square d.f. are: ", c-M," the p-value is: ", pvalue,"\n")
out <- list(statistic = TStat, p.value = pvalue)
return(out)
}
# phi function for divergence test
.phi <- function(x) {
out <- x*log(x) - x + 1
return(out)
}
# another phi function for divergence test
.phi2 <- function(x) {
out <- 1/x
return(out)
}
The divergence test has been replaced by the verifyHomogeneityfunction. It requires and input list of elements that can be coerced to a raw transition matrix (as of createSequenceMatrix). Then it tests whether they belong to the same unknown DTMC.
See the example below:
myMatr1<-matrix(c(0.2,.8,.5,.5),byrow=TRUE, nrow=2)
myMatr2<-matrix(c(0.5,.5,.4,.6),byrow=TRUE, nrow=2)
mc1<-as(myMatr1,"markovchain")
mc2<-as(myMatr2,"markovchain")
mc
mc2
sample1<-rmarkovchain(n=100, object=mc1)
sample2<-rmarkovchain(n=200, object=mc2)
# should reject
verifyHomogeneity(inputList = list(sample1,sample2))
#should accept
sample2<-rmarkovchain(n=200, object=mc1)
verifyHomogeneity(inputList = list(sample1,sample2))
I have a time series problem that I could easily work out manually, only it would take kind of a long time since I have 4 different AR(2) processes and want to calculate at least 20 lags for each.
What I want to do is use the Yule Walker equation for rho as follows:
I have an auto regressive process of second order, AR(2). Phi(1) is 0.6 and Phi(2) is 0.4.
I want to calculate the correlation coefficients rho(k) for all lags up to k = 20.
So rho(0) would naturally be 1 and rho(-1) = rho(1). Therefore
rho(1) = phi(1) + phi(2)*rho(1)
rho(k) = phi(1)*rho(k-1) + phi(2)*rho(k-2)
Now I want to solve this in R, but I have no idea how to start, can anyone help me out here?
You can try my program in R languages,
In R Script:
AR2 <- function(Zt,tetha0,phi1,phi2,nlag)
{
n <- length(Zt)
Zbar <- mean(Zt)
Zt1 <- rep(Zbar,n)
for(i in 2:n){Zt1[i] <- Zt[i-1]}
Zt2 <- rep(Zbar,n)
for(i in 3:n){Zt1[i] <- Zt[i-2]}
Zhat <- tetha0+phi1*Zt1+phi2*Zt2
error <- Zt-Zhat
ACF(error,nlag)
}
ACF <- function(error,nlag)
{
n <- length(error)
rho <- rep(0,nlag)
for(k in 1:nlag)
{
a <- 0
b <- 0
for(t in 1:(n-k)){a <- a+(error[t]*error[t+k])}
for(t in 1:n){b <- b+(error[t]^2)}
rho[k] <- a/b
}
return(rho)
}
In R console:
Let you have a Zt series, tetha(0) = 0, phi(1) = 0.6, phi(2) = 0.4, and number of lag = 20
AR2(Zt,0,0.6,0.4,20)