Euclidean distance between data sets in R using rdist from "fields" package - r

I am using rdist to compute Euclidean distances between a matrix and itself:
> m = matrix(c(1,1,1,2,2,2,3,4,3),nrow=3, ncol=3)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 4
[3,] 1 2 3
library(fields)
> rdist(m)
[,1] [,2] [,3]
[1,] 1e-10 1e+00 1e-10
[2,] 1e+00 1e-10 1e+00
[3,] 1e-10 1e+00 1e-10
What confuses me is that I think it should have 0s on the diagonal (surely the distance of a vector to itself is 0?), and for the same reason it should have 0s where it compares the first and third row. The value that I see instead (1e-10) looks way to big to be numerical noise. What's going wrong?
EDIT: rdist is from the package fields.

First of all 1e-10 is simply 1*10^-10 which is 0.0000000001, so numericaly very close to 0 (as it is a result of square rooting, so the actual error in the computation is of row of magnitude 1e-20). Is it "too big"? Well, library is written in fortran, and is focused on speed, so it is quite acceptable. If you analyze the exact code, you will find out how it is computed:
# fields, Tools for spatial data
# Copyright 2004-2011, Institute for Mathematics Applied Geosciences
# University Corporation for Atmospheric Research
# Licensed under the GPL -- www.gpl.org/licenses/gpl.html
"rdist" <- function(x1, x2) {
if (!is.matrix(x1))
x1 <- as.matrix(x1)
if (missing(x2))
x2 <- x1
if (!is.matrix(x2))
x2 <- as.matrix(x2)
d <- ncol(x1)
n1 <- nrow(x1)
n2 <- nrow(x2)
par <- c(1/2, 0)
temp <- .Fortran("radbas", nd = as.integer(d), x1 = as.double(x1),
n1 = as.integer(n1), x2 = as.double(x2), n2 = as.integer(n2),
par = as.double(par), k = as.double(rep(0, n1 * n2)))$k
return(matrix(temp, ncol = n2, nrow = n1))
}
And the exact answer is hidden in the fortran files (in radfun.f called from radbas.f), where you can find the line
if( dtemp.lt.1e-20) dtemp =1e-20
which treats small (even 0) values as 1e-20, which after taking square root results in 1e-10. It seems that the motivation was to speed up the process by using logarithm of the value (as a result, square rooting is simply dividing by 2), which of course is not defined for 0.

Related

Finding all solutions of a non-square linear system with infinitely many solutions

In this question was found a solution to find a particular solution to a non-square linear system that has infinitely many solutions. This leads to another question:
How to find all the solutions for a non-square linear system with infinitely many solutions, with R? (see below for a possible description of the infinite set of solutions)
Example: the linear system
x+y+z=1
x-y-2z=2
is equivalent to A X = B with:
A=matrix(c(1,1,1,1,-1,-2),2,3,T)
B=matrix(c(1,2),2,1,T)
A
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 -1 -2
B
[,1]
[1,] 1
[2,] 2
We can describe the infinite set of solutions with:
x = 3/2 + (1/2) z
y = -1/2 + (-3/2) z
z in R
Thus, R could describe the set of solutions this way:
> solve2(A,B)
$principal
[1] 1 2 # this means that x and y will be described
$free
[1] 3 # this means that the 3rd variable (i.e. z) is free in the set of real numbers
$P
[1] 1.5 -0.5
$Q
[1] 0.5 -1.5
This means that every solution can be created with:
z = 236782 # any value would be ok
solve2(A,B)$P + z * solve2(A,B)$Q # this gives x and y
About the maths: there always exist such a decomposition, when the linear system has infinitely many solutions, this part is ok. The question is: is there something to do this in R?
You can solve equations like thse using the generalized inverse of A.
library(MASS)
ginv(A) %*% B
# 1.2857143
# 0.1428571
#-0.4285714
A %*% ginv(A) %*% B
# 1
# 2
So, with help from #Bhas
gen_soln <- function(vec) {
G <- ginv(A)
W <- diag(3) - G %*% A
(G %*% B + W %*% vec)
}
You can now find many solutions by providing a vector of length 3 to `gen_soln' function. For example,
one_from_inf <- gen_soln(1:3)
one_from_inf
#[1,] 1.35714286
#[2,] -0.07142857
#[3,] -0.2857142
# Test the solution.
A %*% one_from_inf
# [,1]
#[1,] 1
#[2,] 2
# Using random number generator
A %*% gen_soln(rnorm(3))
# [,1]
#[1,] 1
#[2,] 2
The general solution to
A*x = b
is
x = x0 + z
where x0 is any solution and z is in the kernel of A
As pointed out above you can find a particular solution x0 by using the generalised inverse. You can also use the SVD to find a basis for the kernel of A:
A = U*S*V'
where U and V are orthogonal and S diagonal, with, say, the last k entries on the diagonal 0 (and the others non-zero).
If follows that the last k columns of V form a basis for the kernel of A, and if we call these z1,..zk then the solutions of the original equation
are
x = x0 + c1*z1 + .. ck*zk
for any real c1..ck

R: How to sample from the geometric space outside a 4D hyper ellipse?

Original title (in-precise): How to find all vectors that satisfy a 4-unknown non-linear inequalities?
The question is to find a space C, so that any 4x1 vector x in C satisfies:
t(x) %*% t(Q) %*% Q %*% x > a,
in which Q is a 100 x 4 matrix I already know, and a is a positive constant.
I tried to find the solution from packages such as ellipse, rootSolve, and bvpSolve. But I can't reach a suitable solution.
Any idea or solution will be sincerely appreciated.
Remark: the algorithm below can be used to sample from the surface / manifold of a 4D hyper ellipse, or its inner space, too.
I have changed your question title. It is impossible to list all solutions, although such space has simple mathematical representation. We can at our best sample from such space.
Transformation from Ellipse to Sphere
Here is some mathematics based on Cholesky factorization. Alternatively, consider symmetric Eigen decomposition, and I have a demonstration / comparison between these two at Obtain vertices of the ellipse on an ellipse covariance plot (created by car::ellipse), with nice figures for the geometry.
Since Q is known, R is known. The following R code gets R:
R <- chol(crossprod(Q))
y is from the hypersphere with radius greater than sqrt(a). If we can sample y from such space, we can then map it to x by solving a triangular system:
x <- backsolve(R, y)
Sampling of y
We can use n-sphere coordinates to parametrize such space. For 4D space, we have:
The following R function samples n y-vectors from such space. Because of finite representation of floating point numbers, we can't have infinity radius but .Machine$double.xmax at the best. But we also use an optional argument rmax, if we want a more restricted radius.
ry <- function (n, rmin, rmax = NA) {
if (is.na(rmax)) rmax <- .Machine$double.xmax
if (rmin > rmax) stop("larger `rmax` expected!")
r <- runif(n, rmin, rmax)
phi1 <- runif(n, 0, pi)
phi2 <- runif(n, 0, 2 * pi)
phi3 <- runif(n, 0, 2 * pi)
matrix(c(r * cos(phi1),
r * sin(phi1) * cos(phi2),
r * sin(phi1) * sin(phi2) * cos(phi3),
r * sin(phi1) * sin(phi2) * sin(phi3)),
nrow = 4L, byrow = TRUE, dimnames = list(paste0("y", 1:4), NULL))
}
Try some examples:
## radius between 4 and 10
set.seed(0); ry(5, 4, 10)
# [,1] [,2] [,3] [,4] [,5]
#y1 7.5594886 -5.31049687 -6.1388372 -3.5991830 -3.728597
#y2 5.1402945 0.47936481 0.4799181 -2.5085948 -6.480402
#y3 0.2614002 -1.68833263 -0.1950092 -5.9975328 -4.213166
#y4 -2.0859078 0.02440839 -0.9452077 0.3052708 3.954674
## radius between 4 and "inf"
set.seed(0); ry(5, 4)
# [,1] [,2] [,3] [,4] [,5]
#y1 1.299100e+308 -4.531902e+307 -6.588856e+307 -4.983772e+307 -6.442420e+307
#y2 8.833607e+307 4.090831e+306 5.150993e+306 -3.473640e+307 -1.119710e+308
#y3 4.492167e+306 -1.440799e+307 -2.093047e+306 -8.304756e+307 -7.279678e+307
#y4 -3.584637e+307 2.082977e+305 -1.014498e+307 4.227070e+306 6.833046e+307
I have chosen to use each column than each row as a sample, to make matrix computation easier later.
Transforming y to x
Now assume we have
set.seed(0); Q <- matrix(runif(10 * 4), 10L, 4L)
We get R
R <- chol(crossprod(Q))
# [,1] [,2] [,3] [,4]
#[1,] 2.176848 1.420882 1.2517326 1.4481875
#[2,] 0.000000 1.077816 0.1045581 0.4646328
#[3,] 0.000000 0.000000 1.2284251 0.3961126
#[4,] 0.000000 0.000000 0.0000000 0.9019771
Suppose you have a = 4, then we map y to x:
a <- 4
set.seed(0); y <- ry(5, sqrt(a), 10) ## we set an `rmax` here
x <- backsolve(R, y) ## each column is a sample of `x`
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0.7403534 -1.49866534 -2.2359350 2.0269516 2.948561
#[2,] 5.5481682 0.41827601 0.7024109 -1.7606720 -7.288339
#[3,] 0.9373905 -1.01984708 0.1430660 -4.4180688 -4.749419
#[4,] -2.2616584 0.01995357 -0.8367956 0.2995693 4.299268
Checking
We can check that the above x does satisfy our requirement.
z <- Q %*% x
ax <- colSums(x ^ 2) ## value of `diag(x'Q'Qx)`
#[1] 84.15453 17.00795 24.77044 43.33361 85.85250
They are all greater than 4.

Generate multivariate normal r.v.'s with rank-deficient covariance via Pivoted Cholesky Factorization

I'm just beating my head against the wall trying to get a Cholesky decomposition to work in order to simulate correlated price movements.
I use the following code:
cormat <- as.matrix(read.csv("http://pastebin.com/raw/qGbkfiyA"))
cormat <- cormat[,2:ncol(cormat)]
rownames(cormat) <- colnames(cormat)
cormat <- apply(cormat,c(1,2),FUN = function(x) as.numeric(x))
chol(cormat)
#Error in chol.default(cormat) :
# the leading minor of order 8 is not positive definite
cholmat <- chol(cormat, pivot=TRUE)
#Warning message:
# In chol.default(cormat, pivot = TRUE) :
# the matrix is either rank-deficient or indefinite
rands <- array(rnorm(ncol(cholmat)), dim = c(10000,ncol(cholmat)))
V <- t(t(cholmat) %*% t(rands))
#Check for similarity
cor(V) - cormat ## Not all zeros!
#Check the standard deviations
apply(V,2,sd) ## Not all ones!
I'm not really sure how to properly use the pivot = TRUE statement to generate my correlated movements. The results look totally bogus.
Even if I have a simple matrix and I try out "pivot" then I get bogus results...
cormat <- matrix(c(1,.95,.90,.95,1,.93,.90,.93,1), ncol=3)
cholmat <- chol(cormat)
# No Error
cholmat2 <- chol(cormat, pivot=TRUE)
# No warning... pivot changes column order
rands <- array(rnorm(ncol(cholmat)), dim = c(10000,ncol(cholmat)))
V <- t(t(cholmat2) %*% t(rands))
#Check for similarity
cor(V) - cormat ## Not all zeros!
#Check the standard deviations
apply(V,2,sd) ## Not all ones!
There are two errors with your code:
You did not use pivoting index to revert the pivoting done to the Cholesky factor. Note, pivoted Cholesky factorization for a semi-positive definite matrix A is doing:
P'AP = R'R
where P is a column pivoting matrix, and R is an upper triangular matrix. To recover A from R, we need apply the inverse of P (i.e., P'):
A = PR'RP' = (RP')'(RP')
Multivariate normal with covariance matrix A, is generated by:
XRP'
where X is multivariate normal with zero mean and identity covariance.
Your generation of X
X <- array(rnorm(ncol(R)), dim = c(10000,ncol(R)))
is wrong. First, it should not be ncol(R) but nrow(R), i.e., the rank of X, denoted by r. Second, you are recycling rnorm(ncol(R)) along columns, and the resulting matrix is not random at all. Therefore, cor(X) is never close to an identity matrix. The correct code is:
X <- matrix(rnorm(10000 * r), 10000, r)
As a model implementation of the above theory, consider your toy example:
A <- matrix(c(1,.95,.90,.95,1,.93,.90,.93,1), ncol=3)
We compute the upper triangular factor (suppressing possible rank-deficient warnings) and extract inverse pivoting index and rank:
R <- suppressWarnings(chol(A, pivot = TRUE))
piv <- order(attr(R, "pivot")) ## reverse pivoting index
r <- attr(R, "rank") ## numerical rank
Then we generate X. For better result we centre X so that column means are 0.
X <- matrix(rnorm(10000 * r), 10000, r)
## for best effect, we centre `X`
X <- sweep(X, 2L, colMeans(X), "-")
Then we generate target multivariate normal:
## compute `V = RP'`
V <- R[1:r, piv]
## compute `Y = X %*% V`
Y <- X %*% V
We can verify that Y has target covariance A:
cor(Y)
# [,1] [,2] [,3]
#[1,] 1.0000000 0.9509181 0.9009645
#[2,] 0.9509181 1.0000000 0.9299037
#[3,] 0.9009645 0.9299037 1.0000000
A
# [,1] [,2] [,3]
#[1,] 1.00 0.95 0.90
#[2,] 0.95 1.00 0.93
#[3,] 0.90 0.93 1.00

Generation of random variables

I have a problem about the generation of random variables with R .
I have to generate random variables
$X_{ij}$ (i=1,...,25, j=1,...,5 ) knowing that
each X_ij follows a binomial distribution
$X_{ij} \sim Bin(n_{ij}, p_{ij})
$and I know already
$n_{ij}$ and $p_{ij}$
for each index. How to generate these random variable?
I don't know if it could be useful, but I have generated $p_{ij}$ knowing that they are also random variable which follow a beta distribution (hence actually $X_{ij}$ follow a beta binomial)
Let's say you had the following matrices for n and p:
(n <- matrix(4:7, nrow=2))
# [,1] [,2]
# [1,] 4 6
# [2,] 5 7
set.seed(144)
(p <- matrix(rbeta(4, 1, 2), nrow=2))
# [,1] [,2]
# [1,] 0.1582904 0.2794913
# [2,] 0.5176909 0.2889718
Now you can draw samples X_{ij} with something like:
set.seed(144)
matrix(apply(cbind(as.vector(n), as.vector(p)), 1, function(x) rbinom(1, x[1], x[2])), nrow=2)
# [,1] [,2]
# [1,] 0 2
# [2,] 2 2
The cbind part of this expression builds a 2-column matrix containing each (n, p) pairing and the apply part draws a single binomially distributed sample for each (n, p) pair, with the matrix part converting the resulting vector to a matrix.

Mystified by qr.Q(): what is an orthonormal matrix in "compact" form?

R has a qr() function, which performs QR decomposition using either LINPACK or LAPACK (in my experience, the latter is 5% faster). The main object returned is a matrix "qr" that contains in the upper triangular matrix R (i.e. R=qr[upper.tri(qr)]). So far so good. The lower triangular part of qr contains Q "in compact form". One can extract Q from the qr decomposition by using qr.Q(). I would like to find the inverse of qr.Q(). In other word, I do have Q and R, and would like to put them in a "qr" object. R is trivial but Q is not. The goal is to apply to it qr.solve(), which is much faster than solve() on large systems.
Introduction
R uses the LINPACK dqrdc routine, by default, or the LAPACK DGEQP3 routine, when specified, for computing the QR decomposition. Both routines compute the decomposition using Householder reflections. An m x n matrix A is decomposed into an m x n economy-size orthogonal matrix (Q) and an n x n upper triangular matrix (R) as A = QR, where Q can be computed by the product of t Householder reflection matrices, with t being the lesser of m-1 and n: Q = H1H2...Ht.
Each reflection matrix Hi can be represented by a length-(m-i+1) vector. For example, H1 requires a length-m vector for compact storage. All but one entry of this vector is placed in the first column of the lower triangle of the input matrix (the diagonal is used by the R factor). Therefore, each reflection needs one more scalar of storage, and this is provided by an auxiliary vector (called $qraux in the result from R's qr).
The compact representation used is different between the LINPACK and LAPACK routines.
The LINPACK Way
A Householder reflection is computed as Hi = I - viviT/pi, where I is the identity matrix, pi is the corresponding entry in $qraux, and vi is as follows:
vi[1..i-1] = 0,
vi[i] = pi
vi[i+1:m] = A[i+1..m, i] (i.e., a column of the lower triangle of A after calling qr)
LINPACK Example
Let's work through the example from the QR decomposition article at Wikipedia in R.
The matrix being decomposed is
> A <- matrix(c(12, 6, -4, -51, 167, 24, 4, -68, -41), nrow=3)
> A
[,1] [,2] [,3]
[1,] 12 -51 4
[2,] 6 167 -68
[3,] -4 24 -41
We do the decomposition, and the most relevant portions of the result is shown below:
> Aqr = qr(A)
> Aqr
$qr
[,1] [,2] [,3]
[1,] -14.0000000 -21.0000000 14
[2,] 0.4285714 -175.0000000 70
[3,] -0.2857143 0.1107692 -35
[snip...]
$qraux
[1] 1.857143 1.993846 35.000000
[snip...]
This decomposition was done (under the covers) by computing two Householder reflections and multiplying them by A to get R. We will now recreate the reflections from the information in $qr.
> p = Aqr$qraux # for convenience
> v1 <- matrix(c(p[1], Aqr$qr[2:3,1]))
> v1
[,1]
[1,] 1.8571429
[2,] 0.4285714
[3,] -0.2857143
> v2 <- matrix(c(0, p[2], Aqr$qr[3,2]))
> v2
[,1]
[1,] 0.0000000
[2,] 1.9938462
[3,] 0.1107692
> I = diag(3) # identity matrix
> H1 = I - v1 %*% t(v1)/p[1] # I - v1*v1^T/p[1]
> H2 = I - v2 %*% t(v2)/p[2] # I - v2*v2^T/p[2]
> Q = H1 %*% H2
> Q
[,1] [,2] [,3]
[1,] -0.8571429 0.3942857 0.33142857
[2,] -0.4285714 -0.9028571 -0.03428571
[3,] 0.2857143 -0.1714286 0.94285714
Now let's verify the Q computed above is correct:
> qr.Q(Aqr)
[,1] [,2] [,3]
[1,] -0.8571429 0.3942857 0.33142857
[2,] -0.4285714 -0.9028571 -0.03428571
[3,] 0.2857143 -0.1714286 0.94285714
Looks good! We can also verify QR is equal to A.
> R = qr.R(Aqr) # extract R from Aqr$qr
> Q %*% R
[,1] [,2] [,3]
[1,] 12 -51 4
[2,] 6 167 -68
[3,] -4 24 -41
The LAPACK Way
A Householder reflection is computed as Hi = I - piviviT, where I is the identity matrix, pi is the corresponding entry in $qraux, and vi is as follows:
vi[1..i-1] = 0,
vi[i] = 1
vi[i+1:m] = A[i+1..m, i] (i.e., a column of the lower triangle of A after calling qr)
There is another twist when using the LAPACK routine in R: column pivoting is used, so the decomposition is solving a different, related problem: AP = QR, where P is a permutation matrix.
LAPACK Example
This section does the same example as before.
> A <- matrix(c(12, 6, -4, -51, 167, 24, 4, -68, -41), nrow=3)
> Bqr = qr(A, LAPACK=TRUE)
> Bqr
$qr
[,1] [,2] [,3]
[1,] 176.2554964 -71.1694118 1.668033
[2,] -0.7348557 35.4388886 -2.180855
[3,] -0.1056080 0.6859203 -13.728129
[snip...]
$qraux
[1] 1.289353 1.360094 0.000000
$pivot
[1] 2 3 1
attr(,"useLAPACK")
[1] TRUE
[snip...]
Notice the $pivot field; we will come back to that. Now we generate Q from the information the Aqr.
> p = Bqr$qraux # for convenience
> v1 = matrix(c(1, Bqr$qr[2:3,1]))
> v1
[,1]
[1,] 1.0000000
[2,] -0.7348557
[3,] -0.1056080
> v2 = matrix(c(0, 1, Bqr$qr[3,2]))
> v2
[,1]
[1,] 0.0000000
[2,] 1.0000000
[3,] 0.6859203
> H1 = I - p[1]*v1 %*% t(v1) # I - p[1]*v1*v1^T
> H2 = I - p[2]*v2 %*% t(v2) # I - p[2]*v2*v2^T
> Q = H1 %*% H2
[,1] [,2] [,3]
[1,] -0.2893527 -0.46821615 -0.8348944
[2,] 0.9474882 -0.01602261 -0.3193891
[3,] 0.1361660 -0.88346868 0.4482655
Once again, the Q computed above agrees with the R-provided Q.
> qr.Q(Bqr)
[,1] [,2] [,3]
[1,] -0.2893527 -0.46821615 -0.8348944
[2,] 0.9474882 -0.01602261 -0.3193891
[3,] 0.1361660 -0.88346868 0.4482655
Finally, let's compute QR.
> R = qr.R(Bqr)
> Q %*% R
[,1] [,2] [,3]
[1,] -51 4 12
[2,] 167 -68 6
[3,] 24 -41 -4
Notice the difference? QR is A with its columns permuted given the order in Bqr$pivot above.
I have researched for this same problem as the OP asks and I don't think it is possible. Basically the OP question is whether having the explicitly computed Q, one can recover the H1 H2 ... Ht. I do not think this is possible without computing the QR from scratch but I would also be very interested to know whether there is such solution.
I have a similar issue as the OP but in a different context, my iterative algorithm needs to mutate the matrix A by adding columns and/or rows. The first time, the QR is computed using DGEQRF and thus, the compact LAPACK format. After the matrix A is mutated e.g. with new rows I can quickly build a new set of reflectors or rotators that will annihilate the non-zero elements of the lowest diagonal of my existing R and build a new R but now I have a set of H1_old H2_old ... Hn_old and H1_new H2_new ... Hn_new (and similarly tau's) which can't be mixed up into a single QR compact representation. The two possibilities I have are, and maybe the OP has the same two possibilities:
Always maintain Q and R explicitly separated whether when computed the first time or after every update at the cost of extra flops but keeping the required memory well bounded.
Stick to the compact LAPACK format but then every time a new update comes in, keep a list of all these mini sets of update reflectors. At the point of solving the system one would do a big Q'*c i.e. H1_u3*H2_u3*...*Hn_u3*H1_u2*H2_u2*...*Hn_u2*H1_u1*H2_u1...*Hn_u1*H1*H2*...*Hn*c where ui is the QR update number and this is potentially a lot of multiplications to do and memory to keep track of but definitely the fastest way.
The long answer from David basically explains what the compact QR format is but not how to get to this compact QR format having the explicit computed Q and R as input.

Resources