I am trying to understand the function stats::mahalanobis. Here is its source, but please just focus on last line, or more specifically, the rowSums(x %*% cov * x) part.
> mahalanobis
function (x, center, cov, inverted = FALSE, ...)
{
x <- if (is.vector(x))
matrix(x, ncol = length(x))
else as.matrix(x)
if (!isFALSE(center))
x <- sweep(x, 2L, center)
if (!inverted)
cov <- solve(cov, ...)
setNames(rowSums(x %*% cov * x), rownames(x))
}
Here x is a n-by-p matrix, whereas cov is a p-by-p matrix. Their content doesn't matter for the purpose of this question.
According to the document, mahalanobis calculates the squared Mahalanobis distance of all rows in x. I took this as a hint and found a counterpart of rowSums(X %*% C * X) with apply. (it's perfectly fine if you have no idea what I'm talking about; this paragraph merely serves as an explanation of how I came up with the apply form)
> X = matrix(rnorm(1:6), nrow = 3)
> C = matrix(rnorm(1:4), nrow = 2)
> rowSums(X %*% C * X)
[1] -0.03377298 0.49306538 -0.16615078
> apply(X, 1, function(row) {
+ t(row) %*% C %*% row
+ })
[1] -0.03377298 0.49306538 -0.16615078
Now the question becomes why are they equivalent? I suppose one needs to do some clever matrix partition to understand the rationale behind the equivalence, but I'm not enlightened enough to see it.
Just like instead of
sapply(1:5, `*`, 2)
# [1] 2 4 6 8 10
or a loop we prefer
1:5 * 2
# [1] 2 4 6 8 10
as it's a vectorized solution doing exactly the same - element-wise multiplication,
rowSums(X %*% C * X)
# [1] 0.2484329 0.5583787 0.2303054
can be seen to be better than
apply(X, 1, function(row) t(row) %*% C %*% row)
# [1] 0.2484329 0.5583787 0.2303054
with both of them again doing exactly the same, just with the former being more concise.
In particular, in my first example we went from scalars to vectors, and now we go from vectors to matrices. First,
X %*% C
# [,1] [,2]
# [1,] 0.7611212 0.6519212
# [2,] -0.4293461 0.6905117
# [3,] 1.2917590 -1.2970376
corresponds to
apply(X, 1, function(row) t(row) %*% C)
# [,1] [,2] [,3]
# [1,] 0.7611212 -0.4293461 1.291759
# [2,] 0.6519212 0.6905117 -1.297038
Now the second product in t(row) %*% C %*% row does two things: 1) element-wise multiplication of t(row) %*% C and row, 2) summing. In the same way, * in X %*% C * X does 1) and rowSums does the summing, 2).
So, in this case there are no significant tricks of changing the order of operations, partitioning, or anything else; it's just taking advantage of existing matrix operations that repeat the same actions with each row/column for us.
Extra:
library(microbenchmark)
microbenchmark(rowSums = rowSums(X %*% C * X),
apply = apply(X, 1, function(row) t(row) %*% C %*% row),
times = 100000)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# rowSums 3.565 4.488 5.995129 5.117 5.589 4940.691 1e+05 a
# apply 24.126 26.402 32.539559 27.191 28.615 129234.613 1e+05 b
If A and B are any two conformable matrices and a and b are any two vectors of the same length we will use these facts. The first says the first row of A * B equals the first row of A times the first row of B. The second says that the the first row of A %*% B equals the first row of A all times B. The third says matrix multiplication of two vectors can be expressed as the sum of multiplying them elementwise.
(A * B)[i, ] = A[i, ] * B[i, ] by the defintion of elementwise multiplication [1]
(A %*% B)[i, ] = A[i, ] %*% B as taking ith row is same as premultplying by ei [2]
a %*% b = sum(a * b) by definition of %*% [3]
Thus we get:
rowSums(X %*% C * X)[i]
= sum((X %*% C * X)[i, ])
= sum((X %*% C)[i, ] * X[i, ]) by [1]
= (X %*% C)[i, ] %*% X[i, ] by [3]
= X[i, ] %*% C %*% X[i, ] by [2]
= apply(X, 1, function(row) t(row) %*% C %*% row)[i]
Related
I am starting to learn R and trying to multiply a 5X3 matrix with a 3X1 column vector in R; However while creating a new variable to perform the operation, R throws the error "non-conformable arrays". Can someone please point out my mistake in the code below -
*#5X3 Matrix*
X <- matrix(c(1,25.5,1.23,1,40.8,1.89,1,30.2,1.55,1,4.3,1.18,1,10.7,1.68),nrow=5,ncol=3,byrow=TRUE)
*3X1 Column vector*
b1 <- matrix(c(23,0.1,-8), nrow = 3, ncol = 1, byrow = TRUE)
v1 <- X * b1
v1
Appreciate your help :)
You need the matrix-multiplication operator %*%:
X <- matrix(c(1,25.5,1.23,1,40.8,1.89,1,30.2,1.55,1,4.3,1.18,1,10.7,1.68),nrow=5,ncol=3,byrow=TRUE)
b1 <- matrix(c(23,0.1,-8), nrow = 3, ncol = 1, byrow = TRUE)
v1 <- X %*% b1
v1
#> [,1]
#> [1,] 15.71
#> [2,] 11.96
#> [3,] 13.62
#> [4,] 13.99
#> [5,] 10.63
Normally one would use the first alternative below but the others are possible too. The first four alternatives below give a column vector as the result while the others give a plain vector without dimensions. The first three work even if b1 has more than one column. The remainder assume b1 has one column but could be generalized.
X %*% b1
crossprod(t(X), b1)
library(einsum)
einsum("ij,jk -> ik", X, b1)
out <- matrix(0, nrow(X), ncol(b1))
for(i in 1:nrow(X)) {
for(k in 1:ncol(X)) out[i] <- out[i] + X[i, k] * b1[k, 1]
}
out
colSums(t(X) * c(b1))
apply(X, 1, crossprod, b1)
sapply(1:nrow(X), function(i) sum(X[i, ] * b1))
rowSums(mapply(`*`, as.data.frame(X), b1))
rowSums(sapply(1:ncol(X), function(j) X[, j] * b1[j]))
X[, 1] * b1[1, 1] + X[, 2] * b1[2, 1] + X[, 3] * b1[3, 1]
Note
The input shown in the question is:
X <- matrix(c(1,25.5,1.23,1,40.8,1.89,1,30.2,1.55,1,4.3,1.18,1,10.7,1.68),nrow=5,ncol=3,byrow=TRUE)
b1 <- matrix(c(23,0.1,-8), nrow = 3, ncol = 1, byrow = TRUE)
Assume a correlation matrix P with diagonal of zero. I want to determine the order n where the sum of all the correlation matrices orders would converge i.e. diag(3)+ P + P%^%2 + P%^%3 + ... + P%^%n would converge meaning the L1 norm drops below a tol. I looked into How to find when a matrix converges with a loop but this doens't do it for me, since it doesn't keep the orders, nor it sums them up. I can do it in a really lengthy and lousy way with for loops and all but I don't want to, since I have a big df with many time windows, so I'm looking for something efficient. Thanks!
P <- matrix(c(0,0.1,0.8,0.1,0,-0.7,0.8,-0.7,0), nrow = 3, ncol = 3, byrow = TRUE)
Some notes: The %^% operator is from expm package. To sum the matrices I used matrix(mapply(sum, diag(3), P, P%^%2, P%^%3, MoreArgs=list(na.rm=T)), ncol=3).
x %^% n computes the nth power of x efficiently, but it is inefficient to compute x %^% i for all i from 0 to n, because each x %^% i requires O(log(i)) matrix multiplications.
In general, the most efficient way to compute all of the powers of x up to the nth is recursive multiplication by x, possibly taking advantage of the diagonalizability of x.
The difference is nontrivial for large n: whereas
x2 <- x %^% 2
x3 <- x %^% 3
x4 <- x %^% 4
## and so on
requires O(log(n!)) = O(n * log(n)) matrix multiplications,
x2 <- x %*% x
x3 <- x2 %*% x
x4 <- x3 %*% x
## and so on
requires just O(n).
Here is a function that recursively computes the powers of a matrix x and their sum until it encounters a power whose 1-norm is less than tol. It begins by checking that the spectral radius of x is less than 1, which is a necessary and sufficient condition for convergence of the norm of x %^% n to 0 and thus a necessary condition for convergence of the power series. It does not attempt to diagonalize x, which would simplify computation of the power series but complicate computation of norms.
f <- function(x, tol = 1e-06, nmax = 1e+03) {
stopifnot(max(abs(eigen(x, only.values = TRUE)$values)) < 1)
pow <- sum <- diag(nrow(x))
nrm <- rep.int(NA_real_, nmax + 1)
i <- 1
while ((nrm[i] <- norm(pow, "1")) >= tol && i <= nmax) {
pow <- pow %*% x
sum <- sum + pow
i <- i + 1
}
list(x = x, tol = tol, nmax = nmax, n = i - 1, sum = sum,
norm = nrm[seq_len(i)], converged = nrm[i] < tol)
}
Your matrix P has spectral radius greater than 1, hence:
P <- matrix(c(0, 0.1, 0.8, 0.1, 0, -0.7, 0.8, -0.7, 0), 3L, 3L, byrow = TRUE)
f(P)
Error in f(P) :
max(abs(eigen(x, only.values = TRUE)$values)) < 1 is not TRUE
We can always construct a matrix P whose spectral radius is less than 1, for the purpose of testing f:
set.seed(1L)
m <- 3L
V <- matrix(rnorm(m * m), m, m)
D <- diag(runif(m, -0.9, 0.9))
P <- V %*% D %*% solve(V)
all.equal(sort(eigen(P)$values), sort(diag(D))) # [1] TRUE
(fP <- f(P))
$x
[,1] [,2] [,3]
[1,] 0.26445172 0.5317116 -0.2432849
[2,] 0.04932194 0.6332122 0.1496390
[3,] -0.31174920 0.6847937 0.1682702
$tol
[1] 1e-06
$nmax
[1] 1000
$n
[1] 60
$sum
[,1] [,2] [,3]
[1,] 1.53006915 2.081717 -0.07302465
[2,] -0.04249899 4.047528 0.74063387
[3,] -0.60849191 2.552208 1.83947562
$norm
[1] 1.000000e+00 1.849717e+00 1.223442e+00 1.008928e+00 7.799426e-01
[6] 6.131516e-01 4.795602e-01 3.754905e-01 2.938577e-01 2.299751e-01
[11] 1.799651e-01 1.408263e-01 1.101966e-01 8.622768e-02 6.747162e-02
[16] 5.279503e-02 4.131077e-02 3.232455e-02 2.529304e-02 1.979107e-02
[21] 1.548592e-02 1.211727e-02 9.481396e-03 7.418905e-03 5.805067e-03
[26] 4.542288e-03 3.554202e-03 2.781054e-03 2.176090e-03 1.702724e-03
[31] 1.332329e-03 1.042507e-03 8.157298e-04 6.382837e-04 4.994374e-04
[36] 3.907945e-04 3.057848e-04 2.392672e-04 1.872193e-04 1.464934e-04
[41] 1.146266e-04 8.969179e-05 7.018108e-05 5.491455e-05 4.296896e-05
[46] 3.362189e-05 2.630810e-05 2.058529e-05 1.610736e-05 1.260351e-05
[51] 9.861865e-06 7.716607e-06 6.038009e-06 4.724558e-06 3.696822e-06
[56] 2.892650e-06 2.263410e-06 1.771049e-06 1.385792e-06 1.084340e-06
[61] 8.484627e-07
$converged
[1] TRUE
Hence convergence is attained at n = 60. You can check that the reported sum is correct by comparing against the directly (but inefficiently) calculated value:
library("expm")
all.equal(Reduce(`+`, lapply(0:fP$n, function(i) P %^% i)), fP$sum) # [1] TRUE
And just for fun:
plot(0:fP$n, fP$norm)
Suppose I have a positive semi-definite matrix S and I would like to get the inverse of square root of this matrix, that is S^(-1/2).
May I do like this?
ei <- eigen(S)
V <- ei$vectors
res <- V %*% diag(1 / sqrt(ei$values)) %*% t(V)
Is res equal to S^(-1/2)?
I just do inverse of square root for eigenvalue of S, is this correct?
I know that: if one wants to get S^(1/2), then res <- V %*% diag(sqrt(ei$values)) %*% t(V), which is res = S^(1/2).
How about for S^(-1/2)?
Thanks.
Yes. We can easily take an example S and check that S times res times res is the identity matrix:
set.seed(123)
S <- crossprod(matrix(rnorm(9), 3))
ei <- eigen(S)
V <- ei$vectors
res <- V %*% diag(1 / sqrt(ei$values)) %*% t(V)
S %*% res %*% res
## [,1] [,2] [,3]
## [1,] 1.0000e+00 -2.3731e-15 -1.6653e-16
## [2,] 3.3346e-15 1.0000e+00 -6.6613e-16
## [3,] -1.0235e-16 8.3267e-16 1.0000e+00
Also note that S and res commute, i.e. all.equal(S %*% res, res %*% S) is TRUE, so the 3 factors in the last line of code could have been written in any order.
If, as per the OP's enquiry, the input matrix is positive semi-definite (and not positive definite), its eigenvalues will not all be positive, and the method given in the answer above will fail, since
diag(1 / sqrt(ei$values))
will not be a finite matrix.
I tried norm, but I think it gives the wrong result. (the norm of c(1, 2, 3) is sqrt(1*1+2*2+3*3), but it returns 6..
x1 <- 1:3
norm(x1)
# Error in norm(x1) : 'A' must be a numeric matrix
norm(as.matrix(x1))
# [1] 6
as.matrix(x1)
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
norm(as.matrix(x1))
# [1] 6
Does anyone know what's the function to calculate the norm of a vector in R?
norm(c(1,1), type="2") # 1.414214
norm(c(1, 1, 1), type="2") # 1.732051
This is a trivial function to write yourself:
norm_vec <- function(x) sqrt(sum(x^2))
I was surprised that nobody had tried profiling the results for the above suggested methods, so I did that. I've used a random uniform function to generate a list and used that for repetition (Just a simple back of the envelop type of benchmark):
> uut <- lapply(1:100000, function(x) {runif(1000, min=-10^10, max=10^10)})
> norm_vec <- function(x) sqrt(sum(x^2))
> norm_vec2 <- function(x){sqrt(crossprod(x))}
>
> system.time(lapply(uut, norm_vec))
user system elapsed
0.58 0.00 0.58
> system.time(lapply(uut, norm_vec2))
user system elapsed
0.35 0.00 0.34
> system.time(lapply(uut, norm, type="2"))
user system elapsed
6.75 0.00 6.78
> system.time(lapply(lapply(uut, as.matrix), norm))
user system elapsed
2.70 0.00 2.73
It seems that taking the power and then sqrt manually is faster than the builtin norm for real values vectors at least. This is probably because norm internally does an SVD:
> norm
function (x, type = c("O", "I", "F", "M", "2"))
{
if (identical("2", type)) {
svd(x, nu = 0L, nv = 0L)$d[1L]
}
else .Internal(La_dlange(x, type))
}
and the SVD function internally converts the vector into a matrix, and does more complicated stuff:
> svd
function (x, nu = min(n, p), nv = min(n, p), LINPACK = FALSE)
{
x <- as.matrix(x)
...
EDIT (20 Oct 2019):
There have been some comments to point out the correctness issue which the above test case doesn't bring out:
> norm_vec(c(10^155))
[1] Inf
> norm(c(10^155), type="2")
[1] 1e+155
This happens because large numbers are considered as infinity in R:
> 10^309
[1] Inf
So, it looks like:
It seems that taking the power and then sqrt manually is faster than the builtin norm for real values vectors for small numbers.
How small? So that the sum of squares doesn't overflow.
norm(x, type = c("O", "I", "F", "M", "2"))
The default is "O".
"O", "o" or "1" specifies the one norm, (maximum absolute column sum);
"F" or "f" specifies the Frobenius norm (the Euclidean norm of x treated as if it were a vector);
norm(as.matrix(x1),"o")
The result is 6, same as norm(as.matrix(x1))
norm(as.matrix(x1),"f")
The result is sqrt(1*1+2*2+3*3)
So, norm(as.matrix(x1),"f") is answer.
We can also find the norm as :
Result<-sum(abs(x)^2)^(1/2)
OR Even You can also try as:
Result<-sqrt(t(x)%*%x)
Both will give the same answer
I'mma throw this out there too as an equivalent R expression
norm_vec(x) <- function(x){sqrt(crossprod(x))}
Don't confuse R's crossprod with a similarly named vector/cross product. That naming is known to cause confusion especially for those with a physics/mechanics background.
Answer for Euclidean length of a vector (k-norm) with scaling to avoid destructive underflow and overflow is
norm <- function(x, k) { max(abs(x))*(sum((abs(x)/max(abs(x)))^k))^(1/k) }
See below for explanation.
1. Euclidean length of a vector with no scaling:
norm() is a vector-valued function which computes the length of the vector. It takes two arguments such as the vector x of class matrix and the type of norm k of class integer.
norm <- function(x, k) {
# x = matrix with column vector and with dimensions mx1 or mxn
# k = type of norm with integer from 1 to +Inf
stopifnot(k >= 1) # check for the integer value of k greater than 0
stopifnot(length(k) == 1) # check for length of k to be 1. The variable k is not vectorized.
if(k == Inf) {
# infinity norm
return(apply(x, 2, function(vec) max(abs(vec)) ))
} else {
# k-norm
return(apply(x, 2, function(vec) (sum((abs(vec))^k))^(1/k) ))
}
}
x <- matrix(c(1,-2,3,-4)) # column matrix
sapply(c(1:4, Inf), function(k) norm(x = x, k = k))
# [1] 10.000000 5.477226 4.641589 4.337613 4.000000
1-norm (10.0) converges to infinity-norm (4.0).
k-norm is also called as "Euclidean norm in Euclidean n-dimensional space".
Note:
In the norm() function definition, for vectors with real components, the absolute values can be dropped in norm-2k or even indexed norms, where k >= 1.
If you are confused with the norm function definition, you can read each one individually as given below.
norm_1 <- function(x) sum(abs(x))
norm_2 <- function(x) (sum((abs(x))^2))^(1/2)
norm_3 <- function(x) (sum((abs(x))^3))^(1/3)
norm_4 <- function(x) (sum((abs(x))^4))^(1/4)
norm_k <- function(x) (sum((abs(x))^k))^(1/k)
norm_inf <- max(abs(x))
2. Euclidean length of a vector with scaling to avoid destructive overflow and underflow issues:
Note-2:
The only problem with this solution norm() is that it does not guard against overflow or underflow problems as alluded here and here.
Fortunately, someone had already solved this problem for 2-norm (euclidean length) in the blas (basic linear algebra subroutines) fortran library. A description of this problem can be found in the textbook of "Numerical Methods and Software by Kahaner, Moler and Nash" - Chapter-1, Section 1.3, page - 7-9.
The name of the fortran subroutine is dnrm2.f, which handles destructive overflow and underflow issues in the norm() by scaling with the maximum of the vector components. The destructive overflow and underflow problem arise due to radical operation in the norm() function.
I will show how to implement dnrm2.f in R below.
#1. find the maximum among components of vector-x
max_x <- max(x)
#2. scale or divide the components of vector by max_x
scaled_x <- x/max_x
#3. take square of the scaled vector-x
sq_scaled_x <- (scaled_x)^2
#4. sum the square of scaled vector-x
sum_sq_scaled_x <- sum(sq_scaled_x)
#5. take square root of sum_sq_scaled_x
rt_sum_sq_scaled_x <- sqrt(sum_sq_scaled_x)
#6. multiply the maximum of vector x with rt_sum_sq_scaled_x
max_x*rt_sum_sq_scaled_x
one-liner of the above 6-steps of dnrm2.f in R is:
# Euclidean length of vector - 2norm
max(x)*sqrt(sum((x/max(x))^2))
Lets try example vectors to compute 2-norm (see other solutions in this thread) for this problem.
x = c(-8e+299, -6e+299, 5e+299, -8e+298, -5e+299)
max(x)*sqrt(sum((x/max(x))^2))
# [1] 1.227355e+300
x <- (c(1,-2,3,-4))
max(x)*sqrt(sum((x/max(x))^2))
# [1] 5.477226
Therefore, the recommended way to implement a generalized solution for k-norm in R is that single line, which guard against the destructive overflow or underflow problems. To improve this one-liner, you can use a combination of norm() without scaling for a vector containing not-too-small or not-too-large components and knorm() with scaling for a vector with too-small or too-large components. Implementing scaling for all vectors results in too many calculations. I did not implement this improvement in knorm() given below.
# one-liner for k-norm - generalized form for all norms including infinity-norm:
max(abs(x))*(sum((abs(x)/max(abs(x)))^k))^(1/k)
# knorm() function using the above one-liner.
knorm <- function(x, k) {
# x = matrix with column vector and with dimensions mx1 or mxn
# k = type of norm with integer from 1 to +Inf
stopifnot(k >= 1) # check for the integer value of k greater than 0
stopifnot(length(k) == 1) # check for length of k to be 1. The variable k is not vectorized.
# covert elements of matrix to its absolute values
x <- abs(x)
if(k == Inf) { # infinity-norm
return(apply(x, 2, function(vec) max(vec)))
} else { # k-norm
return(apply(x, 2, function(vec) {
max_vec <- max(vec)
return(max_vec*(sum((vec/max_vec)^k))^(1/k))
}))
}
}
# 2-norm
x <- matrix(c(-8e+299, -6e+299, 5e+299, -8e+298, -5e+299))
sapply(2, function(k) knorm(x = x, k = k))
# [1] 1.227355e+300
# 1-norm, 2-norm, 3-norm, 4-norm, and infinity-norm
sapply(c(1:4, Inf), function(k) knorm(x = x, k = k))
# [1] 2.480000e+300 1.227355e+300 9.927854e+299 9.027789e+299 8.000000e+299
x <- matrix(c(1,-2,3,-4))
sapply(c(1:4, Inf), function(k) knorm(x = x, k = k))
# [1] 10.000000 5.477226 4.641589 4.337613 4.000000
x <- matrix(c(1,-2,3,-4, 0, -8e+299, -6e+299, 5e+299, -8e+298, -5e+299), nc = 2)
sapply(c(1:4, Inf), function(k) knorm(x = x, k = k))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1.00e+01 5.477226e+00 4.641589e+00 4.337613e+00 4e+00
# [2,] 2.48e+300 1.227355e+300 9.927854e+299 9.027789e+299 8e+299
If you have a data.frame or a data.table 'DT', and want to compute the Euclidian norm (norm 2) across each row, the apply function can be used.
apply(X = DT, MARGIN = 1, FUN = norm, '2')
Example:
>DT
accx accy accz
1: 9.576807 -0.1629486 -0.2587167
2: 9.576807 -0.1722938 -0.2681506
3: 9.576807 -0.1634264 -0.2681506
4: 9.576807 -0.1545590 -0.2681506
5: 9.576807 -0.1621254 -0.2681506
6: 9.576807 -0.1723825 -0.2682434
7: 9.576807 -0.1723825 -0.2728810
8: 9.576807 -0.1723825 -0.2775187
> apply(X = DT, MARGIN = 1, FUN = norm, '2')
[1] 9.581687 9.582109 9.581954 9.581807 9.581932 9.582114 9.582245 9.582378
Following AbdealiJK's answer,
I experimented further to gain some insight.
Here's one.
x = c(-8e+299, -6e+299, 5e+299, -8e+298, -5e+299)
sqrt(sum(x^2))
norm(x, type='2')
The first result is Inf and the second one is 1.227355e+300 which is quite correct as I show you in the code below.
library(Rmpfr)
y <- mpfr(x, 120)
sqrt(sum(y*y))
The result is 1227354879.... I didn't count the number of trailing numbers but it looks all right. I know there another way around this OVERFLOW problem which is first applying log function to all numbers and summing up, which I do not have time to implement!
Create your matrix as column vise using cbind then the norm function works well with Frobenius norm (the Euclidean norm) as an argument.
x1<-cbind(1:3)
norm(x1,"f")
[1] 3.741657
sqrt(1*1+2*2+3*3)
[1] 3.741657
I have a 2396x34 double matrix named y wherein each row (2396) represents a separate situation consisting of 34 consecutive time segments.
I also have a numeric[34] named x that represents a single situation of 34 consecutive time segments.
Currently I am calculating the correlation between each row in y and x like this:
crs[,2] <- cor(t(y),x)
What I need now is to replace the cor function in the above statement with a weighted correlation. The weight vector xy.wt is 34 elements long so that a different weight can be assigned to each of the 34 consecutive time segments.
I found the Weighted Covariance Matrix function cov.wt and thought that if I first scale the data it should work just like the cor function. In fact you can specify for the function to return a correlation matrix as well. Unfortunately it does not seem like I can use it in the same manner because I cannot supply my two variables (x and y) separately.
Does anyone know of a way I can get a weighted correlation in the manner I described without sacrificing much speed?
Edit: Perhaps some mathematical function could be applied to y prior to the cor function in order to get the same results that I'm looking for. Maybe if I multiply each element by xy.wt/sum(xy.wt)?
Edit #2 I found another function corr in the boot package.
corr(d, w = rep(1, nrow(d))/nrow(d))
d
A matrix with two columns corresponding to the two variables whose correlation we wish to calculate.
w
A vector of weights to be applied to each pair of observations. The default is equal weights for each pair. Normalization takes place within the function so sum(w) need not equal 1.
This also is not what I need but it is closer.
Edit #3
Here is some code to generate the type of data I am working with:
x<-cumsum(rnorm(34))
y<- t(sapply(1:2396,function(u) cumsum(rnorm(34))))
xy.wt<-1/(34:1)
crs<-cor(t(y),x) #this works but I want to use xy.wt as weight
Unfortunately the accepted answer is wrong when y is a matrix of more than one row. The error is in the line
vy <- rowSums( w * y * y )
We want to multiply the columns of y by w, but this will multiply the rows by the elements of w, recycled as necessary. Thus
> f(x, y[1, , drop = FALSE], xy.wt)
[1] 0.103021
is correct, because in this case the multiplication is performed element-wise, which is equivalent to column-wise multiplication here, but
> f(x, y, xy.wt)[1]
[1] 0.05463575
gives a wrong answer due to the row-wise multiplication.
We can correct the function as follows
f2 <- function( x, y, w = rep(1,length(x))) {
stopifnot(length(x) == dim(y)[2] )
w <- w / sum(w)
# Center x and y, using the weighted means
x <- x - sum(x * w)
ty <- t(y - colSums(t(y) * w))
# Compute the variance
vx <- sum(w * x * x)
vy <- colSums(w * ty * ty)
# Compute the covariance
vxy <- colSums(ty * x * w)
# Compute the correlation
vxy / sqrt(vx * vy)
}
and check the results against those produced by corr from the boot package:
> res1 <- f2(x, y, xy.wt)
> res2 <- sapply(1:nrow(y),
+ function(i, x, y, w) corr(cbind(x, y[i,]), w = w),
+ x = x, y = y, w = xy.wt)
> all.equal(res1, res2)
[1] TRUE
which in itself gives another way that this problem could be solved.
You can go back to the definition of the correlation.
f <- function( x, y, w = rep(1,length(x))) {
stopifnot( length(x) == dim(y)[2] )
w <- w / sum(w)
# Center x and y, using the weighted means
x <- x - sum(x*w)
y <- y - apply( t(y) * w, 2, sum )
# Compute the variance
vx <- sum( w * x * x )
vy <- rowSums( w * y * y ) # Incorrect: see Heather's remark, in the other answer
# Compute the covariance
vxy <- colSums( t(y) * x * w )
# Compute the correlation
vxy / sqrt(vx * vy)
}
f(x,y)[1]
cor(x,y[1,]) # Identical
f(x, y, xy.wt)
Here is a generalization to compute the weighted Pearson correlation between two matrices (instead of a vector and a matrix, as in the original question):
matrix.corr <- function (a, b, w = rep(1, nrow(a))/nrow(a))
{
# normalize weights
w <- w / sum(w)
# center matrices
a <- sweep(a, 2, colSums(a * w))
b <- sweep(b, 2, colSums(b * w))
# compute weighted correlation
t(w*a) %*% b / sqrt( colSums(w * a**2) %*% t(colSums(w * b**2)) )
}
Using the above example and the correlation function from Heather, we can verify it:
> sum(matrix.corr(as.matrix(x, nrow=34),t(y),xy.wt) - f2(x,y,xy.wt))
[1] 1.537507e-15
In terms of calling syntax, this resembles the unweighted cor:
> a <- matrix( c(1,2,3,1,3,2), nrow=3)
> b <- matrix( c(2,3,1,1,7,3,5,2,8,1,10,12), nrow=3)
> matrix.corr(a,b)
[,1] [,2] [,3] [,4]
[1,] -0.5 0.3273268 0.5 0.9386522
[2,] 0.5 0.9819805 -0.5 0.7679882
> cor(a, b)
[,1] [,2] [,3] [,4]
[1,] -0.5 0.3273268 0.5 0.9386522
[2,] 0.5 0.9819805 -0.5 0.7679882