cross product of two matrices with missing values - r

I would like to vectorize a distance matrix calculation by using the Law of Cosines. For a matrix with no missing values, this calculation is even faster than dist for very big matrices.
The function goes like this:
vectorizedDistMat <- function(x,y) {
an = rowSums(x^2)
bn = rowSums(y^2)
m = nrow(x)
n = nrow(y)
tmp = matrix(rep(an, n), nrow=m)
tmp = tmp + matrix(rep(bn, m), nrow=m, byrow=TRUE)
sqrt( abs(tmp - 2 * tcrossprod(x,y) ))
}
Now, missing values complicate things, especially in the last line of the above function, when the two matrices are multiplied. There is a way of accounting for missing values retrospectively, e.g. here. But how do I efficiently prevent missing value to "multiply" when multiplying the matrices?
See for example M1 and M2 below:
set.seed(42)
M1 = matrix(rnorm(50), nrow = 10, ncol = 5)
M2 = matrix(rnorm(50), nrow = 10, ncol = 5)
M1[1,2] = NA
M2[3,4] = NA
here, tcrossprod(M1, M2) yields NA's in the first row and third column. How do I account for them and get rid of them in advance (like the built-in function dist)?

Related

Optimize the weight of vectors given the similarity matrix of mean vectors

I want to solve the optimazation problem to search best weights for groups of vectors. Would you like to give some suggestions about how to solve it by R? Thanks very much.
The problem is as follows.
Given there are N groups, we know their similarity matrix among these N groups. The dimension of S is N*N.
In each group, there are K vectors . There are M elements in each vector which value is 0 or 1. .
we can fit an average vector based on these K vectors. For example, average vector
Based on these avearge vectors in each group, we could calculate the correlation among these avearge vectors.
The object is to minimize the differene between correlation matrix C and known similarity matrix S.
Beacuse you didn't provide any data I will generate random and demonstrate way you can approach your problem.
Similarity matrix:
N <- 6
S <- matrix(runif(N^2, -1, 1), ncol = N, nrow = N)
similarity_matrix <- (S + t(S)) / 2
N is number of groups. Each value of similarity matrix is between -1 and 1 and matrix is symmetric (beacuse you want to compare it to covariance matrix these makes sense).
group vectors:
M <- 10
K <- 8
group_vectors <- replicate(N, replicate(K, sample(c(0, 1), M, TRUE)), FALSE)
M is dimension of vector and K is number of binary vectors in each group.
fitness function
fitness <- function(W, group_vectors, similarity_matrix){
W <- as.data.frame(matrix(W, nrow = K, ncol = N))
SS <- cov(
mapply(function(x,y) rowSums(sweep(x, 2, y, "*")), group_vectors, W)
)
sum(abs(SS - similarity_matrix))
}
fitness for given weights calculates described covariance matrix and its distance from similarity_matrix.
differential evolution approach
res <- DEoptim::DEoptim(
fn = fitness,
lower = rep(-1, K*N),
upper = rep(1, K*N),
group_vectors = group_vectors,
similarity_matrix = similarity_matrix,
control = DEoptim::DEoptim.control(VTR = 0, itermax = 1000, trace = 50, NP = 100)
)
W <- matrix(res$optim$bestmem, nrow = K, ncol = N)
genetic algorithm approach
res <- GA::ga(
type = "real-valued",
fitness = function(W, ...) -fitness(W, ...),
lower = rep(-1, K*N),
upper = rep(1, K*N),
group_vectors = group_vectors,
similarity_matrix = similarity_matrix,
maxiter = 10000,
run = 200
)
W <- matrix(res#solution[1,], nrow = K, ncol = N)

R: Stuck on a "simple" problem: calculating total sum of squares in a n*m matrix

Given a data matrix with n rows and m columns, I would like to calculate the total sum of squares in R.
For this I've tried a loop that iterates through the rows of each column and saves the results in a vector. These are then added to the "TSS" vector where each value is the SS of one column. The sum of this vector should be the TSS.
set.seed(2020)
m <- matrix(c(sample(1:100, 80)), nrow = 40, ncol = 2)
tss <- c()
for(j in 1:ncol(m)){
tssVec <- c()
for(i in 1:nrow(m)){
b <- sum(((m[i,]) - mean(m[,j]))^2)
tssVec <- c(tssVec, b)
}
tss <- c(tss, sum(tssVec))
}
sum(tss)
The output is equal to 136705.6. This is not feasible at all. As a novice coder, I am unfortunately stuck.
Any help is appreciated!
There are many methods to evaluate the TSS, of course they will give you the same result. I would do something like:
Method 1 that implies the use of ANOVA:
n <- as.data.frame(m)
mylm <- lm(n$V1 ~ n$V2)
SSTotal <-sum(anova(mylm)[,2])
Method 2:
SSTotal <- var( m[,1] ) * (nrow(m)-1)

R - Given a matrix and a power, produce multiple matrices containing all combinations of matrix columns

Given a matrix mat (of size N by M) and a power, p (e.g., 4), produce p matrices, where each p-th matrix contains all possible combinations of the columns in mat at that degree.
In my current approach, I generate the p-th matrix and then use it in the next call to produce the p+1th matrix. Can this be 'automated' for a given power p, rather than done manually?
I am a novice when it comes to R and understand that there is likely a more efficient and elegant way to achieve this solution than the following attempt...
N = 5
M = 3
p = 4
mat = matrix(1:(N*M),N,M)
mat_1 = mat
mat_2 = t(sapply(1:N, function(i) tcrossprod(mat_1[i, ], mat[i, ])))
mat_3 = t(sapply(1:N, function(i) tcrossprod(mat_2[i, ], mat[i, ])))
mat_4 = t(sapply(1:N, function(i) tcrossprod(mat_3[i, ], mat[i, ])))
Can anyone provide some suggestions? My goal is to create a function for a given matrix mat and power p that outputs the p different matrices in a more 'automated' fashion.
Related question that got me started: How to multiply columns of two matrix with all combinations
This solves your problem.
N = 5
M = 3
p = 4
mat = matrix(1:(N*M),N,M)
f=function(x) matrix(apply(x,2,"*",mat),nrow(x))
rev(Reduce(function(f,x)f(x), rep(c(f), p-1), mat, T,T))
You can do something like this
N = 5
M = 3
p = 4
mat = matrix(1:(N*M),N,M)
res_mat <- list()
res_mat[[1]] <- mat
for(i in 2:p) {
res_mat[[i]] <- t(sapply(1:N, function(j) tcrossprod(res_mat[[i-1]][j, ], res_mat[[1]][j, ])))
}

R: How to add jitter only on singular matrices within a function?

I have the following function that I need to (m)apply on a list of more than 1500 large matrices (Z) and a list of vectors (p) of the same length. However, I get the error that some matrices are singular as I already posted here. Here my function:
kastner <- function(item, p) { print(item)
imp <- rowSums(Z[[item]])
exp <- colSums(Z[[item]])
x = p + imp
ac = p + imp - exp
einsdurchx = 1/as.vector(x)
einsdurchx[is.infinite(einsdurchx)] <- 0
A = Z[[item]] %*% diag(einsdurchx)
R = solve(diag(length(p))-A) %*% diag(p)
C = ac * einsdurchx
R_bar = diag(as.vector(C)) %*% R
rR_bar = round(R_bar)
return(rR_bar)
}
and my mapply command that also prints the names of the running matrix:
KASTNER <- mapply(kastner, names(Z), p, SIMPLIFY = FALSE)
In order to overcome the singularity problem, I want to add a small amount of jitter the singular matrices. The problem starts in line 9 of the function R = solve(diag(length(p))-A) %*% diag(p) as this term(diag(length(p))-A) gets singular and can't be solved. I tried to add jitter to all Z matrices in the first line of the function using: Z <- lapply(Z,function(x) jitter(x, factor = 0.0001, amount = NULL)), but this is very very low and produces still errors.
Therefore my idea is to check with if/else or something similar if this matrix diag(length(p))-A is singular (maybe using eigenvectors to check collinearity) and add on those matrices jitter, else (if not) the solve command should performed as it is. Ideas how to implement this on the function? Thanks
Here some example data, although there is no problem with singularity as I was not able to rebuild this error for line 9:
Z <- list("111.2012"= matrix(c(0,0,100,200,0,0,0,0,50,350,0,50,50,200,200,0),
nrow = 4, ncol = 4, byrow = T),
"112.2012"= matrix(c(10,90,0,30,10,90,0,10,200,50,10,350,150,100,200,10),
nrow = 4, ncol = 4, byrow = T))
p <- list("111.2012"=c(200, 1000, 100, 10), "112.2012"=c(300, 900, 50, 100))
Edit: a small amount o jitter shouldn't be problematic in my data as I have probably more than 80% of zeros in my matrices and than large values. And I am only interested in those large values, but the large amount of 0s are probably the reason for the singularity, but needed.
Since you didn't provide a working example I couldn't test this easily, so the burden of proof is on you. :) In any case, it should be a starting point for further tinkering. Comments in the code.
kastner <- function(item, p) { print(item)
imp <- rowSums(Z[[item]])
exp <- colSums(Z[[item]])
x = p + imp
ac = p + imp - exp
einsdurchx = 1/as.vector(x)
einsdurchx[is.infinite(einsdurchx)] <- 0
# start a chunk that repeats until you get a valid result
do.jitter <- TRUE # bureaucracy
while (do.jitter == TRUE) {
# run the code as usual
A = Z[[item]] %*% diag(einsdurchx)
# catch any possible errors, you can even catch "singularity" error here by
# specifying error = function(e) e
R <- tryCatch(solve(diag(length(p))-A) %*% diag(p), error = function(e) "jitterme")
# if you were able to solve(), and the result is a matrix (carefuly if it's a vector!)...
if (is.matrix(R)) {
# ... turn the while loop off
do.jitter <- FALSE
} else {
#... else apply some jitter and repeat by construcing A from a jittered Z[[item]]
Z[[item]] <- jitter(Z[[item]])
}
}
C = ac * einsdurchx
R_bar = diag(as.vector(C)) %*% R
rR_bar = round(R_bar)
return(rR_bar)
}

Fastest way for finding which combination of two list maximises a function in R

I have a data set dat and two lists x and y. I would like to calculate different combination of x and y with different value of k. I wrote the following code to find the value of function fun for these different combinations. but how can I get the value of k which maximize the function fun for these different combination? since in each iteration I have different lists of x and y and at the end I want to find the k which maximise the function fun.
dat = c(9, 2, 7)
k = seq(0, 1, length = 10)
x =list(a = 1, b = 8, c = 4)
y = list(a = .5, b = 5, c = 5)
matrix = cbind(unlist(x), unlist(y)) %*% rbind(1-k, k)
z = apply(matrix, 2, as.list)
fun = function(dat, vec) sum(vec$a * dat - vec$b * dat + vec$c * dat)
res = rep(0, length(k))
for (i in 1:(length(k))){
v = split(unlist(z[[i]]), sub("\\d+$", "", names(z[[i]])))
res[i] = fun(dat, v)
}
> res
[1] -54 -47 -40 -33 -26 -19 -12 -5 2 9
In this example, k = 10 , but how can I find for every different lists without loop?
I still can't make heads or tails of what you are trying to do, but your code seems to boil down to this:
colSums(matrix(rep(dat,nrow(matrix)),ncol=nrow(matrix)) %*% (matrix*c(1,-1,1)))
That will work for any size of k. It also does not require any of your names.
Some advice: Don't use list when a simple vector will do. You seem to understand how the %*% multiply works, you just need to get your matrices into the right form.

Resources