R: Convert upper triangular part of a matrix to symmetric matrix - r

I have the upper triangular part of matrix in R (without diagonal) and want to generate a symmetric matrix from the upper triangular part (with 1 on the diagonal but that can be adjusted later). I usually do that like this:
res.upper <- rnorm(4950)
res <- matrix(0, 100, 100)
res[upper.tri(res)] <- res.upper
rm(res.upper)
diag(res) <- 1
res[lower.tri(res)] <- t(res)[lower.tri(res)]
This works fine but now I want to work with very large matrices. Thus, I would want to avoid having to store res.upper and res (filled with 0) at the same time. Is there any way I can directly convert res.upper to a symmetric matrix without having to initialize the matrix res first?

I think there are two issues here.
now I want to work with very large matrices
Then do not use R code to do this job. R will use much more memory than you expect. Try the following code:
res.upper <- rnorm(4950)
res <- matrix(0, 100, 100)
tracemem(res) ## trace memory copies of `res`
res[upper.tri(res)] <- res.upper
rm(res.upper)
diag(res) <- 1
res[lower.tri(res)] <- t(res)[lower.tri(res)]
This is what you will get:
> res.upper <- rnorm(4950) ## allocation of length 4950 vector
> res <- matrix(0, 100, 100) ## allocation of 100 * 100 matrix
> tracemem(res)
[1] "<0xc9e6c10>"
> res[upper.tri(res)] <- res.upper
tracemem[0xc9e6c10 -> 0xdb7bcf8]: ## allocation of 100 * 100 matrix
> rm(res.upper)
> diag(res) <- 1
tracemem[0xdb7bcf8 -> 0xdace438]: diag<- ## allocation of 100 * 100 matrix
> res[lower.tri(res)] <- t(res)[lower.tri(res)]
tracemem[0xdace438 -> 0xdb261d0]: ## allocation of 100 * 100 matrix
tracemem[0xdb261d0 -> 0xccc34d0]: ## allocation of 100 * 100 matrix
In R, you have to use 5 * (100 * 100) + 4950 double words to finish these operations. While in C, you only need at most 4950 + 100 * 100 double words (In fact, 100 * 100 is all that is needed! Will talk about it later). It is difficult to overwrite object directly in R without extra memory assignment.
Is there any way I can directly convert res.upper to a symmetric matrix without having to initialize the matrix res first?
You do have to allocate memory for res because that is what you end up with; but there is no need to allocate memory for res.upper. You can initialize the upper triangular, while filling in the lower triangular at the same time. Consider the following template:
#include <Rmath.h> // use: double rnorm(double a, double b)
#include <R.h> // use: getRNGstate() and putRNGstate() for randomness
#include <Rinternals.h> // SEXP data type
## N is matrix dimension, a length-1 integer vector in R
## this function returns the matrix you want
SEXP foo(SEXP N) {
int i, j, n = asInteger(N);
SEXP R_res = PROTECT(allocVector(REALSXP, n * n)); // allocate memory for `R_res`
double *res = REAL(R_res);
double tmp; // a local variable for register reuse
getRNGstate();
for (i = 0; i < n; i++) {
res[i * n + i] = 1.0; // diagonal is 1, as you want
for (j = i + 1; j < n; j++) {
tmp = rnorm(0, 1);
res[j * n + i] = tmp; // initialize upper triangular
res[i * n + j] = tmp; // fill lower triangular
}
}
putRNGstate();
UNPROTECT(1);
return R_res;
}
The code has not been optimized, as using integer multiplication j * n + i for addressing in the innermost loop will result in performance penalty. But I believe you can move multiplication outside the inner loop, and only leave addition inside.

To get a symmetric matrix from an upper or lower triangular matrix you can add the matrix to its transpose and subtract the diagonal elements. The equation is linked below.
diag(U) is a diagonal matrix with the diagonal elements of U.
ultosymmetric=function(m){
m = m + t(m) - diag(diag(m))
return (m)}
If you want the diagonal elements to be 1 you can do this.
ultosymmetric_diagonalone=function(m){
m = m + t(m) - 2*diag(diag(m)) + diag(1,nrow=dim(m)[1])
return (m)}

Related

Fast cosine similarity of two sparse matrices in Rcpp with Armadillo

I'm trying to port a very fast R function for calculating cosine similarity into Rcpp with Armadillo and sparse matrix operations.
Here's the R function:
#' Compute cosine similarities between columns in x and y
#'
#' #description adapted from qlcMatrix::cosSparse
#'
#' #param x dgCMatrix with samples as columns
#' #param y dgCMatrix with samples as columns
#' #return dgCMatrix of cosine similarities pairs of columns in "x" and "y"
sparse.cos <- function(x, y) {
s <- rep(1, nrow(x))
nx <- Matrix::Diagonal(x = drop(Matrix::crossprod(x ^ 2, s)) ^ -0.5)
x <- x %*% nx
ny <- Matrix::Diagonal(x = drop(Matrix::crossprod(y ^ 2, s)) ^ -0.5)
y <- y %*% ny
return(Matrix::crossprod(x, y))
}
Here's an example usage of the R function:
library(Matrix)
m1 <- rsparsematrix(1000, 10000, density = 0.1)
m2 <- rsparsematrix(1000, 100, density = 0.2)
res <- c_sparse_cos_mat_mat(m1, m2)
And here's my best stab so far at an Rcpp function (not working):
//[[Rcpp::export]]
arma::SpMat<double> sparse_cos(const arma::SpMat<double> &x, const arma::SpMat<double> &y){
arma::vec s(x.n_rows);
s = s.fill(1);
arma::vec nx = arma::vec(1 / sqrt(square(x) * s));
arma::vec ny = arma::vec(1 / sqrt(square(y) * s));
// apply column-wise Euclidean norm to x and y
for(sp_mat::const_iterator it_x = x.begin(); it_x != x.end(); it_x++)
x.at(it_x.row(), it_x.col()) = *it_x * nx(it_x.col());
for(sp_mat::const_iterator it_y = y.begin(); it_y != y.end(); it_y++)
y.at(it_y.row(), it_y.col()) = *it_y * ny(it_y.col());
// return cross-product of x and y as cosine distance
return(x * y);
}
Questions:
What is the fastest way to multiply all non-zero values in each column of SpMat x by corresponding values in a vector of length ncol(x)?
How do I fix the issues in the Rcpp function? Specifically: lvalue required as left operand of assignment in line x.at(it_x.row(), it_x.col()) = *it_x * nx(it_x.col());.
The result is inherently dense, and ideally would be returned as a dense matrix. Is there a fast method for taking the cross-product of two sparse matrices while explicitly filling in a dense matrix with the result?

How to improve processing time for euclidean distance calculation

I'm trying to calculate the weighted euclidean distance (squared) between twoo data frames that have the same number of columns (variables) and different number of rows (observations).
The calculation follows the formula:
DIST[m,i] <- sum(((DATA1[m,] - DATA2[i,]) ^ 2) * lambda[1,])
I specifically need to multiply each parcel of the somatory by a specific weight (lambda).
The code provided bellow runs correctly, but if I use it in hundreds of iterations it takes a lot of processing time. Yesterday it took me 18 hours to create a graphic using multiple iterations of a function that contains this calculation. Using library(profvis) profvis({ my code }) I saw that this specific part of the code is taking up like 80% of the processing time.
I read a lot about how to reduce the processing time using parallel and vectorized operations, but I don't know how to implement them in this particular case, because of the weight lamb#.
Can some one help me reduce my processing time with this code?
More information about the code and the structure of the data can be found in the code provided bellow as comments.
# Data frames used to calculate the euclidean distances between each observation
# from DATA1 and each observation from DATA2.
# The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting
# in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
# Weights used for each of the 50 variables to calculate the weighted
# euclidean distance.
# Can be a vector of different weights or a scalar of the same weight
# for all variables.
lambda <- runif(n=50, min=0, max=10) ## length(lambda) > 1
# lambda=1 ## length(lambda) == 1
if (length(lambda) > 1) {
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
}
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
# Euclidean Distance calculation
DIST <- matrix(NA, nrow=nrows1, ncol=nrows2 )
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}
After all the sugestions, combining the answers from #MDWITT (for length(lambda > 1) and #F. Privé (for length(lambda == 1) the final solution took only one minute to run, whilst the original one took me an hour and a half to run, in a bigger code that has that calculation. The final code for this problem, for those interested, is:
#Data frames used to calculate the euclidean distances between each observation from DATA1 and each observation from DATA2.
#The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
#Weights used for each of the 50 variables to calculate the weighted euclidean distance.
#Can be a vector of different weights or a scalar of the same weight for all variables.
#lambda <- runif(n = 50, min = 0, max = 10) ##length(lambda) > 1
lambda = 1 ##length(lambda) == 1
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
#Euclidean Distance calculation
DIST <- matrix(NA, nrow = nrows1, ncol = nrows2)
if (length(lambda) > 1){
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
library(Rcpp)
cppFunction('NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix DIST(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
DIST(i,j) = d;
}
}
return (DIST) ;
}')
DIST <- weighted_distance(DATA1, DATA2, lambda = lambda)}
if (length(lambda) == 1) {
DIST <- outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
}
Rewrite the problem to use linear algebra and vectorization, which is much faster than loops.
If you don't have lambda, this is just
outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
With lambda, it becomes
outer(drop(DATA1^2 %*% lambda), drop(DATA2^2 %*% lambda), '+') -
tcrossprod(DATA1, sweep(DATA2, 2, 2 * lambda, '*'))
Here an alternate way using Rcpp just to have this concept documents. In a file called euclidean.cpp in it I have
#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix out(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
out(i,j) = d;
}
}
return (out) ;
}
In R, then I have
library(Rcpp)
sourceCpp("libs/euclidean.cpp")
# Generate Data
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
lambda <- runif(n=50, min=0, max=10)
# Run the program
out <- weighted_distance(DATA1, DATA2, lambda = lambda)
When I test the speed using:
microbenchmark(
Rcpp_way = weighted_distance(DATA1, DATA2, lambda = lambda),
other = {DIST <- matrix(NA, nrow=nrows1, ncol=ncols)
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}}, times = 100)
You can see that it is a good clip faster:
Unit: microseconds
expr min lq mean median uq max neval
Rcpp_way 446.769 492.308 656.9849 562.667 846.9745 1169.231 100
other 24688.821 30681.641 44153.5264 37511.385 50878.3585 200843.898 100

Non-comformable arguments in R

I am re-writting an algorithm I did in C++ in R for practice called the Finite Difference Method. I am pretty new with R so I don't know all the rules regarding vector/matrix multiplication. For some reason I am getting a non-conformable arguments error when I do this:
ST_u <- matrix(0,M,1)
ST_l <- matrix(0,M,1)
for(i in 1:M){
Z <- matrix(gaussian_box_muller(i),M,1)
ST_u[i] <- (S0 + delta_S)*exp((r - (sigma*sigma)/(2.0))*T + sigma*sqrt(T)%*%Z)
ST_l[i] <- (S0 - delta_S)*exp((r - (sigma*sigma)/(2.0))*T + sigma*sqrt(T)%*%Z)
}
I get this error:
Error in sqrt(T) %*% Z : non-conformable arguments
Here is my whole code:
gaussian_box_muller <- function(n){
theta <- runif(n, 0, 2 * pi)
rsq <- rexp(n, 0.5)
x <- sqrt(rsq) * cos(theta)
return(x)
}
d_j <- function(j, S, K, r, v,T) {
return ((log(S/K) + (r + (-1^(j-1))*0.5*v*v)*T)/(v*(T^0.5)))
}
call_delta <- function(S,K,r,v,T){
return (S * dnorm(d_j(1, S, K, r, v, T))-K*exp(-r*T) * dnorm(d_j(2, S, K, r, v, T)))
}
Finite_Difference <- function(S0,K,r,sigma,T,M,delta_S){
ST_u <- matrix(0,M,1)
ST_l <- matrix(0,M,1)
for(i in 1:M){
Z <- matrix(gaussian_box_muller(i),M,1)
ST_u[i] <- (S0 + delta_S)*exp((r - (sigma*sigma)/(2.0))*T + sigma*sqrt(T)%*%Z)
ST_l[i] <- (S0 - delta_S)*exp((r - (sigma*sigma)/(2.0))*T + sigma*sqrt(T)%*%Z)
}
Delta <- matrix(0,M,1)
totDelta <- 0
for(i in 1:M){
if(ST_u[i] - K > 0 && ST_l[i] - K > 0){
Delta[i] <- ((ST_u[i] - K) - (ST_l[i] - K))/(2*delta_S)
}else{
Delta <- 0
}
totDelta = totDelta + exp(-r*T)*Delta[i]
}
totDelta <- totDelta * 1/M
Var <- 0
for(i in 1:M){
Var = Var + (Delta[i] - totDelta)^2
}
Var = Var*1/M
cat("The Finite Difference Delta is : ", totDelta)
call_Delta_a <- call_delta(S,K,r,sigma,T)
bias <- abs(call_Delta_a - totDelta)
cat("The bias is: ", bias)
cat("The Variance of the Finite Difference method is: ", Var)
MSE <- bias*bias + Var
cat("The marginal squared error is thus: ", MSE)
}
S0 <- 100.0
delta_S <- 0.001
K <- 100.0
r <- 0.05
sigma <- 0.2
T <- 1.0
M <- 10
result1 <- Finite_Difference(S0,K,r,sigma,T,M,delta_S)
I can't seem to figure out the problem, any suggestions would be greatly appreciated.
In R, the %*% operator is reserved for multiplying two conformable matrices. As one special case, you can also use it to multiply a vector by a matrix (or vice versa), if the vector can be treated as a row or column vector that conforms to the matrix; as a second special case, it can be used to multiply two vectors to calculate their inner product.
However, one thing it cannot do is perform scalar multipliciation. Scalar multiplication of vectors or matrices always uses the plain * operator. Specifically, in the expression sqrt(T) %*% Z, the first term sqrt(T) is a scalar, and the second Z is a matrix. If what you intend to do here is multiply the matrix Z by the scalar sqrt(T), then this should just be written sqrt(T) * Z.
When I made this change, your program still didn't work because of another bug -- S is used but never defined -- but I don't understand your algorithm well enough to attempt a fix.
A few other comments on the program not directly related to your original question:
The first loop in Finite_Difference looks suspicious: guassian_box_muller(i) generates a vector of length i as i varies in the loop from 1 up to M, and forcing these vectors into a column matrix of length M to generate Z is probably not doing what you want. It will "reuse" the values in a cycle to populate the matrix. Try these to see what I mean:
matrix(gaussian_box_muller(1),10,1) # all one value
matrix(gaussian_box_muller(3),10,1) # cycle of three values
You also use loops in many places where R's vector operations would be easier to read and (typically) faster to execute. For example, your definition of Var is equivalent to:
Var <- sum((Delta - totDelta)^2)/M
and the definitions of Delta and totDelta could also be written in this simplified fashion.
I'd suggest Googling for "vector and matrix operations in r" or something similar and reading some tutorials. Vector arithmetic in particular is idiomatic R, and you'll want to learn it early and use it often.
You might find it helpful to consider the rnorm function to generate random Gaussians.
Happy R-ing!

Allocating space for a sparse matrix in R

I construct a large, sparse matrix, of which I know the number non-zero elements in advance. Is it possible in R to allocate space for this matrix, instead of having its space automatically increased every time I add an element? Something like spalloc does in Matlab.
As a simplified code-example of what I want, consider the construction of the following block-wise diagonal matrix.
library("Matrix")
n = 1000;
p = 14000;
q = 7;
x_i = Matrix(rnorm(n*p), n, p);
x = Matrix(0, n*q, p*q, sparse=TRUE);
for(i in 1:q) {
x[((i-1)*n+1):(i*n),((i-1)*p+1):(i*p)] = x_i;
}
I think this process would be much faster if I could tell R in advance that the matrix will contain n*p*q non-zero elements.
Thanks in advance!
Edit: I now see that for the blockwise matrix I should use bdiag()
library("Matrix")
n = 1000;
p = 14000;
q = 7;
x_i = Matrix(rnorm(n*p), n, p);
lst = list();
for(i in 1:q) {
lst[i] = x_i;
}
x = bdiag(lst);
This is much faster.

Recursively Inverting a linear system - getting huge errors (precision)

I have a Ax =b type linear system - where A is an upper-triangular matrix. The structure of A is defined as follows:
comp.Amat <- function(i,j,prob) ifelse(i > j, 0, dbinom(x=i, size=j, prob=prob))
prob <- 1/4
A <- outer(1:50, 1:50 , FUN=function(r,c) comp.Amat(r,c,prob) )
The entries in A are binomial probabilities - and the issue is the diagonal entries fastly approach to 0 when the size of A grows.
If we define the vector b as follows as well:
b <- seq(1,50,1);
Then solve(a=A,b=b) - gives an error:
" system is computationally singular: reciprocal condition number = 1.07584e-64"
That makes sense, since the diagonal entries are almost 0, so the matrix becomes non-invertible.
As a work-around, I have written the following recursive function - which starts to compute the value of last diagonal entry, then replaces that value in the previous rows. Since each entry in matrix is dbinom(j,i, prob) for j=>i :I can get a solution via this way.
solve.for.x.custom <- function(A, b, prob)
{
n =length(A[1,])
m =length(A[,1])
x = seq(1,n, 1);
x[x> 0] = -1000;
calc.inv.Aii <- function(i,j, prob)
{
res = (1 / (prob*(1-prob)))^i;
return(res);
}
for (i in m:1 )
{
if(i ==m)
{
rhs =0;
}else
{
rhs=0;
for(j in m:(i+1))
{
rhs = dbinom(x=i,size=j,prob=prob)*x[j] + rhs;
}
}
x[i] = (b[i] - rhs)*calc.inv.Aii(i,i);
}
print(x)
return(x)
}
My problem is - when I multiply this solution x' by matrix A, the errors (Ax'- b) are huge. Since I have an analytical solution (each entry in x_i can be described as a in terms of binomial probabilities multiplies by previous values) - the error I should get is 0- in each row.
I see that (1 / (1/a)) may not be equal to a because of these issues. However, the current errors are really big( -1.13817489781529e+168).
x_prime=solve.for.x.custom(A, b, prob)
A%*%x_prime - b
#output
[,1]
[1,] -1.13817489781529e+168
[2,] 2.11872209742428e+167
[3,] -1.58403954589004e+166
[4,] 6.52328959209082e+164
[5,] -1.69562573261261e+163
[6,] 3.00614551450976e+161
***
[49,] -7.58010305220250e+08
[50,] 9.65162608741321e+03
I would really appreciate it you'd recommend any suggestions or efficient methods. I gave the size of A and b as 50 -but I intend to grow them as well thus in that case this the error will increase also.
If your matrix A is upper triangular you probably want to use backsolve(A, b) rather than solve(A, b).
You can do arbitrary precision in R with Rmpfr, which will require writing a compatible version of backsolve. With the code below the break we can get
> print(max(abs(b - .b)), digits=5)
1 'mpfr' number of precision 1024 bits
[1] 2.9686e-267
There is one important caveat though: the values in A may not be accurate enough since they come from dbinom rather than using mpfr objeccts. Depending on your end goal, you may need to write your own version of dbinom using Rmpfr.
library(Rmpfr)
logcomp.Amat <- function(i,j,prob) ifelse(i > j, -Inf, dbinom(x=i, size=j, prob=prob, log=TRUE))
nbits <- 1024
.backsolve <- function(A, b) {
n <- length(b)
x <- mpfr(numeric(n), nbits)
for(i in rev(seq_len(n))) {
known <- i + seq_len(n - i)
z <- if(length(known) > 0) sum(A[i,known] * x[known]) else 0
x[i] <- (b[i] - z) / A[i,i]
}
return(x)
}
logA <- outer(1:50, 1:50, logcomp.Amat, prob=1/4)
b <- 1:50
A <- exp(mpfr(logA, nbits))
b <- mpfr(b, nbits)
x <- .backsolve(A, b)
.b <- as.vector(A %*% x)

Resources