Paste elements together pairwise - r

Let's say I have two vectors and a distance matrix like below
v1 = sample(c(0,1),5,replace=TRUE)
v2 = sample(c(0,1),5,replace=TRUE)
d = matrix(rep(1,5*5),ncol=5)
diag(d) <- 0
Using the function below I'm calculating distance
how do I paste them together

Here is a first attempt by removing the inner loop and using the vectorization for the vector2 multiplication and sum:
f_d_categorical2 <- function(vector1, vector2, dist.matrix) {
ptm <- proc.time()
dist <- 0
for (i in 1:length(vector1)) {
dist <- dist + sum(vector1[i]*vector2*dist.matrix[i,])
}
print(proc.time()-ptm)
return(dist)
}
Process time went from 1.8 to 0.03 sec. I am sure there is room for improvement and additional test cases.

Related

R: fastest way to set up matrix of integrals?

I have a tree-parameter function f(x, y, z), and two limits L, U.
Given a vector v, I want to set up a matrix with element M[i, j] = INTEGRAL( f(x, v[i], v[j]) ), where the integrals limits go from x = L to x = U.
So the problem has two elements:
We need to be able to calculate the integrals. I don't care how this is done, as long as its FAST and reasonably accurate. Fast, fast, fast!! What's the fastest way?
We need to set up the matrix M[i, j]. What's the fastest way?
Please don't make this an issue of "dO yOu WaNt GauSsIan QuaDraTure oR SimPsoNs ruLe?". I don't care. Speed is the only thing relevant here. Whatevers faster, I'll take it, as long as the integrals are at least accurate up to 1-2 digits or something.
A potentially fastest solution is given as below
library(pracma)
M <- matrix(0,nrow = length(v),ncol = length(v))
p <- sapply(seq(length(v)-1), function(k) integral(f,v[k],v[k+1]))
u <- unlist(sapply(rev(seq_along(p)), function(k) cumsum(tail(p,k))))
M[lower.tri(M)] <- u
M <- t(M-t(M))
Regarding the two elements requested by OP
I guess integral from package pracma is fast enough
To build the matrix M, I did not used nested for loop. The idea is explained at the bottom lines, which I believe speeds up the computation remarkably
Benchmark
I wrote down some of the possible solutions and you can compare their performance (my "fastest" solution is in method1()).
set.seed(1)
library(pracma)
# dummy data: function f and vector v
f <- function(x) x**3 + cos(x**2)
v <- rnorm(500)
# my "fastest" solution
method1 <- function() {
m1 <- matrix(0,nrow = length(v),ncol = length(v))
p <- sapply(seq(length(v)-1), function(k) integral(f,v[k],v[k+1]))
u <- unlist(sapply(rev(seq_along(p)), function(k) cumsum(tail(p,k))))
m1[lower.tri(m1)] <- u
t(m1-t(m1))
}
# faster than brute-force solution
method2 <- function() {
m2 <- matrix(0,nrow = length(v),ncol = length(v))
for (i in 1:(length(v)-1)) {
for (j in i:length(v)) {
m2[i,j] <- integral(f,v[i],v[j])
}
}
m2 + t(m2)
}
# slowest, brute-force solution
method3 <- function() {
m3 <- matrix(0,nrow = length(v),ncol = length(v))
for (i in 1:length(v)) {
for (j in 1:length(v)) {
m3[i,j] <- integral(f,v[i],v[j])
}
}
m3
}
# timing for compare
system.time(method1())
system.time(method2())
system.time(method3())
such that
> system.time(method1())
user system elapsed
0.17 0.01 0.19
> system.time(method2())
user system elapsed
25.72 0.07 25.81
> system.time(method3())
user system elapsed
41.84 0.03 41.89
Principle
The idea in method1() is that, you only need to calculate the integrals over intervals consisting of adjacent points in v. Note that the integral properties:
integral(f,v[i],v[j]) is equal to sum(integral(f,v[i],v[i+1]) + integral(f,v[i+1],v[i+1]) + ... + integral(f,v[j-1],v[j]))
integral(f,v[j],v[i]) is equal to -integral(f,v[i],v[j])
In this sense, given n <- length(v), you only need to run integral operations (which is rather computational expensive compared to matrix transpose or vector cumulative summation) n-1 times (far less than choose(n,2) times in method2() or n**2 times in method3(), particularly when n is large).

how to modify my R code to accelerate computational speed

Here are my R code. Could you please give me some advice so that can accelerate the computational speed :)
First, the function myfun()generates a complex number.
Second, I compute the elements of matrix M using myfun().
myfun<-function(a,b,nq,ul,uk)
{
m<-seq(1,(nq/2)+1,length=(nq/2)+1);
k<-m;
D<-matrix(NA,nrow = length(k),ncol = length(k));
for(i in 1:length(k)) # row
for(j in 1:length(m)) # column
{
D[i,j]<-(2/nq)*cos(((j-1)*(i-1)*pi)/(nq*0.5))
}
D[,1]<-D[,1]*0.5;
D[,ncol(D)]<-D[,ncol(D)]*0.5;
# compute the vector v
vseq<-seq(2,nq-2,by=2);
vr<-2/(1-vseq^2);
vr<-c(1,vr,1/(1-nq*nq));
v<-matrix(vr,ncol=1); # v is a N by 1 matrix
# compute the vector w, length(w)=nq/2+1
h<-function(x,ul,uk)
{
((b-a)/2)*(exp((b-a)/2*x+(a+b)/2)+1)^(1i*uk)*cos(((b-a)/2*x+(a+b)/2-a)*ul)
}
w<-matrix(rep(NA,length(v)),ncol=1);
for(i in 1:length(w))
{
w[i]<-h((cos((i-1)*pi/nq)),ul,uk)+h((-cos((i-1)*pi/nq)),ul,uk)
}
res<-t(t(D)%*%v)%*%w; # each element of matrix M
return(res)
}
Next, compute each element of matrix M. The N-th column and N-th row are zeros.
matrix.M<-matrix(0,ncol = N,nrow = N);
for(i in 1:N-1)
for(j in 1:N-1)
{
matrix.M[i,j]<-myfun(a,b,nq,i-1,j-1)
}
We can set parameters as
a<--173.2;
b<-78;
alpha<-0.24;
Dt<-0.1;
M<-1000;
N<-150;
u<-seq(1,150,by=1)*pi/(b-a);
nq<-3000;
I appreciate your help!
Here are some suggestions for speeding the function up. I use three "tricks":
Vectorize as many functions as possible
Use the outer function instead of a double loop
Use the hidden gem crossprod for the final matrix products
myfun<-function(a,b,nq,ul,uk) {
m<-seq(1,(nq/2)+1,length=(nq/2)+1);
k<-m;
## Use outer to compute the elements of the matrix
D <- outer(1:length(k), 1:length(m), function(i, j) {(2/nq)*cos(((j-1)*(i-1)*pi)/(nq*0.5))} )
D[,1]<-D[,1]*0.5;
D[,ncol(D)]<-D[,ncol(D)]*0.5;
# compute the vector v
vseq<-seq(2,nq-2,by=2);
vr<-2/(1-vseq^2);
vr<-c(1,vr,1/(1-nq*nq));
v<-matrix(vr,ncol=1); # v is a N by 1 matrix
h<-function(x,ul,uk) {
((b-a)/2)*(exp((b-a)/2*x+(a+b)/2)+1)^(1i*uk)*cos(((b-a)/2*x+(a+b)/2-a)*ul)
}
## Compute the full w vector in one go
vect <- seq_along(v)-1
w <- h((cos(vect*pi/nq)),ul,uk) + h((-cos(vect*pi/nq)),ul,uk)
## Compute the cross products.
res <- crossprod(crossprod(D, v), w)
return(res)
}
I think this should save around 80% of the time compared to the original function. The time hog was the initial computation of D. Hope this helps.

Nested rolling sum in vector

I am struggling to produce an efficient code to compute the vector result r result from an input vector v using this function.
r(i) = \sum_{j=i}^{i-N} [o(i)-o(j)] * exp(o(i)-o(j))
where i loops (from N to M) over the vector v. Size of v is M>>N.
Of course this is feasible with 2 nested for loops, but it is too slow for computational purposes, probably out of fashion and deprecated style...
A MWE:
for (i in c(N+1):length(v)){
csum <- 0
for (j in i:c(i-N)) {
csum <- csum + (v[i]-v[j])*exp(v[i]-v[j])
}
r[i] <- csum
}
In my real application M > 10^5 and the v vector is indeed several vectors.
I have been trying with nested applications of lapply and rollapply without success.
Any suggestion is welcome.
Thanks!
I don't know if it is any more efficient but something you can try:
r[N:M] <- sapply(N:M, function(i) tail(cumsum((v[i]-v[1:N])*exp(v[i]-v[1:N])), 1))
checking that both computations give same results, I got r with your way and r2 with mine, initializing r2 to rep(NA, M) and assessed the similarity:
all((r-r2)<1e-12, na.rm=TRUE)
# [1] TRUE
NOTE: as in #lmo answer, tail(cumsum(...), 1) can be efficiently replaced by just using sum(...):
r[N:M] <- sapply(N:M, function(i) sum((v[i]-v[1:N])*exp(v[i]-v[1:N])))
Here is a method with a single for loop.
# create new blank vector
rr <- rep(NA,M)
for(i in N:length(v)) {
rr[i] <- sum((v[i] - v[seq_len(N)]) * exp(v[i] - v[seq_len(N)]))
}
check for equality
all.equal(r, rr)
[1] TRUE
You could reduce the number of operations by 1 if you store the difference. This should add a little speed up.
for(i in N:length(v)) {
x <- v[i] - v[seq_len(N)]
rr[i] <- sum(x * exp(x))
}

Mahalanobis distance of each pair of observations

I am trying to compute the Mahalanobis distance between each observations of a dataset dat, where each row is an observation and each column is a variable. Such distance is defined as:
I wrote a function that does it, but I feel like it is slow. Is there any better way to compute this in R ?
To generate some data to test the function:
generateData <- function(nObs, nVar){
library(MASS)
mvrnorm(n=nObs, rep(0,nVar), diag(nVar))
}
This is the function I have written so far. They both work and for my data (800 obs and 90 variables), it takes approximatively 30 and 33 seconds for the method = "forLoop" and method = "apply", respectively.
mhbd_calc2 <- function(dat, method) { #Method is either "forLoop" or "apply"
dat <- as.matrix(na.omit(dat))
nObs <- nrow(dat)
mhbd <- matrix(nrow=nObs,ncol = nObs)
cv_mat_inv = solve(var(dat))
distMH = function(x){ #Mahalanobis distance function
diff = dat[x[1],]-dat[x[2],]
diff %*% cv_mat_inv %*% diff
}
if(method=="forLoop")
{
for (i in 1:nObs){
for(j in 1:i){
mhbd[i,j] <- distMH(c(i,j))
}
}
}
if(method=="apply")
{
mhbd[lower.tri(mhbd)] = apply(combn(nrow(dat),2),2, distMH)
}
result = sqrt(mhbd)
colnames(result)=rownames(dat)
rownames(result)=rownames(dat)
return(as.dist(result))
}
NB: I tried using outer() but it was even slower (60seconds)
You need some mathematical knowledge.
Do a Cholesky factorization of empirical covariance, then standardize your observations;
use dist to compute Euclidean distance on transformed observations.
dist.maha <- function (dat) {
X <- as.matrix(na.omit(dat)) ## ensure a valid matrix
V <- cov(X) ## empirical covariance; positive definite
L <- t(chol(V)) ## lower triangular factor
stdX <- t(forwardsolve(L, t(X))) ## standardization
dist(stdX) ## use `dist`
}
Example
set.seed(0)
x <- matrix(rnorm(6 * 3), 6, 3)
dist.maha(x)
# 1 2 3 4 5
#2 2.362109
#3 1.725084 1.495655
#4 2.959946 2.715641 2.690788
#5 3.044610 1.218184 1.531026 2.717390
#6 2.740958 1.694767 2.877993 2.978265 2.794879
The result agrees with your mhbd_calc2.

Downsample matrix in R?

My question is about how to improve the performance of function that downsamples from the columns of a matrix without replacement (a.k.a. "rarefication" of a matrix... I know there has been mention of this here, but I could not find a clear answer that a) does what I need; b) does it quickly).
Here is my function:
downsampled <- function(data,samplerate=0.8) {
data.test <- apply(data,2,function(q) {
names(q) <- rownames(data)
samplepool <- character()
for (i in names(q)) {
samplepool <- append(samplepool,rep(i,times=q[i]))
}
sampled <- sample(samplepool,size=samplerate*length(samplepool),replace = F)
tab <- table(sampled)
mat <- match(names(tab),names(q))
toret=numeric(length <- length(q))
names(toret) <- names(q)
toret[mat] <- tab
return(toret)
})
return(data.test)
}
I need to be downsampling matrices with millions of entries. I find this is quite slow (here I'm using a 1000x1000 matrix, which is about 20-100x smaller than my typical data size):
mat <- matrix(sample(0:40,1000*1000,replace=T),ncol=1000,nrow=1000)
colnames(mat) <- paste0("C",1:1000)
rownames(mat) <- paste0("R",1:1000)
system.time(matd <- downsampled(mat,0.8))
## user system elapsed
## 69.322 21.791 92.512
Is there a faster/easier way to perform this operation that I haven't thought of?
I think you can make this dramatically faster. If I understand what you are trying to do correctly, you want to down-sample each cell of the matrix, such that if samplerate = 0.5 and the cell of the matrix is mat[i,j] = 5, then you want to sample up to 5 things where each thing has a 0.5 chance of being sampled.
To speed things up, rather than doing all these operations on columns of the matrix, you can just loop through each cell of the matrix, draw n things from that cell by using runif (e.g., if mat[i,j] = 5, you can generate 5 random numbers between 0 and 1, and then add up the number of values that are < samplerate), and finally add the number of things to a new matrix. I think this effectively achieves the same down-sampling scheme, but much more efficiently (both in terms of running time and lines of code).
# Sample matrix
set.seed(23)
n <- 1000
mat <- matrix(sample(0:10,n*n,replace=T),ncol=n,nrow=n)
colnames(mat) <- paste0("C",1:n)
rownames(mat) <- paste0("R",1:n)
# Old function
downsampled<-function(data,samplerate=0.8) {
data.test<-apply(data,2,function(q){
names(q)<-rownames(data)
samplepool<-character()
for (i in names(q)) {
samplepool=append(samplepool,rep(i,times=q[i]))
}
sampled=sample(samplepool,size=samplerate*length(samplepool),replace = F)
tab=table(sampled)
mat=match(names(tab),names(q))
toret=numeric(length = length(q))
names(toret)<-names(q)
toret[mat]<-tab
return(toret)
})
return(data.test)
}
# New function
downsampled2 <- function(mat, samplerate=0.8) {
new <- matrix(0, nrow(mat), ncol(mat))
colnames(new) <- colnames(mat)
rownames(new) <- rownames(mat)
for (i in 1:nrow(mat)) {
for (j in 1:ncol(mat)) {
new[i,j] <- sum(runif(mat[i,j], 0, 1) < samplerate)
}
}
return(new)
}
# Compare times
system.time(downsampled(mat,0.8))
## user system elapsed
## 26.840 3.249 29.902
system.time(downsampled2(mat,0.8))
## user system elapsed
## 4.704 0.247 4.918
Using an example 1000 X 1000 matrix, the new function I provided runs about 6 times faster.
One source of savings would be to remove the for loop that appends samplepool using rep. Here is a reproducible example:
myRows <- 1:5
names(myRows) <- letters[1:5]
# get the repeated values for sampling
samplepool <- rep(names(myRows), myRows)
Within your function, this would be
samplepool <- rep(names(q), q)

Resources