How to speed up this simple function in R - r

I am trying to do Conway–Maxwell-Poisson regression using COMPoissonReg in R
However, it is extremely slow for large dataset. Thus I tried to profile and check the source code.
The majority time (>95%) is spent on a function COMPoissonReg:::computez, which is equivalent to:
test <- function (lambda, nu, max=100)
{
forans <- matrix(0, ncol = max + 1, nrow = length(lambda))
for (j in 1:max) {
temp <- matrix(0, ncol = j, nrow = length(lambda))
for (i in 1:j) {
temp[, i] <- lambda/(i^nu)
}
for (k in 1:length(lambda)) {
forans[k, j + 1] <- prod(temp[k, ])
}
}
forans[, 1] <- rep(1, length(lambda))
ans <- rowSums(forans)
return(ans)
}
v is nu here, and lambda is a vector, max is the upper-limit of s (here it is set to 100 as an approximate to infinity).
The question doesn't really need special background stats knowledge, but the link or link2 is here just in case.
A simple script to test performance, this takes 8 secs, and if I lazily cmpfun compile it, it takes 4 secs. I believe it has the potential to be further improved. (without rewriting in C, and I am aiming for around ~ 0.05 sec so that I don't have to refactor the code in package which iteratively calls this function.)
lambda <- rnorm(10000, 1.5, 0.3)
Rprof(tmp <- tempfile())
sum(log(test(lambda, 1.2)))
Rprof()
summaryRprof(tmp)
Update
I realized another issue: floating point arithmetic limitation. Doing power series is dangerous, it can overflow very soon, especially if we have to vectorize. E.g. lambda ^ 100 is certainly NAN if lambda > 10000. Maybe I will use reduce if I program in other languages, but I fear in R reduce is slow.

You can make it much faster than the function you are using by avoiding loops. For example:
test2<-function(lambda,nu,max=100){
len<-length(lambda)
mm<-matrix(rep(lambda,each=max+1),max+1,len)
mm<-mm^(0:max)
mm<-mm/factorial(0:max)^nu
colSums(mm)
}
This runs about 50 times faster with lambda of length 100:
> require(microbenchmark)
> lam<-rnorm(100)
> max(abs(test(lam,1.2)-test2(lam,1.2)))
[1] 4.510281e-16
> microbenchmark(test(lam,1.2),test2(lam,1.2),times=10)
Unit: milliseconds
expr min lq median uq max neval
test(lam, 1.2) 77.124705 77.422619 78.241945 79.635746 81.260280 10
test2(lam, 1.2) 1.335716 1.373116 1.401411 1.507765 1.562447 10
You can probably optimize it a little more, but this should get most of the gains, unless there is some kind of builtin function you can exploit rather than doing the sum explicitly.
On input of length 10000, it takes 0.148 seconds on my machine, versus 6.850 seconds for test:
> lam<-rnorm(10000)
> max(abs(test(lam,1.2)-test2(lam,1.2)))
[1] 3.552714e-15
> system.time(test2(lam,1.2))
user system elapsed
0.132 0.016 0.148
> system.time(test(lam,1.2))
user system elapsed
6.780 0.056 6.850

OK, here's an Rcpp answer. As expected, it's a lot faster than either of the others.
require(Rcpp)
rcppfun<-"
Rcpp::NumericVector myfun(Rcpp::NumericVector lambda,
Rcpp::NumericVector weights)
{
int num = lambda.size();
int max = weights.size();
std::vector<double> r(num);
for(int i=0; i<num; i++){
double total = 0;
double prod = 1;
for(int j=0; j<max; j++){
total += prod/weights[j];
prod *= lambda[i];
}
r[i]=total;
}
return Rcpp::wrap(r);
}
"
testRcpp<-cppFunction(rcppfun)
test5<-function(lambda,nu,max=100){
wts<-factorial(0:max)^nu
testRcpp(lambda,wts)
}
This is about 40x faster than my original test2 and about 12x faster than than #NealFultz's improved logarithmic implementation.
> lam<-abs(rnorm(10000))
> max(abs(test5(lam,1.2)-test2(lam,1.2)))
[1] 7.105427e-15
> microbenchmark(test2(lam,1.2),test3(lam,1.2),test4(lam,1.2),test5(lam,1.2))
Unit: milliseconds
expr min lq median uq max neval
test2(lam, 1.2) 125.601616 126.790516 127.700099 135.182263 222.340179 100
test3(lam, 1.2) 125.523424 126.666410 126.921035 131.316254 178.633839 100
test4(lam, 1.2) 41.734015 42.640340 43.190553 50.932952 97.765219 100
test5(lam, 1.2) 3.432029 3.501046 3.519007 3.532603 3.754232 100
On edit, here's one more Rcpp version that should at least partially address the overflow issue, by computing each term incrementally, rather than the numerator and denominator separately.
rcppfun2<-"
Rcpp::NumericVector myfun2(Rcpp::NumericVector lambda, Rcpp::NumericVector nu){
int num = lambda.size();
int max = nu.size();
std::vector<double> r(num);
for(int i=0; i<num; i++){
double term = 1;
double total = 1;
for(int j=0; j< max; j++){
term *= (lambda[i]/nu[j]);
total += term;
}
r[i]=total;
}
return Rcpp::wrap(r);
}
"
testRcpp2<-cppFunction(rcppfun2)
test6<-function(lambda,nu,max=100){
testRcpp2(lambda,(1:max)^nu)
}
> lam<-abs(rnorm(10000))
> max(abs(test2(lam,1.2)-test6(lam,1.2)))
[1] 1.065814e-14
> microbenchmark(test5(lam,1.2),test6(lam,1.2))
Unit: milliseconds
expr min lq median uq max neval
test5(lam, 1.2) 3.416786 3.426013 3.435492 3.444196 3.604486 100
test6(lam, 1.2) 3.554147 3.572285 3.580865 3.588030 3.840713 100

I slept on it, came up with another big improvement if you can use the gsl package. All you're doing is evaluating a polynomial:
require(gsl)
test5 <- function(lambda, nu, max=100){
gsl_poly(factorial(0:max)^-nu, lambda)
}
R>microbenchmark(test2(1:50,5.1), test4(1:50,5.1), test5(1:50,5.1))
Unit: microseconds
expr min lq median uq max neval
test2(1:50, 5.1) 4518.957 4838.5185 5318.5040 5617.6330 19978.039 100
test4(1:50, 5.1) 2043.422 2268.3490 2472.0430 2727.1045 10328.376 100
test5(1:50, 5.1) 311.144 407.2465 476.0755 540.6095 1138.766 100

Following up #mrips, sometimes working on the log can be a bit faster because you can multiply instead of exponentiate:
test4 <- function(lambda,nu,max=100){
mm<-matrix(log(lambda),max,length(lambda), byrow=TRUE)
mm<-mm * 1:max - nu*lfactorial(1:max)
1 + colSums(exp(mm))
}
I've also factored out the special case where s = 0. Here's my timings:
R>microbenchmark(test2(1:50,5), test4(1:50,5))
Unit: microseconds
expr min lq median uq max neval
test2(1:50, 5) 952.360 1432.6600 1436.4525 1440.1860 3467.981 100
test4(1:50, 5) 695.189 1041.4785 1042.8315 1045.6525 2970.441 100

Related

Vectorizing double summations using R

I am struggling with translating this function into R using via vectorization technique:
Where all I have been able to do so far is this:
c <- matrix(1:9, 3)
z <- 1:3
sum(abs(outer(z, z,"-")) * c)/sum(c)
But I don't think its necessarily correct. I tried a for-loop version but that is too long and my answer is likely wrong anyway. Anyone keen on this? What am I missing (or doing wrong)? Any help would be appreciated.
Here's a double-loop version:
q =
function(z,c){
num = 0
for(i in 1:length(z)){
for(j in 1:length(z)){
num = num + abs(z[i]-z[j]) * c[i,j]
}
}
num/sum(c)
}
Here's your vectorised version, functionised:
q2 =
function(z,c){sum(c*abs(outer(z,z,'-')) /sum(c))}
Not a great difference in timing between them really for a small matrix:
> microbenchmark::microbenchmark(q(z,c), q2(z,c))
Unit: microseconds
expr min lq mean median uq max neval cld
q(z, c) 15.368 15.7505 16.59644 16.0225 16.6290 30.346 100 b
q2(z, c) 12.232 12.8885 13.79178 13.2225 13.6585 44.085 100 a
But for a larger test it's a big win:
> c2 = matrix(runif(100*100),100,100)
> z2 = runif(100)
> microbenchmark::microbenchmark(q(z2,c2), q2(z2,c2))
Unit: microseconds
expr min lq mean median uq max neval cld
q(z2, c2) 7437.031 7588.131 8046.92272 7794.927 8332.104 10729.799 100 b
q2(z2, c2) 74.742 78.647 94.20153 86.113 100.125 188.428 100 a
>
Numeric difference is within floating point tolerance:
> q(z2,c2) - q2(z2,c2)
[1] 6.661338e-16
So unless anyone has faster code, I'd stick with what you've got.
As perfectly explained by #Spacedman, your approach is very efficient, but if you still want to go faster you could try Rcpp :
library(Rcpp)
sourceCpp(code='
#include <Rcpp.h>
// [[Rcpp::export]]
double qRcpp(const Rcpp::NumericVector z, const Rcpp::NumericMatrix cm){
int zlen = z.length();
if(!(zlen == cm.nrow() && cm.nrow() == cm.ncol()))
Rcpp::stop("Invalid sizes");
double num = 0;
for(int i = 0 ; i < zlen ; i++){
for(int j = 0 ; j < zlen ; j++){
num = num + std::abs(z[i]-z[j]) * cm(i,j);
}
}
return num / Rcpp::sum(cm);
}
')
Benchmark :
c2 = matrix(runif(100*100),100,100)
z2 = runif(100)
microbenchmark::microbenchmark(q(z2,c2), q2(z2,c2),qRcpp(z2,c2))
# Unit: microseconds
# expr min lq mean median uq max neval
# q(z2, c2) 10273.035 10976.3050 11680.85554 11348.763 11765.2010 44115.632 100
# q2(z2, c2) 64.292 67.9455 80.56427 75.543 86.3565 244.019 100
# qRcpp(z2, c2) 21.042 21.9180 25.30515 24.256 26.8860 56.403 100

Simplify and improve for loop

I wondered if there exists a simpler (and possibly more efficient) way to write the following loop
l = leg(u[1],k)
for (i in 2:length(u)){l=rbind(l,leg(u[i],k))}
where leg is a user-defined function that returns a vector of size k. Basically, I want to return a length(u) x k matrix.
My issue is that the computation is very slow when length(u) is big, so maybe using some smarter functions (like apply or its variants) would make it faster.
Contrary to popular belief, sapply is not necessarily faster than a for loop
Let's test a few different methods with microbenchmark
library(microbenchmark)
u<-1:10
k<-3
I invent here a function for the sake of testing:
leg<-function(u,k){u**(1:k)}
Let's test different methods:
method1<-function(u,k){
l = leg(u[1],k)
for (i in 2:length(u)){l=rbind(l,leg(u[i],k))}
}
method2<-function(u,k){
l<-matrix(nrow = length(u),ncol = k)
for (i in 1:length(u)){l[i,]<-leg(u[i],k)}
}
method3<-function(u,k){
l <- do.call(rbind,lapply(1:length(u),function(i)leg(u[i],k)))
}
Now :
microbenchmark(times = 100, method1(u,k),method2(u,k),method3(u,k))
expr min lq mean median uq max neval
method1(u, k) 30.031 32.6920 36.88837 34.2125 41.4350 53.219 100
method2(u, k) 21.668 25.8490 29.60131 27.1800 32.6915 70.705 100
method3(u, k) 21.667 26.2295 29.42637 27.3700 33.0715 51.699 100
If we vectorize our function:
leg2<-function(u,k){
result<-matrix(nrow = length(u),ncol = k);
for(i in 1:k){result[,i]<-u**i}
}
microbenchmark(times = 100, method1(u,k),method2(u,k),method3(u,k),leg2(u,k))
Unit: microseconds
expr min lq mean median uq max neval
method1(u, k) 28.891 31.1710 34.79391 32.692 37.8235 64.243 100
method2(u, k) 20.527 24.7085 29.20205 26.229 31.3610 79.068 100
method3(u, k) 22.428 24.7090 28.49507 26.610 31.1710 71.465 100
leg2(u, k) 6.462 7.6030 9.03213 8.363 8.7430 19.768 100
So about 3 times faster!

Make cumulative sum faster

I'm trying to take cumulative sums for each column of a matrix. Here's my code in R:
testMatrix = matrix(1:65536, ncol=256);
microbenchmark(apply(testMatrix, 2, cumsum), times=100L);
Unit: milliseconds
expr min lq mean median uq max neval
apply(testMatrix, 2, cumsum) 1.599051 1.766112 2.329932 2.15326 2.221538 93.84911 10000
I used Rcpp for comparison:
cppFunction('NumericMatrix apply_cumsum_col(NumericMatrix m) {
for (int j = 0; j < m.ncol(); ++j) {
for (int i = 1; i < m.nrow(); ++i) {
m(i, j) += m(i - 1, j);
}
}
return m;
}');
microbenchmark(apply_cumsum_col(testMatrix), times=10000L);
Unit: microseconds
expr min lq mean median uq max neval
apply_cumsum_col(testMatrix) 205.833 257.719 309.9949 265.986 276.534 96398.93 10000
So the C++ code is 7.5 times as fast. Is it possible to do better than apply(testMatrix, 2, cumsum) in pure R? It feels like I have an order of magnitude overhead for no reason.
It is difficult to beat C++ with just R code. The fastest way I can think of doing it is if you are willing to split your matrix in to a list. That way, R is using primitive functions and doesn't copy the object with each iteration (apply is essentially a pretty loop). You can see that C++ still wins out but there is a significant speedup with the list approach if you really just want to use R code.
fun1 <- function(){
apply(testMatrix, 2, cumsum)
}
testList <- split(testMatrix, col(testMatrix))
fun2 <- function(){
lapply(testList, cumsum)
}
microbenchmark(fun1(),
fun2(),
apply_cumsum_col(testMatrix),
times=100L)
Unit: microseconds
expr min lq mean median uq max neval
fun1() 3298.534 3411.9910 4376.4544 3477.608 3699.2485 9249.919 100
fun2() 558.800 596.0605 766.2377 630.841 659.3015 5153.100 100
apply_cumsum_col(testMatrix) 219.651 282.8570 576.9958 311.562 339.5680 4915.290 100
EDIT
Please note that this method is slower than fun1 if you include the time to split the matrix in to a list.
Using a byte-compiled for loop is slightly faster than the apply call on my system. I expected it to be faster because it does less work than apply. As expected, the R loop is still slower than the simple C++ function you wrote.
colCumsum <- compiler::cmpfun(function(x) {
for (i in 1:ncol(x))
x[,i] <- cumsum(x[,i])
x
})
testMatrix <- matrix(1:65536, ncol=256)
m <- testMatrix
require(microbenchmark)
microbenchmark(colCumsum(m), apply_cumsum_col(m), apply(m, 2, cumsum), times=100L)
# Unit: microseconds
# expr min lq median uq max neval
# matrixCumsum(m) 1478.671 1540.5945 1586.1185 2199.9530 37377.114 100
# apply_cumsum_col(m) 178.214 192.4375 204.3905 234.8245 1616.030 100
# apply(m, 2, cumsum) 1879.850 1940.1615 1991.3125 2745.8975 4346.802 100
all.equal(colCumsum(m), apply(m, 2, cumsum))
# [1] TRUE
Maybe it is to late but I will write my answer so anyone else can see it.
First of all, in your C++ code you need to clone you matrix otherwise you are write into R's memory and it is forbiden by CRAN. So your code becomes:
rcpp_apply<-cppFunction('NumericMatrix apply_cumsum_col(NumericMatrix m) {
NumericMatrix g=clone(m);
for (int j = 0; j < m.ncol(); ++j) {
for (int i = 1; i < m.nrow(); ++i) {
g(i, j) += g(i - 1, j);
}
}
return g;
}');
Since your matrix is typeof integer then you can change your C++'s argument to be IntegerMatrix.
rcpp_apply_integer<-cppFunction('IntegerMatrix apply_cumsum_col(IntegerMatrix m) {
NumericMatrix g=clone(m);
for (int j = 0; j < m.ncol(); ++j) {
for (int i = 1; i < m.nrow(); ++i) {
g(i, j) += g(i - 1, j);
}
}
return g;
}');
This impoved the code about 2 times. Here is a benchmark:
microbenchmark::microbenchmark(R=apply(testMatrix, 2, cumsum),Rcpp=rcpp_apply(testMatrix),Rcpp_integer=rcpp_apply_integer(testMatrix), times=10)
Unit: microseconds
expr min lq mean median uq max neval
R 1552.217 1706.165 1770.1264 1740.0345 1897.884 1940.989 10
Rcpp 502.900 523.838 637.7188 665.0605 699.134 743.471 10
Rcpp_integer 220.455 274.645 274.9327 275.8770 277.930 316.109 10
all.equal(rcpp_apply(testMatrix),rcpp_apply_integer(testMatrix))
[1] TRUE
If your matrix has large values then you have to use NumericMatrix.

Optimize r code

I want to optimize my r function for calculating gini mean difference:
gini.md<- function(x)
{
n = length(x)
nm = n+1
x = sort(x)
return (2/n^2*sum((2*(1:n)-nm)*x))
}
Do you have any idea how to make it faster? Generating seqences with seq was slow. bitwShiftL((1:n), 1) is slower than 2* (1:n). How is that possible?
Moreover I found out that mean(x) is slower than sum(x)/length(x). Again why??? Mean is an internal function it should be faster.
Ignoring my own advice, I guessed that the most likely source of any speed problem is unnecessary creation of long vectors. The following C implementation avoids creating four vectors (1:n, 2 * (1:n), 2 * (1:n) - nm, and finally (2*(1:n)-nm)*x).
library(inline)
gini <- cfunction(signature(x="REALSXP"), "
double n = Rf_length(x), nm = n + 1, ans = 0;
const double *xp = REAL(x);
for (int i = 0; i < n; ++i)
ans += (2 * (i + 1) - nm) * xp[i];
return ScalarReal(2 * ans / (n * n));
")
but this doesn't seem to help much. I realized after the fact that evaluation time is dominated by sort().
> library(microbenchmark)
> x <- rnorm(100000)
> all.equal(gini.md(x), gini(sort(x)))
[1] TRUE
> microbenchmark(gini.md(x), gini(sort(x)), sort(x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
gini.md(x) 10.668591 10.98063 11.09274 11.03377 11.20588 11.62714 10
gini(sort(x)) 10.439458 10.64972 10.78242 10.70099 10.93015 11.36177 10
sort(x) 9.995886 10.18180 10.31508 10.27024 10.46160 10.66006 10
Maybe there's more speed to be had, but it will be similarly marginal.

Set an alpha trim in colMeans

Is there an implementation of colMeans in R that includes an alpha trimmed mean parameter?
If not, how could I make one?
The original question has been answered in the comments below.
apply(x, 2, mean, trim=.05) is not as optimized as colMeans;
what is an implementation of equal efficiency?
Here are some examples of different ways to calculate trimmed colMeans, with a comparison of their performance.
m <- matrix(runif(1000000), nc=1000)
trim <- 0.1
Using apply:
out1 <- apply(m, 2, mean, trim=trim)
Using sapply:
out2 <- sapply(seq_len(ncol(m)), function(i) mean(m[, i], trim=trim))
Using Rcpp:
library(inline)
library(RcppArmadillo)
f <- 'using namespace arma;
mat x = sort(as<mat>(x_));
double trim = as<double>(trim_);
int low;
if(x.n_rows % 2 == 1) {
low = ceil(trim * x.n_rows) - 1;
} else {
low = ceil(trim * x.n_rows);
}
int high = ceil((1 - trim) * x.n_rows) - 1;
return(wrap(mean(x.rows(low, high))));'
trim.colMeans <- cxxfunction(signature(x_='matrix', trim_='numeric'),
f, plugin="RcppArmadillo")
out3 <- trim.colMeans(m, trim)
Comparison
identical(out1, out2)
[1] TRUE
identical(out1, c(out3))
[1] TRUE
library(microbenchmark)
microbenchmark(apply=apply(m, 2, mean, trim=trim),
sapply=sapply(seq_len(ncol(m)), function(i) mean(m[, i], trim=trim)),
Rcpp=trim.colMeans(m, trim),
colMeans=colMeans(m))
Unit: microseconds
expr min lq median uq max neval
apply 68907.162 100439.4775 102555.396 109044.4025 136034.067 100
sapply 64675.928 66383.6010 66937.615 68152.1115 98680.906 100
Rcpp 43614.629 44297.6980 44761.360 45164.4850 46883.602 100
colMeans 782.458 805.7995 828.538 988.4625 1452.877 100
I'm sure my Rcpp implementation is sub-optimal, so feel free to chime in with improvements. As you can see, none of these methods is as efficient as an untrimmed calculation of colMeans, yet I suspect equivalent efficiency is impossible, since additional calculations must be made, including sorting and subsetting of the matrix. This penalty for trimming data is evident when benchmarking the mean of a vector vs. the trimmed counterpart:
v <- runif(1000)
microbenchmark(mean(v), mean(v, trim=0.1))
Unit: microseconds
expr min lq median uq max neval
mean(v) 5.722 6.325 6.927 7.229 124.989 100
mean(v, trim = 0.1) 42.165 43.671 44.574 44.876 84.630 100

Resources