I am trying to code in R a(centered) weighted moving average function that returns me a vector of the same size than the input vector.
The following code almost gives me what I want but it does not work for the first and last values of my vector
set.seed(0)
len=10
x=floor(l*runif(l))
weights=c(1,3,0,3,1)
weights=weights/sum(weights)
rollapply(x,width=length(weights), function(x) sum(x*weights),align="center")
na.omit(filter(x,sides=2,weights))
Setting partial=TRUE in the rollapply function is sort of what I want to do. Anyway it does not work since my function does not support an x of changing sizes.
I could the latter and manually add the sides computations with a loop. It would work but I would like to find a nicer (computationally faster) way to do it.
For a more rigorous description of my needs here is a mathematical version
r is the vector my function would return
x and the weights w as inputs :
With Rcpp, you can do:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector roll_mean(const NumericVector& x,
const NumericVector& w) {
int n = x.size();
int w_size = w.size();
int size = (w_size - 1) / 2;
NumericVector res(n);
int i, ind_x, ind_w;
double w_sum = Rcpp::sum(w), tmp_wsum, tmp_xwsum, tmp_w;
// beginning
for (i = 0; i < size; i++) {
tmp_xwsum = tmp_wsum = 0;
for (ind_x = i + size, ind_w = w_size - 1; ind_x >= 0; ind_x--, ind_w--) {
tmp_w = w[ind_w];
tmp_wsum += tmp_w;
tmp_xwsum += x[ind_x] * tmp_w;
}
res[i] = tmp_xwsum / tmp_wsum;
}
// middle
int lim2 = n - size;
for (; i < lim2; i++) {
tmp_xwsum = 0;
for (ind_x = i - size, ind_w = 0; ind_w < w_size; ind_x++, ind_w++) {
tmp_xwsum += x[ind_x] * w[ind_w];
}
res[i] = tmp_xwsum / w_sum;
}
// end
for (; i < n; i++) {
tmp_xwsum = tmp_wsum = 0;
for (ind_x = i - size, ind_w = 0; ind_x < n; ind_x++, ind_w++) {
tmp_w = w[ind_w];
tmp_wsum += tmp_w;
tmp_xwsum += x[ind_x] * tmp_w;
}
res[i] = tmp_xwsum / tmp_wsum;
}
return res;
}
I use this function in one of my packages.
Just put that in a .cpp file and source it with Rcpp::sourceCpp.
Related
I am probably being greedy for performance, but I've observed significant performance gains when combining Rcpp and OpenMP in possibly illict ways. I understand that "Calling any of the R API from threaded code is ‘for experts only’ and strongly discouraged." but I'm don't fully understand when C++ code may be implicitly doing this on Rcpp vectors. I understand RcppParallel has the RVector class but I understand this may involve taking a copy of the vector, which may wash away any performance gains.
I note the Rcpp gallery has (https://gallery.rcpp.org/articles/gerber-statistic/) which appears to access a NumericMatrix HIST_RETURN_RAW inside a parallel loop, so it seems "some" access is allowed, but I know/believe that some wrappers (like Rcpp::List) would not be permitted. Is atomicity the distinguishing characteristic?
Concretely, are any of the following acceptable uses of OpenMP? (i.e. are they all threadsafe, compliant with R's memory management, and free from undefined behaviour?)
#include <Rcpp.h>
#ifdef _OPENMP
#include <omp.h>
#endif
using namespace Rcpp;
// 1. No 'outside' R vector, but IntegerVector created outside omp region
// [[Rcpp::export]]
IntegerVector fastInitalize(int n, int nThread = 1) {
IntegerVector out = no_init(n);
#pragma omp parallel for num_threads(nThread)
for (int i = 0; i < n; ++i) {
out[i] = 0;
}
return out;
}
// 2. Simple access
// [[Rcpp::export]]
IntegerVector AddOMP(IntegerVector x, IntegerVector y, int nThread = 1) {
R_xlen_t N = x.length();
if (N != y.length()) {
stop("Lengths differ");
}
IntegerVector out = no_init(N);
#pragma omp parallel for num_threads(nThread)
for (R_xlen_t i = 0; i < N; ++i) {
out[i] = x[i] + y[i];
}
return out;
}
// 3. Access, copy inside Rcpp
// [[Rcpp::export]]
IntegerVector pmax0xy(IntegerVector x, IntegerVector y, int nThread = 1) {
R_xlen_t N = x.length();
if (N != y.length()) {
stop("Lengths differ");
}
IntegerVector out = clone(y);
#pragma omp parallel for num_threads(nThread)
for (R_xlen_t i = 0; i < N; ++i) {
if (x[i] > 0) {
out[i] = 0;
}
}
return out;
}
// 4. Access with omp + reduction
// [[Rcpp::export]]
int firstNonzero(IntegerVector x, int nThread = 1) {
R_xlen_t N = x.length();
int out = N;
#pragma omp parallel for num_threads(nThread) reduction(min : out)
for (R_xlen_t i = 0; i < N; ++i) {
if (x[i] != 0) {
out = (i < out) ? i : out;
}
}
return out;
}
// 5. Access with omp array reduction
// [[Rcpp::export]]
IntegerVector count_one_to_ten(IntegerVector x, int nThread = 1) {
R_xlen_t N = x.length();
if (N >= INT_MAX) {
stop("Possibly too many numbers.");
}
const int TEN = 10;
int numbers[TEN] = {}; // what if 10 was large?
#if defined _OPENMP && _OPENMP >= 201511
#pragma omp parallel for num_threads(nThread) reduction(+:numbers[:TEN])
#endif
for (R_xlen_t i = 0; i < N; ++i) {
int xi = x[i];
if (xi < 1 || xi > 10) {
continue;
}
numbers[xi - 1] += 1;
}
IntegerVector out(TEN);
for (int n = 0; n < TEN; ++n) {
out[n] = numbers[n];
}
return out;
}
// You can include R code blocks in C++ files processed with sourceCpp
// (useful for testing and development). The R code will be automatically
// run after the compilation.
//
/*** R
x <- sample(1:1200, size = 1e6, replace = TRUE)
y <- sample(1:1200, size = 1e6, replace = TRUE)
fastInitalize(1e6)[1]
head(AddOMP(x, y))
head(AddOMP(x, y, 2))
head(pmax0xy(x, y))
head(pmax0xy(x, y, 2))
firstNonzero(x)
firstNonzero(x, 2)
count_one_to_ten(x, 2)
*/
I have a following R code which is not efficient. I would like to make this efficient using Rcpp. Particularly, I am not used to dealing with array in Rcpp. Any help would be appreciated.
myfunc <- function(n=1600,
m=400,
p = 3,
time = runif(n,min=0.05,max=4),
qi21 = rnorm(n),
s0c = rnorm(n),
zc_min_ecox_multi = array(rnorm(n*n*p),dim=c(n,n,p)),
qi=matrix(0,n,n),
qi11 = rnorm(p),
iIc_mat = matrix(rnorm(p*p),p,p)){
for (j in 1:n){
u<-time[j]
ind<-1*(u<=time)
locu<-which(time==u)
qi2<- sum(qi21*ind) /s0c[locu]
for (i in 1:n){
qi1<- qi11%*%iIc_mat%*%matrix(zc_min_ecox_multi[i,j,],p,1)
qi[i,j]<- -(qi1+qi2)/m
}
}
}
Computing time is about 7.35 secs. I need to call this function over and over again, maybe 20 times.
system.time(myfunc())
user system elapsed
7.34 0.00 7.35
First thing to do would be to profile your code: profvis::profvis({myfunc()}).
What you can do is precompute qi11 %*% iIc_mat once.
You get (with minor improvements):
precomp <- qi11 %*% iIc_mat
for (j in 1:n) {
u <- time[j]
qi2 <- sum(qi21[u <= time]) / s0c[time == u]
for (i in 1:n) {
qi1 <- precomp %*% zc_min_ecox_multi[i, j, ]
qi[i, j] <- -(qi1 + qi2) / m
}
}
that is twice as fast (8 sec -> 4 sec).
Vectorizing the i loop then seems straightforward:
q1_all_i <- tcrossprod(precomp, zc_min_ecox_multi[, j, ])
qi[, j] <- -(q1_all_i + qi2) / m
(12 times as fast now)
And if you want to try it in Rcpp, you will first need a function to multiply the matrices...
#include<Rcpp.h>
#include<numeric>
// [[Rcpp::plugins("cpp11")]]
Rcpp::NumericMatrix mult(const Rcpp::NumericMatrix& lhs,
const Rcpp::NumericMatrix& rhs)
{
if (lhs.ncol() != rhs.nrow())
Rcpp::stop ("Incompatible matrices");
Rcpp::NumericMatrix out(lhs.nrow(),rhs.ncol());
Rcpp::NumericVector rowvec, colvec;
for (int i = 0; i < lhs.nrow(); ++i)
{
rowvec = lhs(i,Rcpp::_);
for (int j = 0; j < rhs.ncol(); ++j)
{
colvec = rhs(Rcpp::_,j);
out(i, j) = std::inner_product(rowvec.begin(), rowvec.end(),
colvec.begin(), 0.);
}
}
return out;
}
Then port your function...
// [[Rcpp::export]]
Rcpp::NumericMatrix myfunc_rcpp( int n, int m, int p,
const Rcpp::NumericVector& time,
const Rcpp::NumericVector& qi21,
const Rcpp::NumericVector& s0c,
const Rcpp::NumericVector& zc_min_ecox_multi,
const Rcpp::NumericMatrix& qi11,
const Rcpp::NumericMatrix& iIc_mat)
{
Rcpp::NumericMatrix qi(n, n);
Rcpp::NumericMatrix outermat = mult(qi11, iIc_mat);
for (int j = 0; j < n; ++j)
{
double qi2 = 0;
for(int k = 0; k < n; ++k)
{
if(time[j] <= time[k]) qi2 += qi21[k];
}
qi2 /= s0c[j];
for (int i = 0; i < n; ++i)
{
Rcpp::NumericMatrix tmpmat(p, 1);
for(int z = 0; z < p; ++z)
{
tmpmat(z, 0) = zc_min_ecox_multi[i + n*j + z*n*n];
}
Rcpp::NumericMatrix qi1 = mult(outermat, tmpmat);
qi(i,j) -= (qi1(0,0) + qi2)/m;
}
}
return qi;
}
Then in R:
my_rcpp_func <- function(n=1600,
m=400,
p = 3,
time = runif(n,min=0.05,max=4),
qi21 = rnorm(n),
s0c = rnorm(n),
zc_min_ecox_multi = array(rnorm(n*n*p),dim=c(n,n,p)),
qi11 = rnorm(p),
iIc_mat = matrix(rnorm(p*p),p,p))
{
myfunc_rcpp(n, m, p, time, qi21, s0c, as.vector(zc_min_ecox_multi),
matrix(qi11,1,p), iIc_mat)
}
This is certainly faster, and gives the same results as your own function, but it's no quicker than the in-R optimizations suggested by F Privé. Maybe optimizing the C++ code could get things even faster, but ultimately you are multiplying 2 reasonably large matrices together over 2.5 million times, so it's never going to be all that fast. R is optimized pretty well for this kind of calculation after all...
I am trying to get some results by using Rcpp and this is the code.
#include <Rcpp.h>
#include <math.h>
using namespace Rcpp;
enter code here
// [[Rcpp::export]]
double compssr(NumericMatrix dist, NumericVector x, int n, int p) {
double ssr = 0; double del_sq = 0; double del_ij = 0;
int i, j, ip;
for (i = 0; i < n; i++) {
for (j = 0; j < i; j++) {
for (ip = 0; ip < p; ip++) {
del_sq = del_sq + (x(i, ip) - x(j, ip))*(x(i, ip) - x(j, ip));
if (i == j) del_sq = 0;
}
del_ij = sqrt(del_sq);
ssr = ssr + (dist(i, j) - del_ij)*(dist(i, j) - del_ij);
}}
return ssr;
}
NumericMatrix Scaling_X(NumericVector xbar, NumericMatrix x, double n, double p) {
NumericMatrix Sig_x(p, p);
int i, ii, ip, ip2;
for (ii = 0; ii < n; ii++) {
for (i = 0; i < p; i++) {
x(ii, i) = x(ii, i) - xbar(i);
}}
for (i = 0; i < n; i++) {
for (ip = 0; ip < p; ip++) {
for (ip2 = 0; ip2 < p; ip2++) {
Sig_x(ip, ip2) = Sig_x(ip, ip2) + x(i, ip)*x(i, ip2);
}}}
for (i = 0; i < Sig_x.ncol(); i++) {
for (ii = 0; ii < Sig_x.nrow(); ii++) {
Sig_x(i, ii) = Sig_x(i, ii) / n;
}}
return Sig_x;
}
In fact there are some more functions and the file name of this code is "test.cpp"
And I called this file in R by using
sourceCpp("test.cpp")
There was no error and I could use the function "compssr" the first function(return type: double)
But I couldn't call the function Scaling_X
Is there any error in my code?
I made other functions and I could use the function with return type double, but I couldn't use others(NumericMatrix, NumericVector, List)
You are missing the
// [[Rcpp::export]]
in front of function Scaling_X so the compileAttributes() function does as it has been told: compile both functions, make just one available.
I have the following RccpArmadillo function that runs fine if I execute it on one cpu core. But if I use several cores, then R will crash. All the other Rcpp functions I created so far run fine on several cores (with foreach), only RccpArmadillo seems to be problematic. Any ideas how to fix that?
cppFunction('double augmentedDickeyFullerCpp(NumericVector a, NumericVector b, double gamma, double mu, int lags) {
if (gamma < 0) {
return 0;
}
int n = a.size()-1;
int lags2 = lags + 1;
// first rows, then columns
NumericMatrix x(n-lags2,lags2);
NumericMatrix zdifflag(n-lags2+1,lags2);
NumericVector diff(n);
NumericVector zdiff(n-lags2+1);
NumericVector residuals(n+1);
residuals[0] = a[0] - gamma * b[0] - mu;
// residuals a is y and b is x
for(int i = 1; i < n+1; i++) {
residuals[i] = a[i] - gamma * b[i] - mu;
diff[i-1] = residuals[i] - residuals[i-1];
}
for(int i = 0; i < n-lags2+1; i++) {
zdifflag[0,i] = residuals[i+lags2-1];
}
for(int j = 0; j < n-lags2+1; j++) {
for(int i = 0; i < lags2; i++) {
x(j,i) = diff[j+lags2-1-i];
if (i > 0) {
zdifflag(j,i) = x(j,i);
}
}
zdiff[j] = x(j,0);
}
int length = zdifflag.nrow(), k = zdifflag.ncol();
arma::mat X(zdifflag.begin(), length, k, false); // reuses memory and avoids extra copy
arma::colvec y(zdiff.begin(), zdiff.size(), false);
arma::colvec coef = arma::solve(X, y); // fit model y ~ X
arma::colvec res = y - X*coef; // residuals
// std.errors of coefficients
//arma::colvec res = y - X*coef[0];
// sqrt(sum(residuals^2)/(length - k))
double s2 = std::inner_product(res.begin(), res.end(), res.begin(), 0.0)/(length - k);
arma::colvec std_err = arma::sqrt(s2 * arma::diagvec(arma::pinv(arma::trans(X)*X)));
return coef[0]/std_err[0];
}',depends = "RcppArmadillo", includes="#include <RcppArmadillo.h>")
I generally recommend putting the code into a small package, and having each parallel worker load the package. That is known to work, both in serial and parallel, whereas relying on cppFunction() for an ad-hoc function may be too fragile for parallel execution.
I had implemented a function in R which was long to run. I have succeeded in improving it in R but now I would like to speed it up more by using Rcpp package.
I have created the following Rcpp code. Unfortunately, it takes approximately the same time to run as the R code. I would like thus to improve it. Has anyone an idea on how to improve this piece of code?
Thanks a lot!
#include <math.h>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double kernelcpp(NumericVector a, NumericVector b, int N){
int i;
double sum=0.0;
for (i=0;i<N;i++){
if (a[i] > b[i])
sum+= a[i] - b[i];
else
sum+= b[i] - a[i];
}
return(exp( - sum));
}
// [[Rcpp::export]]
NumericVector testFromontcpp(NumericMatrix z1, NumericMatrix z2, int Nbootstrap){
// first element of TKeps = TK
int i,j,k,t;
int dim1 = z1.nrow();
int dim2 = z2.nrow();
double n1 = (double) dim1;
double n2 = (double) dim2;
int dimension = z1.ncol();
int N = dim1 + dim2;
NumericVector TKeps(Nbootstrap+1);
Rcpp::NumericMatrix bb(N,N);
double cc = 1 / (n1*n2*(n1+n2-2));
double a = sqrt(1/(n1*n1-n1)-cc);
double b = - sqrt(1/(n2*n2-n2)-cc);
for (i=0 ; i<N ; i++){
for (j=0 ; j<N ; j++){
if (i != j){
if (i < dim1) {
if (j < dim1){
bb(i,j) = kernelcpp(z1(i,_),z1(j,_),dimension);
} else {
bb(i,j) = kernelcpp(z1(i,_),z2(j-dim1,_),dimension);
}
}
else{
if (j < dim1){
bb(i,j) = kernelcpp(z2(i-dim1,_),z1(j,_),dimension);
} else {
bb(i,j) = kernelcpp(z2(i-dim1,_),z2(j-dim1,_),dimension);
}
}
}
}
}
TKeps(0)=0.0;
for (i=0 ; i<N ; i++){
for (j=0 ; j<N ; j++){
if (i != j){
if (i < dim1) {
if (j < dim1){
TKeps(0) += bb(i,j)* (a*a + cc);
} else {
TKeps(0) += bb(i,j) * (a*b + cc);
}
}
else{
if (j < dim1){
TKeps(0) += bb(i,j) * (a*b + cc);
} else {
TKeps(0) += bb(i,j) * (b*b + cc);
}
}
}
}
}
for (k=1 ; k<=Nbootstrap ; k++){
TKeps(k)=0;
int R[N];
for (i = 0 ; i < N ; i++)
R[i] = i;
for (i = 0; i < N - 1 ; i++) {
int j = i + rand() / (RAND_MAX / (N - i) + 1);
t = R[j];
R[j] = R[i];
R[i] = t;
}
for (i=0 ; i<N ; i++){
for (j=0 ; j<N ; j++){
if (i != j){
if (R[i] < n1) {
if (R[j] < n1){
TKeps(k) += bb(i,j) * (a*a + cc);
} else {
TKeps(k) += bb(i,j) * (a*b + cc);
}
} else{
if (R[j] < n1){
TKeps(k) += bb(i,j) * (b*a + cc);
} else {
TKeps(k) += bb(i,j) * (b*b + cc);
}
}
}
}
}
}
return(TKeps);
}
Since I do not exactly know what your code does, I can see two things from the scratch:
The function you call from your R environment is testFromontcpp(...). I suggest that this function should have SEXP values as parameters. Those S-Expressions are pointer to the memory of R. If you don't use SEXP, then both matrices will be copied:
Consider a 1000x1000 matrix, this means you have 1 million entries saved in R, which are copied to C++. To do so write:
testFromontcpp(SEXP x, SEXP y, SEXP z) {
NumericMatrix z1(x), z2(y);
int *Nbootstrap = INTEGER(z);
...
}
Be careful: In the for-loop you cannot use i<Nbootstrap. You have to write i<*Nbootstrap!!!
Secondly...and more important: Since R's matrices are saved as pointer to column and from the column pointer to the row, C's matrices are saved the other way round. What I want to say is that it costs a lot to jump into memory and jump back the whole time instead of following the memory path. My suggestion for this is: Switch the for-loops, so first iterate over the row of a specific column and not the other way round.
To the last point: In a task at university I had the problem with iterating over matrices, too. In my case it was way cheaper to transpose the matrix and then do calculations.
I hope I could help you.
Best,
Michael
PS: Referring to point 1...I just benchmarked your code with your implementation and with using SEXP. With SEXP it is slightly quicker for a 100x100 matrix with random numbers between 1 to 10.