How do I draw multinomial distributed samples with RcppArmadillo? - r

The problem is that I have a variable arma::mat prob_vec and want something equivalent to rmultinom(1, 1, prob_vec) in R.
I found the rmultinom function provided by RcppArmadillo has a weird argument requirement which is different from that in R! So it won't pass the compilation.
I just wanna know how to draw the desired sample in RcppArmadillo, or equivalently in Armadillo. If I need to get the pointer or convert my prob_vec variable, please tell me how.
Many thanks!

Your friendly neighbourhood co-author of RcppArmadillo here: I can assure you that it does not provide rmultinom, but Rcpp does. In fact, it simply passes through to R itself as a quick grep would have told you:
inline void rmultinom(int n, double* prob, int k, int* rn)
{ return ::rmultinom(n, prob, k, rn); }
So I would suggest your first write a five-line C program against the R API to make sure you know how to have rmultinom do what you want it to do, and then use Rcpp and RcppArmadillo to do the same thing on the data in your vector.

Related

More efficient way to compute the rowNorms in R?

I wrote a program using an unsupervised K-means algorithm to try and compress images. It now works but in comparison to Python it's incredibly slow! Specifically it's finding the rowNorms thats slow. The array X is 350000+ elements.
This is the particular function:
find_closest_centroids <- function(X, centroids) {
m <- nrow(X)
c <- integer(m)
for(i in 1:m){
distances = rowNorms(sweep(centroids,2,X[i,]))
c[i] = which.min(distances)
}
return(c)
}
In Python I am able to do it like this:
def find_closest_centroids(X, centroids):
m = len(X)
c = np.zeros(m)
for i in range(m):
distances = np.linalg.norm(X[i] - centroids, axis=1)
c[i] = np.argmin(distances)
return c
Any recommendations?
Thanks.
As dvd280 has noted in his comment, R tends to do worse than many other languages in terms of performance. If are content with the performance of your code in Python, but need the function available in R, you might want to look into the reticulate package which provides an interface to python like the Rcpp package mentioned by dvd280 does for C++.
If you still want to implement this natively in R, be mindful of the data structures you use. For rowwise operations, data frames are a poor choice as they are lists of columns. I'm not sure about the data structures in your code, but rowNorms() seems to be a matrix method. You might get more mileage out of a list of rows structure.
If you feel like getting into dplyr, you could find this vignette on row-wise operations helpful. Make sure you have the latest version of the package, as the vignette is based on dplyr 1.0.
The data.table package tends to yield the best performance for large data sets in R, but I'm not familiar with it, so I can't give you any further directions on that.

integer64 and Rcpp compatibility

I will need 64 bits integer in my package in a close future. I'm studying the feasibility based on the bit64 package. Basically I plan to have one or more columns in a data.table with an interger64 S3 class and I plan to pass this table to C++ functions using Rcpp.
The following nanotime example from Rcpp gallery explains clearly how a vector of 64 bits int is built upon a vector of double and explain how to create an integer64 object from C++ to R.
I'm now wondering how to deal with an interger64 from R to C++. I guess I can invert the principle.
void useInt64(NumericVector v)
{
double len = v.size();
std::vector<int64_t> n(len);
// transfers values 'keeping bits' but changing type
// using reinterpret_cast would get us a warning
std::memcpy(&(n[0]), &(v[0]), len * sizeof(double));
// use n in further computations
}
Is that correct? Is there another way to do that? Can we use a wrapper as<std::vector<int64_t>>(v)? For this last question I guess the conversion is not based on a bit to bit copy.

Efficient programming to overcome memory limit in R

I have a function that calculates an index in R for a matrix of binary data. The goal of this function is to calculate a person-fit index for binary response data called HT. It divides the covariance between response vectors of two respondents (e.g. person i & j) by the maximum possible covariance between the two response patterns which can be calculated using the mean of response vectors(e.g. Bi).The function is:
fit<-function(Data){
N<-dim(Data)[1]
L<-dim(Data)[2]
r <- rowSums(Data)
p.cor.n <- (r/L) #proportion correct for each response pattern
sig.ij <- var(t(Data),t(Data)) #covariance of response patterns
diag(sig.ij) <-0
H.num <- apply(sig.ij,1,sum)
H.denom1 <- matrix(p.cor.n,N,1) %*% matrix(1-p.cor.n,1,N) #Bi(1-Bj)
H.denom2 <- matrix(1-p.cor.n,N,1) %*% matrix(p.cor.n,1,N) #(1-Bi)Bj
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
diag(H.denomm) <-0
H.denom <- apply(H.denomm,1,sum)
HT <- H.num / H.denom
return(HT)
}
This function works fine with small matrices (e.g. 1000 by 20) but when I increased the number of rows (e.g. to 10000) I came across to memory limitation problem. The source of the problem is this line in the function:
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
which selects the denominator for each response pattern.Is there any other way to re-write this line which demands lower memory?
P.S.: you can try data<-matrix(rbinom(200000,1,.7),10000,20).
Thanks.
Well here is one way you could shave a little time off. Overall I still think there might be a better theoretical answer in terms of the approach you take....But here goes. I wrote up an Rcpp function that specifically implements ifelse in the sense you use it in above. It only works for square matrices like in your example. BTW I wasn't really trying to optimize R ifelse because I'm pretty sure it already calls internal C functions. I was just curious if a C++ function designed to do exactly what you are trying to do and nothing more would be faster. I shaved 11 seconds off. (This selects the larger value).
C++ Function:
library(Rcpp)
library(inline)
code <-"
Rcpp::NumericMatrix x(xs);
Rcpp::NumericMatrix y(ys);
Rcpp::NumericMatrix ans (x.nrow(), y.ncol());
int ii, jj;
for (ii=0; ii &lt x.nrow(); ii++){
for (jj=0; jj &lt x.ncol(); jj++){
if(x(ii,jj) &lt y(ii,jj)){
ans(ii,jj) = y(ii,jj);
} else {
ans(ii,jj) = x(ii,jj);
}
}
}
return(ans);"
matIfelse <- cxxfunction(signature(xs="numeric",ys="numeric"),
plugin="Rcpp",
body=code)
Now if you replace ifelse in your function above with matIfelse you can give it a try. For example:
H.denomm <- matIfelse(H.denom1,H.denom2)
# Time for old version to run with the matrix you suggested above matrix(rbinom(200000,1,.7),10000,20)
# user system elapsed
# 37.78 3.36 41.30
# Time to run with dedicated Rcpp function
# user system elapsed
# 28.25 0.96 30.22
Not bad roughly 36% faster, again though I don't claim that this is generally faster than ifelse just in this very specific instance. Cheers
P.s. I forgot to mention that to use Rcpp you need to have Rtools installed and during the install make sure environment path variables are added for Rtools and gcc. On my machine those would look like: c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin
Edit:
I just noticed that you were running into memory problems... So I'm not sure if you are running a 32 or 64 bit machine, but you probably just need to allow R to increase the amount of RAM it can use. I'll assume you are running on 32 bit to be safe. So you should be able to let R take at least 2gigs of RAM. Give this a try: memory.limit(size=1900) size is in megabytes so I just went for 1.9 gigs just to be safe. I'd imagine this is plenty of memory for what you need.
Do you actually intend to do NxL independent ifelse((H.denom1>H.denom2,... operations?
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
If you really do, look for a library or alternatively, a better decomposition.
If you told us in general terms what this code is trying to do, it would help us answer it.

How can I protect a matrix in R from being altered by Rcpp?

I am making a package containing two Rcpp functions. The first function is used for creating a matrix that will be used several times by the second function. The matrix is stored in R's global environment between calls to the two functions.
M <- myFirstRcpp(X)
P <- mySecondRcpp(M)
Depending on input parameters the second function will make changes to the input matrix (created by the first function) before calculating a vector from it (aFunction is the C++ inside mySecondRcpp()):
IntegerVector aFunction( SEXP Qin, SEXP param ) {
NumericMatrix Q(Qin);
// Some changes made to Q
...
// return a vector generated from Q
}
My problem is that the changes done to the Q matrix inside the second Rcpp function also affect the copy of the matrix (M) residing in R's global environment.
How can I prevent Rcpp from altering the global environment of R without too much overhead?
Notes: The M matrix is ~2000x65000 in size. The problem occurs with R 3.0.2 and Rcpp 0.10.6 on Windows and Linux in 32 and 64 bit R.
That is a known and documented feature. We are being called from R via the interface
SEXP somefunction(SEXP a, SEXP b, ...)
so a pointer is being passed and changes to Q affect the outer object. That is a good thing as it makes the calls very fast -- no copies.
If you want distinct instances, use the clone() method as in
NumericMatrix Q = clone(Qin);
Another thing you can do from within R (e.g., when you cannot easily edit the Rcpp code) is to call a [ method on the R object reference. This forces R to pass a copy. For example,
M <- myFirstRcpp(X)
P <- mySecondRcpp(M[])`
Now, M will not get altered by side-effects from mySecondRcpp().

Where can I learn how to write C code to speed up slow R functions? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
What's the best resource for learning how to write C code for use with R? I know about the system and foreign language interfaces section of R extensions, but I find it pretty hard going. What are good resources (both online and offline) for writing C code for use with R?
To clarify, I don't want to learn how to write C code, I want to learn how to better integrate R and C. For example, how do I convert from a C integer vector to a R integer vector (or vice versa) or from a C scalar to an R vector?
Well there is the good old Use the source, Luke! --- R itself has plenty of (very efficient) C code one can study, and CRAN has hundreds of packages, some from authors you trust. That provides real, tested examples to study and adapt.
But as Josh suspected, I lean more towards C++ and hence Rcpp. It also has plenty of examples.
Edit: There were two books I found helpful:
The first one is Venables and Ripley's "S Programming" even though it is getting long in the tooth (and there have been rumours of a 2nd edition for years). At the time there was simply nothing else.
The second in Chambers' "Software for Data Analysis" which is much more recent and has a much nicer R-centric feel -- and two chapters on extending R. Both C and C++ get mentioned. Plus, John shreds me for what I did with digest so that alone is worth the price of admission.
That said, John is growing fond of Rcpp (and contributing) as he finds the match between R objects and C++ objects (via Rcpp) to be very natural -- and ReferenceClasses help there.
Edit 2: With Hadley's refocussed question, I very strongly urge you to consider C++. There is so much boilerplate nonsense you have to do with C---very tedious and very avoidable. Have a look at the Rcpp-introduction vignette. Another simple example is this blog post where I show that instead of worrying about 10% differences (in one of the Radford Neal examples) we can get eightyfold increases with C++ (on what is of course a contrived example).
Edit 3: There is complexity in that you may run into C++ errors that are, to put it mildly, hard to grok. But to just use Rcpp rather than to extend it, you should hardly ever need it. And while this cost is undeniable, it is far eclipsed by the benefit of simpler code, less boilerplate, no PROTECT/UNPROTECT, no memory management etc pp. Doug Bates just yesterday stated that he finds C++ and Rcpp to be much more like writing R than writing C++. YMMV and all that.
Hadley,
You can definitely write C++ code that is similar to C code.
I understand what you say about C++ being more complicated than C. This is if you want to master everything : objects, templates, STL, template meta programming, etc ... most people don't need these things and can just rely on others to it. The implementation of Rcpp is very complicated, but just because you don't know how your fridge works, it does not mean you cannot open the door and grab fresh milk ...
From your many contributions to R, what strikes me is that you find R somewhat tedious (data manipulation, graphics, string manipulatio, etc ...). Well get prepared for many more surprises with the internal C API of R. This is very tedious.
From time to time, I read the R-exts or R-ints manuals. This helps. But most of the time, when I really want to find out about something, I go into the R source, and also in the source of packages written by e.g. Simon (there is usually lots to learn there).
Rcpp is designed to make these tedious aspects of the API go away.
You can judge for yourself what you find more complicated, obfuscated, etc ... based on a few examples. This function creates a character vector using the C API:
SEXP foobar(){
SEXP ab;
PROTECT(ab = allocVector(STRSXP, 2));
SET_STRING_ELT( ab, 0, mkChar("foo") );
SET_STRING_ELT( ab, 1, mkChar("bar") );
UNPROTECT(1);
}
Using Rcpp, you can write the same function as:
SEXP foobar(){
return Rcpp::CharacterVector::create( "foo", "bar" ) ;
}
or:
SEXP foobar(){
Rcpp::CharacterVector res(2) ;
res[0] = "foo" ;
res[1] = "bar" ;
return res ;
}
As Dirk said, there are other examples on the several vignettes. We also usually point people towards our unit tests because each of them test a very specific part of the code and are somewhat self explanatory.
I'm obviously biased here, but I would recommend getting familiar about Rcpp instead of learning the C API of R, and then come to the mailing list if something is unclear or does not seem doable with Rcpp.
Anyway, end of the sales pitch.
I guess it all depends what sort of code you want to write eventually.
Romain
#hadley: unfortunately, I don't have specific resources in mind to help you getting started on C++. I picked it up from Scott Meyers's books (Effective C++, More effective C++, etc ...) but these are not really what one could call introductory.
We almost exclusively use the .Call interface to call C++ code. The rule is easy enough :
The C++ function must return an R object. All R objects are SEXP.
The C++ function takes between 0 and 65 R objects as input (again SEXP)
it must (not really, but we can save this for later) be declared with C linkage, either with extern "C" or the RcppExport alias that Rcpp defines.
So a .Call function gets declared like this in some header file:
#include <Rcpp.h>
RcppExport SEXP foo( SEXP x1, SEXP x2 ) ;
and implemented like this in a .cpp file :
SEXP foo( SEXP x1, SEXP x2 ){
...
}
There is not much more to know about the R API to be using Rcpp.
Most people only want to deal with numeric vectors in Rcpp. You do this with the NumericVector class. There are several ways to create a numeric vector :
From an existing object that you pass down from R:
SEXP foo( SEXP x_) {
Rcpp::NumericVector x( x_ ) ;
...
}
With given values using the ::create static function:
Rcpp::NumericVector x = Rcpp::NumericVector::create( 1.0, 2.0, 3.0 ) ;
Rcpp::NumericVector x = Rcpp::NumericVector::create(
_["a"] = 1.0,
_["b"] = 2.0,
_["c"] = 3
) ;
Of a given size:
Rcpp::NumericVector x( 10 ) ; // filled with 0.0
Rcpp::NumericVector x( 10, 2.0 ) ; // filled with 2.0
Then once you have a vector, the most useful thing is to extract one element from it. This is done with the operator[], with 0-based indexing, so for example summing values of a numeric vector goes something like this:
SEXP sum( SEXP x_ ){
Rcpp::NumericVector x(x_) ;
double res = 0.0 ;
for( int i=0; i<x.size(), i++){
res += x[i] ;
}
return Rcpp::wrap( res ) ;
}
But with Rcpp sugar we can do this much more nicely now:
using namespace Rcpp ;
SEXP sum( SEXP x_ ){
NumericVector x(x_) ;
double res = sum( x ) ;
return wrap( res ) ;
}
As I said before, it all depends on what sort of code you want to write. Look into what people do in packages that rely on Rcpp, check the vignettes, the unit tests, come back to us on the mailing list. We are always happy to help.
#jbremnant: That's right. Rcpp classes implement something close to the RAII pattern. When an Rcpp object is created, the constructor takes appropriate measures to ensure the underlying R object (SEXP) is protected from the garbage collector. The destructor withdraws the protection. This is explained in the Rcpp-intrduction vignette. The underlying implementation relies on the R API functions R_PreserveObject and R_ReleaseObject
There is indeed performance penalty due to C++ encapsulation. We try to keep this at a minimum with inlining, etc ... The penalty is small, and when you take into account the gain in terms of time it takes to write and maintain code, it is not that relevant.
Calling R functions from the Rcpp class Function is slower than directly calling eval with the C api. This is because we take precautions and wrap the function call into a tryCatch block so that we capture R errors and promote them to C++ exceptions so that they can be dealt with using the standard try/catch in C++.
Most people want to use vectors (specially NumericVector), and the penalty is very small with this class. The examples/ConvolveBenchmarks directory contains several variants of the notorious convolution function from R-exts and the vignette has benchmark results. It turns out that Rcpp makes it faster than the benchmark code that uses the R API.

Resources