Lasso solution with Rcpp: A self study - r

I'm very new to Rcpp. I 'm trying to write a coordinate descent algorithm for lasso in Rcpp as a self study. The code return an error:
Mat::init(): requested size is not compatible with column vector layout
I also have some problems while writing the code.
no matching function for call to 'sign'
no matching function for call to 'ifelse'
pow(X.col(j),2) : no viable conversion
I write
(S1>0)-(S1<0) for (1) to compute the sign of S1,
a if(){}else{} statement for (2) and
X.col(j)%X.col(j) for (3).
Any suggestion, please?
Here is the code.
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
mat betamat(NumericMatrix Xr, NumericVector yr, NumericVector lambdar, double tol=0.0000001) {
int N = Xr.nrow(), p = Xr.ncol(), l = lambdar.size();
mat X(Xr.begin(), N, p, false);
colvec y(yr.begin(), yr.size(), false);
vec lambda(lambdar.begin(), lambdar.size(),false);
colvec ols = solve(X,y);
mat betas = zeros<mat>(p,l);
//
bool converged = false;
for (int i = 0; i < l; ++i) {
colvec b = zeros<vec>(p);
colvec r = y-X*b;
while(converged == false){
colvec beta_old = betas;
for(int j = 0; j < p; ++j){
r = r + X.col(j)*b(j);
double xr = dot(X.col(j),r);
double S1 = xr/N;
double xx = sum(X.col(j)%X.col(j))/N;
b(j) =((S1>0)-(S1<0))*(abs(S1)-lambda(i))/xx;
if(b(j)>0){
b(j)=b(j);
}else{
b(j)=0;
}
r = r - X.col(j)*b(j);
}
converged = (sum(abs(betas - beta_old)) < tol);
}
betas.col(i) = b;
}
return betas;
}
In R, I'm calling this with
library(Rcpp)
sourceCpp("filename.cpp")
set.seed(1)
X <- matrix(rnorm(100*3),100)
y <- rnorm(100)
coefficients <- betamat(X,y,seq(0,1,0.0005))

Related

Rcpp: how to combine the R function and Rcpp function together to make a package

Suppose I have the following c++ code in a file named test.cpp
#include <Rcpp.h>
//[[Rcpp::export]]
Rcpp::NumericMatrix MyAbar (const Rcpp::NumericMatrix & x, int T){
unsigned int outrows = x.nrow(), i = 0, j = 0;
double d;
Rcpp::NumericMatrix out(outrows,outrows);
// Rcpp::LogicalVector comp;
for (i = 0; i < outrows - 1; i++){
Rcpp::NumericVector v1 = x.row(i);
Rcpp::NumericVector ans(outrows);
for (j = i + 1; j < outrows ; j ++){
d = mean(Rcpp::runif( T ) < x(i,j));
out(j,i)=d;
out(i,j)=d;
}
}
return out;
}
I know with the following command, I can have my own package
Rcpp.package.skeleton("test",cpp_files = "~/Desktop/test.cpp")
However, what if I want to combine the following R function which call the Rcpp-function into the package
random = function(A, T){
if (!is.matrix(A)){
A = Reduce("+",A)/T
}
# global constant and threshold
n = nrow(A)
B_0 = 3
w = min(sqrt(n),sqrt(T * log(n)))
q = B_0 * log(n) / (sqrt(n) * w)
A2 = MyAbar(A)
diag(A2) <- NA
K = A2 <= rowQuantiles(A2, probs=q, na.rm =TRUE)
diag(K) = FALSE
P = K %*% A * ( 1/(rowSums(K) + 1e-10))
return( (P + t(P))*0.5 )
}
How can i make it?
So you are asking how to make an R package? There are many good tutorials.
To a first approximation:
copy your file into, say, file R/random.R
deal with a help file for your function, either manually by writing man/random.Rd or by learning package roxygen2
make sure you know what NAMESPACE is for and that DESCRIPTION is right

Rcpp function complaining about unintialized variables

In a very first attempt at creating a C++ function which can be called from R using Rcpp, I have a simple function to compute a minimum spanning tree from a distance matrix using Prim's algorithm. This function has been converted into C++ from a former version in ANSI C (which works fine).
Here it is:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame primlm(const int n, NumericMatrix d)
{
double const din = 9999999.e0;
long int i1, nc, nc1;
double dlarge, dtot;
NumericVector is, l, lp, dist;
l(1) = 1;
is(1) = 1;
for (int i=2; i <= n; i++) {
is(i) = 0;
}
for (int i=2; i <= n; i++) {
dlarge = din;
i1 = i - 1;
for (int j=1; j <= i1; j++) {
for (int k=1; k <= n; k++) {
if (l(j) == k)
continue;
if (d[l(j), k] > dlarge)
continue;
if (is(k) == 1)
continue;
nc = k;
nc1 = l(j);
dlarge = d(nc1, nc);
}
}
is(nc) = 1;
l(i) = nc;
lp(i) = nc1;
dist(i) = dlarge;
}
dtot = 0.e0;
for (int i=2; i <= n; i++){
dtot += dist(i);
}
return DataFrame::create(Named("l") = l,
Named("lp") = lp,
Named("dist") = dist,
Named("dtot") = dtot);
}
When I compile this function using Rcpp under RStudio, I get two warnings, complaining that variables 'nc' and 'nc1' have not been initialized. Frankly, I could not understand that, as it seems to me that both variables are being initialized inside the third loop. Also, why there is no similar complaint about variable 'i1'?
Perhaps it comes as no surprise that, when attempting to call this function from R, using the below code, what I get is a crash of the R system!
# Read test data
df <- read.csv("zygo.csv", header=TRUE)
lonlat <- data.frame(df$Longitude, df$Latitude)
colnames(lonlat) <- c("lon", "lat")
# Compute distance matrix using geosphere library
library(geosphere)
d <- distm(lonlat, lonlat, fun=distVincentyEllipsoid)
# Calls Prim minimum spanning tree routine via Rcpp
library(Rcpp)
sourceCpp("Prim.cpp")
n <- nrow(df)
p <- primlm(n, d)
Here is the dataset I use for testing purposes:
"Scientific name",Locality,Longitude,Latitude Zygodontmys,Bush Bush
Forest,-61.05,10.4 Zygodontmys,Cerro Azul,-79.4333333333,9.15
Zygodontmys,Dividive,-70.6666666667,9.53333333333 Zygodontmys,Hato El
Frio,-63.1166666667,7.91666666667 Zygodontmys,Finca Vuelta
Larga,-63.1166666667,10.55 Zygodontmys,Isla
Cebaco,-81.1833333333,7.51666666667 Zygodontmys,Kayserberg
Airstrip,-56.4833333333,3.1 Zygodontmys,Limao,-60.5,3.93333333333
Zygodontmys,Montijo Bay,-81.0166666667,7.66666666667
Zygodontmys,Parcela 200,-67.4333333333,8.93333333333 Zygodontmys,Rio
Chico,-65.9666666667,10.3166666667 Zygodontmys,San Miguel
Island,-78.9333333333,8.38333333333
Zygodontmys,Tukuko,-72.8666666667,9.83333333333
Zygodontmys,Urama,-68.4,10.6166666667
Zygodontmys,Valledup,-72.9833333333,10.6166666667
Could anyone give me a hint?
The initializations of ncand nc1 are never reached if one of the three if statements is true. It might be that this is not possible with your data, but the compiler has no way knowing that.
However, this is not the reason for the crash. When I run your code I get:
Index out of bounds: [index=1; extent=0].
This comes from here:
NumericVector is, l, lp, dist;
l(1) = 1;
is(1) = 1;
When declaring a NumericVector you have to tell the required size if you want to assign values by index. In your case
NumericVector is(n), l(n), lp(n), dist(n);
might work. You have to analyze the C code carefully w.r.t. memory allocation and array boundaries.
Alternatively you could use the C code as is and use Rcpp to build a wrapper function, e.g.
#include <array>
#include <Rcpp.h>
using namespace Rcpp;
// One possibility for the function signature ...
double prim(const int n, double *d, double *l, double *lp, double *dist) {
....
}
// [[Rcpp::export]]
List primlm(NumericMatrix d) {
int n = d.nrow();
std::array<double, n> lp; // adjust size as needed!
std::array<double, n> dist; // adjust size as needed!
double dtot = prim(n, d.begin(), l.begin(), lp.begin(), dist.begin());
return List::create(Named("l") = l,
Named("lp") = lp,
Named("dist") = dist,
Named("dtot") = dtot);
}
Notes:
I am returning a List instead of a DataFrame since dtot is a scalar value.
The above code is meant to illustrate the idea. Most likely it will not work without adjustments!

Euclidean distance matrix performance between two shapes

The problem I am having is that I have to calculate a Euclidean distance matrix between shapes that can range from 20,000 up to 60,000 points, which produces 10-20GB amounts of data. I have to run each of these calculates thousands of times so 20GB x 7,000 (each calculation is a different point cloud). The shapes can be either 2D or 3D.
EDITED (Updated questions)
Is there a more efficient way to calculate the forward and backward distances without using two separate nested loops?
I know I could save the data matrix and calculate the minimum
distances in each direction, but then there is a huge memory issue
with large point clouds.
Is there a way to speed up this calculation and/or clean up the code to trim off time?
The irony is that I only need the matrix to calculate a very simple metric, but it requires the entire matrix to find that metric (Average Hausdorff distance).
Data example where each column represents a dimension of the shape and each row is a point in the shape:
first_configuration <- matrix(1:6,2,3)
second_configuration <- matrix(6:11,2,3)
colnames(first_configuration) <- c("x","y","z")
colnames(second_configuration) <- c("x","y","z")
This code calculates a Euclidean distance between between coordinates:
m <- nrow(first_configuration)
n <- nrow(second_configuration)
D <- sqrt(pmax(matrix(rep(apply(first_configuration * first_configuration, 1, sum), n), m, n, byrow = F) + matrix(rep(apply(second_configuration * second_configuration, 1, sum), m), m, n, byrow = T) - 2 * first_configuration %*% t(second_configuration), 0))
D
Output:
[,1] [,2]
[1,] 8.660254 10.392305
[2,] 6.928203 8.660254
EDIT: included hausdorff average code
d1 <- mean(apply(D, 1, min))
d2 <- mean(apply(D, 2, min))
average_hausdorff <- mean(d1, d2)
EDIT (Rcpp solution):
Here is my attempt to implement it in Rcpp so the matrix is never saved to memory. Working now but very slow.
sourceCpp(code=
#include <Rcpp.h>
#include <limits>
using namespace Rcpp;
// [[Rcpp::export]]
double edist_rcpp(NumericVector x, NumericVector y){
double d = sqrt( sum( pow(x - y, 2) ) );
return d;
}
// [[Rcpp::export]]
double avg_hausdorff_rcpp(NumericMatrix x, NumericMatrix y){
int nrowx = x.nrow();
int nrowy = y.nrow();
double new_low_x = std::numeric_limits<int>::max();
double new_low_y = std::numeric_limits<int>::max();
double mean_forward = 0;
double mean_backward = 0;
double mean_hd;
double td;
//forward
for(int i = 0; i < nrowx; i++) {
for(int j = 0; j < nrowy; j++) {
NumericVector v1 = x.row(i);
NumericVector v2 = y.row(j);
td = edist_rcpp(v1, v2);
if(td < new_low_x) {
new_low_x = td;
}
}
mean_forward = mean_forward + new_low_x;
new_low_x = std::numeric_limits<int>::max();
}
//backward
for(int i = 0; i < nrowy; i++) {
for(int j = 0; j < nrowx; j++) {
NumericVector v1 = y.row(i);
NumericVector v2 = x.row(j);
td = edist_rcpp(v1, v2);
if(td < new_low_y) {
new_low_y = td;
}
}
mean_backward = mean_backward + new_low_y;
new_low_y = std::numeric_limits<int>::max();
}
//hausdorff mean
mean_hd = (mean_forward / nrowx + mean_backward / nrowy) / 2;
return mean_hd;
}
)
EDIT (RcppParallel solution):
Definitely faster than the serial Rcpp solution and most certainly the R solution. If anyone has tips on how to improve my RcppParallel code to trim off some extra time it would be much appreciated!
sourceCpp(code=
#include <Rcpp.h>
#include <RcppParallel.h>
#include <limits>
// [[Rcpp::depends(RcppParallel)]]
struct minimum_euclidean_distances : public RcppParallel::Worker {
//Input
const RcppParallel::RMatrix<double> a;
const RcppParallel::RMatrix<double> b;
//Output
RcppParallel::RVector<double> medm;
minimum_euclidean_distances(const Rcpp::NumericMatrix a, const Rcpp::NumericMatrix b, Rcpp::NumericVector medm) : a(a), b(b), medm(medm) {}
void operator() (std::size_t begin, std::size_t end) {
for(std::size_t i = begin; i < end; i++) {
double new_low = std::numeric_limits<double>::max();
for(std::size_t j = 0; j < b.nrow(); j++) {
double dsum = 0;
for(std::size_t z = 0; z < b.ncol(); z++) {
dsum = dsum + pow(a(i,z) - b(j,z), 2);
}
dsum = pow(dsum, 0.5);
if(dsum < new_low) {
new_low = dsum;
}
}
medm[i] = new_low;
}
}
};
// [[Rcpp::export]]
double mean_directional_hausdorff_rcpp(Rcpp::NumericMatrix a, Rcpp::NumericMatrix b){
Rcpp::NumericVector medm(a.nrow());
minimum_euclidean_distances minimum_euclidean_distances(a, b, medm);
RcppParallel::parallelFor(0, a.nrow(), minimum_euclidean_distances);
double results = Rcpp::sum(medm);
results = results / a.nrow();
return results;
}
// [[Rcpp::export]]
double max_directional_hausdorff_rcpp(Rcpp::NumericMatrix a, Rcpp::NumericMatrix b){
Rcpp::NumericVector medm(a.nrow());
minimum_euclidean_distances minimum_euclidean_distances(a, b, medm);
RcppParallel::parallelFor(0, a.nrow(), minimum_euclidean_distances);
double results = Rcpp::max(medm);
return results;
}
)
Benchmarks using large point clouds of sizes 37,775 and 36,659:
//Rcpp serial solution
system.time(avg_hausdorff_rcpp(ll,rr))
user system elapsed
409.143 0.000 409.105
//RcppParallel solution
system.time(mean(mean_directional_hausdorff_rcpp(ll,rr), mean_directional_hausdorff_rcpp(rr,ll)))
user system elapsed
260.712 0.000 33.265
I try to use JuliaCall to do the calculation for the average Hausdorff distance.
JuliaCall embeds Julia in R.
I only try a serial solution in JuliaCall. It seems to be faster than the RcppParallel and the Rcpp serial solution in the question, but I don't have the benchmark data. Since ability for parallel computation is built in Julia. A parallel computation version in Julia should be written without much difficulty. I will update my answer after finding that out.
Below is the julia file I wrote:
# Calculate the min distance from the k-th point in as to the points in bs
function min_dist(k, as, bs)
n = size(bs, 1)
p = size(bs, 2)
dist = Inf
for i in 1:n
r = 0.0
for j in 1:p
r += (as[k, j] - bs[i, j]) ^ 2
## if r is already greater than the upper bound,
## then there is no need to continue doing the calculation
if r > dist
continue
end
end
if r < dist
dist = r
end
end
sqrt(dist)
end
function avg_min_dist_from(as, bs)
distsum = 0.0
n1 = size(as, 1)
for k in 1:n1
distsum += min_dist_from(k, as, bs)
end
distsum / n1
end
function hausdorff_avg_dist(as, bs)
(avg_min_dist_from(as, bs) + avg_min_dist_from(bs, as)) / 2
end
And this is the R code to use the julia function:
first_configuration <- matrix(1:6,2,3)
second_configuration <- matrix(6:11,2,3)
colnames(first_configuration) <- c("x","y","z")
colnames(second_configuration) <- c("x","y","z")
m <- nrow(first_configuration)
n <- nrow(second_configuration)
D <- sqrt(matrix(rep(apply(first_configuration * first_configuration, 1, sum), n), m, n, byrow = F) + matrix(rep(apply(second_configuration * second_configuration, 1, sum), m), m, n, byrow = T) - 2 * first_configuration %*% t(second_configuration))
D
d1 <- mean(apply(D, 1, min))
d2 <- mean(apply(D, 2, min))
average_hausdorff <- mean(d1, d2)
library(JuliaCall)
## the first time of julia_setup could be quite time consuming
julia_setup()
## source the julia file which has our hausdorff_avg_dist function
julia_source("hausdorff.jl")
## check if the julia function is correct with the example
average_hausdorff_julia <- julia_call("hausdauff_avg_dist",
first_configuration,
second_configuration)
## generate some large random point clouds
n1 <- 37775
n2 <- 36659
as <- matrix(rnorm(n1 * 3), n1, 3)
bs <- matrix(rnorm(n2 * 3), n2, 3)
system.time(julia_call("hausdauff_avg_dist", as, bs))
The time on my laptop was less than 20 seconds, note this is performance of the serial version of JuliaCall! I used the same data to test RCpp serial solution in the question, which took more than 10 minutes to run. I don't have RCpp parallel on my laptop now so I can't try that. And as I said, Julia has built-in ability to do parallel computation.

Rcpp returns large negative number when 2 large positives are multiplied

I am creating a function that calculates area under the curve and when I take the 2 partials and multiply them for the numerator I exceed 2^31 and then a value like -2013386137 is used in the calculation.
Here are the cpp chunks
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector sort_rcpp(NumericVector x) {
std::vector<double> tmp = Rcpp::as< std::vector<double> > (x);
std::sort(tmp.begin(), tmp.end());
return wrap(tmp);
}
// [[Rcpp::export]]
IntegerVector rank(NumericVector x) {
return match(x, sort_rcpp(x));
}
// [[Rcpp::export]]
double auc_(NumericVector actual, NumericVector predicted) {
double n = actual.size();
IntegerVector Ranks = rank(predicted);
int NPos = sum(actual == 1);
int NNeg = (actual.size() - NPos);
int sumranks = 0;
for(int i = 0; i < n; ++i) {
if (actual[i] == 1){
sumranks = sumranks + Ranks[i];
}
}
double p1 = (sumranks - NPos*( NPos + 1 ) / 2);
long double p2 = NPos*NNeg;
double auc = p1 / p2;
return auc ;
}
and then the test example that has the issue
N = 100000
Actual = as.numeric(runif(N) > .65)
Predicted = as.numeric(runif(N))
actual = Actual
predicted = Predicted
auc_(Actual, Predicted)
I am also putting this in an R package
devtools::install_github("JackStat/ModelMetrics")
N = 100000
Actual = as.numeric(runif(N) > .65)
Predicted = as.numeric(runif(N))
actual = Actual
predicted = Predicted
ModelMetrics::auc(Actual, Predicted)
You use int internally in your function which leads to overflow. Use a double and things look sunnier:
R> sourceCpp("/tmp/jackstat.cpp")
R> N <- 100000
R> Actual <- as.numeric(runif(N) > .65)
R> Predicted <- as.numeric(runif(N))
R> auc1(Actual, Predicted) # your function
[1] -0.558932
R> auc2(Actual, Predicted) # my variant using double
[1] 0.499922
R>
The complete corrected file is below:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector sort_rcpp(NumericVector x) {
std::vector<double> tmp = Rcpp::as< std::vector<double> > (x);
std::sort(tmp.begin(), tmp.end());
return wrap(tmp);
}
// [[Rcpp::export]]
IntegerVector rank(NumericVector x) {
return match(x, sort_rcpp(x));
}
// [[Rcpp::export]]
double auc1(NumericVector actual, NumericVector predicted) {
double n = actual.size();
IntegerVector Ranks = rank(predicted);
int NPos = sum(actual == 1);
int NNeg = (actual.size() - NPos);
int sumranks = 0;
for(int i = 0; i < n; ++i) {
if (actual[i] == 1){
sumranks = sumranks + Ranks[i];
}
}
double p1 = (sumranks - NPos*( NPos + 1 ) / 2);
long double p2 = NPos*NNeg;
double auc = p1 / p2;
return auc ;
}
// [[Rcpp::export]]
double auc2(NumericVector actual, NumericVector predicted) {
double n = actual.size();
IntegerVector Ranks = rank(predicted);
double NPos = sum(actual == 1);
double NNeg = (actual.size() - NPos);
double sumranks = 0;
for(int i = 0; i < n; ++i) {
if (actual[i] == 1){
sumranks = sumranks + Ranks[i];
}
}
double p1 = (sumranks - NPos*( NPos + 1 ) / 2);
long double p2 = NPos*NNeg;
double auc = p1 / p2;
return auc ;
}
/*** R
N <- 100000
Actual <- as.numeric(runif(N) > .65)
Predicted <- as.numeric(runif(N))
auc1(Actual, Predicted)
auc2(Actual, Predicted)
*/

Parallel computation of a quadratic term in Rcpp

Let Y and K be an n-dimensional (column) vector and n by n matrix, respectively. Think of Y and K as a sample vector and its covariance matrix.
Corresponding to each entry of Y (say Yi) there is a row vector (of size 2) Si encoding the location of the sample in a two dimensional space. Construct the n by 2 matrix S by concatenating all the Si vectors. The ij-th entry of K is of the form
Kij= f( |si-sj|, b )
in which |.| denotes the usual Euclidean norm, f is the covariance function and b represents the covariance parameters. For instance for powered exponential covariance we have f(x) = exp( (-|x|/r)q ) and b = (r,q).
The goal is to compute the following quantity in Rcpp, using a parallel fashion. (YT stands for Y transpose and ||.||2 denotes the sum of square entries of K).
YTKY ⁄ ||K||2
Here is the piece of code I've written to do the job. While running, Rstudio runs out of memory after a few seconds and the following massage displays: "R encountered a fatal error. The session was terminated". I've very recently started using open MP in Rcpp and I have no idea why this happens! Can anybody tell me what have I done wrong here?
#include <Rcpp.h>
#include<math.h>
#include<omp.h>
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
// [[Rcpp::export]]
double InnerProd(NumericVector x, NumericVector y) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i]*y[i];
}
return total;
}
// [[Rcpp::export]]
double CorFunc(double r, double range_param, double beta) {
double q,x;
x = r/range_param;
q = exp( -pow(x,beta) );
return(q);
}
// [[Rcpp::export]]
double VarianceComp( double range, NumericVector Y, NumericMatrix s, double
beta, int t ){
int n,i,j;
double Numer = 0, Denom = 0, dist, CorVal, ObjVal;
NumericVector DistVec;
n = Y.size();
omp_set_num_threads(t);
# pragma omp parallel for private(DistVec,CorVal,dist,j) \
reduction(+:Numer,Denom)
for( i = 0; i < n; ++i) {
for( j = 0; j < n; ++j){
DistVec = ( s(i,_)-s(j,_) );
dist = sqrt( InnerProd(DistVec,DistVec) );
CorVal = CorFunc(dist,range,beta);
Numer += Y[i]*Y[j]*CorVal/n;
Denom += pow( CorVal, 2 )/n;
}
}
ObjVal = Numer/Denom;
return( ObjVal );
}

Resources