Using columns of big.matrix in fisher.test in Rcpp - r

I have a very large binary big.matrix and also a vector of class assignment (same length as number of rows of big.matrix).
I want to be able to loop through each column of the big.matrix and output the p value for each fisher.test.
With a normal matrix object, I can do the following, but converting my big.matrix into a matrix takes up over 5 gb of ram.
p.value <- unlist(
lapply(
lapply(as.data.table(binarymatrix),
fisher.test,
y = class
), function(x) x$p.value
)
)
How can I do this without converting into a matrix object? As I understand it, accessing elements of a big.matrix requires C++ code, but I am not familiar with this at all.
Here it shows how to do fisher.test in Rcpp Rcpp: Is there an implementation fisher.test() in Rcpp but I am not sure how to input each column of a matrix into this.
An example big.matrix would look like
library(bigmemory)
matrix <- matrix(sample(0:1, 100 * 10000, replace = TRUE), 100 , 10000)
bigmatrix <- as.big.matrix(matrix)
And my class variable looks like:
class <- sample( LETTERS[1:2], 100, replace=TRUE)
Thanks!
EDIT:
Here is the Rcpp code I have right now. If someone could help me figure out the issue I would really appreciate it.
// [[Rcpp::depends(RcppEigen, RcppArmadillo, bigmemory, BH)]]
#include <RcppArmadillo.h>
#include <RcppEigen.h>
#include <bigmemory/BigMatrix.h>
#include <bigmemory/MatrixAccessor.hpp>
using namespace Rcpp;
using namespace arma;
using namespace Eigen;
using namespace std;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
ListOf<IntegerVector> AccessVector(SEXP pBigMat, int j, vector<int> status) {
XPtr<BigMatrix> xpMat(pBigMat);
MatrixAccessor<int> macc(*xpMat);
int n = xpMat->nrow();
// Bigmemory
cout << "Bigmemory:";
for (int i = 0; i < n; i++) {
cout << macc[j][i] << ' ';
}
cout << endl;
// STD VECTOR
vector<int> stdvec(macc[j], macc[j] + n);
// Obtain environment containing function
Rcpp::Environment base("package:stats");
// Make function callable from C++
Rcpp::Function fisher_test = base["fisher.test"];
// Call the function and receive its list output
Rcpp::List test_out = fisher_test(Rcpp::_["x"] = stdvec, Rcpp::_["y"] = status);
// Return test object in list structure
return test_out;
}
Ideally I want to be able to loop through each of the columns in C++ itself, and just output the p-values to R.

Related

How do you convert object of class Eigen::MatrixXd to class Rcpp::NumericMatrix

I'm working on a package that requires some very fast matrix multiplication so looking to use RcppEigen. For a variety of reasons though having to do with the need for multidimensional arrays, I need to convert a created object of class Eigen::MatrixXd to class Rcpp::NumericMatrix.
I tried reversing the steps listed in RcppEigen::FastLm.cpp, but that doesn't seem to work
e.g. instead of using
const Map<MatrixXd> X(as<Map<MatrixXd> >(Xs));
I tried
Rcpp:NumericMatrix X(as<Rcpp::NumericMatrix>(Xs));
where Xs is a matrix of class Eigen::MatrixXd but that didn't seem to work:" error: no matching function for call to 'as'
return Rcpp::asRcpp::NumericMatrix(z);"
If this isn't at all possible I can try another direction.
Basically what I need to do in R speak is
a = matrix(1, nrow = 10, ncol = 10)
b = array(0, c(10,10,10))
b[,,1] = a
To give a clearer starting example
How would I go about storing an object of class MatrixXd in an object of class NumericMatrix?
#include <Rcpp.h>
#include <RcppEigen.h>
using namespace Rcpp;
using namespace Eigen;
// [[Rcpp::export]]
NumericMatrix sample_problem() {
Eigen::MatrixXd x(2, 2); x << 1,1,2,2;
Eigen::MatrixXd z(2, 2);
Eigen::MatrixXd y(2,2); y << 3,3,4,4;
z = x * y; // do some eigen matrix multiplication
Rcpp::NumericMatrix w(2,2);
// what I'd like to be able to do somehow:
// store the results of the eigen object z in
// a NumericMatrix w
// w = z;
return w;
}
Thanks for posting code! It makes everything easier. I just rearranged you code the tiniest bit.
The key changes is to "explicitly" go back from the Eigen representation via an RcppEigen helper to a SEXP, and to then instantiate the matrix. Sometimes ... the compiler needs a little nudge.
Code
#include <Rcpp.h>
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
Rcpp::NumericMatrix sample_problem() {
Eigen::MatrixXd x(2, 2), y(2, 2);
x << 1,1,2,2;
y << 3,3,4,4;
// do some eigen matrix multiplication
Eigen::MatrixXd z = x * y;
// what I'd like to be able to do somehow:
// store the results of the eigen object z in
// a NumericMatrix w
// w = z;
SEXP s = Rcpp::wrap(z);
Rcpp::NumericMatrix w(s);
return w;
}
/*** R
sample_problem()
*/
Demo
R> sourceCpp("demo.cpp)
R> sample_problem()
[,1] [,2]
[1,] 7 7
[2,] 14 14
R>
With g++-9 I need -Wno-ignored-attributes or I get screens and screens of warnings...

efficiently sample from modified arma::vec object

I am using Rcpp to speed up some R code. However, I'm really struggling with types - since these are foreign in R. Here's a simplified version of what I'm trying to do:
#include <RcppArmadillo.h>
#include <algorithm>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
NumericVector fun(SEXP Pk, int k, int i, const vec& a, const mat& D) {
// this is dummy version of my actual function - with actual arguments.;
// I'm guessing SEXP is going to need to be replaced with something else when it's called from C++ not R.;
return D.col(i);
}
// [[Rcpp::export]]
NumericVector f(const arma::vec& assignment, char k, int B, const mat& D) {
uvec k_ind = find(assignment == k);
NumericVector output(assignment.size()); // for dummy output.
uvec::iterator k_itr = k_ind.begin();
for(; k_itr != k_ind.end(); ++k_itr) {
// this is R code, as I don't know the best way to do this in C++;
k_rep = sample(c(assignment[assignment != k], -1), size = B, replace = TRUE);
output = fun(k_rep, k, *k_itr, assignment, D);
// do something with output;
}
// compile result, ultimately return a List (after I figure out how to do that. For right now, I'll cheat and return the last output);
return output;
}
The part I'm struggling with is the random sampling of assignment. I know that sample has been implemented in Rarmadillo. However, I can see two approaches to this, and I'm not sure which is more efficient/doable.
Approach 1:
Make a table of theassignment values. Replace assignment == k with -1 and set its "count" equal to 1.
sample from the table "names" with probability proportional to the count.
Approach 2:
Copy the relevant subset of the assignment vector into a new vector with an extra spot for -1.
Sample from the copied vector with equal probabilities.
I want to say that approach 1 would be more efficient, except that assignment is currently of type arma::vec, and I'm not sure how to make the table from that - or how much of a cost there is to converting it to a more-compatible format. I think I could implement Approach 2, but I'm hoping to avoid the expensive copy.
Thanks for any insights you can provide.
many variable declaration is not coherent with the assignment made by you, like assignment = k is impossible to compare as assignment has real value and k is a char. as the queston is bad written I feel free to change the variables type. this compile..
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <RcppArmadilloExtensions/sample.h>
// [[Rcpp::export]]
arma::vec fun(const Rcpp::NumericVector& Pk, int k, unsigned int i, const arma::ivec& a, const arma::mat& D)
{
return D.col(i);
}
// [[Rcpp::export]]
Rcpp::NumericMatrix f(const arma::ivec& assignment, int k, unsigned int B, const arma::mat& D)
{
arma::uvec k_ind = find(assignment == k);
arma::ivec KK = assignment(find(assignment != k));
//these 2 row are for KK = c(assignment[assignment != k], -1)
//I dont know what is this -1 is for, why -1 ? maybe you dont need it.
KK.insert_rows(KK.n_rows, 1);
KK(KK.n_rows - 1) = -1;
arma::uvec k_ind_not = find(assignment != k);
Rcpp::NumericVector k_rep(B);
arma::mat output(D.n_rows,k_ind.n_rows); // for dummy output.
for(unsigned int i =0; i < k_ind.n_rows ; i++)
{
k_rep = Rcpp::RcppArmadillo::sample(KK, B, true);
output(arma::span::all, i) = fun(k_rep, k, i, assignment, D);
// do something with output;
}
// compile result, ultimately return a List (after I figure out how to do that. For right now, I'll cheat and return the last output);
return Rcpp::wrap(output);
}
this is not optimized (as the question is bogus), this is badly written, beccause as I think R would be sufficiently faster in searching index of a vector (so do this in R and implemement only fun in Rcpp)...is not useful to waste time here, there are other problems that need a solver implemented in Rcpp , not this searching stuff...
but this is not a useful question as you are asking more for an algorithm than for example signature of function

Calling igraph from within Rcpp

As a part of utilizing network data drawn at random before further processing, I am trying to call a couple of functions from the igraph package at the beginning of each iteration. The code I use is as follows:
#define ARMA_64BIT_WORD
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]
using namespace Rcpp;
using arma::sp_mat;
// [[Rcpp::export]]
sp_mat adj_mat(int n, double p) {
Environment igraph("package:igraph");
Function game_er = igraph["erdos.renyi.game"];
Function get_adjacency = igraph["get.adjacency"];
List g = game_er(Named("n", n), Named("p", p));
NumericMatrix A_m = get_adjacency(Named("g", g));
sp_mat A = as<sp_mat>(A_m);
return A;
}
/*** R
set.seed(20130810)
library(igraph)
adj_mat(100, 0.5)
*/
So, while the C++ compiles without warnings, the following error is thrown:
> sourceCpp("Hooking-R-in-cpp.cpp")
> set.seed(20130810)
> library(igraph)
> adj_mat(100, 0.5)
Error in adj_mat(100, 0.5) :
Not compatible with requested type: [type=S4; target=double].
From the error it looks like I am passing a S4 class to a double? Where is the error?
You were imposing types in the middle of your C++ functions that did not correspond to the representation, so you got run-time errors trying to instantiate them.
The version below works. I don't know igraph well enough to suggest what else you use to store the first return; for the S4 you can use the dgCMatrix matrix but S4 is an ok superset.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]
using namespace Rcpp;
using arma::sp_mat;
// [[Rcpp::export]]
sp_mat adj_mat(int n, double p) {
Environment igraph("package:igraph");
Function game_er = igraph["erdos.renyi.game"];
Function get_adjacency = igraph["get.adjacency"];
SEXP g = game_er(Named("n", n), Named("p", p));
S4 A_m = get_adjacency(Named("g", g));
sp_mat A = as<sp_mat>(A_m);
return A;
}
/*** R
set.seed(20130810)
library(igraph)
adj_mat(100, 0.5)
*/

Rcpp gamma integral

I am trying to rewrite into (R)cpp an original R function that makes use of the gamma function (from double input). Below the original source. When comping with sourceCpp the following error is raised "no matching function for call to 'gamma(Rcpp::traits::storage_type(<14>:.type)'"
The gamma function should has been put within sugar (as the mean below use) so I expect there should be easily called.
#include <Rcpp.h>
#include <math.h>
using namespace Rcpp;
// original R function
// function (y_pred, y_true)
// {
// eps <- 1e-15
// y_pred <- pmax(y_pred, eps)
// Poisson_LogLoss <- mean(log(gamma(y_true + 1)) + y_pred -
// log(y_pred) * y_true)
// return(Poisson_LogLoss)
// }
// [[Rcpp::export]]
double poissonLogLoss(NumericVector predicted, NumericVector actual) {
NumericVector temp, y_pred_new;
double out;
const double eps=1e-15;
y_pred_new=pmax(predicted,eps);
long n = predicted.size();
for (long i = 0; i < n; ++i) {
temp[i] = log( gamma(actual[i]+1)+y_pred_new[i]-log(y_pred_new[i])*actual[i]);
}
out=mean(temp); // using sugar implementation
return out;
}
You are making this too complicated as the point of Rcpp Sugar is work vectorized. So the following compiles as well:
#include <Rcpp.h>
#include <math.h>
using namespace Rcpp;
// [[Rcpp::export]]
double poissonLogLoss(NumericVector predicted, NumericVector actual) {
NumericVector temp, y_pred_new;
double out;
const double eps=1e-15;
y_pred_new=pmax(predicted,eps);
temp = log(gamma(actual + 1)) + y_pred_new - log(y_pred_new)*actual;
out=mean(temp); // using sugar implementation
return out;
}
Now, you didn't supply any test data so I do not know if this computes correctly or not. Also, because your R expression is already vectorized, this will not be much faster.
Lastly, your compile error is likely due to the Sugar function gamma() expecting an Rcpp object whereas you provided a double.

Error: could not convert using R function : as.data.frame

I'm trying to read a text file in C++ and return it as a DataFrame. I have created a skeleton method for reading the file and returning it:
// [[Rcpp::export]]
DataFrame rcpp_hello_world(String fileName) {
int vsize = get_number_records(fileName);
CharacterVector field1 = CharacterVector(vsize+1);
std::ifstream in(fileName);
int i = 0;
string tmp;
while (!in.eof()) {
getline(in, tmp, '\n');
field1[i] = tmp;
tmp.clear( );
i++;
}
DataFrame df(field1);
return df;
}
I am running in R using:
> df <- rcpp_hello_world( "my_haproxy_logfile" )
However, R returns the following error:
Error: could not convert using R function : as.data.frame
What am I doing wrong?
Many thanks.
DataFrame objects are "special". Our preferred usage is via return Rcpp::DateFrame::create ... which you will see in many of the posted examples, including in the many answers here.
Here is one from a Rcpp Gallery post:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame modifyDataFrame(DataFrame df) {
// access the columns
Rcpp::IntegerVector a = df["a"];
Rcpp::CharacterVector b = df["b"];
// make some changes
a[2] = 42;
b[1] = "foo";
// return a new data frame
return DataFrame::create(_["a"]= a, _["b"]= b);
}
While focussed on modifying a DataFrame, it shows you in passing how to create one. The _["a"] shortcut can also be written as Named("a") which I prefer.

Resources