I'm trying to read a text file in C++ and return it as a DataFrame. I have created a skeleton method for reading the file and returning it:
// [[Rcpp::export]]
DataFrame rcpp_hello_world(String fileName) {
int vsize = get_number_records(fileName);
CharacterVector field1 = CharacterVector(vsize+1);
std::ifstream in(fileName);
int i = 0;
string tmp;
while (!in.eof()) {
getline(in, tmp, '\n');
field1[i] = tmp;
tmp.clear( );
i++;
}
DataFrame df(field1);
return df;
}
I am running in R using:
> df <- rcpp_hello_world( "my_haproxy_logfile" )
However, R returns the following error:
Error: could not convert using R function : as.data.frame
What am I doing wrong?
Many thanks.
DataFrame objects are "special". Our preferred usage is via return Rcpp::DateFrame::create ... which you will see in many of the posted examples, including in the many answers here.
Here is one from a Rcpp Gallery post:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame modifyDataFrame(DataFrame df) {
// access the columns
Rcpp::IntegerVector a = df["a"];
Rcpp::CharacterVector b = df["b"];
// make some changes
a[2] = 42;
b[1] = "foo";
// return a new data frame
return DataFrame::create(_["a"]= a, _["b"]= b);
}
While focussed on modifying a DataFrame, it shows you in passing how to create one. The _["a"] shortcut can also be written as Named("a") which I prefer.
Related
I am using Rcpp to speed up some R code. However, I'm really struggling with types - since these are foreign in R. Here's a simplified version of what I'm trying to do:
#include <RcppArmadillo.h>
#include <algorithm>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
NumericVector fun(SEXP Pk, int k, int i, const vec& a, const mat& D) {
// this is dummy version of my actual function - with actual arguments.;
// I'm guessing SEXP is going to need to be replaced with something else when it's called from C++ not R.;
return D.col(i);
}
// [[Rcpp::export]]
NumericVector f(const arma::vec& assignment, char k, int B, const mat& D) {
uvec k_ind = find(assignment == k);
NumericVector output(assignment.size()); // for dummy output.
uvec::iterator k_itr = k_ind.begin();
for(; k_itr != k_ind.end(); ++k_itr) {
// this is R code, as I don't know the best way to do this in C++;
k_rep = sample(c(assignment[assignment != k], -1), size = B, replace = TRUE);
output = fun(k_rep, k, *k_itr, assignment, D);
// do something with output;
}
// compile result, ultimately return a List (after I figure out how to do that. For right now, I'll cheat and return the last output);
return output;
}
The part I'm struggling with is the random sampling of assignment. I know that sample has been implemented in Rarmadillo. However, I can see two approaches to this, and I'm not sure which is more efficient/doable.
Approach 1:
Make a table of theassignment values. Replace assignment == k with -1 and set its "count" equal to 1.
sample from the table "names" with probability proportional to the count.
Approach 2:
Copy the relevant subset of the assignment vector into a new vector with an extra spot for -1.
Sample from the copied vector with equal probabilities.
I want to say that approach 1 would be more efficient, except that assignment is currently of type arma::vec, and I'm not sure how to make the table from that - or how much of a cost there is to converting it to a more-compatible format. I think I could implement Approach 2, but I'm hoping to avoid the expensive copy.
Thanks for any insights you can provide.
many variable declaration is not coherent with the assignment made by you, like assignment = k is impossible to compare as assignment has real value and k is a char. as the queston is bad written I feel free to change the variables type. this compile..
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <RcppArmadilloExtensions/sample.h>
// [[Rcpp::export]]
arma::vec fun(const Rcpp::NumericVector& Pk, int k, unsigned int i, const arma::ivec& a, const arma::mat& D)
{
return D.col(i);
}
// [[Rcpp::export]]
Rcpp::NumericMatrix f(const arma::ivec& assignment, int k, unsigned int B, const arma::mat& D)
{
arma::uvec k_ind = find(assignment == k);
arma::ivec KK = assignment(find(assignment != k));
//these 2 row are for KK = c(assignment[assignment != k], -1)
//I dont know what is this -1 is for, why -1 ? maybe you dont need it.
KK.insert_rows(KK.n_rows, 1);
KK(KK.n_rows - 1) = -1;
arma::uvec k_ind_not = find(assignment != k);
Rcpp::NumericVector k_rep(B);
arma::mat output(D.n_rows,k_ind.n_rows); // for dummy output.
for(unsigned int i =0; i < k_ind.n_rows ; i++)
{
k_rep = Rcpp::RcppArmadillo::sample(KK, B, true);
output(arma::span::all, i) = fun(k_rep, k, i, assignment, D);
// do something with output;
}
// compile result, ultimately return a List (after I figure out how to do that. For right now, I'll cheat and return the last output);
return Rcpp::wrap(output);
}
this is not optimized (as the question is bogus), this is badly written, beccause as I think R would be sufficiently faster in searching index of a vector (so do this in R and implemement only fun in Rcpp)...is not useful to waste time here, there are other problems that need a solver implemented in Rcpp , not this searching stuff...
but this is not a useful question as you are asking more for an algorithm than for example signature of function
I have a very large binary big.matrix and also a vector of class assignment (same length as number of rows of big.matrix).
I want to be able to loop through each column of the big.matrix and output the p value for each fisher.test.
With a normal matrix object, I can do the following, but converting my big.matrix into a matrix takes up over 5 gb of ram.
p.value <- unlist(
lapply(
lapply(as.data.table(binarymatrix),
fisher.test,
y = class
), function(x) x$p.value
)
)
How can I do this without converting into a matrix object? As I understand it, accessing elements of a big.matrix requires C++ code, but I am not familiar with this at all.
Here it shows how to do fisher.test in Rcpp Rcpp: Is there an implementation fisher.test() in Rcpp but I am not sure how to input each column of a matrix into this.
An example big.matrix would look like
library(bigmemory)
matrix <- matrix(sample(0:1, 100 * 10000, replace = TRUE), 100 , 10000)
bigmatrix <- as.big.matrix(matrix)
And my class variable looks like:
class <- sample( LETTERS[1:2], 100, replace=TRUE)
Thanks!
EDIT:
Here is the Rcpp code I have right now. If someone could help me figure out the issue I would really appreciate it.
// [[Rcpp::depends(RcppEigen, RcppArmadillo, bigmemory, BH)]]
#include <RcppArmadillo.h>
#include <RcppEigen.h>
#include <bigmemory/BigMatrix.h>
#include <bigmemory/MatrixAccessor.hpp>
using namespace Rcpp;
using namespace arma;
using namespace Eigen;
using namespace std;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
ListOf<IntegerVector> AccessVector(SEXP pBigMat, int j, vector<int> status) {
XPtr<BigMatrix> xpMat(pBigMat);
MatrixAccessor<int> macc(*xpMat);
int n = xpMat->nrow();
// Bigmemory
cout << "Bigmemory:";
for (int i = 0; i < n; i++) {
cout << macc[j][i] << ' ';
}
cout << endl;
// STD VECTOR
vector<int> stdvec(macc[j], macc[j] + n);
// Obtain environment containing function
Rcpp::Environment base("package:stats");
// Make function callable from C++
Rcpp::Function fisher_test = base["fisher.test"];
// Call the function and receive its list output
Rcpp::List test_out = fisher_test(Rcpp::_["x"] = stdvec, Rcpp::_["y"] = status);
// Return test object in list structure
return test_out;
}
Ideally I want to be able to loop through each of the columns in C++ itself, and just output the p-values to R.
I am trying to convert some character data to numeric as below. The data will come with special caracters so I have to get them out. I convert the data to std:string to search for the special caracters. Dos it creates a new variable in memory? I want to know if there is a better way to do it.
NumericVector converter_ra_(Rcpp::RObject x){
if(x.sexp_type() == STRSXP){
CharacterVector y(x);
NumericVector resultado(y.size());
for(unsigned int i = 0; i < y.size(); i++){
std::string ra_string = Rcpp::as<std::string>(y[i]);
//std::cout << ra_string << std::endl;
double t = 0;
int base = 0;
for(int j = (int)ra_string.size(); j >= 0; j--){
if(ra_string[j] >= 48 && ra_string[j] <= 57){
t += ((ra_string[j] - '0') * base_m[base]);
base++;
}
}
//std::cout << t << std::endl;
resultado[i] = t;
}
return resultado;
}else if(x.sexp_type() == REALSXP){
return NumericVector(x);
}
return NumericVector();
}
Does it creates a new variable in memory?
If the input object actually is a numeric vector (REALSXP) and you are simply returning, e.g. as<NumericVector>(input), then no additional variables are created. In any other case new memory will, of course, need to be allocated for the returned object. For example,
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector demo(RObject x) {
if (x.sexp_type() == REALSXP) {
return as<NumericVector>(x);
}
return NumericVector::create();
}
/*** R
y <- rnorm(3)
z <- letters[1:3]
data.table::address(y)
# [1] "0x6828398"
data.table::address(demo(y))
# [1] "0x6828398"
data.table::address(z)
# [1] "0x68286f8"
data.table::address(demo(z))
# [1] "0x5c7eea0"
*/
I want to know if there is a better way to do it.
First you need to define "better":
Faster?
Uses less memory?
Fewer lines of code?
More idiomatic?
Personally, I would start with the last definition since it often entails one or more of the others. For example, in this approach we
Define a function object Predicate that relies on the standard library function isdigit rather than trying to implement this locally
Define another function object that uses the erase-remove idiom to eliminate characters as determined by Predicate; and if necessary, uses std::atoi to convert what remains into a double (again, instead of trying to implement this ourselves)
Uses an Rcpp idiom -- the as converter -- to convert the STRSXP to a std::vector<std::string>
Calls std::transform to convert this into the result vector
#include <Rcpp.h>
using namespace Rcpp;
struct Predicate {
bool operator()(char c) const
{ return !(c == '.' || std::isdigit(c)); }
};
struct Converter {
double operator()(std::string s) const {
s.erase(
std::remove_if(s.begin(), s.end(), Predicate()),
s.end()
);
return s.empty() ? NA_REAL : std::atof(s.c_str());
}
};
// [[Rcpp::export]]
NumericVector convert(RObject obj) {
if (obj.sexp_type() == REALSXP) {
return as<NumericVector>(obj);
}
if (obj.sexp_type() != STRSXP) {
return NumericVector::create();
}
std::vector<std::string> x = as<std::vector<std::string> >(obj);
NumericVector res(x.size(), NA_REAL);
std::transform(x.begin(), x.end(), res.begin(), Converter());
return res;
}
Testing this for minimal functionality,
x <- c("123 4", "abc 1567.35 def", "abcdef", "")
convert(x)
# [1] 1234.00 1567.35 NA NA
(y <- rnorm(3))
# [1] 1.04201552 -0.08965042 -0.88236960
convert(y)
# [1] 1.04201552 -0.08965042 -0.88236960
convert(list())
# numeric(0)
Will this be as performant as something hand-written by a seasoned C or C++ programmer? Almost certainly not. However, since we used library functions and common idioms, it is reasonably concise, likely to be bug-free, and the intention is fairly evident even at a quick glance. If you need something faster then there are probably a handful of optimizations to be made, but there's no need to begin on that premise without benchmarking and profiling first.
I want to get the most frequent value (e.g. mode) from the IntegerVector. I can use only the Rcpp sugar functions.
How do I convert the output from String to int?
My code:
// [[Rcpp::export]]
String pier(NumericVector x) {
IntegerVector wyniki;
int max;
wyniki = Rcpp::table(x);
max = which_max(wyniki);
CharacterVector wynik_nazwy = wyniki.attr("names");
String wynik = wynik_nazwy[max];
return wynik;
}
/***R
pier(c(3,2,2,2,2,4,4,5))
*/
WYNIK:
> pier(c(3,2,2,2,2,4,4,5))
[1] "2"
It is correct, but I need the numeric value 2 instead of string value "2" that I am presently receiving. Furthermore, I need to convert it in Rcpp and not after exporting the function to R,
If you are using C++98, which looks like it is the case since // [[Rcpp::plugins(cpp11)]] was not defined, then to convert a string to an integer use the atoi() function and the string's .c_str() function.
e.g.
std::string ex = "1";
int res = atoi(ex.c_str());
To simplify matters, the use of .c_str() does not need to be explicit in this case as pointed out by #nrussell. This saves us the need to create an intermediary std::string and just simply use what is returned from accessing the CharacterVector.
Therefore, having said this, we end up with the following:
// [[Rcpp::export]]
int pier(NumericVector x) {
IntegerVector wyniki;
int max;
wyniki = Rcpp::table(x);
max = which_max(wyniki);
CharacterVector wynik_nazwy = wyniki.attr("names");
return atoi( wynik_nazwy[max] );
}
Test:
pier(c(3,2,2,2,2,4,4,5))
# [1] 2
class(pier(c(3,2,2,2,2,4,4,5)))
# [1] "integer"
I am puzzled.
The following compile and work fine:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List test(){
List l;
IntegerVector v(5, NA_INTEGER);
l.push_back(v);
return l;
}
In R:
R) test()
[[1]]
[1] NA NA NA NA NA
But when I try to set the IntegerVector in the list:
// [[Rcpp::export]]
List test(){
List l;
IntegerVector v(5, NA_INTEGER);
l.push_back(v);
l[0][1] = 1;
return l;
}
It does not compile:
test.cpp:121:8: error: invalid use of incomplete type 'struct SEXPREC'
C:/PROGRA~1/R/R-30~1.0/include/Rinternals.h:393:16: error: forward declaration of 'struct SEXPREC'
It is because of this line:
l[0][1] = 1;
The compiler has no idea that l is a list of integer vectors. In essence l[0] gives you a SEXP (the generic type for all R objects), and SEXP is an opaque pointer to SEXPREC of which we don't have access to te definition (hence opaque). So when you do the [1], you attempt to get the second SEXPREC and so the opacity makes it impossible, and it is not what you wanted anyway.
You have to be specific that you are extracting an IntegerVector, so you can do something like this:
as<IntegerVector>(l[0])[1] = 1;
or
v[1] = 1 ;
or
IntegerVector x = l[0] ; x[1] = 1 ;
All of these options work on the same underlying data structure.
Alternatively, if you really wanted the syntax l[0][1] you could define your own data structure expressing "list of integer vectors". Here is a sketch:
template <class T>
class ListOf {
public:
ListOf( List data_) : data(data_){}
T operator[](int i){
return as<T>( data[i] ) ;
}
operator List(){ return data ; }
private:
List data ;
} ;
Which you can use, e.g. like this:
// [[Rcpp::export]]
List test2(){
ListOf<IntegerVector> l = List::create( IntegerVector(5, NA_INTEGER) ) ;
l[0][1] = 1 ;
return l;
}
Also note that using .push_back on Rcpp vectors (including lists) requires a complete copy of the list data, which can cause slow you down. Only use resizing functions when you don't have a choice.