Convert Rcpp::String from table to int for a mode function - r

I want to get the most frequent value (e.g. mode) from the IntegerVector. I can use only the Rcpp sugar functions.
How do I convert the output from String to int?
My code:
// [[Rcpp::export]]
String pier(NumericVector x) {
IntegerVector wyniki;
int max;
wyniki = Rcpp::table(x);
max = which_max(wyniki);
CharacterVector wynik_nazwy = wyniki.attr("names");
String wynik = wynik_nazwy[max];
return wynik;
}
/***R
pier(c(3,2,2,2,2,4,4,5))
*/
WYNIK:
> pier(c(3,2,2,2,2,4,4,5))
[1] "2"
It is correct, but I need the numeric value 2 instead of string value "2" that I am presently receiving. Furthermore, I need to convert it in Rcpp and not after exporting the function to R,

If you are using C++98, which looks like it is the case since // [[Rcpp::plugins(cpp11)]] was not defined, then to convert a string to an integer use the atoi() function and the string's .c_str() function.
e.g.
std::string ex = "1";
int res = atoi(ex.c_str());
To simplify matters, the use of .c_str() does not need to be explicit in this case as pointed out by #nrussell. This saves us the need to create an intermediary std::string and just simply use what is returned from accessing the CharacterVector.
Therefore, having said this, we end up with the following:
// [[Rcpp::export]]
int pier(NumericVector x) {
IntegerVector wyniki;
int max;
wyniki = Rcpp::table(x);
max = which_max(wyniki);
CharacterVector wynik_nazwy = wyniki.attr("names");
return atoi( wynik_nazwy[max] );
}
Test:
pier(c(3,2,2,2,2,4,4,5))
# [1] 2
class(pier(c(3,2,2,2,2,4,4,5)))
# [1] "integer"

Related

Hash to string with given character set

The usual hash-functions, e.g. from digest create hex output. I want to create a hash with character from a given set, e.g [a-z,0-9]; no strong cryptographic security is required.
Using base64encode on a hashed string comes close, but the character set is fixed in that function.
It is ugly div/mod manipulation for an arbitrary character table, so I decided to use a 32 character table without l 0, O
#include <Rcpp.h>
using namespace Rcpp;
static const std::string base32_chars = "abcdefghijkmnpqrstuvwxyz23456789";
// [[Rcpp::export]]
String encode32(uint32_t hash_int, int length = 7)
{
String res;
std::ostringstream oss;
if (length > 7 || length < 1)
length = 7;
for (int i = 0; i < length; i++) {
oss << base32_chars[hash_int & 31];
hash_int = hash_int >> 5;
}
res = oss.str();
return res;
}
/*** R
print(encode32(digest::digest2int("Hellod")))
*/

Simple Rcpp function with try catch returning 'memory not mapped' error

Background
The function has a simple task of iterating over factor elements and attempting to convert each element to double, integer and finally leave it as a character. Upon each count respective counter is increased. At the end string corresponding to the biggest counter is returned.
Rationale
This is mostly a learning example. I've come across a messy data.frame with some data I want to use saved as factors. The variables are in effect doubles, integers or strings. I want to bring them to those types. There are better ways it could be done in base R but this problem looks like a nice opportunity to learn more rcpp.
Code
#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]
//' #title Guess Vector Type
//'
//' #description Function analyses content of a factor vector and attempts to
//' guess the correct type.
//'
//' #param x A vector of factor class.
//'
//' #return A scalar string with class name.
//'
//' #export
//'
// [[Rcpp::export]]
Rcpp::String guess_vector_type(Rcpp::IntegerVector x) {
// Define counters for all types
int num_doubles = 0;
int num_integers = 0;
int num_strings = 0;
// Converted strings
double converted_double;
int converted_integer;
// Get character vector with levels
Rcpp::StringVector levels = x.attr("levels");
// Get integer vector with values
// Rcpp::String type = x.sexp_type();
// Returns integer vector type
// Use iterator: https://teuder.github.io/rcpp4everyone_en/280_iterator.html
for(Rcpp::IntegerVector::iterator it = x.begin(); it != x.end(); ++it) {
// Get [] for vector element
int index = std::distance(x.begin(), it);
// Get value of a specific vector element
int element = x[index];
// Convert to normal string
std::string temp = Rcpp::as<std::string>(levels[element]);
// Try converting to an integer
try
{
converted_integer = std::stoi(temp);
}
catch(...)
{
// Try converting to a doubke
try
{
// Convert to ineteges
converted_double = std::stod(temp);
}
catch(...)
{
++num_integers;
}
++num_doubles;
}
++num_strings;
}
// Get max value of three variables
// https://stackoverflow.com/a/2233412/1655567
int max_val;
max_val = num_doubles > num_integers? (num_doubles > num_strings? num_doubles: num_strings): (num_integers > num_strings? num_integers: num_strings);
// Create results storage
Rcpp::String res;
// Check which value is matching max val
if (max_val == num_doubles) {
// Most converted to doubles
res = "double";
} else if (max_val == num_integers) {
res = "integer";
} else {
res = "character";
}
// Return results vector
return res;
}
Tests
test_factor <- as.factor(rep(letters, 3))
Should return scalar string "character".
Error
guess_vector_type(test_factor)
*** caught segfault ***
address 0xe1000013, cause 'memory not mapped'
I understand this is similar to the problem discussed here but it's not clear to me where is the mistake.
Updates
Following the comments, I've updated the function:
Rcpp::String guess_vector_type(Rcpp::IntegerVector x) {
// Define counters for all types
int num_doubles = 0;
int num_integers = 0;
int num_strings = 0;
// Converted strings
double converted_double;
// flag for runnig more tests
bool is_number;
// Get character vector with levels
Rcpp::StringVector levels = x.attr("levels");
// Get integer vector with values
// Rcpp::String type = x.sexp_type();
// Returns integer vector type
// Use iterator: https://teuder.github.io/rcpp4everyone_en/280_iterator.html
for(Rcpp::IntegerVector::iterator it = x.begin(); it != x.end(); ++it) {
// Get [] for vector element
int index = std::distance(x.begin(), it);
// Get value of a specific vector element
int element = x[index];
// Convert to normal string
std::string temp = Rcpp::as<std::string>(levels[element - 1]);
// Reset number checking flag
is_number = 1;
// Attempt conversion to double
try {
converted_double = std::stod(temp);
} catch(...) {
// Conversion failed, increase string count
++num_strings;
// Do not run more test
is_number = 0;
}
// If number run more tests
if (is_number == 1) {
// Check if converted string is an integer
if(floor(converted_double) == converted_double) {
// Increase counter for integer
++num_integers;
} else {
// Increase count for doubles
++num_doubles;
}
}
}
// Get max value of three variables
// https://stackoverflow.com/a/2233412/1655567
int max_val;
max_val = num_doubles > num_integers? (num_doubles > num_strings? num_doubles: num_strings): (num_integers > num_strings? num_integers: num_strings);
// Create results storage
Rcpp::String res;
// Check which value is matching max val
if (max_val == num_doubles) {
// Most converted to doubles
res = "double";
} else if (max_val == num_integers) {
res = "integer";
} else {
res = "character";
}
// Return results vector
return res;
}
Tests
>> guess_vector_type(x = as.factor(letters))
[1] "character"
>> guess_vector_type(as.factor(1:10))
[1] "integer"
>> guess_vector_type(as.factor(runif(n = 1e3)))
[1] "double"
The problem causing your segfault is with this line
std::string temp = Rcpp::as<std::string>(levels[element]);
Since R is 1-indexed, you need
std::string temp = Rcpp::as<std::string>(levels[element - 1]);
However, I also noticed that you increment your counters in the wrong place (you need to increment string in the innermost catch and integer outside the catches) and need continue statements after the increments (otherwise you end up doing inapplicable increments in addition to the one you want to do). Once you fix those things, the code runs as expected on the test case (but see updates at the end regarding doubles vs. integers).
guess_vector_type(test_factor)
# [1] "character"
Full working code is
#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]
//' #title Guess Vector Type
//'
//' #description Function analyses content of a factor vector and attempts to
//' guess the correct type.
//'
//' #param x A vector of factor class.
//'
//' #return A scalar string with class name.
//'
//' #export
//'
// [[Rcpp::export]]
Rcpp::String guess_vector_type(Rcpp::IntegerVector x) {
// Define counters for all types
int num_doubles = 0;
int num_integers = 0;
int num_strings = 0;
// Converted strings
double converted_double;
int converted_integer;
// Get character vector with levels
Rcpp::StringVector levels = x.attr("levels");
// Get integer vector with values
// Rcpp::String type = x.sexp_type();
// Returns integer vector type
// Use iterator: https://teuder.github.io/rcpp4everyone_en/280_iterator.html
for(Rcpp::IntegerVector::iterator it = x.begin(); it != x.end(); ++it) {
// Get [] for vector element
int index = std::distance(x.begin(), it);
// Get value of a specific vector element
int element = x[index];
// Convert to normal string
std::string temp = Rcpp::as<std::string>(levels[element - 1]);
// Try converting to an integer
try
{
converted_integer = std::stoi(temp);
}
catch(...)
{
// Try converting to a doubke
try
{
// Convert to ineteges
converted_double = std::stod(temp);
}
catch(...)
{
++num_strings;
continue;
}
++num_doubles;
continue;
}
++num_integers;
}
// Get max value of three variables
// https://stackoverflow.com/a/2233412/1655567
int max_val;
max_val = num_doubles > num_integers? (num_doubles > num_strings? num_doubles: num_strings): (num_integers > num_strings? num_integers: num_strings);
// Create results storage
Rcpp::String res;
// Check which value is matching max val
if (max_val == num_doubles) {
// Most converted to doubles
res = "double";
} else if (max_val == num_integers) {
res = "integer";
} else {
res = "character";
}
// Return results vector
return res;
}
Updates
I tried it on some more examples and found that it doesn't work quite as expected for doubles, since the program is able to convert "42.18" to an integer (for example). It does cleanly discern between integers/doubles and characters though:
test_factor <- as.factor(rep(letters, 3))
guess_vector_type(test_factor)
# [1] "character"
test_factor <- as.factor(1:3)
guess_vector_type(test_factor)
# [1] "integer"
test_factor <- as.factor(c(letters, 1))
guess_vector_type(test_factor)
# [1] "character"
test_factor <- as.factor(c(1.234, 42.1138, "a"))
guess_vector_type(test_factor)
# [1] "integer"
In any event, that's an entirely separate issue from the one presented in the question, for which you may want to consult this Stack Overflow post, for example.

Rcpp memory management

I am trying to convert some character data to numeric as below. The data will come with special caracters so I have to get them out. I convert the data to std:string to search for the special caracters. Dos it creates a new variable in memory? I want to know if there is a better way to do it.
NumericVector converter_ra_(Rcpp::RObject x){
if(x.sexp_type() == STRSXP){
CharacterVector y(x);
NumericVector resultado(y.size());
for(unsigned int i = 0; i < y.size(); i++){
std::string ra_string = Rcpp::as<std::string>(y[i]);
//std::cout << ra_string << std::endl;
double t = 0;
int base = 0;
for(int j = (int)ra_string.size(); j >= 0; j--){
if(ra_string[j] >= 48 && ra_string[j] <= 57){
t += ((ra_string[j] - '0') * base_m[base]);
base++;
}
}
//std::cout << t << std::endl;
resultado[i] = t;
}
return resultado;
}else if(x.sexp_type() == REALSXP){
return NumericVector(x);
}
return NumericVector();
}
Does it creates a new variable in memory?
If the input object actually is a numeric vector (REALSXP) and you are simply returning, e.g. as<NumericVector>(input), then no additional variables are created. In any other case new memory will, of course, need to be allocated for the returned object. For example,
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector demo(RObject x) {
if (x.sexp_type() == REALSXP) {
return as<NumericVector>(x);
}
return NumericVector::create();
}
/*** R
y <- rnorm(3)
z <- letters[1:3]
data.table::address(y)
# [1] "0x6828398"
data.table::address(demo(y))
# [1] "0x6828398"
data.table::address(z)
# [1] "0x68286f8"
data.table::address(demo(z))
# [1] "0x5c7eea0"
*/
I want to know if there is a better way to do it.
First you need to define "better":
Faster?
Uses less memory?
Fewer lines of code?
More idiomatic?
Personally, I would start with the last definition since it often entails one or more of the others. For example, in this approach we
Define a function object Predicate that relies on the standard library function isdigit rather than trying to implement this locally
Define another function object that uses the erase-remove idiom to eliminate characters as determined by Predicate; and if necessary, uses std::atoi to convert what remains into a double (again, instead of trying to implement this ourselves)
Uses an Rcpp idiom -- the as converter -- to convert the STRSXP to a std::vector<std::string>
Calls std::transform to convert this into the result vector
#include <Rcpp.h>
using namespace Rcpp;
struct Predicate {
bool operator()(char c) const
{ return !(c == '.' || std::isdigit(c)); }
};
struct Converter {
double operator()(std::string s) const {
s.erase(
std::remove_if(s.begin(), s.end(), Predicate()),
s.end()
);
return s.empty() ? NA_REAL : std::atof(s.c_str());
}
};
// [[Rcpp::export]]
NumericVector convert(RObject obj) {
if (obj.sexp_type() == REALSXP) {
return as<NumericVector>(obj);
}
if (obj.sexp_type() != STRSXP) {
return NumericVector::create();
}
std::vector<std::string> x = as<std::vector<std::string> >(obj);
NumericVector res(x.size(), NA_REAL);
std::transform(x.begin(), x.end(), res.begin(), Converter());
return res;
}
Testing this for minimal functionality,
x <- c("123 4", "abc 1567.35 def", "abcdef", "")
convert(x)
# [1] 1234.00 1567.35 NA NA
(y <- rnorm(3))
# [1] 1.04201552 -0.08965042 -0.88236960
convert(y)
# [1] 1.04201552 -0.08965042 -0.88236960
convert(list())
# numeric(0)
Will this be as performant as something hand-written by a seasoned C or C++ programmer? Almost certainly not. However, since we used library functions and common idioms, it is reasonably concise, likely to be bug-free, and the intention is fairly evident even at a quick glance. If you need something faster then there are probably a handful of optimizations to be made, but there's no need to begin on that premise without benchmarking and profiling first.

Error: could not convert using R function : as.data.frame

I'm trying to read a text file in C++ and return it as a DataFrame. I have created a skeleton method for reading the file and returning it:
// [[Rcpp::export]]
DataFrame rcpp_hello_world(String fileName) {
int vsize = get_number_records(fileName);
CharacterVector field1 = CharacterVector(vsize+1);
std::ifstream in(fileName);
int i = 0;
string tmp;
while (!in.eof()) {
getline(in, tmp, '\n');
field1[i] = tmp;
tmp.clear( );
i++;
}
DataFrame df(field1);
return df;
}
I am running in R using:
> df <- rcpp_hello_world( "my_haproxy_logfile" )
However, R returns the following error:
Error: could not convert using R function : as.data.frame
What am I doing wrong?
Many thanks.
DataFrame objects are "special". Our preferred usage is via return Rcpp::DateFrame::create ... which you will see in many of the posted examples, including in the many answers here.
Here is one from a Rcpp Gallery post:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame modifyDataFrame(DataFrame df) {
// access the columns
Rcpp::IntegerVector a = df["a"];
Rcpp::CharacterVector b = df["b"];
// make some changes
a[2] = 42;
b[1] = "foo";
// return a new data frame
return DataFrame::create(_["a"]= a, _["b"]= b);
}
While focussed on modifying a DataFrame, it shows you in passing how to create one. The _["a"] shortcut can also be written as Named("a") which I prefer.

How to access elements of a vector in a Rcpp::List

I am puzzled.
The following compile and work fine:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List test(){
List l;
IntegerVector v(5, NA_INTEGER);
l.push_back(v);
return l;
}
In R:
R) test()
[[1]]
[1] NA NA NA NA NA
But when I try to set the IntegerVector in the list:
// [[Rcpp::export]]
List test(){
List l;
IntegerVector v(5, NA_INTEGER);
l.push_back(v);
l[0][1] = 1;
return l;
}
It does not compile:
test.cpp:121:8: error: invalid use of incomplete type 'struct SEXPREC'
C:/PROGRA~1/R/R-30~1.0/include/Rinternals.h:393:16: error: forward declaration of 'struct SEXPREC'
It is because of this line:
l[0][1] = 1;
The compiler has no idea that l is a list of integer vectors. In essence l[0] gives you a SEXP (the generic type for all R objects), and SEXP is an opaque pointer to SEXPREC of which we don't have access to te definition (hence opaque). So when you do the [1], you attempt to get the second SEXPREC and so the opacity makes it impossible, and it is not what you wanted anyway.
You have to be specific that you are extracting an IntegerVector, so you can do something like this:
as<IntegerVector>(l[0])[1] = 1;
or
v[1] = 1 ;
or
IntegerVector x = l[0] ; x[1] = 1 ;
All of these options work on the same underlying data structure.
Alternatively, if you really wanted the syntax l[0][1] you could define your own data structure expressing "list of integer vectors". Here is a sketch:
template <class T>
class ListOf {
public:
ListOf( List data_) : data(data_){}
T operator[](int i){
return as<T>( data[i] ) ;
}
operator List(){ return data ; }
private:
List data ;
} ;
Which you can use, e.g. like this:
// [[Rcpp::export]]
List test2(){
ListOf<IntegerVector> l = List::create( IntegerVector(5, NA_INTEGER) ) ;
l[0][1] = 1 ;
return l;
}
Also note that using .push_back on Rcpp vectors (including lists) requires a complete copy of the list data, which can cause slow you down. Only use resizing functions when you don't have a choice.

Resources