Iterate over R object using SEXP in Rcpp - r

I'd like to subset an xts object in an Rcpp function and return the subset.
If the xts object has an index of class Date extracting the index via Rcpp corrupts the xts object -- see dirk's answer to this question, where he demonstrates that getting a pointer to the Date indices from the xts (what i call the SEXP approach) doesn't lead to corruption.
Say that i have a pointer s to the SEXP in Rcpp -- how do i iterate over the underlying object using that SEXP? Can it be done?
I'd like to iterate over the underlying object, and return a subset of that object.
The below R code does what I require:
set.seed(1)
require(xts)
xx_date <- xts(round(runif(100, min = 0, max = 20), 0),
order.by = seq.Date(Sys.Date(), by = "day", length.out = 100))
subXts_r <- function(Xts) {
i = 2
while( as.numeric(Xts[i, ]) != as.numeric(Xts[i-1, ])) {
if (i == nrow(Xts)) break else i = i+1
}
Xts[1:i,]
}
subXts_r(xx_date)
This Rcpp code also does what I want, but it uses a clone of the index (second line) to prevent corruption. My idea is to replace the second line with SEXP s = X.attr(\"index\") -- but I don't know how to iterate over s once I have it.
cppFunction("NumericVector subXts_cpp(NumericMatrix X) {
DatetimeVector v = clone(NumericVector(X.attr(\"index\"))); // need to clone else xx_date is corrupted
double * p_dt = v.begin() +1;
double * p_value = X.begin() +1;
while( (*p_value != *(p_value -1)) & (p_value < X.end())) {
p_value++;
p_dt++;
}
Rcpp::NumericVector toDoubleValue(X.begin(), p_value);
Rcpp::NumericVector toDoubleDate(v.begin(), p_dt);
int rows = toDoubleValue.size(); // find length of xts object
toDoubleDate.attr(\"tzone\") = \"UTC\"; // the index has attributes
CharacterVector t_class = CharacterVector::create(\"POSIXct\", \"POSIXt\");
toDoubleDate.attr(\"tclass\") = t_class;
// now modify dataVec to make into an xts
toDoubleValue.attr(\"dim\") = IntegerVector::create(rows,1);
toDoubleValue.attr(\"index\") = toDoubleDate;
CharacterVector d_class = CharacterVector::create(\"xts\", \"zoo\");
toDoubleValue.attr(\"class\") = d_class;
toDoubleValue.attr(\".indexCLASS\") = t_class;
toDoubleValue.attr(\"tclass\") = t_class;
toDoubleValue.attr(\".indexTZ\") = \"UTC\";
toDoubleValue.attr(\"tzone\") = \"UTC\";
return toDoubleValue;}")

Related

C5.0 package: Error in paste(apply(x, 1, paste, collapse = ","), collapse = "\n") : result would exceed 2^31-1 bytes

When trying to train a model with a dataset of around 3 million rows and 600 columns using the C5.0 CRAN package I get the following error:
Error in paste(apply(x, 1, paste, collapse = ","), collapse = "\n") : result would exceed 2^31-1 bytes
From what the owner of the repository answered to a similar issue, it is due to an R limitation in the number of bytes in a character string, which is limited to 2^31 - 1.
Long answer ahead:
So, as stated in the question, the error occurs in the last line of the makeDataFile function from the Cubist package, used in C5.0, which concatenates all rows into one string. As this string is needed to pass the data to the C5.0 function in C, but is not needed to make any operations in R, and C has no memory limitation aside from those of the machine itself, the approach I have taken is to create such string in C instead. In order to do this, the R code will pass the information in a character vector containing various strings that don’t surpass the length limit, instead of one, so that once in C these elements can be concatenated.
However, instead of leaving all rows as separate elements in the character vector to be concatenated in C using strcat in a loop, I have found that the strcat function is quite slow, so I have chosen to create another R function (create_max_len_strings) in order to concatenate the rows into the longest (~or close~) strings possible without reaching the memory limit so that strcat only needs to be applied a few times to concatenate these longer strings.
So, the last line of the original makeDataFile() function will be replaced so that each row is left separately as an element of a character vector, only adding a line break at the end of each string row so that when concatenating some of these elements into longer strings, using create_max_len_strings(), they will be differentiated:
makeDataFile.R:
create_max_len_strings <- function(original_vector) {
vector_length = length(original_vector)
nchars = sum(nchar(original_vector, type = "chars"))
## Check if the length of the string would reach 1900000000, which is close to the memory limitation
if(nchars >= 1900000000){
## Calculate how many strings we could create of the maximum length
nchunks = 0
while(nchars > 0){
nchars = nchars - 1900000000
nchunks = nchunks + 1
}
## Get the number of rows that would be contained in each string
chunk_size = vector_length/nchunks
## Get the rounded number of rows in each string
chunk_size = floor(chunk_size)
index = chunk_size
## Create a vector with the indexes of the rows that delimit each string
indexes_vector = c()
indexes_vector = append(indexes_vector, 0)
n = nchunks
while(n > 0){
indexes_vector = append(indexes_vector, index)
index = index + chunk_size
n = n - 1
}
## Get the last few rows if the division had remainder
remainder = vector_length %% nchunks
if (remainder != 0){
indexes_vector = append(indexes_vector, vector_length)
nchunks = nchunks + 1
}
## Create the strings pasting together the rows from the indexes in the indexes vector
strings_vector = c()
i = 2
while (i <= length(indexes_vector)){
## Sum 1 to the index_init so that the next string does not contain the last row of the previous string
index_init = indexes_vector[i-1] + 1
index_end = indexes_vector[i]
## Paste the rows from the vector from index_init to index_end
string <- paste0(original_vector[index_init:index_end], collapse="")
## Create vector containing the strings that were created
strings_vector <- append(strings_vector, string)
i = i + 1
}
}else {
strings_vector = paste0(original_vector, collapse="")
}
strings_vector
}
makeDataFile <- function(x, y, w = NULL) {
## Previous code stays the same
...
x = apply(x, 1, paste, collapse = ",")
x = paste(x, "\n", sep="")
char_vec = create_max_len_strings(x)
}
CALLING C5.0
Now, in order to create the final string to pass to the c50() function in C, an intermediate function is created and called instead. In order to do this, the .C() statement that calls c50() in R is replaced with a .Call() statement calling this function, as .Call() allows for complex objects such as vectors to be passed to C. Also, it allows for the result to be returned in the variable result instead of having to pass back the variables tree, rules and output by reference. The result of calling C5.0 will be received in the character vector result containing the strings corresponding to the tree, rules and output in the first three positions:
C5.0.R:
C5.0.default <- function(x,
y,
trials = 1,
rules = FALSE,
weights = NULL,
control = C5.0Control(),
costs = NULL,
...) {
## Previous code stays the same
...
dataString <- makeDataFile(x, y, weights)
num_chars = sum(nchar(dataString, type = "chars"))
result <- .Call(
"call_C50",
as.character(namesString),
dataString,
as.character(num_chars), ## The length of the resulting string is passed as character because it is too long for an integer
as.character(costString),
as.logical(control$subset),
# -s "use the Subset option" var name: SUBSET
as.logical(rules),
# -r "use the Ruleset option" var name: RULES
## for the bands option, I'm not sure what the default should be.
as.integer(control$bands),
# -u "sort rules by their utility into bands" var name: UTILITY
## The documentation has two options for boosting:
## -b use the Boosting option with 10 trials
## -t trials ditto with specified number of trial
## I think we should use -t
as.integer(trials),
# -t : " ditto with specified number of trial", var name: TRIALS
as.logical(control$winnow),
# -w "winnow attributes before constructing a classifier" var name: WINNOW
as.double(control$sample),
# -S : use a sample of x% for training
# and a disjoint sample for testing var name: SAMPLE
as.integer(control$seed),
# -I : set the sampling seed value
as.integer(control$noGlobalPruning),
# -g: "turn off the global tree pruning stage" var name: GLOBAL
as.double(control$CF),
# -c: "set the Pruning CF value" var name: CF
## Also, for the number of minimum cases, I'm not sure what the
## default should be. The code looks like it dynamically sets the
## value (as opposed to a static, universal integer
as.integer(control$minCases),
# -m : "set the Minimum cases" var name: MINITEMS
as.logical(control$fuzzyThreshold),
# -p "use the Fuzzy thresholds option" var name: PROBTHRESH
as.logical(control$earlyStopping)
)
## Get the first three positions of the character vector that contain the tree, rules and output returned by C5.0 in C
result_tree = result[1]
result_rules = result[2]
result_output = result[3]
modelContent <- strsplit(
if (rules)
result_rules
else
result_tree, "\n"
)[[1]]
entries <- grep("^entries", modelContent, value = TRUE)
if (length(entries) > 0) {
actual <- as.numeric(substring(entries, 10, nchar(entries) - 1))
} else
actual <- trials
if (trials > 1) {
boostResults <- getBoostResults(result_output)
## This next line is here to avoid a false positive warning in R
## CMD check:
## * checking R code for possible problems ... NOTE
## C5.0.default: no visible binding for global variable 'Data'
Data <- NULL
size <-
if (!is.null(boostResults))
subset(boostResults, Data == "Training Set")$Size
else
NA
} else {
boostResults <- NULL
size <- length(grep("[0-9])$", strsplit(result_output, "\n")[[1]]))
}
out <- list(
names = namesString,
cost = costString,
costMatrix = costs,
caseWeights = !is.null(weights),
control = control,
trials = c(Requested = trials, Actual = actual),
rbm = rules,
boostResults = boostResults,
size = size,
dims = dim(x),
call = funcCall,
levels = levels(y),
output = result_output,
tree = result_tree,
predictors = colnames(x),
rules = result_rules
)
class(out) <- "C5.0"
out
}
Now onto the C code, the function call_c50() basically acts as an intermediate between the R code and the C code, concatenating the elements in the dataString array to obtain the string needed by the C function c50(), by accessing each position of the array using CHAR(STRING_ELT(x, i)) and concatenating (strcat) them together. Then the rest of the variables are casted to their respective types and the c50() function in file top.c (where this function should also be placed) is called. The result of calling c50() will be returned to the R routine by creating a character vector and placing the strings corresponding to the tree, rules and output in each position.
Lastly, the c50() function is basically left as is, except for the variables treev, rulesv and outputv, as these are the values that are going to be returned by .Call() instead of being passed by reference, they no longer need to be in the arguments of the function. As they are all strings they can be returned in a single array, by setting each string to a position in the array c50_return.
top.c:
SEXP call_C50(SEXP namesString, SEXP data_vec, SEXP datavec_len, SEXP costString, SEXP subset, SEXP rules, SEXP bands, SEXP trials, SEXP winnow, SEXP sample,
SEXP seed, SEXP noGlobalPruning, SEXP CF, SEXP minCases, SEXP fuzzyThreshold, SEXP earlyStopping){
char* string;
char* concat;
long n = 0;
long size;
int i;
char* eptr;
// Get the length of the data vector
n = length(data_vec);
// Get the string indicating the length of the final string
char* size_str = malloc((strlen(CHAR(STRING_ELT(datavec_len, 0)))+1)*sizeof(char));
strcpy(size_str, CHAR(STRING_ELT(datavec_len, 0)));
// Turn the string to long
size = strtol(size_str, &eptr, 10);
// Allocate memory for the number of characters indicated by datavec_len
string = malloc((size+1)*sizeof(char));
// Copy the first element of data_vec into the string variable
strcpy(string, CHAR(STRING_ELT(data_vec, 0)));
// Loop over the data vector until all elements are concatenated in the string variable
for (i = 1; i < n; i++) {
strcat(string, CHAR(STRING_ELT(data_vec, i)));
}
// Copy the value of namesString into a char*
char* namesv = malloc((strlen(CHAR(STRING_ELT(namesString, 0)))+1)*sizeof(char));
strcpy(namesv, CHAR(STRING_ELT(namesString, 0)));
// Copy the value of costString into a char*
char* costv = malloc((strlen(CHAR(STRING_ELT(costString, 0)))+1)*sizeof(char));
strcpy(costv, CHAR(STRING_ELT(costString, 0)));
// Call c50() function casting the rest of arguments into their respective C types
char** c50_return = c50(namesv, string, costv, asLogical(subset), asLogical(rules), asInteger(bands), asInteger(trials), asLogical(winnow), asReal(sample), asInteger(seed), asInteger(noGlobalPruning), asReal(CF), asInteger(minCases), asLogical(fuzzyThreshold), asLogical(earlyStopping));
free(string);
free(namesv);
free(costv);
// Create a character vector to be returned to the C5.0 R function
SEXP out = PROTECT(allocVector(STRSXP, 3));
SET_STRING_ELT(out, 0, mkChar(c50_return[0]));
SET_STRING_ELT(out, 1, mkChar(c50_return[1]));
SET_STRING_ELT(out, 2, mkChar(c50_return[2]));
UNPROTECT(1);
return out;
}
static char** c50(char *namesv, char *datav, char *costv, int subset,
int rules, int utility, int trials, int winnow,
double sample, int seed, int noGlobalPruning, double CF,
int minCases, int fuzzyThreshold, int earlyStopping) {
int val; /* Used by setjmp/longjmp for implementing rbm_exit */
char ** c50_return = malloc(3 * sizeof(char*));
// Initialize the globals to the values that the c50
// program would have at the start of execution
initglobals();
// Set globals based on the arguments. This is analogous
// to parsing the command line in the c50 program.
setglobals(subset, rules, utility, trials, winnow, sample, seed,
noGlobalPruning, CF, minCases, fuzzyThreshold, earlyStopping,
costv);
// Handles the strbufv data structure
rbm_removeall();
// Deallocates memory allocated by NewCase.
// Not necessary since it's also called at the end of this function,
// but it doesn't hurt, and I'm feeling paranoid.
FreeCases();
// XXX Should this be controlled via an option?
// Rprintf("Calling setOf\n");
setOf();
// Create a strbuf using *namesv as the buffer.
// Note that this is a readonly strbuf since we can't
// extend *namesv.
STRBUF *sb_names = strbuf_create_full(namesv, strlen(namesv))
// Register this strbuf using the name "undefined.names"
if (rbm_register(sb_names, "undefined.names", 0) < 0) {
error("undefined.names already exists");
}
// Create a strbuf using *datav and register it as "undefined.data"
STRBUF *sb_datav = strbuf_create_full(datav, strlen(datav));
// XXX why is sb_datav copied? was that part of my debugging?
// XXX or is this the cause of the leak?
if (rbm_register(strbuf_copy(sb_datav), "undefined.data", 0) < 0) {
error("undefined data already exists");
}
// Create a strbuf using *costv and register it as "undefined.costs"
if (strlen(costv) > 0) {
// Rprintf("registering cost matrix: %s", *costv);
STRBUF *sb_costv = strbuf_create_full(costv, strlen(costv));
// XXX should sb_costv be copied?
if (rbm_register(sb_costv, "undefined.costs", 0) < 0) {
error("undefined.cost already exists");
}
} else {
// Rprintf("no cost matrix to register\n");
}
/*
* We need to initialize rbm_buf before calling any code that
* might call exit/rbm_exit.
*/
if ((val = setjmp(rbm_buf)) == 0) {
// Real work is done here
c50main();
if (rules == 0) {
// Get the contents of the the tree file
STRBUF *treebuf = rbm_lookup("undefined.tree");
if (treebuf != NULL) {
char *treeString = strbuf_getall(treebuf);
c50_return[0] = R_alloc(strlen(treeString) + 1, 1);
strcpy(c50_return[0], treeString);
c50_return[1] = "";
} else {
// XXX Should *treev be assigned something in this case?
// XXX Throw an error?
}
} else {
// Get the contents of the the rules file
STRBUF *rulesbuf = rbm_lookup("undefined.rules");
if (rulesbuf != NULL) {
char *rulesString = strbuf_getall(rulesbuf);
c50_return[1] = R_alloc(strlen(rulesString) + 1, 1);
strcpy(c50_return[1], rulesString);
c50_return[0] = "";
} else {
// XXX Should *rulesv be assigned something in this case?
// XXX Throw an error?
}
}
} else {
Rprintf("c50 code called exit with value %d\n", val - JMP_OFFSET);
}
// Close file object "Of", and return its contents via argument outputv
char *outputString = closeOf();
c50_return[2] = R_alloc(strlen(outputString) + 1, 1);
strcpy(c50_return[2], outputString);
// Deallocates memory allocated by NewCase
FreeCases();
// We reinitialize the globals on exit out of general paranoia
initglobals();
return c50_return;
}
***IMPORTANT: if the string created is longer than 2147483647, you also will need to change the definition of the variables i and j in the function strbuf_gets() in strbuf.c. This function basically iterates through each position of the string, so trying to increase their value above the INT limit to access those positions in the array will cause a segmentation fault. I suggest changing the declaration type to long in order to avoid this issue.
C5.0 PREDICTIONS
However, as the makeDataFile function is not only used to create the model but also to pass the data to the predictions() function, this function will also have to be modified. Just like previously, the .C() statement in predict.C5.0() used to call predictions() will be replaced with a .Call() statement in order to be able to pass the character vector to C, and the result will be returned in the result variable instead of being passed by reference:
predict.C5.0.R:
predict.C5.0 <- function (object,
newdata = NULL,
trials = object$trials["Actual"],
type = "class",
na.action = na.pass,
...) {
## Previous code stays the same
...
caseString <- makeDataFile(x = newdata, y = NULL)
num_chars = sum(nchar(caseString, type = "chars"))
## When passing trials to the C code, convert to
## zero if the original version of trials is used
if (trials <= 0)
stop("'trials should be a positive integer", call. = FALSE)
if (trials == object$trials["Actual"])
trials <- 0
## Add trials (not object$trials) as an argument
results <- .Call(
"call_predictions",
caseString,
as.character(num_chars),
as.character(object$names),
as.character(object$tree),
as.character(object$rules),
as.character(object$cost),
pred = integer(nrow(newdata)),
confidence = double(length(object$levels) * nrow(newdata)),
trials = as.integer(trials)
)
predictions = as.numeric(unlist(results[1]))
confidence = as.numeric(unlist(results[2]))
output = as.character(results[3])
if(any(grepl("Error limit exceeded", output)))
stop(output, call. = FALSE)
if (type == "class") {
out <- factor(object$levels[predictions], levels = object$levels)
} else {
out <-
matrix(confidence,
ncol = length(object$levels),
byrow = TRUE)
if (!is.null(rownames(newdata)))
rownames(out) <- rownames(newdata)
colnames(out) <- object$levels
}
out
}
In the file top.c, the predictions() function will be modified to receive the variables passed by the .Call() statement, so that just like previously, the caseString array will be concatenated into a single string and the rest of the variables casted to their respective types. In this case the variables pred and confidence will be also received as vectors of integer and double types and so they will need to be casted to int* and double*. The rest of the function is left as it was in order to create the predictions and the resulting variables predv, confidencev and output variables will be placed in the first three positions of a vector respectively.
top.c:
SEXP call_predictions(SEXP caseString, SEXP case_len, SEXP names, SEXP tree, SEXP rules, SEXP cost, SEXP pred, SEXP confidence, SEXP trials){
char* casev;
char* outputv = "";
char* eptr;
char* size_str = malloc((strlen(CHAR(STRING_ELT(case_len, 0)))+1)*sizeof(char));
strcpy(size_str, CHAR(STRING_ELT(case_len, 0)));
long size = strtol(size_str, &eptr, 10);
casev = malloc((size+1)*sizeof(char));
strcpy(casev, CHAR(STRING_ELT(caseString, 0)));
int n = length(caseString);
for (int i = 1; i < n; i++) {
strcat(casev, CHAR(STRING_ELT(caseString, i)));
}
char* namesv = malloc((strlen(CHAR(STRING_ELT(names, 0)))+1)*sizeof(char));
strcpy(namesv, CHAR(STRING_ELT(names, 0)));
char* treev = malloc((strlen(CHAR(STRING_ELT(tree, 0)))+1)*sizeof(char));
strcpy(treev, CHAR(STRING_ELT(tree, 0)));
char* rulesv = malloc((strlen(CHAR(STRING_ELT(rules, 0)))+1)*sizeof(char));
strcpy(rulesv, CHAR(STRING_ELT(rules, 0)));
char* costv = malloc((strlen(CHAR(STRING_ELT(cost, 0)))+1)*sizeof(char));
strcpy(costv, CHAR(STRING_ELT(cost, 0)));
int variable;
int* predv = &variable;
int npred = length(pred);
predv = malloc((npred+1)*sizeof(int));
for (int i = 0; i < npred; i++) {
predv[i] = INTEGER(pred)[i];
}
double variable1;
double* confidencev = &variable1;
int nconf = length(confidence);
confidencev = malloc((nconf+1)*sizeof(double));
for (int i = 0; i < nconf; i++) {
confidencev[i] = REAL(confidence)[i];
}
int* trialsv = &variable;
*trialsv = asInteger(trials);
/* Original code for predictions starts */
int val;
// Announce ourselves for testing
// Rprintf("predictions called\n");
// Initialize the globals
initglobals();
// Handles the strbufv data structure
rbm_removeall();
// XXX Should this be controlled via an option?
// Rprintf("Calling setOf\n");
setOf();
STRBUF *sb_cases = strbuf_create_full(casev, strlen(casev));
if (rbm_register(sb_cases, "undefined.cases", 0) < 0) {
error("undefined.cases already exists");
}
STRBUF *sb_names = strbuf_create_full(namesv, strlen(namesv));
if (rbm_register(sb_names, "undefined.names", 0) < 0) {
error("undefined.names already exists");
}
if (strlen(treev)) {
STRBUF *sb_treev = strbuf_create_full(treev, strlen(treev));
if (rbm_register(sb_treev, "undefined.tree", 0) < 0) {
error("undefined.tree already exists");
}
} else if (strlen(rulesv)) {
STRBUF *sb_rulesv = strbuf_create_full(rulesv, strlen(rulesv));
if (rbm_register(sb_rulesv, "undefined.rules", 0) < 0) {
error("undefined.rules already exists");
}
setrules(1);
} else {
error("either a tree or rules must be provided");
}
// Create a strbuf using *costv and register it as "undefined.costs"
if (strlen(costv) > 0) {
// Rprintf("registering cost matrix: %s", *costv);
STRBUF *sb_costv = strbuf_create_full(costv, strlen(costv));
// XXX should sb_costv be copied?
if (rbm_register(sb_costv, "undefined.costs", 0) < 0) {
error("undefined.cost already exists");
}
} else {
// Rprintf("no cost matrix to register\n");
}
if ((val = setjmp(rbm_buf)) == 0) {
// Real work is done here
// Rprintf("\n\nCalling rpredictmain\n");
rpredictmain(trialsv, predv, confidencev);
// Rprintf("predict finished\n\n");
} else {
// Rprintf("predict code called exit with value %d\n\n", val - JMP_OFFSET);
}
// Close file object "Of", and return its contents via argument outputv
char *outputString = closeOf();
char *output = R_alloc(strlen(outputString) + 1, 1);
strcpy(output, outputString);
// We reinitialize the globals on exit out of general paranoia
initglobals();
/* Original code for predictions ends */
free(namesv);
free(treev);
free(rulesv);
free(costv);
SEXP predx = PROTECT(allocVector(INTSXP, npred));
for (int i = 0; i < npred; i++) {
INTEGER(predx)[i] = predv[i];
}
SEXP confidencex = PROTECT(allocVector(REALSXP, nconf));
for (int i = 0; i < npred; i++) {
REAL(confidencex)[i] = confidencev[i];
}
SEXP outputx = PROTECT(allocVector(STRSXP, 1));
SET_STRING_ELT(outputx, 0, mkChar(output));
SEXP vector = PROTECT(allocVector(VECSXP, 3));
SET_VECTOR_ELT(vector, 0, predx);
SET_VECTOR_ELT(vector, 1, confidencex);
SET_VECTOR_ELT(vector, 2, outputx);
UNPROTECT(4);
free(predv);
free(confidencev);
return vector;
}

Rcpp forbidden conversion when using any() in an if() statement

I am trying to convert my R function to C++ using Rcpp, but I came around errors that I don't understand quite well.
The following code gives my R function, my (poor) attempt to translate it and some examples of uses at the end (testing that the two function return the same thing...)
My R Code function:
intersect_rectangles <- function(x_min, x_max, y_min, y_max) {
rez <- list()
rez$min <- pmax(x_min, y_min)
rez$max <- pmin(x_max, y_max)
if (any(rez$min > rez$max)) {
return(list(NULL))
}
return(rez)
}
My attempt to create the same function with Rcpp.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List Cpp_intersect_rectangles(NumericVector x_min,NumericVector
x_max,NumericVector y_min,NumericVector y_max) {
// Create a list :
NumericVector min = pmax(x_min,y_min);
NumericVector max = pmin(x_max,y_max);
List L = List::create(R_NilValue);
if (! any(min > max)) {
L = List::create(Named("min") = min , _["max"] = max);
}
return(L);
}
I receive the following error messages:
/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/Rcpp/sugar/logical/SingleLogicalResult.h:36:2: error: implicit instantiation of undefined template 'Rcpp::sugar::forbidden_conversion<false>'
forbidden_conversion<x>{
^
/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/Rcpp/sugar/logical/SingleLogicalResult.h:74:40: note: in instantiation of template class 'Rcpp::sugar::conversion_to_bool_is_forbidden<false>' requested here
conversion_to_bool_is_forbidden<!NA> x ;
^
file637e53281965.cpp:13:9: note: in instantiation of member function 'Rcpp::sugar::SingleLogicalResult<true, Rcpp::sugar::Negate_SingleLogicalResult<true, Rcpp::sugar::Any<true, Rcpp::sugar::Comparator<14, Rcpp::sugar::greater<14>, true, Rcpp::Vector<14, PreserveStorage>, true, Rcpp::Vector<14, PreserveStorage> > > > >::operator bool' requested here
if (! any(min > max))
If the Rcpp function is implemented correctly, then the following should work:
u = rep(0,4)
v = rep(1,4)
w = rep(0.3,4)
x = c(0.8,0.8,3,3)
all.equal(intersect_rectangles(u,v,w,x), Cpp_intersect_rectangles(u,v,w,x))
all.equal(intersect_rectangles(u,v,w,w), Cpp_intersect_rectangles(u,v,w,w))
What's wrong with my cpp code?
The reason the code isn't translating correctly is due to how the any() Rcpp sugar implementation was created. In particular, we have that:
The actual return type of any(X) is an instance of the
SingleLogicalResult template class, but the functions is_true
and is_false may be used to convert the return value to bool.
Per https://thecoatlessprofessor.com/programming/unofficial-rcpp-api-documentation/#any
Therefore, the solution is to add .is_true() to the any() function call, e.g. !any(condition).is_true().
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List Cpp_intersect_rectangles(NumericVector x_min, NumericVector x_max,
NumericVector y_min, NumericVector y_max) {
// Create a list :
NumericVector min = pmax(x_min, y_min);
NumericVector max = pmin(x_max, y_max);
List L = List::create(R_NilValue);
if (! any(min > max).is_true()) {
// ^^^^^^^^^ Added
L = List::create(Named("min") = min , _["max"] = max);
}
return(L);
}
Then, through testing we get:
u = rep(0,4)
v = rep(1,4)
w = rep(0.3,4)
x = c(0.8,0.8,3,3)
all.equal(intersect_rectangles(u,v,w,x), Cpp_intersect_rectangles(u,v,w,x))
# [1] TRUE
all.equal(intersect_rectangles(u,v,w,w), Cpp_intersect_rectangles(u,v,w,w))
# [1] TRUE

Subsetting a DateVector

I'm trying to extract and subset a vector containing date information from a data.frame. I'm able to successfully extract the DateVector from the DataFrame; however, I receive an error when trying to subset the data.
The below works fine given the /* */ around the DateVector subsets.
Rcpp::cppFunction('
Rcpp::DataFrame test(DataFrame x, StringVector y ) {
StringVector New = x["string_1"];
std::string KEY = Rcpp::as<std::string>(y[0]);
Rcpp::LogicalVector ind(New.size());
for(int i = 0; i < New.size(); i++){
ind[i] = (New[i] == KEY);
}
Rcpp::StringVector st1 = x["string_1"];
Rcpp::StringVector Id = x["ID"];
Rcpp::StringVector NameId = x["NameID"];
Rcpp::DateVector StDate = x["StartDate"];
Rcpp::DateVector EtDate = x["EndDate"];
/*
Rcpp::DateVector StDate_sub = StDate[ind];
Rcpp::DateVector EtDate_sub = EtDate[ind];
*/
return Rcpp::DataFrame::create(Rcpp::Named("string_1") = st1[ind],
Rcpp::Named("ID") = Id[ind],
Rcpp::Named("NameID") = NameId[ind]/*,
Rcpp::Named("StartDate") = StDate_sub,
Rcpp::Named("EndDate") = EtDate_sub*/
);
}')
There are two notable errors I receive:
error: invalid user-defined conversion from 'Rcpp::LogicalVector {aka Rcpp::Vector<10, Rcpp::PreserveStorage>}' to 'int' [-fpermissive]
Rcpp::DateVector StDate_sub = StDate[ind]
The second is:
no known conversion from 'SEXP' to 'int'
file585c1863151c.cpp:23:53: error: conversion from 'Rcpp::Date' to non-scalar type 'Rcpp::DateVector {aka Rcpp::oldDateVector}' requested
Rcpp::DateVector EtDate_sub = EtDate[ind];
I looked at the docs, but couldn't find a way. Sorry, if I missed it. I have a couple of date variables in data.frame. I am using the Rcpp to subset the data set in a nested for loop. Currently, it is taking too much time. I cannot implement it in data.table or dplyr as the subset data set is required fro some processing.
First off, your example is not minimally reproducible as there is no defined data set.
Second, you are making the (heroic?) assumption that assignment by index vector be defined for Date vectors. Appears it may not be.
Third, just looping is trivial. Amended code below. Builds without a hitch, no idea if it run as you supplied no reference data.
#define RCPP_NEW_DATE_DATETIME_VECTORS 1
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
Rcpp::DataFrame dftest(DataFrame x, StringVector y ) {
StringVector New = x["string_1"];
std::string KEY = Rcpp::as<std::string>(y[0]);
Rcpp::LogicalVector ind(New.size());
for(int i = 0; i < New.size(); i++){
ind[i] = (New[i] == KEY);
}
Rcpp::StringVector st1 = x["string_1"];
Rcpp::StringVector Id = x["ID"];
Rcpp::StringVector NameId = x["NameID"];
Rcpp::DateVector StDate = x["StartDate"];
Rcpp::DateVector EtDate = x["EndDate"];
int n = sum(ind);
Rcpp::DateVector StDate_sub = StDate(n);
Rcpp::DateVector EtDate_sub = EtDate(n);
for (int i=0; i<n; i++) {
StDate_sub[i] = StDate( ind[i] );
EtDate_sub[i] = EtDate( ind[i] );
}
return Rcpp::DataFrame::create(Rcpp::Named("string_1") = st1[ind],
Rcpp::Named("ID") = Id[ind],
Rcpp::Named("NameID") = NameId[ind],
Rcpp::Named("StartDate") = StDate_sub,
Rcpp::Named("EndDate") = EtDate_sub);
}

Rcpp function for adding elements of a vector

I have a very long vector of parameters (approximately 4^10 elements) and a vector of indices. My aim is to add together all of the values of the parameters that are indexed in the indices vector.
For instance, if I had paras = [1,2,3,4,5,5,5] and indices = [3,3,1,6] then I would want to find the cumulative sum of the third value (3) twice, the first value (1) and the sixth (5), to get 12. There is additionally the option of warping the parameter values according to their location.
I am trying to speed up an R implementation, as I am calling it millions of times.
My current code always returns NA, and I can't see where it is going wrong
Here's the Rcpp function:
double dot_prod_c(NumericVector indices, NumericVector paras,
NumericVector warp = NA_REAL) {
int len = indices.size();
LogicalVector indices_ok;
for (int i = 0; i < len; i++){
indices_ok.push_back(R_IsNA(indices[i]));
}
if(is_true(any(indices_ok))){
return NA_REAL;
}
double counter = 0;
if(NumericVector::is_na(warp[1])){
for (int i = 0; i < len; i++){
counter += paras[indices[i]];
}
} else {
for (int i = 0; i < len; i++){
counter += paras[indices[i]] * warp[i];
}
}
return counter;
}
And here is the working R version:
dot_prod <- function(indices, paras, warp = NA){
if(is.na(warp[1])){
return(sum(sapply(indices, function(ind) paras[ind + 1])))
} else {
return(sum(sapply(1:length(indices), function(i){
ind <- indices[i]
paras[ind + 1] * warp[i]
})))
}
}
Here is some code for testing, and benchmarking using the microbenchmark package:
# testing
library(Rcpp)
library(microbenchmark)
parameters <- list()
indices <- list()
indices_trad <- list()
set.seed(2)
for (i in 4:12){
size <- 4^i
window_size <- 100
parameters[[i-3]] <- runif(size)
indices[[i-3]] <- floor(runif(window_size)*size)
temp <- rep(0, size)
for (j in 1:window_size){
temp[indices[[i-3]][j] + 1] <- temp[indices[[i-3]][j] + 1] + 1
}
indices_trad[[i-3]] <- temp
}
microbenchmark(
x <- sapply(1:9, function(i) dot_prod(indices[[i]], parameters[[i]])),
x_c <- sapply(1:9, function(i) dot_prod_c(indices[[i]], parameters[[i]])),
x_base <- sapply(1:9, function(i) indices_trad[[i]] %*% parameters[[i]])
)
all.equal(x, x_base) # is true, does work
all.equal(x_c, x_base) # not true - C++ version returns only NAs
I was having a little trouble trying to interpret your overall goal through your code, so I'm just going to go with this explanation
For instance, if I had paras = [1,2,3,4,5,5,5] and indices = [3,3,1,6]
then I would want to find the cumulative sum of the third value (3)
twice, the first value (1) and the sixth (5), to get 12. There is
additionally the option of warping the parameter values according to
their location.
since it was most clear to me.
There are some issues with your C++ code. To start, instead of doing this - NumericVector warp = NA_REAL - use the Rcpp::Nullable<> template (shown below). This will solve a few problems:
It's more readable. If you're not familiar with the Nullable class, it's pretty much exactly what it sounds like - an object that may or may not be null.
You won't have to make any awkward initializations, such as NumericVector warp = NA_REAL. Frankly I was surprised that the compiler accepted this.
You won't have to worry about accidentally forgetting that C++ uses zero-based indexing, unlike R, as in this line: if(NumericVector::is_na(warp[1])){. That has undefined behavior written all over it.
Here's a revised version, going off of your quoted description of the problem above:
#include <Rcpp.h>
typedef Rcpp::Nullable<Rcpp::NumericVector> nullable_t;
// [[Rcpp::export]]
double DotProd(Rcpp::NumericVector indices, Rcpp::NumericVector params, nullable_t warp_ = R_NilValue) {
R_xlen_t i = 0, n = indices.size();
double result = 0.0;
if (warp_.isNull()) {
for ( ; i < n; i++) {
result += params[indices[i]];
}
} else {
Rcpp::NumericVector warp(warp_);
for ( ; i < n; i++) {
result += params[indices[i]] * warp[i];
}
}
return result;
}
You had some elaborate code to generate sample data. I didn't take the time to go through this because it wasn't necessary, nor was the benchmarking. You stated yourself that the C++ version wasn't producing the correct results. Your first priority should be to get your code working on simple data. Then feed it some more complex data. Then benchmark. The revised version above works on simple data:
args <- list(
indices = c(3, 3, 1, 6),
params = c(1, 2, 3, 4, 5, 5, 5),
warp = c(.25, .75, 1.25, 1.75)
)
all.equal(
DotProd(args[[1]], args[[2]]),
dot_prod(args[[1]], args[[2]]))
#[1] TRUE
all.equal(
DotProd(args[[1]], args[[2]], args[[3]]),
dot_prod(args[[1]], args[[2]], args[[3]]))
#[1] TRUE
It's also faster than the R version on this sample data. I have no reason to believe it wouldn't be for larger, more complex data either - there's nothing magical or particularly efficient about the *apply functions; they are just more idiomatic / readable R.
microbenchmark::microbenchmark(
"Rcpp" = DotProd(args[[1]], args[[2]]),
"R" = dot_prod(args[[1]], args[[2]]))
#Unit: microseconds
#expr min lq mean median uq max neval
#Rcpp 2.463 2.8815 3.52907 3.3265 3.8445 18.823 100
#R 18.869 20.0285 21.60490 20.4400 21.0745 66.531 100
#
microbenchmark::microbenchmark(
"Rcpp" = DotProd(args[[1]], args[[2]], args[[3]]),
"R" = dot_prod(args[[1]], args[[2]], args[[3]]))
#Unit: microseconds
#expr min lq mean median uq max neval
#Rcpp 2.680 3.0430 3.84796 3.701 4.1360 12.304 100
#R 21.587 22.6855 23.79194 23.342 23.8565 68.473 100
I omitted the NA checks from the example above, but that too can be revised into something more idiomatic by using a little Rcpp sugar. Previously, you were doing this:
LogicalVector indices_ok;
for (int i = 0; i < len; i++){
indices_ok.push_back(R_IsNA(indices[i]));
}
if(is_true(any(indices_ok))){
return NA_REAL;
}
It's a little aggressive - you are testing a whole vector of values (with R_IsNA), and then applying is_true(any(indices_ok)) - when you could just break prematurely and return NA_REAL on the first instance of R_IsNA(indices[i]) resulting in true. Also, the use of push_back will slow down your function quite a bit - you would have been better off initializing indices_ok to the known size and filling it by index access in your loop. Nevertheless, here's one way to condense the operation:
if (Rcpp::na_omit(indices).size() != indices.size()) return NA_REAL;
For completeness, here's a fully sugar-ized version which allows you to avoid loops entirely:
#include <Rcpp.h>
typedef Rcpp::Nullable<Rcpp::NumericVector> nullable_t;
// [[Rcpp::export]]
double DotProd3(Rcpp::NumericVector indices, Rcpp::NumericVector params, nullable_t warp_ = R_NilValue) {
if (Rcpp::na_omit(indices).size() != indices.size()) return NA_REAL;
if (warp_.isNull()) {
Rcpp::NumericVector tmp = params[indices];
return Rcpp::sum(tmp);
} else {
Rcpp::NumericVector warp(warp_), tmp = params[indices];
return Rcpp::sum(tmp * warp);
}
}
/*** R
all.equal(
DotProd3(args[[1]], args[[2]]),
dot_prod(args[[1]], args[[2]]))
#[1] TRUE
all.equal(
DotProd3(args[[1]], args[[2]], args[[3]]),
dot_prod(args[[1]], args[[2]], args[[3]]))
#[1] TRUE
*/

How to modify drawdown functions in PerformanceAnalytics package for value

I am calculating the average drawdown, average length, recovery length, etc. in R for a PnL data series rather than return data. This is data frame like this
PNL
2008-11-03 3941434
2008-11-04 4494446
2008-11-05 2829608
2008-11-06 2272070
2008-11-07 -2734941
2008-11-10 -2513580
I used the maxDrawDown function from fTrading package and it worked. How could I get the other drawdown functions? If I directly run AverageDrawdown(quantbook) function, it will give out error message like this
Error in if (thisSign == priorSign) { : missing value where TRUE/FALSE needed
I checked the documentation for AverageDrawdown and it is as below:
findDrawdowns(R, geometric = TRUE, ...)
R an xts, vector, matrix, data frame, timeSeries or zoo object of asset returns
My quantbook is a data frame but doesn't work for this function.
Or do you have anything other packages to get the same funciton, please advise.
I've modified the package's functions. Here is one solution in PnL case (or any other case you want to get the value rather than the return) and hope you find it useful. The parameter x is a dataframe and the row.names for x are dates so you don't bother to convert amongst different data types (which I actually suffer a lot). With the function findPnLDrawdown, you could perform a lot other functions to calculate averageDrawDown, averageLength, recovery, etc.
PnLDrawdown <- function(x) {
ts = as.vector(x[,1])
cumsum = cumsum(c(0, ts))
cmaxx = cumsum - cummax(cumsum)
cmaxx = cmaxx[-1]
cmaxx = as.matrix(cmaxx)
row.names(cmaxx) = row.names(x)
cmaxx = timeSeries(cmaxx)
cmaxx
}
findPnLDrawdown <- function(R) {
drawdowns = PnLDrawdown(R)
draw = c()
begin = c()
end = c()
length = c(0)
trough = c(0)
index = 1
if (drawdowns[1] >= 0) {
priorSign = 1
} else {
priorSign = 0
}
from = 1
sofar = as.numeric(drawdowns[1])
to = 1
dmin = 1
for (i in 1:length(drawdowns)) {
thisSign =ifelse(drawdowns[i] < 0, 0, 1)
if (thisSign == priorSign) {
if (as.numeric(drawdowns[i]) < as.numeric(sofar)) {
sofar = drawdowns[i]
dmin = i
}
to = i+ 1
}
else {
draw[index] = sofar
begin[index] = from
trough[index] = dmin
end[index] = to
from = i
sofar = drawdowns[i]
to = i + 1
dmin = i
index = index + 1
priorSign = thisSign
}
}
draw[index] = sofar
begin[index] = from
trough[index] = dmin
end[index] = to
list(pnl = draw, from = begin, trough = trough, to = end,
length = (end - begin + 1),
peaktotrough = (trough - begin + 1),
recovery = (end - trough))
}

Resources