Related
UPDATE3: Problem is solved but I'm leaving the code here as-is for future reference--I've posted an answer below with the final state of the code in case people wanted to see the final product.
UPDATE2: Refactored to use R_alloc instead of calloc for automated cleanup. Unfortunately the problem persists.
UPDATE: If I add this line right before UNPROTECT(1):
Rprintf("%p %p %p", (void *)rans, (void *)fm, (void *)corrs);
then the function executes with no corrupted heap error. Maybe there's a background garbage collection call that corrupts one of the pointers prior to execution finishing, resulting in a write to a garbage pointer? Important to note here that if I don't print out all three of the pointer addresses, the error comes back.
Also I'm running this on an M1 Mac and compiling with clang via R CMD SHLIB, in case Apple silicon is to blame.
I'm at my wits end trying to debug this issue, and I figured I'd turn to SO for help. I'm writing a function in C to optimize some parts of my R code, and I'm getting a Heap Corruption Error when running the function many times. The function trimCovar() is called from R using the .Call("trimCovar", ...) interface.
I'm having a lot of difficulty debugging this for a few reasons:
I'm on OSX, so I can't use Valgrind
C function depends on inputs from R, so I can't debug the C code on its own
Heap corruption only occurs when calling the function many times within an R function
(just running .Call directly a bunch of times has no errors)
Error point is inconsistent
I start with two sets of vectors, and I condense them into a frequency matrix, where each column is a position in the vector set, and each row is a particular character that appears. I concatenate them into one matrix prior to passing in because it makes pre-processing easier. An toy example of the frequency matrix would be:
INPUT:
v1_1 = 101
v1_2 = 011
v2_1 = 111
v2_2 = 110
Frequency Matrix:
position: | 1_1 | 1_2 | 1_3 | 2_1 | 2_2 | 2_3 |
0: 0.5 0.5 0.0 0.0 0.0 0.5
1: 0.5 0.5 1.0 1.0 1.0 0.5
The goal is to find the NV highest correlated positions across the vector sets, which I do by calculating pairwise KL divergence of positions. These are stored in a linked list sorted in ascending order, and at the end I take the positions corresponding to the first NV entries. The R code I have can deparse everything else, so I really just need a vector of positions at the end (duplicates are allowed).
The function takes in 5 arguments:
fMAT: a frequency matrix (RObject, so gets read in as a flat vector)
fSP : columns in matrix corresponding to positions from the first vector set
sSP : same as fSP but for second vector set
NV : Number of values to return
NR : Number of columns in fMAT
The error returned is:
R(95564,0x104858580) malloc: Heap corruption detected, free list is damaged at 0x600000f10040
*** Incorrect guard value: 4626885667169763328
R(95564,0x104858580) malloc: *** set a breakpoint in malloc_error_break to debug
This only happens when I run an R function that calls this 10+ times, so I'm assuming that I'm just missing one or two small hanging pointers corrupting a memory reference. I've tried running this with gc() called in R immediately after each call, but it doesn't fix the problem. I'm not really sure what else to do at this point, I've tried using lldb but I'm not really sure how to use that program. From running lots of print statements I've determined that it usually crashes in the main loop (identified in code below), but it's inconsistent on when it crashes. I've also tried saving off erroneous inputs--I can rerun them individually with no issues, so it must be something relatively small that only appears over many runs.
Happy to provide more details if it would help. Code is listed at the bottom.
The only thing being allocated here are linked list nodes, and I thought I had free()'d them all prior to returning. I've also double checked the input values, so I'm 99.99% sure that I'm never referencing out of bounds on firstSeqPos, secondSeqPos, ans, or fm. I've also triple checked the R code surrounding this and can confidently say it is not the source of this error.
I haven't coded in C in a long time so I feel like I'm missing something obvious. If I really have to I can try to get ahold of a Linux box to run valgrind, but if there's another option I'd prefer it. Thanks in advance!
Code:
#include <R.h>
#include <Rdefines.h>
#include <Rinternals.h>
#include <math.h>
#include <stdlib.h>
#include <stdbool.h>
typedef struct node {
double data;
int i1;
int i2;
struct node *next;
} node;
// Linked list
// data is the correlation value,
// i1 the position from first vector set,
// i2 the position from second vector set
node *makeNewNode(double data, int i1, int i2){
node *newNode;
newNode = (node *)R_alloc(1, sizeof(node));
newNode->data = data;
newNode->i1 = i1;
newNode->i2 = i2;
newNode->next = NULL;
return(newNode);
}
//insert link in sorted order (ascending)
void insertSorted(node **head, node *toInsert, int maxSize) {
int ctr = 0;
if ((*head) == NULL || (*head)->data >= toInsert->data){
toInsert->next = *head;
*head = toInsert;
} else {
node *temp = *head;
while (temp->next != NULL && temp->next->data < toInsert->data){
temp = temp->next;
if (ctr == maxSize){
// Performance optimization, if we aren't inserting in the first NR
// positions then we can just skip since we only care about the NR
// lowest scores overall
return;
}
ctr += 1;
}
toInsert->next = temp->next;
temp->next = toInsert;
}
}
// MAIN FUNCTION CALLED FROM R
// (This is the one that crashes)
SEXP trimCovar(SEXP fMAT, SEXP fSP, SEXP sSP, SEXP NV, SEXP NR){
// Converting input SEXPs into C-compatible values
int nv = asInteger(NV);
int nr = asInteger(NR);
int sp1l = length(fSP);
int sp2l = length(sSP);
int *firstSeqPos = INTEGER(coerceVector(fSP, INTSXP));
int *secondSeqPos = INTEGER(coerceVector(sSP, INTSXP));
double *fm = REAL(fMAT);
int colv1, colv2;
// Using a linked list for efficient insert
node *corrs = NULL;
int cv1, cv2;
double p1, p2, score=0;
// USUALLY FAILS IN THIS LOOP
for ( int i=0; i<sp1l; i++ ){
cv1 = firstSeqPos[i];
colv1 = (cv1 - 1) * nr;
for ( int j=0; j<sp2l; j++ ){
cv2 = secondSeqPos[j];
colv2 = (cv2 - 1) * nr;
// KL Divergence
score = 0;
for ( int k=0; k<nr; k++){
p1 = fm[colv1 + k];
p2 = fm[colv2 + k];
if (p1 != 0 && p2 != 0){
score += p1 * log(p1 / p2);
}
}
// Add result into LL
node *newNode = makeNewNode(score, cv1, cv2);
insertSorted(&corrs, newNode, nv);
}
R_CheckUserInterrupt();
}
SEXP ans;
PROTECT(ans = allocVector(INTSXP, 2*nv));
int *rans = INTEGER(ans);
int ctr=0;
int pos1, pos2;
node *ptr = corrs;
for ( int i=0; i<nv; i++){
rans[2*i] = ptr->i1;
rans[2*i+1] = ptr->i2;
ptr = ptr->next;
}
UNPROTECT(1);
return(ans);
}
int *firstSeqPos = INTEGER(coerceVector(fSP, INTSXP));
int *secondSeqPos = INTEGER(coerceVector(sSP, INTSXP));
This is not good. The SEXPs returned by the 2 calls to coerceVector() need to be protected. However it's usually considered better practice to do this coercion at the R level right before entering the .Call entry point. Note that if fSP and sSP are integer matrices, there's no need to coerce them to integer as they are already seen as integer vectors at the C level. This also avoids a possibly expensive copy (as.integer() in R and coerceVector() in C both trigger a full copy of the matrix data).
The question was answered above, but I received a couple messages from people asking for the final code, so I'm going to include it as an answer to preserve the original question. There's a couple optimizations here (thanks to #hpages for help and troubleshooting regarding these):
Original code fails because the output of coerceVector() wasn't protected with PROTECT(). I've refactored the R code to check for integer inputs prior to calling this C function to avoid this function call and be more efficient with memory (see the accepted answer for more details).
Original code uses R_alloc(), which gives responsibility to R to clean up memory at the end of the function call. However, this introduces substantial memory overhead during the runtime of the function, since memory allocated to nodes not inserted into the linked list aren't cleared until the end of the function call.
Allocation with calloc() isn't as simple as switching over and calling free() at the end of the function, since we have to guard the case where the user interrupts execution of the program. If an interrupt signal is thrown prior to the end of the function, we'll never free the memory.
Final C Code:
#include <R.h>
#include <Rdefines.h>
#include <Rinternals.h>
#include <math.h>
#include <stdlib.h>
#include <stdbool.h>
typedef struct node {
double data;
int i1;
int i2;
struct node *next;
} node;
// Defining the head as a static so that we can access it globally
// Important for ensuring clean up in case of interrupt
static node *corrs = NULL;
// Function to clean up memory allocations in case of interrupt
void cleanupFxn(){
node *ptr = corrs;
// Free allocated memory in linked list
while (corrs != NULL){
ptr = corrs;
corrs = corrs->next;
free(ptr);
}
}
node *makeNewNode(double data, int i1, int i2){
node *newNode;
// very important to use calloc here so we have control of when we free it
// R_alloc() memory won't be freed until after function finishes execution
newNode = (node *)calloc(1, sizeof(node));
newNode->data = data;
newNode->i1 = i1;
newNode->i2 = i2;
newNode->next = NULL;
return(newNode);
}
// insert link in sorted order
// returns a bool corresponding to if we inserted
bool insertSorted(node **head, node *toInsert, int maxSize) {
int ctr = 0;
if ((*head) == NULL || (*head)->data >= toInsert->data){
toInsert->next = *head;
*head = toInsert;
return(true);
} else {
node *temp = *head;
while (temp->next != NULL && temp->next->data < toInsert->data){
temp = temp->next;
if (ctr == maxSize){
// Performance optimization, if we aren't inserting in the first NR
// positions then we can just skip since we only care about the NR
// lowest scores overall. Saves a huge amount of time and memory.
return(false);
}
ctr += 1;
}
toInsert->next = temp->next;
temp->next = toInsert;
return(true);
}
}
SEXP trimCovar(SEXP fMAT, SEXP fSP, SEXP sSP, SEXP NV, SEXP NR){
// Converting inputs into C-compatible forms
int nv = asInteger(NV);
int nr = asInteger(NR);
int sp1l = length(fSP);
int sp2l = length(sSP);
// Note here we're not using coerceVector() anymore
// typechecking done on R side
int *firstSeqPos = INTEGER(fSP);
int *secondSeqPos = INTEGER(sSP);
double *fm = REAL(fMAT);
int colv1, colv2;
// Using a linked list for efficient insert
corrs = NULL;
int cv1, cv2;
double p1, p2, score=0;
bool success;
for ( int i=0; i<sp1l; i++ ){
cv1 = firstSeqPos[i];
colv1 = (cv1 - 1) * nr;
for ( int j=0; j<sp2l; j++ ){
cv2 = secondSeqPos[j];
colv2 = (cv2 - 1) * nr;
score = 0;
for ( int k=0; k<nr; k++){
p1 = fm[colv1 + k];
p2 = fm[colv2 + k];
if (p1 != 0 && p2 != 0){
score += p1 * log(p1 / p2);
}
}
node *newNode = makeNewNode(score, cv1, cv2);
success = insertSorted(&corrs, newNode, nv);
// If we don't insert, free the associated memory
// I'm checking for NULL here just out of an abundance of caution
if (!success && newNode != NULL){
free(newNode);
newNode = NULL;
}
}
R_CheckUserInterrupt();
}
SEXP ans;
PROTECT(ans = allocVector(INTSXP, 2*nv));
int *rans = INTEGER(ans);
node *ptr=corrs;
for ( int i=0; i<nv; i++){
rans[2*i] = ptr->i1;
rans[2*i+1] = ptr->i2;
ptr = ptr->next;
}
// Free allocated memory in linked list
cleanupFxn();
UNPROTECT(1);
return(ans);
}
Assuming the C file is named trimCovar.c, we'd compile with R CMD SHLIB trimCovar.c.
R Code to run this function:
dyn.load("trimCovar.so")
# Wrapped into a function with on.exit(...) to ensure cleanup
# in the event the user or system interrupts execution early
CorrComp_C <- function(fm, fsp, ssp, nv, nr){
# type checking to ensure input to C is integer vector
# (could probably do more type checking here, mainly for illustration)
stopifnot(is(fsp, 'integer'))
stopifnot(is(ssp, 'integer'))
on.exit(.C("cleanupFxn"))
a <- .Call('trimCovar', fm, fsp, ssp, nv, nr)
return(a)
}
When trying to train a model with a dataset of around 3 million rows and 600 columns using the C5.0 CRAN package I get the following error:
Error in paste(apply(x, 1, paste, collapse = ","), collapse = "\n") : result would exceed 2^31-1 bytes
From what the owner of the repository answered to a similar issue, it is due to an R limitation in the number of bytes in a character string, which is limited to 2^31 - 1.
Long answer ahead:
So, as stated in the question, the error occurs in the last line of the makeDataFile function from the Cubist package, used in C5.0, which concatenates all rows into one string. As this string is needed to pass the data to the C5.0 function in C, but is not needed to make any operations in R, and C has no memory limitation aside from those of the machine itself, the approach I have taken is to create such string in C instead. In order to do this, the R code will pass the information in a character vector containing various strings that don’t surpass the length limit, instead of one, so that once in C these elements can be concatenated.
However, instead of leaving all rows as separate elements in the character vector to be concatenated in C using strcat in a loop, I have found that the strcat function is quite slow, so I have chosen to create another R function (create_max_len_strings) in order to concatenate the rows into the longest (~or close~) strings possible without reaching the memory limit so that strcat only needs to be applied a few times to concatenate these longer strings.
So, the last line of the original makeDataFile() function will be replaced so that each row is left separately as an element of a character vector, only adding a line break at the end of each string row so that when concatenating some of these elements into longer strings, using create_max_len_strings(), they will be differentiated:
makeDataFile.R:
create_max_len_strings <- function(original_vector) {
vector_length = length(original_vector)
nchars = sum(nchar(original_vector, type = "chars"))
## Check if the length of the string would reach 1900000000, which is close to the memory limitation
if(nchars >= 1900000000){
## Calculate how many strings we could create of the maximum length
nchunks = 0
while(nchars > 0){
nchars = nchars - 1900000000
nchunks = nchunks + 1
}
## Get the number of rows that would be contained in each string
chunk_size = vector_length/nchunks
## Get the rounded number of rows in each string
chunk_size = floor(chunk_size)
index = chunk_size
## Create a vector with the indexes of the rows that delimit each string
indexes_vector = c()
indexes_vector = append(indexes_vector, 0)
n = nchunks
while(n > 0){
indexes_vector = append(indexes_vector, index)
index = index + chunk_size
n = n - 1
}
## Get the last few rows if the division had remainder
remainder = vector_length %% nchunks
if (remainder != 0){
indexes_vector = append(indexes_vector, vector_length)
nchunks = nchunks + 1
}
## Create the strings pasting together the rows from the indexes in the indexes vector
strings_vector = c()
i = 2
while (i <= length(indexes_vector)){
## Sum 1 to the index_init so that the next string does not contain the last row of the previous string
index_init = indexes_vector[i-1] + 1
index_end = indexes_vector[i]
## Paste the rows from the vector from index_init to index_end
string <- paste0(original_vector[index_init:index_end], collapse="")
## Create vector containing the strings that were created
strings_vector <- append(strings_vector, string)
i = i + 1
}
}else {
strings_vector = paste0(original_vector, collapse="")
}
strings_vector
}
makeDataFile <- function(x, y, w = NULL) {
## Previous code stays the same
...
x = apply(x, 1, paste, collapse = ",")
x = paste(x, "\n", sep="")
char_vec = create_max_len_strings(x)
}
CALLING C5.0
Now, in order to create the final string to pass to the c50() function in C, an intermediate function is created and called instead. In order to do this, the .C() statement that calls c50() in R is replaced with a .Call() statement calling this function, as .Call() allows for complex objects such as vectors to be passed to C. Also, it allows for the result to be returned in the variable result instead of having to pass back the variables tree, rules and output by reference. The result of calling C5.0 will be received in the character vector result containing the strings corresponding to the tree, rules and output in the first three positions:
C5.0.R:
C5.0.default <- function(x,
y,
trials = 1,
rules = FALSE,
weights = NULL,
control = C5.0Control(),
costs = NULL,
...) {
## Previous code stays the same
...
dataString <- makeDataFile(x, y, weights)
num_chars = sum(nchar(dataString, type = "chars"))
result <- .Call(
"call_C50",
as.character(namesString),
dataString,
as.character(num_chars), ## The length of the resulting string is passed as character because it is too long for an integer
as.character(costString),
as.logical(control$subset),
# -s "use the Subset option" var name: SUBSET
as.logical(rules),
# -r "use the Ruleset option" var name: RULES
## for the bands option, I'm not sure what the default should be.
as.integer(control$bands),
# -u "sort rules by their utility into bands" var name: UTILITY
## The documentation has two options for boosting:
## -b use the Boosting option with 10 trials
## -t trials ditto with specified number of trial
## I think we should use -t
as.integer(trials),
# -t : " ditto with specified number of trial", var name: TRIALS
as.logical(control$winnow),
# -w "winnow attributes before constructing a classifier" var name: WINNOW
as.double(control$sample),
# -S : use a sample of x% for training
# and a disjoint sample for testing var name: SAMPLE
as.integer(control$seed),
# -I : set the sampling seed value
as.integer(control$noGlobalPruning),
# -g: "turn off the global tree pruning stage" var name: GLOBAL
as.double(control$CF),
# -c: "set the Pruning CF value" var name: CF
## Also, for the number of minimum cases, I'm not sure what the
## default should be. The code looks like it dynamically sets the
## value (as opposed to a static, universal integer
as.integer(control$minCases),
# -m : "set the Minimum cases" var name: MINITEMS
as.logical(control$fuzzyThreshold),
# -p "use the Fuzzy thresholds option" var name: PROBTHRESH
as.logical(control$earlyStopping)
)
## Get the first three positions of the character vector that contain the tree, rules and output returned by C5.0 in C
result_tree = result[1]
result_rules = result[2]
result_output = result[3]
modelContent <- strsplit(
if (rules)
result_rules
else
result_tree, "\n"
)[[1]]
entries <- grep("^entries", modelContent, value = TRUE)
if (length(entries) > 0) {
actual <- as.numeric(substring(entries, 10, nchar(entries) - 1))
} else
actual <- trials
if (trials > 1) {
boostResults <- getBoostResults(result_output)
## This next line is here to avoid a false positive warning in R
## CMD check:
## * checking R code for possible problems ... NOTE
## C5.0.default: no visible binding for global variable 'Data'
Data <- NULL
size <-
if (!is.null(boostResults))
subset(boostResults, Data == "Training Set")$Size
else
NA
} else {
boostResults <- NULL
size <- length(grep("[0-9])$", strsplit(result_output, "\n")[[1]]))
}
out <- list(
names = namesString,
cost = costString,
costMatrix = costs,
caseWeights = !is.null(weights),
control = control,
trials = c(Requested = trials, Actual = actual),
rbm = rules,
boostResults = boostResults,
size = size,
dims = dim(x),
call = funcCall,
levels = levels(y),
output = result_output,
tree = result_tree,
predictors = colnames(x),
rules = result_rules
)
class(out) <- "C5.0"
out
}
Now onto the C code, the function call_c50() basically acts as an intermediate between the R code and the C code, concatenating the elements in the dataString array to obtain the string needed by the C function c50(), by accessing each position of the array using CHAR(STRING_ELT(x, i)) and concatenating (strcat) them together. Then the rest of the variables are casted to their respective types and the c50() function in file top.c (where this function should also be placed) is called. The result of calling c50() will be returned to the R routine by creating a character vector and placing the strings corresponding to the tree, rules and output in each position.
Lastly, the c50() function is basically left as is, except for the variables treev, rulesv and outputv, as these are the values that are going to be returned by .Call() instead of being passed by reference, they no longer need to be in the arguments of the function. As they are all strings they can be returned in a single array, by setting each string to a position in the array c50_return.
top.c:
SEXP call_C50(SEXP namesString, SEXP data_vec, SEXP datavec_len, SEXP costString, SEXP subset, SEXP rules, SEXP bands, SEXP trials, SEXP winnow, SEXP sample,
SEXP seed, SEXP noGlobalPruning, SEXP CF, SEXP minCases, SEXP fuzzyThreshold, SEXP earlyStopping){
char* string;
char* concat;
long n = 0;
long size;
int i;
char* eptr;
// Get the length of the data vector
n = length(data_vec);
// Get the string indicating the length of the final string
char* size_str = malloc((strlen(CHAR(STRING_ELT(datavec_len, 0)))+1)*sizeof(char));
strcpy(size_str, CHAR(STRING_ELT(datavec_len, 0)));
// Turn the string to long
size = strtol(size_str, &eptr, 10);
// Allocate memory for the number of characters indicated by datavec_len
string = malloc((size+1)*sizeof(char));
// Copy the first element of data_vec into the string variable
strcpy(string, CHAR(STRING_ELT(data_vec, 0)));
// Loop over the data vector until all elements are concatenated in the string variable
for (i = 1; i < n; i++) {
strcat(string, CHAR(STRING_ELT(data_vec, i)));
}
// Copy the value of namesString into a char*
char* namesv = malloc((strlen(CHAR(STRING_ELT(namesString, 0)))+1)*sizeof(char));
strcpy(namesv, CHAR(STRING_ELT(namesString, 0)));
// Copy the value of costString into a char*
char* costv = malloc((strlen(CHAR(STRING_ELT(costString, 0)))+1)*sizeof(char));
strcpy(costv, CHAR(STRING_ELT(costString, 0)));
// Call c50() function casting the rest of arguments into their respective C types
char** c50_return = c50(namesv, string, costv, asLogical(subset), asLogical(rules), asInteger(bands), asInteger(trials), asLogical(winnow), asReal(sample), asInteger(seed), asInteger(noGlobalPruning), asReal(CF), asInteger(minCases), asLogical(fuzzyThreshold), asLogical(earlyStopping));
free(string);
free(namesv);
free(costv);
// Create a character vector to be returned to the C5.0 R function
SEXP out = PROTECT(allocVector(STRSXP, 3));
SET_STRING_ELT(out, 0, mkChar(c50_return[0]));
SET_STRING_ELT(out, 1, mkChar(c50_return[1]));
SET_STRING_ELT(out, 2, mkChar(c50_return[2]));
UNPROTECT(1);
return out;
}
static char** c50(char *namesv, char *datav, char *costv, int subset,
int rules, int utility, int trials, int winnow,
double sample, int seed, int noGlobalPruning, double CF,
int minCases, int fuzzyThreshold, int earlyStopping) {
int val; /* Used by setjmp/longjmp for implementing rbm_exit */
char ** c50_return = malloc(3 * sizeof(char*));
// Initialize the globals to the values that the c50
// program would have at the start of execution
initglobals();
// Set globals based on the arguments. This is analogous
// to parsing the command line in the c50 program.
setglobals(subset, rules, utility, trials, winnow, sample, seed,
noGlobalPruning, CF, minCases, fuzzyThreshold, earlyStopping,
costv);
// Handles the strbufv data structure
rbm_removeall();
// Deallocates memory allocated by NewCase.
// Not necessary since it's also called at the end of this function,
// but it doesn't hurt, and I'm feeling paranoid.
FreeCases();
// XXX Should this be controlled via an option?
// Rprintf("Calling setOf\n");
setOf();
// Create a strbuf using *namesv as the buffer.
// Note that this is a readonly strbuf since we can't
// extend *namesv.
STRBUF *sb_names = strbuf_create_full(namesv, strlen(namesv))
// Register this strbuf using the name "undefined.names"
if (rbm_register(sb_names, "undefined.names", 0) < 0) {
error("undefined.names already exists");
}
// Create a strbuf using *datav and register it as "undefined.data"
STRBUF *sb_datav = strbuf_create_full(datav, strlen(datav));
// XXX why is sb_datav copied? was that part of my debugging?
// XXX or is this the cause of the leak?
if (rbm_register(strbuf_copy(sb_datav), "undefined.data", 0) < 0) {
error("undefined data already exists");
}
// Create a strbuf using *costv and register it as "undefined.costs"
if (strlen(costv) > 0) {
// Rprintf("registering cost matrix: %s", *costv);
STRBUF *sb_costv = strbuf_create_full(costv, strlen(costv));
// XXX should sb_costv be copied?
if (rbm_register(sb_costv, "undefined.costs", 0) < 0) {
error("undefined.cost already exists");
}
} else {
// Rprintf("no cost matrix to register\n");
}
/*
* We need to initialize rbm_buf before calling any code that
* might call exit/rbm_exit.
*/
if ((val = setjmp(rbm_buf)) == 0) {
// Real work is done here
c50main();
if (rules == 0) {
// Get the contents of the the tree file
STRBUF *treebuf = rbm_lookup("undefined.tree");
if (treebuf != NULL) {
char *treeString = strbuf_getall(treebuf);
c50_return[0] = R_alloc(strlen(treeString) + 1, 1);
strcpy(c50_return[0], treeString);
c50_return[1] = "";
} else {
// XXX Should *treev be assigned something in this case?
// XXX Throw an error?
}
} else {
// Get the contents of the the rules file
STRBUF *rulesbuf = rbm_lookup("undefined.rules");
if (rulesbuf != NULL) {
char *rulesString = strbuf_getall(rulesbuf);
c50_return[1] = R_alloc(strlen(rulesString) + 1, 1);
strcpy(c50_return[1], rulesString);
c50_return[0] = "";
} else {
// XXX Should *rulesv be assigned something in this case?
// XXX Throw an error?
}
}
} else {
Rprintf("c50 code called exit with value %d\n", val - JMP_OFFSET);
}
// Close file object "Of", and return its contents via argument outputv
char *outputString = closeOf();
c50_return[2] = R_alloc(strlen(outputString) + 1, 1);
strcpy(c50_return[2], outputString);
// Deallocates memory allocated by NewCase
FreeCases();
// We reinitialize the globals on exit out of general paranoia
initglobals();
return c50_return;
}
***IMPORTANT: if the string created is longer than 2147483647, you also will need to change the definition of the variables i and j in the function strbuf_gets() in strbuf.c. This function basically iterates through each position of the string, so trying to increase their value above the INT limit to access those positions in the array will cause a segmentation fault. I suggest changing the declaration type to long in order to avoid this issue.
C5.0 PREDICTIONS
However, as the makeDataFile function is not only used to create the model but also to pass the data to the predictions() function, this function will also have to be modified. Just like previously, the .C() statement in predict.C5.0() used to call predictions() will be replaced with a .Call() statement in order to be able to pass the character vector to C, and the result will be returned in the result variable instead of being passed by reference:
predict.C5.0.R:
predict.C5.0 <- function (object,
newdata = NULL,
trials = object$trials["Actual"],
type = "class",
na.action = na.pass,
...) {
## Previous code stays the same
...
caseString <- makeDataFile(x = newdata, y = NULL)
num_chars = sum(nchar(caseString, type = "chars"))
## When passing trials to the C code, convert to
## zero if the original version of trials is used
if (trials <= 0)
stop("'trials should be a positive integer", call. = FALSE)
if (trials == object$trials["Actual"])
trials <- 0
## Add trials (not object$trials) as an argument
results <- .Call(
"call_predictions",
caseString,
as.character(num_chars),
as.character(object$names),
as.character(object$tree),
as.character(object$rules),
as.character(object$cost),
pred = integer(nrow(newdata)),
confidence = double(length(object$levels) * nrow(newdata)),
trials = as.integer(trials)
)
predictions = as.numeric(unlist(results[1]))
confidence = as.numeric(unlist(results[2]))
output = as.character(results[3])
if(any(grepl("Error limit exceeded", output)))
stop(output, call. = FALSE)
if (type == "class") {
out <- factor(object$levels[predictions], levels = object$levels)
} else {
out <-
matrix(confidence,
ncol = length(object$levels),
byrow = TRUE)
if (!is.null(rownames(newdata)))
rownames(out) <- rownames(newdata)
colnames(out) <- object$levels
}
out
}
In the file top.c, the predictions() function will be modified to receive the variables passed by the .Call() statement, so that just like previously, the caseString array will be concatenated into a single string and the rest of the variables casted to their respective types. In this case the variables pred and confidence will be also received as vectors of integer and double types and so they will need to be casted to int* and double*. The rest of the function is left as it was in order to create the predictions and the resulting variables predv, confidencev and output variables will be placed in the first three positions of a vector respectively.
top.c:
SEXP call_predictions(SEXP caseString, SEXP case_len, SEXP names, SEXP tree, SEXP rules, SEXP cost, SEXP pred, SEXP confidence, SEXP trials){
char* casev;
char* outputv = "";
char* eptr;
char* size_str = malloc((strlen(CHAR(STRING_ELT(case_len, 0)))+1)*sizeof(char));
strcpy(size_str, CHAR(STRING_ELT(case_len, 0)));
long size = strtol(size_str, &eptr, 10);
casev = malloc((size+1)*sizeof(char));
strcpy(casev, CHAR(STRING_ELT(caseString, 0)));
int n = length(caseString);
for (int i = 1; i < n; i++) {
strcat(casev, CHAR(STRING_ELT(caseString, i)));
}
char* namesv = malloc((strlen(CHAR(STRING_ELT(names, 0)))+1)*sizeof(char));
strcpy(namesv, CHAR(STRING_ELT(names, 0)));
char* treev = malloc((strlen(CHAR(STRING_ELT(tree, 0)))+1)*sizeof(char));
strcpy(treev, CHAR(STRING_ELT(tree, 0)));
char* rulesv = malloc((strlen(CHAR(STRING_ELT(rules, 0)))+1)*sizeof(char));
strcpy(rulesv, CHAR(STRING_ELT(rules, 0)));
char* costv = malloc((strlen(CHAR(STRING_ELT(cost, 0)))+1)*sizeof(char));
strcpy(costv, CHAR(STRING_ELT(cost, 0)));
int variable;
int* predv = &variable;
int npred = length(pred);
predv = malloc((npred+1)*sizeof(int));
for (int i = 0; i < npred; i++) {
predv[i] = INTEGER(pred)[i];
}
double variable1;
double* confidencev = &variable1;
int nconf = length(confidence);
confidencev = malloc((nconf+1)*sizeof(double));
for (int i = 0; i < nconf; i++) {
confidencev[i] = REAL(confidence)[i];
}
int* trialsv = &variable;
*trialsv = asInteger(trials);
/* Original code for predictions starts */
int val;
// Announce ourselves for testing
// Rprintf("predictions called\n");
// Initialize the globals
initglobals();
// Handles the strbufv data structure
rbm_removeall();
// XXX Should this be controlled via an option?
// Rprintf("Calling setOf\n");
setOf();
STRBUF *sb_cases = strbuf_create_full(casev, strlen(casev));
if (rbm_register(sb_cases, "undefined.cases", 0) < 0) {
error("undefined.cases already exists");
}
STRBUF *sb_names = strbuf_create_full(namesv, strlen(namesv));
if (rbm_register(sb_names, "undefined.names", 0) < 0) {
error("undefined.names already exists");
}
if (strlen(treev)) {
STRBUF *sb_treev = strbuf_create_full(treev, strlen(treev));
if (rbm_register(sb_treev, "undefined.tree", 0) < 0) {
error("undefined.tree already exists");
}
} else if (strlen(rulesv)) {
STRBUF *sb_rulesv = strbuf_create_full(rulesv, strlen(rulesv));
if (rbm_register(sb_rulesv, "undefined.rules", 0) < 0) {
error("undefined.rules already exists");
}
setrules(1);
} else {
error("either a tree or rules must be provided");
}
// Create a strbuf using *costv and register it as "undefined.costs"
if (strlen(costv) > 0) {
// Rprintf("registering cost matrix: %s", *costv);
STRBUF *sb_costv = strbuf_create_full(costv, strlen(costv));
// XXX should sb_costv be copied?
if (rbm_register(sb_costv, "undefined.costs", 0) < 0) {
error("undefined.cost already exists");
}
} else {
// Rprintf("no cost matrix to register\n");
}
if ((val = setjmp(rbm_buf)) == 0) {
// Real work is done here
// Rprintf("\n\nCalling rpredictmain\n");
rpredictmain(trialsv, predv, confidencev);
// Rprintf("predict finished\n\n");
} else {
// Rprintf("predict code called exit with value %d\n\n", val - JMP_OFFSET);
}
// Close file object "Of", and return its contents via argument outputv
char *outputString = closeOf();
char *output = R_alloc(strlen(outputString) + 1, 1);
strcpy(output, outputString);
// We reinitialize the globals on exit out of general paranoia
initglobals();
/* Original code for predictions ends */
free(namesv);
free(treev);
free(rulesv);
free(costv);
SEXP predx = PROTECT(allocVector(INTSXP, npred));
for (int i = 0; i < npred; i++) {
INTEGER(predx)[i] = predv[i];
}
SEXP confidencex = PROTECT(allocVector(REALSXP, nconf));
for (int i = 0; i < npred; i++) {
REAL(confidencex)[i] = confidencev[i];
}
SEXP outputx = PROTECT(allocVector(STRSXP, 1));
SET_STRING_ELT(outputx, 0, mkChar(output));
SEXP vector = PROTECT(allocVector(VECSXP, 3));
SET_VECTOR_ELT(vector, 0, predx);
SET_VECTOR_ELT(vector, 1, confidencex);
SET_VECTOR_ELT(vector, 2, outputx);
UNPROTECT(4);
free(predv);
free(confidencev);
return vector;
}
https://play.golang.org/p/DOhYaiH53Ek
I do not understand the *&p operation nor how a pointer *p which is an int() value NOT a memory address is able to modify its value i.e. valueX = valueY. I know that I must be misunderstanding something, but this code literally appears to be self-conflicting.
//point
package main
import "reflect"
var pt int = 27
func main() {
println(reflect.TypeOf(pt))
println("pt = ", pt) //value of pt
println("&pt = ", &pt) //memory address of pt
updatePointer(&pt)
println("pt = ", pt) //value of pt
}
func updatePointer(p *int) { //requires memory address of an int
println("&p = ", &p) //memory address of p
println("p = ", p) //memory address of pt
println("*p before = ", *p) //value of pt
println(*p == 27) //true
*p = 14 //27 = 14??????????
println("*p =", *p) //value of pt
println(reflect.TypeOf(&pt) == reflect.TypeOf(*&p)) //true!!!?????
println("*&p = ", *&p) //memory address which p's memory address evals to???? 0x800 (p) -> 0x900 (pt) = 0x800 (p)?
}
/*
Why can't I do the following?
func updatePointer(p *int){
p = 14
//OR
&p = 14
//OR
*&p = 14
}
*/
Considering *&p firstly lets simplify the example (removing all of the irrelevant stuff)
var pt int = 27
p := &pt // because &pt is passed as the argument into updatePointer
*p = 14 // This assigns 14 to whatever p points to (i.e. pt)
println(reflect.TypeOf(&pt) == reflect.TypeOf(*&p))
*& does nothing (and I can see no reason to use it in a real application); the statement gets the address of p (the &p bit) and then gets what the result points at (p). So this an be rewritten
println(reflect.TypeOf(&pt) == reflect.TypeOf(p))
p is pointing to pt so p == &pt (by definition this means that *p == pt). This means they are the same thing so will, of course, have the same type.
So why does *p = 14 work? You said that "*p which is an int() value NOT a memory address is able to modify its value" but that is not quite what the spec says:
For an operand x of pointer type *T, the pointer indirection *x
denotes the variable of type T pointed to by x. If x is nil, an
attempt to evaluate *x will cause a run-time panic.
So what *p = 14 is saying is set the variable that p points to to 14.
Now lets look at your second question:
// Why can't I do the following?
func updatePointer(p *int){
p = 14
//OR
&p = 14
//OR
*&p = 14
}
So p is a pointer to an integer (thats what *int means). Saying p = 14 is an attempt to set the pointer (not the int) to the value 14. 14 is an int not a pointer hence the compiler error cannot use 14 (type int) as type *int in assignment.
&p = 14 is saying set the address of p to 14. p is a *int so getting the address will give you a **int (a pointer to a pointer to an int). The compiler error you will get is cannot assign to &p and this is because the result of &p is not addressable. You could get around that by saying
x := &p
x = 14
and this will give you the error you might expect based upon what I said above: cannot use 14 (type int) as type **int in assignment.
*&p = 14 is basically the same as saying p=14 (it gets the address of p and then gets whatever the result points at whis will be p).
Pointers can get quite confusing (especially in contrived examples) and this article may help you understand.
Simplify. For example,
package main
import "fmt"
func f(q *int) {
fmt.Println(*q, q, &q, "f")
*q = 14
fmt.Println(*q, q, &q, "f")
}
func main() {
var i int = 27
var p *int = &i
fmt.Println(*p, p, &p, i, "main")
f(p)
fmt.Println(*p, p, &p, i, "main")
}
Output:
27 0x40e020 0x40c138 27 main
27 0x40e020 0x40c148 f
14 0x40e020 0x40c148 f
14 0x40e020 0x40c138 14 main
Errors:
/*
func g(r *int) {
// cannot use 14 (type int) as type *int in assignment
r = 14
// cannot assign to &r
&r = 14
// cannot use 14 (type int) as type *int in assignment
*(&r) = 14
}
*/
Playground: https://play.golang.org/p/Hwe3anFBTfD
In Go, all arguments are passed by value, as if by assignment. A pointer p to an int i is passed by value (q = p) to function f. The pointer q, a copy of p, is used to modify the value of i, *q = 14, *q dereferences type *int to type int.
The function g compiler error messages explain why the statements are illegal. For example, *&r = 14 is *(&r) = 14 is r = 14, cannot use 14 (type int) as type *int in assignment, r is type *int.
References:
A Tour of Go
The Go Programming Language Specification
Is there a way to get a non-deterministic output from the CryptoAPI? In other words, a different string output when encrypting a string.
For example, using CALG_AES_256 when deriving a crypt key with password of 'password' and string to encrypt of 'a', it always returns "SnÆwÞ¢L\x1e?6FÏLþw"
I'm somewhat of a n00b in using CryptoAPI, so any assistance is appreciated.
Edited:
Here is the cryptography code from Microsoft's example code decrypte and encrypt This is the same code, just shortened/compacted. This code was compiled in VS 2017 as a Win32 Console app. pszSource and pszDest are two files in the C:\temp folder. source.txt has the letter we're trying to encrypt in it.
The problem I'm having is that this crypt/decrypt code from the CryptoAPI does not allow certain strings to be encrypted and then decrypted (i.e. n, t, L, p, aa, ab, ac, ad, ae, etc). If someone can tell me why, that would be very helpful.
#include <windows.h>
#include <tchar.h>
#include <wincrypt.h>
#define KEYLENGTH 0x00800000
#define ENCRYPT_BLOCK_SIZE 8
bool MyDecryptFile(LPTSTR szSource,LPTSTR szDestination,LPTSTR szPassword);
bool MyEncryptFile(LPTSTR szSource,LPTSTR szDestination,LPTSTR szPassword);
int _tmain(int argc, _TCHAR* argv[])
{
LPTSTR pszSource = L"c:\\temp\\source.txt";
LPTSTR pszDestination = L"c:\\temp\\dest.txt";
LPTSTR pszPassword = L"t";
if (MyEncryptFile(pszSource, pszDestination, pszPassword))
{
_tprintf(TEXT("Encryption of the file %s was successful. \n"),pszSource);
_tprintf(TEXT("The encrypted data is in file %s.\n"),pszDestination);
}
if (MyDecryptFile(pszSource, pszDestination, pszPassword))
{
_tprintf(TEXT("Encryption of the file %s was successful. \n"),pszSource);
_tprintf(TEXT("The encrypted data is in file %s.\n"),pszDestination);
}
return 0;
}
bool MyEncryptFile(LPTSTR pszSourceFile,LPTSTR pszDestinationFile,LPTSTR pszPassword)
{
bool fReturn = false;
HANDLE hSourceFile = INVALID_HANDLE_VALUE, hDestinationFile = INVALID_HANDLE_VALUE;
HCRYPTPROV hCryptProv = NULL;
HCRYPTKEY hKey = NULL, hXchgKey = NULL;
HCRYPTHASH hHash = NULL;
PBYTE pbBuffer = NULL;
DWORD dwBlockLen, dwBufferLen, dwCount;
hSourceFile = CreateFile(pszSourceFile,FILE_READ_DATA,FILE_SHARE_READ,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
if (INVALID_HANDLE_VALUE == hSourceFile)
goto Exit_MyEncryptFile;
hDestinationFile = CreateFile(pszDestinationFile,FILE_WRITE_DATA,FILE_SHARE_READ,NULL,OPEN_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL);
if (INVALID_HANDLE_VALUE == hDestinationFile)
goto Exit_MyEncryptFile;
CryptAcquireContext(&hCryptProv,NULL,MS_ENH_RSA_AES_PROV,PROV_RSA_AES,0);
CryptCreateHash(hCryptProv,CALG_SHA_256,0,0,&hHash);
CryptHashData(hHash,(BYTE *)pszPassword,lstrlen(pszPassword),0);
CryptDeriveKey(hCryptProv,CALG_AES_256,hHash,CRYPT_EXPORTABLE,&hKey);
dwBlockLen = 1000 - 1000 % ENCRYPT_BLOCK_SIZE;
if (ENCRYPT_BLOCK_SIZE > 1)
dwBufferLen = dwBlockLen + ENCRYPT_BLOCK_SIZE;
else
dwBufferLen = dwBlockLen;
pbBuffer = (BYTE *)malloc(dwBufferLen);
bool fEOF = FALSE;
do
{
if (ReadFile(hSourceFile,pbBuffer,dwBlockLen,&dwCount,NULL))
{
if (dwCount < dwBlockLen)
fEOF = TRUE;
if (CryptEncrypt(hKey,NULL,fEOF,0,pbBuffer,&dwCount,dwBufferLen))
WriteFile(hDestinationFile,pbBuffer,dwCount,&dwCount,NULL);
}
}
while (!fEOF);
fReturn = true;
Exit_MyEncryptFile:
if (hSourceFile) CloseHandle(hSourceFile);
if (hDestinationFile) CloseHandle(hDestinationFile);
if (pbBuffer) free(pbBuffer);
if (hHash) {CryptDestroyHash(hHash);hHash = NULL;}
if (hKey) CryptDestroyKey(hKey);
if (hCryptProv) CryptReleaseContext(hCryptProv, 0);
return fReturn;
}
bool MyDecryptFile(LPTSTR pszSourceFile,LPTSTR pszDestinationFile,LPTSTR pszPassword)
{
bool fReturn = false;
HANDLE hSourceFile = INVALID_HANDLE_VALUE, hDestinationFile = INVALID_HANDLE_VALUE;
HCRYPTKEY hKey = NULL;
HCRYPTHASH hHash = NULL;
HCRYPTPROV hCryptProv = NULL;
PBYTE pbBuffer = NULL;
DWORD dwCount, dwBlockLen, dwBufferLen;
hSourceFile = CreateFile(pszDestinationFile,FILE_READ_DATA,FILE_SHARE_READ,NULL,OPEN_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL);
if (INVALID_HANDLE_VALUE == hSourceFile)
goto Exit_MyDecryptFile;
hDestinationFile = CreateFile(pszSourceFile,FILE_WRITE_DATA,FILE_SHARE_READ,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
if (INVALID_HANDLE_VALUE == hDestinationFile)
goto Exit_MyDecryptFile;
CryptAcquireContext(&hCryptProv,NULL,MS_ENH_RSA_AES_PROV,PROV_RSA_AES,0);
CryptCreateHash(hCryptProv,CALG_SHA_256,0,0,&hHash);
CryptHashData(hHash,(BYTE *)pszPassword,lstrlen(pszPassword),0);
CryptDeriveKey(hCryptProv,CALG_AES_256,hHash,CRYPT_EXPORTABLE,&hKey);
dwBlockLen = 1000 - 1000 % ENCRYPT_BLOCK_SIZE;
dwBufferLen = dwBlockLen;
pbBuffer = (PBYTE)malloc(dwBufferLen);
bool fEOF = false;
do
{
if (!ReadFile(hSourceFile,pbBuffer,dwBlockLen,&dwCount,NULL))
goto Exit_MyDecryptFile;
if (dwCount <= dwBlockLen)
fEOF = TRUE;
LONG rv = CryptDecrypt(hKey,0,fEOF,0,pbBuffer,&dwCount);
if (rv==0)
{
DWORD dwErr = GetLastError(); // <--- fails if password and string are n, t, L, p, aa, ab, ac, ad , ae
goto Exit_MyDecryptFile;
}
if (!WriteFile(hDestinationFile,pbBuffer,dwCount,&dwCount,NULL))
goto Exit_MyDecryptFile;
}
while (!fEOF);
fReturn = true;
Exit_MyDecryptFile:
if (pbBuffer) free(pbBuffer);
if (hSourceFile) CloseHandle(hSourceFile);
if (hDestinationFile) CloseHandle(hDestinationFile);
if (hHash) {CryptDestroyHash(hHash);hHash = NULL;}
if (hKey) CryptDestroyKey(hKey);
if (hCryptProv) CryptReleaseContext(hCryptProv, 0);
return fReturn;
}
What about using this to get the KP_IV option?
BOOL bRV;
bRV = CryptAcquireContextW(&hCryptProv, NULL, MS_ENH_RSA_AES_PROV, PROV_RSA_AES, 0);
bRV = CryptGenKey(hCryptProv, CALG_AES_256,0,&hKey);
DWORD dwMode = CRYPT_MODE_CBC;
bRV = CryptSetKeyParam(hKey,KP_MODE,(BYTE*)&dwMode,0);
BYTE pbData[16];
memcpy(pbData,"n",sizeof("n")); // <--- Hard coded password
bRV = CryptSetKeyParam(hKey,KP_IV,pbData,0);
enter code here
If you want to obtain different cypertext when encrypting the same plaintext with the same key, you have to use the CBC mode of operation:
https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation
In order to correctly encrypt with CBC, you need to generate a different random Initialization Vector (IV) every time.
In order to decrypt, you need to know the IV used during encryption.
So, the IV must be associated (in clear) to the cyphertext.
In reference to your example, when calling the CryptDeriveKey function, the CBC is the default mode but it uses an IV set to zero and this invalidates the utility of the CBC operating mode:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa379916(v=vs.85).aspx
In order to set the random IV you need to call the CryptSetKeyParam function, which accept the KP_IV param:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa380272(v=vs.85).aspx
Bye
Giovanni
Recently, I have been trying to educate myself on how to encrypt and decrypt using the Vigenere Cipher.
I have successfully encrypted the message and these are the steps I undertook to achieve encryption:
Encryption Key: Set
Message: Top secret
Step 1: Numerical representation of key is 18, 4, 19 (Using the table below)
Working Out:
Reminder:
P is the set of plaintext units
C is the set of ciphertext units
K is the set of keys
E: P x K -> C is the encryption function
D: C x K -> P is the decryption function
Plaintext: top secret
Ciphertext: ISIKIVJIM
Although I have managed to encrypt the message "top secret" I am struggling to decrypt messages using the Vigenere Cipher method using the numerical technique I used above. Can someone explain to me how I can decrypt lets say: ISIKIVJIM (the ciphertext from above) to its original plain text message which is "top secret".
Thanks.
As pointed out in the comments the decryption formula is : p = c - k mod 26, also note that we have to perform modular arithmetic so our answer for any input should belong in the range of 0 - 25, i.e if we get a negative number we can add 26(i.e the number we are taking modulo by) until we are within this range you can read more about this here :
https://en.wikipedia.org/wiki/Modular_arithmetic
So the decryption will be like :
L = 11 - 18 = -7 mod 26 = -7 + 26 = 19 = T
S = 18 - 4 = 14 mod 26 = 14 = O
I = 8 - 19 = -11 mod 26 = -11 + 26 = 15 = P
ans so on...
I have also written a c++ code : http://ideone.com/M3BAmq
I recently wrote a java program that encrypts and decrypts in Vigenere using bytes. You need to convert the plain text/ crypt text into byte array and pass it in.
public static byte [] encryptVigenere (byte [] pt, String key)
{
byte [] c_text = new byte [pt.length];
byte [] key_text = key.getBytes();
byte tmp;
int shift;
for (int i = 0, j = 0; i < pt.length; i++)
{
if (j >= key_text.length)
j = 0;
shift = key_text[j] - 65; //index of alphabet
tmp = (byte) (pt[i] + shift);
if (tmp > 'Z')
tmp = (byte) (pt[i] - (26-shift));
c_text[i] = tmp;
j++;
}
return c_text;
}
public static byte [] decryptVigenere (byte [] ct, String key)
{
byte [] p_text = new byte [ct.length];
byte [] key_text = key.getBytes();
byte tmp;
int shift;
for (int i = 0, j = 0; i < ct.length; i++)
{
if (j >= key_text.length)
j = 0;
shift = key_text[j] - 65; //index of alphabet
tmp = (byte) (ct[i] - shift);
if (tmp < 'A')
tmp = (byte) (ct[i] + (26-shift));
p_text[i] = tmp;
j++;
}
return p_text;
}