R-C50: Definition of minCases in C5.0Control - r

The parameter minCases of the function C5.0Control in the C50 R package is defined as:
an integer for the smallest number of samples that must be put in at least two of the splits.
How is this implemented? I assume that split in this context refers to the nodes resulting from the split operation. minCases does not seem to represent the smallest number of cases that must be put in at least one node, as I would have expected.
I have tried to find the implementation in the C source code. The variable minCases seems to be defined in extern.h in line 33:
extern CaseCount MINITEMS, LEAFRATIO;
It is used, for instance, in prune.c, lines 249 and 250:
if (BranchCases[v] < MINITEMS) {
ForEach(i, Bp, Ep) { SmallBranches[Class(Case[i])] += Weight(Case[i]); }
What does minCases really do?

Related

How do I represent sparse arrays in Pari/GP?

I have a function that returns integer values to integer input. The output values are relatively sparse; the function only returns around 2^14 unique outputs for input values 1....2^16. I want to create a dataset that lets me quickly find the inputs that produce any given output.
At present, I'm storing my dataset in a Map of Lists, with each output value serving as the key for a List of input values. This seems slow and appears to use a whole of stack space. Is there a more efficient way to create/store/access my dataset?
Added:
It turns out the time taken by my sparesearray() function varies hugely on the ratio of output values (i.e., keys) to input values (values stored in the lists). Here's the time taken for a function that requires many lists, each with only a few values:
? sparsearray(2^16,x->x\7);
time = 126 ms.
Here's the time taken for a function that requires only a few lists, each with many values:
? sparsearray(2^12,x->x%7);
time = 218 ms.
? sparsearray(2^13,x->x%7);
time = 892 ms.
? sparsearray(2^14,x->x%7);
time = 3,609 ms.
As you can see, the time increases exponentially!
Here's my code:
\\ sparsearray takes two arguments, an integer "n" and a closure "myfun",
\\ and returns a Map() in which each key a number, and each key is associated
\\ with a List() of the input numbers for which the closure produces that output.
\\ E.g.:
\\ ? sparsearray(10,x->x%3)
\\ %1 = Map([0, List([3, 6, 9]); 1, List([1, 4, 7, 10]); 2, List([2, 5, 8])])
sparsearray(n,myfun=(x)->x)=
{
my(m=Map(),output,oldvalue=List());
for(loop=1,n,
output=myfun(loop);
if(!mapisdefined(m,output),
/* then */
oldvalue=List(),
/* else */
oldvalue=mapget(m,output));
listput(oldvalue,loop);
mapput(m,output,oldvalue));
m
}
To some extent, the behavior you are seeing is to be expected. PARI appears to pass lists and maps by value rather than reference except to the special inbuilt functions for manipulating them. This can be seen by creating a wrapper function like mylistput(list,item)=listput(list,item);. When you try to use this function you will discover that it doesn't work because it is operating on a copy of the list. Arguably, this is a bug in PARI, but perhaps they have their reasons. The upshot of this behavior is each time you add an element to one of the lists stored in the map, the entire list is being copied, possibly twice.
The following is a solution that avoids this issue.
sparsearray(n,myfun=(x)->x)=
{
my(vi=vector(n, i, i)); \\ input values
my(vo=vector(n, i, myfun(vi[i]))); \\ output values
my(perm=vecsort(vo,,1)); \\ obtain order of output values as a permutation
my(list=List(), bucket=List(), key);
for(loop=1, #perm,
if(loop==1||vo[perm[loop]]<>key,
if(#bucket, listput(list,[key,Vec(bucket)]);bucket=List()); key=vo[perm[loop]]);
listput(bucket,vi[perm[loop]])
);
if(#bucket, listput(list,[key,Vec(bucket)]));
Mat(Col(list))
}
The output is a matrix in the same format as a map - if you would rather a map then it can be converted with Map(...), but you probably want a matrix for processing since there is no built in function on a map to get the list of keys.
I did a little bit of reworking of the above to try and make something more akin to GroupBy in C#. (a function that could have utility for many things)
VecGroupBy(v, f)={
my(g=vector(#v, i, f(v[i]))); \\ groups
my(perm=vecsort(g,,1));
my(list=List(), bucket=List(), key);
for(loop=1, #perm,
if(loop==1||g[perm[loop]]<>key,
if(#bucket, listput(list,[key,Vec(bucket)]);bucket=List()); key=g[perm[loop]]);
listput(bucket, v[perm[loop]])
);
if(#bucket, listput(list,[key,Vec(bucket)]));
Mat(Col(list))
}
You would use this like VecGroupBy([1..300],i->i%7).
There is no good native GP solution because of the way garbage collection occurs because passing arguments by reference has to be restricted in GP's memory model (from version 2.13 on, it is supported for function arguments using the ~ modifier, but not for map components).
Here is a solution using the libpari function vec_equiv(), which returns the equivalence classes of identical objects in a vector.
install(vec_equiv,G);
sparsearray(n, f=x->x)=
{
my(v = vector(n, x, f(x)), e = vec_equiv(v));
[vector(#e, i, v[e[i][1]]), e];
}
? sparsearray(10, x->x%3)
%1 = [[0, 1, 2], [Vecsmall([3, 6, 9]), Vecsmall([1, 4, 7, 10]), Vecsmall([2, 5, 8])]]
(you have 3 values corresponding to the 3 given sets of indices)
The behaviour is linear as expected
? sparsearray(2^20,x->x%7);
time = 307 ms.
? sparsearray(2^21,x->x%7);
time = 670 ms.
? sparsearray(2^22,x->x%7);
time = 1,353 ms.
Use mapput, mapget and mapisdefined methods on a map created with Map(). If multiple dimensions are required, then use a polynomial or vector key.
I guess that is what you are already doing, and I'm not sure there is a better way. Do you have some code? From personal experience, 2^16 values with 2^14 keys should not be an issue with regards to speed or memory - there may be some unnecessary copying going on in your implementation.

What does this extra '+' represent in this code? Recursive function

Problem:
A digital root is the recursive sum of all the digits in a number. Given n, take the sum of the digits of n. If that value has two digits, continue reducing in this way until a single-digit number is produced. This is only applicable to the natural numbers.
example:
digital_root(16)
=> 1 + 6
=> 7
This is a function that was coded:
function digital_root(n) {
if (n < 10) {
return n;
}
return digital_root( n.toString().split('').reduce( function (a, b) {
return a + +b;
}, 0));
}
Can someone clarify what the extra + is doing in this line of code? return a + +b;
Its probably a sneaky way of converting a string to an integer. You don't say what language this is, but many dynamic languages allow variables to be any type without declaration and use + for both addition and string concatenation, with implicit conversions between strings and numbers. Such languages make it easy to accidentally get the wrong thing (concatenating when you intend to add or vice versa).
However, using a unary + is (usually) a numeric identity, which will convert its argument to a number (if it happens to be a string -- it does nothing if the argument is already a number). So then the binary + will be add rather than concatenate.

Add my custom loss function to torch

I want to add a loss function to torch that calculates the edit distance between predicted and target values.
Is there an easy way to implement this idea?
Or do I have to write my own class with backward and forward functions?
If your criterion can be represented as a composition of existing modules and criteria, it's a good idea to simply construct such composition using containers. The only problem is that standard containers are designed to work with modules only, not criteria. The difference is in :forward method signature:
module:forward(input)
criterion:forward(input, target)
Luckily, we are free to define our own container which is able work with criteria too. For example, sequential:
local GeneralizedSequential, _ = torch.class('nn.GeneralizedSequential', 'nn.Sequential')
function GeneralizedSequential:forward(input, target)
return self:updateOutput(input, target)
end
function GeneralizedSequential:updateOutput(input, target)
local currentOutput = input
for i=1,#self.modules do
currentOutput = self.modules[i]:updateOutput(currentOutput, target)
end
self.output = currentOutput
return currentOutput
end
Below is an illustration of how to implement nn.CrossEntropyCriterion having this generalized sequential container:
function MyCrossEntropyCriterion(weights)
criterion = nn.GeneralizedSequential()
criterion:add(nn.LogSoftMax())
criterion:add(nn.ClassNLLCriterion(weights))
return criterion
end
Check whether everything is correct:
output = torch.rand(3,3)
target = torch.Tensor({1, 2, 3})
mycrit = MyCrossEntropyCriterion()
-- print(mycrit)
print(mycrit:forward(output, target))
print(mycrit:backward(output, target))
crit = nn.CrossEntropyCriterion()
-- print(crit)
print(crit:forward(output, target))
print(crit:backward(output, target))
Just to add to the accepted answer, you have to be careful that the loss function you define (edit distance in your case) is differentiable with respect to the network parameters.

Static keyword in binary tree searching

Given a Binary Tree, find the maximum sum path from a leaf to root. For example, in the following tree, there are three leaf to root paths 8->-2->10, -4->-2->10 and 7->10. The sums of these three paths are 16, 4 and 17 respectively. The maximum of them is 17 and the path for maximum is 7->10.
10
/ \
-2 7
/ \
8 -4
This is a function to calculate maximum sum from root to any leaf node in the given binary tree. This problem is asked in interview many times by various companies. I am trying this declaring ls and rs as static, but it's producing wrong output. When I'm removing static keyword it's producing correct output.
Can you please explain me what is the problem here with the static keyword.
int roottoleafmaxsum(struct node*temp) //root will be passed as an argument
{
static int ls,rs; //left sum and right sum
if(temp) //if temp!=NULL then block will be executed
{
ls=roottoleafmaxsum(temp->left); //recursive call to left
rs=roottoleafmaxsum(temp->right); //recursive call to right
if(ls>rs)
return (ls+temp->data); //it will return sum of ls and data
else
return (rs+temp->data); //it will return sum of rs and data
}
}
Static means they will retain the value for every function call.
Since you are using recursion, it will change the value in the recursive call and that same value will be used in the parent function producing errors.

Method vector::clear doesn't clear a two dimensional vector

Thank you very much for the answer.
The reason that makes me think about checking the content of the vector is that even after I overwrite it, the same (strange) values remain. My purpose is to generate iteratively some random variables and put them in the two dimensional vector. I give you my code, maybe I am doing something wrong:
while (nbre_TTI_tempo != 0 )
{
srand(time(NULL)) ;
while (nbre_UE_tempo != 0 )
{
vect.clear() ;
nbre_PRB_tempo = nbre_PRB ;
while (nbre_PRB_tempo != 0)
{
value = rand() % 15 + 1 ; // generating random variables between 1 and 15
vect.push_back(value) ;
nbre_PRB_tempo -- ;
}
All_CQI.push_back(vect) ;
nbre_UE_tempo -- ;
}
// Do business
All_CQI.clear();
} .// end while
At the first round, everything goes well, but at the second one, this is what I find in the vector after the use of the method "clear":
158429184
14
15
158429264
10
9
158429440
5
1
And when I try to overwrite it, I find:
158429184
14
15
158429264
10
9
158429440
5
1
Which are the same values as before using the method "push_back".
Do you think I'm doing something wrong in my code?
Thank you very much in advance for your help.
For your purpose, if empty() returns true, you should trust that it is empty and SHOULD NOT check for individual elements. It is illegal to access the contents of an empty vector thus and can cause memory access errors.
The reason you find values is because the actual memory locations are not overwritten immediately- they are only marked as invalid. Until some other object is allocated the same memory, the data may remain as it is - but there is no way to be sure - the implementation is left to the compiler.
Looks like you are using the array operator[] to access the elements. To be safer, use the iterator or the .at() method to access the elements. Both these methods perform bounds checking and will not let you go beyond the last valid element

Resources