Can someone explain me the difference between ID3 and CART algorithm? - r

I have to create decision trees with the R software and the rpart Package.
In my paper I should first define the ID3 algorithm and then implement various decision trees.
I found out that the rpart package does not work with the ID3 algorithm. It uses the CART algorithm. I would like to understand the difference and maybe explain the difference in my paper, but I did not found any literature who compares both sides.
Can you help me? Do you know a paper where both are compared or can you explain the difference to me?

I don't have access to the original texts 1,2 but using some secondary sources, key differences between these recursive ("greedy") partitioning ("tree") algorithms seem to be:
Type of learning:
ID3, as an "Iterative Dichotomiser," is for binary classification only
CART, or "Classification And Regression Trees," is a family of algorithms (including, but not limited to, binary classification tree learning). With rpart(), you can specify method='class' or method='anova', but rpart can infer this from the type of dependent variable (i.e., factor or numeric).
Loss functions used for split selection.
ID3, as other comments have mentioned, selects its splits based on Information Gain, which is the reduction in entropy between the parent node and (weighted sum of) children nodes.
CART, when used for classification, selects its splits to achieve the subsets that minimize Gini impurity
Anecdotally, as a practitioner, I hardly ever hear the term ID3 used, whereas CART is often used as a catch-all term for decision trees. CART has a very popular implementation in R's rpart package. ?rpart notes that "In most details it follows Breiman et. al (1984) quite closely."
However, you can pass rpart(..., parms=list(split='information')) to override the default behavior and split on information gain instead.
1 Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81–106
2 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.

http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf
Read 1 C4.5 and beyond of the paper It will clarify all your doubts, helped me with mine.
Don't get discouraged by the title, its about differences in different tree algorithms.
Anyways a good paper to read through

ID3 algorithm can be used for categorical feature and categorical label.
Whereas, CART is used for continuous features and continuous label.

Related

No correlation after phylogenetic independent contrast

I am testing the correlation between two physiological parameters in plants. I am a bit stuck on the interpretation of the phylogenetic independent contrast (PIC). Without considering the PIC, I got a significant correlation, but there is no correlation between the PIC. What does the absence of correlation between the PIC mean:
the correlation without PIC is the effect of phylogeny
OR,
There is no phylogenetic effect on the correlation.
Looks like you've run into the classic case when an apparent correlation between two traits disappears when you analyze the independent contrasts. This could be due to relatedness: closely related species will tend to have similar trait values and that might look like a correlation between the traits. But it is simply a byproduct of the bifurcating nature of phylogenies and statistical non-idependence of species' trait values. I would recommend to go back to the early papers (1980 and 1990) by Felsenstein, Garland, etc, and the book by Harvey and Pagel (1991) where these concepts feature prominently.
Otherwise, there are many similar threads on the r-sig-phylo mailing list (highly recommended), websites of various phylogenetic workshops (e.g., Bodega, Woods Hole, ...), and blogs (e.g., Liam Revell's phytools blog) that might be of help.

Maximal Information Coefficient vs Hierarchical Agglomerative Clustering

What is the difference between the Maximal Information Coefficient and Hierarchical Agglomerative Clustering in identifying functional and non functional dependencies.
Which of them can identify duplicates better?
This question doesn't make a lot of sense, sorry.
The MIC and HAC have close to zero in common.
The MIC is a crippled form of "correlation" with a very crude heuristic search, and plenty of promotion video and news announcements, and received some pretty harsh reviews from statisticians. You can file it in the category "if it had been submitted to an appropriate journal (rather than the quite unspecific and overrated Science which probably shouldn't publish such topics at all - or at least, get better reviewers from the subject domains. It's not the first Science article of this quality....), it would have been rejected (as-is - better expert reviewers would have demanded major changes)". See, e.g.,
Noah Simon and Robert Tibshirani, Comment on “Detecting Novel Associations in Large Data Sets” by Reshef et al., Science Dec. 16, 2011
"As one can see from the Figure, MIC has lower power than dcor, in every case except the somewhat pathological high-frequency sine wave. MIC is sometimes less powerful than Pearson correlation as well, the linear case being particularly worrisome."
And "tibs" is a highly respected author. And this is just one of many surprised that such things get accepted in such a high reputation journal. IIRC, the MIC authors even failed to compare to "ancient" alternatives such as Spearman, to modern alternatives like dCor, or to properly conduct a test of statistical power of their method.
MIC works much worse than advertised when studied with statistical scrunity:
Gorfine, M., Heller, R., & Heller, Y. (2012). Comment on "detecting novel associations in large data sets"
"under the majority of the noisy functionals and non-functional settings, the HHG and dCor tests hold very large power advantages over the MIC test, under practical sample sizes; "
As a matter of fact, MIC gives wildly inappropriate results on some trivial data sets such as a checkerboard uniform distribution ▄▀, which it considers maximally correlated (as correlated as y=x); by design. Their grid-based design is overfitted to the rather special scenario with the sine curve. It has some interesting properties, but these are IMHO captured better by earlier approaches such as Spearman and dCor).
The failure by the MIC authors to compare to Spearman is IMHO a severe omission, because their own method is also purely rank-based if I recall correctly. Spearman is Pearson-on-ranks, yet they compare only to Pearson. The favorite example of MIC (another questionable choice) is the sine wave - which after rank transformation actually is busy a zigzag curve, not a sine anymore). I consider this to be "cheating" to make Pearson look bad, by not using the rank transformation with Pearson, too. Good reviewers would have demanded such a comparison.
Now all of these complaints are essentially unrelated to HAC. HAC is not trying to define any form if "correlation", but it can be used with any distance or similarity (including correlation similarity).
HAC is something completely different: a clustering algorithm. It analyzes a larger rows, not two (!) columns.
You could even combine them: if you compute the MIC foe every pair of variables (but I'd rather use Pearson correlation, Spearman correlation, or distance correlation dCor instead), you can use HAC to cluster variables.
For finding aftual duplicates, neither is a good choice. Just sort your data, and duplicates will follow each other. (Or, if you sort columns, next to each other).

Decision Tree algorithms in R packages

Is there any way to specify the algorithm used in any of the R packages for decision tree formation? I know that CART and C5.0 models are available. I want to find out about other decision tree algorithms such as ID3, C4.5 and OneRule algorithms.
EDIT: Due to the ambiguous nature of my question, I would like to clarify it. Is there some function (say fun()) which creates and trains a decision tree wherein we can specify the algorithm as a parameter of the function fun()?
Like for example, to find the correlation between two vectors, we have cor() where we can specify the method used as pearson, spearman or kendall.
Is there such a function for decision trees as well so we can use different algorithms like ID3, C4.5, etc?

Genetic Algorithms Introduction

Starting off let me clarify that i have seen This Genetic Algorithm Resource question and it does not answer my question.
I am doing a project in Bioinformatics. I have to take data about the NMR spectrum of a cell(E. Coli) and find out what are the different molecules(metabolites) present in the cell.
To do this i am going to be using Genetic Algorithms in R language. I DO NOT have the time to go through huge books on Genetic algorithms. Heck! I dont even have time to go through little books.(That is what the linked question does not answer)
So i need to know of resources which will help me understand quickly what it is Genetic Algorithms do and how they do it. I have read the Wikipedia entry ,this webpage and also a couple of IEEE papers on the subject.
Any working code in R(even in C) or pointers to which R modules(if any) to be used would be helpful.
A brief (and opinionated) introduction to genetic algorithms is at http://www.burns-stat.com/pages/Tutor/genetic.html
A simple GA written in R is available at http://www.burns-stat.com/pages/Freecode/genopt.R The "documentation" is in 'S Poetry' http://www.burns-stat.com/pages/Spoetry/Spoetry.pdf and the code.
I assume from your question you have some function F(metabolites) which yields a spectrum but you do not have the inverse function F'(spectrum) to get back metabolites. The search space of metabolites is large so rather than brute force it you wish to try an approximate method (such as a genetic algorithm) which will make a more efficient random search.
In order to apply any such approximate method you will have to define a score function which compares the similarity between the target spectrum and the trial spectrum. The smoother this function is the better the search will work. If it can only yield true/false it will be a purely random search and you'd be better off with brute force.
Given the F and your score (aka fitness) function all you need to do is construct a population of possible metabolite combinations, run them all through F, score all the resulting spectrums, and then use crossover and mutation to produce a new population that combines the best candidates. Choosing how to do the crossover and mutation is generally domain specific because you can speed the process greatly by avoiding the creation of nonsense genomes. The best mutation rate is going to be very small but will also require tuning for your domain.
Without knowing about your domain I can't say what a single member of your population should look like, but it could simply be a list of metabolites (which allows for ordering and duplicates, if that's interesting) or a string of boolean values over all possible metabolites (which has the advantage of being order invariant and yielding obvious possibilities for crossover and mutation). The string has the disadvantage that it may be more costly to filter out nonsense genes (for example it may not make sense to have only 1 metabolite or over 1000). It's faster to avoid creating nonsense rather than merely assigning it low fitness.
There are other approximate methods if you have F and your scoring function. The simplest is probably Simulated Annealing. Another I haven't tried is the Bees Algorithm, which appears to be multi-start simulated annealing with effort weighted by fitness (sort of a cross between SA and GA).
I've found the article "The science of computing: genetic algorithms", by Peter J. Denning (American Scientist, vol 80, 1, pp 12-14). That article is simple and useful if you want to understand what genetic algorithms do, and is only 3 pages to read!!

mca or various ca (multivariate analysis)

I will make an analysis about some information about my company.
I thought of making a ca to represent the association between two variables. I have 3 variables: Category, Tag, Valoration. My idea is to make 2 analyses, one to view the association between Category - Valorarion and a second analysis between Tag - Valoration.
But I think that this representation is possible with a mca.
What do you recommend to me?
Thank You
Various classification or association rule mining algorithms could be of much help too. You could check the Weka toolbench for machine learning and data mining.
Assuming that all variables are categorical, you can use multiple classification analysis to gain an understanding of the associations between the variables. There was a good article on the topic from the European Consortium for Politics back in 2k7 but I can't find it on my drive, I'm sure google will have it somewhere. I can't "see" your data so I can't say with any certainty that MCA will be better than regression or GLM but the article I'm referring to has a discussion on this topic specifically to do with MCA vs. GLM vs. Regression.
Alternatively, you could use pearson product-moment correlations to identify the coefficients. Close to 1 = positive linear relationship, close to -1 = negative linear relationship, close to 0 = no linear relationship.
I came across VGAM package for categorical data analysis. You could check this too

Resources