I'm trying out a machine learning task (binary classification) using caret and was wondering if there is a way to incorporate information about "uncertain" class, or to weight the classes differently.
As an illustration, I've cut and paste some of the code from the caret homepage working with the Sonar dataset (placeholder code - could be anything):
library(mlbench)
testdat <- get(data(Sonar))
set.seed(946)
testdat$Source<-as.factor(sample(c(LETTERS[1:6],LETTERS[1:3]),nrow(testdat),replace = T))
yielding:
summary(testdat$Source)
A B C D E F
49 51 44 17 28 19
after which I would continue with a typical train,tune, and test routine once I decide on a model.
What I've added here is another factor column of a source, or where the corresponding "Class" came from. As an arbitrary example, say these were 6 different people who made their designation of "Class" using slightly different methods and I want to put greater importance on A's classification method than B's but less than C's and so forth.
The actual data are something like this, where there are class imbalances, both among the true/false, M/R, or whatever class, and among these Sources. From the vignettes and examples I have found, at least the former I would address by using a metric like ROC during tuning, but as to how to even incorporate the latter, I'm not sure.
separating the original data by Source and cycling through the factor
levels one at a time, using the current level to build a model and
the rest of the data to test it
instead of classification, turn it into a hybrid classification/regression problem, where I use the ranks of the sources as what I want to model. If A is considered best, then an "A positive" would get a score of +6, "A negative", a score of -6 and so on. Then perform a regression fit on these values, ignoring the Class column.
Any thoughts? Every search I conduct on classes and weights seems to reference the class imbalance issue, but assumes that the classification itself is perfect (or a standard on which to model). Is it even inappropriate to try to incorporate that information and I should just include everything and ignore the source? A potential issue with the first plan is that the smaller sources account for around a few hundred instances, versus over 10,000 for the larger sources, so I might also be concerned that a model built on a smaller set wouldn't generalize as well as one based on more data. Any thoughts would be appreciated.
There is no difference between weighting "because of importance" and weighting "because imbalance". These are exactly the same settings, they both refer to "how strongly should I penalize model for missclassifing sample from a particular class". Thus you do not need any regression (and should not do so! this is perfectly well stated classification problem, and you are simply overthinking it) but just providing samples weights, thats all. There are many models in caret accepting this kind of setting, including glmnet, glm, cforest etc. if you want to use svm you should change package (as ksvm does not support such things) for example to https://cran.r-project.org/web/packages/gmum.r/gmum.r.pdf (for sample or class weighting) or https://cran.r-project.org/web/packages/e1071/e1071.pdf (if it is class weighting)
Related
I have a dataset in which individuals, each belonging to a particular group, repeatedly chose between multiple discrete outcomes.
subID group choice
1 Big A
1 Big B
2 Small B
2 Small B
2 Small C
3 Big A
3 Big B
. . .
. . .
I want to test how group membership influences choice, and want to account for non-independence of observations due to repeated choices being made by the same individuals. In turn, I planned to implement a mixed multinomial regression treating group as a fixed effect and subID as a random effect. It seems that there are a few options for multinomial logits in R, and I'm hoping for some guidance on which may be most easily implemented for this mixed model:
1) multinom - GLM, via nnet, allows the usage of the multinom function. This appears to be a nice, clear, straightforward option... for fixed effect models. However is there a manner to implement random effects with multinom? A previous CV post suggests that multinom is able to handle mixed-effects GLM with poisson distribution and a log link. However, I don't understand (a) why this is the case or (b) the required syntax. Can anyone clarify?
2) mlogit - A fantastic package, with incredibly helpful vignettes. However, the "mixed logit" documentation refers to models that have random effects related to alternative specific covariates (implemented via the rpar argument). My model has no alternative specific variables; I simply want to account for the random intercepts of the participants. Is this possible with mlogit? Is that variance automatically accounted for by setting subID as the id.var when shaping the data to long form with mlogit.data? EDIT: I just found an example of "tricking" mlogit to provide random coefficients for variables that vary across individuals (very bottom here), but I don't quite understand the syntax involved.
3) MCMCglmm is evidently another option. However, as a relative novice with R and someone completely unfamiliar with Bayesian stats, I'm not personally comfortable parsing example syntax of mixed logits with this package, or, even following the syntax, making guesses at priors or other needed arguments.
Any guidance toward the most straightforward approach and its syntax implementation would be thoroughly appreciated. I'm also wondering if the random effect of subID needs to be nested within group (as individuals are members of groups), but that may be a question for CV instead. In any case, many thanks for any insights.
I would recommend the Apollo package by Hess & Palma. It comes with a great documentation and a quite helpful user group.
I'm having a few issues running a simple decision tree within R using rpart.
I can't post my actual data for an example because of confidentiality, but here's the structure. I've blanked out a load of bits just because I've got my tin foil hat on today.
I've run the most basic model to predict MIX based on MIX_BEFORE and LIFESTAGE and I don't get a tree out of the end of it. I've tried using rpart.control and specifying the minsplit, it makes no difference.
Even when I add in a few more variables I still don't get a tree:
Yet... the second I remove the factor variables and attempt to build a tree using an integer, it works fine:
Any ideas at all?
Your data has a fairly strong class imbalance: 99% one class, 1% the other. So rpart can get 99% accuracy just by saying that everything is the majority class (which is what it is doing). Most variables will not be able to discriminate better than that, so you get trees with no branches like you did with the factor variables. Your _NBR variable happens to be more predictive for the small number of points with _NBR >= 7. But even your model that uses _NBR predicts almost all points are majority class. You may be able to get some help from This Cross Validated Post on how to deal with class imbalance.
I've built a toy Random Forest model in R (using the German Credit dataset from the caret package), exported it in PMML 4.0 and deployed onto Hadoop, using the Cascading Pattern library.
I've run into an issue where Cascading Pattern scores the same data differently (in a binary classification problem) than the same model in R. Out of 200 observations, 2 are scored differently.
Why is this? Could it be due to a difference in the implementation of Random Forests?
The German Credit dataset represents a classification-type problem. The winning score of a classification-type RF model is simply the class label that was the most frequent among member decision trees.
Suppose you have RF model with 100 decision trees, and 50 decision trees predict "good credit" and another 50 decision trees predict "bad credit". It is possible that R and Cascading Pattern resolve such tie situations differently - one picks the score that is seen first and the other picks the score that is seen last. You could try re-training your RF model with odd number of member decision trees (ie. use some value that is not divisible by two, such as 99 or 101).
The PMML specification tells to return the score that was seen first. I'm not sure if Cascading Pattern pays any attention to such details. You may want to try out an alternative solution called JPMML-Cascading.
Score matching is a big deal. When a model is moved from the scientist's desktop to the production IT deployment environment, the scores need to match. For a classification task, that also includes the probabilities of all target categories. There is sometimes a problem of precision between different implementations/platforms which can result in minimal differences (really minimal). In any case, they also need to be checked.
Obviously, it could also be the case that the model was not represented correctly in PMML ... unlikely with the R PMML Package. The other option is that the model is not deployed correctly. That is, the scoring engine cascading is using is not interpreting the PMML file properly.
PMML itself has a model element called ModelVerification that allows for a PMML file to contain scored data which can then be used for score matching. This is useful but not necessary since you should be able to score an already scored dataset and compared computed with expected results which you are already doing.
For more on model verification and score matching as well as error handling in PMML, check:
https://support.zementis.com/entries/21207918-Verifying-your-model-in-ADAPA-did-it-upload-correctly-
I have just done oversampling in my dataset using Smote, included in DMwR package.
My dataset is formed by two classes. The original distribution is 12 vs 62. So, I have coded this oversampling:
newData <- SMOTE(Score ~ ., data, k=3, perc.over = 400,perc.under=150)
Now, the distribution is 60 vs 72.
However, when I display the 'newData' dataset I discover how SMOTE has made an oversampling and there are some samples repeated.
For example, the sample number 24 appears as 24.1, 24.2 and 24.3.
Is this correct? This affects directly in classification because the classifier will learn a model with data that it will be present in test, so this is not legal in classification.
Edit:
I think I didn't explain correctly my issue:
As you know, SMOTE is a technique to oversample. It creates new samples from the original ones, modifying the values of the features for it. However, when I display my new data generated by SMOTE, I obtain this:
(these values are the values of the features) Sample50: 1.8787547 0.19847987 -0.0105946940 4.420207 4.660536 1.0936388 0.5312777 0.07171645 0.008043167
Sample 50.1: 1.8787547 0.19847987 -0.0105946940 4.420207 4.660536 1.0936388 0.5312777 0.07171645
Sample 50 belongs to the original dataset. Sample 50.1 is the 'artificial' sample generated by SMOTE. However (and this is my issue), SMOTE has created a repeated sample, instead of creating a artificial one modifying 'a bit' the values of the features.
I hope you can understand me.
Thanks!
Smote is an algorithm that generates synthetic examples of a given class (the minority class) to handle imbalanced distributions. This strategy for generating new data is then combined with random under-sampling of the majority class. When you use SMOTE in package DMwR you need to specify an over-sampling percentage and also an under-sampling percentage. These value must be set carefully because the obtained distribution of the data may remain imbalanced.
In your case, and given the parameters set, namely the percentage of under- and over-sampling smote will introduce replicas of the examples of your minority class.
Your initial class distribution is 12 to 62 and after applying smote you end with 60 to 72. This means that the minority class was oversampled with smote and new synthetic examples of this class where produced.
However, your majority class which had 62 examples, now contains 72! The under sampling percentage was applied to this class but it actually increased the number of examples. Since the number of examples to select from the majority class is determined based on the examples of the minority class, the number of examples sampled from this class was larger then the already existing.
Therefore, you had 62 examples and the algorithm tried to randomly select 72! This means that some replicas of the examples of the majority class where introduced.
So, to explain the over-sampling and under-sampling you selected:
12 examples from the minority class with 400% of oversampling gives: 12*400/100=48. So, 48 new synthetic examples where added to the minority class (12+48=60 the final number of examples for the minority class).
The number of examples to select from the majority class are: 48*150/100=72.
But the majority class only has 62, so replicas are necessarily introduced.
I'm not sure about the implementation of SMOTE in DMwR but it should be safe for you to round the new data to the nearest integer value. One guess is that this is left for you to do in the off chance that you want to do regression instead of classification. Otherwise if you wanted regression and SMOTE returned integers, you'll have unintentionally lost information by going in the opposite direction (SMOTE -> integers -> reals).
If you are not familiar with what SMOTE does is, it creates 'new data' by looking at nearest neighbors to establish a neighborhood and then sampling from within that neighborhood. It is usually done when there is insufficient data in a classification problem for a given class. It operates on the assumptions that data near your data is similar because of proximity.
Alternately you can use Weka's implementation of SMOTE which does not make you do this additional work.
SMOTE is a very simple algorithm to generate synthetic samples. However, before you go ahead and start to use it, you have to understand your features. For example, should each of your features vary the same, etc.
Simply put, before you do, try to understand your data...!!!
I'm trying to do some machine learning stuff that involves a lot of factor-type variables (words, descriptions, times, basically non-numeric stuff). I usually rely on randomForest but it doesn't work w/factors that have >32 levels.
Can anyone suggest some good alternatives?
Tree methods won't work, because the number of possible splits increases exponentially with the number of levels. However, with words this is typically addressed by creating indicator variables for each word (of the description etc.) - that way splits can use a word at a time (yes/no) instead of picking all possible combinations. In general you can always expand levels into indicators (and some models do that implicitly, such as glm). The same is true in ML for handling text with other methods such as SVM etc. So the answer may be that you need to think about your input data structure, not as much the methods. Alternatively, if you have some kind of order on the levels, you can linearize it (so there are only c-1 splits).
In general the best package I've found for situations where there are lots of factor levels is to use the gbm package.
It can handle up to 1024 factor levels.
If there are more than 1024 levels I usually change the data by keeping the 1023 most frequently occurring factor levels and then code the remaining levels as one level.
There is nothing wrong in theory with the use of randomForest's method on class variables that have more than 32 classes - it's computationally expensive, but not impossible to handle any number of classes using the randomForest methodology. The normal R package randomForest sets 32 as a max number of classes for a given class variable and thus prohibits the user from running randomForest on anything with > 32 classes for any class variable.
Linearlizing the variable is a very good suggestion - I've used the method of ranking the classes, then breaking them up evenly into 32 meta-classes. So if there are actually 64 distinct classes, meta-class 1 consists of all things in class 1 and 2, etc. The only problem here is figuring out a sensible way of doing the ranking - and if you're working with, say, words it's very difficult to know how each word should be ranked against every other word.
A way around this is to make n different prediction sets, where each set contains all instances with any particular subset of 31 of the classes in each class variable with more than 32 classes. You can make a prediction using all sets, then using variable importance measures that come with the package find the implementation where the classes used were most predictive. Once you've uncovered the 31 most predictive classes, implement a new version of RF using all the data that designates these most predictive classes as 1 through 31, and everything else into an 'other' class, giving you the max 32 classes for the categorical variable but hopefully preserving much of the predictive power.
Good luck!