Differences in scoring from PMML model on different platforms - r

I've built a toy Random Forest model in R (using the German Credit dataset from the caret package), exported it in PMML 4.0 and deployed onto Hadoop, using the Cascading Pattern library.
I've run into an issue where Cascading Pattern scores the same data differently (in a binary classification problem) than the same model in R. Out of 200 observations, 2 are scored differently.
Why is this? Could it be due to a difference in the implementation of Random Forests?

The German Credit dataset represents a classification-type problem. The winning score of a classification-type RF model is simply the class label that was the most frequent among member decision trees.
Suppose you have RF model with 100 decision trees, and 50 decision trees predict "good credit" and another 50 decision trees predict "bad credit". It is possible that R and Cascading Pattern resolve such tie situations differently - one picks the score that is seen first and the other picks the score that is seen last. You could try re-training your RF model with odd number of member decision trees (ie. use some value that is not divisible by two, such as 99 or 101).
The PMML specification tells to return the score that was seen first. I'm not sure if Cascading Pattern pays any attention to such details. You may want to try out an alternative solution called JPMML-Cascading.

Score matching is a big deal. When a model is moved from the scientist's desktop to the production IT deployment environment, the scores need to match. For a classification task, that also includes the probabilities of all target categories. There is sometimes a problem of precision between different implementations/platforms which can result in minimal differences (really minimal). In any case, they also need to be checked.
Obviously, it could also be the case that the model was not represented correctly in PMML ... unlikely with the R PMML Package. The other option is that the model is not deployed correctly. That is, the scoring engine cascading is using is not interpreting the PMML file properly.
PMML itself has a model element called ModelVerification that allows for a PMML file to contain scored data which can then be used for score matching. This is useful but not necessary since you should be able to score an already scored dataset and compared computed with expected results which you are already doing.
For more on model verification and score matching as well as error handling in PMML, check:


Merging Tree Models from two random forest models into one random forest model at H2O in R

I am relatively new to the machine learning ocean, please excuse me if some of my questions are really basic.
Current situation: The overall goal was trying to improve some code for h2o package in r running on the supercomputer cluster. However, since the data is too large that single node with h2o really takes more than a day, therefore, we have decided to use multiple nodes to run the model. I came up with an idea:
(1) Distribute each node to build (nTree/num_node) trees and saved into a model;
(2) running on the cluster at each node for (nTree/num_node) number of trees in the forest;
(3) Merging the trees back together and reform the original forest, and using the measurement results in average.
I later realized this could be risky. But I cannot find the actual support or against statement since I am not machine learning focused programmer.
if this way of handling random forest will result in some risk, please reference me the link so I can have a basic idea why this is not right.
If this way is actually an "ok" way to do so. What should I be do to merge the trees, is there a package or method I can borrow from?
If this is actually a solved problem, please reference me the link, I may have searched the wrong keywords, and thank you!
The real number-involved example I can present here is:
I have a random forest task with 80k rows and 2k columns and wanted the number of trees are 64. What I have done is put 16 trees on each node running with the whole dataset, and each one of four nodes come up with an RF model. I am now trying to merge the trees from each model into this one big RF model and average the measurements (from each of those four models).
There is no need to merge the models. Unlike with boosting methods, every tree in a Random Forest is grown independently (just don't set the same seed prior to kicking off RF on each node!).
You are basically doing what Random Forest does on its own, which is to grow X independent trees and then average across the votes. Many packages provide an option to specify the number of cores or threads, in order to take advantage of this feature of RF.
In your case, since you have the same number of trees per node, you'll get 4 "models" back, but those are really just collections of 16 trees. To use it, I'd just keep the 4 models separate and when you want a prediction, average the prediction from each of the 4 models. Assuming you're going to be doing that more than once, you could write a small wrapper function to predict with the 4 models and average the output.
10,000 rows by 1,000 columns is not overly large and should not take that long to train an RF model.
It sound like something unexpected is happening.
While you can try to average models if you know what you are doing, I don't think it should be necessary in this case.

r rpart only working for integers and not factors? getting a tree with no depth

I'm having a few issues running a simple decision tree within R using rpart.
I can't post my actual data for an example because of confidentiality, but here's the structure. I've blanked out a load of bits just because I've got my tin foil hat on today.
I've run the most basic model to predict MIX based on MIX_BEFORE and LIFESTAGE and I don't get a tree out of the end of it. I've tried using rpart.control and specifying the minsplit, it makes no difference.
Even when I add in a few more variables I still don't get a tree:
Yet... the second I remove the factor variables and attempt to build a tree using an integer, it works fine:
Any ideas at all?
Your data has a fairly strong class imbalance: 99% one class, 1% the other. So rpart can get 99% accuracy just by saying that everything is the majority class (which is what it is doing). Most variables will not be able to discriminate better than that, so you get trees with no branches like you did with the factor variables. Your _NBR variable happens to be more predictive for the small number of points with _NBR >= 7. But even your model that uses _NBR predicts almost all points are majority class. You may be able to get some help from This Cross Validated Post on how to deal with class imbalance.

Weighting class in machine learning task

I'm trying out a machine learning task (binary classification) using caret and was wondering if there is a way to incorporate information about "uncertain" class, or to weight the classes differently.
As an illustration, I've cut and paste some of the code from the caret homepage working with the Sonar dataset (placeholder code - could be anything):
testdat <- get(data(Sonar))
testdat$Source<-as.factor(sample(c(LETTERS[1:6],LETTERS[1:3]),nrow(testdat),replace = T))
49 51 44 17 28 19
after which I would continue with a typical train,tune, and test routine once I decide on a model.
What I've added here is another factor column of a source, or where the corresponding "Class" came from. As an arbitrary example, say these were 6 different people who made their designation of "Class" using slightly different methods and I want to put greater importance on A's classification method than B's but less than C's and so forth.
The actual data are something like this, where there are class imbalances, both among the true/false, M/R, or whatever class, and among these Sources. From the vignettes and examples I have found, at least the former I would address by using a metric like ROC during tuning, but as to how to even incorporate the latter, I'm not sure.
separating the original data by Source and cycling through the factor
levels one at a time, using the current level to build a model and
the rest of the data to test it
instead of classification, turn it into a hybrid classification/regression problem, where I use the ranks of the sources as what I want to model. If A is considered best, then an "A positive" would get a score of +6, "A negative", a score of -6 and so on. Then perform a regression fit on these values, ignoring the Class column.
Any thoughts? Every search I conduct on classes and weights seems to reference the class imbalance issue, but assumes that the classification itself is perfect (or a standard on which to model). Is it even inappropriate to try to incorporate that information and I should just include everything and ignore the source? A potential issue with the first plan is that the smaller sources account for around a few hundred instances, versus over 10,000 for the larger sources, so I might also be concerned that a model built on a smaller set wouldn't generalize as well as one based on more data. Any thoughts would be appreciated.
There is no difference between weighting "because of importance" and weighting "because imbalance". These are exactly the same settings, they both refer to "how strongly should I penalize model for missclassifing sample from a particular class". Thus you do not need any regression (and should not do so! this is perfectly well stated classification problem, and you are simply overthinking it) but just providing samples weights, thats all. There are many models in caret accepting this kind of setting, including glmnet, glm, cforest etc. if you want to use svm you should change package (as ksvm does not support such things) for example to https://cran.r-project.org/web/packages/gmum.r/gmum.r.pdf (for sample or class weighting) or https://cran.r-project.org/web/packages/e1071/e1071.pdf (if it is class weighting)

How to use the `vcconv` command in lme4 for serial correlation?

I'm working with a large longitudinal dataset of firm-year observations. For some time now I have been using lme4to implement crossed (non-nested) effects for year and firm-ID groups.
My goal is now to correct for the serial correlation in the firm-group dimension. Based on chl's and fabians' answers to this question (as well as Ben Bolker's comment on the latter), I've assumed this is impossible with lmer(), but is feasible with nlme::lme().
I have been able to implement crossed effects in nlme based on the discussion in Pinheiro & Bates (2000, sec. 4.2.2, pp. 163-6). In principle then, I believe I can use the correlation = AR1() speficiation in lme() to control for autocorrelation.
My strong preference, however, would be to implement such a correlation specification in lmer() because:
lme4 is much, much (much) faster
nlme requires crossed effects to be nested in some higher group -- without such a higher level grouping I'm forced to create an arbitrary dummy for groupedData to which all observations belong (e.g., here). This creates issues interpreting the relative levels of variation between the two crossed groups and the residual variance because some of the variation appears to be captured by the higher-level dummy group.
I got excited when I found the feature request #224 on GitHub, but alas it doesn't seem like there's much movement on the flexLambda front (please let me know if I'm wrong!).
lme4 v1.1-10
I've just noticed that the latest (Oct. 2015) version of lme4 contains a vcconv command that can
Convert between representation of (co-)variance structures (EXPERIMENTAL.)
Based on the source code, it seems that maybe the sdcor2cov option could allow one to specify a correlation structure such as AR(1).
So my questions are:
Is this interpretation of the vcconv function correct?
If so, does the user supply the correlation (e.g., AR(1)) parameters or are they determined internally in lmer()?
How does one implement this function properly?

Random Forest optimization with tuning and cross-validation

I'm working with a large data set, so hope to remove extraneous variables and tune for an optimal m variables per branch. In R, there are two methods, rfcv and tuneRF, that help with these two tasks. I'm attempting to combine them to optimize parameters.
rfcv works roughly as follows:
create random forest and extract each variable's importance;
while (nvar > 1) {
remove the k (or k%) least important variables;
run random forest with remaining variables, reporting cverror and predictions
Presently, I've recoded rfcv to work as follows:
create random forest and extract each variable's importance;
while (nvar > 1) {
remove the k (or k%) least important variables;
tune for the best m for reduced variable set;
run random forest with remaining variables, reporting cverror and predictions;
This, of course, increases the run time by an order of magnitude. My question is how necessary this is (it's been hard to get an idea using toy datasets), and whether any other way could be expected to work roughly as well in far less time.
As always, the answer is it depends on the data. On one hand, if there aren't any irrelevant features, then you can just totally skip feature elimination. The tree building process in the random forest implementation already tries to select predictive features, which gives you some protection against irrelevant ones.
Leo Breiman gave a talk where he introduced 1000 irrelevant features into some medical prediction task that had only a handful of real features from the input domain. When he eliminated 90% of the features using a single filter on variable importance, the next iteration of random forest didn't pick any irrelevant features as predictors in its trees.
