how does h2o.randomforest handle missing values - r

After my research on h2o, I have found that h2o.randomForest can handle missing values in variables unlike R randomForest package.
See, http://h2o.ai/blog/2014/04/sjsu-tutorial-h2o-random-forest/
But, after looking everywhere, I can not seem to find how exactly missing values are handled by h2o.randomForest? How similar is it to handling of missin values by R gbm() package?
Any help regarding above 2 questions will be greatly appreciated.
Thanks,

You can refer to the H2O documentation to see how the DRF algorithm handles missing values in various situations:
http://h2o-release.s3.amazonaws.com/h2o/rel-slater/5/docs-website/h2o-docs/index.html#Data%20Science%20Algorithms-DRF-FAQ
In terms of R's GBM, they create trees that are ready to handle NA's. In other words, it explicitly handles NA's as a special case. R's GBM actually handles NAs as a special case and builds tree branches for them: left, right, NA is the result of every decision.
Hope this helps!
Avni

Related

ARTool package in R - multiple within factors

I have recently discovered the ARTool package for R (https://cran.r-project.org/web/packages/ARTool/) when looking for a non-parametric alternative for a repeated measures ANOVA.
I have used ARTool and find it really very useful, but I came across a problem, that I am not sure how to deal with. Specifically, the Df.res seem to be strongly inflated as soon as I have more than one within factor. I have not come across this when I tried it with two between factors, a between and a within factor, or two between and one within factor, but whenever I add a second within factor, Df.res seems to become inflated.
I just wondered whether I am misunderstanding something or maybe there is an explanation that I am not aware of.
Any response would be greatly appreciated.
Many thanks!

Not losing observations when faced with missing data

I have a dataset where I've fitted a linear model and I've tried to use the step function on this linear model. I get an error message "saying number of rows in use has changed: remove missing values?".
I noticed that a few of the observations (not many) in my dataset had NA values for one variable. I've seen similar questions which suggest using na.omit(), but when I do this I lose the observations. I want to keep the observations however, because they contain useful information for the other variables. Is there a way to use step and avoid losing the observations?
You can call the nobs function to check that the number of observations is unchanged, and its use.fallback argument to potentially guess the missing values. The R documentation however recommends omitting the relevant data before running step.
I would discourage you from simply omitting the missing values if they are indeed really missing. You can use multiple imputation via Amelia to impute the data such that you have a full dataset.
see here: https://cran.r-project.org/web/packages/Amelia/Amelia.pdf
also I would recommend reviewing the book "Statistical Analysis With Missing Data" by R. Little and D.B. Rubin.

Sparse data clustering for extremely large dataset

I have tried using
kmeansparse, from sparcl packages (Lack of Memory error)
bigkmeans from Biganalytics (Weird error: Couldn't find anything online; Error in duplicated.default(centers[[length(centers)]]) :
duplicated() applies only to vectors)
skmean from skmeans (similar results as kmeans)
but I am still not able to get proper clustering for my sparse data. The clusters are not well defined, have overlapping membership for most part. Am I missing something in terms of handling sparse data?
What kind of pre-processing is suggested for data? should the missing values be marked -1 instead of 0 for clear distinction? Please feel free to ask for more details if you have any ideas that may help.

Princomp error in R : covariance matrix is not non-negative definite

I have this script which does a simple PCA analysis on number of variables and at the end attaches two coordinates and two other columns(presence, NZ_Field) to the output file. I have done this many times before but now its giving me this error:
I understand that it means there are negative eigenvalues. I looked at similar posts which suggest to use na.omit but it didn't work.
I have uploaded the "biodata.Rdata" file here:
covariance matrix is not non-negative definite
https://www.dropbox.com/s/1ex2z72lilxe16l/biodata.rdata?dl=0
I am pretty sure it is not because of missing values in data because I have used the same data with different "presence" and "NZ_Field" column.
Any help is highly appreciated.
load("biodata.rdata")
#save data separately
coords=biodata[,1:2]
biovars=biodata[,3:21]
presence=biodata[,22]
NZ_Field=biodata[,23]
#Do PCA
bpc=princomp(biovars ,cor=TRUE)
#re-attach data with auxiliary data..coordinates, presence and NZ location data
PCresults=cbind(coords, bpc$scores[,1:3], presence, NZ_Field)
write.table(PCresults,file= "hlb_pca_all.txt", sep= ",",row.names=FALSE)
This does appear to be an issue with missing data so there are a few ways to deal with it. One way is to manually do listwise deletion on the data before running the PCA which in your case would be:
biovars<-biovars[complete.cases(biovars),]
The other option is to use another package, specifically psych seems to work well here and you can use principal(biovars), and while the output is bit different it does work using pairwise deletion, so basically it comes down to whether or not you want to use pairwise or listwise deletion. Thanks!

R: Winsorizing (robust HD) not compatible with NAs?

I want to use the winsorize function provided in the "robustHD" Package but it does not seem to work with NA's as can be seen in the example
## generate data
set.seed(1234) # for reproducibility
x <- rnorm(10) # standard normal
x[1] <- x[1] * 10 # introduce outlier
x[11]<- NA ## adding NA
## winsorize data
x
winsorize(x)
I googled the problem but didn't find a solution or even anyone with a similar problem. Is winsorizing might considered as a "bad" technique or how can you explain this lack of information?
If you only have a vector to winsorize, the winsor2 function defined here can be easily modified by setting na.rm = TRUE for the median and mad functions in the code. That provides the same functionality as winsorize{robustHD} with 1 difference: winsorize calls robStandardize, which includes some adjustment for very small values. I don't understand what it's doing, so caveat emptor if you forgo it.
If you want to winsorize the individual columns of a matrix (as opposed to the multivariate winsorization using a tolerance ellipse available as another option in winsorize) you should be able to poach the necessary code from winsorize.default and standardize. They do the same thing as winsor2 but in matrix form. Again, you need to add your own na.rm = TRUE settings into the functions as needed.
Some maybe useful thoughts:
Stack Overflow is a programming board, where programming related questions are asked and answers are given. For question whether or not certain statistical procedures are appropriate or considered "bad", you are more likely to find knowledgable people on crossvalidated.
A statistical method and the implementation of a statistical method into a certain software environment are often rather independent. That is to say that if the developer of a package has not included certain features (e.g NA handling) into his package, this does not mean much for the method per se. Having said that, of course it can. The only way to be sure whether the omission of a package feature was intentional is to actually ask the developer of the package. If the question is more geared towards statistics and the validity of the method in the presence of missing values, crossvalidated is likely to be more helpful.
I don't know why you can't find any information on this topic. I can confidently say though that this is the very first time I have heard the term "winsorized". I actually had to look it up, and I can surely say that I have never encountered this approach, and I would personally never use it.
A simple solution to your problem from a computational point of view would be to omit all incomplete cases before you start working with the function. It also makes intuitive sense that cases with missing values cannot be easily winsorized. First, the computation of the mean and standard deviation would have to be done on the complete cases anyway, and then it is unclear which value to assign to those with missing values since they may not necessarily be outliers, even though they could be.
If omitting incomplete cases is not an option for you, you may want to look for imputation methods (on CV).

Resources