R h2o server CURL error, kind of repeatable - r

At first I thought it was a random issue, but re-running the script it happens again.
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = urlSuffix, :
Unexpected CURL error: Recv failure: Connection reset by peer
I'm doing a grid search on a medium-size dataset (roughly 40000 x 30) with a Gradient Boosting Machine model. The largest tree in the grid is 1000. This usually happens after running for a couple of hours. I set max_mem_size to 30Gb.
for ( k in 1:nrow(par.grid)) {
hg = h2o.gbm(training_frame = Xtr.hf,
validation_frame = Xt.hf,
distribution="huber",
huber_alpha = HuberAlpha,
x=2:ncol(Xtr.hf),
y=1,
ntrees = par.grid[k,"ntree"],
max_depth = depth,
learn_rate = par.grid[k,"shrink"],
min_rows = par.grid[k,"min_leaf"],
sample_rate = samp_rate,
col_sample_rate = c_samp_rate,
nfolds = 5,
model_id = p(iname, "_gbm_CV")
)
cv_result[k,1] = h2o.mse(hg, train=TRUE)
cv_result[k,2] = h2o.mse(hg, valid=TRUE)
}

Try adding gc() in your innermost loop. Even better would be to explicitly use h2o.rm().
So, it would become something like:
for ( k in 1:nrow(par.grid)) {
hg = h2o.gbm(...stuff...,
model_id = p(iname, "_gbm_CV")
)
cv_result[k,1] = h2o.mse(hg, train=TRUE)
cv_result[k,2] = h2o.mse(hg, valid=TRUE)
h2o.rm(hg);rm(hg);gc()
}
Theoretically this shouldn't matter, but if R holds on to the reference, then H2O will too.
If you think you might want to investigate any models further, and you have plenty of local disk space, you could do h2o.saveModel() before your h2o.mse() calls. (You'll need to specify a filename that somehow summarizes all your parameters, of course...)
UPDATE based on comment: If you do not need to keep any models or data, then using h2o.removeAll() is another way to rapidly reclaim all the memory. (This approach is also worth considering if any data or models you do need preserved are quick and easy to re-load.)

Related

extract_inner_fselect_results is NULL with mlr3 Nested Resampling

This question is an extension of the following question: No Model Stored with Mlr3.
I have been performing nested resampling to get an unbiased metric of model performance. If I don't specify store_models=TRUE then I get Error: No model stored at the end of the run. However, if I specify store_models=TRUE in both the at and resample calls then RStudio crashes due to RAM consumption.
I have now tried the following code in which I specified store_models=TRUE for just the at call:
MSvCon<-read.csv("MS v Control Proteomics Final.csv", row.names=1)
MSvCon$Status<-as.factor(MSvCon$Status)
MSvCon[,2:4399]<-scale(MSvCon[,2:4399], center=TRUE, scale=TRUE)
set.seed(123, "L'Ecuyer")
task = as_task_classif(MSvCon, target = "Status")
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
set_threads(learner, n = 8)
measure = msr("classif.fbeta", beta=1, average="micro")
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models=TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer)
After finishing, I am able to extract performance measures successfully. However, I tried to use extract_inner_fselect_results and extract_inner_fselect_archives to check what features were selected and importance measures but received a NULL result.
Do you have any suggestions on what I would need to adjust in my code to see this information? I anticipate that adding store_models=TRUE to the resample call would but the RAM consumption issue (even using 128GB on Rstudio Workbench) prevents that. Is there a way around this?
The archives of the inner resampling are stored in the model slot of the AutoFSelectors i.e. without store_models = TRUE in resample() you cannot access the inner results and archives. I will write a workaround for you and answer in the other question.

R bootnet case-dropping bootstrap stops running with no specific error message

I'm running a network analysis in R using qgraph and bootnet. When running the case-dropping bootstrap to estimate correlation-stability coefficients of centrality indices, the algorithm simply "gets stuck" with no specific error message.
I know there are issues with my data (e.g., the sample is quite small, N = 112, a subset of variables are highly correlated with each other, while a few others share little to no correlation with the rest). However, it's never happened to me that the analysis would simply stop mid-way and I'm trying to figure out what exactly is causing this.
This is the current code:
netdt23 <- select(dati,
ToM,
"MT" = Mentalization,
"PpR" = Popular_Response,
"CM" = Complete_Meaning,
"PE" = Problem_Elaboration,
"PS" = Problem_Solving,
"NE" = Negative_Emotions
)
npn23 <- huge.npn(netdt23)
net23 <- estimateNetwork(
npn23,
default = "EBICglasso",
corMethod = "cor_auto",
lambda.min.ratio = 1e-15,
threshold = TRUE
)
btfun <- function(x){
nt <- estimateNetwork(x,
default = "EBICglasso",
corMethod = "cor_auto",
lambda.min.ratio = 1e-15,
threshold = TRUE
)
return(nt$graph)
}
btntCen23 <- bootnet(
npn23, fun = btfun, type = "case",
statistics = c("edge", "strength", "betweenness"), #"closeness",
nBoots = 2000,
caseMin = .05, caseMax = .95, caseN = 19)
While running, bootnet occasionally gives warnings and/or errors, but these are never the last output before crashing (meaning that, apparently, the analysis keeps running past these issues and stops later). Such as:
Error in lav_samplestats_icov(COV = cov[[g]], ridge = ridge.eps, x.idx = x.idx[[g]], : lavaan ERROR: sample covariance matrix is not positive-definite
An empty network was selected to be the best fitting network. Possibly set 'lambda.min.ratio' higher to search more sparse networks. You can also change the 'gamma' parameter to improve sensitivity (at the cost of specificity).
Any help would be greatly appreciated. If I left out necessary information, please let me know and I'll edit the question.

A question about the parallelism in h2o.grid() function

I try to use the h2o.grid() function from the h2o package to do some tuning using R, when I set the parameter parallelism larger then 1, it always shows the warning
Some models were not built due to a failure, for more details run `summary(grid_object, show_stack_traces = TRUE)
And the model_ids in the final grid object includes many models end with _cv_1, _cv_2 etc, and the number of the models is not equal to the setting of my max_models in search_criteria list, I think they are just the models in the cv process, not the final model.
When I set parallelism larger than 1:
When I leave the parallelism default or set it to 1, the result is normal, all models end with _model_1, _model_2 etc.
When I leave the "parallelism" default or set it to 1:
Here is my code:
# set the grid
rf_h2o_grid <- list(mtries = seq(3, ncol(train_h2o), 4),
max_depth = c(5, 10, 15, 20))
# set the search_criteria
sc <- list(strategy = "RandomDiscrete",
seed = 100,
max_models = 5
)
# random grid tuning
rf_h2o_grid_tune_random <- h2o.grid(
algorithm = "randomForest",
x = x,
y = y,
training_frame = train_h2o,
nfolds = 5, # use cv to validate the parameters
fold_assignment = "Stratified",
ntrees = 100,
seed = 100,
hyper_params = rf_h2o_grid,
search_criteria = sc
# parallelism = 6 # when I set it larger than 1, the result always includes some "cv_" models
)
So how can I use the parallelism correctly in h2o.grid()? Thanks for helping!
This is an actual issue with parallelism in grid search, previously noticed but not reported correctly.
Thanks for raising this, we'll try to fix it soon: see https://h2oai.atlassian.net/browse/PUBDEV-7886 if you want to track progress.
Until proper fix, you must avoid using both CV and parallelism in your grids.
Regarding the following error:
Some models were not built due to a failure, for more details run `summary(grid_object, show_stack_traces = TRUE)
if the error is reproducible, you should be getting more details by running the grid with verbose=True.
Adding the entire error message to the ticket above might also help.
This is because you set max_models = 5, your grid will only make 5 models then stop.
There are three ways to set up early stopping criteria:
"max_models": max number of models created
"max_runtime_secs": max running time in seconds
metric-based early stopping by setting up for "stopping_rounds", "stopping_metric", and "stopping_tolerance"

Reducing NbClust memory usage

I need some help with massive usage of memory by the NbClust function.
On my data, memory balloons to 56GB at which point R crashes with a fatal error. Using debug(), I was able to trace the error to these lines:
if (any(indice == 23) || (indice == 32)) {
res[nc - min_nc + 1, 23] <- Index.sPlussMoins(cl1 = cl1,
md = md)$gamma
Debugging of Index.sPlussMoins revealed that the crash happens during a for loop. The iteration that it crashes at varies, and during the loop memory usage varies between 41 and 57Gb (I have 64 total):
for (k in 1:nwithin1) {
s.plus <- s.plus + (colSums(outer(between.dist1,
within.dist1[k], ">")))
s.moins <- s.moins + (colSums(outer(between.dist1,
within.dist1[k], "<")))
print(s.moins)
}
I'm guessing that the memory usage comes from the outer() function.
Can I modify NbClust to be more memory efficient (perhaps using the bigmemory package)?
At very least, it would be nice to get R to exit the function with an "cannot allocate vector of size..." instead of crashing. That way I would have an idea of just how much more memory I need to handle the matrix causing the crash.
Edit: I created a minimal example with a matrix the approximate size of the one I am using, although now it crashes at a different point, when the hclust function is called:
set.seed(123)
cluster_means = sample(1:25, 10)
mlist = list()
for(cm in cluster_means){
name = as.character(cm)
m = data.frame(matrix(rnorm(60000*60,mean=cm,sd=runif(1, 0.5, 3.5)), 60000, 60))
mlist[[name]] = m
}
test_data = do.call(cbind, cbind(mlist))
library(NbClust)
debug(fun = "NbClust")
nbc = NbClust(data = test_data, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 30,
method = "ward.D2", index = "alllong", alphaBeale = 0.1)
debug: hc <- hclust(md, method = "ward.D2")
It seems to crash before using up available memory (according to my system monitor, 34Gb is being used when it crashes out of 64 total.
So is there any way I can do this without sub-sampling manageable sized matrices? And if I did, how do I know how much memory I will need for a matrix of a given size? I would have thought my 64Gb would be enough.
Edit:
I tried altering NbClust to use fastcluster instead of the stats version. It didn't crash, but did exit with a memory error:
Browse[2]>
exiting from: fastcluster::hclust(md, method = "ward.D2")
Error: cannot allocate vector of size 9.3 Gb
If you check the source code of Nbclust, you'll see that is all but optimized for speed or memory efficiency.
The crash you're reporting is not even during clustering - it's in the evaluation afterwards, specifically in the "Gamma, Gplus and Tau" index code. Disable these indexes and you may get further, but most likely you'll just have the same problem again in another index. Maybe you can pick only a few indices to run, specifically such indices that so not need a lot of memory?
I forked NbClust and made some changes that seem to have made it go for longer without crashing with bigger matrices. I changed some of the functions to use Rfast, propagate and fastcluster. However there are still problems.
I haven't run all my data yet and only run a few tests on dummy data with gap, so there is still time for it to fail. But any suggestions/criticisms would be welcome.
My (in progress) fork of NbCluster:
https://github.com/jbhanks/NbClust

Weird error message when tuning svm with polynomial kernel: "WARNING: reaching max number of iterations"

It is my first time working with support vector machines. I am trying to solve this homework, but am receiving the above mentioned error... The code is working for the linear kernel and radial kernel, but not for the polynomial kernel here is my code:
library(e1071)
test_data = #upload test data here.
training_data= read.table('Digits_training.csv', sep =',', header = TRUE)
y = training_data$y
chosen_svm = function(y,training_data,kernel_name){
obj <- tune.svm(y~., data = training_data, gamma = 10^(-3:1), cost = 10^(-3:1), kernel = kernel_name)
gamma = obj$best.parameters$gamma
cost = obj$best.parameters$cost
model = svm(y~., data = training_data, gamma = gamma, cost = cost, kernel = kernel_name)
return(model)
}
radial_svm = chosen_svm(y,training_data,'radial')
lin_svm = chosen_svm(y,training_data,'linear')
pol_svm = chosen_svm(y,training_data,'polynomial')
I tired to change the gamma and cost range a bit, and tried it with a second degree polynomial, but I am still getting the same error message.
Any idea why this is happening?
This is not the error. It is just a warning, meaning that your optimizer did not converge in given number of iterations. Unfortunatley, the e1071 has internally a limit set... and you cannot change it
int max_iter = max(10000000, l>INT_MAX/100 ? INT_MAX : 100*l);
what you can do? Simply change the library, for example http://r.gmum.net has the very same library (libsvm) available with this limitation dropped
https://github.com/gmum/gmum.r/blob/master/src/svm/svm.cpp (line 553)
[...]
int iter = 0;
// int max_iter = max(10000000, l>INT_MAX/100 ? INT_MAX : 100*l);
int counter = min(l,1000)+1;
while(1)
[...]
I am pretty sure that many others also dropped it. For example in python's scikit learn you can also explicitly state maximum number of iterations (and set -1 for lack of limit).

Resources