Improve run time for GBM package - r

I am building a GBM model with rather large datasets. data.table is great for processing data. But when I run GBM model, it takes forever to finish. Looking at Activity Monitor (in Mac), I can see the process doesn't use up all memory, and doesn't max out processor.
Since GBM is single core, and I can't modify it to run on multicore, what are my options to improve my run time? Right now I am using Macbook Air with 4BG RAM and 1.7GHz i5.
I am not sure which of the following options would help performance the most: buying a (i) computer with bigger memory; (ii) get a more powerful chip (i7), or (iii) use Amazon AWS and install R there. How each of these will help?
Add sample code per Brandson's request:
library(gbm)
GBM_NTREES = 100
GBM_SHRINKAGE = 0.05
GBM_DEPTH = 4
GBM_MINOBS = 50
GBM_model <- gbm.fit(
x = data[,-target] ,
y = data[,target] ,
#var.monotone = TRUE, #NN added
distribution = "gaussian"
,n.trees = GBM_NTREES ,
shrinkage = GBM_SHRINKAGE ,
interaction.depth = GBM_DEPTH ,
n.minobsinnode = GBM_MINOBS ,
verbose = TRUE)

Maybe something worth considering is using the XGBoost library. According to the Github repo:
"XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way."
I also realize the original question is quite old, but maybe this will help someone out down the road.

This seems to be more about parallel computing in R in general, rather than a specific question about gbm. I would start here: http://cran.r-project.org/web/views/HighPerformanceComputing.html.

Related

Computation time GAM

I am fitting the below GAM in mgcv
m3.2 <- bam(pt10 ~
s(year, by = org.type) +
s(year, by = region) +
s(org.name, bs = 're') +
s(org.name, year, bs = 're'),
data = dat,
method = "fREML",
family = betar(link="logit"),
select = T,
discrete = T)
My dat has ~58,000 observations, and the factor org.name has ~2,500 levels - meaning there are a lot of random intercepts and slopes to be fit. Hence, in an attempt to reduce computation time, I have used bam() and the discrete = T option. However, my model has now been running for ~36 hours and still has not fit nor failed and provided me with an error message. I am unsure how long might be reasonable for such a model to take to fit, and therefore how/when to decide to kill the command or not; I don't want to stop the model running if this is normal behaviour/computational time for such a model, but also don't want to waste my time if bam() is stuck going around in circles and will never fit.
Question: How long might be reasonable for such a model to take to fit / what would reasonable computation time for such a model be? Is there a way I can determine if bam() is still making progress or if the command should just be killed to avoid wasting time?
My computer has 16GB RAM and an Intel(R) Core(TM) i7-8565u processor (CPU # 1.80GHz). In my Windows Task Manager I can see that RStudio is using 20-30% CPU and 20-50% memory and that these usage values are changing and are not static.
To see what bam() is doing, you should have set trace = TRUE in the list passed to the control argument:
ctrl <- gam.control(trace = TRUE)
m3.2 <- bam(pt10 ~
org.type + region, # include parametric terms for group means
s(year, by = org.type) +
s(year, by = region) +
s(org.name, bs = 're') +
s(org.name, year, bs = 're'),
data = dat,
method = "fREML",
family = betar(link="logit"),
select = TRUE,
discrete = TRUE,
control = ctrl)
That way you'd get printed statements in the console while bam() is doing it's thing. I would check on a much smaller set of data that this actually works (run the example in ?bam() say) within RStudio on Windows; I've never used it so you wouldn't want the trace output to only come once the function had finished, in a single torrent.)
Your problem is the random effects, not really the size of the data; you are estimating 2500 + 2500 coefficients for your two random effects and another 10 * nlevels(org.type) + 10 * nlevels(region) for the two factor by smooths. This is a lot for 58,000 observations.
As you didn't set nthreads it's fitting the model on a single CPU core. I wouldn't necessarily change that though as using 2+ threads might just make the memory situation worse. 16GB on Windows isn't very much RAM these days - RStudio is using ~half the available RAM with your other software and Windows using the remaining 36% that Task Manager is reporting is in use.
You should also check if the OS is having to swap memory out to disk; if that's happening then give up as retrieving data from disk for each iteration of model fitting is going to be excruciating, even with a reasonably fast SSD.
The random effects can be done more efficiently in dedicated mixed model software but then you have the problem of writing the GAM bits (the two factor by smooths) in that form - you would write the random effects in the required notation for {brms} or {lme4} respectively as (1 + year | org.name) in the relevant part of the formula (or the random argument in gamm4()).
{brms} and {gamm4} can do this bit, but for the former you need to know how to drive {brms} and Stan (which is doing HMC sampling of the posterior), while {lme4}, which is what {gamm4} uses to do the fitting, doesn't have a beta response family. The {bamlss} package has options for this too, but it's quite a complex package so be sure to understand how to specify the model estimation method.
So perhaps revisit your model structure; why do you want a smooth of year for the regions and organisation types, but a linear trend for individual organisations?

h2o.GBM taking too long on a small sized dataset

I've got a rather small dataset (162,000 observations with 13 attributes)
that I'm trying to use for modelling, using h2o.GBM. The response variable is categorical with large number of levels (~ 20,000 levels)
The model doesn't run out of memory or give any errors, but it's been going for nearly 24 hours without any progress (says 0% on H2o.GBM reporting)
I finally gave in and stopped it.
I'm wondering if there's anything wrong with my hyperparameters, as data is not particularly large.
here's my code:
library(h2o)
localH2O <- h2o.init(nthreads = -1, max_mem_size = "12g")
train.h20 <- as.h2o(analdata_train)
gbm1 <- h2o.gbm(
y = response_var
, x = independ_vars
, training_frame = train.h20
, ntrees = 3
, max_depth = 5
, min_rows = 10
, stopping_tolerance = 0.001
, learn_rate = 0.1
, distribution = "multinomial"
)
The way H2O GBM multinomial classification works is, when you ask for 1 tree as a parameter, it actually builds a tree for each level in the response column underneath the hood.
So 1 tree really means 20,000 trees in your case.
2 trees would really mean 40,000, and so on...
(Note the binomial classification case takes a shortcut and builds only one tree for both classes.)
So... it will probably finish but it could take quite a long time!
It's probably not a good idea to train a classifier with 20,000 classes -- most GBM implementations won't even let you do that. Can you group/cluster the classes into a smaller number of groups so that you can train a model with a smaller number of classes? If so, then you could perform your training in a two-stage process -- the first model would have K classes (assuming you clustered your classes into K groups). Then you can train secondary models that further classify the observations into your original classes.
This type of two-stage process may make sense if your classes represent groups that naturally clusters into a hierarchy of groups -- such as zip codes or ICD-10 medical diagnostic codes, for example.
If your use-case really demands that you train a 20,000 class GBM (and there's no way around it), then you should get a bigger cluster of machines to use in your H2O cluster (it's unclear how many CPUs you are using currently). H2O GBM should be able to finish training, assuming it has enough memory and CPUs, but it may take a while.

Parallelism in XGBoost machine learning technique

Its have to do with
parallelism implementation of XGBoost
I am trying to optimize XGBoost execution by giving it parameter nthread= 16 where my system has 24 cores. But when I train my model, it doesn't seem to even cross approx 20% of CPU utilization at any point in time while model training.
Code snippet is as follows:-
param_30 <- list("objective" = "reg:linear", # linear
"subsample"= subsample_30,
"colsample_bytree" = colsample_bytree_30,
"max_depth" = max_depth_30, # maximum depth of tree
"min_child_weight" = min_child_weight_30,
"max_delta_step" = max_delta_step_30,
"eta" = eta_30, # step size shrinkage
"gamma" = gamma_30, # minimum loss reduction
"nthread" = nthreads_30, # number of threads to be used
"scale_pos_weight" = 1.0
)
model <- xgboost(data = training.matrix[,-5],
label = training.matrix[,5],
verbose = 1, nrounds=nrounds_30, params = param_30,
maximize = FALSE, early_stopping_rounds = searchGrid$early_stopping_rounds_30[x])
Please explain me (if possible) on how I can increase CPU utilization and speed up the model training for efficient execution. Code in R shall be helpful for further understanding.
Assumption:- It is about the execution in R package of XGBoost
This is a guess... but I have had this happen to me ...
You are spending to much time communicating during the parallelism and are not ever getting CPU bound. https://en.wikipedia.org/wiki/CPU-bound
Bottom line is your data isn't large enough (rows and columns ), and/or your trees aren't deep enough max_depth to warrant that many cores. Too much overhead. xgboost parallelizes split evaluations so deep trees on big data can keep the CPU humming at max.
I have trained many models where single threaded outperforms 8/16 cores. Too much time switching and not enough work.
**MORE DATA, DEEPER TREES OR LESS CORES :) **
I tried to answer this question but my post was deleted by a moderator. Please see https://stackoverflow.com/a/67188355/5452057 which I believe could help you also, it relates to missing MPI support in the xgboost R-package for Windows available from CRAN.

support vector machine in r

I've been working on Support Vector Machine algorithm using R Studio. However, I'm ending up with a low accuracy rate and I don't know how to fix it. I'm expecting an accuracy rate higher than 90%.
Here is my code:
install.packages("caTools")
install.packages("class")
library(caTools)
library(class)
install.packages("ISLR")
library(ISLR)
Collegedata<-College[,-1]
Collegedata[,-17]<-scale(Collegedata[,-17])
med<-median(Collegedata$Grad.Rate)
Grad.Rate<-Collegedata$Grad.Rate>=med
Grad.Rate1<-as.numeric(Grad.Rate)
Collegedata<-data.frame(Collegedata,Grad.Rate1)
corcollege<-cor(Collegedata)
Collegedata<-Collegedata[,-2:-3]
Collegedata<-Collegedata[,-4]
Collegedata<-Collegedata[,-7]
Collegedata<-Collegedata[,-13]
SVM
install.packages("e1071")
library("e1071")
collegesplit=sample.split(Collegedata, SplitRatio=0.8)
collegetrain<-subset(Collegedata, collegesplit==1)
collegetest<-subset(Collegedata, collegesplit==0)
collegetrain<-data.frame(collegetrain)
svm.model.college <- svm(Grad.Rate1 ~ ., data = collegetrain, type = "C-classification", cost = 1,gamma = 0.125, cross =10)
svm.pred.college <- predict(svm.model.college, collegetest[,-13])
table(pred = svm.pred.college, true = collegetest[,13])
install.packages('ROCR')
library(ROCR)
ROC=predict(svm.model.college,newdata=collegetest)
ROC<-as.vector(ROC)
ROC<-as.numeric(ROC)
pred=prediction(ROC,collegetest$Grad.Rate1)
perf=performance(pred,'tpr','fpr')
plot(perf)
as.numeric(performance(pred,'auc')#y.values)
It's been a while since I've coded SVMs so I won't be able to provide you with any code, however, I can say that SVMs are incredibly sensitive to your choice of hyperparameters, e.g. cost and gamma. You'd want to perform a grid search over some sequence of values to determine the optimal values. I recommend using the tune.svm() or best.tune() functions in the e1071 package to do this. Further, while the default Gaussian kernel is often the optimal kernel, it is not guaranteed to be the best, so you could perhaps try the linear kernel instead.
You may find this paper useful in helping develop a framework for building your model.

Questions about parallelizing Caret nnet in R

I have a training set that looks like
Name Day Area X Y Month Night
ATTACK Monday LA -122.41 37.78 8 0
VEHICLE Saturday CHICAGO -1.67 3.15 2 0
MOUSE Monday TAIPEI -12.5 3.1 9 1
Name is the outcome/dependent variable.
Here is what my code looks like so far in case it helps
ynn <- model.matrix(~Name , data = trainDF)
mnn <- model.matrix(~ Day+Area +X + Y + Month + Night, data = trainDF)
yCat<-make.names(trainDF$Name, unique=FALSE, allow_=TRUE)
I then setup tuning the parameters
nnTrControl=trainControl(method = "repeatedcv",number = 3,repeats=5,verboseIter = TRUE, returnData = FALSE, returnResamp = "all", classProbs = TRUE, summaryFunction = multiClassSummary,allowParallel = TRUE)
nnGrid = expand.grid(.size=c(1,4,7),.decay=c(0,0.001,0.1))
model <- train(y=yCat, x=mnn, method='nnet',linout=TRUE, trace = FALSE, trControl = nnTrControl,metric="logLoss", tuneGrid=nnGrid)
When I ran this, it was still running over 20 hours later, so I had to stop it
I read in the link below that its possible to parallelize the resampling of Caret using registerDoMC: R caret nnet package in Multicore
However, that only seems to work for cores. My machine uses 2 cores and 2 threads on each core. Is there a way to get a speedup using the threads in addition to using the 2 cores and registerDoMC(2)?
I also see in this link below that the user had to setup seeds for each resample: Fully reproducible parallel models using caret
Do I also have to do that for my code? Why was this not used in the former link? What about if I used xgboost instead of nnet?
If you want to reproduce your results you will have to set your seed on every thread that you spawn. This is required because every thread will have a different random number every time an instance is spawned. Depending on which OS you are working each thread will most likely be scheduled on a separate core on your CPU. This depends on your OS job scheduler. In regards to using xgboost versus nnet, I think that the most important aspect should be whether you are interested in the model properties. I think that if you are starting with machine learning xgboost may be a bit easier than nnet. If computational performance is your biggest concern you may try to run your problem on a smaller subset first.
One thing I would do first is run a MCA analysis, which can be find in FactoMineR. This will allow you to see the amount of variance in each of your variables. You could drop variables that have too little variance and thereby speed up the performance of your learning task.

Resources