Computation time GAM - r

I am fitting the below GAM in mgcv
m3.2 <- bam(pt10 ~
s(year, by = org.type) +
s(year, by = region) +
s(org.name, bs = 're') +
s(org.name, year, bs = 're'),
data = dat,
method = "fREML",
family = betar(link="logit"),
select = T,
discrete = T)
My dat has ~58,000 observations, and the factor org.name has ~2,500 levels - meaning there are a lot of random intercepts and slopes to be fit. Hence, in an attempt to reduce computation time, I have used bam() and the discrete = T option. However, my model has now been running for ~36 hours and still has not fit nor failed and provided me with an error message. I am unsure how long might be reasonable for such a model to take to fit, and therefore how/when to decide to kill the command or not; I don't want to stop the model running if this is normal behaviour/computational time for such a model, but also don't want to waste my time if bam() is stuck going around in circles and will never fit.
Question: How long might be reasonable for such a model to take to fit / what would reasonable computation time for such a model be? Is there a way I can determine if bam() is still making progress or if the command should just be killed to avoid wasting time?
My computer has 16GB RAM and an Intel(R) Core(TM) i7-8565u processor (CPU # 1.80GHz). In my Windows Task Manager I can see that RStudio is using 20-30% CPU and 20-50% memory and that these usage values are changing and are not static.

To see what bam() is doing, you should have set trace = TRUE in the list passed to the control argument:
ctrl <- gam.control(trace = TRUE)
m3.2 <- bam(pt10 ~
org.type + region, # include parametric terms for group means
s(year, by = org.type) +
s(year, by = region) +
s(org.name, bs = 're') +
s(org.name, year, bs = 're'),
data = dat,
method = "fREML",
family = betar(link="logit"),
select = TRUE,
discrete = TRUE,
control = ctrl)
That way you'd get printed statements in the console while bam() is doing it's thing. I would check on a much smaller set of data that this actually works (run the example in ?bam() say) within RStudio on Windows; I've never used it so you wouldn't want the trace output to only come once the function had finished, in a single torrent.)
Your problem is the random effects, not really the size of the data; you are estimating 2500 + 2500 coefficients for your two random effects and another 10 * nlevels(org.type) + 10 * nlevels(region) for the two factor by smooths. This is a lot for 58,000 observations.
As you didn't set nthreads it's fitting the model on a single CPU core. I wouldn't necessarily change that though as using 2+ threads might just make the memory situation worse. 16GB on Windows isn't very much RAM these days - RStudio is using ~half the available RAM with your other software and Windows using the remaining 36% that Task Manager is reporting is in use.
You should also check if the OS is having to swap memory out to disk; if that's happening then give up as retrieving data from disk for each iteration of model fitting is going to be excruciating, even with a reasonably fast SSD.
The random effects can be done more efficiently in dedicated mixed model software but then you have the problem of writing the GAM bits (the two factor by smooths) in that form - you would write the random effects in the required notation for {brms} or {lme4} respectively as (1 + year | org.name) in the relevant part of the formula (or the random argument in gamm4()).
{brms} and {gamm4} can do this bit, but for the former you need to know how to drive {brms} and Stan (which is doing HMC sampling of the posterior), while {lme4}, which is what {gamm4} uses to do the fitting, doesn't have a beta response family. The {bamlss} package has options for this too, but it's quite a complex package so be sure to understand how to specify the model estimation method.
So perhaps revisit your model structure; why do you want a smooth of year for the regions and organisation types, but a linear trend for individual organisations?

Related

Release H2O Grid From Memory?

I am struggling to find the correct API for releasing memory for an object created by the H2O grid. This code was pre-written by someone else and I am currently maintaining it.
#train grid search
gbm_grid1 <- h2o.grid(algorithm = "gbm" #specifies gbm algorithm is used
,grid_id = paste("gbm_grid1",current_date,sep="_") #defines a grid identification
,x = predictors #defines column variables to use as predictors
,y = y #specifies the response variable
,training_frame = train1 #specifies the training frame
#gbm parameters to remain fixed
,nfolds = 5 #specify number of folds for cross-validation is 5 (this acceptable here in order to reduce training time)
,distribution = "bernoulli" #specify that we are predicting a binary dependent variable
,ntrees = 1000 #specify the number of trees to build (1000 as essentially the maximum number of trees that can be built. Early stopping parameters defined later will make it unlikely our model will reach 1000 trees)
,learn_rate = 0.1 #specify the learn rate used of for gradient descent optimization (goal is to use as small a learn rate as possible)
,learn_rate_annealing = 0.995 #specifies that the learn rate will perpetually decrease by a factor of 0.995 (this can help speed up traing for our grid search)
,max_depth = tuned_max_depth
,min_rows = tuned_min_rows
,sample_rate = 0.8 #specify the amount of row observations used when making a split decision
,col_sample_rate = 0.8 #specify the amount of column observations used when making a split decision
,stopping_metric = "logloss" #specify loss function
,stopping_tolerance = 0.001 #specify minimum change required in stopping metric for individual model to continue training
,stopping_rounds = 5 #specify maximum amount of training rounds stopping metric has to change in excess of stopping tolerance
#specifies hyperparameters to fluctuate during model building in the grid search
,hyper_params = gbm_hp2
#specifies the search criteria that includes stop training etrics to speed up model building
,search_criteria = search_criteria2
#sets a reproducible seed
,seed = 123456
)
h2o.rm(gbm_grid1)
The problem is I believe this code was written awhile ago and has been deprecated since. h2o.rm(gbm_grid1) fails and R Studio tells me that I require a hex identifier. So I assigned my object an identifier and tried h2o.rm(gbm_grid1, "identifier.hex") and it tells me I cannot release this type of object.
The issue is I run out of memory if I move onto the next steps of the script. What should I do?
This is what I get with H2O.ls()
Yes, you can remove objects with h2o.rm(). You can use the variable name or key.
h2o.rm(your_object)
h2o.rm(‘your_key’)
You can use h2o.ls() to check what objects are in memory. Also, you can add the argument cascade = TRUE to the rm method to remove sub-models.
See more here

Computational speed of a complex Hierarchical GAM

I have a large dataset (3.5+ million observations) of a binary response variable that I am trying to compute a Hierarchical GAM with a global smoother with individual effects that have a Shared penalty (e.g. 'GS' in Pedersen et al. 2019). Specifically I am trying to estimate the following structure: Global > Geographic Zone (N=2) > Bioregion (N=20) > Season (N varies by bioregion). In total, I am trying to estimate 36 different nested parameters.
Here is the the code I am currently using:
modGS <- bam(
outbreak ~
te(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
My main issue is that it is taking a long time (5+ days) to construct the model. This nesting structure cannot be discretized, so I cannot compute it in parallel. Further I have tried gamm4 but I ran into memory limit issues. Here is the gamm4 code:
modGS <- gamm4(
outbreak ~
t2(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
What is the best/most computationally feasible way to run this model?
I cut down the computational time by reducing the amount of bioregion levels and randomly sampling ca. 60% of the data. This actually allow me to calculate OOB error for the model.
There is an article I read recently that has a specific section on decreasing computational time. The main things they highlight are:
Use the bam function with it's useful fREML estimation, which refactorizes the model matrix to make calculation faster. Here it seems you have already done that.
Adding the discrete = TRUE argument, which assumes only a smaller finite number of unique values for estimation.
Manipulating nthreads in this function so it runs more than one core in parallel in your computer.
As the authors caution, the second option can reduce the amount of accuracy in your estimates. I fit some large models recently doing this and found that it was not always the same as the default bam function, so its best to use this as a quick inspection rather than the full result you are looking for.

h2o.GBM taking too long on a small sized dataset

I've got a rather small dataset (162,000 observations with 13 attributes)
that I'm trying to use for modelling, using h2o.GBM. The response variable is categorical with large number of levels (~ 20,000 levels)
The model doesn't run out of memory or give any errors, but it's been going for nearly 24 hours without any progress (says 0% on H2o.GBM reporting)
I finally gave in and stopped it.
I'm wondering if there's anything wrong with my hyperparameters, as data is not particularly large.
here's my code:
library(h2o)
localH2O <- h2o.init(nthreads = -1, max_mem_size = "12g")
train.h20 <- as.h2o(analdata_train)
gbm1 <- h2o.gbm(
y = response_var
, x = independ_vars
, training_frame = train.h20
, ntrees = 3
, max_depth = 5
, min_rows = 10
, stopping_tolerance = 0.001
, learn_rate = 0.1
, distribution = "multinomial"
)
The way H2O GBM multinomial classification works is, when you ask for 1 tree as a parameter, it actually builds a tree for each level in the response column underneath the hood.
So 1 tree really means 20,000 trees in your case.
2 trees would really mean 40,000, and so on...
(Note the binomial classification case takes a shortcut and builds only one tree for both classes.)
So... it will probably finish but it could take quite a long time!
It's probably not a good idea to train a classifier with 20,000 classes -- most GBM implementations won't even let you do that. Can you group/cluster the classes into a smaller number of groups so that you can train a model with a smaller number of classes? If so, then you could perform your training in a two-stage process -- the first model would have K classes (assuming you clustered your classes into K groups). Then you can train secondary models that further classify the observations into your original classes.
This type of two-stage process may make sense if your classes represent groups that naturally clusters into a hierarchy of groups -- such as zip codes or ICD-10 medical diagnostic codes, for example.
If your use-case really demands that you train a 20,000 class GBM (and there's no way around it), then you should get a bigger cluster of machines to use in your H2O cluster (it's unclear how many CPUs you are using currently). H2O GBM should be able to finish training, assuming it has enough memory and CPUs, but it may take a while.

Why is my glmer model in R taking so long to run?

I had previously been using simple stats in Statistica, but required R for my masters research. I am trying to run the following code to test for any significant interactions, and it is just running forever. If I simplify the model by taking month out, then it runs, but biologically it makes sense that month is significant so I would really like this to run including month as a factor. Once I run the model, the stop sign in R studio just stays present for hours, what could be the reason for this? Like I said I'm very new and it has been really difficult to learn this on my own. I am working with presence/absence data (as %) which I do cbind as my dependent variable. SO far this is what my coad looks like:
library(car)
library(languageR)
library(AICcmodavg)
library(lme4)
Scat <- read.csv("Scat2.csv", header=T)
attach(Scat)
names(Scat)
y <- cbind(Present,Absent)
ScatData <- glmer(y ~ Estate * Species * Month * Content * (1|Site) + Min + Max,family=binomial)
summary(ScatData)
Once I get to running the actual model, I don't even get to do the summary because R is not done computing the results of the actual model. I ran the model for approximately 4 hours, and when I clicked on the stop sign, I received this message:
Warning message:
In (function (fn, par, lower = rep.int(-Inf, n), upper = rep.int(Inf, :
failure to converge in 10000 evaluations
I would really appreciate some input on this matter.
You have a few problems with your model specification. Your model
y ~ Estate * Species * Month * Content * (1|Site) + Min + Max
is asking for all the main effects and interactions of estate, species, month, content, and site, which is incredibly complex.
Also, you have specified site as a random effect and asked for its interaction with fixed effects. I'm not sure whether that's possible, but it certainly seems wrong. You should decide whether you want site to be a fixed effect or a random effect.
If you post a minimal replicable example, I can give more specific advice.

Questions about parallelizing Caret nnet in R

I have a training set that looks like
Name Day Area X Y Month Night
ATTACK Monday LA -122.41 37.78 8 0
VEHICLE Saturday CHICAGO -1.67 3.15 2 0
MOUSE Monday TAIPEI -12.5 3.1 9 1
Name is the outcome/dependent variable.
Here is what my code looks like so far in case it helps
ynn <- model.matrix(~Name , data = trainDF)
mnn <- model.matrix(~ Day+Area +X + Y + Month + Night, data = trainDF)
yCat<-make.names(trainDF$Name, unique=FALSE, allow_=TRUE)
I then setup tuning the parameters
nnTrControl=trainControl(method = "repeatedcv",number = 3,repeats=5,verboseIter = TRUE, returnData = FALSE, returnResamp = "all", classProbs = TRUE, summaryFunction = multiClassSummary,allowParallel = TRUE)
nnGrid = expand.grid(.size=c(1,4,7),.decay=c(0,0.001,0.1))
model <- train(y=yCat, x=mnn, method='nnet',linout=TRUE, trace = FALSE, trControl = nnTrControl,metric="logLoss", tuneGrid=nnGrid)
When I ran this, it was still running over 20 hours later, so I had to stop it
I read in the link below that its possible to parallelize the resampling of Caret using registerDoMC: R caret nnet package in Multicore
However, that only seems to work for cores. My machine uses 2 cores and 2 threads on each core. Is there a way to get a speedup using the threads in addition to using the 2 cores and registerDoMC(2)?
I also see in this link below that the user had to setup seeds for each resample: Fully reproducible parallel models using caret
Do I also have to do that for my code? Why was this not used in the former link? What about if I used xgboost instead of nnet?
If you want to reproduce your results you will have to set your seed on every thread that you spawn. This is required because every thread will have a different random number every time an instance is spawned. Depending on which OS you are working each thread will most likely be scheduled on a separate core on your CPU. This depends on your OS job scheduler. In regards to using xgboost versus nnet, I think that the most important aspect should be whether you are interested in the model properties. I think that if you are starting with machine learning xgboost may be a bit easier than nnet. If computational performance is your biggest concern you may try to run your problem on a smaller subset first.
One thing I would do first is run a MCA analysis, which can be find in FactoMineR. This will allow you to see the amount of variance in each of your variables. You could drop variables that have too little variance and thereby speed up the performance of your learning task.

Resources