I am trying to run a binary logistic regression in R on a very large set of data. I keep running into memory problems. I have tried many different packages to try to circumvent this issue, but am still stuck. I thought packages such as caret and biglm would help. But they gave me the same memory error. Why is it that when I start with a dataset with 300,000 rows and 300 columns and proceed to subset it to 50,000 rows and 120 columns, it still requires the same amount of memory? It makes no sense. I have no way of replicating the data since it is sensitive information, but most of the variables are factors. Below are some examples I have tried
model = bigglm(f, data = reg, na.action = na.pass, family = binomial(link=logit), chunksize = 5000)
But I get:
Error: cannot allocate vector of size 128.7 Gb
MyControl <- trainControl(method = "repeatedCV", index = MyFolds, summaryFunction = twoClassSummary, classProbs = TRUE)
fit = train(f, data = reg, family = binomial, trControl = MyControl)
The error message "Error: cannot allocate vector of size 128.7 Gb" doesn't meant that R cannot allocate a total memory of 128.7 Gb.
Quoting Patrick Burns:
"It is because R has already allocated a lot of memory successfully.
The error message is about how much memory R was going after at the
point where it failed".
So it is your interpretation of the error that is wrong. Even though the size of the problems might be very different, they are probably both just too big for your computer, and the amount of memory displayed in the error message is unrelated to the size of your problem.
Related
I am fitting the below GAM in mgcv
m3.2 <- bam(pt10 ~
s(year, by = org.type) +
s(year, by = region) +
s(org.name, bs = 're') +
s(org.name, year, bs = 're'),
data = dat,
method = "fREML",
family = betar(link="logit"),
select = T,
discrete = T)
My dat has ~58,000 observations, and the factor org.name has ~2,500 levels - meaning there are a lot of random intercepts and slopes to be fit. Hence, in an attempt to reduce computation time, I have used bam() and the discrete = T option. However, my model has now been running for ~36 hours and still has not fit nor failed and provided me with an error message. I am unsure how long might be reasonable for such a model to take to fit, and therefore how/when to decide to kill the command or not; I don't want to stop the model running if this is normal behaviour/computational time for such a model, but also don't want to waste my time if bam() is stuck going around in circles and will never fit.
Question: How long might be reasonable for such a model to take to fit / what would reasonable computation time for such a model be? Is there a way I can determine if bam() is still making progress or if the command should just be killed to avoid wasting time?
My computer has 16GB RAM and an Intel(R) Core(TM) i7-8565u processor (CPU # 1.80GHz). In my Windows Task Manager I can see that RStudio is using 20-30% CPU and 20-50% memory and that these usage values are changing and are not static.
To see what bam() is doing, you should have set trace = TRUE in the list passed to the control argument:
ctrl <- gam.control(trace = TRUE)
m3.2 <- bam(pt10 ~
org.type + region, # include parametric terms for group means
s(year, by = org.type) +
s(year, by = region) +
s(org.name, bs = 're') +
s(org.name, year, bs = 're'),
data = dat,
method = "fREML",
family = betar(link="logit"),
select = TRUE,
discrete = TRUE,
control = ctrl)
That way you'd get printed statements in the console while bam() is doing it's thing. I would check on a much smaller set of data that this actually works (run the example in ?bam() say) within RStudio on Windows; I've never used it so you wouldn't want the trace output to only come once the function had finished, in a single torrent.)
Your problem is the random effects, not really the size of the data; you are estimating 2500 + 2500 coefficients for your two random effects and another 10 * nlevels(org.type) + 10 * nlevels(region) for the two factor by smooths. This is a lot for 58,000 observations.
As you didn't set nthreads it's fitting the model on a single CPU core. I wouldn't necessarily change that though as using 2+ threads might just make the memory situation worse. 16GB on Windows isn't very much RAM these days - RStudio is using ~half the available RAM with your other software and Windows using the remaining 36% that Task Manager is reporting is in use.
You should also check if the OS is having to swap memory out to disk; if that's happening then give up as retrieving data from disk for each iteration of model fitting is going to be excruciating, even with a reasonably fast SSD.
The random effects can be done more efficiently in dedicated mixed model software but then you have the problem of writing the GAM bits (the two factor by smooths) in that form - you would write the random effects in the required notation for {brms} or {lme4} respectively as (1 + year | org.name) in the relevant part of the formula (or the random argument in gamm4()).
{brms} and {gamm4} can do this bit, but for the former you need to know how to drive {brms} and Stan (which is doing HMC sampling of the posterior), while {lme4}, which is what {gamm4} uses to do the fitting, doesn't have a beta response family. The {bamlss} package has options for this too, but it's quite a complex package so be sure to understand how to specify the model estimation method.
So perhaps revisit your model structure; why do you want a smooth of year for the regions and organisation types, but a linear trend for individual organisations?
Unlike a previous question about this, this case is different to that and that is why I'm asking. I have an already cleaned dataset containing 120 000 observations of 25 variables, and I am supposed to analyze it all through logistic regression and random forest. However, I get an error "cannot allocate vector of size 98 GB whereas my friend doesn't.
Summary says most of it. I even tried to reduce number of observations to 50 000 and number of variables in dataset to 15 (used 5 of them in regression) and it failed. However, I tried sending the script where i shortened the dataset to a friend, and she could run it. This is odd because I have a 64 bit system and 8 GB RAM, she has only 4 GB. So it appears that the problem lies with me.
pd_data <- read.csv2("pd_data_v2.csv")
split <- rsample::initial_split(pd_data, prop = 0.7)
train <- rsample::training(split)
test <- rsample::testing(split)
log_model <- glm(default ~ profit_margin + EBITDA_margin + payment_reminders, data = pd_data, family = "binomial")
log_model
The result should be a logistic model where I can see coefficients and meassure it's accuracy, and make adjustments.
Its have to do with
parallelism implementation of XGBoost
I am trying to optimize XGBoost execution by giving it parameter nthread= 16 where my system has 24 cores. But when I train my model, it doesn't seem to even cross approx 20% of CPU utilization at any point in time while model training.
Code snippet is as follows:-
param_30 <- list("objective" = "reg:linear", # linear
"subsample"= subsample_30,
"colsample_bytree" = colsample_bytree_30,
"max_depth" = max_depth_30, # maximum depth of tree
"min_child_weight" = min_child_weight_30,
"max_delta_step" = max_delta_step_30,
"eta" = eta_30, # step size shrinkage
"gamma" = gamma_30, # minimum loss reduction
"nthread" = nthreads_30, # number of threads to be used
"scale_pos_weight" = 1.0
)
model <- xgboost(data = training.matrix[,-5],
label = training.matrix[,5],
verbose = 1, nrounds=nrounds_30, params = param_30,
maximize = FALSE, early_stopping_rounds = searchGrid$early_stopping_rounds_30[x])
Please explain me (if possible) on how I can increase CPU utilization and speed up the model training for efficient execution. Code in R shall be helpful for further understanding.
Assumption:- It is about the execution in R package of XGBoost
This is a guess... but I have had this happen to me ...
You are spending to much time communicating during the parallelism and are not ever getting CPU bound. https://en.wikipedia.org/wiki/CPU-bound
Bottom line is your data isn't large enough (rows and columns ), and/or your trees aren't deep enough max_depth to warrant that many cores. Too much overhead. xgboost parallelizes split evaluations so deep trees on big data can keep the CPU humming at max.
I have trained many models where single threaded outperforms 8/16 cores. Too much time switching and not enough work.
**MORE DATA, DEEPER TREES OR LESS CORES :) **
I tried to answer this question but my post was deleted by a moderator. Please see https://stackoverflow.com/a/67188355/5452057 which I believe could help you also, it relates to missing MPI support in the xgboost R-package for Windows available from CRAN.
I'm trying to speed up my random forest approach by parallel computing. My dataset contains of 20.000 rows and 10 columns. Dependent variable, which could be predicted, is a numerical and there are two factors between independent variables (one has 2 levels and second one has 504 levels).
I think the function train does coding all the factor variables into dummy variables, so decoding is not needed in this case.
Please, could you give me some useful advice, how to speed up the following code, I would appreciate any of advice. The solution below is never ending. Thanks a lot in advance.
library(doParallel); library(caret)
set.seed(975)
forTraining <- createDataPartition(DATA$NumVar,
p = 3/4)[[1]]
trainingSet <- DATA[forTraining,]
testSet <- DATA[-forTraining,]
controlObject <- trainControl(method = "repeatedcv",
repeats = 5,
number = 10)
#run model in parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(669)
rfModel <- train(NumVar ~ .,
data = trainingSet,
method = "rf",
tuneLength = 10,
ntrees = 1000,
importance = TRUE,
trControl = controlObject)
stopCluster(cl)
My response is too verbose for a comment so hopefully I can help guide you. Here is a summary of the points in the comments above.
Our primary concern - Computation Time
One major limitation on randomforest computation - Number of Trees
The general idea is that as you increase the number trees, your randomforest model will improve (i.e. lower error). However, this increase in performance will diminish as you continue to add trees whereas computation time will continue to increase. As such, you reach a point of diminishing returns. So, how to do we determine how many trees to use?
Well, naively we could simply fit the randomForest model with the call you provide. Another option is to do cross-validation on ntree but that isn't implemented by default in caret and Max Kuhn (the author) really knows his stuff when it comes to predictive models. So, to get started you are on the correct track with your call you provided above:
randomForest(dependentVariable ~ ., data = dataSet, mtry = 522, ntree=3000, importance=TRUE, do.trace=100)
But let's make this reproducible and use the mlbench Sonar dataset.
library(mlbench)
data(Sonar)
But we currently don't care about variable importance at the moment so let's remove that. Also, your ntree is way too high to start. I would be surprised if you need it that high in the end. Starting at a lower level we have the following:
set.seed(825)
rf1 <- randomForest(Class~., data=Sonar, mtry=3, ntree=200, do.trace=25)
> rf1 <- randomForest(Class~., data=Sonar, mtry=3, ntree=200, do.trace=25)
ntree OOB 1 2
25: 16.83% 11.71% 22.68%
50: 18.27% 12.61% 24.74%
75: 17.31% 17.12% 17.53%
100: 15.38% 12.61% 18.56%
125: 15.38% 10.81% 20.62%
150: 16.35% 13.51% 19.59%
175: 15.87% 10.81% 21.65%
200: 14.42% 8.11% 21.65%
As you can see, the OOB is bottoming out at approximately 100 trees. However, if I am uncertain or if the OOB is still dropping significantly I could run the call again with a larger number of trees. Once you have a working number of trees, then you can tune your mtry with caret::training.
If you do end up needing to use lots of trees (i.e. 1000's) then your code is likely going to be slow. If you have access to machines that have many processors and large amounts of RAM then your parallel implementation can help but on more common machines it will be slow going.
I am training randomforest on my training data which has 114954 rows and 135 columns (predictors). And I am getting the following error.
model <- randomForest(u_b_stars~. ,data=traindata,importance=TRUE,do.trace=100, keep.forest=TRUE, mtry=30)
Error: cannot allocate vector of size 877.0 Mb
In addition: Warning messages:
1: In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
2: In matrix(double(nrnodes * nt), ncol = nt) :
Reached total allocation of 3958Mb: see help(memory.size)
3: In matrix(double(nrnodes * nt), ncol = nt) :
Reached total allocation of 3958Mb: see help(memory.size)
4: In matrix(double(nrnodes * nt), ncol = nt) :
Reached total allocation of 3958Mb: see help(memory.size)
5: In matrix(double(nrnodes * nt), ncol = nt) :
Reached total allocation of 3958Mb: see help(memory.size)
I want to know know what do I do to avoid this error? Should I train it on less data? But that wont be good, of course. Can somebody suggest an alternative in which I don't have to take less data from training data. I want to use complete training data.
As was stated in an answer to a previous question (which I can't find now), increasing the sample size affects the memory requirements of RF in a nonlinear way. Not only is the model matrix larger, but the default size of each tree, based on the number of points per leaf, is also larger.
To fit the model given your memory constraints, you can do the following:
Increase the nodesize parameter to something bigger than the default, which is 5 for a regression RF. With 114k observations, you should be able to increase this significantly without hurting performance.
Reduce the number of trees per RF, with the ntree parameter. Fit several small RFs, then combine them with combine to produce the entire forest.
One alternative you could try if you can't use a machine with more memory is: train separate models on subsets of the data (say 10 separate subsets) and then combine the output of each model in a sensible way (the easiest way to do this is averaging the predictions of the 10 models but there are other ways to ensemble models http://en.wikipedia.org/wiki/Ensemble_learning).
Technically you would be using all your data without hitting the memory restriction, but depending on the size of the resulting subsets of the data the resulting models might be too weak to be of any use.