Unexpected long execution time for zeroinfl() - r

I am running a zero inflated regression model function for a 498,501 rows dataframe on a 32gb linux machine with a 2.6GHz CPU. I used the following command
library(pscl)
zeroinfl(response ~ predictor1 + predictor2, data=dataframe, dist = "negbin", EM = TRUE)
R has now been computing for more that 48 hours now with no warning or error message.
I am a bit puzzled since a traditional lm() on the same data delivers almost instantaneously. Should I suspect some problem with my command? Or is it just a very slow function?

Related

Computation time GAM

I am fitting the below GAM in mgcv
m3.2 <- bam(pt10 ~
s(year, by = org.type) +
s(year, by = region) +
s(org.name, bs = 're') +
s(org.name, year, bs = 're'),
data = dat,
method = "fREML",
family = betar(link="logit"),
select = T,
discrete = T)
My dat has ~58,000 observations, and the factor org.name has ~2,500 levels - meaning there are a lot of random intercepts and slopes to be fit. Hence, in an attempt to reduce computation time, I have used bam() and the discrete = T option. However, my model has now been running for ~36 hours and still has not fit nor failed and provided me with an error message. I am unsure how long might be reasonable for such a model to take to fit, and therefore how/when to decide to kill the command or not; I don't want to stop the model running if this is normal behaviour/computational time for such a model, but also don't want to waste my time if bam() is stuck going around in circles and will never fit.
Question: How long might be reasonable for such a model to take to fit / what would reasonable computation time for such a model be? Is there a way I can determine if bam() is still making progress or if the command should just be killed to avoid wasting time?
My computer has 16GB RAM and an Intel(R) Core(TM) i7-8565u processor (CPU # 1.80GHz). In my Windows Task Manager I can see that RStudio is using 20-30% CPU and 20-50% memory and that these usage values are changing and are not static.
To see what bam() is doing, you should have set trace = TRUE in the list passed to the control argument:
ctrl <- gam.control(trace = TRUE)
m3.2 <- bam(pt10 ~
org.type + region, # include parametric terms for group means
s(year, by = org.type) +
s(year, by = region) +
s(org.name, bs = 're') +
s(org.name, year, bs = 're'),
data = dat,
method = "fREML",
family = betar(link="logit"),
select = TRUE,
discrete = TRUE,
control = ctrl)
That way you'd get printed statements in the console while bam() is doing it's thing. I would check on a much smaller set of data that this actually works (run the example in ?bam() say) within RStudio on Windows; I've never used it so you wouldn't want the trace output to only come once the function had finished, in a single torrent.)
Your problem is the random effects, not really the size of the data; you are estimating 2500 + 2500 coefficients for your two random effects and another 10 * nlevels(org.type) + 10 * nlevels(region) for the two factor by smooths. This is a lot for 58,000 observations.
As you didn't set nthreads it's fitting the model on a single CPU core. I wouldn't necessarily change that though as using 2+ threads might just make the memory situation worse. 16GB on Windows isn't very much RAM these days - RStudio is using ~half the available RAM with your other software and Windows using the remaining 36% that Task Manager is reporting is in use.
You should also check if the OS is having to swap memory out to disk; if that's happening then give up as retrieving data from disk for each iteration of model fitting is going to be excruciating, even with a reasonably fast SSD.
The random effects can be done more efficiently in dedicated mixed model software but then you have the problem of writing the GAM bits (the two factor by smooths) in that form - you would write the random effects in the required notation for {brms} or {lme4} respectively as (1 + year | org.name) in the relevant part of the formula (or the random argument in gamm4()).
{brms} and {gamm4} can do this bit, but for the former you need to know how to drive {brms} and Stan (which is doing HMC sampling of the posterior), while {lme4}, which is what {gamm4} uses to do the fitting, doesn't have a beta response family. The {bamlss} package has options for this too, but it's quite a complex package so be sure to understand how to specify the model estimation method.
So perhaps revisit your model structure; why do you want a smooth of year for the regions and organisation types, but a linear trend for individual organisations?

CPU requirement to run logistic regression with pairwise interactions

I am trying to fit a logistic regression model with pairwise interactions for 7 variables; however, I have let the code run for as long as 12 hours, and still no results. My dataset is not terribly large…about 3000 lines. One of my variables has 82 degrees of freedom, and I am wondering if that is the problem? I have no problem running the code with main level effects, but I expect interactions between my variables, so I would like pairwise interactions included. I have tried adding arguments to the code to speed up the process, but I still can’t get the code to kick back results even after 12 hours of running. I am using the glmulti package to fit the model, and I included the method = “g” argument and the conseq = 5 argument in an attempt to make the code run faster. Is there anything else I can do to speed it up? Another code or different package, or is a basic laptop just not enough to run it?
This is the code I used:
detectmodel<- glmulti::glmulti(outcome~ bird + year + season + sex + numobs + obsname + season + month, data=detect, level=2, fitfunction=glm, crit="aicc", family=binomial, confsetsize=10, method = "g")

Error during wrapup: long vectors not supported yet: in glm() function

I found several questions on Stackoverflow regarding this topic (some of them without any answer) but nothing related (so far) with this error in regression.
I'm, running a probit model in r with (I'm guessing) too many fixed effects (year and places):
myprobit <- glm(factor(Y) ~ factor(T) + factor(X1) + factor(X2) + factor(X3) +
factor(YEAR) + factor(PLACE),
family = binomial(link = "probit"),
data = DT)
The PLACE variable has about 1000 unique values and YEAR 8 values. The dataset DT has 13,099,225 obs and 79 columns.
The error I got is:
Error: cannot allocate vector of size 59.3 Gb
Error during wrapup: long vectors not supported yet: ../include/Rinlinedfuns.h:519
The machine I'm using has 128 GB of RAM.
So, I don't know what I can do, without change the function. Does anyone know how to deal with this issue? Thanks!
In order to close this question, I have to mention that the #Axeman's answer it is the only approach feasible for my problem. The whole issue is, there is not enough memory to manage such a huge design matrix.
Therefore, run a probit regression using the biglm package and bigglm() function is the only solution I found so far.
Nevertheless, I realize, due to how the biglm package works, taking iteratively chunks of the data, the use of factor() variables in the RHS it's problematic every time when factor level is not represented in the chunk. In other words, if a factor variable has 5 levels, but in the data chunk only 4 levels appear, I will have an error in the estimation.
There are several questions and comments about this on Stackoverflow.

Is there a way to handle "cannot allocate vector of size" issue without dropping data?

Unlike a previous question about this, this case is different to that and that is why I'm asking. I have an already cleaned dataset containing 120 000 observations of 25 variables, and I am supposed to analyze it all through logistic regression and random forest. However, I get an error "cannot allocate vector of size 98 GB whereas my friend doesn't.
Summary says most of it. I even tried to reduce number of observations to 50 000 and number of variables in dataset to 15 (used 5 of them in regression) and it failed. However, I tried sending the script where i shortened the dataset to a friend, and she could run it. This is odd because I have a 64 bit system and 8 GB RAM, she has only 4 GB. So it appears that the problem lies with me.
pd_data <- read.csv2("pd_data_v2.csv")
split <- rsample::initial_split(pd_data, prop = 0.7)
train <- rsample::training(split)
test <- rsample::testing(split)
log_model <- glm(default ~ profit_margin + EBITDA_margin + payment_reminders, data = pd_data, family = "binomial")
log_model
The result should be a logistic model where I can see coefficients and meassure it's accuracy, and make adjustments.

Elastic net with Cox regression

I am trying to perform elastic net with cox regression on 120 samples with ~100k features.
I tried R with the glmnet package but R is not supporting big matrices (it seems R is not designed for 64 bit). Furthermore, the package glmnet does support sparse matrices but for whatever reason they have not implemented sparse matrix + cox regression.
I am not pushing for R but this is the only tool I found so far. Anyone knows what program I can use to calculate elastic nets + cox regression on big models? I did read that I can use Support Vector Machine but I need to calculate the model first and I cannot do that in R due to the above restriction.
Edit:
A bit of clarification. I am not reporting an error in R as apparently it is normal for R to be limited by how many elements its matrix can hold (as for glmnet not supporting sparse matrix + cox I have no idea). I am not pushing for a tool but it would be easier if there is another package or a stand alone program that can perform what I am looking for.
If someone has an idea or has done this before please share your method (R, Matlab, something else).
Edit 2:
Here is what I used to test:
I made a matrix of 100x100000. Added labels and tried to create the model using model.matrix.
data <- matrix(rnorm(100*100000), 100, 100000)
formula <- as.formula(class ~ .)
x = c(rep('A', 40), rep('B', 30), rep('C', 30))
y = sample(x=1:100, size=100)
class = x[y]
data <- cbind(data, class)
X <- model.matrix(formula, data)
The error I got:
Error: cannot allocate vector of size 37.3 Gb
In addition: Warning messages:
1: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
2: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
3: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
4: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
Thank you in advance! :)
Edit 3:
Thanks to #marbel I was able to construct a test model that works and does not become too big. It seems my problem came from using cbind in my test.
A few pointers:
a) That's a rather small dataset, R should be more than enought. All you need is a modern computer, meaning a decent amount of RAM. I guess 4GB should be enough for such a small dataset.
The package is available in Julia and Python but I'm not sure if that model is available.
Here and here you have examples of the cox model with the GLMNET package. There is also a package called survival.
There are at least two problems with your code:
This is not something your would like to do in R: data <- cbind(data, class). It's just not memory efficient. If you need to do this type of operations use the data.table package. It allows to do assignment by references, check out the := operator.
If all your data is numeric you don't need to use model.matrix, just use data.matrix(X).
If you have categorical variables, use model.matrix with them only, then add them to the X matrix, perhaps using data.table, one column at a time using the ?data.table::set or the := operator.
Hopefully this can help you debug the code. Good luck!

Resources