How to get stepwise logistic regression to run faster - r

I'm using the standard glm function with step function on 100k rows and 107 variables. When I did a regular glm I got the calculation done within a minute or two but when I added step(glm(...)) it runs for hours.
I tried to run it as a matrix, but it is still running for about 0.5 hour and I'm not sure it will ever be done. When I ran it on 9 variables it gave me the answers in a few seconds but with 9 warnings: all of them were "Warning messages:1: glm.fit: fitted probabilities numerically 0 or 1 occurred "
I used the line of code below: is it wrong? What should I do in order to gain better running time?
logit1back <- step(glm(IsChurn ~ var1 + var2+ var3+ var4+
var5+ var6+ var7+ var8+ var9, data=tdata , family='binomial'))

Related

How can I debug BLAS/LAPACK error in mgcv::gam function

In R I'm running a mgcv::gam() function with 12000 observations and using 130 parameters (mostly factors and 3 splines. I get the following error message:
Error in magic(G$y, G$X, msp, G$S, G$off, L = G$L, lsp0 = G$lsp0, G$rank, :
BLAS/LAPACK routine 'DLASCLMOIEDDLASRTDSP' gave error code -4
I run this function 100 times with everything from 8000 to 100000 observations and it runs fine, except in one instance. There are no NAs.
I've tried to extract the model matrix from the formula, but for gam's this seems to require a model object, which I can't get because of the error.
I'm thinking this might be a collinearity issue, but I don't know how to check without a model matrix.

CPU requirement to run logistic regression with pairwise interactions

I am trying to fit a logistic regression model with pairwise interactions for 7 variables; however, I have let the code run for as long as 12 hours, and still no results. My dataset is not terribly large…about 3000 lines. One of my variables has 82 degrees of freedom, and I am wondering if that is the problem? I have no problem running the code with main level effects, but I expect interactions between my variables, so I would like pairwise interactions included. I have tried adding arguments to the code to speed up the process, but I still can’t get the code to kick back results even after 12 hours of running. I am using the glmulti package to fit the model, and I included the method = “g” argument and the conseq = 5 argument in an attempt to make the code run faster. Is there anything else I can do to speed it up? Another code or different package, or is a basic laptop just not enough to run it?
This is the code I used:
detectmodel<- glmulti::glmulti(outcome~ bird + year + season + sex + numobs + obsname + season + month, data=detect, level=2, fitfunction=glm, crit="aicc", family=binomial, confsetsize=10, method = "g")

Is there a way to handle "cannot allocate vector of size" issue without dropping data?

Unlike a previous question about this, this case is different to that and that is why I'm asking. I have an already cleaned dataset containing 120 000 observations of 25 variables, and I am supposed to analyze it all through logistic regression and random forest. However, I get an error "cannot allocate vector of size 98 GB whereas my friend doesn't.
Summary says most of it. I even tried to reduce number of observations to 50 000 and number of variables in dataset to 15 (used 5 of them in regression) and it failed. However, I tried sending the script where i shortened the dataset to a friend, and she could run it. This is odd because I have a 64 bit system and 8 GB RAM, she has only 4 GB. So it appears that the problem lies with me.
pd_data <- read.csv2("pd_data_v2.csv")
split <- rsample::initial_split(pd_data, prop = 0.7)
train <- rsample::training(split)
test <- rsample::testing(split)
log_model <- glm(default ~ profit_margin + EBITDA_margin + payment_reminders, data = pd_data, family = "binomial")
log_model
The result should be a logistic model where I can see coefficients and meassure it's accuracy, and make adjustments.

Do imputation in R when mice returns error that "system is computationally singular"

I am trying to do imputation to a medium size dataframe (~100,000 rows) where 5 columns out of 30 have NAs (a large proportion, around 60%).
I tried mice with the following code:
library(mice)
data_3 = complete(mice(data_2))
After the first iteration I got the following exception:
iter imp variable
1 1 Existing_EMI Loan_Amount Loan_Period
Error in solve.default(xtx + diag(pen)): system is computationally singular: reciprocal condition number = 1.08007e-16
Is there some other package that is more robust to this kind of situations? How can I deal with this problem?
Your 5 columns might have a number of unbalanced factors. When these are turned into dummy variables there is a high probability that you will have one column a linear combination of another. The default imputation methods of mice involve linear regression, this results in a X matrix that cannot be inverted and will result in your error.
Change the method being used to something else like cart -- mice(data_2, method = "cart") --. Also check which seed you are calling before / during imputation for reproducible results.
My advice is to go through the 7 vignettes of mice. You can find out how to change the method of imputation being used for separate columns instead of for the whole dataset.

Model runs with glm but not bigglm

I was trying to run a logistic regression on 320,000 rows of data (6 variables). Stepwise model selection on a sample of the data (10000) gives a rather complex model with 5 interaction terms: Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5. The glm() function could fit this model with 10000 rows of data, but not with the whole dataset (320,000).
Using bigglm to read data chunk by chunk from a SQL server resulted in an error, and I couldn't make sense of the results from traceback():
fit <- bigglm(Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5,
data=sqlQuery(myconn,train_dat),family=binomial(link="logit"),
chunksize=1000, maxit=10)
Error in coef.bigqr(object$qr) :
NA/NaN/Inf in foreign function call (arg 3)
> traceback()
11: .Fortran("regcf", as.integer(p), as.integer(p * p/2), bigQR$D,
bigQR$rbar, bigQR$thetab, bigQR$tol, beta = numeric(p), nreq = as.integer(nvar),
ier = integer(1), DUP = FALSE)
10: coef.bigqr(object$qr)
9: coef(object$qr)
8: coef.biglm(iwlm)
7: coef(iwlm)
6: bigglm.function(formula = formula, data = datafun, ...)
5: bigglm(formula = formula, data = datafun, ...)
4: bigglm(formula = formula, data = datafun, ...)
bigglm was able to fit a smaller model with fewer interaction terms. but bigglm was not able to fit the same model with a small dataset (10000 rows).
Has anyone run into this problem before? Any other approach to run a complex logistic model with big data?
I've run into this problem many times and it was always caused by the fact that the the chunks processed by the bigglm did not contain all the levels in a categorical (factor) variable.
bigglm crunches data by chunks and the default size of the chunk is 5000. If you have, say, 5 levels in your categorical variable, e.g. (a,b,c,d,e) and in your first chunk (from 1:5000) contains only (a,b,c,d), but no "e" you will get this error.
What you can do is increase the size of the "chunksize" argument and/or cleverly reorder your dataframe so that each chunk contains ALL the levels.
hope this helps (at least somebody)
Ok so we were able to find the cause for this problem:
for one category in one of the interaction terms, there's no observation. "glm" function was able to run and provide "NA" as the estimated coefficient, but "bigglm" doesn't like it. "bigglm" was able to run the model if I drop this interaction term.
I'll do more research on how to deal with this kind of situation.
I met this error before, thought it was from randomForest instead of biglm. The reason could be the function cannot handle character variables, so you need to convert characters to factors. Hope this can help you.

Resources