maximum number of covariates in R glm package is 128? - r

I have a dataset with 146 covariates, and am training a logistic regression.
logit = glm(Y ~ .,
data = pred.dataset[1:1000,],
family = binomial)
The model trains very quickly, but when I then try to view the Beta's with
logit
After the 128th variable the Beta's are all "NA"
I noticed this when trying to export it as pmml and noticed it stopped listing Beta's after 128 predictors.
I've gone through the documentation and can't find a reference to a maximum number of covariates, and also trained on 60k rows - I still see NAs after the 128th predictor.
Is this a limitation of glm, or a limitation of my system? I am running R 3.1.2 64 bit. How can I increase the number of predictors?

This is a question I actually just asked on Stack Exchange, which is where this question should be. See this link:
https://stats.stackexchange.com/questions/159316/logistic-regression-in-r-with-many-predictors?noredirect=1#comment303422_159316 and the subsequent links included in the thread. To answer your question though, basically that is too many predictors for logistic regression, and OLS can be used in this case, and even though it does not yield the best results for a binary outcome, the results are still valid and can be used.

You didn't provide reproducible data, so it's hard to tell exactly what is going on--is there an issue with how some of the variables are coded? Are variables that seem uniform not uniform at all? These would be a couple of situations that could be ruled out with a reproducible code example.
However, I'm answering because I think you may have a legitimate concern. What can you say about these other variables? What type are they? I have been trying to run some logits that seem to be dropping factor levels over 48.
What worked for me (at least to get the model to run in full) was going into the glm() function and changing
mf$drop.unused.levels <- TRUE
to
mf$drop.unused.levels <- FALSE
then saving the function under a different name and using that to run my analyses. (I was inspired by this answer.)
Be warned, though! It gave me some warning messages:
Warning messages:
1: In predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
2: In predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
3: In predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
I know that there are frequency issues in certain groups in the data; I have to analyze these separately and I will do so. But for the time being, I have achieved the prediction of all levels that I wanted.
The first step would be to check your data, though. Part of why this happens with my data is almost certainly due to issues in the data itself, but this approach let me override it. This may or may not be an appropriate solution for you.

Related

Running the same glm model with caret provides different accuracy and errors

As stated in the title, running the same glm model with caret returns different accuracies and errors (no error OR glm.fit: fitted probabilities numerically 0 or 1 occurred OR 1: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == : prediction from a rank-deficient fit may be misleading). If I set the seed and always run it with the seed and then the model, predictably I always get the same accuracy and error (or no error) message.
When running the same model with the glm() function, coefficients are always the same (as with caret), but I never ever get any of the errors in this case. Should I just interpret this as being an issue with resample or may the errors provided by the glm of the caret package have any important meaning, if they depend on seed?
I've searched for this and though I assume it has something to do with resampling, I am not quite able to understand how it works and would like assistance in understanding this. Also, I'm trying to use the caret package for all the modelling, so I would also like some help trying to understand if I should instead start my process by always running glm() instead of through the caret package, as this will always provide me the same error message straight away no matter the seed.
Data is from a client, so I'd prefer not to share it. The formula I'm using is (example) simply train(Y ~ X + Z + A, data = df, method = "glm") for the caret version and glm(Y ~ X + Z + A, data = df, family = binomial()) in the glm() function.

R: glm (multiple linear regression) ignores/removes some predictor variables

I have posted this question before, but I believe that I had not explained the problem well and that it was over-complicated, so I deleted my previous post and I am posting this one instead. I am sorry if this caused any inconvenience.
I also apologize in advance for not being able to provide example data, I am using very large tables, and what I am trying to do works fine with simpler examples, so providing example data cannot help. It has always worked for me until now. So I am just trying to get your ideas on what might be the issue. But if there is any way I could provide more information, do let me know.
So, I have a vector corresponding to a response variable and a table of predictor variables. The response vector is numeric, the predictor variables (columns of the table) are in the binary format (0s and 1s).
I am running the glm function (multivariate linear regression) using the response vector and the table of predictors:
fit <- glm(response ~ as.matrix(predictors), na.action=na.exclude)
coeff <- as.vector(coef(summary(fit))[,4])[-1]
When I have been doing that in the past, I would extract the vector of regression coefficient to use it for further analysis.
The problem is that now the regression returns a vector of coefficients which is missing some values. Essentially some predictor variables are not attributed a coefficient at all by glm. But there are no error messages.
The summary of the model looks normal, but some predictor variables are missing like I mentioned. Most other predictors have assigned data (coefficient, pvalue, etc.).
About 30 predictors are missing from the model, over 200.
I have tried using different response variables (vectors), but I am getting the same issue, although the missing predictors vary depending on the response vector...
Any ideas on what might be going on? I think this can happen if some variables have 0 variance, but I have checked that. There are also no NA values and no missing values in the tables.
What could cause glm to ignore/remove some predictor variables?
Any suggestion is welcome!
EDIT: I found out that the predictors that were removed has values identical to another predictor. There should still be a way to keep them, and they would get the same regression coefficient for example
Your edit explains why you are not getting those variables. That was going to be my first question. (This question would be better posed on Cross validated because it is not an R error, it is a problem with your model.)
They would not get the same coefficients: Say you have a 1:1 relationship, Y = X + e, Then fit simple model Y ~ X + X. Each X is going to be assigned ANY value such that the sum is equal to 1. There is no solution. Y = 0.5X + 0.5X may be the most obvious to us, but Y = 100X -99X is just as valid.
You also cannot have any predictors that are linear sums of other predictors for the same reason.
If you really want those values you can generate them from what you have. However I do not recommend it because the assumptions are going to be on very thin ice.

GLMNet convergence issue for penalized regression

I am working on network models for political networks. One of the things I am doing is penalized inference. I am using an adaptive lasso approach by setting a penalty factor for glmnet. I have various parameters in my model: alphas and phis. The alphas are fixed effects so I want to keep them in the model while the phis are being penalized.
I have starting coefficients from the MLE estimation process of glm() to compute the adaptive weights that are set through the penalty factor of glmnet().
This is the code:
# Generate Generalized Linear Model
GenLinMod = glm(y ~ X, family = "poisson")
# Set coefficients
coefficients = coef(GenLinMod)
# Set penalty
penalty = 1/(coefficients[-1])^2
# Protect alphas
penalty[1:(n-1)] = 0
# Generate Generalized Linear Model with adaptive lasso procedure
GenLinModNet = glmnet(XS, y, family = "poisson", penalty.factor = penalty, standardize = FALSE)
For some networks this code executes just fine, however I have certain networks for which I get these errors:
Error: Matrices must have same number of columns in rbind2(.Call(dense_to_Csparse, x), y)
In addition: Warning messages:
1: from glmnet Fortran code (error code -1); Convergence for 1th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned
2: In getcoef(fit, nvars, nx, vnames) :
an empty model has been returned; probably a convergence issue
The odd thing is that they all use the same code, so I am wondering if it is a data problem.
Additional information:
+In one case I have over 500 alphas and 21 phis and these errors appear, in another case that does not work I have 200 alphas and 28 phis. But on the other hand I have a case with over 600 alphas and 28 phis and it converges nicely.
+I have tried settings for lambda.min.ratio and nlambda to no avail.
Additional question: Is the first entry of penalty the one associated with the intercept? Or is it added automatically by glmnet()? I did not find clarity about this in the glmnet vignette. My thoughts are that I shouldn't include a term for the intercept, since it's said that the penalty is internally rescaled to sum nvars and I assume the intercept isn't one of my variables.
I'm not 100% sure about this, but I think I have found the root of the problem.
I've tried to use all kinds of manual lambda sequences, even trying very large starting lambda's (1000's). This all seemed to do no good at all. However, when I tried without penalizing the alpha's, everything would converge nicely. So it probably has something to do with the amount of unpenalized variables. Maybe keeping all alpha's unpenalized forces glmnet in some divergent state. Maybe there is some sort of collinearity going on. My "solution", which is basically just doing something else, is to penalize the alpha's with the same weigth that is used for one of the phi's. This works on the assumption that some phi's are significant and the alpha's can be just as significant, instead of being fixed (which makes them infinitely significant). I'm not completely satisfied, because this is just a different approach, but it might be interesting to note that it probably has something to do with the amount of unpenalized variables.
Also, to answer my additional question: In the glmnet vignette it says that the penalty term is internally rescaled to sum to nvars. Since the intercept is not one of the variables, my guess is that it is not needed in the penalty term. Though, I've tried with both including and excluding the term, results seem to be the same. So maybe glmnet automatically removes it if it detects that the length is +1 of what it should be.

R multinom (nnet) logistic regression prediction

I am going through "Applied Logistic Regression" by Hosmer, Lemeshow and Sturdivant, concretely chapter 8, multinomial logistic regression. I've built a model:
>library(nnet)
>library(aplore3); data(aps)
>fit <- multinom(place3 ~ danger, data = aps)
I got coefficients that match those in the book. My problem is when I try to make predictions: it just stacks everything in the level that is the most frequent in the data, and zero in other levels. I used code:
>preds <- predict(fit, newdata = data.frame(danger = aps$danger),type = "class")
>table(preds)
OutDay Int Res
508 0 0
I've searched the web and got to the nice r-bloggers post and the function described there kind of sorted the thing out, with the same fit I got
predictions
Int OutDay Res
132 262 114
But I searched the web further because I wanted to read more about the use of the predict function with multinom so I ran into some examples like this one on github where the poster uses predict in the straightforward way. I replicated the example and got everything OK.
My final question is if someone can explain when to use the predict with multinom object in the complicated (r-bloggers' post) way and when to use it in the straightforward way. Is the difference that in my example I am trying to get out fitted values for the data that was used to fit the model, or there's more to it?
Thanks!

glmer predict: warning message about outcome variable not being a factor

I have a mixed effect model with binomial outcome fitted with glmer. For plotting purposes I would like to predict population-level values for a small dataset.
Below is an example illustrating my approach:
silly <- glmer(Sex ~ distance +age + (1|Subject), data=Orthodont, family=binomial)
sillypred <- expand.grid(distance=c(20, 25), age=unique(Orthodont$age))
sillypred$fitted <- predict(silly, sillypred, re.form=NA, type="response")
I get the following warning message:
Warning message:
In model.frame.default(delete.response(Terms), newdata, na.action = na.action, :
variable 'Sex' is not a factor
However, when I check, it looks like it is:
str(Orthodont["Sex"])
The variable fitted is still created and the values make sense, but I'm curious about this message. Is there something I should be concerned about? Otherwise, what is the purpose of this message.
It might seem like a trivial question (after all, it all seems to work), in which case I apologize, but I want to make sure that I don't overlook something important.
This appears to be a harmless bug that will always occur when predicting from a model with a factor response (the problem is that we use the xlev argument to model.frame including the levels of the response variable). I've posted this at https://github.com/lme4/lme4/issues/205 . Thanks for the report!
You could either ignore the warning or coerce the response variable to a binary value (which will give identical results).

Resources