R: glm (multiple linear regression) ignores/removes some predictor variables - r

I have posted this question before, but I believe that I had not explained the problem well and that it was over-complicated, so I deleted my previous post and I am posting this one instead. I am sorry if this caused any inconvenience.
I also apologize in advance for not being able to provide example data, I am using very large tables, and what I am trying to do works fine with simpler examples, so providing example data cannot help. It has always worked for me until now. So I am just trying to get your ideas on what might be the issue. But if there is any way I could provide more information, do let me know.
So, I have a vector corresponding to a response variable and a table of predictor variables. The response vector is numeric, the predictor variables (columns of the table) are in the binary format (0s and 1s).
I am running the glm function (multivariate linear regression) using the response vector and the table of predictors:
fit <- glm(response ~ as.matrix(predictors), na.action=na.exclude)
coeff <- as.vector(coef(summary(fit))[,4])[-1]
When I have been doing that in the past, I would extract the vector of regression coefficient to use it for further analysis.
The problem is that now the regression returns a vector of coefficients which is missing some values. Essentially some predictor variables are not attributed a coefficient at all by glm. But there are no error messages.
The summary of the model looks normal, but some predictor variables are missing like I mentioned. Most other predictors have assigned data (coefficient, pvalue, etc.).
About 30 predictors are missing from the model, over 200.
I have tried using different response variables (vectors), but I am getting the same issue, although the missing predictors vary depending on the response vector...
Any ideas on what might be going on? I think this can happen if some variables have 0 variance, but I have checked that. There are also no NA values and no missing values in the tables.
What could cause glm to ignore/remove some predictor variables?
Any suggestion is welcome!
EDIT: I found out that the predictors that were removed has values identical to another predictor. There should still be a way to keep them, and they would get the same regression coefficient for example

Your edit explains why you are not getting those variables. That was going to be my first question. (This question would be better posed on Cross validated because it is not an R error, it is a problem with your model.)
They would not get the same coefficients: Say you have a 1:1 relationship, Y = X + e, Then fit simple model Y ~ X + X. Each X is going to be assigned ANY value such that the sum is equal to 1. There is no solution. Y = 0.5X + 0.5X may be the most obvious to us, but Y = 100X -99X is just as valid.
You also cannot have any predictors that are linear sums of other predictors for the same reason.
If you really want those values you can generate them from what you have. However I do not recommend it because the assumptions are going to be on very thin ice.

Related

Regression model with missing data in dependant variable

modelo <- lm( P3J_IOP~ PräOP_IOP +OPTyp + P3J_Med, data = na.omit(df))
summary(modelo)
Error:
Fehler in step(modelo, direction = "backward") :
Number of lines used has changed: remove missing values?
I have a lot of missing values in my dependent variable P3J_IOP.
Has anyone any idea how to create the model?
tl;dr unfortunately, this is going to be hard.
It is fairly difficult to make linear regression work smoothly with missing values in the predictors/dependent variables (this is true of most statistical modeling approaches, with the exception of random forests). In case it's not clear, the problem with stepwise approaches with missing data in the predictor is:
incomplete cases (i.e., observations with missing data for any of the current set of predictors) must be dropped in order to fit a linear model;
models with different predictor sets will typically have different sets of incomplete cases, leading to the models being fitted on different subsets of the data;
models fitted to different data sets aren't easily comparable.
You basically have the following choices:
drop any predictors with large numbers of missing values, then drop all cases that have missing values in any of the remaining predictors;
use some form of imputation, e.g. with the mice package, to fill in your missing data (in order to do proper statistical inference, you need to do multiple imputation, which may be hard to combine with stepwise regression).
There are some advanced statistical techniques that will allow you to simultaneously do the imputation and the modeling, such as the brms package (here is some documentation on imputation with brms), but it's a pretty big hammer/jump in statistical sophistication if all you want to do is fit a linear model to your data ...

R Lmer output - single model summary output for factors with multiple levels

When using lmer, if the predictors are factors with multiple levels, how can you obtain the coefficients, standard error and t-values across the entire factor, as opposed to for the individual factor levels?
A specific example of this is in this paper (see Table 1). The authors report that they use lmer on this model:
lmer(DeterminerRT ~ AgeGroup * Condition * TrialType * Plurality + (1|participant) + (1|target))
Where AgeGroup, Condition, TrialType and Plurality all have 2 levels. However, in their Table of results they report a single B, SE and t value across each factor (rather than for each level).
When I run lmer with predictors with multiple levels, I seem to only be able to obtain model information for each individual level of each factor.
I have looked at a number of questions that have already addressed similar themes. Specifically, this, this (which seems like almost the opposite problem) and this. However, I do not seem to have found the answer so far...
What I am after seems to more closely resemble the output of the anova(lmer) - but I just need it to include the model summary information, rather than the analysis of variance information.
Any help would be greatly appreciated.

R coxph() warning: Loglik converged before variable - valid results for other variables?

I fit cox regressions and I am interested in the effect of a predictor x, which is the last variable in the model (variable 7). I include some variables like sex and age in the model, because I want to adjust the model for them.
Using R function coxph() gives me the warning "Loglik converged before variable 3". In fact, I am not interested in variable 3, because it is just one of the variables I want to adjust for. But the word "before" makes me wonder whether this mean that results of all variables following variable 3 (which includs my predictor x) are not valid. Or are only the results of variable 3 affected?
This is the output:
More information: Actually, I am running multiple cox regressions and the described warning occurs only in some of the models for variable 3. I do want to adjust for variable 3 and thus to keep it in the code.
There is some discussion about this warning(1), but I have not found an answer to my question so far. Thank you.
(1) For example, here or here
Thanks for doing a search on prior answers. (And thanks for citing one of my answers :-) The warning message only applies to the control3 variable. The key to investigating the validity of the statistical inference about predictorx lies part of Therneau's answer that you cited. You are interested in one of the other variables and fortunately only one of your variables "exploded". That means you can do a LRT comparing the models with and without the variable(s) of interest to get a proper statistical result. The results are essentially saying there were no events in the subset of cases with a positive value of `control3 . I'm guessing that control3 is a 0/1 variable and if you looked at the table:
with( your_data, table(control3, your_status_variable))
.... that you would find zero events in cases with control3 == 1.

R package multgee -- initial values

I'm working on estimating a generalized estimating equation in R. I have a multinomial (ordinal) outcome, and so I have been attempting to use the package multgee since, as far as I know, packages like geepack or gee don't allow for the estimation of multinomial outcomes.
However, I'm running into some issues. The documentation seems good, but in particular, it seems to be requiring initial (starting) values in the model. If I try to run the model without it, I get a line requesting starting values. Here's the model:
formula <- PAINAD_recode~Age + Gender + group_2part + Cornell + stim_intensity_scale
fitmod <- ordLORgee(formula,data=data,bstart=c(1,0,1,0,1),id=data$subject,
repeated=data$trial)
I just threw in some ones and zeroes for the starting values there. However, when I enter starting values (even plausible ones), it claims that:
Starting values and parameters vector differ in length
I thought that with five predictors, I would need five starting values. I can't find more information about this particular matrix. Does anyone have any thoughts on this? The outcome here has five levels (ordinal) and the repeated component has 20 levels. ANy suggestions would be appreciated.

Running predict() after tobit() in package AER

I am doing a tobit analysis on a dataset where the dependent variable (lets call it y) is left censored at 0. So this is what I do:
library(AER)
fit <- tobit(data=mydata,formula=y ~ a + b + c)
This is fine. Now I want to run the "predict" function to get the fitted values. Ideally I am interested in the predicted values of the unobserved latent variable "y*" and the observed censored variable "y" [See Reference 1].
I checked the documentation for predict.survreg [Reference 2] and I don't think I understood which option gives me the predicted censored variables (or the latent variable).
Most examples I found online advise the following :
predict(fit,type="response").
Again, its not clear what kind of predictions these are.
My guess is that the "type" option in the predict function is the key here, with type="response" meant for the censored variable predictions and type="linear" meant for latent variable predictions.
Can someone with some experience here, shed some light for me please ?
Many Thanks!
References:
http://en.wikipedia.org/wiki/Tobit_model
http://astrostatistics.psu.edu/datasets/2006tutorial/html/survival/html/predict.survreg.html
Generally predict-"response" results have been back-transformed to the original scale of data from whatever modeling transformations were used in a regression, whereas the "linear" predictions are the linear predictors on the link transformed scale. In the case of tobit which has an identity link, they should be the same.
You can check my meta-prediction easily enough. I just checked it with the example on the ?tobit page:
plot(predict(fm.tobit2, type="response"), predict(fm.tobit2,type="linear"))
I posted a similar question on stats.stackexchange and I got an answer that could be useful for you:
https://stats.stackexchange.com/questions/149091/censored-regression-in-r
There one of the authors of the package shows how to calculate the mean of (ie. prediction) of $Y$ where $Y = max(Y^*,0)$. Using the package AER this has to be done somewhat "by hand".

Resources