Regression model with missing data in dependant variable - r

modelo <- lm( P3J_IOP~ PräOP_IOP +OPTyp + P3J_Med, data = na.omit(df))
summary(modelo)
Error:
Fehler in step(modelo, direction = "backward") :
Number of lines used has changed: remove missing values?
I have a lot of missing values in my dependent variable P3J_IOP.
Has anyone any idea how to create the model?

tl;dr unfortunately, this is going to be hard.
It is fairly difficult to make linear regression work smoothly with missing values in the predictors/dependent variables (this is true of most statistical modeling approaches, with the exception of random forests). In case it's not clear, the problem with stepwise approaches with missing data in the predictor is:
incomplete cases (i.e., observations with missing data for any of the current set of predictors) must be dropped in order to fit a linear model;
models with different predictor sets will typically have different sets of incomplete cases, leading to the models being fitted on different subsets of the data;
models fitted to different data sets aren't easily comparable.
You basically have the following choices:
drop any predictors with large numbers of missing values, then drop all cases that have missing values in any of the remaining predictors;
use some form of imputation, e.g. with the mice package, to fill in your missing data (in order to do proper statistical inference, you need to do multiple imputation, which may be hard to combine with stepwise regression).
There are some advanced statistical techniques that will allow you to simultaneously do the imputation and the modeling, such as the brms package (here is some documentation on imputation with brms), but it's a pretty big hammer/jump in statistical sophistication if all you want to do is fit a linear model to your data ...

Related

How does fixest handle negative values of the demeaned dependent variable in poisson estimations?

I need to perform glm (poisson) estimations with fixed-effects (say merely unit FE) and several regressors (RHS variables). I have an unbalanced panel dataset where most (~90%) observations have missing values (NA) for some but not all regressors.
fixest::feglm() can handle this and returns my fitted model.
However, to do so, it (and fixest::demean too) removes observations that have at least one regressor missing, before constructing the fixed-effect means.
In my case, I am afraid this implies not using a significant share of available information in the data.
Therefore, I would like to demean my variables by hand, to be able to include as much information as possible in each fixed-effect dimension's mean, and then run feglm on the demeaned data. However, this implies getting negative dependent variable values, which is not compatible with Poisson. If I run feglm with "poisson" family and my manually demeaned data, I (coherently) get: "Negative values of the dependent variable are not allowed for the "poisson" family.". The same error is returned with data demeaned with the fixest::demean function.
Question:
How does feglm handle negative values of the demeaned dependent variable? Is there a way (like some data transformation) to reproduce fepois on a fixed-effect in the formula with fepois on demeaned data and a no fixed-effect formula?
To use the example from fixest::demean documentation (with two-way fixed-effects):
data(trade)
base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)
# We center the two variables ln_dist and ln_euros
# on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
fe = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean
and I would like to reproduce
est_fe = fepois(ln_euros ~ ln_dist | Origin + Destination, base)
with
est = fepois(ln_euros_dm ~ ln_dist_dm, base)
I think there are two main problems.
Modelling strategy
In general, it is important to be able to formally describe the estimated model.
In this case it wouldn't be possible to write down the model with a single equation, where the fixed-effects are estimated using all the data and other variables only on the non-missing observations. And if the model is not clear, then... maybe it's not a good model.
On the other hand, if your model is well defined, then removing random observations should not change the expectation of the coefficients, only their variance. So again, if your model is well specified, you shouldn't worry too much.
By suggesting that observations with missing values are relevant to estimate the fixed-effects coefficients (or stated differently, that they are used to demean some variables) you are implying that these observations are not randomly distributed. And now you should worry.
Just using these observations to demean the variables wouldn't remove the bias on the estimated coefficients due to the selection to non-missingness. That's a deeper problem that cannot be removed by technical tricks but rather by a profound understanding of the data.
GLM
There is a misunderstanding with GLM. GLM is a super smart trick to estimate maximum likelihood models with OLS (there's a nice description here). It was developed and used at a time when regular optimization techniques were very expensive in terms of computational time, and it was a way to instead employ well developed and fast OLS techniques to perform equivalent estimations.
GLM is an iterative process where typical OLS estimations are performed at each step, the only changes at each iteration concern the weights associated to each observation. Therefore, since it's a regular OLS process, techniques to perform fast OLS estimations with multiple fixed-effects can be leveraged (as is in the fixest package).
So actually, you could do what you want... but only within the OLS step of the GLM algorithm. By no means you should demean the data before running GLM because, well, it makes no sense (the FWL theorem has absolutely no hold here).

Use of multiple imputation in R for two-level binary logistic regression model

I am using the glmer function of the R library lme4 to fit a General Linear Mixed (GLM) models using the Laplace approximation and with a binary response variable, BINARY_r , say. I have one level two fixed effects variables (‘FIXED’, say) and two level two cross-classified random effects variables (‘RANDOM1’ and ‘RANDOM2’, say). The level one binary response variable, BINARY_r, is nested separately within each of the two level two variables. The logit function is used as a link function for representing the non-Gaussian nature of the response variable. The interactive effect between the two random effects variables is represented by ‘RANDOM1:RANDOM2’. All three independent variables are categorical. The model takes the form,
BINARY_r ~ FIXED + RANDOM1 + RANDOM2 + RANDOM1:RANDOM2.
There are missing data for ‘FIXED’ and ‘BINARY_r’ and I wish to explore the improvement in the model through applying multiple imputation for each of these two variables.
I am very unclear, however, as to how to use MI to generate a new function in R using glmer which is identical to the original one but now includes imputed data for FIXED and BINARY_r. Can you help, please?
Many thanks in advance

Heteroskedasticity and the other one is the autocorrelation problems in multiple regression analysis

I am running a multiple linear regression model using lm function in R to study the impact of some characteristics on the gene expression level.
My data matrix contains one continuous dependent variable (i.e. gene expression levels) and 50 explanatory variables which are the count of these characteristics on each gene and many of these counts are zeros.
I checked all of the regression assumptions and I found two issues the first one is the Heteroscedasticity and the other one is the autocorrelation problem. The later is not serious. I wonder if using multiple linear regression is correct or not and if there is any other regression techniques can be used to solve these problems.
I used stepwise method and I got just 11 significant variables among those 50. But when I checked the heteroscedasticity, and I found it still appears as shown below. The sample size is 15,000 genes. (15,000 rows and 50 columns).
Updated image, with weights added to lm call, re comments

Can/Should I use the output of a log-linear model as the predictors in a logistic regression model?

I have a data set with both continuous and categorical variables. In the end I want to build a logistic regression model to calculate the probability of a response dichotomous variable.
Is it acceptable, or even a good idea, to apply a log linear model to the categorical variables in the model to test their interactions, and then use the indicated interactions as predictors in the logistic model?
Example in R:
Columns in df: CategoricalA, CategoricalB, CategoricalC, CategoricalD, CategoricalE, ContinuousA, ContinuousB, ResponseA
library(MASS)
#Isolate categorical variables in new data frame
catdf <- df[,c("CategoricalA","CategoricalB","CategoricalC", "CategoricalD", "CategoricalE")]
#Create cross table
crosstable <- table(catdf)
#build log-lin model
model <- loglm(formula = ~ CategoricalA * CategoricalB * CategoricalC * CategoricalD * CategoricalE, data = crosstable)
#Use step() to build better model
automodel <- step(object = model, direction = "backward")
Then build a logistic regresion using the output of automodeland the values of ContinuousA and ContinuousB in order to predict ResponseA (which is binary).
My hunch is that this is not ok, but I cant find the answer definitively one way or the other.
Short answer: Yes. You can use any information in the model that will be available in out-of-time or 'production' run of the model. Whether this information is good, powerful, significant, etc. is a different question.
The logic is that a model can have any type of RHS variable, be it categorical, continuous, logical, etc. Furthermore, you can combine RHS variables to create one RHS variable and also apply transformations. The log linear model of categorical is nothing by a transformed linear combination of raw variables (that happen to be categorical). This method would not be violating any particular modeling framework.

Inputting a whole data frame as independent variables in a logistic regression [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
short formula call for many variables when building a model
I have a biggish data frame (112 variables) that I'd like to do a stepwise logistic regression on using R. I know how to setup the glm model and the stepAIC model, but I'd rather not type in all the headings to input the independent variables. Is there a fast way to give the glm model an entire data frame as independent variables such that it will recognize each column as an x variable to be included in the model? I tried:
ft<-glm(MFDUdep~MFDUind, family=binomial)
But it didn't work (wrong data types). MFDUdep and MFDUind are both data frames, with MFDUind containing 111 'x' variables and MFDUdep containing a single 'y'.
You want the . special symbol in the formula notation. Also, it is probably better to have the response and predictors in the single data frame.
Try:
MFDU <- cbind(MFDUdep, MFDUind)
ft <- glm(y ~ ., data = MFDU, family = binomial)
Now that I have given you the rope, I am obliged to at least warn you about the potential for hanging...
The approach you are taking is usually not the recommended one, unless perhaps prediction is the purpose of the model. Regression coefficient for selected variables may be strongly biased so if you are using this for enlightenment, then rethink your approach.
You will also need a lot of observations to allow 100+ terms in a model.
Better alternative exist; e.g. see the glmnet package for one such approach which allows for ridge, lasso or both (elastic net) constraints on the set of coefficients, which allows one to minimise model error at the expense of a small amount of additional bias.

Resources