locate missingness in fitted model in R - r

Consider fitting a coxph model with, say, 100 data points. Only 95 are included in the analysis, while 5 are excluded due to being NA (i.e. missingness). I extract the residuals on the fitted data so I have a residual vector with 95 observations. I would like to include the residuals back into the original data frame, but I can't do this since the lengths are different.
How do I identify which observations from the original data frame were not included in the model, so I can exclude/delete them to make the two lengths the same?
(The original data is much larger so it's hard to locate where data are missing...)

Re-fit your model, setting the na.action argument to na.exclude. This pads the residuals and fitted values that are part of the fitted object with NAs. If your original model is zn50:
zn50_na <- update(zn50, na.action=na.exclude)
This should give you residuals(zn50_na) and fitted(zn50_na) of the appropriate length. See ?na.omit for more info.

Related

Fit multiple linear regression without an intercept with the function lm() in R

can you please help with this question in R, i need to get more than one predictor:
Fit multiple linear regression without an intercept with the function lm() to train data
using variable (y.train) as a goal variable and variables (X.mat.train) as
predictors. Look at the vector of estimated coefficients of the model and compare it with
the vector of ’true’ values beta.vec graphically
(Tip: build a plot for the differences of the absolute values of estimated and true values).
i have already tried it out with a code i will post at the end but it give me only one predictor but in this example i need to get more than one predictor:
and i think the wrong one is the first line but i couldn't find a way to fix it :
i can't put the data set here it's large but i have a variable that stores 190 observation from a victor (y.train) and another value that stores 190 observation from a matrix (X.mat.trian).. should give more than one predictor but for me it's giving one..
simple.fit = lm(y.train~0+ X.mat.train) #goal var #intercept # predictor
summary(simple.fit)# Showing the linear regression output
plot(simple.fit)
abline(simple.fit)
n <- summary(simple.fit)$coefficients
estimated_coeff <- n[ , 1]
estimated_coeff
plot(estimated_coeff)
#Coefficients: X.mat.train 0.5018
v <- sum(beta.vec)
#0.5369
plot(beta.vec)
plot(beta.vec, simple.fit)

How to handle missing values (NA's) in a column in lmer

I would like to use na.pass for na.action when working with lmer. There are NA values in some observations of the data set in some columns. I just want to control for this variables that contains the NA's. It is very important that the size of the data set will be the same after the control of the fixed effects.
I think I have to work with na.action in lmer(). I am using the following model:
baseline_model_0 <- lmer(formula=log_life_time_income_child ~ nationality_dummy +
sex_dummy + region_dummy + political_position_dummy +(1|Family), data = baseline_df
Error in qr.default(X, tol = tol, LAPACK = FALSE) :
NA/NaN/Inf in foreign function call (arg 1)
My data: as you see below, there are quite a lot of NA's in all the control variables. So "throwing" away all of these observations is no option!
One example:
nat_dummy
1 : 335
2 : 19
NA's: 252
My questions:
1.) How can I include all of my control variables (expressed in multiple columns) to the model without kicking out observations (expressed in rows)?
2.) How does lmer handle the missing variables in all the columns?
To answer your second question, lmer typically uses maximum likelihood, where it will estimate missing values of the dependent variable and kick out missing values of your predictors. To avoid this, as others have suggested, you can use multiple imputation instead. I demonstrate below an example with the airquality dataset native to R since you don't have your data included in your question. First, load the necessary libraries: lmerTest for fitting the regression, mice for imputation and broom.mixed for summarizing the results.
#### Load Libraries ####
library(lmerTest)
library(mice)
library(broom.mixed)
We can inspect the missing patterns with the next code:
#### Missing Patterns ####
md.pattern(airquality)
Which gives us this nice plot of all the missing data patterns. For example, you may notice that we have two observations that are missing both Ozone and Solar.R.
To fill in the gap, we can impute the data with 5 imputations (the default, so you don't have to include the m=5 part, but I specify explicitly for your understanding.
#### Impute ####
imp <- mice(airquality,
m=5)
After, you run your imputations with the model like below. The with argument takes your imputed data and runs each imputation with the regression model. This model is a bit erroneous and comes back singular, but I just use it because its the quickest dataset I could remember with missing values included.
#### Fit With Imputations ####
fit <- with(imp,
lmer(Solar.R ~ Ozone + (1|Month)))
From there you can pool and summarize your results like so:
#### Pool and Summarise ####
pool <- pool(fit)
summary(pool)
Obviously with the model being singular this would be meaningless, but with a proper fit model, your summary should look something like this:
term estimate std.error statistic df p.value
1 (Intercept) 151.9805678 12.1533295 12.505262 138.8303 0.000000000
2 Ozone 0.8051218 0.2190679 3.675216 135.4051 0.000341446
As Ben already mentioned, you need to also determine why your data is missing. If there are non-random reasons for their missingness, this would require some consideration, as this can bias your imputations/model. I really recommend the mice vignettes here as a gentle introduction to the topic:
https://www.gerkovink.com/miceVignettes/
Edit
You asked in the comments about adding in random effects estimates. I'm not sure why this isn't already something ported into the respective packages already, but the mitml package can help fill that gap. Here is the code:
#### Load Library and Get All Estimates ####
library(mitml)
testEstimates(as.mitml.result(fit),
extra.pars = T)
Which gives you both fixed and random effects for imputed lmer objects:
Call:
testEstimates(model = as.mitml.result(fit), extra.pars = T)
Final parameter estimates and inferences obtained from 5 imputed data sets.
Estimate Std.Error t.value df P(>|t|) RIV FMI
(Intercept) 146.575 14.528 10.089 68.161 0.000 0.320 0.264
Ozone 0.921 0.254 3.630 90.569 0.000 0.266 0.227
Estimate
Intercept~~Intercept|Month 112.587
Residual~~Residual 7274.260
ICC|Month 0.015
Unadjusted hypothesis test as appropriate in larger samples.
And if you just want to pull the random effects, you can use testEstimates(as.mitml.result(fit), extra.pars = T)$extra.pars instead, which gives you just the random effects:
Estimate
Intercept~~Intercept|Month 1.125872e+02
Residual~~Residual 7.274260e+03
ICC|Month 1.522285e-02
Unfortunately there is no easy answer to your question; using na.pass doesn't do anything smart, it just lets the NA values go forward into the mixed-model machinery, where (as you have seen) they screw things up.
For most analysis types, in order to deal with missing values you need to use some form of imputation (using a model of some kind to fill in plausible values). If you only care about prediction without confidence intervals, you can use some simple single imputation method such as replacing NA values with means. If you want to do inference (compute p-values/confidence intervals), you need multiple imputation, i.e. generating multiple data sets with imputed values drawn differently in each one, fitting the model to each data set, then pooling estimates and confidence intervals appropriately across the fits.
mice is the standard/state-of-the-art R package for multiple imputation: there is an example of its use with lmer here.
There a bunch of questions you should ask/understand the answers to before you embark on any kind of analysis with missing data:
what kind of missingness do I have ("completely at random" [MCAR], "at random" [MAR], "not at random" [MNAR])? Can my missing-data strategy lead to bias if the data are missing not-at-random?
have I explored the pattern of missingness? Are there subsets of rows/columns that I can drop without much loss of information (e.g. if some column(s) or row(s) have mostly missing information, imputation won't help very much)
mice has a variety of imputation methods to choose from. It won't hurt to try out the default methods when you're getting started (as in #ShawnHemelstrand's answer), but before you go too far you should at least make sure you understand what methods mice is using on your data, and that the defaults make sense for your case.
I would strongly recommend the relevant chapter of Frank Harrell's Regression Modeling Strategies, if you can get ahold of a copy.

Why is R removing some residuals and how to avoid it?

I am creating linear models in R and testing them for model assumptions.
I noticed that when I create my models, R removes some residuals, giving this:
(2 observations deleted due to missingness)
This prevents me from checking the relationship between the independent variable and the residuals and any further analysis because of the different lengths for x and y.
edit:
Do you have any ideas on how to fix this?
R isn't removing residuals when you run lm(). Rather, it cannot create residuals for samples that have any missing data in the model (nor actually use them in the analysis). Therefore, the summary(model_5) output notifies you that some samples (observations) cannot be used (i.e., are deleted).
To run a correlation between the residuals and the independent variable when there is a difference in their lengths, and when for some reason we cannot find the missing data to remove from the dataset (e.g., if dataset[!complete.cases(dataset), ] isn't working), we first need to figure another way to find which observations are kept/removed in the model. We may be able to rely on the observation ID or the dataset's row names for this.
Example
# sample data
set.seed(12345)
dataset <- data.frame(indep_var = c(NA, rnorm(9)), dep_var = c(rnorm(9), NA))
dataset$index <- rownames(dataset)
# model residuals
resid <- lm(data=dataset, dep_var ~ indep_var)$residuals
dataset.resid <- data.frame(index = names(resid), resid)
# join or match the residuals and the variables by their observation identifier
cor.data <- dplyr::inner_join(dataset.resid, dataset, by = "index")
# correlation analysis
cor.test(~ resid + indep_var, cor.data)
Note that names(resid) are the row names of the observations used in the model from dataset. Any unused rownames(dataset) or observations/rows from dataset (due to missingness) will not be included in names(resid).

Using predict for linear model with NA values in R

I have a dataset of ~32,000, for which I have created a linear model. ~12,000 observations were deleted due to missingness.
I am trying to use the predict function to backtest the expected value for each of my 32,000 data points, but [as expected], this gives the error 'replacement has 20000 rows, data has 32000'.
Is there any way I can use that model made on the 20,000 rows to predict that of the 32,000? I am happy to have 'zero' for observations that don't have results for every column used in the model.
If not, how can I at least subset the 32,000 dataset correctly such that it only includes the 20,000 whole observations? If my model was lm(a ~ x+y+Z, data=data), for example, how would I filter data to only include observations with full data in x, y and z?
The best thing to do is to use na.action=na.exclude when you fit the model in the first place: from ?na.exclude,
when ‘na.exclude’ is used the residuals and
predictions are padded to the correct length by inserting ‘NA’s
for cases omitted by ‘na.exclude’.
The problem with using a 0 instead of a missing value is that thee linear model will interpret the value as actually having been 0 instead of missing. For instance, if your variable x had a range of 10-100, the model would interpret your imputed 0's as observations lower than the training data's range and give you artificially low predictions. If you want to make a prediction for the rows with missing values, you're going to have to do some value imputation (ie. replace the NAs with the mean, the median or using k-nearest neighbors).
Using
data[complete.cases(data),]
gives you only observations without NAs. Perhaps that's what you are looking for.
Other way is
na.omit(data)
which gives you in addition the indices of the removed observations.

Can I use AICc to rank models based in nested data?

I have a database where there are 136 species for 6 variables. For 4 variables there are data for all species. However, for the other 2 variables, there are data for only 88 species. When we look at the 6 variables together, only 78 species have data for all variables.
So, i ran models using this variables.
Note that the models have different species sample sizes, varying according to the data in the database
I need to know if AICc is a good way to compare these models.
The model.avg in MuMIn package returns a error when i try to run a list including all my models:
mods <- list(mod1, mod2, ..., mod14)
aicc <- summary(model.avg(mods))
*Error in model.avg.default(mods) :
models are not all fitted to the same data*
This error makes me think that is not possible rank models based in different sample sizes using AICc. I need help to solve this question!
Thanks in advance!
Basically, all information criteria (as AIC is) are based on the likelihood function of the model that is influenced by sample size. The sample size is directly correlated with information criteria (greater sample size = lower likelihood = greater information criteria).
This means that you cannot compare different sample-size models using AIC or any information criteria.
That's also why your model.avg is failing.

Resources