Can I use AICc to rank models based in nested data? - r

I have a database where there are 136 species for 6 variables. For 4 variables there are data for all species. However, for the other 2 variables, there are data for only 88 species. When we look at the 6 variables together, only 78 species have data for all variables.
So, i ran models using this variables.
Note that the models have different species sample sizes, varying according to the data in the database
I need to know if AICc is a good way to compare these models.
The model.avg in MuMIn package returns a error when i try to run a list including all my models:
mods <- list(mod1, mod2, ..., mod14)
aicc <- summary(model.avg(mods))
*Error in model.avg.default(mods) :
models are not all fitted to the same data*
This error makes me think that is not possible rank models based in different sample sizes using AICc. I need help to solve this question!
Thanks in advance!

Basically, all information criteria (as AIC is) are based on the likelihood function of the model that is influenced by sample size. The sample size is directly correlated with information criteria (greater sample size = lower likelihood = greater information criteria).
This means that you cannot compare different sample-size models using AIC or any information criteria.
That's also why your model.avg is failing.

Related

Higher order IRT model in R

I am trying to estimate a higher order model with the mirt package in R.
I have already conducted analysis using a bi-factor structure.
I have 60 items, the first 30 being one and the second 30 being a second dimensions of my model. I know that these two dimensions are strongly correlated (r = .8), which was the motivation to estimate a bi-factor model. However, I would like to compare the bi-factor structure to a model in which both subdimensions are explained by a general factor G.
I have troubles finding the correct model specification to estimate the model.
`specific <- c(rep(1,30),rep(2,30)) `
model2 <- '
F1 = 1-60 '
mod <- bfactor(all_data,specific, model2 = model2,itemtype = "Rasch")
However this only gives me a model with ageneral factors and two specific factors, essentially a bi-factor model. What I am trying to specify is that (a) the two lower tier factors both load on a general factor, while the lower tier factors are uncorrelated and (b) the same models but allowing the lower tier factors to correlate.
I am happy if someone could help me out here.

How to handle missing values (NA's) in a column in lmer

I would like to use na.pass for na.action when working with lmer. There are NA values in some observations of the data set in some columns. I just want to control for this variables that contains the NA's. It is very important that the size of the data set will be the same after the control of the fixed effects.
I think I have to work with na.action in lmer(). I am using the following model:
baseline_model_0 <- lmer(formula=log_life_time_income_child ~ nationality_dummy +
sex_dummy + region_dummy + political_position_dummy +(1|Family), data = baseline_df
Error in qr.default(X, tol = tol, LAPACK = FALSE) :
NA/NaN/Inf in foreign function call (arg 1)
My data: as you see below, there are quite a lot of NA's in all the control variables. So "throwing" away all of these observations is no option!
One example:
nat_dummy
1 : 335
2 : 19
NA's: 252
My questions:
1.) How can I include all of my control variables (expressed in multiple columns) to the model without kicking out observations (expressed in rows)?
2.) How does lmer handle the missing variables in all the columns?
To answer your second question, lmer typically uses maximum likelihood, where it will estimate missing values of the dependent variable and kick out missing values of your predictors. To avoid this, as others have suggested, you can use multiple imputation instead. I demonstrate below an example with the airquality dataset native to R since you don't have your data included in your question. First, load the necessary libraries: lmerTest for fitting the regression, mice for imputation and broom.mixed for summarizing the results.
#### Load Libraries ####
library(lmerTest)
library(mice)
library(broom.mixed)
We can inspect the missing patterns with the next code:
#### Missing Patterns ####
md.pattern(airquality)
Which gives us this nice plot of all the missing data patterns. For example, you may notice that we have two observations that are missing both Ozone and Solar.R.
To fill in the gap, we can impute the data with 5 imputations (the default, so you don't have to include the m=5 part, but I specify explicitly for your understanding.
#### Impute ####
imp <- mice(airquality,
m=5)
After, you run your imputations with the model like below. The with argument takes your imputed data and runs each imputation with the regression model. This model is a bit erroneous and comes back singular, but I just use it because its the quickest dataset I could remember with missing values included.
#### Fit With Imputations ####
fit <- with(imp,
lmer(Solar.R ~ Ozone + (1|Month)))
From there you can pool and summarize your results like so:
#### Pool and Summarise ####
pool <- pool(fit)
summary(pool)
Obviously with the model being singular this would be meaningless, but with a proper fit model, your summary should look something like this:
term estimate std.error statistic df p.value
1 (Intercept) 151.9805678 12.1533295 12.505262 138.8303 0.000000000
2 Ozone 0.8051218 0.2190679 3.675216 135.4051 0.000341446
As Ben already mentioned, you need to also determine why your data is missing. If there are non-random reasons for their missingness, this would require some consideration, as this can bias your imputations/model. I really recommend the mice vignettes here as a gentle introduction to the topic:
https://www.gerkovink.com/miceVignettes/
Edit
You asked in the comments about adding in random effects estimates. I'm not sure why this isn't already something ported into the respective packages already, but the mitml package can help fill that gap. Here is the code:
#### Load Library and Get All Estimates ####
library(mitml)
testEstimates(as.mitml.result(fit),
extra.pars = T)
Which gives you both fixed and random effects for imputed lmer objects:
Call:
testEstimates(model = as.mitml.result(fit), extra.pars = T)
Final parameter estimates and inferences obtained from 5 imputed data sets.
Estimate Std.Error t.value df P(>|t|) RIV FMI
(Intercept) 146.575 14.528 10.089 68.161 0.000 0.320 0.264
Ozone 0.921 0.254 3.630 90.569 0.000 0.266 0.227
Estimate
Intercept~~Intercept|Month 112.587
Residual~~Residual 7274.260
ICC|Month 0.015
Unadjusted hypothesis test as appropriate in larger samples.
And if you just want to pull the random effects, you can use testEstimates(as.mitml.result(fit), extra.pars = T)$extra.pars instead, which gives you just the random effects:
Estimate
Intercept~~Intercept|Month 1.125872e+02
Residual~~Residual 7.274260e+03
ICC|Month 1.522285e-02
Unfortunately there is no easy answer to your question; using na.pass doesn't do anything smart, it just lets the NA values go forward into the mixed-model machinery, where (as you have seen) they screw things up.
For most analysis types, in order to deal with missing values you need to use some form of imputation (using a model of some kind to fill in plausible values). If you only care about prediction without confidence intervals, you can use some simple single imputation method such as replacing NA values with means. If you want to do inference (compute p-values/confidence intervals), you need multiple imputation, i.e. generating multiple data sets with imputed values drawn differently in each one, fitting the model to each data set, then pooling estimates and confidence intervals appropriately across the fits.
mice is the standard/state-of-the-art R package for multiple imputation: there is an example of its use with lmer here.
There a bunch of questions you should ask/understand the answers to before you embark on any kind of analysis with missing data:
what kind of missingness do I have ("completely at random" [MCAR], "at random" [MAR], "not at random" [MNAR])? Can my missing-data strategy lead to bias if the data are missing not-at-random?
have I explored the pattern of missingness? Are there subsets of rows/columns that I can drop without much loss of information (e.g. if some column(s) or row(s) have mostly missing information, imputation won't help very much)
mice has a variety of imputation methods to choose from. It won't hurt to try out the default methods when you're getting started (as in #ShawnHemelstrand's answer), but before you go too far you should at least make sure you understand what methods mice is using on your data, and that the defaults make sense for your case.
I would strongly recommend the relevant chapter of Frank Harrell's Regression Modeling Strategies, if you can get ahold of a copy.

Error in glsEstimate(object, control = control) : computed "gls" fit is singular, rank 19

First time asking in the forums, this time I couldn't find the solutions in other answers.
I'm just starting to learn to use R, so I can't help but think this has a simple solution I'm failing to see.
I'm analyzing the relationship between different insect species (SP) and temperature (T), explanatory variables
and the area of the femur of the resulting adult (Femur.area) response variable.
This is my linear model:
ModeloP <- lm(Femur.area ~ T * SP, data=Datos)
No error, but when I want to model variance with gls,
modelo_varPower <- gls(Femur.area ~ T*SP,
weights = varPower(),
data = Datos
)
I get the following errors...
Error in glsEstimate(object, control = control) :
computed "gls" fit is singular, rank 19
The linear model barely passes the Shapiro test of normality, could this be the issue?
Shapiro-Wilk normality test
data: re
W = 0.98269, p-value = 0.05936
Strangely I've run this model using another explanatory variable and had no errors, all I can read in the forums has to do with multiple samplings along a period of time, and thats not my case.
Since the only difference is the response variable I'm uploading and image of how the table looks like in case it helps.
You have some missing cells in your SP:T interaction. lm() tolerates these (if you look at coef(lm(Femur.area~SP*T,data=Datos)) you'll see some NA values for the missing interactions). gls() does not. One way to deal with this is to create an interaction variable and drop the missing levels, then fit the model as (effectively) a one-way rather than a two-way ANOVA. (I called the data dd rather than datos.)
dd3 <- transform(na.omit(dd), SPT=droplevels(interaction(SP,T)))
library(nlme)
gls(Femur.area~SPT,weights=varPower(form=~fitted(.)),data=dd3)
If you want the main effects and the interaction term and the power-law variance that's possible, but it's harder.

Unequal Data for Model Compare in R

I'm fairly new to R and am trying to compare two models with the modelCompare function. However, the data set that I am working with is a bit large and has unevenly distributed missing values. When I try the following code for example:
Model_A <- lm(DV~var1*var2 + cont.var, data=df)
Model_C <- lm(DV~ cont.var, data=df)
modelCompare(Model_C,Model_A)
I get an error that the models have different N values and cannot be compared because data is differentially omitted between the two models. Is there an easy way to remove this variation, as I will be running a number of regression analyses with this data set?
What are you looking to compare? If you want to compare intercepts between the models just:
Model_A
Model_C
If you want to compare accuracy of the model, use a training and testing dataset!

Linear Discriminant Analysis in R - Training and validation samples

I am working with lda command to analyze a 2-column, 234 row dataset (x): column X1 contains the predictor variable (metric) and column X2 the independent variable (categorical, 4 categories). I would like to build a linear discriminant model by using 150 observations and then use the other 84 observations for validation. After a random partitioning of data i get x.build and x.validation with 150 and 84 observations, respectively. I run the following
fit = lda(x.build$X2~x.build$X1, data=x.build, na.action="na.omit")
Then I run predict command like this:
pred = predict(fit, newdata=x.validation)
From the reading of the commands description I thought that in pred$class I would get the classification of validation data according to the model built, but I get the classification of 150 observations instead of the 84 I intended to use as validation data. I don't really know what is happening, can someone please give me an example of how I should be conducting this analysis?
Thank you very much in advance.
Try this instead:
fit = lda(X2~X1, data=x.build, na.action="na.omit")
pred = predict(fit, newdata=x.validation)
If you use this formula x.build$X2~x.build$X1 when you build the model, predict expects x.build$X1 column in the validation data. Obviously there isn't one so you get prediction for training data.

Resources