Odd behavior with step() - r

step() and stepAIC() produce a "remove missing values error" when running the code on data with missing values.
Error in step(mod1, direction = "backward") :
number of rows in use has changed: remove missing values?
According to ?step:
The model fitting must apply the models to the same dataset. This may be
a problem if there are missing values and R's default of na.action = na.omit
is used. We suggest you remove the missing values first.
I have a data frame with one variable which has four na values. However, when I run step on the lm object, I don't get the "missing values" error even though it has missing values. Can anyone tell me what could be going on?
> d1$Impressions
[1] NA NA NA 6924180 9313226 27888455
18213812 54557205 13495553
...
This does not produce an error message:
mod1 = lm(Leads ~ G + Con + GOO + DAY + Res + SD + ED +
ME + Impressions + Inc + Sea, data=d1)
step(mod1, direction="backward")
stepAIC(mod1)
Even with a variable which has missing values, it's not generating an error message. Any ideas on what's going on?

One reason for the stated behaviour is this. step() fits the full model and hence drops 3 (as stated) observations due to presence of NAs. As long as the variables for which there are NAs remain in the model, the lm() function will remove those observations at each step. If stepping stops before it removes a variable that would result in one of the previously removed observations remaining in the model, then no error will be raised, because the numbers of rows in the model matrix will not have changed.
As an aside, stepwise selection like this is considered to be of somewhat dubious validity. Not least, in using it you a making a fairly bold statement that the effects of the eliminated variables are exactly equal to zero. This also has the effect of biasing the effect (estimated coefficients) of the variables retained in the model to have larger (absolute) value.
Alternatives to this stepwise selection include shrinkage methods such as the Lasso and the Elastic Net.

Related

lmer failing with na.pass

When I run a lmer model with lme4 using na.pass as the na.action, I get the following error:
R: NA/NaN/Inf in foreign function call (arg 1)
I run the model like this:
model1 <- lme4::lmer(agg_dv_singing ~ GMS.Musical.Training +
JAJ.ability + MDT.ability + MPT.ability + PDCT.ability +
PIAT.ability + agg_dv_long_note + demographics.age +
aggiv_entropy + aggiv_interval_complexity +
aggiv_rhythmic_complexity + aggiv_tonal_complexity +
log.freq + length + (1|p_id),
data = dat, na.action = na.pass)
summary(dat) indicates that there are no Inf or NaN values, although yes, there are many NA values.
Running na.pass outside of lmer on the same data set does not give an error:
na.pass(dat)
So what could be going wrong within lmer?
Comments to a previous question of yours attempted to explain that, in general, mixed model machinery cannot handle estimation from cases when there are missing values in the predictors; it just doesn't work that way. If you want to fit mixed models with missing data you need to do some form of imputation, i.e. filling in values for missing predictors (e.g. see the mice package, which is more or less the state of the art at least as far as the R ecosystem is concerned). Here is what the four different standard na.* actions do in the context of mixed models:
na.fail(): fail immediately if there are missing values in the data (predictors or response). This is frustrating, but alerts you immediately to the fact that you have missing data, and lets you decide what to do about it.
na.omit(): drop non-complete cases from the data before fitting.
na.exclude(): like na.omit(), but keep track of the locations of the excluded cases. When using predict() or residuals() (or any function that produces results per observation), reconstitute a complete data set with NA values for the non-complete cases in the original data set. (I usually find this setting to be the most useful default.)
na.pass: do not remove NA values, but attempt to continue with the fitting procedure. As you found out, this usually doesn't work at all! It will just pass the NA values down through the code until something goes wrong. Typically one of two things happens at this point:
if the entire estimation procedure is written using R functions that can handle and propagate missing values, then you'll usually get a fitted model object with NA/NaN for all coefficients, likelihoods, etc. etc. (because the missing values contaminate the entire fitting procedure);
if some step of the estimation procedure can't handle NA/NaN values (as in this case), you get an inscrutable error from the first point in the procedure that fails.
If you look at the source code of na.pass() (by typing na.pass at the R prompt), you'll see that in fact all it does is return the same object, unchanged. To be honest, I'm not really sure why na.pass even exists, except for completeness ... (or compatibility with S)
Your NA value was not in a parameter that is used in a random-effects term: if it had, you would have gotten a more interpretable error message:
library(lme4)
ss <- sleepstudy
ss[1,"Days"] <- NA
lmer(Reaction ~ Days + (Days|Subject), ss, na.action=na.pass)
Error in lme4::lFormula(formula = Reaction ~ Days + (Days | Subject), :
NA in Z (random-effects model matrix): please use "na.action='na.omit'" or "na.action='na.exclude'"
If I fit a model with (1|Subject), so that the NA value only affects the fixed effects
lmer(Reaction ~ Days + (1|Subject), ss, na.action=na.pass)
then we get your error message.
Error in qr.default(X, tol = tol, LAPACK = FALSE) :
NA/NaN/Inf in foreign function call (arg 1)
traceback() tells me that this happens in the internal chkRank.drop.cols() function, where R is trying to figure out if any of your fixed-effect columns are collinear. There should probably be a check for missing values there ...

TukeyHSD or glht in R, ANCOVA

I'm wondering if i can use the function "TukeyHSD" to perform the all pairwise comparisons of a "aov()" model with one factor (e.g., GROUP) and one continuous covariate (e.g., AGE). I did for example:
library(multcomp)
data('litter', package = 'multcomp')
litter.aov <- aov(weight ~ gesttime + dose, data = litter)
TukeyHSD(litter.aov, which = 'dose')
and i get a warning message like this:
Warning message:
In replications(paste("~", xx), data = mf): non-factor ignored: gesttime
Is this process above correct? What's the meaning of the warning message? And does "TukeyHSD" apply to badly unbalanced designs?
In addition, is there any difference between the processes above and below?
litter.mc <- glht(litter.aov, linfct = mcp(dose = 'Tukey'))
summary(litter.mc)
Best, Sue
There's no difference. TukeyHSD() is just a bit more eager to tell you about potential problems. Notice that it's a warning message, not an error, meaning that the results might not be what you expect, but they'll still be returned so you can judge for yourself.
As for what it means, it means what it says: non-factor variables are ignored. Remember that you are comparing the differences between groups, and grouping is done using factors, so factors are all TukeyHSD() care about. In your case you explicitly tell the function to only care about dose, which is factor, so the warning might be seen as overly cautious.
One way of avoiding the warning would be to convert gesttime into a factor, and as it consist of only four levels it makes some sense to do so.
data('litter', package = 'multcomp')
litter$gesttime <- as.factor(litter$gesttime)
litter.aov <- aov(weight ~ gesttime + dose, data = litter)
TukeyHSD(litter.aov, which = 'dose')
I know this is an old thread but I'm not sure the existing answers are quite right...
I've been trying both functions with my own data and have a similar situation to Sue, where TukeyHSD gives a warning message about ignoring non-factor covariates, while glht() does not.
It does not appear that they are doing the same thing contrary to the other answer. The results are different and it appears that TukeyHSD is not marginalizing over the non-factor covariates (as the warning states). It appears that glht() correctly uses the mean value of non-factor covariates to compute the marginal mean of the groups of interest since the point estimates are the same as those obtained from lsmeans().
So it does not seem that TukeyHSD is overly cautious, it just seems that it can't handle non-factor covariates while glht is able to. So glht seems to be the correct function to use in this case, to me.

residuals() function error: replacement has x rows and data has y rows

I have a data set that has reading time for each word that numerous individuals read.
I am trying to calculate reading time residuals for each individual in my data. Word lengths and the order of presentation (of a particular word) are factors in calculating a regression for each individual.
The reading time was log-transformed (logRT) and word lengths were calculated by nchar(). The order of presentation is also log-transformed.
model1<-lmer(logRT~wlen+log(order)+(1|subject), data=mydata)
Then, I try to get a residual column for every data point by doing the following,
mydata$logResid<-residuals(model1)
Then, I get this error.
Error in `$<-.data.frame`(`*tmp*`, "LogResid", value = c(0.145113408056189, :
replacement has 30509 rows, data has 30800
Does anyone have any advice? I am totally confused. Since this is an analysis I've been doing every day with no such error so far. It is even more confusing.
I would say you should try
model1 <- lmer(logRT~wlen+log(order)+(1|subject), data=mydata,
na.action=na.exclude)
and see if that helps; it should fill in NA values in the appropriate places.
From ?na.exclude:
... when ‘na.exclude’ is used the residuals and
predictions are padded to the correct length by inserting ‘NA’s
for cases omitted by ‘na.exclude’.

randomForest() machine learning in R

I am exploring with the function randomforest() in R and several articles I found all suggest using a similar logic as below, where the response variable is column 30 and independent variables include everthing else except for column 30:
dat.rf <- randomForest(dat[,-30],
dat[,30],
proximity=TRUE,
mtry=3,
importance=TRUE,
do.trace=100,
na.action = na.omit)
When I try this, I got the following error messages:
Error in randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, :
NA not permitted in predictors
In addition: Warning message:
In randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, :
The response has five or fewer unique values. Are you sure you want to do regression?
However, I was able to get it to work when I listed the independent variables one by one while keeping all the other parameters the same.
dat.rf <- randomForest(as.factor(Y) ~X1+ X2+ X3+ X4+ X5+ X6+ X7+ X8+ X9+ X10+......,
data=dat
proximity=TRUE,
mtry=3,
importance=TRUE,
do.trace=100,
na.action = na.omit)
Could someone help me debug the simplier command where I don't have to list each predictor one by one?
The error message gives you a clue to two problems:
First, you need to remove any row that has a NA anywhere. Removing NA should be easy enough and I'll leave you that one as an exercise.
It looks like you need to do classification (which predicts a response which only has one of a few discrete levels), rather than regression (which predicts a continuous response). If the response is continuous, randomForest() will automatically apply regression.
So, how do you force randomForest() to use classification?As you noticed in your first try, randomForest allows you to give data as predictors and response data, not just using the formula style. To force randomForest() to apply classification, make sure that the value you are trying to predict (the response, or dat[,30]) is a factor. Remember to explicitly identify the $x$ and $y$ arguments. This is easy to do:
randomForest(x = dat[,-30],
y = factor(dat[,30]),
...)
This way your output can only take one of the levels given in y.
This is all buried in the description of the arguments $x$ and $y$: see ?help.

Explanation of the formula object used in the coxph function in R

I am a complete novice when it comes to survival analysis. I am working on a project that requires I use the coxph function in the "survival" package, but I am running into trouble because I do not understand what is required by the formula object.
Most descriptions I can find about the function are as follows:
"a formula object, with the response on the left of a ~ operator, and the terms on the right. The response must be a survival object as returned by the Surv function. "
I know what needs to be on the left of the operator, the issue is what the function expects from the right-hand side.
Here is a link of what my data looks like (The actual data set is much larger, I'm only displaying the first 20 data points for brevity):
Short explanation of data:
-Row 1 is the header
-Each row after that is a separate patient
-The first column is the age of the patient at the time of the study
-columns 2 through 14 (headed by x2-x13), and 19 (x18) and 20 (x19) are covariates such as race, relationship status, medical conditions that take on either true (1) or false (0) values.
-columns 15 (x14) through 18 (x17) are covariates such as tumor size, which take on whole number values greater than 0.
-The second to last column "sur" is the number of months survived, and "index" is whether or not that is a right-censored time (1 for true, 0 for false).
Given this data I need to plot a Cox Proportional hazard curve, but I end up with an incorrect plot because the right hand side of the formula object is wrong.
Here is my code, "temp4" is the name I gave to the data table:
library("survival")
temp4 <- read.table("~/data.txt", header=TRUE)
seerCox <- coxph(Surv(sur, index)~ temp4$x1 + temp4$x2 + temp4$x3 + temp4$x4 + temp4$x5 + temp4$x6 + temp4$x7 + temp4$x8 + temp4$x9 + temp4$x10 + temp4$x11 + temp4$x12 + temp4$x13 + temp4$x14 + temp4$x15 + temp4$x16 + temp4$x17 + temp4$x18 + temp4$x19, data=temp4, singular.ok=TRUE)
plot(survfit(seerCox), main= "Cox Estimate", mark.time=FALSE, ylab="Probability", xlab="Survival Time in Months", col=c("blue", "red", "green"))
I should also note that I have tried replacing the right hand side that you're seeing with the number 1, a period, leaving it blank. These methods produce a kaplan-meier curve.
The following is the console output:
Each new line is an example of the error produced depending on how I filter the data. (ie if I only include patients with ages greater than 85, etc.)
If someone could explain how it works, it would be greatly appreciated.
PS- I have searched for over a week to my solution, and I am asking for help here as a last resort.
You should not be using the prefix temp$ if you are also using a data argument. The whole purpose of supplying a data argument is to allow dropping those in the formula.
seerCox <- coxph( Surv(sur, index) ~ . , data=temp4, singular.ok=TRUE)
The above would use all of the x-variables in your temp data.frame. This will use just the first 3:
seerCox <- coxph( Surv(sur, index) ~ x1+x2+x3 , data=temp4)
Exactly what the warnings signify depends on the data (as you have in one sense already exemplified by producing different sorts of collinearity with different subsets.) If you have collinear columns, then you get singularities in the inversion of the model matrix and the software will attempt to drop aliased columns with a warning. This is really telling you that you do not have enough data to build the large models you are attempting. Exploring that possibility with table calls is often informative.
Bottom line: This is not a problem with your formula construction, so much as it is a problem of not understanding the limitations of the chosen method with the dataset you have assembled. You need to be more careful about defining your goals. What is the highest priority in this research? Do you really need every variable? Is it possible to aggregate some of these anonymous variables into clinically meaningful categories such as diagnostic categories or comorbities?

Resources