TukeyHSD or glht in R, ANCOVA - r

I'm wondering if i can use the function "TukeyHSD" to perform the all pairwise comparisons of a "aov()" model with one factor (e.g., GROUP) and one continuous covariate (e.g., AGE). I did for example:
library(multcomp)
data('litter', package = 'multcomp')
litter.aov <- aov(weight ~ gesttime + dose, data = litter)
TukeyHSD(litter.aov, which = 'dose')
and i get a warning message like this:
Warning message:
In replications(paste("~", xx), data = mf): non-factor ignored: gesttime
Is this process above correct? What's the meaning of the warning message? And does "TukeyHSD" apply to badly unbalanced designs?
In addition, is there any difference between the processes above and below?
litter.mc <- glht(litter.aov, linfct = mcp(dose = 'Tukey'))
summary(litter.mc)
Best, Sue

There's no difference. TukeyHSD() is just a bit more eager to tell you about potential problems. Notice that it's a warning message, not an error, meaning that the results might not be what you expect, but they'll still be returned so you can judge for yourself.
As for what it means, it means what it says: non-factor variables are ignored. Remember that you are comparing the differences between groups, and grouping is done using factors, so factors are all TukeyHSD() care about. In your case you explicitly tell the function to only care about dose, which is factor, so the warning might be seen as overly cautious.
One way of avoiding the warning would be to convert gesttime into a factor, and as it consist of only four levels it makes some sense to do so.
data('litter', package = 'multcomp')
litter$gesttime <- as.factor(litter$gesttime)
litter.aov <- aov(weight ~ gesttime + dose, data = litter)
TukeyHSD(litter.aov, which = 'dose')

I know this is an old thread but I'm not sure the existing answers are quite right...
I've been trying both functions with my own data and have a similar situation to Sue, where TukeyHSD gives a warning message about ignoring non-factor covariates, while glht() does not.
It does not appear that they are doing the same thing contrary to the other answer. The results are different and it appears that TukeyHSD is not marginalizing over the non-factor covariates (as the warning states). It appears that glht() correctly uses the mean value of non-factor covariates to compute the marginal mean of the groups of interest since the point estimates are the same as those obtained from lsmeans().
So it does not seem that TukeyHSD is overly cautious, it just seems that it can't handle non-factor covariates while glht is able to. So glht seems to be the correct function to use in this case, to me.

Related

glmmLasso error and warning

I am trying to perform variable selection in a generalized linear mixed model using glmmLasso, but am coming up with an error and a warning, that I can not resolve. The dataset is unbalanced, with some participants (PTNO) having more samples than others; no missing data. My dependent variable is binary, all other variables (beside the ID variable PTNO) are continous.
I suspect something very generic is happening, but obviously fail to see it and have not found any solution in the documentation or on the web.
The code, which is basically just adapted from the glmmLasso soccer example is:
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL, rnd = list(PTNO=~1),
family = poisson(link = log), data = stackdata, lambda=100,
control = list(print.iter=TRUE,start=c(1,rep(0,29)),q.start=0.7))
The error message is displayed below. Specficially, I do not believe there are any NAs in the dataset and I am unsure about the meaning of the warning regarding the factor variable.
Iteration 1
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In Ops.factor(y, Mu) : ‘-’ not meaningful for factors
An abbreviated dataset containing the necessary variables is available in R format and can be downladed here.
I hope I can be guided a bit as to how to go on with the analysis. Please let me know if there is anything wrong with the dataset or you cannot download it. ANY help is much appreciated.
Just to follow up on #Kristofersen comment above. It is indeed the start vector that messes your analysis up.
If I run
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL,
rnd = list(PTNO=~1),
family = binomial(),
data = stackdata,
lambda=100,
control = list(print.iter=TRUE))
then everything is fine and dandy (i.e., it converges and produces a solution). You have copied the example with poisson regression and you need to tweak the code to your situation. I have no idea about whether the output makes sense.
Quick note: I ran with the binomial distribution in the code above since your outcome is binary. If it makes sense to estimate relative risks then poisson may be reasonable (and it also converges), but you need to recode your outcome as the two groups are defined as 1 and 2 and that will certainly mess up the poisson regression.
In other words do a
stackdata$Group <- stackdata$Group-1
before you run the analysis.

Error in bn.fit predict function in bnlear R

I have learned and fitted Bayesian Network in bnlearn R package and I wish to predict it's "event" node value.
fl="data/discrete_kdd_10.txt"
h=TRUE
dtbl1 = read.csv(file=fl, head=h, sep=",")
net=hc(dtbl1)
fitted=bn.fit(net,dtbl1)
I want to predict the value of "event" node based on the evidence stored in another file with the same structure as the file used for learning.
fileName="data/dcmp.txt"
dtbl2 = read.csv(file=fileName, head=h, sep=",")
predict(fitted,"event",dtbl2)
However, predict fails with
Error in check.data(data) : variable duration must have at least two levels.
I don't understand why there should be any restriction on number of levels of variables in the evidence data.frame.
The dtbl2 data.frame contains only few rows, one for each scenario in which I want to predict the "event" value.
I know I can use cpquery, but I wish to use the predict function also for networks with mixed variables (both discrete and continuous). I haven't found out how to make use of evidence of continuous variable in cpqery.
Can someone please explain what I'm doing wrong with the predict function and how should I do it right?
Thanks in advance!
The problem was that reading the evidence data.frame in
fileName="data/dcmp.txt"
dtbl2 = read.csv(file=fileName, head=h, sep=",")
predict(fitted,"event",dtbl2)
caused categoric variables to be factors with different number of levels (subset of levels of the original training set).
I used following code to solve this issue.
for(i in 1:dim(dtbl2)[2]){
dtbl2[[i]] = factor(dtbl2[[i]],levels = levels(dtbl1[[i]]))
}
By the way bnlearn package does fit models with mixed variables and also provides functions for predictions in them.

randomForest() machine learning in R

I am exploring with the function randomforest() in R and several articles I found all suggest using a similar logic as below, where the response variable is column 30 and independent variables include everthing else except for column 30:
dat.rf <- randomForest(dat[,-30],
dat[,30],
proximity=TRUE,
mtry=3,
importance=TRUE,
do.trace=100,
na.action = na.omit)
When I try this, I got the following error messages:
Error in randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, :
NA not permitted in predictors
In addition: Warning message:
In randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, :
The response has five or fewer unique values. Are you sure you want to do regression?
However, I was able to get it to work when I listed the independent variables one by one while keeping all the other parameters the same.
dat.rf <- randomForest(as.factor(Y) ~X1+ X2+ X3+ X4+ X5+ X6+ X7+ X8+ X9+ X10+......,
data=dat
proximity=TRUE,
mtry=3,
importance=TRUE,
do.trace=100,
na.action = na.omit)
Could someone help me debug the simplier command where I don't have to list each predictor one by one?
The error message gives you a clue to two problems:
First, you need to remove any row that has a NA anywhere. Removing NA should be easy enough and I'll leave you that one as an exercise.
It looks like you need to do classification (which predicts a response which only has one of a few discrete levels), rather than regression (which predicts a continuous response). If the response is continuous, randomForest() will automatically apply regression.
So, how do you force randomForest() to use classification?As you noticed in your first try, randomForest allows you to give data as predictors and response data, not just using the formula style. To force randomForest() to apply classification, make sure that the value you are trying to predict (the response, or dat[,30]) is a factor. Remember to explicitly identify the $x$ and $y$ arguments. This is easy to do:
randomForest(x = dat[,-30],
y = factor(dat[,30]),
...)
This way your output can only take one of the levels given in y.
This is all buried in the description of the arguments $x$ and $y$: see ?help.

Odd behavior with step()

step() and stepAIC() produce a "remove missing values error" when running the code on data with missing values.
Error in step(mod1, direction = "backward") :
number of rows in use has changed: remove missing values?
According to ?step:
The model fitting must apply the models to the same dataset. This may be
a problem if there are missing values and R's default of na.action = na.omit
is used. We suggest you remove the missing values first.
I have a data frame with one variable which has four na values. However, when I run step on the lm object, I don't get the "missing values" error even though it has missing values. Can anyone tell me what could be going on?
> d1$Impressions
[1] NA NA NA 6924180 9313226 27888455
18213812 54557205 13495553
...
This does not produce an error message:
mod1 = lm(Leads ~ G + Con + GOO + DAY + Res + SD + ED +
ME + Impressions + Inc + Sea, data=d1)
step(mod1, direction="backward")
stepAIC(mod1)
Even with a variable which has missing values, it's not generating an error message. Any ideas on what's going on?
One reason for the stated behaviour is this. step() fits the full model and hence drops 3 (as stated) observations due to presence of NAs. As long as the variables for which there are NAs remain in the model, the lm() function will remove those observations at each step. If stepping stops before it removes a variable that would result in one of the previously removed observations remaining in the model, then no error will be raised, because the numbers of rows in the model matrix will not have changed.
As an aside, stepwise selection like this is considered to be of somewhat dubious validity. Not least, in using it you a making a fairly bold statement that the effects of the eliminated variables are exactly equal to zero. This also has the effect of biasing the effect (estimated coefficients) of the variables retained in the model to have larger (absolute) value.
Alternatives to this stepwise selection include shrinkage methods such as the Lasso and the Elastic Net.

Regression coefficients by group in dataframe R

I have data of various companies' financial information organized by company ticker. I'd like to regress one of the columns' values against the others while keeping the company constant. Is there an easy way to write this out in lm() notation?
I've tried using:
reg <- lmList(lead2.dDA ~ paudit1 + abs.d.GINDEX + logcapx + logmkvalt +
logmkvalt2|pp, data=reg.df)
where pp is a vector of company names, but this returns coefficients as though I regressed all the data at once (and did not separate by company name).
A convenient and apparently little-known syntax for estimating separate regression coefficients by group in lm() involves using the nesting operator, /. In this case it would look like:
reg <- lm(lead2.dDA ~ 0 + pp/(paudit1 + abs.d.GINDEX + logcapx +
logmkvalt + logmkvalt2), data=reg.df)
Make sure that pp is a factor and not a numeric. Also notice that the overall intercept must be suppressed for this to work; in the new formulation, we have a different "intercept" for each group.
A couple comments:
Although the regression coefficients obtained this way will match those given by lmList(), it should be noted that with lm() we estimate only a single residual variance across all the groups, whereas lmList() would estimate separate residual variances for each group.
Like I mentioned in my earlier comment, the lmList() syntax that you gave looks like it should have worked. Since you say it didn't, this leads me to expect that really the problem is something else (although it's hard to tell what without a reproducible example), and so it seems likely that the solution I posted will fail for you as well, for the same unknown reasons. If you want more detailed guidance, please provide more information; help us help you.

Resources