ANOVA on ranks VS kruskal wallis, how different is it - r

i'm not sure that this is the perfect place for such a question but maybe you can help me.
I want to check for differences of a quantitative variable between 3 treatments, i.e perform an ANOVA.
Unfortunately the residuals of my model aren't normally distributed.
I usually have here two solutions : Transform my data or use a non parametric equivalent of my test (here a kruskal wallis rank test).
None of the transformations that i tried managed to satisfy normality (log, 1/x, square root, tukey and boxcox power) so I wanted to use a kruskal and to move on.
However, my project manager insisted on having only ANOVAs and talked about ANOVA on rank as a magic solution.
Working on R I looked for some examples and find a function art from ARTool package that perform anova on rank.
library(ARTool)
model <- art(variable~treatment,data)
anova(model)
Basically it takes your variable and replace it by its rank (dealing with ties by averaging the rank) as :
model2 <- lm(rank(variable, ties.method = "average")~treatment,data)
anova(model2)
gives exactly the same output.
I'm not an expert statistician and I wonder how valid is this solution/transformation.
It seems quite brutal to me and not this far from the logic of the kruskal-wallis test
even tho the statistic is not computed directly on ranks.
I find this very confusing to have an 'ANOVA on ranks' test that is different from the kruskal-wallis (also known as One-way ANOVA on ranks) and I don't know how to chose between those two tests.
I don't know if I've been very clear and if someone can help me but, anyway,
Thanks for your attention and comments!
PS: here is an exemple on dummy data
library(ARTool)
# note that dummy data are random so we shouldn't have the same results
treatment <- as.factor(c(rep("A",100),rep("B",100),rep("C",100)))
variable <- as.numeric(c(sample(c(0:30),100,replace=T),sample(c(10:40),100,replace=T),sample(c(5:35),100,replace=T)))
dummy <- data.frame(treatment,variable)
model <- art(variable~treatment)
anova(model) #f.value = 30.746 and p = 7.312e-13
model2 <- lm(rank(variable, ties.method = "average")~treatment,dummy)
anova(model2) #f.value = 30.746 and p = 7.312e-13
kruskal.test(variable~treatment,dummy)

Related

Model Syntax for Simple Moderation Model in Lavaan (with bootstrapping)

I am a social scientist currently running a simple moderation model in R, in the form of y ~ x + m + m * x. My moderator is a binary categorical variable (two separate groups).
I started out with lm(), bootstrapped estimates with boot() and obtained bca confidence intervals with boot.ci. Since there is no automated way of doing this for all parameters (at my coding level at least), this is bit tedious. Howver, I now saw that the lavaan package offer bootstrapping as part of the regular sem() function, and also bca CIs as part of parameterEstimates(). So, I was wondering (since I am using lavaan in other analyses) whether I could just replace lm() with lavaan for the sake of keeping my work more consistent.
Doing this, I was wondering about what the equivalent model for lavaan would be to test for moderation in the same way. I saw this post where Jeremy Miles proposes the code below, which I follow mostly.
mod.1 <- "
y ~ c(a, b) * x
y ~~ c(v1, v1) * y # This step needed for exact equivalence
y ~ c(int1, int2) * 1
modEff := a - b
mEff := int1 - int2"
But it would be great if you could help me figure out some final things.
1) What does the y ~~ c(v1, v1) * y part mean and and why is it needed for "exact equivalence" to the lm model? From the output it seems this constrics variances of the outcome for both groups to the same value?
2) From the post, am I right to understand that either including the interaction effect as calculated above OR constraining (only) the slope between models and looking at model fit with anova()would be the same test for moderation?
3) The lavaan page says that adding test = "bootstrap" to the sem() function allows for boostrap adjusted p-values. However, I read a lot about p-values conflicting with the bca-CIs at times, and this has happened to me. Searching around, I understand that this conflict comes from the assumptions for the distribution of the data under the H0 for p-values, but not for CIs (which just give the range of most likely values). I was therefore wondering what it exactly means that the p-values given here are "bootstrap-adjusted"? Is it technically more true to report these for my SEM models than the CIs?
Many questions, but I would be very grateful for any help you can provide.
Best,
Alex
I think I can answer at least Nr. 1 and 2 of your questions but it is probably easier to not use SEM and instead program a function that conveniently gives you CIs for all coefficients of your model.
So first, to answer your questions:
What is proposed in the code you gave is called multigroup comparison. Essentially this means that you fit the same SEM to two different groups of cases in your dataset. It is equivalent to a moderated regression with binary moderator because in both cases you get two slopes (often called „simple slopes“) for the scalar predictor, one slope per group of the moderator.
Now, in your lavaan code you only see the scalar predictor x. The binary moderator is implied by group="m" when you fit the model with fit.1 <- sem(mod.1, data = df, group = "m") (took this from the page you linked).
The two-element vectors (c( , )) in the lavaan code specify named parameters for the first and second group, respectively. By y ~~ c(v1, v1) * y , the residual variances of y are set equal in both groups because they have the same name. In contrast, the slopes c(a, b) and the intercepts c(int1, int2) are allowed to vary between groups.
Yes. If you use the SEM, you would fit the model a second time adding a == b and compare the model this to the first version where the slopes can differ. This is the same as comparing lm() models with and without a:b (or a*b) in the formula.
Here I cannot provide a direct answer to your question. I suspect if you want BCa CIs as you would get from applying boot.ci to an lm model fit, this might not be implemented. In the lavaan documentation BCa confidence intervals are only mentioned once: In the section about the parameterEstimates function, which can also perform bootstrap (see p. 89). However, it does not produce actual BCa (bias-corrected and accelerated) CIs but only bias-corrected ones.
As mentioned above, I guess the simplest solution would be to use lm() and either repeat the boot.ci procedure for each coefficient or write a wrapper function that does this for you. I suggest this also because a reviewer may be quite puzzled to see you do multigroup SEM instead of a simple moderated regression, which is much more common.
You probably did something like this already:
lm_fit <- function(dat, idx) coef( lm(y ~ x*m, data=dat[idx, ]) )
bs_out <- boot::boot(mydata, statistic=lm_fit, R=1000)
ci_out <- boot::boot.ci(bs_out, conf=.95, type="bca", index=1)
Now, either you repeat the last line for each coefficient, i.e., varying index from 1 to 4. Or you get fancy and let R do the repeating with a function like this:
all_ci <- function(bs) {
est <- bs$t0
lower <- vector("numeric", length(bs$t0))
upper <- lower
for (i in 1:length(bs$t0)) {
ci <- tail(boot::boot.ci(bs, type="bca", index=i)$bca[1,], 2)
lower[i] <- ci[1]
upper[i] <- ci[2]
}
cbind(est, lower, upper)
}
all_ci(bs_out)
I am sure this could be written more concisely but it should work fine for bootstraps of simple lm() models.

R language, how to use bootstraps to generate maximum likelihood and AICc?

Sorry for a quite stupid question. I am doing multiple comparisons of morphologic traits through correlations of bootstraped data. I'm curious if such multiple comparisons are impacting my level of inference, as well as the effect of the potential multicollinearity in my data. Perhaps, a reasonable option would be to use my bootstraps to generate maximum likelihood and then generate AICc-s to do comparisons with all of my parameters, to see what comes out as most important... the problem is that although I have (more or less clear) the way, I don't know how to implement this in R. Can anybody be so kind as to throw some light on this for me?
So far, here an example (using R language, but not my data):
library(boot)
data(iris)
head(iris)
# The function
pearson <- function(data, indices){
dt<-data[indices,]
c(
cor(dt[,1], dt[,2], method='p'),
median(dt[,1]),
median(dt[,2])
)
}
# One example: iris$Sepal.Length ~ iris$Sepal.Width
# I calculate the r-squared with 1000 replications
set.seed(12345)
dat <- iris[,c(1,2)]
dat <- na.omit(dat)
results <- boot(dat, statistic=pearson, R=1000)
# 95% CIs
boot.ci(results, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = results, type = "bca")
Intervals :
Level BCa
95% (-0.2490, 0.0423 )
Calculations and Intervals on Original Scale
plot(results)
I have several more pairs of comparisons.
More of a Cross Validated question.
Multicollinearity shouldn't be a problem if you're just assessing the relationship between two variables (in your case correlation). Multicollinearity only becomes an issue when you fit a model, e.g. multiple regression, with several highly correlated predictors.
Multiple comparisons is always a problem though because it increases your type-I error. The way to address that is to do a multiple comparison correction, e.g. Bonferroni-Holm or the less conservative FDR. That can have its downsides though, especially if you have a lot of predictors and few observations - it may lower your power so much that you won't be able to find any effect, no matter how big it is.
In high-dimensional setting like this, your best bet may be with some sort of regularized regression method. With regularization, you put all predictors into your model at once, similarly to doing multiple regression, however, the trick is that you constrain the model so that all of the regression slopes are pulled towards zero, so that only the ones with the big effects "survive". The machine learning versions of regularized regression are called ridge, LASSO, and elastic net, and they can be fitted using the glmnet package. There is also Bayesian equivalents in so-called shrinkage priors, such as horseshoe (see e.g. https://avehtari.github.io/modelselection/regularizedhorseshoe_slides.pdf). You can fit Bayesian regularized regression using the brms package.

Levene post hoc test in R

I have a problem regarding my data analysis in R. One of my hypothesis is basically that my groups will differ in terms of spread of the scores, indicating that there would be a difference in extremity between the groups.
I decided to check my hypothesis with Levenes test, which turned out significant and should thus highlight that the standard deviations is significantly different between the groups.
But I do not know of any post hoc tests for Levenes test, and after reading up on possible post hoc analyses I decided to conduct an ANOVA on the residuals, and then do a post hoc test on the ANOVA.
This is the code I've tried so far:
leveneTest(SS_mean~RA01, DF)
DF$residuals <- abs(DF$SS_mean - DF$SS_mean_big) #SS_mean = Participants score,
#SS_mean_big = mean for each group.
My test and post hoc test looks like this:
levene.anova<-aov(residuals~RA01, DF) #RA01 is the groups. Four in total
summary(levene.anova)
TukeyHSD(levene.anova)
The ANOVA on the residuals turned out significant as well, but the p-value changed from 0.04 (Levenes test) to 0.01 (ANOVA on residuals).
When reading about it, it seemed like Levene test is just an ANOVA on the resiudals, and thus it should give me the same results. And I am also unsure what post hoc test i should use. I thought about Dunnett as well as it includes a baseline, which corresponds to one of my groups.
Lastly, I did a leveneTest on the residuals as well "leveneTest(residuals~RA01)", which turned out significant. Is it better for me to use a non-parametric test, e.g. Kruskal-Wallis h-test and conduct a post hoc test on my kruskal wallis test instead? And if this is the case, what would be the appropriate test? Should I use a pairwise Mann Whitney u-test or Dunn test?
As this is the first time im doing something like this, I'm unsure about if this is a legitimate analysis, I would really appreciate your help or input!
Levene's test should indeed give the same p-value as an ANOVA on the residuals.
See for example this code:
data("mtcars")
mtcars$cyl <- as.factor(mtcars$cyl)
# Calculate means and add them to data
cyl_means <- aggregate(disp ~ cyl, data = mtcars, FUN = mean)
colnames(cyl_means)[2] <- "disp_mean"
mtcars2 <- merge(mtcars, cyl_means, by = "cyl")
# Residuals and anova
mtcars2$residuals <- abs(mtcars2$disp - mtcars2$disp_mean)
res.aov <- aov(residuals ~ cyl, data = mtcars2)
summary(res.aov)
# Levene's test
lawstat::levene.test(mtcars$disp, mtcars$cyl, location = "mean")
Maybe you accidentially ran the Brown–Forsythe test instead, which is the default in lawstat::levene.test, and which uses the median instead of the mean to calculate residuals.
Use Dunnett's if you are only interested in comparing the groups to one baseline group.
Use TukeyHSD if you want all pairwise comparisons among groups.

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

How to do a Tukey HSD test with the Anova command (car package)

I'm dealing with an unbalanced design/sample and originally learned aov(). I know now that for my ANOVA tests I need to use the Type III Sum of Squares which involves using fitting using lm() rather than using aov().
The problem is getting post-hoc tests (specifically Tukey's HSD) using lm(). All the research I've done has said that using simint in the multcomp package would work, but now that it's updated that command seems to not be available. It also seems to rely upon going through aov() to calculate.
Essentially all of the Tukey HSD tests I've found for R assume that you use aov() for the comparison rather than lm(). To get the Type III Sum of Squares I need for the unbalanced design I have to use:
mod<-lm(Snavg~StudentEthnicity*StudentGender)
Anova(mod, type="III")
How do I use a Tukey HSD test with my mod using lm()? Or conversely, calculate my ANOVA using Type III and still be able to run a Tukey HSD test?
Thanks!
Try HSD.test in agricolae
library(agricolae)
data(sweetpotato)
model<-lm(yield~virus, data=sweetpotato)
comparison <- HSD.test(model,"virus", group=TRUE,
main="Yield of sweetpotato\nDealt with different virus")
Output
Study: Yield of sweetpotato
Dealt with different virus
HSD Test for yield
Mean Square Error: 22.48917
virus, means
yield std.err replication
cc 24.40000 2.084067 3
fc 12.86667 1.246774 3
ff 36.33333 4.233727 3
oo 36.90000 2.482606 3
alpha: 0.05 ; Df Error: 8
Critical Value of Studentized Range: 4.52881
Honestly Significant Difference: 12.39967
Means with the same letter are not significantly different.
Groups, Treatments and means
a oo 36.9
ab ff 36.33333
bc cc 24.4
c fc 12.86667
As an initial note, unless it's been changed, to get the correct results for type iii sum of squares, you need to set the contrast coding for the factor variables. This can be done inside the lm call or with options. The example below uses options.
I would be cautious about using HSD.test and similar functions with unbalanced designs unless the documentation addresses their use in these situations. The documentation for TukeyHSD mentions that it adjusts for "mildly unbalanced" designs. I don't know if HSD.test handles things differently. You'd have to check additional documentation for the package or the original reference cited for the function.
As a side note, enclosing the whole HSD.test function in parentheses will cause it to print the results. See example below.
In general, I would recommend using the flexible emmeans (née lsmeans) or multcomp packages for all your post-hoc comparison needs. emmeans is particularly useful for doing mean separations on interactions or for examining contrasts among treatments. [EDIT: Caveat that I am the author of these pages.]
With an unbalanced design, you may want to report the E.M. (or L.S.) means instead of the arithmetic means. See SAEPER: What are least square means?. [EDIT: Caveat that I am the author of this page.] Note in the example below that the marginal means reported by emmeans are different than those reported by HSD.test.
Also note that the "Tukey" in glht has nothing to do with Tukey HSD or Tukey-adjusted comparisons; it just sets up the contrasts for all pairwise tests, as the output says.
However, the adjust="tukey" in emmeans functions does mean to use Tukey-adjusted comparisons, as the output says.
The following example is partially adapted from ARCHBS: One-way Anova.
### EDIT: Some code changed to reflect changes to some functions
### in the emmeans package
if(!require(car)){install.packages("car")}
library(car)
data(mtcars)
mtcars$cyl.f = factor(mtcars$cyl)
mtcars$carb.f = factor(mtcars$carb)
options(contrasts = c("contr.sum", "contr.poly"))
model = lm(mpg ~ cyl.f + carb.f, data=mtcars)
library(car)
Anova(model, type="III")
if(!require(agricolae)){install.packages("agricolae")}
library(agricolae)
(HSD.test(model, "cyl")$groups)
if(!require(emmeans)){install.packages("emmeans")}
library(emmeans)
marginal = emmeans(model,
~ cyl.f)
pairs(marginal, adjust="tukey")
if(!require(multcomp)){install.packages("multcomp")}
library(multcomp)
cld(marginal, adjust="tukey", Letters=letters)
if(!require(multcomp)){install.packages("multcomp")}
library(multcomp)
mc = glht(model,
mcp(cyl.f = "Tukey"))
summary(mc, test=adjusted("single-step"))
cld(mc)
I found HSD.test() also to be very meticulous about the way you have built either the lm() or aov() model that you're using for it.
There was no output from HSD.test() with my data when I had used following idea of coding for lm() :
model<-lm(sweetpotato$yield ~ sweetpotato$virus)
out <- HSD.test(model,"virus", group=TRUE, console=TRUE)
Output was only:
Name: virus
sweetpotato$virus
The output was equally bad when using the same logic for aov()
model<-aov(sweetpotato$yield ~ sweetpotato$virus)
To get the output for HSD.test() the lm()
(or also if using aov() for the model )
must be constructed strictly using the logic presented in the MYaseen208 answer:
model <- lm(yield~virus, data=sweetpotato)
Hope this helps someone who's not getting a proper output from HSD.test().
I was stuck with the same problem of the HSD.test printing out nothing. You need to put console=TRUE inside the function, so it prints out automatically.
For example:
HSD.test(alturacrit.anova, "fator", console=TRUE).
Hope it helps!

Resources