How to generate a compact letter display for pairwise TukeyHSD - r

I'm having trouble generating a compact letter display for my results.
I've run an ANOVA followed by Tukey's HSD to generate the p values for each pair, but I do not know how (or if it is possible?) to assign letters to these p values to show which pairs are significant from each other.
csa.anova<-aov(rate~temp*light,data=csa.per.chl)
summary(csa.anova)
TukeyHSD(csa.anova)
This runs the tests I need, but I don't know how to assign letters to each p value to show which pairs are significant.

Find more details here.
mod <- lm(Sepal.Width ~ Species, data = iris)
mod_means_contr <- emmeans::emmeans(object = mod,
pairwise ~ "Species",
adjust = "tukey")
mod_means <- multcomp::cld(object = mod_means_contr$emmeans,
Letters = letters)
library(ggplot2)
ggplot(data = mod_means,
aes(x = Species, y = emmean)) +
geom_errorbar(aes(ymin = lower.CL,
ymax = upper.CL),
width = 0.2) +
geom_point() +
geom_text(aes(label = gsub(" ", "", .group)),
position = position_nudge(x = 0.2)) +
labs(caption = "Means followed by a common letter are\nnot significantly different according to the Tukey-test")
Created on 2021-06-03 by the reprex package (v2.0.0)

You need to install the multcomp package first. It can compute the Tukey HSD Test and returns an object that has summary and plot methods. The package also has a function (cld) to print the "compact letter display." As an example we can use the iris data set that comes with R:
library(multcomp)
data(iris)
iris.aov <- aov(Petal.Length~Species, iris)
iris.tukey <- glht(iris.aov, linfct=mcp(Species="Tukey"))
cld(iris.tukey)
# setosa versicolor virginica
# "a" "b" "c"

Related

Plot the impact for each variable in linear regression?

I want to create a plot like below for a lm model calculated using R.
Is there a simple way of doing it?
The plot above was collected here in this page.
Package {caret} offers a convenient method varImp:
Example:
library(caret)
my_model <- lm(mpg ~ disp + cyl, data = mtcars)
## > varImp(my_model)
##
## Overall
## disp 2.006696
## cyl 2.229809
For different measures of variable importance see ?varImp. Feed values into your plotting library of choice.
Extra: {ggstatsplot} calculates and plots a host of model stats for a plethora of model objects. This includes hypotheses about regression coefficients, for which method ggcoefstats() might serve your purpose (remember to scale predictor variables for meaningful comparison of coefficients though).
Following the method in the linked article (relative marginal increase in r squared), you could write your own function that takes a formula, and the data frame, then plots the relative importance:
library(ggplot2)
plot_importance <- function(formula, data) {
lhs <- as.character(as.list(formula)[[2]])
rhs <- as.list(as.list(formula)[[3]])
vars <- grep("[+\\*]", rapply(rhs, as.character), invert = TRUE, value = TRUE)
df <- do.call(rbind, lapply(seq_along(vars), function(i) {
f1 <- as.formula(paste(lhs, paste(vars[-i], collapse = "+"), sep = "~"))
f2 <- as.formula(paste(lhs, paste(c(vars[-i], vars[i]), collapse = "+"),
sep = "~"))
r1 <- summary(lm(f1, data = data))$r.squared
r2 <- summary(lm(f2, data = data))$r.squared
data.frame(variable = vars[i], importance = r2 - r1)
}))
df$importance <- df$importance / sum(df$importance)
df$variable <- reorder(factor(df$variable), -df$importance)
ggplot(df, aes(x = variable, y = importance)) +
geom_col(fill = "deepskyblue4") +
scale_y_continuous(labels = scales::percent) +
coord_flip() +
labs(title = "Relative importance of variables",
subtitle = deparse(formula)) +
theme_classic(base_size = 16)
}
We can test this out with the sample data provided in the linked article:
IV <- read.csv(paste0("https://statisticsbyjim.com/wp-content/uploads/",
"2017/07/ImportantVariables.csv"))
plot_importance(Strength ~ Time + Pressure + Temperature, data = IV)
And we see that the plot is the same.
We can also test it out on some built-in datasets to demonstrate that its use is generalized:
plot_importance(mpg ~ disp + wt + gear, data = mtcars)
plot_importance(Petal.Length ~ Species + Petal.Width, data = iris)
Created on 2022-05-01 by the reprex package (v2.0.1)
Just ended up using relaimpo package and showing with ggplot answered by #Allan Cameron
library(relaimpo)
relative_importance <- calc.relimp(mymodel, type="lmg")$lmg
df = data.frame(
variable=names(relative_importance),
importance=round(c(relative_importance) * 100,2)
)
ggplot(df, aes(x = reorder(variable, -importance), y = importance)) +
geom_col(fill = "deepskyblue4") +
geom_text(aes(label=importance), vjust=.3, hjust=1.2, size=3, color="white")+
coord_flip() +
labs(title = "Relative importance of variables") +
theme_classic(base_size = 16)

`data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class ranger

I am working with R. Using a tutorial, I was able to create a statistical model and produce visual plots for some of the outputs:
#load libraries
library(survival)
library(dplyr)
library(ranger)
library(data.table)
library(ggplot2)
#use the built in "lung" data set
#remove missing values (dataset is called "a")
a <- na.omit(lung)
#create id variable
a$ID <- seq_along(a[,1])
#create test set with only the first 3 rows
new <- a[1:3,]
#create a training set by removing first three rows
a <- a[-c(1:3),]
#fit survival model (random survival forest)
r_fit <- ranger(Surv(time,status) ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data = a, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE)
#create new intermediate variables required for the survival curves
death_times <- r_fit$unique.death.times
surv_prob <- data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob, mean)
#use survival model to produce estimated survival curves for the first three observations
pred <- predict(r_fit, new, type = 'response')$survival
pred <- data.table(pred)
colnames(pred) <- as.character(r_fit$unique.death.times)
#plot the results for these 3 patients
plot(r_fit$unique.death.times, pred[1,], type = "l", col = "red")
lines(r_fit$unique.death.times, pred[2,], type = "l", col = "green")
lines(r_fit$unique.death.times, pred[3,], type = "l", col = "blue")
Now, I am trying to convert the above plot into ggplot format (and add 95% confidence intervals):
ggplot(r_fit) + geom_line(aes(x = r_fit$unique.death.times, y = pred[1,], group = 1), color = red) + geom_ribbon(aes(ymin = 0.95 * pred[1,], ymax = - 0.95 * pred[1,]), fill = "red") + geom_line(aes(x = r_fit$unique.death.times, y = pred[2,], group = 1), color = blue) + geom_ribbon(aes(ymin = 0.95 * pred[2,], ymax = - 0.95 * pred[2,]), fill = "blue") + geom_line(aes(x = r_fit$unique.death.times, y = pred[3,], group = 1), color = green) + geom_ribbon(aes(ymin = 0.95 * pred[3,], ymax = - 0.95 * pred[3,]), fill = "green") + theme(axis.text.x = element_text(angle = 90)) + ggtitle("sample graph")
But this produces the following error:
Error: `data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class ranger
Run `rlang::last_error()` to see where the error occurred.
What is the reason for this error? Can someone please show me how to fix this problem?
Thanks
As per the ggplot2 documentation, you need to provide a data.frame() or object that can be converted (coerced) to a data.frame(). In this case, if you want to reproduce the plot above in ggplot2, you will need to manually set up the data frame yourself.
Below is an example of how you could set up the data to display the plot in ggplot2.
Data Frame
First we create a data.frame() with the variables that we want to plot. The easiest way to do this is to just group them all in as separate columns. Note that I have used the as.numeric() function to first coerce the predicted values to a vector, because they were previously a data.table row, and if you don't convert them they are maintained as rows.
ggplot_data <- data.frame(unique.death.times = r_fit$unique.death.times,
pred1 = as.numeric(pred[1,]),
pred2 = as.numeric(pred[2,]),
pred3 = as.numeric(pred[3,]))
head(ggplot_data)
## unique.death.times pred1 pred2 pred3
## 1 5 0.9986676 1.0000000 0.9973369
## 2 11 0.9984678 1.0000000 0.9824642
## 3 12 0.9984678 0.9998182 0.9764154
## 4 13 0.9984678 0.9998182 0.9627118
## 5 15 0.9731656 0.9959416 0.9527424
## 6 26 0.9731656 0.9959416 0.9093876
Pivot the data
This format is still not ideal, because in order to plot the data and colour by the correct column (variable), we need to 'pivot' the data. We need to load the tidyr package for this.
library(tidyr)
ggplot_data <- ggplot_data %>%
pivot_longer(cols = !unique.death.times,
names_to = "category", values_to = "predicted.value")
Plotting
Now the data is in a form that makes it really easy to plot in ggplot2.
plot <- ggplot(ggplot_data, aes(x = unique.death.times, y = predicted.value, colour = category)) +
geom_line()
plot
If you really want to match the look of the base plot, you can add theme_classic():
plot + theme_classic()
Additional notes
Note that this doesn't include 95% confidence intervals, so they would have to be calculated separately. Be aware though, that a 95% confidence interval is not just 95% of the y value at a given x value. There are calculations that will give you the correct values of the confidence interval, including functions built into R.
For a quick view of a trend line with prediction intervals, you can use the geom_smooth() function in ggplot2, but in this case it adds a loess curve by default, and the intervals provided by that function.
plot + theme_classic() + geom_smooth()

Plotting output of GAM model

Edit: following interactions in the responses below, I believe there may be some issues with the plot() or plot.gam() functions when dealing with gam outputs. See responses below.
I am running a non parametric regression model <- gam(y ~ x, bs = "cs", data = data).
My data looks like what follows, where x is in logs. I have 273 observations
y x
[1,] 0.010234756 10.87952
[2,] 0.009165001 10.98407
[3,] 0.001330975 11.26850
[4,] 0.008000957 10.97803
[5,] 0.008579472 10.94924
[6,] 0.009746714 11.01823
I would like to plot the output of the model, basically the fitted curve. When I do
# graph
plot(model)
or
ggplot(data = data, mapping = aes(x = x y = y)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method="gam", formula= y~s(x, bs = "cs") )
I get the desired output graphs (apologies for the original labels):
[
However, the two plotted curves are not exactly the same and I did not manage to find the parameters to tweak to remove the differences. Therefore I would like to plot the curve manually.
Here it's my current attempt.
model <- gam(y~ s(x), bs = "cs", data = data)
names(model)
# summary(model)
model_fit <- as.data.frame(cbind(model$y, model$fitted.values,
model$linear.predictors, data$x,
model$residuals))
names(model_fit) <- c("y", "y_fit", "linear_pred", "x", "res")
### here the plotting
ggplot(model_fit) +
geom_point(aes(x = x, y = y_fit), size = 0.5, alpha = 0.5) +
geom_line(aes(x = x, y = y_fit))
However I get the following warning
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
and wrong output graph
I do not seem to be able to fix the last graph (it seems the error is in geom_point() ) and add the confidence intervals, nor to find where to tweak the first two to make them exactly the same.
The difference is likely due to you using different fitting algorithms. The default in gam() is (currently) method = "GCV.Cp" even through the recommended option is to use method = "REML". stat_smooth() uses method = "REML". GCV-based smoothness selection is known to undersmooth in some circumstances and this seems to be the case here with the REML solution being a much smoother curve.
If you change to method = "REML" in your gam() call, the differences should disappear.
That said, you really shouldn't be ripping things out of model objects like that - for a set off $residuals is not what you think it is - it's not useful in this context as those are the working residuals for PIRLS algorithm. Use the extractor functions like fitted(), residuals() etc.
The easiest way to plot your own version of that drawn by plot.gam() is to capture the object returned by plot.gam() and then use that object to draw what you need.
Via plot.gam()
df <- data_sim("eg1", seed = 2)
m <- gam(y ~ s(x2), data = df, method = "REML")
p_obj <- plot(m, residuals = TRUE)
p_obj <- p_obj[[1]] # just one smooth so select the first component
sm_df <- as.data.frame(p_obj[c("x", "se", "fit")])
data_df <- as.data.frame(p_obj[c("raw", "p.resid")])
## plot
ggplot(sm_df, aes(x = x, y = fit)) +
geom_rug(data = data_df, mapping = aes(x = raw, y = NULL),
sides = "b") +
geom_point(data = data_df, mapping = aes(x = raw, y = p.resid)) +
geom_ribbon(aes(ymin = fit - se, ymax = fit + se, y = NULL),
alpha = 0.3) +
geom_line() +
labs(x = p_obj$xlab, y = p_obj$ylab)
Which produces
Alternatively, you might look at my {gratia} package or the {mgcViz} package of Matteo Fasiolo as options that will do this all for you.
{gratia} example
For example with {gratia}
library('gratia')
draw(m, residuals = TRUE)
which produces
The solution provided by #Gavin Simpson here partially solves the issue, meaning that to make the two curves equal, one needs to add the method = "REML". The two curves then have the same slope.
However, for some reason, when plotting the output of a gam() model using either plot() or plot.gam(), the curve does not fit properly the original data as it should. The same happens by manually plotting the graph by extracting the elements from the object returned by plot.gam(). I am not sure why this happens. In my case, the fitted curve is shifted downwards, clearly "missing" the data points it is supposed to fit. Below the code and the corresponding output graph, the latter being the same you get in plot() or plot.gam() with the addition of the original data points to the plots.
plot(model_1)
# or plot.gam(model_1)
data.plot = as.data.frame(cbind(b[[1]]$x, b[[1]]$fit, b[[1]]$se))
ggplot(data=data.plot, mapping = aes(x= data.plot$V1, y= data.plot$V2)) +
geom_line(aes(x = V1, y = V2)) +
geom_line(aes(x= V1, y = V2 + V3 ), linetype="dashed") +
geom_line(aes(x= V1, y = V2 - V3 ), linetype ="dashed") +
geom_point(data= df_abs, aes(x= log(prd_l_1999), y=prd_gr), size = 0.5, alpha = 0.5)
Misplaced graphs
To note that the ggplot function makes the plot properly. Therefore, my ignorant guess is that this may be an issue with the plotting method.
Working solution
I am not able to prove that the issue is with the plotting functions, but it turns out that this is the same issue as in this question and the partial solution provided by the OP fixes the plotting while still using the gam() function. Below (his) code adapted to my case and the corresponding output graph. As you can see, the graph is plotted properly and the curve fits the data as it is supposed to do. I'd say this may corroborate my hypothesis even though I cannot prove it as I am not knowledgeable enough.
library(data.table)
model_1 <- gam(prd_gr ~ s(log(prd_l_1999)), bs = "cs", data = df_abs, method = "REML")
preds <- predict(model_1,se.fit=TRUE)
my_data <- data.frame(mu=preds$fit, low =(preds$fit - 1.96 * preds$se.fit), high = (preds$fit + 1.96 * preds$se.fit))
ggplot()+
geom_line(data = my_data, aes(x=log(df_abs$prd_l_1999), y=mu), size=1, col="blue")+
geom_smooth(data=my_data,aes(ymin = low, ymax = high, x=log(df_abs$prd_l_1999), y = mu), stat = "identity", col="green")

Compact letter display from pairwise test

I would like to create a compact letter display from a post-hoc test I did on a linear mixed effect model (lmer)
Here is an example of what I would like when I do a pairwise t.test
df <- read.table("https://pastebin.com/raw/Dzfh7b2f", header=T,sep="")
library(rcompanion)
library(multcompView)
PT <- pairwise.t.test(df$fit,df$treatment, method=bonferroni)
PT = PT$p.value
PT1 = fullPTable(PT)
multcompLetters(PT1,
compare="<",
threshold=0.05,
Letters=letters,
reversed = FALSE)
This works our great, because from the pairwise.t.test, it is easy to simply extract the p values, and create the table I would like.
Now lets say I run a linear model, do a pairwise comparison, and would like to also create a table, as I did above, that creates a compact letter display for me from the extracted pvalues
library(multcomp)
mult<- summary(glht(model, linfct = mcp(treatment = "Tukey")), test = adjusted("holm"))
mult
I can see the p values, but have spent the last 2-3 hours trying to figure out how to just extract those values (as I did above with the pairwise.t.test), and subsequently, create a compact letter display table.
Any help is much appreciated. All the best
Find more details here.
mod <- lm(Sepal.Width ~ Species, data = iris)
mod_means_contr <- emmeans::emmeans(object = mod,
pairwise ~ "Species",
adjust = "tukey")
mod_means <- multcomp::cld(object = mod_means_contr$emmeans,
Letters = letters)
### Bonus plot
library(ggplot2)
ggplot(data = mod_means,
aes(x = Species, y = emmean)) +
geom_point() +
geom_errorbar(aes(ymin = lower.CL,
ymax = upper.CL),
width = 0.2) +
geom_text(aes(label = gsub(" ", "", .group)),
position = position_nudge(x = 0.2)) +
labs(caption = "Means followed by a common letter are\nnot significantly different according to the Tukey-test")
Created on 2021-06-03 by the reprex package (v2.0.0)
Thanks to the suggestion by #roland, the answer is simply:
mult<- summary(glht(model, linfct = mcp(treatment = "Tukey")), test = adjusted("holm"))
letter_display <- cld(mult)
letter_display

How do I plot a single numerical covariate using emmeans (or other package) from a model?

After variable selection I usually end up in a model with a numerical covariable (2nd or 3rd degree). What I want to do is to plot using emmeans package preferentially. Is there a way of doing it?
I can do it using predict:
m1 <- lm(mpg ~ poly(disp,2), data = mtcars)
df <- cbind(disp = mtcars$disp, predict.lm(m1, interval = "confidence"))
df <- as.data.frame(df)
ggplot(data = df, aes(x = disp, y = fit)) +
geom_line() +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = disp, y = fit),alpha = 0.2)
I didn't figured out a way of doing it using emmip neither emtrends
For illustration purposes, how could I do it using mixed models via lme?
m1 <- lme(mpg ~ poly(disp,2), random = ~1|factor(am), data = mtcars)
I suspect that your issue is due to the fact that by default, covariates are reduced to their means in emmeans. You can use theat or cov.reduce arguments to specify a larger number of values. See the documentation for ref_grid and vignette(“basics”, “emmeans”), or the index of vignette topics.
Using sjPlot:
plot_model(m1, terms = "disp [all]", type = "pred")
gives the same graphic.
Using emmeans:
em1 <- ref_grid(m1, at = list(disp = seq(min(mtcars$disp), max(mtcars$disp), 1)))
emmip(em1, ~disp, CIs = T)
returns a graphic with a small difference in layout. An alternative is to add the result to an object and plot as the way that I want to:
d1 <- emmip(em1, ~disp, CIs = T, plotit = F)

Resources