R ggplot2 boxplots - ggpubr stat_compare_means not working properly - r

I am trying to add significance levels to my boxplots in the form of asterisks using ggplot2 and the ggpubr package, but I have many comparisons and I only want to show the significant ones.
I try to use the option hide.ns=TRUE in stat_compare_means, but it clearly does not work, it might be a bug in the ggpubr package.
Besides, you see that I leave out group "PGMC4" from the pairwise wilcox.test comparisons; how can I leave this group out also for the kruskal.test?
The last question I have is how the significance level works? As in * is significant below 0.05, ** below 0.025, *** below 0.01? what is the convention ggpubr uses? Is it showing p-values or adjusted p-values? If the latter, what's the adjusting method? BH?
Please check my MWE below and this link and this other one for reference
##############################
##MWE
set.seed(5)
#test df
mydf <- data.frame(ID=paste(sample(LETTERS, 163, replace=TRUE), sample(1:1000, 163, replace=FALSE), sep=''),
Group=c(rep('C',10),rep('FH',10),rep('I',19),rep('IF',42),rep('NA',14),rep('NF',42),rep('NI',15),rep('NS',10),rep('PGMC4',1)),
Value=rnorm(n=163))
#I don't want to compare PGMC4 cause I have only onw sample
groups <- as.character(unique(mydf$Group[which(mydf$Group!="PGMC4")]))
#function to make combinations of groups without repeating pairs, and avoiding self-combinations
expand.grid.unique <- function(x, y, include.equals=FALSE){
x <- unique(x)
y <- unique(y)
g <- function(i){
z <- setdiff(y, x[seq_len(i-include.equals)])
if(length(z)) cbind(x[i], z, deparse.level=0)
}
do.call(rbind, lapply(seq_along(x), g))
}
#all pairs I want to compare
combs <- as.data.frame(expand.grid.unique(groups, groups), stringsAsFactors=FALSE)
head(combs)
my.comps <- as.data.frame(t(combs), stringsAsFactors=FALSE)
colnames(my.comps) <- NULL
rownames(my.comps) <- NULL
#pairs I want to compare in list format for stat_compare_means
my.comps <- as.list(my.comps)
head(my.comps)
pdf(file="test.pdf", height=20, width=25)
print(#or ggsave()
ggplot(mydf, aes(x=Group, y=Value, fill=Group)) + geom_boxplot() +
stat_summary(fun.y=mean, geom="point", shape=5, size=4) +
scale_fill_manual(values=myPal) +
ggtitle("TEST TITLE") +
theme(plot.title = element_text(size=30),
axis.text=element_text(size=12),
axis.text.x = element_text(angle=45, hjust=1),
axis.ticks = element_blank(),
axis.title=element_text(size=20,face="bold"),
legend.text=element_text(size=16)) +
stat_compare_means(comparisons=my.comps, method="wilcox.test", label="p.signif", size=14) + #WHY DOES hide.ns=TRUE NOT WORK??? WHY DOES size=14 NOT WORK???
stat_compare_means(method="kruskal.test", size=14) #GLOBAL COMPARISON ACROSS GROUPS (HOW TO LEAVE PGMC4 OUT OF THIS??)
)
dev.off()
##############################
The MWE will produce the following boxplots:
The questions would be:
1- How to make hide.ns=TRUE work?
2- How to increase the size of the *?
3- How to exclude a group from the kruskal.test comparison?
4- What is the * convention used by ggpubr, and are the p-values shown adjusted or not?
Many thanks!!
EDIT
Besides, when doing
stat_compare_means(comparisons=my.comps, method="wilcox.test", p.adjust.method="BH")
I do not obtain the same p-values as when doing
wilcox.test(Value ~ Group, data=mydf.sub)$p.value
where mydf.sub is a subset() of mydf for a given comparison of 2 groups.
What is ggpubr doing here? How does it calculate the p.values?
EDIT 2
Please help, the solution does not have to be with ggpubr (but it has to be with ggplot2), I just need to be able to hide the NS and make the size of the asterisks bigger, as well as a p-value calculation identical to wilcox.test() + p.adjust(method"BH").
Thanks!

Edit: Since I discovered the rstatix package I would do:
set.seed(123)
#test df
mydf <- data.frame(ID=paste(sample(LETTERS, 163, replace=TRUE), sample(1:1000, 163, replace=FALSE), sep=''),
Group=c(rep('C',10),rep('FH',10),rep('I',19),rep('IF',42),rep('NA',14),rep('NF',42),rep('NI',15),rep('NS',10),rep('PGMC4',1)),
Value=c(runif(n=100), runif(63,max= 0.5)))
library(tidyverse)
stat_pvalue <- mydf %>%
rstatix::wilcox_test(Value ~ Group) %>%
filter(p < 0.05) %>%
rstatix::add_significance("p") %>%
rstatix::add_y_position() %>%
mutate(y.position = seq(min(y.position), max(y.position),length.out = n())
ggplot(mydf, aes(x=Group, y=Value)) + geom_boxplot() +
ggpubr::stat_pvalue_manual(stat_pvalue, label = "p.signif") +
theme_bw(base_size = 16)
Old Answer:
You can try following. The idea is that you calculate the stats by your own using pairwise.wilcox.test. Then you use the ggsignif function geom_signif
to add the precalculated pvalues. With y_position you can place the brackets so they don't overlap.
library(tidyverse)
library(ggsignif)
library(broom)
# your list of combinations you want to compare
CN <- combn(levels(mydf$Group)[-9], 2, simplify = FALSE)
# the pvalues. I use broom and tidy to get a nice formatted dataframe. Note, I turned off the adjustment of the pvalues.
pv <- tidy(with(mydf[ mydf$Group != "PGMC4", ], pairwise.wilcox.test(Value, Group, p.adjust.method = "none")))
# data preparation
CN2 <- do.call(rbind.data.frame, CN)
colnames(CN2) <- colnames(pv)[-3]
# subset the pvalues, by merging the CN list
pv_final <- merge(CN2, pv, by.x = c("group2", "group1"), by.y = c("group1", "group2"))
# fix ordering
pv_final <- pv_final[order(pv_final$group1), ]
# set signif level
pv_final$map_signif <- ifelse(pv_final$p.value > 0.05, "", ifelse(pv_final$p.value > 0.01,"*", "**"))
# the plot
ggplot(mydf, aes(x=Group, y=Value, fill=Group)) + geom_boxplot() +
stat_compare_means(data=mydf[ mydf$Group != "PGMC4", ], aes(x=Group, y=Value, fill=Group), size=5) +
ylim(-4,30)+
geom_signif(comparisons=CN,
y_position = 3:30, annotation= pv_final$map_signif) +
theme_bw(base_size = 16)
The arguments vjust, textsize, and size are not properly working. Seems to be a bug in the latest version ggsignif_0.3.0.
Edit: When you want to show only the significant comparisons, you can easily subset the dataset CN. Since I updated to ggsignif_0.4.0 and R version 3.4.1, vjust and textsize are working now as expected. Instead of y_position you can try step_increase.
# subset
gr <- pv_final$p.value <= 0.05
CN[gr]
ggplot(mydf, aes(x=Group, y=Value, fill=Group)) +
geom_boxplot() +
stat_compare_means(data=mydf[ mydf$Group != "PGMC4", ], aes(x=Group, y=Value, fill=Group), size=5) +
geom_signif(comparisons=CN[gr], textsize = 12, vjust = 0.7,
step_increase=0.12, annotation= pv_final$map_signif[gr]) +
theme_bw(base_size = 16)
You can use ggpubr as well. Add:
stat_compare_means(comparisons=CN[gr], method="wilcox.test", label="p.signif", color="red")

Related

ggplot in R to add significance asterisk vs control group over multiple variables

I have barplots, but would like to run a Wilcox.test within each "grp1" comparing the bars to the control for that group, and then putting an asterix if it is significant.
I've seen "compare_means" to get the comparisons, but I'm trying to make it automated and not so manual. Would "geom_signif" or "stat_compare_means" do this? Can someone help with this? Thank you very much.
I need the comparison to be made using the full dataset, not just the means (which is only one value per bar). I added a line at the end of the code running one of the comparisons so you can see where I need the p-values from.
y <- c(runif(100,0,4.5),runif(100,3,6),runif(100,4,7))
grp1 <- sample(c("A","B","C","D"),size = 300, replace = TRUE)
grp2 <- rep(c("High","Med","Contrl"),each=100)
dataset <- data.frame(y,grp1,grp2)
means <- aggregate(y~grp1+grp2,data=dataset,mean)
sd <- aggregate(y~grp1+grp2,data=dataset,function(x){sd(x)})
means.all <- merge(sd,means,by=c("grp1","grp2"))
names(means.all)[3:4] <- c("sd","y.mean")
library(ggplot2)
p<- ggplot(means.all, aes(x=grp1, y=y.mean, fill=grp2))+
geom_bar(stat="identity", color="black",
position=position_dodge()) +
geom_errorbar(aes(ymin=y.mean-sd, ymax=y.mean+sd), width=.2,
position=position_dodge(.9))
p
compare_means(y~grp2,data = dataset[dataset$grp1=="A",],method="wilcox.test")
Maybe this is not the optimal way but you can create a list splitting the data and applying the stat_compare_means() function individually at each level of your data. After that you can arrange the plots in one using patchwork:
library(ggplot2)
library(ggpubr)
library(patchwork)
#Split data
List <- split(means.all,means.all$grp1)
#Function for plot
myfun <- function(x)
{
#Ref group
rg <- paste0(unique(x$grp1),'.','Contrl')
#Plot
G <- ggplot(x, aes(x=interaction(grp1,grp2), y=y.mean, fill=grp2))+
geom_bar(stat="identity", color="black",
position=position_dodge()) +
geom_errorbar(aes(ymin=y.mean-sd, ymax=y.mean+sd), width=.2,
position=position_dodge(.9))+
stat_compare_means(ref.group = rg,label = "p.signif",method = "wilcox.test",label.y = 7)+
theme(axis.text.x = element_blank())+
xlab(unique(x$grp1))
return(G)
}
#Apply
Lplot <- lapply(List, myfun)
#Wrap plots
wrap_plots(Lplot,nrow = 1)+plot_layout(guides = 'collect')
Output:
Consider this update that takes the values for asterisks stored in a new dataframe:
#Create p-vals dataset
List2 <- split(dataset,dataset$grp1)
#p-val function
mypval <- function(x)
{
y <- compare_means(y~grp2,data = x,method="wilcox.test")
y <- y[,c('group2', 'group1','p.signif')]
names(y)<-c('grp2','grp1','p.signif')
y <- y[y$grp2=='Contrl',]
y$grp2 <- y$grp1
y <- rbind(y,data.frame(grp2='Contrl',grp1='',p.signif=''))
y$grp1 <- unique(x$grp1)
y$y.mean=7
return(y)
}
#Apply
dfpvals <- lapply(List2, mypval)
df <- do.call(rbind,dfpvals)
#Plot
ggplot(means.all, aes(x=grp1, y=y.mean, fill=grp2,group=grp2))+
geom_bar(stat="identity", color="black",
position=position_dodge()) +
geom_errorbar(aes(ymin=y.mean-sd, ymax=y.mean+sd), width=.2,
position=position_dodge(.9))+
geom_text(data=df,aes(x=grp1, y=y.mean,group=grp2,label=p.signif),
position=position_dodge(0.9))
Output:

Remove outliers and reduce yLim appropriately for each facet in ggplot2

I am currently making a facet multi box plot using ggplot2, where I have cleared the outliers and set the yLim to 5000.
However, not all of the boxplots (the ones at the beginning of the image below) go anywhere near 5000. How can I reduce the y axis for only a select few of these boxplots in the image? I've tried multiple answers from the community, but they seem to be outdated.
Here is the code I am using:
require(reshape2)
require(ggplot2)
data_frame <- read.csv("results.csv", header=T)
p <- ggplot(data=data_frame, aes(x='', y=value)) + geom_boxplot(outlier.shape=NA, aes(fill=policy))
p <- p + facet_wrap( ~ level, scales="free") + coord_cartesian(ylim = c(0, 5000))
p <- p + xlab("") + ylab("Authorisation Time (ms)") + ggtitle("Title")
ggsave("bplots.png", plot=last_plot(), device=png())
As noted above, you pretty much have to filter before plotting, but this doesn't need to be done by editing any files, or even by creating new dataframes. Using dplyr you can just chain this into the processing of your data. I've done a hopefully reproducible example below with some made-up data (as I don't have yours). I created a function to filter by the same procedures as the boxplot is using. It's a bit hacky, but hopefully works as one potential solution:
require(ggplot2)
require(dplyr)
data_frame <- data.frame(value = c(rnorm(2000, mean = 100, sd = 20), rnorm(2000, mean = 1000, sd = 500)),
level = c(rep(1,2000), rep(2, 2000)),
policy = factor(c(rep(c(rep(1, 500), rep(2, 500), rep(3, 500), rep(4, 500)), 2))))
# filtering function - turns outliers into NAs to be removed
filter_lims <- function(x){
l <- boxplot.stats(x)$stats[1]
u <- boxplot.stats(x)$stats[5]
for (i in 1:length(x)){
x[i] <- ifelse(x[i]>l & x[i]<u, x[i], NA)
}
return(x)
}
data_frame %>%
group_by(level, policy) %>% # do the same calcs for each box
mutate(value2 = filter_lims(value)) %>% # new variable (value2) so as not to displace first one)
ggplot(aes(x='', y=value2, fill = policy)) +
geom_boxplot(na.rm = TRUE, coef = 5) + # remove NAs, and set the whisker length to all included points
facet_wrap( ~ level, scales="free") +
xlab("") + ylab("Authorisation Time (ms)") + ggtitle("Title")
Resulting in the following (simplified) plot:

Boxplots with Wilcoxon significance levels, and facets, show only significant comparisons with asterisks

Following up on this question and for the sake of completeness, I modified the accepted answer and customized the resulting plot, but I am still facing some important problems.
To sum up, I am doing boxplots reflecting significance of Kruskal-Wallis and pairwise Wilcoxon test comparisons.
I want to replace the p-value numbers with asterisks, and show only the significant comparisons, reducing vertical spacing to the max.
Basically I want to do this, but with the added problem of facets, that messes everything up.
So far I have worked on a very decent MWE, but it still shows problems...
library(reshape2)
library(ggplot2)
library(gridExtra)
library(tidyverse)
library(data.table)
library(ggsignif)
library(RColorBrewer)
data(iris)
iris$treatment <- rep(c("A","B"), length(iris$Species)/2)
mydf <- melt(iris, measure.vars=names(iris)[1:4])
mydf$treatment <- as.factor(mydf$treatment)
mydf$variable <- factor(mydf$variable, levels=sort(levels(mydf$variable)))
mydf$both <- factor(paste(mydf$treatment, mydf$variable), levels=(unique(paste(mydf$treatment, mydf$variable))))
# Change data to reduce number of statistically significant differences
set.seed(2)
mydf <- mydf %>% mutate(value=rnorm(nrow(mydf)))
##
##FIRST TEST BOTH
#Kruskal-Wallis
addkw <- as.data.frame(mydf %>% group_by(Species) %>%
summarize(p.value = kruskal.test(value ~ both)$p.value))
#addkw$p.adjust <- p.adjust(addkw$p.value, "BH")
a <- combn(levels(mydf$both), 2, simplify = FALSE)
#new p.values
pv.final <- data.frame()
for (gr in unique(mydf$Species)){
for (i in 1:length(a)){
tis <- a[[i]] #variable pair to test
as <- subset(mydf, Species==gr & both %in% tis)
pv <- wilcox.test(value ~ both, data=as)$p.value
ddd <- data.table(as)
asm <- as.data.frame(ddd[, list(value=mean(value)), by=list(both=both)])
asm2 <- dcast(asm, .~both, value.var="value")[,-1]
pf <- data.frame(group1=paste(tis[1], gr), group2=paste(tis[2], gr), mean.group1=asm2[,1], mean.group2=asm2[,2], FC.1over2=asm2[,1]/asm2[,2], p.value=pv)
pv.final <- rbind(pv.final, pf)
}
}
#pv.final$p.adjust <- p.adjust(pv.final$p.value, method="BH")
pv.final$map.signif <- ifelse(pv.final$p.value > 0.05, "", ifelse(pv.final$p.value > 0.01,"*", "**"))
cols <- colorRampPalette(brewer.pal(length(unique(mydf$Species)), "Set1"))
myPal <- cols(length(unique(mydf$Species)))
#Function to get a list of plots to use as "facets" with grid.arrange
plot.list=function(mydf, pv.final, addkw, a, myPal){
mylist <- list()
i <- 0
for (sp in unique(mydf$Species)){
i <- i+1
mydf0 <- subset(mydf, Species==sp)
addkw0 <- subset(addkw, Species==sp)
pv.final0 <- pv.final[grep(sp, pv.final$group1), ]
num.signif <- sum(pv.final0$p.value <= 0.05)
P <- ggplot(mydf0,aes(x=both, y=value)) +
geom_boxplot(aes(fill=Species)) +
stat_summary(fun.y=mean, geom="point", shape=5, size=4) +
facet_grid(~Species, scales="free", space="free_x") +
scale_fill_manual(values=myPal[i]) + #WHY IS COLOR IGNORED?
geom_text(data=addkw0, hjust=0, size=4.5, aes(x=0, y=round(max(mydf0$value, na.rm=TRUE)+0.5), label=paste0("KW p=",p.value))) +
geom_signif(test="wilcox.test", comparisons = a[which(pv.final0$p.value<=0.05)],#I can use "a"here
map_signif_level = F,
vjust=0,
textsize=4,
size=0.5,
step_increase = 0.05)
if (i==1){
P <- P + theme(legend.position="none",
axis.text.x=element_text(size=20, angle=90, hjust=1),
axis.text.y=element_text(size=20),
axis.title=element_blank(),
strip.text.x=element_text(size=20,face="bold"),
strip.text.y=element_text(size=20,face="bold"))
} else{
P <- P + theme(legend.position="none",
axis.text.x=element_text(size=20, angle=90, hjust=1),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title=element_blank(),
strip.text.x=element_text(size=20,face="bold"),
strip.text.y=element_text(size=20,face="bold"))
}
#WHY USING THE CODE BELOW TO CHANGE NUMBERS TO ASTERISKS I GET ERRORS?
#P2 <- ggplot_build(P)
#P2$data[[3]]$annotation <- rep(subset(pv.final0, p.value<=0.05)$map.signif, each=3)
#P <- plot(ggplot_gtable(P2))
mylist[[sp]] <- list(num.signif, P)
}
return(mylist)
}
p.list <- plot.list(mydf, pv.final, addkw, a, myPal)
y.rng <- range(mydf$value)
# Get the highest number of significant p-values across all three "facets"
height.factor <- 0.3
max.signif <- max(sapply(p.list, function(x) x[[1]]))
# Lay out the three plots as facets (one for each Species), but adjust so that y-range is same for each facet. Top of y-range is adjusted using max_signif.
png(filename="test.png", height=800, width=1200)
grid.arrange(grobs=lapply(p.list, function(x) x[[2]] +
scale_y_continuous(limits=c(y.rng[1], y.rng[2] + height.factor*max.signif))),
ncol=length(unique(mydf$Species)), top="Random title", left="Value") #HOW TO CHANGE THE SIZE OF THE TITLE AND THE Y AXIS TEXT?
#HOW TO ADD A COMMON LEGEND?
dev.off()
It produces the following plot:
As you can see there are some problems, most obviously:
1- Coloring does not work for some reason
2- I do not seem to be able to change the annotation with the asterisks
I want something more like this (mockup):
So we need to:
1- Make coloring work
2- Show asterisks instead of numbers
...and for the win:
3- Make a common legend
4- Place Kruskal-Wallis line on top
5- Change the size (and alignment) of the title and y axis text
IMPORTANT NOTES
I would appreciate my code is left as intact as possible even if it isn't the prettiest, cause I still have to make use of intermediate objects like "CNb" or "pv.final".
The solution should be easily transferable to other cases; please consider testing "variable" alone, instead of "both"... In this case we have 6 "facets" (vertically and horizontally) and everything gets even more screwed up...
I made this other MWE:
##NOW TEST MEASURE, TO GET VERTICAL AND HORIZONTAL FACETS
addkw <- as.data.frame(mydf %>% group_by(treatment, Species) %>%
summarize(p.value = kruskal.test(value ~ variable)$p.value))
#addkw$p.adjust <- p.adjust(addkw$p.value, "BH")
a <- combn(levels(mydf$variable), 2, simplify = FALSE)
#new p.values
pv.final <- data.frame()
for (tr in levels(mydf$treatment)){
for (gr in levels(mydf$Species)){
for (i in 1:length(a)){
tis <- a[[i]] #variable pair to test
as <- subset(mydf, treatment==tr & Species==gr & variable %in% tis)
pv <- wilcox.test(value ~ variable, data=as)$p.value
ddd <- data.table(as)
asm <- as.data.frame(ddd[, list(value=mean(value, na.rm=T)), by=list(variable=variable)])
asm2 <- dcast(asm, .~variable, value.var="value")[,-1]
pf <- data.frame(group1=paste(tis[1], gr, tr), group2=paste(tis[2], gr, tr), mean.group1=asm2[,1], mean.group2=asm2[,2], FC.1over2=asm2[,1]/asm2[,2], p.value=pv)
pv.final <- rbind(pv.final, pf)
}
}
}
#pv.final$p.adjust <- p.adjust(pv.final$p.value, method="BH")
# set signif level
pv.final$map.signif <- ifelse(pv.final$p.value > 0.05, "", ifelse(pv.final$p.value > 0.01,"*", "**"))
plot.list2=function(mydf, pv.final, addkw, a, myPal){
mylist <- list()
i <- 0
for (sp in unique(mydf$Species)){
for (tr in unique(mydf$treatment)){
i <- i+1
mydf0 <- subset(mydf, Species==sp & treatment==tr)
addkw0 <- subset(addkw, Species==sp & treatment==tr)
pv.final0 <- pv.final[grep(paste(sp,tr), pv.final$group1), ]
num.signif <- sum(pv.final0$p.value <= 0.05)
P <- ggplot(mydf0,aes(x=variable, y=value)) +
geom_boxplot(aes(fill=Species)) +
stat_summary(fun.y=mean, geom="point", shape=5, size=4) +
facet_grid(treatment~Species, scales="free", space="free_x") +
scale_fill_manual(values=myPal[i]) + #WHY IS COLOR IGNORED?
geom_text(data=addkw0, hjust=0, size=4.5, aes(x=0, y=round(max(mydf0$value, na.rm=TRUE)+0.5), label=paste0("KW p=",p.value))) +
geom_signif(test="wilcox.test", comparisons = a[which(pv.final0$p.value<=0.05)],#I can use "a"here
map_signif_level = F,
vjust=0,
textsize=4,
size=0.5,
step_increase = 0.05)
if (i==1){
P <- P + theme(legend.position="none",
axis.text.x=element_blank(),
axis.text.y=element_text(size=20),
axis.title=element_blank(),
axis.ticks.x=element_blank(),
strip.text.x=element_text(size=20,face="bold"),
strip.text.y=element_text(size=20,face="bold"))
}
if (i==4){
P <- P + theme(legend.position="none",
axis.text.x=element_text(size=20, angle=90, hjust=1),
axis.text.y=element_text(size=20),
axis.title=element_blank(),
strip.text.x=element_text(size=20,face="bold"),
strip.text.y=element_text(size=20,face="bold"))
}
if ((i==2)|(i==3)){
P <- P + theme(legend.position="none",
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.title=element_blank(),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
strip.text.x=element_text(size=20,face="bold"),
strip.text.y=element_text(size=20,face="bold"))
}
if ((i==5)|(i==6)){
P <- P + theme(legend.position="none",
axis.text.x=element_text(size=20, angle=90, hjust=1),
axis.text.y=element_blank(),
#axis.ticks.y=element_blank(), #WHY SPECIFYING THIS GIVES ERROR?
axis.title=element_blank(),
axis.ticks.y=element_blank(),
strip.text.x=element_text(size=20,face="bold"),
strip.text.y=element_text(size=20,face="bold"))
}
#WHY USING THE CODE BELOW TO CHANGE NUMBERS TO ASTERISKS I GET ERRORS?
#P2 <- ggplot_build(P)
#P2$data[[3]]$annotation <- rep(subset(pv.final0, p.value<=0.05)$map.signif, each=3)
#P <- plot(ggplot_gtable(P2))
sptr <- paste(sp,tr)
mylist[[sptr]] <- list(num.signif, P)
}
}
return(mylist)
}
p.list2 <- plot.list2(mydf, pv.final, addkw, a, myPal)
y.rng <- range(mydf$value)
# Get the highest number of significant p-values across all three "facets"
height.factor <- 0.5
max.signif <- max(sapply(p.list2, function(x) x[[1]]))
# Lay out the three plots as facets (one for each Species), but adjust so that y-range is same for each facet. Top of y-range is adjusted using max_signif.
png(filename="test2.png", height=800, width=1200)
grid.arrange(grobs=lapply(p.list2, function(x) x[[2]] +
scale_y_continuous(limits=c(y.rng[1], y.rng[2] + height.factor*max.signif))),
ncol=length(unique(mydf$Species)), top="Random title", left="Value") #HOW TO CHANGE THE SIZE OF THE TITLE AND THE Y AXIS TEXT?
#HOW TO ADD A COMMON LEGEND?
dev.off()
That produces the following plot:
Now the color problem becomes more striking, the facet heights are uneven, and something should be done with the redundant facet strip texts too.
I am stuck at this point, so would appreciate any help. Sorry for the long question, but I think it is almost there! Thanks!!
You can try following. As your code is really busy and for me too complicated to understand, I suggest a different approach. I tried to avoid loops and to use the tidyverse as much as possible. Thus, first I created your data. Then calculated kruskal wallis tests as this was not possible within ggsignif. Afterwards I will plot all p.values using geom_signif. Finally, insignificant ones will be removed and a step increase is added.
1- Make coloring work done
2- Show asterisks instead of numbers done
...and for the win:
3- Make a common legend done
4- Place Kruskal-Wallis line on top done, I placed the values at the bottom
5- Change the size (and alignment) of the title and y axis text done
library(tidyverse)
library(ggsignif)
# 1. your data
set.seed(2)
df <- as.tbl(iris) %>%
mutate(treatment=rep(c("A","B"), length(iris$Species)/2)) %>%
gather(key, value, -Species, -treatment) %>%
mutate(value=rnorm(n())) %>%
mutate(key=factor(key, levels=unique(key))) %>%
mutate(both=interaction(treatment, key, sep = " "))
# 2. Kruskal test
KW <- df %>%
group_by(Species) %>%
summarise(p=round(kruskal.test(value ~ both)$p.value,2),
y=min(value),
x=1) %>%
mutate(y=min(y))
# 3. Plot
P <- df %>%
ggplot(aes(x=both, y=value)) +
geom_boxplot(aes(fill=Species)) +
facet_grid(~Species) +
ylim(-3,7)+
theme(axis.text.x = element_text(angle=45, hjust=1)) +
geom_signif(comparisons = combn(levels(df$both),2,simplify = F),
map_signif_level = T) +
stat_summary(fun.y=mean, geom="point", shape=5, size=4) +
xlab("") +
geom_text(data=KW,aes(x, y=y, label=paste0("KW p=",p)),hjust=0) +
ggtitle("Plot") + ylab("This is my own y-lab")
# 4. remove not significant values and add step increase
P_new <- ggplot_build(P)
P_new$data[[2]] <- P_new$data[[2]] %>%
filter(annotation != "NS.") %>%
group_by(PANEL) %>%
mutate(index=(as.numeric(group[drop=T])-1)*0.5) %>%
mutate(y=y+index,
yend=yend+index) %>%
select(-index) %>%
as.data.frame()
# the final plot
plot(ggplot_gtable(P_new))
and similar approach using two facets
# --------------------
# 5. Kruskal
KW <- df %>%
group_by(Species, treatment) %>%
summarise(p=round(kruskal.test(value ~ both)$p.value,2),
y=min(value),
x=1) %>%
ungroup() %>%
mutate(y=min(y))
# 6. Plot with two facets
P <- df %>%
ggplot(aes(x=key, y=value)) +
geom_boxplot(aes(fill=Species)) +
facet_grid(treatment~Species) +
ylim(-5,7)+
theme(axis.text.x = element_text(angle=45, hjust=1)) +
geom_signif(comparisons = combn(levels(df$key),2,simplify = F),
map_signif_level = T) +
stat_summary(fun.y=mean, geom="point", shape=5, size=4) +
xlab("") +
geom_text(data=KW,aes(x, y=y, label=paste0("KW p=",p)),hjust=0) +
ggtitle("Plot") + ylab("This is my own y-lab")
# 7. remove not significant values and add step increase
P_new <- ggplot_build(P)
P_new$data[[2]] <- P_new$data[[2]] %>%
filter(annotation != "NS.") %>%
group_by(PANEL) %>%
mutate(index=(as.numeric(group[drop=T])-1)*0.5) %>%
mutate(y=y+index,
yend=yend+index) %>%
select(-index) %>%
as.data.frame()
# the final plot
plot(ggplot_gtable(P_new))
Edit.
Regarding to your p.adjust needs, you can set up a function on your own and calling it directly within geom_signif().
wilcox.test.BH.adjusted <- function(x,y,n){
tmp <- wilcox.test(x,y)
tmp$p.value <- p.adjust(tmp$p.value, n = n,method = "BH")
tmp
}
geom_signif(comparisons = combn(levels(df$both),2,simplify = F),
map_signif_level = T, test = "wilcox.test.BH.adjusted",
test.args = list(n=8))
The challenge is to know how many independet tests you will have in the end. Then you can set the n by your own. Here I used 8. But this is maybe wrong.
Constructing ggplots in a loop has always been known to produce confusing results, and for the explanation of point 1 I'll refer to this question and many others. There's also a hint there about evaluating the ggplot object on the spot, e.g. via print.
Re point 2, you were close, a bit of debugging with trial and error helped. Here's the complete code for plot.list:
plot.list=function(mydf, pv.final, addkw, a, myPal){
mylist <- list()
i <- 0
for (sp in unique(mydf$Species)){
i <- i+1
mydf0 <- subset(mydf, Species==sp)
addkw0 <- subset(addkw, Species==sp)
pv.final0 <- pv.final[grep(sp, pv.final$group1), ]
num.signif <- sum(pv.final0$p.value <= 0.05)
P <- ggplot(mydf0,aes(x=both, y=value)) +
geom_boxplot(aes(fill=Species)) +
stat_summary(fun.y=mean, geom="point", shape=5, size=4) +
facet_grid(~Species, scales="free", space="free_x") +
scale_fill_manual(values=myPal[i]) +
geom_text(data=addkw0, hjust=0, size=4.5, aes(x=0, y=round(max(mydf0$value, na.rm=TRUE)+0.5), label=paste0("KW p=",p.value))) +
geom_signif(test="wilcox.test", comparisons = a[which(pv.final0$p.value<=0.05)],#I can use "a"here
map_signif_level = F,
vjust=0,
textsize=4,
size=0.5,
step_increase = 0.05)
if (i==1){
P <- P + theme(legend.position="none",
axis.text.x=element_text(size=20, angle=90, hjust=1),
axis.text.y=element_text(size=20),
axis.title=element_blank(),
strip.text.x=element_text(size=20,face="bold"),
strip.text.y=element_text(size=20,face="bold"))
} else{
P <- P + theme(legend.position="none",
axis.text.x=element_text(size=20, angle=90, hjust=1),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title=element_blank(),
strip.text.x=element_text(size=20,face="bold"),
strip.text.y=element_text(size=20,face="bold"))
}
P2 <- ggplot_build(P)
P2$data[[4]]$annotation <- rep(subset(pv.final0, p.value<=0.05)$map.signif, each=3)
P <- ggplot_gtable(P2)
mylist[[sp]] <- list(num.signif, P)
}
return(mylist)
}
Note that we can no longer modify the plot via ggplot semantics, since we already applied ggplot_build/ggplot_gtable, so scale modification is no longer possible. If you want to preserve it, move it inside the plot.list function. So, changing to
grid.arrange(grobs=lapply(p.list, function(x) x[[2]]),
ncol=length(unique(mydf$Species)), top="Random title", left="Value")
yields
That's not a complete solution, of course, but I hope that helps.

Presentation of two curves in ggplot2

Is there a more efficient way to present these data in ggplot2? Ideally, I would like them both in one plot. I know this can be achieved in python with matlibplot, but I like the visuals of ggplot2 better.
R code used to generate the plots:
#load libraries
library(ggplot2)
library (gridExtra)
library(scales)
#generate some data plot 1
var_iter <- c(seq(0, 4000, 20))
x <- runif(201,0.877813, 2.283210)
var_loss <- c(sort(x, decreasing = TRUE))
rndm1 <- data.frame(var_iter, var_loss)
#generate some data plot 2
var_iter2 <- c(seq(0, 3500, 500))
x2 <- runif(8,0.1821, 0.6675)
var_acc <- c(sort(x2, decreasing = FALSE))
rndm2 <- data.frame(var_iter2, var_acc)
#plot loss
c <- ggplot(data=rndm1, aes(x=var_iter, y=var_loss)) + geom_line(aes(colour="Log Loss")) +
scale_colour_manual(name='', values=c('Log Loss'='#00BFC4')) + #theme_bw() +
xlab("iterations") + ylab("log loss") + theme(legend.position=c(1,1),legend.justification=c(1,1),
legend.direction="horizontal",
legend.box="horizontal",
legend.box.just = c("top"),
legend.background = element_rect(fill=alpha('white', 0.3)))
#plot accuracy
d <- ggplot(data=rndm2, aes(x=var_iter2, y=var_acc)) + geom_line(aes(colour="Accuracy")) +
scale_colour_manual(name='', values=c('Accuracy'='#F8766D')) + #theme_bw() +
xlab("iterations") + ylab("accuracy") + theme(legend.position=c(0.80, 1),legend.justification=c(1,1),
legend.direction="horizontal",
legend.box="horizontal",
legend.box.just = c("top"),
legend.background = element_rect(fill=alpha('white', 0.3)))
grid.arrange(c, d, ncol=2)
You still can use the same concept of adding a layer on another layer.
ggplot(rndm1, aes(x=var_iter)) +
geom_line(aes(y=var_loss, color="var_loss")) +
geom_line(data=rndm2, aes(x=var_iter2, y=var_acc, color="var_acc"))
Or combine two data frame together and create another variable for color.
# Change the column name, so they can combine together
names(rndm1) <- c("x", "y")
names(rndm2) <- c("x", "y")
rndm <- rbind(rndm1, rndm2)
# Create a variable for color
rndm$group <- rep(c("Log Loss", "Accuracy"), c(dim(rndm1)[1], dim(rndm2)[1]))
ggplot(rndm, aes(x=x, y=y, color=group)) + geom_line()
I wanted to suggest the same idea as the JasonWang, but he was faster. I think it is the way to go (hence I upvoted it myself).
ggplot2 doesn't allow two y axis, for a reason: Plot with 2 y axes, one y axis on the left, and another y axis on the right
It is misleading.
But if you still want to do it. You can do it with base plot or dygraphs (for example):
rndm2$var_iter <- rndm2$var_iter2
rndm2$var_iter2 <- NULL
merged.rndm <- merge(rndm1, rndm2, all = TRUE)
dygraph(merged.rndm) %>% dySeries("var_acc", axis = "y2")
But this will give you points for var_acc, as it has a lot less observations.
You could fill it.
merged.rndm1 <- as.data.frame(zoo::na.approx(merged.rndm))
dygraph(merged.rndm1) %>% dySeries("var_acc", axis = "y2")
Note: this has approximated values, which might not be something you want to do.

Align multiple ggplot graphs with and without legends [duplicate]

This question already has answers here:
Align multiple plots in ggplot2 when some have legends and others don't
(6 answers)
Closed 5 years ago.
I'm trying to use ggplot to draw a graph comparing the absolute values of two variables, and also show the ratio between them. Since the ratio is unitless and the values are not, I can't show them on the same y-axis, so I'd like to stack vertically as two separate graphs with aligned x-axes.
Here's what I've got so far:
library(ggplot2)
library(dplyr)
library(gridExtra)
# Prepare some sample data.
results <- data.frame(index=(1:20))
results$control <- 50 * results$index
results$value <- results$index * 50 + 2.5*results$index^2 - results$index^3 / 8
results$ratio <- results$value / results$control
# Plot absolute values
plot_values <- ggplot(results, aes(x=index)) +
geom_point(aes(y=value, color="value")) +
geom_point(aes(y=control, color="control"))
# Plot ratios between values
plot_ratios <- ggplot(results, aes(x=index, y=ratio)) +
geom_point()
# Arrange the two plots above each other
grid.arrange(plot_values, plot_ratios, ncol=1, nrow=2)
The big problem is that the legend on the right of the first plot makes it a different size. A minor problem is that I'd rather not show the x-axis name and tick marks on the top plot, to avoid clutter and make it clear that they share the same axis.
I've looked at this question and its answers:
Align plot areas in ggplot
Unfortunately, neither answer there works well for me. Faceting doesn't seem a good fit, since I want to have completely different y scales for my two graphs. Manipulating the dimensions returned by ggplot_gtable seems more promising, but I don't know how to get around the fact that the two graphs have a different number of cells. Naively copying that code doesn't seem to change the resulting graph dimensions for my case.
Here's another similar question:
The perils of aligning plots in ggplot
The question itself seems to suggest a good option, but rbind.gtable complains if the tables have different numbers of columns, which is the case here due to the legend. Perhaps there's a way to slot in an extra empty column in the second table? Or a way to suppress the legend in the first graph and then re-add it to the combined graph?
Here's a solution that doesn't require explicit use of grid graphics. It uses facets, and hides the legend entry for "ratio" (using a technique from https://stackoverflow.com/a/21802022).
library(reshape2)
results_long <- melt(results, id.vars="index")
results_long$facet <- ifelse(results_long$variable=="ratio", "ratio", "values")
results_long$facet <- factor(results_long$facet, levels=c("values", "ratio"))
ggplot(results_long, aes(x=index, y=value, colour=variable)) +
geom_point() +
facet_grid(facet ~ ., scales="free_y") +
scale_colour_manual(breaks=c("control","value"),
values=c("#1B9E77", "#D95F02", "#7570B3")) +
theme(legend.justification=c(0,1), legend.position=c(0,1)) +
guides(colour=guide_legend(title=NULL)) +
theme(axis.title.y = element_blank())
Try this:
library(ggplot2)
library(gtable)
library(gridExtra)
AlignPlots <- function(...) {
LegendWidth <- function(x) x$grobs[[8]]$grobs[[1]]$widths[[4]]
plots.grobs <- lapply(list(...), ggplotGrob)
max.widths <- do.call(unit.pmax, lapply(plots.grobs, "[[", "widths"))
plots.grobs.eq.widths <- lapply(plots.grobs, function(x) {
x$widths <- max.widths
x
})
legends.widths <- lapply(plots.grobs, LegendWidth)
max.legends.width <- do.call(max, legends.widths)
plots.grobs.eq.widths.aligned <- lapply(plots.grobs.eq.widths, function(x) {
if (is.gtable(x$grobs[[8]])) {
x$grobs[[8]] <- gtable_add_cols(x$grobs[[8]],
unit(abs(diff(c(LegendWidth(x),
max.legends.width))),
"mm"))
}
x
})
plots.grobs.eq.widths.aligned
}
df <- data.frame(x = c(1:5, 1:5),
y = c(1:5, seq.int(5,1)),
type = factor(c(rep_len("t1", 5), rep_len("t2", 5))))
p1.1 <- ggplot(diamonds, aes(clarity, fill = cut)) + geom_bar()
p1.2 <- ggplot(df, aes(x = x, y = y, colour = type)) + geom_line()
plots1 <- AlignPlots(p1.1, p1.2)
do.call(grid.arrange, plots1)
p2.1 <- ggplot(diamonds, aes(clarity, fill = cut)) + geom_bar()
p2.2 <- ggplot(df, aes(x = x, y = y)) + geom_line()
plots2 <- AlignPlots(p2.1, p2.2)
do.call(grid.arrange, plots2)
Produces this:
// Based on multiple baptiste's answers
Encouraged by baptiste's comment, here's what I did in the end:
library(ggplot2)
library(dplyr)
library(gridExtra)
# Prepare some sample data.
results <- data.frame(index=(1:20))
results$control <- 50 * results$index
results$value <- results$index * 50 + 2.5*results$index^2 - results$index^3 / 8
results$ratio <- results$value / results$control
# Plot ratios between values
plot_ratios <- ggplot(results, aes(x=index, y=ratio)) +
geom_point()
# Plot absolute values
remove_x_axis =
theme(
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank())
plot_values <- ggplot(results, aes(x=index)) +
geom_point(aes(y=value, color="value")) +
geom_point(aes(y=control, color="control")) +
remove_x_axis
# Arrange the two plots above each other
grob_ratios <- ggplotGrob(plot_ratios)
grob_values <- ggplotGrob(plot_values)
legend_column <- 5
legend_width <- grob_values$widths[legend_column]
grob_ratios <- gtable_add_cols(grob_ratios, legend_width, legend_column-1)
grob_combined <- gtable:::rbind_gtable(grob_values, grob_ratios, "first")
grob_combined <- gtable_add_rows(
grob_combined,unit(-1.2,"cm"), pos=nrow(grob_values))
grid.draw(grob_combined)
(I later realised I didn't even need to extract the legend width, since the size="first" argument to rbind tells it just to have that one override the other.)
It feels a bit messy, but it is exactly the layout I was hoping for.
An alternative & quite easy solution is as follows:
# loading needed packages
library(ggplot2)
library(dplyr)
library(tidyr)
# Prepare some sample data
results <- data.frame(index=(1:20))
results$control <- 50 * results$index
results$value <- results$index * 50 + 2.5*results$index^2 - results$index^3 / 8
results$ratio <- results$value / results$control
# reshape into long format
long <- results %>%
gather(variable, value, -index) %>%
mutate(facet = ifelse(variable=="ratio", "ratio", "values"))
long$facet <- factor(long$facet, levels=c("values", "ratio"))
# create the plot & remove facet labels with theme() elements
ggplot(long, aes(x=index, y=value, colour=variable)) +
geom_point() +
facet_grid(facet ~ ., scales="free_y") +
scale_colour_manual(breaks=c("control","value"), values=c("green", "red", "blue")) +
theme(axis.title.y=element_blank(), strip.text=element_blank(), strip.background=element_blank())
which gives:

Resources