My Tukey test significant results LABELS and the colours plotted as box plots do not plot over each sample box plot. Why?
Seems like the labels are plotted at different y-axis along the same s1 (x-axis)?
Reproducible dataset here:
library(multcompView)
df <- data.frame('Sample'=c("s1","s1","s1","s1","s1","s2","s2","s2","s2","s2","s3","s3","s3","s3","s4","s4","s5","s5"), 'value'=c(-0.1098,-0.1435,-0.1046,-0.1308,-0.1523,-0.1219,-0.1114,-0.1328,-0.1589,-0.1567,-0.1395,-0.1181,-0.1448,-0.124,-0.1929,-0.1996,-0.1981,-0.1917))
anova_df <- aov(df$value ~ df$Sample )
tukey_df <- TukeyHSD(anova_df, 'df$Sample', conf.level=0.95)
# I need to group the treatments that are not different each other together.
TUKEY <- tukey_df
generate_label_df <- function(TUKEY, variable){
# Extract labels and factor levels from Tukey post-hoc
Tukey.levels <- TUKEY[[variable]][,4]
Tukey.labels <- data.frame(multcompLetters(Tukey.levels)['Letters'])
#I need to put the labels in the same order as in the boxplot :
Tukey.labels$Sample=rownames(Tukey.labels)
Tukey.labels=Tukey.labels[order(Tukey.labels$Sample) , ]
return(Tukey.labels)
}
# Apply the function on my dataset
LABELS <- generate_label_df(TUKEY , "df$Sample")
# A panel of colors to draw each group with the same color :
my_colors <- c(
rgb(143,199,74,maxColorValue = 255),
rgb(242,104,34,maxColorValue = 255),
rgb(111,145,202,maxColorValue = 255))
# Draw the basic boxplot
a <- boxplot(df$value ~ df$Sample , ylim=c(min(df$value) , 1.1*max(df$value)) , col=my_colors[as.numeric(LABELS[,1])] , ylab="Value" , main="")
# I want to write the letter over each box. Over is how high I want to write it.
over <- 0.1*max(a$stats[nrow(a$stats),] )
#Add the labels
text(c(1:nlevels(df$Sample)), a$stats[nrow(a$stats),]+over, LABELS[,1] , col=my_colors[as.numeric(LABELS[,1])] )
Current output:
Desired plot-like (colours and LABELS):
First, LABELS$Letters is a character vector. You can get as.numeric(LABELS[,1]) to work if you make it a factor first.
Second, your y-limit needs some work for negative values. There is a function you might find useful called extendrange which is used in many a plotting function.
This line c(1:nlevels(df$Sample)) also would work if df$Sample was a factor which is was not.
Also, if you are plotting text at a specific location, you can adjust the text using either text(..., pos = ) or text(..., adj = ) to shift the position.
LABELS$Letters <- factor(LABELS$Letters)
a <- boxplot(df$value ~ df$Sample , ylim = extendrange(df$value), col=my_colors[as.numeric(LABELS[,1])] , ylab="Value" , main="")
text(seq_along(a$names), apply(a$stats, 2, max), LABELS[,1], col=my_colors[as.numeric(LABELS[,1])], pos = 3)
If you don't mind changing your workflow and use tidyverse library this is how you could achieve your goal:
# join df and LABELS into one data table
inner_join(df, LABELS, by = "Sample") %>%
# calculate max value for each Sample group (it will be used to place the labels)
group_by(Sample) %>%
mutate(placement = max(value)) %>%
ungroup() %>%
# make a plot
ggplot(aes(Sample, value, fill = Letters))+
geom_boxplot()+
geom_text(aes(y = placement, label = Letters, col = Letters), nudge_y = 0.01, size = 6)+
theme_minimal()+
theme(legend.position = "none")
Related
I have created a qqplot (with quantiles of beta distribution) from a dataset including two groups. To visualize, which points belong to which group, I would like to color them. I have tried the following:
res <- beta.mle(data$values) #estimate parameters of beta distribution
qqplot(qbeta(ppoints(500),res$param[1], res$param[2]),data$values,
col = data$group,
ylab = "Quantiles of data",
xlab = "Quantiles of Beta Distribution")
the result is shown here:
I have seen solutions specifying a "col" vector for qqnorm, hover this seems to not work with qqplot, as simply half the points is colored in either color, regardless of group. Is there a way to fix this?
A simulated some data just to shown how to add color in ggplot
Libraries
library(tidyverse)
# install.packages("Rfast")
Data
#Simulating data from beta distribution
x <- rbeta(n = 1000,shape1 = .5,shape2 = .5)
#Estimating parameters
res <- Rfast::beta.mle(x)
data <-
tibble(
simulated_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2])
) %>%
#Creating a group variable using quartiles
mutate(group = cut(x = simulated_data,
quantile(simulated_data,seq(0,1,.25)),
include.lowest = T))
Code
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = simulated_data, col = group))+
geom_point()
Output
For those who are wondering, how to work with pre-defined groups, this is the code that worked for me:
library(tidyverse)
library(Rfast)
res <- beta.mle(x)
# make sure groups are not numerrical
# (else color skale might turn out continuous)
g <- plyr::mapvalues(g, c("1", "2"), c("Group1", "Group2"))
data <-
tibble(
my_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2]),
group = g[order(x)]
)
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = my_data, col = group))+
geom_point()
result
I am trying to add kruskal Wallis and pairwise Wilcoxon test to the figure to show which groups are significant different, but I have multiple groups/subgroups within each group and facet which makes it complicated.
Here is the R code by using iris dataset as an example, the idea is to perform Kruskal.test across different treatments (A, B, C) for different variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) per species, and also wilcox.test pairwise between them:
rm(list=ls(all=TRUE)); cat('\014') # clear workspace
library(tidyverse)
library(ggplot2)
library(viridis)
library(rstatix)
data(iris)
iris$treatment <- rep(c("A","B","C"), length(iris$Species)/3)
mydf <- gather(iris,Variable,value,Sepal.Length:Petal.Width)
# change number to create more difference
mydf[mydf$treatment=="B",]$value <- mydf[mydf$treatment=="B",]$value*1.2
#mydf[mydf$treatment=="C",]$value <- mydf[mydf$treatment=="C",]$value+0.3
# do pairwise Wilcoxon test for pairwise comparisons between groups
df_wilcox <- mydf %>%
group_by(Species,Variable) %>%
pairwise_wilcox_test(value ~ treatment) %>%
add_y_position(step.increase = 0.02)
# do Kruskal Wallis test to see whether or not there is statistically significant difference between three or more groups
df_kw <- compare_means(value ~ treatment, mydf, group.by = c("Species","Variable"), method="kruskal")
# plot boxplot with wilcoxon and kruskal test results
P <- ggplot(data=mydf,
aes(x=treatment, y=value, fill=Variable))+
stat_boxplot(geom = "errorbar")+geom_boxplot(outlier.shape = NA)+
facet_wrap(~Species,nrow=1)+
theme_bw()+
theme(axis.text=element_text(size=12),axis.title=element_text(size=16),plot.title=element_text(size=20)) +
theme(strip.text = element_text(size=14))+
scale_fill_viridis(discrete = TRUE) +
guides(fill=guide_legend(title="Variable"))+
stat_pvalue_manual(df_wilcox,color ="Variable",step.group.by="Variable",tip.length = 0,step.increase = 0.02)
#stat_pvalue_manual(df_wilcox,color ="Variable",step.group.by="Variable",tip.length = 0,step.increase = 0.02,hide.ns=T) #hide non-significant
# change legend title and wilcoxon test color
ggpar(P,legend.title = "Wilcoxon test",palette = c("#440154FF","#3B528BFF","#21908CFF","#FDE725FF"))
This produces the following plot:
To improve the figure, I want to:
automatic add Kruskal test result from 'df_kw' as text to the figure as well and only show significant p-value (e.g. KW(petal.length) p = 0.003)
make the wilcoxon line between treatment (e.g. "A", "B", "C") for different variable (e.g. Petal/Speal Length/Width) looks neat (e.g. all on top of the boxplot with consistent line space)
make the color of wilcoxon test line same as the color of boxplot (now the 'ggpar' don't always work if I hide non-significant, when the wilcoxon test variable is less than the actual variable)
I am stuck here and wondering anyone has a solution? Thank you very much!
I can answer the first part of your question regarding how to add the pvalues labels to the plot automatically. One way to do that is to combine mydf anddf_kw so that df_kw includes all of the same columns as mydf. here I do that using the data.table package like this:
setDT(mydf); setDT(df_kw) # convert to data.tables by reference
df_kw <- mydf[df_kw, mult = "first", on = c("Variable", "Species"), nomatch=0L] #creates data table with the same columns as mydf
df_kw <- df_kw[df_kw$p < 0.05,] #removes non-significant values
Then you can add the labels automatically using geom_text. I would generate a character vector of values to position the labels like this first:
y_lab_placement <- c(sort(rep(seq(max(mydf$value)*1.25, by = -0.35, length.out = length(unique(mydf$Variable))),
length(unique(mydf$Species))), decreasing = T)) # creates y values of where to place the labels
y_lab_placement <- y_lab_placement[1:nrow(df_kw)] # adjusts length of placements to the length of significant values
Then I would add this line to your ggplot to add the labels:
geom_text(data = df_kw, aes(x = 2 , y = y_lab_placement, label = c(paste(Variable, "KW p ~" , round(p, 5)))))+ #adds labels to the plot based on your data
Here is your entire code block including these editions.
rm(list=ls(all=TRUE)); cat('\014') # clear workspace
library(tidyverse)
library(ggplot2)
library(viridis)
library(rstatix)
library(data.table) # used in creating combined data table
data(iris)
iris$treatment <- rep(c("A","B","C"), length(iris$Species)/3)
mydf <- gather(iris,Variable,value,Sepal.Length:Petal.Width)
# change number to create more difference
mydf[mydf$treatment=="B",]$value <- mydf[mydf$treatment=="B",]$value*1.2
#mydf[mydf$treatment=="C",]$value <- mydf[mydf$treatment=="C",]$value+0.3
# do pairwise Wilcoxon test for pairwise comparisons between groups
df_wilcox <- mydf %>%
group_by(Species,Variable) %>%
pairwise_wilcox_test(value ~ treatment) %>%
add_y_position(step.increase = 0.02)
# do Kruskal Wallis test to see whether or not there is statistically significant difference between three or more groups
df_kw <- compare_means(value ~ treatment, mydf, group.by = c("Species","Variable"), method="kruskal")
setDT(mydf); setDT(df_kw) # convert to data.tables by reference
df_kw <- mydf[df_kw, mult = "first", on = c("Variable", "Species"), nomatch=0L] #creates data table with the same columns as mydf
df_kw <- df_kw[df_kw$p < 0.05,] #removes non-significant values
# plot boxplot with wilcoxon and kruskal test results
y_lab_placement <- c(sort(rep(seq(max(mydf$value)*1.25, by = -0.35, length.out = length(unique(mydf$Variable))),
length(unique(mydf$Species))), decreasing = T)) # creates y values of where to place the labels
y_lab_placement <- y_lab_placement[1:nrow(df_kw)] # adjusts length of placements to the length of significant values
P <- ggplot(data=mydf,
aes(x=treatment, y=value, fill=Variable))+
stat_boxplot(geom = "errorbar")+geom_boxplot(outlier.shape = NA)+
facet_wrap(~Species,nrow=1)+
theme_bw()+
theme(axis.text=element_text(size=12),axis.title=element_text(size=16),plot.title=element_text(size=20)) +
theme(strip.text = element_text(size=14))+
scale_fill_viridis(discrete = TRUE) +
guides(fill=guide_legend(title="Variable"))+
geom_text(data = df_kw, aes(x = 2 , y = y_lab_placement, label = c(paste(Variable, "KW p ~" , round(p, 5)))))+ #adds labels to the plot based on your data
stat_pvalue_manual(df_wilcox,color ="Variable",step.group.by="Variable",tip.length = 0,step.increase = 0.02)
#stat_pvalue_manual(df_wilcox,color ="Variable",step.group.by="Variable",tip.length = 0,step.increase = 0.02,hide.ns=T) #hide non-significant
# change legend title and wilcoxon test color
ggpar(P,legend.title = "Wilcoxon test",palette = c("#440154FF","#3B528BFF","#21908CFF","#FDE725FF"))
I'm using the svars package to generate some IRF plots. The plots are rendered using ggplot2, however I need some help with changing some of the aesthetics.
Is there any way I can change the fill and alpha of the shaded confidence bands, as well as the color of the solid line? I know in ggplot2 you can pass fill and alpha arguments to geom_ribbon (and col to geom_line), just unsure of how to do the same within the plot function of this package's source code.
# Load Dataset and packages
library(tidyverse)
library(svars)
data(USA)
# Create SVAR Model
var.model <- vars::VAR(USA, lag.max = 10, ic = "AIC" )
svar.model <- id.chol(var.model)
# Wild Bootstrap
cores <- parallel::detectCores() - 1
boot.svar <- wild.boot(svar.model, n.ahead = 30, nboot = 500, nc = cores)
# Plot the IRFs
plot(boot.svar)
I'm also looking at the command for a historical decomposition plot (see below). Is there any way I could omit the first two facets and plot only the bottom three lines on the same facet?
hist.decomp <- hd(svar.model, series = 1)
plot(hist.decomp)
Your first desired result is easily achieved by resetting the aes_params after calling plot. For your second goal. There is probably an approach to manipulate the ggplot object. Instead my approach below constructs the plot from scratch. Basically I copy and pasted the data wrangling code from vars:::plot.hd and filtered the prepared dataset for the desired series:
# Plot the IRFs
p <- plot(boot.svar)
p$layers[[1]]$aes_params$fill <- "pink"
p$layers[[1]]$aes_params$alpha <- .5
p$layers[[2]]$aes_params$colour <- "green"
p
# Helper to convert to long dataframe. Source: svars:::plot.hd
hd2PlotData <- function(x) {
PlotData <- as.data.frame(x$hidec)
if (inherits(x$hidec, "ts")) {
tsStructure = attr(x$hidec, which = "tsp")
PlotData$Index <- seq(from = tsStructure[1], to = tsStructure[2],
by = 1/tsStructure[3])
PlotData$Index <- as.Date(yearmon(PlotData$Index))
}
else {
PlotData$Index <- 1:nrow(PlotData)
PlotData$V1 <- NULL
}
dat <- reshape2::melt(PlotData, id = "Index")
dat
}
hist.decomp <- hd(svar.model, series = 1)
dat <- hd2PlotData(hist.decomp)
dat %>%
filter(grepl("^Cum", variable)) %>%
ggplot(aes(x = Index, y = value, color = variable)) +
geom_line() +
xlab("Time") +
theme_bw()
EDIT One approach to change the facet labels is via a custom labeller function. For a different approach which changes the facet labels via the data see here:
myvec <- LETTERS[1:9]
mylabel <- function(labels, multi_line = TRUE) {
data.frame(variable = labels)
}
p + facet_wrap(~variable, labeller = my_labeller(my_labels))
I have a dataframe df. While plotting this in ggplot. Can we also highlight outliers. Below is the sample code
df <- data.frame(col=runif(100, min=0, max=100000))
df$D <- c(1:100)
ggplot(df,aes(x=D,y=col))+geom_line()
Is there the way to highlight outliers here
We can define a function for this. The line_outlier_plot has four arguments. df has the same format as your example data frame. outlier_color and normal_color are to specify the color for the points.drop indicates if we want to drop the category in the legend.
We have to define how to determine an outlier. Here, I decided that an outlier is a value larger or smaller than the mean plus or minus 3 times of the standard deviation. You can define your own approach to determine the outlier by modifying the code in the ifelse statement.
library(ggplot2)
line_outlier_plot <- function(df, outlier_color = "red", normal_color = "black", drop = FALSE){
# Assign a label to show if it is an outlier or not
df$label <- ifelse(df$col > mean(df$col) + 3 * sd(df$col) |
df$col < mean(df$col) - 3 * sd(df$col), "Outlier", "Normal")
df$label <- factor(df$label, levels = c("Normal", "Outlier"))
# Set the color palette
pal <- c("Outlier" = outlier_color, "Normal" = normal_color)
p <- ggplot(df, aes(x = D, y = col)) +
geom_line() +
geom_point(aes(color = label)) +
scale_color_manual(values = pal, drop = drop)
return(p)
}
Below is an example of the plot using this function.
set.seed(155)
df <- data.frame(col=rnorm(1000))
df$D <- c(1:1000)
line_outlier_plot(df)
I need to create a qq plot of -log10 p-values in ggplot2 where a subset of 137 points ("targets") are highlighted in gold using a colorblind-friendly palette I'm using called cbbPalette. I cannot do this in an alternate package because I eventually need to combine multiple qq plots into a grid using grid.arrange from the gridExtra package that works with ggplot2.
Setup:
library(ggplot2)
library(reshape2)
cbbPalette <- c("#E69F00", "#000000") #part of my palette; gold & black
set.seed(100)
The data consists of 100,137 p-values, 137 of which are targets:
p_values = c(
runif(100000, min = 0, max = 1),
runif(132, min = 1e-7, max = 1),
c(6e-20, 6e-19, 7e-9, 7.5e-9, 4e-8)
)
#labels for the p-values
names_letters <-
do.call(paste0, replicate(2, sample(LETTERS, 100137, TRUE), FALSE))
names = paste0(names_letters, sprintf("%04d", sample(9999, 100137, TRUE)))
targets = names[100001:100137] #last 137 are targets
df = as.data.frame(p_values)
df$names = names
df <-
df[sample(nrow(df)), ] #shuffles the df to place targets randomly w/in
df$Category = ifelse(df$names %in% targets, "Target", "Non-Target")
Appearance of Data:
head(df, 4)
p_values names Category
89863 0.4821147 NZ3385 Non-Target
20209 0.3998835 SQ3793 Non-Target
29200 0.7893478 ZT5497 Non-Target
71623 0.3459360 QF5311 Non-Target
Melted df Using reshape2 with Observed (o) & Expected (e) -log10 p-values:
df.m = melt(df)
df.m$o = -log10(sort(df.m$value, decreasing = F))
df.m$e = -log10(1:nrow(df.m) / nrow(df.m))
Appearance of Melted df:
head(df.m,4)
names Category variable value o e
1 NZ3385 Non-Target p_values 0.4821147 19.221849 5.000595
2 SQ3793 Non-Target p_values 0.3998835 18.221849 4.699565
3 ZT5497 Non-Target p_values 0.7893478 8.154902 4.523473
4 QF5311 Non-Target p_values 0.3459360 8.124939 4.398535
QQ-plot
df_qq = ggplot(df.m, aes(e, o)) +
geom_point(aes(color = Category)) +
scale_colour_manual(values = cbbPalette) +
geom_abline(intercept = 0, slope = 1) +
ylab("Observed -log[10](p)") +
xlab("Theoretical -log[10](p)")
I then get a qq with no highlighting of my 137 targets.
You can draw the targets in a separate geom_point() call after the non-targets, the geoms are plotted in order so the targets end up on top:
cbbPalette <- c(Target = "#E69F00", `Non-Target` = "#000000")
df_qq = ggplot(df.m, aes(e, o)) +
geom_abline(intercept = 0, slope = 1) +
geom_point(aes(color = Category), data = df.m[df.m$Category == "Non-Target", ]) +
geom_point(aes(color = Category), data = df.m[df.m$Category == "Target", ]) +
scale_colour_manual(values = cbbPalette) +
ylab("Observed -log[10](p)") +
xlab("Theoretical -log[10](p)")
I've also added names to your palette to make sure the right colours are attached to each category, when changing the order of the geom_point() calls this can get mixed up otherwise.
Result:
If you want to avoid having to split your dataframe into two calls to geom_point, you can order the data by the Category column first, then pipe it into ggplot. For just these two category values, you could arrange pretty simply:
df.m %>%
arrange(Category) %>%
ggplot(...)
which will put your data in alphabetical order with Non-Target observations, then Target ones. Points get drawn in order, so this will put points in the target category on top.
To have more control over the ordering, you can make Category a factor, and set the levels explicitly, then arrange by the factor order:
df.m %>%
mutate(Category = as.factor(Category) %>% fct_relevel("Target")) %>%
arrange(desc(Category)) %>%
ggplot(...)
I'm using fct_relevel from the forcats package, just because it's a really easy way to manipulate factor levels; you could order levels with base R as well. fct_relevel puts the Target level first, so when I arrange by Category, I'm doing it in reverse, so that again Target gets drawn last.
Hope that makes sense!