data points misaligned when using a third value with position jitterdodge - r

Edited with sample data:
When I try to plot a grouped boxplot together with jittered points using position=position_jitterdodge(), and add an additional group indicated by e.g. shape, I end up with a graph where the jittered points are misaligned within the individual groups:
n <- 16
data <- data.frame(
age = factor(rep(c('young', 'old'), each=8)),
group=rep(LETTERS[1:2], n/2),
yval=rnorm(n)
)
ggplot(data, aes(x=group, y=yval))+
geom_boxplot(aes(color=group), outlier.shape = NA)+
geom_point(aes(color=group, shape=age, fill=group),size = 1.5, position=position_jitterdodge())+
scale_shape_manual(values = c(21,24))+
scale_color_manual(values=c("black", "#015393"))+
scale_fill_manual(values=c("white", "#015393"))+
theme_classic()
Is there a way to suppress that additional separation?
Thank you!

OP, I think I get what you are trying to explain. It seems the points are grouped according to age, rather than treated as the same for each group. The reason for this is that you have not specified what to group together. In order to jitter the points, they are first grouped together according to some aesthetic, then the jitter is applied. If you don't specify the grouping, then ggplot2 gives it a guess as to how you want to group the points.
In this case, it is grouping according to age and group, since both are defined to be used in the aesthetics (x=, fill=, and color= are assigned to group and shape= is assigned to age).
To define that you only want to group the points by the column group, you can use the group= aesthetic modifier. (reposting your data with a seed so you see the same thing)
set.seed(8675309)
n <- 16
data <- data.frame(
age = factor(rep(c('young', 'old'), each=8)),
group=rep(LETTERS[1:2], n/2),
yval=rnorm(n)
)
ggplot(data, aes(x=group, y=yval))+
geom_boxplot(aes(color=group), outlier.shape = NA)+
geom_point(aes(color=group, shape=age, fill=group, group=group),size = 1.5, position=position_jitterdodge())+
scale_shape_manual(values = c(21,24))+
scale_color_manual(values=c("black", "#015393"))+
scale_fill_manual(values=c("white", "#015393"))+
theme_classic()

Related

overlay the grand mean and se into a scatter dot

I have a dot plot created by ggplot, in which I plot every subject's individual responses. The subjects are organized into 3 groups in the plot and I have also estimated and plotted for each subject the mean and se. Now, I want to add at the same plot the grand mean and Se for each group.
This is how I created the first plot:
mazeSRDataS1_Errorplot<-ggplot(mazeSRDataS1, aes(Errorfixed, GroupSub,
colour=as.factor(Group)))+geom_point() +
mytheme3+ ggtitle("mazeSR-S1 Error plot")+ labs(y="Subject ID", x = "Error (degrees)", colour =
"Group")+ scale_colour_manual(values = c("brown4", "slategray3", "tan1"))
mazeSRDataS1_Errorplot + stat_summary(fun = mean, position = 'dodge', shape=1, size=0.5,
colour='black') + stat_summary(fun.data = mean_cl_normal, geom = 'errorbar', colour='black')
This is how I plotted the grand mean and se for each group. (i first aggregated the data and computed the mean and se for each group).
ggplot(meanSEErrorMazeSR1, aes(x=Error, y=Group, colour=Group)) +
geom_errorbar(aes(xmin=Error-se, xmax=Error+se), width=.1, position='dodge') +
geom_line(position='dodge') + geom_point(position='dodge')
But, how do I merge these plots and overlay the one over the other?
Thank you in advance!!
You can add y-axis positions to the aggregated data you've made to specify where on the first plot you want them plotted, and then add another geom_errorbar(data = ...) where you specify to use the aggregated data e.g.:
meanSEErrorMazeSR1 <-
meanSEErrorMazeSR1 %>%
mutate(y_position = c(30, 90, 150) # since you didn't provide a reproducible example you'll need to figure out the best positions yourself here
mazeSRDataS1_Errorplot +
geom_errorbar(data = meanSEErrorMazeSR1, aes(y = y_position, xmin=Error-se, xmax=Error+se), width=.1)
You can toy around with different y-values to use for the positioning of the error bars. In your case, because the y-axis is discrete due to being based on Subject IDs, the y-values will correspond to the order of the subject on the plot - the y_position = c(30, 90, 150) above corresponds to the 30th, 90th, and 150th subject, respectively.
Note also that the argument position='dodge' is not needed because you're not using a group aesthetic!

Ordering geom_col NOT by fill value

Please resist your instinct to jump at defining factors level. I am trying to make a bar plot with text annotations. I'm using geom_col with a y value aesthetic, and I'm using geom_text with a separate dataframe where the value has been converted into a cumulative sum. The order matters here, I want to plot based on the same order in which cumulative sum is calculated.
Example
library(ggplot2)
library(data.table)
example_df <- data.frame(gender = c('M', 'F', 'F', 'M'), month = c('1', '1', '2', '2'),
value = c(10, 20, 30, 40), name = c('Jack', 'Kate', 'Nassrin', 'Malik'))
setDT(example_df)
text_df <- example_df[, .(value=cumsum(value), name=name), by='month']
ggplot(example_df) + geom_col(aes(x=month, y=value, fill=gender)) +
geom_text(data=text_df, aes(x=month, y=value, label=name), vjust=1)
If you can see here, the left side is exactly what I want. Jack is labeled at 10 over the M color, Kate labeled 20 above that over the F color. The right side though is wrong. Nassrin is labeled at 30, but over the M color that is of height 40. This is because geom_col by default orders by fill, which is converted to a factor in alphabetic order. What I want here is for the left plot to be ordered M, F but the right one F, M. Is this possible? Or is my best solution to reorder my cumulative sum (which would lead to a different plot than I intend).
Set group and fill separately. The order of stacking (i.e. the position) is controlled by group, and when you don't define that it gets set automatically (in this case the definition of fill is used). So:
ggplot(example_df) +
geom_col(aes(x=month, y=value, group = fct_rev(fct_inorder(name)), fill = gender)) +
geom_text(data=text_df, aes(x=month, y=value, label=name), vjust=1)
Note that we can also let ggplot do the cumulative sums for us. Then we can use just the original data.frame, simplifying your plot to:
ggplot(example_df, aes(month, value, group = fct_rev(fct_inorder(name)),)) +
geom_col(aes(fill = gender)) +
geom_text(aes(label = name), position = 'stack', vjust = 1)

Boxplot ggplot2: Show mean value and number of observations in grouped boxplot

I wish to add the number of observations to this boxplot, not by group but separated by factor. Also, I wish to display the number of observations in addition to the x-axis label that it looks something like this: ("PF (N=12)").
Furthermore, I would like to display the mean value of each box inside of the box, displayed in millions in order not to have a giant number for each box.
Here is what I have got:
give.n <- function(x){
return(c(y = median(x)*1.05, label = length(x)))
}
mean.n <- function(x){x <- x/1000000
return(c(y = median(x)*0.97, label = round(mean(x),2)))
}
ggplot(Soils_noctrl) +
geom_boxplot(aes(x=Slope,y=Events.g_Bacteria, fill = Detergent),
varwidth = TRUE) +
stat_summary(aes(x = Slope, y = Events.g_Bacteria), fun.data = give.n, geom = "text",
fun = median,
position = position_dodge(width = 0.75))+
ggtitle("Cell Abundance")+
stat_summary(aes(x = Slope, y = Events.g_Bacteria),
fun.data = mean.n, geom = "text", fun = mean, colour = "red")+
facet_wrap(~ Location, scale = "free_x")+
scale_y_continuous(name = "Cell Counts per Gram (Millions)",
breaks = round (seq(min(0),
max(100000000), by = 5000000),1),
labels = function(y) y / 1000000)+
xlab("Sample")
And so far it looks like this:
As you can see, the mean value is at the bottom of the plot and the number of observations are in the boxes but not separated
Thank you for your help! Cheers
TL;DR - you need to supply a group= aesthetic, since ggplot2 does not know on which column data it is supposed to dodge the text geom.
Unfortunately, we don't have your data, but here's an example set that can showcase the rationale here and the function/need for group=.
set.seed(1234)
df1 <- data.frame(detergent=c(rep('EDTA',15),rep('Tween',15)), cells=c(rnorm(15,10,1),rnorm(15,10,3)))
df2 <- data.frame(detergent=c(rep('EDTA',20),rep('Tween',20)), cells=c(rnorm(20,1.3,1),rnorm(20,4,2)))
df3 <- data.frame(detergent=c(rep('EDTA',30),rep('Tween',30)), cells=c(rnorm(30,5,0.8),rnorm(30,3.3,1)))
df1$smp='Sample1'
df2$smp='Sample2'
df3$smp='Sample3'
df <- rbind(df1,df2,df3)
Instead of using stat_summary(), I'm just going to create a separate data frame to hold the mean values I want to include as text on my plot:
summary_df <- df %>% group_by(smp, detergent) %>% summarize(m=mean(cells))
Now, here's the plot and use of geom_text() with dodging:
p <- ggplot(df, aes(x=smp, y=cells)) +
geom_boxplot(aes(fill=detergent))
p + geom_text(data=summary_df,
aes(y=m, label=round(m,2)),
color='blue', position=position_dodge(0.8)
)
You'll notice the numbers are all separated along y= just fine, but the "dodging" is not working. This is because we have not supplied any information on how to do the dodging. In this case, the group= aesthetic can be supplied to let ggplot2 know that this is the column by which to use for the dodging:
p + geom_text(data=summary_df,
aes(y=m, label=round(m,2), group=detergent),
color='blue', position=position_dodge(0.8)
)
You don't have to supply the group= aesthetic if you supply another aesthetic such as color= or fill=. In cases where you give both a color= and group= aesthetic, the group= aesthetic will override any of the others for dodging purposes. Here's an example of the same, but where you don't need a group= aesthetic because I've moved color= up into the aes() (changing fill to greyscale so that you can see the text):
p + geom_text(data=summary_df,
aes(y=m, label=round(m,2), color=detergent),
position=position_dodge(0.8)
) + scale_fill_grey()
FUN FACT: Dodging still works even if you supply geom_text() with a nonsensical aesthetic that would normally work for dodging, such as fill=. You get a warning message Ignoring unknown aesthetics: fill, but the dodging still works:
p + geom_text(data=summary_df,
aes(y=m, label=round(m,2), fill=detergent),
position=position_dodge(0.8)
)
# gives you the same plot as if you just supplied group=detergent, but with black text
In your case, changing your stat_summary() line to this should work:
stat_summary(aes(x = Slope, y = Events.g_Bacteria, group = Detergent),...

How do I set discrete axis limit on ggplot2 with reordered data

I have a dataset consisting of counts for 700+ categories(discrete data) grouped by sex. I would like to display the top 50 categories ranked by ascending or descending order (I know the code for ranking). The categories can also be eliminated by setting a cut-off for the count (in this case I used 50,000 counts) The problem here is that I cannot set the discrete axis limits based on the categories that have already been reordered by ggplot2.
I have already tried to arrange with dplyr but its not letting me arrange by aggregated data from only a particular layer of groups within the dataset.
I have tried coord cartesian and scale_y_continous.
Ideally, I would like a code that just allows me to cut off the last 600 of the re-ordered data.
library(ggplot2)
library(scales)
ggplot(df, aes(species, counts)) +
geom_linerange(
aes(x = reorder (species, counts), ymin = 0, ymax = counts, group = sex),
color = "lightgray", size = 1.5,
position = position_dodge(0.3)
)+
geom_point(
aes(colour = sex),
position = position_dodge(0.3), size = 3
)+
theme(axis.text.x= element_text(angle=90))+
scale_y_continuous(limits=c(50000,500000),labels = comma)+
scale_color_manual(values = c("#0080FF", "#FA1212"))
Scale_y_continous only removed the 600+ data points I did not want but the axis labels and axis size still remained.

How to specify ggplot2 boxplot fill colour for continuous data?

I want to plot a ggplot2 boxplot using all columns of a data.frame, and I want to reorder the columns by the median for each column, rotate the x-axis labels, and fill each box with the colour corresponding to the same median. I can't figure out how to do the last part. There are plenty of examples where the fill colour corresponds to a factor variable, but I haven't seen a clear example of using a continuous variable to control fill colour. (The reason I'm trying to do this is that the resultant plot will provide context for a force-directed network graph with nodes that will be colour-coded in the same way as the boxplot -- the colour will then provide a mapping between the two plots.) It would be nice if I could re-use the value-to-colour mapping for later plots so that colours are consistent between plots. So, for example, the box corresponding to the column variable with a high median value will have a colour that denotes this mapping and matches perfectly the colour for the same column variable in other plots (such as the corresponding node in a force-directed network graph).
So far, I have something like this:
# Melt the data.frame:
DT.m <- melt(results, id.vars = NULL) # using reshape2
# I can now make a boxplot for every column in the data.frame:
g <- ggplot(DT.m, aes(x = reorder(variable, value, FUN=median), y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
geom_boxplot(???, alpha=0.5)
The colour fill information is what I'm stuck on. "value" is a continuous variable in the range [0,1] and there are 55 columns in my data.frame. Various approaches I've tried seem to result in the boxes being split vertically down the middle, and I haven't got any further. Any ideas?
You can do this by adding the median-by-group to your data frame and then mapping the new median variable to the fill aesthetic. Here's an example with the built-in mtcars data frame. By using this same mapping across different plots, you should get the same colors:
library(ggplot2)
library(dplyr)
ggplot(mtcars %>% group_by(carb) %>%
mutate(medMPG = median(mpg)),
aes(x = reorder(carb, mpg, FUN=median), y = mpg)) +
geom_boxplot(aes(fill=medMPG)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
scale_fill_gradient(low=hcl(15,100,75), high=hcl(195,100,75))
If you have various data frames with different ranges of medians, you can still use the method above, but to get a consistent mapping of color to median across all your plots, you'll need to also set the same limits for scale_fill_gradient in each plot. In this example, the median of mpg (by carb grouping) varies from 15.0 to 22.8. But let's say across all my data sets, it varies from 13.3 to 39.8. Then I could add this to all my plots:
scale_fill_gradient(limits=c(13.3, 39.8),
low=hcl(15,100,75), high=hcl(195,100,75))
This is just for illustration. For ease of maintenance if your data might change, you'll want to set the actual limits programmatically.
I built on eipi10's solution and obtained the following code which does what I want:
# "results" is a 55-column data.frame containing
# bootstrapped estimates of the Gini impurity for each column variable
# (But can synthesize fake data for testing with a bunch of rnorms)
DT.m <- melt(results, id.vars = NULL) # using reshape2
g <- ggplot(DT.m %>% group_by(variable) %>%
mutate(median.gini = median(value)),
aes(x = reorder(variable, value, FUN=median), y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_boxplot(aes(fill=median.gini)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
scale_fill_gradientn(colours = heat.colors(9)) +
ylab("Gini impurity") +
xlab("Feature") +
guides(fill=guide_colourbar(title="Median\nGini\nimpurity"))
plot(g)
Later, for the second plot:
medians <- lapply(results, median)
color <- colorRampPalette(colors =
heat.colors(9))(1000)[cut(unlist(medians),1000,labels = F)]
color is then a character vector containing the colours of the nodes in my subsequent network graph, and these colours match those in the boxplot. Job done!

Resources