ggplot2 - box plot questions on misalignment - r

I am having the following challenges making a plot:
I am trying to do a 'grouped' box plot - but it appears that the box plots are not showing up near the corresponding x axis group. So It's not easy to see which group each plot belongs to.
I am trying to add in an icon for the 'mean' value which right now is a triangle. However these aren't moving with the grouped boxplots.
I don't want the triangle icon for the mean value to show up in the legend - I can't figure out how to remove this.
Whenever I try to add text I just want one value either for the median or the mean - not something repeated 50x.
Boxplot
library(ggplot2)
library(ggthemes)
library(RColorBrewer)
library(reshape2)
ggplot(tips, aes(x = day, y = total_bill, fill=sex)) + #grouping factor, y variable
geom_boxplot(position = position_dodge(width = 1.2)) + # how to color
labs(title ='Barchart Plot', x=' xaxis label',y='ylabel') +
scale_fill_brewer(palette="Dark2")+ #use Dark2, Paired, Set1
theme(axis.text.x = element_text(colour="black",size=14,angle=45,
hjust=.5,vjust=.5,face="bold"),
axis.text.y = element_text(colour="grey20",size=16,angle=45,
hjust=1,vjust=0,face="plain"),
axis.title.x = element_text(colour="grey20",size=12,angle=45,
hjust=.5,vjust=0,face="plain"),
axis.title.y = element_text(colour="blue",size=16,angle=90,
hjust=0.5,vjust=.5,face='bold')) +
stat_summary(fun.y=mean, geom="point", shape=17, size=4) +
theme_base() +
geom_text(label='just the mean or median please - number only')

Related

Is it possible to add a 'break' to an axis that plots a categorical factor?

I am using ggplot2 to plot a mixed-design dataset in a violin plot.
The data was collected over three sessions: Baseline (collected on Day 1), Post-training (collected on Day 3) and Follow-up (collected on Day 30) and two groups: (1) Active and (2) Sham. For the sessions I have a categorical factor called 'Session' with the labels: Baseline, Post-training and Follow-up which are plotted on the x-axis. (Please ignore the rough state of the draft plot and dummy data for demonstration purposes).
level_order <- factor(tidied_data$Session, level = c('Baseline (Day 1)', 'Post-training (Day 3)', 'Follow-up (Day 30)'))
tidied_data %>%
ggplot(aes(x=level_order, y=Amplitude, fill=Group)) +
geom_violin(position=position_dodge(1), trim=FALSE) +
geom_jitter(binaxis='y', stackdir='center',
position=position_dodge(1)) +
stat_summary(fun = "mean", geom = "point",
size = 3, position=position_dodge(1), color="white") +
stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width=0.3, position=position_dodge(1), color="white") +
theme_bw() + # removes background colour
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + # removes grid lines
theme(panel.border = element_blank()) + # removes border lines
theme(axis.line = element_line(colour = "black")) + # adds axis lines
labs(title = "Group x Session",
x = "Session",
y = "Amplitude")
I want to demonstrate to the viewer that there is a different time-course between Baseline (Day 1), Post-training (Day 3) and follow-up (Day 30), it's a 30-day scale essentially.
From previous threads I've seen that this isn't something that ggplot2 will handle well, since broken axes are generally considered questionable.
I've come across the package 'ggbreak', where you can use the function 'scale_x_break' or scale_y_break' to set an axis break on a continuous variable. This doesn't work for the three time-points, presumably as it's a categorical factor.
Can anyone recommend a way that I can 'break' the axis to demonstrate the different length of time between the three sessions, or alternatively another way I could demonstrate this to the viewer? I've thought about adding custom spacing between bars, but I can only manage to set this to the same width for each bar, not different widths between different bars.
Any help would be greatly appreciated! Thanks in advance!
I can't recommend using discontinuous for this, but you can use facets to visually indicate small multiples. Example with a standard dataset below:
library(ggplot2)
ggplot(mtcars, aes(factor(cyl), mpg, fill = factor(am))) +
geom_violin() +
facet_grid(~ cyl, scales = "free_x") +
theme_classic() +
theme(strip.text = element_blank()) # Hide strip text
Created on 2021-08-20 by the reprex package (v1.0.0)

How can you put tick marks on the very edge of the graph border? (ggplot)

I want my tick marks/gridlines to be on the very edge of the plot border.
What I want:
Drawn graph with leftmost and rightmost tick marks on the borders
What it looks like:
GGPlot graph with leftmost and rightmost tick marks NOT on the borders
Code used to create the above:
data <- data.frame(
year = seq(1998,2018),
soc = runif(21,1,90)
)
ggplot(data, aes(x = year, y=soc)) +
geom_line(data=data) +
xlim(c(1995,2020)) +
theme_bw() +
theme(panel.grid.major = element_line(colour="gray", size=0.5))
This question doesn't correctly address the problem - I am not looking to fit the axis limits to the minimum and maximum of my data points, but rather to make sure that the tick marks land perfectly on the border of the graph.
You can remove the extra plot area by setting expand=FALSE in coord_cartesian.
I then modified the plot margin so that the rightmost number doesn't get cut off.
ggplot(data, aes(x = year, y=soc)) +
geom_line(data=data) +
scale_x_continuous(limits = c(1995,2020)) +
theme_bw() +
theme(panel.grid.major = element_line(colour="gray", size=0.5)) +
coord_cartesian(expand = F)+
theme(plot.margin = margin(6,10,6,6))

Adding different secondary x axis for each facet in ggplot2

I would like to add a different secondary axis to each facet. Here is my working example:
library(ggplot2)
library(data.table)
#Create the data:
data<-data.table(cohort=sample(c(1946,1947,1948),10000,replace=TRUE),
works=sample(c(0,1),10000,replace=TRUE),
year=sample(seq(2006,2013),10000,replace=TRUE))
data[,age_cohort:=year-cohort]
data[,prop_works:=mean(works),by=c("cohort","year")]
#Prepare data for plotting:
data_to_plot<-unique(data,by=c("cohort","year"))
#Plot what I want:
ggplot(data_to_plot,aes(x=age_cohort,y=prop_works))+geom_point()+geom_line()+
facet_wrap(~ cohort)
The plot shows how many people of a particular cohort work at a given age. I would like to add a secondary x axis showing which year corresponds to a particular age for different cohorts.
Since you have the actual values you want to use in your dataset, one work around is to plot them as an additional geom_text layer:
ggplot(data_to_plot,
aes(x = age_cohort, y = prop_works, label = year))+
geom_point() +
geom_line() +
geom_text(aes(y = min(prop_works)),
hjust = 1.5, angle = 90) + # rotate to save space
expand_limits(y = 0.44) +
scale_x_continuous(breaks = seq(58, 70, 1)) + # ensure x-axis breaks are at whole numbers
scale_y_continuous(labels = scales::percent) +
facet_wrap(~ cohort, scales = "free_x") + # show only relevant age cohorts in each facet
theme(panel.grid.minor.x = element_blank()) # hide minor grid lines for cleaner look
You can adjust the hjust value in geom_text() and y value in expand_limits() for a reasonable look, depending on your desired output's dimensions.
(More data wrangling would be required if there are missing years in the data, but I assume that isn't the case here.)

How can I add the legend for a line made with stat_summary in ggplot2?

Say I am working with the following (fake) data:
var1 <- runif(20, 0, 30)
var2 <- runif(20, 0, 40)
year <- c(1900:1919)
data_gg <- cbind.data.frame(var1, var2, year)
I melt the data for ggplot:
data_melt <- melt(data_gg, id.vars='year')
and I make a grouped barplot for var1 and var2:
plot1 <- ggplot(data_melt, aes(as.factor(year), value)) +
geom_bar(aes(fill = variable), position = "dodge", stat="identity")+
xlab('Year')+
ylab('Density')+
theme_light()+
theme(panel.grid.major.x=element_blank())+
scale_fill_manual(values=c('goldenrod2', 'firebrick2'), labels=c("Var1",
"Var2"))+
theme(axis.title = element_text(size=15),
axis.text = element_text(size=12),
legend.title = element_text(size=13),
legend.text = element_text(size=12))+
theme(legend.title=element_blank())
Finally, I want to add a line showing the cumulative sum (Var1 + Var2) for each year. I manage to make it using stat_summary, but it does not show up in the legend.
plot1 + stat_summary(fun.y = sum, aes(as.factor(year), value, colour="sum"),
group=1, color='steelblue', geom = 'line', size=1.5)+
scale_colour_manual(values=c("sum"="blue"))+
labs(colour="")
How can I make it so that it appears in the legend?
To be precise and without being a ggplot2 expert the thing that you need to change in your code is to remove the color argument from outside the aes of the stat.summary call.
stat_summary(fun.y = sum, aes(as.factor(year), value, col="sum"), group=1, geom = 'line', size=1.5)
Apparently, the color argument outside the aes function (so defining color as an argument) overrides the aesthetics mapping. Therefore, ggplot2 cannot show that mapping in the legend.
As far as the group argument is concerned it is used to connect the points for making the line, the details of which you can read here: ggplot2 line chart gives "geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?"
But it is not necessary to add it inside the aes call. In fact if you leave it outside the chart will not change.

How to calculate the percentages per bar in stacked bar plot?

I have produced a stacked bar plot using ggplot2 with the code below:
library(ggplot2)
g <- ggplot(data=df.m, aes(x=Type_Checklist,fill=Status)) +
geom_bar(stat="bin", position="stack", size=.3) +
scale_fill_hue(name="Status") +
xlab("Checklists") + ylab("Quantity") +
ggtitle("Status of Assessment Checklists") +
theme_bw() +
theme(axis.text.x = element_text(angle=90, vjust=0.5, size=10))+
stat_bin(geom = "text",aes(label = paste(round((..count..)/sum(..count..)*100), "%")),vjust = 1)
print(g)
The code manages to show the percentage as labels and maintained the actual quantities on the y axis.
However, I wanted the percentages to be calculated per bar (shown on the x axis), not for the entire set of checks (bars in the plot).
I managed to do exactly that with the following code:
#count how many checklists per status`
qty.checklist.by.status.this.week = df.m %>% count(Type_Checklist,Status)
# Add percentage
qty.checklist.by.status.this.week$percentage <- (qty.checklist.by.status.this.week$n/ nrNAs * 100)
#add column position to calculate where to place the percentage in the plot
qty.checklist.by.status.this.week <- qty.checklist.by.status.this.week %>% group_by(Type_Checklist) %>% mutate(pos = (cumsum(n) - 0.5 * n))
g <- ggplot(data=qty.checklist.by.status.this.week, aes(x=Type_Checklist,y=n,fill=Status)) +
geom_bar(stat="identity",position="stack", size=.3) +
scale_fill_hue(name="Status") +
xlab("Checklists") + ylab("Quantity") +
ggtitle("Status of Assessment Checklists") +
theme_bw() +
theme(axis.text.x = element_text(angle=90, vjust=0.5, size=10))+
geom_text(aes(y = pos,label = paste(round(percentage), "%")),size=5)
print(g)
But that solution seems quite manual, since I need to calculate the position of each label, different from the first plot that positions the labels automatically using stat_bin.
Existing answers have been very useful, such as Showing percentage in bars in ggplot
and Show % instead of counts in charts of categorical variables
and How to draw stacked bars in ggplot2 that show percentages based on group?
and R stacked percentage bar plot with percentage of binary factor and labels (with ggplot)
but they don't address exactly this situation.
To reiterate, how could I use the first solution, but calculate the percentages per bar, not for the entire set of bars in the plot?
Could anyone give me some help please? Thanks!

Resources