R-Programming - ggplot2 - boxplot issues (varwidth & position_dodge / stat_summary & position_dodge) - r

I am currently using ggplot2 to display some feature distributions with boxplots.
I can produce some simple boxplots, changing color, form, etc. but I cannot achieve the ones that combine several options.
1°)
My purpose is to display side by side boxplots for men and for women, which can be done with position = position_dodge(width=0.9).
I want that the width of the boxplot be proportional to the size of the sample, which can be done with var_width=TRUE.
First problem: when I put the two options together, it does not work and I get the following message:
position_dodge requires non-overlapping x intervals
Boxplot when using var_width=TRUE and position_dodge together:
I have tried to change the size of the plot, but it did not help. If I skip var_width=TRUE, then the boxplots are correctly dodged.
Is there a way out to this or is this a limit of ggplot2?
2°)
Besides, I want to display the size of each sample building the boxplots.
I can get the calculation with stat_summary(fun.data = give.n, but unfortunately, I have not found a way to avoid that the numbers overlap over each other when the boxplots are of similar positions.
I tried to use hjust & vjust to change the numbers’ positions, but they seem to share the same origin, so that does not help.
Overlapping numbers produced by stats_summary when boxplots are dodged:
As there are not labels, I could not use geom_text or I do not find a way how to get the stat passed to the geom_text.
So the second problem is: how can I nicely display each number on its own boxplot?
Here is my code:
`library(ggplot2)
# function to get the median of my sample
give.n <- function(x){
return(c(y = median(x), label = length(x)))
}
plot_boxes <- function(mydf, mycolumn1, mycolumn2) {
mylegendx <- deparse(substitute(mycolumn1))
mylegendy <- deparse(substitute(mycolumn2))
g2 <- ggplot(mydf, aes(x=as.factor(mycolumn1), y=mycolumn2, color=Gender,
fill=Gender)) +
geom_boxplot( data=mydf, aes(x=as.factor(mycolumn1), y=mycolumn2,
color=Gender), position=position_dodge(width=0.9), alpha=0.3) +
stat_summary(fun.data = give.n, geom = "text", size = 3, vjust=1) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_discrete(name = mylegendx ) +
labs(title=paste("Boxplot ", substring(mylegendy, 11), " by ",
substring(mylegendx, 11)) , x = mylegendx, y = mylegendy)
print(g2)
}
#setwd("~/data")
filename <- "df_stackoverflow.csv"
df_client <- read.csv(file=filename, header=TRUE, sep=";", dec=".")
plot_boxes(df_client, df_client$Client.Class, df_client$nbyears_client)`
And the data looks like this (small sample from the dataset - 20,000 lines):
Client.Id;Client.Status;Client.Class;Gender;nbyears_client
3;Active;Middle Class;Male;1.38
4;Active;Middle Class;Male;0.9
5;Active;Retiree;Female;0.21
6;Active;Middle Class;Male;0.9
7;Active;Middle Class;Male;3.55
8;Active;Subprime;Male;1.16
9;Active;Middle Class;Male;1.21
10;Active;Part-time;Male;3.38
17;Active;Middle Class;Male;1.83
19;Active;Subprime;Female;5.81
20;Active;Farming;Male;8.99
21;Active;Subprime;Female;6.49
22;Active;Middle Class;Male;1.54
23;Active;Middle Class;Female;2.74
24;Active;Subprime;Male;0.46
25;Active;Executive;Female;0.49
26;Active;Middle Class;Female;3.55
27;Active;Middle Class;Male;3.83
29;Active;Subprime;Female;2.66
30;Active;Middle Class;Male;2.72
31;Active;Middle Class;Female;4.88
32;Active;Subprime;Male;1.46
34;Active;Middle Class;Female;7.16
41;Active;Middle Class;Male;0.65
44;Active;Middle Class;Male;2
45;Active;Subprime;Male;1.13

Related

How to turn my legend horizontal as opposed to vertical with ggplot2?

I am struggling to understand why legend.horizontal is not rotating my legend axis so it isn't displaying vertically? Any help would be massively appreciated.
library(phyloseq)
library(ggplot2)
##phylum level
ps_tmp <- get_top_taxa(physeq_obj = ps.phyl, n = 10, relative = TRUE, discard_other = FALSE, other_label = "Other")
ps_tmp <- name_taxa(ps_tmp, label = "Unkown", species = T, other_label = "Other")
phyl <- fantaxtic_bar(ps_tmp, color_by = "phylum", label_by = "phylum",facet_by = "TREATMENT", other_label = "Other", order_alg = "as.is")
phyl + theme(legend.direction = "horizontal", legend.position = "bottom", )
Legends for discrete values don't have a formal direction per se and are positioned however ggplot2 decides it can best fit with your data. This is why things like legend.direction won't work here. I don't have the phyloseq package or access to your particular data, so I'll show you how this works and how you can mess with the legend using a reproducible example dataset.
library(ggplot2)
set.seed(8675309)
df <- data.frame(x=LETTERS[1:8], y=sample(1:100, 8))
p <- ggplot(df, aes(x, y, fill=x)) + geom_col()
p
By default, ggplot is putting our legend to the right and organizes it vertically as one column. Here's what happens when we move the legend to the bottom:
p + theme(legend.position="bottom")
Now ggplot thinks it's best to put that legend into 4 columns, 2 rows each. As u/Tech Commodities mentioned, you can use the guides() functions to specify how the legend looks. In this case, we will specify to have 2 columns instead of 4. We only need to supply the number of columns (or rows), and ggplot figures out the rest.
p + theme(legend.position="bottom") +
guides(fill=guide_legend(ncol=2))
So, to get a "horizontally-arranged" legend, you just need to specify that there should be only one row:
p + theme(legend.position="bottom") +
guides(fill=guide_legend(nrow=1))

ggplot: why is the y-scale larger than the actual values for each response?

Likely a dumb question, but I cannot seem to find a solution: I am trying to graph a categorical variable on the x-axis (3 groups) and a continuous variable (% of 0 - 100) on the y-axis. When I do so, I have to clarify that the geom_bar is stat = "identity" or use the geom_col.
However, the values still show up at 4000 on the y-axis, even after following the comments from Y-scale issue in ggplot and from Why is the value of y bar larger than the actual range of y in stacked bar plot?.
Here is how the graph keeps coming out:
I also double checked that the x variable is a factor and the y variable is numeric. Why would this still be coming out at 4000 instead of 100, like a percentage?
EDIT:
The y-values are simply responses from participants. I have a large dataset (N = 600) and the y-value are a percentage from 0-100 given by each participant. So, in each group (N = 200 per group), I have a value for the percentage. I wanted to visually compare the three groups based on the percentages they gave.
This is the code I used to plot the graph.
df$group <- as.factor(df$group)
df$confid<- as.numeric(df$confid)
library(ggplot2)
plot <-ggplot(df, aes(group, confid))+
geom_col()+
ylab("confid %") +
xlab("group")
Are you perhaps trying to plot the mean percentage in each group? Otherwise, it is not clear how a bar plot could easily represent what you are looking for. You could perhaps add error bars to give an idea of the spread of responses.
Suppose your data looks like this:
set.seed(4)
df <- data.frame(group = factor(rep(1:3, each = 200)),
confid = sample(40, 600, TRUE))
Using your plotting code, we get very similar results to yours:
library(ggplot2)
plot <-ggplot(df, aes(group, confid))+
geom_col()+
ylab("confid %") +
xlab("group")
plot
However, if we use stat_summary, we can instead plot the mean and standard error for each group:
ggplot(df, aes(group, confid)) +
stat_summary(geom = "bar", fun = mean, width = 0.6,
fill = "deepskyblue", color = "gray50") +
geom_errorbar(stat = "summary", width = 0.5) +
geom_point(stat = "summary") +
ylab("confid %") +
xlab("group")

ggplot2 geom_points won't colour or dodge

So I'm using ggplot2 to plot both a bar graph and points. I'm currently getting this:
As you can see the bars are nicely separated and colored in the desired colors. However my points are all uncolored and stacked ontop of eachother. I would like the points to be above their designated bar and in the same color.
#Add bars
A <- A + geom_col(aes(y = w1, fill = factor(Species1)),
position = position_dodge(preserve = 'single'))
#Add colors
A <- A + scale_fill_manual(values = c("A. pelagicus"= "skyblue1","A. superciliosus"="dodgerblue","A. vulpinus"="midnightblue","Alopias sp."="black"))
#Add points
A <- A + geom_point(aes(y = f1/2.5),
shape= 24,
size = 3,
fill = factor(Species1),
position = position_dodge(preserve = 'single'))
#change x and y axis range
A <- A + scale_x_continuous(breaks = c(2000:2020), limits = c(2016,2019))
A <- A + expand_limits(y=c(0,150))
# now adding the secondary axis, following the example in the help file ?scale_y_continuous
# and, very important, reverting the above transformation
A <- A + scale_y_continuous(sec.axis = sec_axis(~.*2.5, name = " "))
# modifying axis and title
A <- A + labs(y = " ",
x = " ")
A <- A + theme(plot.title = element_text(size = rel(4)))
A <- A + theme(axis.text.x = element_text(face="bold", size=14, angle=45),
axis.text.y = element_text(face="bold", size=14))
#A <- A + theme(legend.title = element_blank(),legend.position = "none")
#Print plot
A
When I run this code I get the following error:
Error: Unknown colour name: A. pelagicus
In addition: Warning messages:
1: Width not defined. Set with position_dodge(width = ?)
2: In max(table(panel$xmin)) : no non-missing arguments to max; returning -Inf
I've tried a couple of things but I can't figure out it does work for geom_col and not for geom_points.
Thanks in advance
The two basic problems you have are dealing with your color error and not dodging, and they can be solved by formatting your scale_...(values= argument using a list instead of a vector, and applying the group= aesthetic, respectively.
You'll see the answer to these two question using an example:
# dummy dataset
year <- c(rep(2017, 4), rep(2018, 4))
species <- rep(c('things', 'things1', 'wee beasties', 'ew'), 2)
values <- c(10, 5, 5, 4, 60, 10, 25, 7)
pt.value <- c(8, 7, 10, 2, 43, 12, 20, 10)
df <-data.frame(year, species, values, pt.value)
I made the "values" set for my column heights and I wanted to use a different y aesthetic for points for illustrative purposes, called "pt.value". Otherwise, the data setup is similar to your own. Note that df$year will be set as numeric, so it's best to change that into either Date format (kinda more trouble than it's worth here), or just as a factor, since "2017.5" isn't gonna make too much sense here :). The point is, I need "year" to be discrete, not continuous.
Solve the color error
For the plot, I'll try to create it similar to you. Here note that in the scale_fill_manual object, you have to set the values= argument using a list. In your example code, you are using a vector (c()) to specify the colors and naming. If you have name1=color1, name2=color2,..., this represents a list structure.
ggplot(df, aes(x=as.factor(year), y=values)) +
geom_col(aes(fill=species), position=position_dodge(width=0.62), width=0.6) +
scale_fill_manual(values=
list('ew' = 'skyblue1', 'things' = 'dodgerblue',
'things1'='midnightblue', 'wee beasties' = 'gray')) +
geom_point(aes(y=pt.value), shape=24, position=position_dodge(width=0.62)) +
theme_bw() + labs(x='Year')
So the colors are applied correctly and my axis is discrete, and the y values of the points are mapped to pt.value like I wanted, but why don't the points dodge?!
Solve the dodging issue
Dodging is a funny thing in ggplot2. The best reasoning here I can give you is that for columns and barplots, dodging is sort of "built-in" to the geom, since the default position is "stack" and "dodge" represents an alternative method to draw the geom. For points, text, labels, and others, the default position is "identity" and you have to be more explicit in how they are going to dodge or they just don't dodge at all.
Basically, we need to let the points know what they are dodging based on. Is it "species"? With geom_col, it's assumed to be, but with geom_point, you need to specify. We do that by using a group= aesthetic, which let's the geom_point know what to use as criteria for dodging. When you add that, it works!
ggplot(df, aes(x=as.factor(year), y=values, group=species)) +
geom_col(aes(fill=species), position=position_dodge(width=0.62), width=0.6) +
scale_fill_manual(values=
list('ew' = 'skyblue1', 'things' = 'dodgerblue',
'things1'='midnightblue', 'wee beasties' = 'gray')) +
geom_point(aes(y=pt.value), shape=24, position=position_dodge(width=0.62)) +
theme_bw() + labs(x='Year')

changing ggplot legend unit scale

This question is motivated by a previous post illustrating various ways to change how axes scales are plotted in a ggplot figure, from the default exponential notation to the full integer value (when ones axes values are very large). While I am able to convert the axes scales from exponential notation to full values, I am unclear how one would achieve the same goal for the values appearing in the legend.
While I understand that one can manually change the length of the legend scale with "scale_color..." or "scale_fill..." followed by the "limits" argument, this does not appear to be a solution to getting my legend values to show "6000000000" rather than "6e+09" (or "0" rather than "0e+00" for that matter).
The following example should suffice. My hope is someone can point out how to implement the 'scales' package to apply for legend scales rather than axes scales.
Thanks very much.
library(ggplot2)
library(scales)
Data <- data.frame(
pi = c(2,71,828,1828,45904,523536,2874713,52662497,757247093,6999595749),
e = c(3,14,159,2653,58979,311599,7963468,54418516,1590576171, 99),
face = 1:10)
p <- ggplot(data = Data, aes(x=face, y=e, colour = pi))
myplot <- p + geom_point() +
scale_y_continuous(labels = comma) +
scale_color_gradientn(colours = rainbow(2), limits=c(0,7000000000))
myplot
Use the Comma formatter in scale_color_gradientn by setting labels = comma e.g.:
p <- ggplot(data = Data, aes(x=face, y=e, colour = pi))
myplot <- p + geom_point() +
scale_y_continuous(labels = comma) +
scale_color_gradientn(colours = rainbow(2), limits=c(0,7000000000), labels = comma)
myplot

Overlay raw data onto geom_bar

I have a data-frame arranged as follows:
condition,treatment,value
A , one , 2
A , one , 1
A , two , 4
A , two , 2
...
D , two , 3
I have used ggplot2 to make a grouped bar plot that looks like this:
The bars are grouped by "condition" and the colours indicate "treatment." The bar heights are the mean of the values for each condition/treatment pair. I achieved this by creating a new data frame containing the mean and standard error (for the error bars) for all the points that will make up each group.
What I would like to do is superimpose the raw jittered data to produce a bar-chart version of this box plot: http://docs.ggplot2.org/0.9.3.1/geom_boxplot-6.png [I realise that a box plot would probably be better, but my hands are tied because the client is pathologically attached to bar charts]
I have tried adding a geom_point object to my plot and feeding it the raw data (rather than the aggregated means which were used to make the bars). This sort of works, but it plots the raw values at the wrong x axis locations. They appear at the points at which the red and grey bars join, rather than at the centres of the appropriate bar. So my plot looks like this:
I can not figure out how to shift the points by a fixed amount and then jitter them in order to get them centered over the correct bar. Anyone know? Is there, perhaps, a better way of achieving what I'm trying to do?
What follows is a minimal example that shows the problem I have:
#Make some fake data
ex=data.frame(cond=rep(c('a','b','c','d'),each=8),
treat=rep(rep(c('one','two'),4),each=4),
value=rnorm(32) + rep(c(3,1,4,2),each=4) )
#Calculate the mean and SD of each condition/treatment pair
agg=aggregate(value~cond*treat, data=ex, FUN="mean") #mean
agg$sd=aggregate(value~cond*treat, data=ex, FUN="sd")$value #add the SD
dodge <- position_dodge(width=0.9)
limits <- aes(ymax=value+sd, ymin=value-sd) #Set up the error bars
p <- ggplot(agg, aes(fill=treat, y=value, x=cond))
#Plot, attempting to overlay the raw data
print(
p + geom_bar(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25) +
geom_point(data= ex[ex$treat=='one',], colour="green", size=3) +
geom_point(data= ex[ex$treat=='two',], colour="pink", size=3)
)
I found it is unnecessary to create separate dataframes. The plot can be created by providing ggplot with the raw data.
ex <- data.frame(cond=rep(c('a','b','c','d'),each=8),
treat=rep(rep(c('one','two'),4),each=4),
value=rnorm(32) + rep(c(3,1,4,2),each=4) )
p <- ggplot(ex, aes(cond,value,fill = treat))
p + geom_bar(position = 'dodge', stat = 'summary', fun.y = 'mean') +
geom_errorbar(stat = 'summary', position = 'dodge', width = 0.9) +
geom_point(aes(x = cond), shape = 21, position = position_dodge(width = 1))
You need just one call to geom_point() where you use data frame ex and set x values to cond, y values to value and color=treat (inside aes()). Then add position=dodge to ensure that points are dodgeg. With scale_color_manual() and argument values= you can set colors you need.
p+geom_bar(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25)+
geom_point(data=ex,aes(cond,value,color=treat),position=dodge)+
scale_color_manual(values=c("green","pink"))
UPDATE - jittering of points
You can't directly use positions dodge and jitter together. But there are some workarounds. If you save whole plot as object then with ggplot_build() you can see x positions for bars - in this case they are 0.775, 1.225, 1.775... Those positions correspond to combinations of factors cond and treat. As in data frame ex there are 4 values for each combination, then add new column that contains those x positions repeated 4 times.
ex$xcord<-rep(c(0.775,1.225,1.775,2.225,2.775,3.225,3.775,4.225),each=4)
Now in geom_point() use this new column as x values and set position to jitter.
p+geom_bar(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25)+
geom_point(data=ex,aes(xcord,value,color=treat),position=position_jitter(width =.15))+
scale_color_manual(values=c("green","pink"))
As illustrated by holmrenser above, referencing a single dataframe and updating the stat instruction to "summary" in the geom_bar function is more efficient than creating additional dataframes and retaining the stat instruction as "identity" in the code.
To both jitter and dodge the data points with the bar charts per the OP's original question, this can also be accomplished by updating the position instruction in the code with position_jitterdodge. This positioning scheme allows widths for jitter and dodge terms to be customized independently, as follows:
p <- ggplot(ex, aes(cond,value,fill = treat))
p + geom_bar(position = 'dodge', stat = 'summary', fun.y = 'mean') +
geom_errorbar(stat = 'summary', position = 'dodge', width = 0.9) +
geom_point(aes(x = cond), shape = 21, position =
position_jitterdodge(jitter.width = 0.5, jitter.height=0.4,
dodge.width=0.9))

Resources