ggplot Error: similar data graphs, why not anymore? - r

I am graphing some data with ggplot. However, I don't understand the error I'm getting with slightly different data than data that I can graph successfully. For example, this data graphs successfully:
to_graph <- structure(list(Teacher = c("BS", "BS", "FA"
), Level = structure(c(2L, 1L, 1L), .Label = c("BE", "AE", "ME",
"EE"), class = "factor"), Count = c(2L, 25L, 28L)), .Names = c("Teacher",
"Level", "Count"), row.names = c(NA, 3L), class = "data.frame")
ggplot(data=to_graph, aes(x=Teacher, y=Count, fill=Level), ordered=TRUE) +
geom_bar(aes(fill = Level), position = 'fill') +
scale_y_continuous("",formatter="percent") +
scale_fill_manual(values = c("#FF0000", "#FFFF00","#00CC00", "#0000FF")) +
opts(axis.text.x=theme_text(angle=45)) +
opts(title = "Score Distribution")
But this does not:
to_graph <- structure(list(School = c(84351L, 84384L, 84385L, 84386L, 84387L,
84388L, 84389L, 84397L, 84398L, 84351L, 84384L, 84385L, 84386L,
84387L, 84388L, 84389L, 84397L, 84398L, 84351L, 84386L), Level = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 3L, 3L), .Label = c("BE", "AE", "ME", "EE"), class = "factor"),
Count = c(3L, 7L, 5L, 4L, 3L, 4L, 4L, 6L, 2L, 116L, 138L,
147L, 83L, 76L, 81L, 83L, 85L, 53L, 1L, 1L)), .Names = c("School",
"Level", "Count"), row.names = c(NA, 20L), class = "data.frame")
ggplot(data=to_graph, aes(x=School, y=Count, fill=Level), ordered=TRUE) +
geom_bar(aes(fill = Level), position = 'fill') +
scale_y_continuous("",formatter="percent") +
scale_fill_manual(values = c("#FF0000", "#FFFF00","#00CC00", "#0000FF")) +
opts(axis.text.x=theme_text(angle=90)) +
opts(title = "Score Distribution")
With the latter code, I get this error:
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
this. Error in if (!all(data$ymin == 0)) warning("Filling not well
defined when ymin != 0") : missing value where TRUE/FALSE needed
Anyone know what's going on here? Thank you!

The error occurs because your x variable has numerical values, when in reality you want them to be discrete, i.e. use x=factor(School).
The reason for this is that stat_bin, the default stat for geom_bar, will try to summarise for each unique value of x. When your x-variable is numeric, it tries to summarise at each integer in the range. This is clearly not what you need.
ggplot(data=to_graph, aes(x=factor(School), y=Count, fill=Level), ordered=TRUE) +
geom_bar(aes(fill = Level), position='fill') +
opts(axis.text.x=theme_text(angle=90)) +
scale_y_continuous("",formatter="percent") +
opts(title = "Score Distribution") +
scale_fill_manual(values = c("#FF0000", "#FFFF00","#00CC00", "#0000FF"))

Related

How to add comparison bars to a plot to denote which comparison a p value corresponds to

I'm using the following data frame:
df1 <- structure(list(Genotype = structure(c(1L, 1L, 1L, 1L, 1L,
2L,2L,2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L,1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
.Label= c("miR-15/16 FL", "miR-15/16 cKO"), class = "factor"),
Tissue = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L), .Label = c("iLN", "Spleen", "Skin", "Colon"), class = "factor"),
`Cells/SC/Live/CD8—,, CD4+/Foxp3+,Median,<BV421-A>,CD127` = c(518L,
715L, 572L, 599L, 614L, 881L, 743L, 722L, 779L, 843L, 494L,
610L, 613L, 624L, 631L, 925L, 880L, 932L, 876L, 926L, 1786L,
2079L, 2199L, 2345L, 2360L, 2408L, 2509L, 3129L, 3263L, 3714L,
917L, NA, 1066L, 1059L, 939L, 1269L, 1047L, 974L, 1048L,
1084L)),
.Names = c("Genotype", "Tissue", "Cells/SC/Live/CD8—,,CD4+/Foxp3+,Median,<BV421-A>,CD127"),
row.names = c(NA, -40L), class = c("tbl_df", "tbl", "data.frame"))
and trying to make a plot using ggplot2 where box plots and points are displayed grouped by "Tissue" and interleaved by "Genotype". The significance values are displaying properly but I would like to add lines to denote the comparisons being made and have them start at the center of each "miR-15/16 FL" box plot and end at the center of each "miR-15/16 cKO" box plot and sit directly below the significance values. Below is the code I am using to generate the plot:
library(ggplot2)
library(ggpubr)
color.groups <- c("black","red")
names(color.groups) <- unique(df1$Genotype)
shape.groups <- c(16, 1)
names(shape.groups) <- unique(df1$Genotype)
ggplot(df1, aes(x = Tissue, y = df1[3], color = Genotype, shape = Genotype)) +
geom_boxplot(position = position_dodge(), outlier.shape = NA) +
geom_point(position=position_dodge(width=0.75)) +
ylim(0,1.2*max(df1[3], na.rm = TRUE)) +
ylab('MFI CD127 (of CD4+ Foxp3+ T cells') +
scale_color_manual(values=color.groups) +
scale_shape_manual(values=shape.groups) +
theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"),
axis.title.x=element_blank(), aspect.ratio = 1,
text = element_text(size = 9)) +
stat_compare_means(show.legend = FALSE, label = 'p.format', method = 't.test',
label.y = c(0.1*max(df1[3], na.rm = TRUE) + max(df1[3][c(1:10),], na.rm = TRUE),
0.1*max(df1[3], na.rm = TRUE) + max(df1[3][c(11:20),], na.rm = TRUE),
0.1*max(df1[3], na.rm = TRUE) + max(df1[3][c(21:30),], na.rm = TRUE),
0.1*max(df1[3], na.rm = TRUE) + max(df1[3][c(31:40),], na.rm = TRUE)))
Thanks for any help!
I've created the brackets with three calls to geom_segment. These calls use a new dmax data frame created to provide the reference y-values for positioning the brackets and the p-value labels. The values e and r are for tweaking these positions.
I've made a few other changes to your code.
Change the name of the third column to temp and use this name y=temp in the call to ggplot. Your original code uses y=df1[3], which essentially reaches outside the plot environment to the df1 object in the parent environment, which can cause problems. Also, having a short name to refer to makes it easier to generate the dmax data frame and refer to its columns.
Use the dmax data frame for label.y positions in stat_compare_means, which reduces the amount of code needed. (Incidently, stat_compare_means seems to require hard-coded label.y positions, rather than getting them from an aes mapping of the data.)
Position the p-value labels an absolute distance above each pair of box plots (using the value e), rather than a multiplicative distance. This makes it easier to keep spacing consistent between p-value labels, brackets, and box plots.
# Use a short column name for the third column
names(df1)[3] = "temp"
# Generate data frame of reference y-values for p-value labels and bracket positions
dmax = df1 %>% group_by(Tissue) %>%
summarise(temp=max(temp, na.rm=TRUE),
Genotype=NA)
# For tweaking position of brackets
e = 350
r = 0.6
w = 0.19
bcol = "grey30"
ggplot(df1, aes(x = Tissue, y = temp, color = Genotype, shape = Genotype)) +
geom_boxplot(position = position_dodge(), outlier.shape = NA) +
geom_point(position=position_dodge(width=0.75)) +
ylim(0,1.2*max(df1[3], na.rm = TRUE)) +
ylab('MFI CD127 (of CD4+ Foxp3+ T cells') +
scale_color_manual(values=color.groups) +
scale_shape_manual(values=shape.groups) +
theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"),
axis.title.x=element_blank(), aspect.ratio = 1,
text = element_text(size = 9)) +
stat_compare_means(show.legend = FALSE, label = 'p.format', method = 't.test',
label.y = e + dmax$temp) +
geom_segment(data=dmax,
aes(x=as.numeric(Tissue)-w, xend=as.numeric(Tissue)+w,
y=temp + r*e, yend=temp + r*e), size=0.3, color=bcol, inherit.aes=FALSE) +
geom_segment(data=dmax,
aes(x=as.numeric(Tissue) + w, xend=as.numeric(Tissue) + w,
y=temp + r*e, yend=temp + r*e - 60), size=0.3, color=bcol, inherit.aes=FALSE) +
geom_segment(data=dmax,
aes(x=as.numeric(Tissue) - w, xend=as.numeric(Tissue) - w,
y=temp + r*e, yend=temp + r*e - 60), size=0.3, color=bcol, inherit.aes=FALSE)
To address your comment, here's an example to show that the method above inherently adjusts to any number of x-categories.
Let's begin by adding two new tissue categories:
library(forcats)
df1$Tissue = fct_expand(df1$Tissue, "Tissue 5", "Tissue 6")
df1$Tissue[seq(1,20,4)] = "Tissue 5"
df1$Tissue[seq(21,40,4)] = "Tissue 6"
dmax = df1 %>% group_by(Tissue) %>%
summarise(temp=max(temp, na.rm=TRUE),
Genotype=NA)
Now run exactly the same plot code listed above to get the following plot:

r- ggplot decrease number of intervals of axis or spacing the axe tiks

I've made a group plot of time series with ggplot with this syntax:
ggplot(Tur_flow, aes(x=time, group=parameter, colour=parameter))
+ geom_point(aes(y=value), size=1)
+ stat_smooth(aes(y=value), method=lm)
+ facet_grid(parameter ~ Section, scale="free_y")
+ theme_minimal()
+ theme(text = element_text(size=16))
dput(head(Tur_flow))
structure(list(Section = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("S-5", "S-50", "S+5", "S+50"), class = "factor"), parameter = structure(c(3L,
3L, 3L, 3L, 3L, 3L), .Label = c("Discharge", "Mean_Velocity",
"T_15", "T_25", "T_65", "Water_Depth"), class = "factor"), time = structure(c(6L, 13L, 20L, 27L, 34L, 41L), .Label = c("11:59:55", "11:59:56",
"11:59:58", "11:59:59", "12:00:00", "12:00:02", "12:00:05", "12:00:55",
"12:00:56", "12:00:58", "12:00:59", "12:01:00", "12:01:01", "12:01:05",
"12:01:55", "12:01:56............. "8.30", "8.31", "8.41", "8.54", "8.94", "800.31", "822.01", "828.77", "839.30", "846.11", "847.60", "8497.25", "894.21", "91.66", "91.67", "91.68", "91.90", "92.08", "92.23", "92.54", "93.23", "974.50", "N/A"), class = "factor")), .Names = c("Section", "parameter",
"time", "value"), row.names = c(NA, 6L), class = "data.frame")
How can I reduce the interval of both x and y axis? I mean spacing the axes? The x_axis data is time?
On y-axis how can I reduce decimal numbers?

With both stacked and dodged bars, how can you remove dodge-bar elements from legend?

Thanks to combine stacked bars and dodged bars, I created the plot below using the data frame shown. But now, since the axis titles name the bars, how can I remove the legend elements other than for the one stacked bar? That is, can the legend show only the segments of the Big8 bar?
> dput(combo)
structure(list(firm = structure(c(12L, 1L, 11L, 13L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c("Avg.", "Co", "Firm1",
"Firm2", "Firm3", "Firm4", "Firm5", "Firm6", "Firm7", "Firm8",
"Median", "Q1", "Q3"), class = "factor"), metric = structure(c(5L,
1L, 4L, 6L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Avg.",
"Big8", "Co", "Median", "Q1", "Q3"), class = "factor"), value = c(0.0012,
0.0065, 0.002, 0.0036, 0.0065, 0.000847004466666667, 0.000658907411111111,
0.0002466389, 8.41422555555556e-05, 8.19149222222222e-05, 7.97185555555556e-05,
7.82742555555556e-05, 7.56679888888889e-05), grp = structure(c(1L,
2L, 3L, 6L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("Q1",
"Avg.", "Median", "Co", "Big8", "Q3"), class = "factor")), .Names = c("firm",
"metric", "value", "grp"), row.names = c(NA, -13L), class = "data.frame")
Here is the plotting code.
ggplot(combo, aes(x=grp, y=value, fill=firm)) +
geom_bar(stat="identity") +
labs(x = "", y = "") +
theme(legend.position = "bottom") +
guides(fill = guide_legend(nrow = 2))
The plot, which ideally would have a smaller set of elements in the legend.
You can manually set the breaks for scale_fill_discrete:
library(ggplot2)
ggplot(combo, aes(x=grp, y=value, fill=firm)) +
geom_bar(stat="identity") +
labs(x = "", y = "") +
theme(legend.position = "bottom") +
guides(fill = guide_legend(nrow = 2)) +
scale_fill_discrete(breaks = combo$firm[combo$metric=="Big8"])
I'm not 100% sure which labels you want to keep, but a manually entered vector, combo$firm and combo$metric will all work.

placing linear line based on the aggregate data in ggplot2

dput(x)
structure(list(Date = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("1/1/2012", "2/1/2012", "3/1/2012"
), class = "factor"), Server = structure(c(1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
Storage = c(10000L, 20000L, 30000L, 15000L, 15000L, 25000L,
35000L, 15700L, 16000L, 27000L, 37000L, 16700L)), .Names = c("Date",
"Server", "Storage"), class = "data.frame", row.names = c(NA,
-12L))
I would like to create a stack bar x=Date, y=Storage and alos place a linear line based on the total storage.
I have come up with this ggplot line:
ggplot(x, aes(x=Date, y=Storage)) + geom_bar(aes(x=Date,y=Storage,fill=Server), stat="identity", position="stack") + geom_smooth(aes(group=1),method="lm", size=2, color="red")
It kinda works but linear line is not based on total storage for a given Date on the date frame x. Is there an easy way to do this?
Often the easiest way is just to calculate the values outside of ggplot2. So calculate the totals:
dd = as.data.frame(tapply(x$Storage, x$Date, sum))
dd$Date = rownames(dd)
colnames(dd)[1] = "Storage"
then add a geom_smooth call but specify the data:
ggplot(x, aes(x=Date, y=Storage)) +
geom_bar(aes(x=Date,y=Storage, fill=Server), stat="identity", position="stack") +
geom_smooth(data = dd, aes(x=Date, y=Storage, group=1),method="lm")

set minimum limit for violin plot ggplot

I'd like to set the minimum bounds for a violin plot, similar to this question: set only lower bound of a limit for ggplot
For this:
p <- ggplot(somedf, aes(factor(user1), pq)) + aes(ymin = -50)
p + geom_violin(aes(fill = user1))+ aes(ymin=-50)
I've tried adding
+ expand_limits(y=-50)
and
+ aes(ymin = -50)
to set lower bounds with no effect.
Here's a sample dataframe that results in the same problem:
structure(list(pq = c(-20L, -12L, 10L, -13L, 11L, -16L), time = c(1214.1333,
1214.1833, 1214.2667, 1214.2833, 1214.35, 1214.5167), pq.1 = c(-20L,
-12L, 10L, -13L, 11L, -16L), time.1 = c(1214.1333, 1214.1833,
1214.2667, 1214.2833, 1214.35, 1214.5167), time.2 = c(1214.1333,
1214.1833, 1214.2667, 1214.2833, 1214.35, 1214.5167), pq.2 = c(-20L,
-12L, 10L, -13L, 11L, -16L), user1 = structure(c(1L, 1L, 2L,
1L, 2L, 1L), .Label = c("someguy3", "someguy4", "someguy6", "someguy4",
"someguy5", "someguy6"), class = "factor"), pq.3 = c(-20L, -12L, 10L,
-13L, 11L, -16L), time.3 = c(1214.1333, 1214.1833, 1214.2667,
1214.2833, 1214.35, 1214.5167), user1.1 = structure(c(1L, 1L,
2L, 1L, 2L, 1L), .Label = c("someguy3", "someguy4", "someguy6",
"someguy4", "someguy5", "someguy6"), class = "factor")), .Names = c("pq",
"time", "pq.1", "time.1", "time.2", "pq.2", "user1", "pq.3",
"time.3", "user1.1"), row.names = c(565L, 566L, 568L, 569L, 570L,
574L), class = "data.frame")
ggplot will pay attention to the aes() directive if you add a call to geom_blank().
## A reproducible example
library(ggplot2)
p <- ggplot(mtcars, aes(factor(cyl), mpg))
## This doesn't work:
p + aes(ymin = -10) + geom_violin()
## But this does:
p + aes(ymin = -10) + geom_violin() + geom_blank()
(Note: For this example at least, expand_limits(y = -10) works with or without an accompanying call to geom_blank().)

Resources