ggplot2 scatter plot with overlay of means and bidirectional SD bars - r

This question is a direct successor to a pervious question asked here called “ggplot scatter plot of two groups with superimposed means with X and Y error bars”. That questions answer looks to do exactly what I am trying to accomplish however the code provided results in an error which I can’t get around. I will use my data as example here but I have tried the original question code as well with the same result.
I have a data frame which looks like this:
structure(list(Meta_ID = structure(c(15L, 22L, 31L, 17L), .Label = c("NM*624-46",
"NM*624-54", "NM*624-56", "NM*624-61", "NM*624-70", "NM624-36",
"NM624-38", "NM624-39", "NM624-40", "NM624-41", "NM624-43", "NM624-46",
"NM624-47", "NM624-51", "NM624-54 ", "NM624-56", "NM624-57",
"NM624-59", "NM624-61", "NM624-64", "NM624-70", "NM624-73", "NM624-75",
"NM624-77", "NM624-81", "NM624-82", "NM624-83", "NM624-84", "NM625-02",
"NM625-10", "NM625-11", "SM621-43", "SM621-44", "SM621-46", "SM621-47",
"SM621-48", "SM621-52", "SM621-53", "SM621-55", "SM621-56", "SM621-96",
"SM621-97", "SM622-51", "SM622-52", "SM623-14", "SM623-23", "SM623-26",
"SM623-27", "SM623-32", "SM623-33", "SM623-34", "SM623-55", "SM623-56",
"SM623-57", "SM623-58", "SM623-59", "SM623-61", "SM623-62", "SM623-64",
"SM623-65", "SM623-66", "SM623-67", "SM680-74", "SM681-16"), class = "factor"),
Region = structure(c(1L, 1L, 1L, 1L), .Label = c("N", "S"
), class = "factor"), Tissue = structure(c(1L, 2L, 1L, 1L
), .Label = c("M", "M*"), class = "factor"), Tag_Num = structure(c(41L,
48L, 57L, 43L), .Label = c("621-43", "621-44", "621-46",
"621-47", "621-48", "621-52", "621-53", "621-55", "621-56",
"621-96", "621-97", "622-51", "622-52", "623-14", "623-23",
"623-26", "623-27", "623-32", "623-33", "623-34", "623-55",
"623-56", "623-57", "623-58", "623-59", "623-61", "623-62",
"623-64", "623-65", "623-66", "623-67", "624-36", "624-38",
"624-39", "624-40", "624-41", "624-43", "624-46", "624-47",
"624-51", "624-54", "624-56", "624-57", "624-59", "624-61",
"624-64", "624-70", "624-73", "624-75", "624-77", "624-81",
"624-82", "624-83", "624-84", "625-02", "625-10", "625-11",
"680-74", "681-16"), class = "factor"), Lab_Num = structure(1:4, .Label = c("C4683",
"C4684", "C4685", "C4686", "C4687", "C4688", "C4689", "C4690",
"C4691", "C4692", "C4693", "C4694", "C4695", "C4696", "C4697",
"C4698", "C4699", "C4700", "C4701", "C4702", "C4703", "C4704",
"C4705", "C4706", "C4707", "C4708", "C4709", "C4710", "C4711",
"C4712", "C4713", "C4714", "C4715", "C4716", "C4717", "C4718",
"C4719", "C4720", "C4721", "C4722", "C4723", "C4724", "C4725",
"C4726", "C4727", "C4728", "C4729", "C4730", "C4731", "C4732",
"C4733", "C4734", "C4735", "C4736", "C4737", "C4738", "C4739",
"C4740", "C4741", "C4742", "C4743", "C4744", "C4745", "C4746",
"C4747", "C4748"), class = "factor"), C = c(46.5, 46.7, 45,
43.6), N = c(12.9, 13.7, 14.5, 13.4), C.N = c(3.6, 3.4, 3.1,
3.3), d13C = c(-19.7, -19.5, -19.4, -19.2), d15N = c(13.3,
12.4, 11.7, 11.9)), .Names = c("Meta_ID", "Region", "Tissue",
"Tag_Num", "Lab_Num", "C", "N", "C.N", "d13C", "d15N"), row.names = c(NA,
4L), class = "data.frame")
What I want to produce is a scatter plot of the raw data with an overlay of the data means for each “Region” with bidirectional error bars. To accomplish that I use plyr to summarize my data and generate the means and SD’s. Then I use ggplot2:
library(plyr)
Basic <- ddply(First.run,.(Region),summarise,
N = length(d13C),
d13C.mean = mean(d13C),
d15N.mean = mean(d15N),
d13C.SD = sd(d13C),
d15N.SD = sd(d15N))
ggplot(data=First.run, aes(x = First.run$d13C, y = First.run$d15N))+
geom_point(aes(colour = Region))+
geom_point(data = Basic,aes(colour = Region))+
geom_errorbarh(data = Basic, aes(xmin = d13C.mean + d13C.SD, xmax = d13C.mean - d13C.SD,
y = d15N.mean, colour = Region, height = 0.01))+
geom_errorbar(data = Basic, aes(ymin = d15N.mean - d15N.SD, ymax = d15N.mean + d15N.SD,
x = d13C.mean,colour = Region))
But each time I run this code I get the same error and can’t figure out what the problem is.
Error: Aesthetics must either be length one, or the same length as the dataProblems:Region
Any help would be much appreciated.
Edit: Since my example data is taken from the head of my full dataset it only includes samples from the "N" Region. With only this one region the code works fine but if you use fix() to change the provided dataset so that at least one other Region is included (in my data the other Region is "S") then the error I get shows up. My mistake in not including some data from each Region.

I ended up changing two of the "N" Regions to "S" so I could calculate standard deviation for both groups.
I think the problem was that you were missing required aesthetics in some of your geoms (geom_point was missing x and y, for example). At least getting all the required aesthetics into each geom seemed to get everything working. I cleaned up a few other things while I was at it to shorten the code up a bit.
ggplot(data = First.run, aes(x = d13C, y = d15N, colour = Region)) +
geom_point() +
geom_point(data = Basic,aes(x = d13C.mean, y = d15N.mean)) +
geom_errorbarh(data = Basic, aes(xmin = d13C.mean + d13C.SD,
xmax = d13C.mean - d13C.SD, y = d15N.mean, x = d13C.mean), height = .5) +
geom_errorbar(data = Basic, aes(ymin = d15N.mean - d15N.SD,
ymax = d15N.mean + d15N.SD, x = d13C.mean, y = d15N.mean), width = .01)

Related

Why geom_bracket is not allowing me to plot a bracket?

I would like to add a bracket using geom_bracket for my first two groups of countries the United Kingdom (UK) and France (FR). I use the following code and it plots the three estimates:
library(ggpubr)
library(ggplot2)
df %>%
ggplot(aes(estimate, cntry)) +
geom_point()
However, whenever i add the geom_bracket as below, i get an error. I tried to get around it in different ways but it is still not working. Could someone let me know what i am doing wrong?
df %>%
ggplot(aes(estimate, cntry)) +
geom_point() +
geom_bracket(ymin = "UK", ymax = "FR", x.position = -.75, label.size = 7,
label = "group 1")
Here is a reproducible example:
structure(list(cntry = structure(1:3, .Label = c("BE", "FR",
"UK"), class = "factor"), estimate = c(-0.748, 0.436,
-0.640)), row.names = c(NA, -3L), groups = structure(list(
cntry = structure(1:3, .Label = c("BE", "FR", "UK"), class = "factor"),
.rows = structure(list(1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Well, it's pretty damn late at that, but I figured out a workaround for this. I though that I might as well post it here in case anyone finds it useful.
Firstly, as Basti mentioned, ymin, ymax, and x.position aren't arguments that can be used - you have to use xmin, xmax, and y.position. Now, won't this only work for a flipped graph (i.e. x = cntry, y = estimate)? Yes, it will. However you can easily get around this by using coord_flip().
Secondly, it turns out that geom_bracket doesn't inherit the data description (df) and won't run without it being defined inside it. Why? No idea. But this is what was causing the error. Additionally, for some reason, merely defining the data isn't enough, a label must also be added. Not a problem here, just thought I might mention it for dumb people like me who decided to use geom_bracket to add brackets to stat_compare_means.
Here's an example of the OP that should work, along with data generation:
library(ggplot2)
library(ggpubr)
library(tibble) #I like tibbles
df <- tibble(cntry = factor(c("BE", "FR", "UK")),
estimate = c(-0.748,0.436,-0.64)) #dataframe generation
df %>%
ggplot(aes(cntry, estimate)) +
geom_point() +
coord_flip() + #necessary if you want to keep this weird x/y orientation
geom_bracket(data = df, xmin = "UK", xmax = "FR", y.position = -.75,
label.size = 7, label = "group 1", coord.flip = T)
#coord.flip = T reflects the added coord_flip()
You can then play around with y coordinates, size, etc. You can also expand the graph using expand_limits().

Heatmap coloring and references with ggplot in R

I have the following code, that generates the following heatmap in R.
ggplot(data = hminput, color=category, aes(x = Poblaciones, y = Variantes)) +
geom_tile(aes(fill = Frecuencias)) + scale_colour_gradient(name = "Frecuencias",low = "blue", high = "white",guide="colourbar")
hminput is a data frame with three columns: Poblaciones, Variantes and Frecuencias, where the first two are the x and y axis and the third one is the color reference.
And my desired output is that the heatmap to have a bar as the reference instead of those blocks, and also that the coloring is white-blue gradient instead of that multicolor gradient.
To achieve that, I tried what's in my code, but I'm not achieving what I want (I'm getting the graph you see in the picture). Any thoughts? Thanks!
As some people asked, here is the dput of the data frame :
> dput(hminput)
structure(list(Variantes = structure(c(1L, 2L, 3L, 4L,...), .Label =
c("rs10498633", "rs10792832", "rs10838725",
"rs10948363", ..., "SNP"), class = "factor"),
Poblaciones = c("AFR", "AFR", ...), Frecuencias = structure(c(12L,
10L,...), .Label = c("0.01135", "0.0121",
"0.01286", "0.01513", "0.02194", "0.05144", "0.05825", "0.059",
"0.07716", "0.0938", "0.1051", "0.1225", "0.1346", "0.1407",
"0.1566", "0.1604", "0.1619", "0.1838", "0.1914", "0.1929",
...,
"0.45", "0.5", "0.4"), class = "factor")), .Names = c("Variantes",
"Poblaciones", "Frecuencias"), row.names = c("frqAFR.33", "frqAFR.31",
"frqAFR.27", "frqAFR.14", "frqAFR.24",...
), class = "data.frame")

geom_errorbar behaving strangely, ggplot2

I have an usual problem when using geom_errorbar in ggplot2.
The error bars are not within range but that is of no concern here.
My problem is that geom_errorbar is plotting the confidence intervals for the same data differently depending on what other data is plotted with it.
The code below filters the data only passing data where Audio1 is equal to "300SW" OR "3500MFL" in the uncommented SE and AggBar.
SE<-c(0.0861829641865964, 0.0296894376485468, 0.0323219002250762,
0.0937013798013447)
AggBar <- structure(list(Report = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), .Label = c("One Flash", "Two Flashes"), class = "factor"),
Visual = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("one",
"two"), class = "factor"), Audio = c("300SW", "300SW", "300SW",
"300SW", "3500MFL3500CL", "3500MFL3500CL", "3500MFL3500CL",
"3500MFL3500CL"), Prob = c(0.938828282828283, 0.0611717171717172,
0.754141414141414, 0.245858585858586, 0.534484848484848,
0.465515151515151, 0.0830909090909091, 0.916909090909091)), .Names = c("Report",
"Visual", "Audio", "Prob"), row.names = c(NA, -8L), class = "data.frame")
#SE<-c(0.0310069159026252, 0.113219880555153, 0.0861829641865964, 0.0296894376485468)
#AggBar <- structure(list(Report = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
#2L), .Label = c("One Flash", "Two Flashes"), class = "factor"),
#Visual = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("one",
#"two"), class = "factor"), Audio = c("300MFL300CL", "300MFL300CL",
#"300MFL300CL", "300MFL300CL", "300SW", "300SW", "300SW",
#"300SW"), Prob = c(0.562242424242424, 0.437757575757576,
#0.0921010101010101, 0.90789898989899, 0.938828282828283,
#0.0611717171717172, 0.754141414141414, 0.245858585858586)), .Names = c("Report",
#"Visual", "Audio", "Prob"), row.names = c(NA, -8L), class = "data.frame")
prob.bar = ggplot(AggBar, aes(x = Report, y = Prob, fill = Report)) + theme_bw() #+ facet_grid(Audio~Visual)
prob.bar + #This changes all panels' colour
geom_bar(position=position_dodge(.9), stat="identity", colour="black", width=0.8)+
theme(legend.position = "none") + labs(x="Report", y="Probability of Report", title = expression("Visual Condition")) + scale_fill_grey() +
scale_fill_grey(start=.4) +
scale_y_continuous(limits = c(0, 1), breaks = (seq(0,1,by = .25)))+
facet_grid(Audio ~ Visual)+
geom_errorbar(aes(ymin=Prob-SE, ymax=Prob+SE),
width=.1, # Width of the error bars
position=position_dodge(.09))
This results in the following output:
The Audio1 variables are seen on the rightmost vertical labels.
However if I filter where it only passes where Audio1 is equal to "300SW" OR "300MFL" (the commented SE and AggBar) the error bars for "300SW change":
The Audio1 variables are seen on the rightmost vertical labels with "300SW" on the bottom this time.
This change is the incorrect one because when I plot just the Audio1 "300SW" the error bars match the original plot.
I have tried plotting the Audio1 "300SW" with other variables not presented here and it is only when presenting with "300MFL" that this change occurs.
If you look at the SE variable contents you will see that there is no change in the values therein for "300SW" in both versions of the code. Yet the outputs differ.
I cannot fathom what is happening here. Any ideas or suggestions are welcome.
Thanks very much for your time.
#Antonios K below has highlighted that when "300SW" is on top of the grid the error bars are correctly drawn. I'm guessing that the error bars are being incorrectly matched to the bars although I don't know why this is the case.
The problem is that SE is not stored inside the data frame: it's just floating around in the global environment. When the data is facetted (which involves rearranging the order), it no longer lines up with the correct records. Fix the problem by storing SE in the data frame:
AggBar$SE <- c(0.0310069159026252, 0.113219880555153, 0.0861829641865964, 0.0296894376485468)
ggplot(AggBar, aes(Report, Prob, Report)) +
geom_bar(stat = "identity", fill = "grey50") +
geom_errorbar(aes(ymin = Prob - SE, ymax = Prob + SE), width = 0.4) +
facet_grid(Audio ~ Visual)
The bit of code that plots the error bars is :
geom_errorbar(aes(ymin=Prob-SE, ymax=Prob+SE),
width=.1, # Width of the error bars
position=position_dodge(.09))
So, I guess it's something there.
As you said the SE variable is the same in both cases, but what you plot there is Prob-SE and Prob+SE. And if you do AggBar$Prob-SE and AggBar$Prob+SE you'll get different values for 300SW for each case.
Might have to do with the order of your Audio1 values. The other cases that worked did they have 300SW on the top part of the plots as well maybe?
Try
sort(unique(DataRearrange$Audio1) )
[1] "300MFL" "300SW" "3500MFL"
Combining first two will give you 300SW on the bottom part of the plots.
Combining last two will give you 300SW on the top part.
So, to check this assumption, in your second case when you combine 300MFL and 300SW try to replace 300SW with 1_300SW (so that 300SW will be plotted on top) and see what happens. Just do :
DataRearrange$Audio1[DataRearrange$Audio1=="300SW"] = "1_300SW"
# Below is the alternative coupling..
ErrorBarsDF <- DataRearrange[(DataRearrange$Audio1=="1_300SW" | DataRearrange$Audio1=="300MFL"), c("correct","Visual1", "Audio1", "Audio2","correct_response", "response", "subject_nr")]
DataRearrange <- DataRearrange[(DataRearrange$Audio1=="1_300SW" | DataRearrange$Audio1=="300MFL"), c("correct","Visual1", "Audio1", "Audio2","correct_response", "response", "subject_nr")]

plotting simulation coverage of a "known" point in ggplot

I have the results of a simulation that involved removing data and refitting a model, and generated the mean and CIs for 5 beta coefficients (AAA:EEE). The sample data are reproducible through dupt().
data <- structure(list(PercentData = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("90Percent", "80Percent", "70Percent", "60Percent", "50Percent", "40Percent", "30Percent", "20Percent"), class = "factor"), Beta = c("AAA", "BBB", "CCC", "DDD", "EEE", "AAA", "BBB", "CCC", "DDD", "EEE", "AAA", "BBB", "CCC", "DDD", "EEE"), Mean = c(-0.0184798128725727, 0.577389832570274, 0.307079889066798, -1.04434737355186, 0.765444299971639, -0.0342811658086197, 0.571119844203796, 0.307904693724208, -1.05833526491829, 0.772586633692223, -0.0287982339992084, 0.567559187110271, 0.300408471488675, -1.05392763762688, 0.768956684863523), UpperCI = c(0.011382484714714, 0.592146704143253, 0.334772268551607, -0.997865978815953, 0.787196643647358, 0.0270716705899447, 0.595047291677895, 0.363220155550484, -1.01101175408862, 0.82142109640807, 0.0501543137571774, 0.597455743424951, 0.351903162023205, -1.00408187639287, 0.805740012899328), LowerCI = c(-0.0483421104598594, 0.562632960997295, 0.279387509581988, -1.09082876828776, 0.743691956295919, -0.0956340022071842, 0.547192396729696, 0.252589231897933, -1.10565877574796, 0.723752170976376, -0.107750781755594, 0.537662630795591, 0.248913780954145, -1.10377339886088, 0.732173356827717)), .Names = c("PercentData", "Beta", "Mean", "UpperCI", "LowerCI"), row.names = c("X1", "X2", "X3", "X4", "X5", "X1.1", "X2.1", "X3.1", "X4.1", "X5.1", "X1.2", "X2.2", "X3.2", "X4.2", "X5.2"), class = "data.frame")
head(data)
# PercentData Beta Mean UpperCI LowerCI
# X1 90Percent AAA -0.01847981 0.01138248 -0.04834211
# X2 90Percent BBB 0.57738983 0.59214670 0.56263296
# X3 90Percent CCC 0.30707989 0.33477227 0.27938751
# X4 90Percent DDD -1.04434737 -0.99786598 -1.09082877
# X5 90Percent EEE 0.76544430 0.78719664 0.74369196
# X1.1 80Percent AAA -0.03428117 0.02707167 -0.09563400
I can plot the simulation data using this code
require(ggplot2)
ggplot(data, aes(x = Beta)) +
geom_point(aes(y = Mean, color = PercentData),
position = position_dodge(0.5),
size=2.5) +
geom_errorbar(aes(ymin = LowerCI,
ymax = UpperCI,
color = PercentData),
cex = 1.25,
width = .75,
position = position_dodge(0.5))
I want to add the "truth" to the above figure. Currently, I have the truth data in a different DF, which is below.
truth <- structure(list(Est = c(-0.0178489366139546, 0.575347417798796, 0.299445933484525, -1.02862600141036, 0.767365594695577), UpperCI = c(0.486793276079609, 0.647987076085212, 0.380433141441644, -0.937511307956846, 0.837682594951183 ), LowerCI = c(-0.522491149307518, 0.502707759512379, 0.218458725527406, -1.11974069486387, 0.697048594439971), Beta = c("AAA", "BBB", "CCC", "DDD", "EEE")), .Names = c("Est", "UpperCI", "LowerCI", "Beta"), row.names = c(NA, 5L), class = "data.frame")
head(truth)
# Est UpperCI LowerCI Beta
# 1 -0.01784894 0.4867933 -0.5224911 AAA
# 2 0.57534742 0.6479871 0.5027078 BBB
# 3 0.29944593 0.3804331 0.2184587 CCC
# 4 -1.02862600 -0.9375113 -1.1197407 DDD
# 5 0.76736559 0.8376826 0.6970486 EEE
I would like to add the truth data as a line to the above figure and have provided a schematic below where the added black lines are the truth$Est values - although they are not drawn to represent the actual values.
If possible, it would be nice to also include the truth Upper and Lower CIs. Is it possible to draw two lines - one at each CI value?
I have left the truth data as a separate DF as I am not sure on the best way to format the data for the intended result. I can reformat based on comments or suggestions to have the data in a single melt() data frame.
Thanks in advance.
With a little bit of data restructuring, this becomes simple with the use of geom_segment:
all.data <- merge(data, truth, by = "Beta")
all.data$xposition <- as.numeric(factor(all.data$Beta))
ggplot(all.data, aes(x = Beta)) +
geom_point(aes(y = Mean, color = PercentData),
position = position_dodge(0.5),
size=2.5) +
geom_errorbar(aes(ymin = LowerCI.x,
ymax = UpperCI.x,
color = PercentData),
cex = 1.25,
width = .75,
position = position_dodge(0.5)) +
geom_segment(aes(y = UpperCI.y,
yend = UpperCI.y,
x = xposition - 0.5,
xend = xposition + 0.5)) +
geom_segment(aes(y = LowerCI.y,
yend = LowerCI.y,
x = xposition - 0.5,
xend = xposition + 0.5))
A few things to note:
The easiest way to add additional data with an additional geom to your plot is to include it as a separate column in your dataframe. This is no different than including the confidence interval columns for drawing errorbars
To determine the horizontal position of the segments, you can use the numeric values of the factor of your categorical x variable. As explained by Hadley, categorical variables still have numeric position on a plot.
You can change the width of your bars by changing the value added and subtracted to x and xend (currently 0.5)

Error in making a boxplot with ggplot2

I tried to make a boxplot today using ggplot2, and I encountered an error I haven't been able to solve, yet. I have used a similar approach (which I actually took from an answer by user #joran) before without incident, but I must be doing something incorrectly this time.
Here is my data:
myboxplot<-structure(list(gap = structure(1:2, .Label = c("Jib", "NoJib"), class = "factor"),
Location = structure(c(4L, 4L), .Label = c("A", "B", "C",
"D"), class = "factor"), min = c(21.809, 21.081), q1 = c(25.582,
25.375), med = c(28.082, 27), q3 = c(30.142, 28.622), max = c(37.166,
39.808), lab = c(2342L, 119681L)), .Names = c("JibStat", "Location",
"min", "q1", "med", "q3", "max", "lab"), row.names = c(2L, 7L
), class = "data.frame")
The code that I have been attempting to use is as follows:
ggplot(myboxplot + aes(x=JibStat, fill=JibStat)) +
geom_boxplot(aes(lower = q1, upper = q3, middle = med, ymin = min, ymax = max), stat = "identity")
and I get the following error message:
Error in Ops.data.frame(myboxplot, aes(x = JibStat, fill = JibStat)) : list of length 2 not meaningful
I have worked on resolving the issue, but I have not been able to find much on resolving the error. My Google skills must be lacking today, but I can't think of what to search for to get help on this problem. What is it I am doing wrong here?
Additional info: R version 3.0.1, 64-bit Windows 8.
Try changing the first line to:
ggplot(myboxplot, aes(x=JibStat)) +
geom_boxplot(aes(lower = q1, upper = q3, middle = med,
ymin = min, ymax = max), stat = "identity")
I think you'd mis-typed a comma.

Resources