Normalization of data within ggplot - r

I have my data as
melted.df <- structure(list(organisms = structure(c(1L, 1L, 1L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L,
1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 1L, 1L, 1L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L), .Label = c("Botrytis cinerea", "Fusarium graminearum",
"Human", "Mus musculus"), class = "factor"), types = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("AllMismatches",
"mismatchType2", "MismatchesType1", "totalDNA"), class = "factor"),
mutations = c(30501L, 12256L, 58357L, 366531L, 3475L, 186907L,
253453L, 222L, 24906L, 2775L, 247990L, 12324L, 4395L, 25324L,
77862L, 1862L, 112217L, 163117L, 100L, 17549L, 1057L, 20331L,
18177L, 7861L, 33033L, 288669L, 1613L, 74690L, 90336L, 122L,
7357L, 1718L, 227659L, 635951L, 229493L, 868052L, 2418724L,
65833L, 1081903L, 1339758L, 4318L, 59387L, 15199L, 2134229L
)), row.names = c(NA, -44L), class = "data.frame")
The values totalDNA in type column indicates total DNAs in the data whereas mismatches are the mutations. I would like to normalize this data based on totalDNA values and plot it. The way I am plotting right now doesn't give me the accurate picture of the data as todalDNA inflates the whole Y-axis and other three types(mismatchType2, mismatchesType1 and AllMismatches) are not properly visible with respect to totalDNA. What would be the better way to plot this? Should I first calculate the percentage? or Perhaps do log scaling? Thanks for helping me out.
ggplot(melted.df, aes(x = types, y = mutations, color=types)) +
geom_point()+
facet_grid(.~organisms)+
xlab("Types")+
ylab("Mismatches")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())

Try a log scale?
ggplot(melted.df, aes(x = types, y = mutations, color=types)) +
geom_point()+
facet_grid(.~organisms)+
xlab("Types")+
ylab("Mismatches")+
# ylim(c(90,130))+
scale_y_log10()+ #add log scale
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
How would you normalise on total DNA? Would you use the (geometric) mean?

Related

ggplot: how to add an outline around each fill in a stacked barchart but only partly

I have the following stacked barchart produced with geom_bar.
Question: how to add an outline around each fill corresponding to the matched color in cols. The tricky part is, that the outline should not be in between each fill but around the "borders" and the top, exclusively (expected output below)
I have
Written with
library(ggplot)
cols = c("#E1B930", "#2C77BF","#E38072","#6DBCC3", "grey40","black")
ggplot(i, aes(fill=uicc, x=n)) + theme_bw() +
geom_bar(position="stack", stat="count") +
scale_fill_manual(values=alpha(cols,0.5))
Expected output
My data i
i <- structure(list(uicc = structure(c(4L, 4L, 4L, 4L, 3L, 4L, 4L,
4L, 4L, 3L, 4L, 3L, 4L, 2L, 4L, 4L, 4L, 1L, 4L, 4L, 2L, 4L, 4L,
3L, 4L, 4L, 4L, 4L, 3L, 2L, 4L, 4L, 3L, 3L, 3L, 3L, 1L, 3L, 4L,
4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 3L,
4L, 1L, 3L, 1L, 4L, 4L, 3L, 1L, 2L, 1L, 3L, 3L, 3L, 4L, 3L, 4L,
4L, 3L, 4L, 3L, 3L, 3L, 2L, 2L, 4L, 3L, 4L, 2L, 1L, 1L, 4L, 4L,
4L, 4L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 3L, 3L, 4L), .Label = c("1",
"2", "3", "4"), class = "factor"), n = structure(c(4L, 4L, 4L,
4L, 2L, 1L, 4L, 1L, 4L, 2L, 4L, 2L, 4L, 1L, 4L, 5L, 2L, 1L, 1L,
5L, 1L, 1L, 2L, 2L, 1L, 4L, 3L, 4L, 2L, 1L, 5L, 2L, 2L, 2L, 2L,
2L, 1L, 2L, 3L, 2L, 1L, 4L, 2L, 1L, 4L, 1L, 4L, 1L, 2L, 2L, 2L,
4L, 1L, 4L, 2L, 4L, 1L, 1L, 1L, 4L, 1L, 2L, 1L, 1L, 1L, 2L, 2L,
1L, 1L, 2L, 4L, 1L, 1L, 4L, 2L, 2L, 2L, 1L, 1L, 3L, 2L, 5L, 1L,
1L, 1L, 4L, 4L, 4L, 5L, 1L, 4L, 4L, 1L, 4L, 2L, 1L, 1L, 2L, 2L,
4L), .Label = c("0", "1", "2", "3", "4", "5"), class = "factor")), row.names = c(NA,
100L), class = "data.frame")
Well, I found a way.
There's no easy way to draw just the "outside" lines, so the approach I used was to go ahead and draw them with the geom_bar call. The inner lines are "erased" by drawing white rectangles over top the initial geom_bar call, and then the fill is drawn back in with a colorless color= aesthetic.
In order to draw the rectangles over the initial geom_bar call, I created a summary dataframe of i which sets the y values.
i.sum <- i %>% group_by(n) %>% tally()
ggplot(i, aes(x=n)) + theme_bw() +
# draw lines
geom_bar(position='stack', stat='count',
aes(color=uicc), fill=NA, size=1.5) +
# cover over those inner lines
geom_col(data=i.sum, aes(y=nn), fill='white') +
# put back in the fill
geom_bar(position='stack', stat='count',
aes(fill=uicc), color=NA) +
scale_fill_manual(values=alpha(cols,0.5)) +
scale_color_manual(values=cols)
Note that the size= of the color= aesthetic needs to be much higher than normal, since the white rectangle ends up covering about half the line.

Plot linear regression analysis with error bar for variability

I wanted to make plots that look like figure 1 (source: link)
In figure 1, they have plotted the regression analysis with one-year yield variability. In my case, I would like to plot variability between two locations and 4 blocks for each treatment group. So the plot I wanted would have three facets for factors B.glucosidase, Protein, POX.C of variable and four colors for treatments factors. Also, in my current plot I have legend for block and treatment. I should only have treatment because the block should be used for making error bar for variability.
I tried with this code, which obviously doesn't work for what I want. (Data for df.melted included below.)
ggplot(df.melted, aes(x = value, y = yield, color = as.factor(treatment))) +
geom_point(aes(shape= as.factor(block))) +
stat_smooth(method = "lm", formula = y ~ x, col = "darkslategrey", se=F) +
stat_poly_eq(formula = y~x,
# aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
aes(label = ..rr.label..),
parse = TRUE) +
theme_classic() +
geom_errorbar(aes(ymax = df.melted$yield+sd(df.melted$yield), ymin = df.melted$yield-sd(df.melted$yield)), width = 0.05)+
facet_wrap(~variable)
Data:
df.melted <- structure(list(Location = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("M", "U"), class = "factor"),
treatment = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("CC",
"CCS", "CS", "SCS"), class = "factor"), block = c(1L, 2L,
3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L,
3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L), yield = c(5156L, 5157L, 5551L, 5156L, 4804L,
4720L, 4757L, 5021L, 4826L, 4807L, 4475L, 4596L, 4669L, 4588L,
4542L, 4592L, 5583L, 5442L, 5693L, 5739L, 5045L, 4902L, 5006L,
5086L, 4639L, 4781L, 4934L, 4857L, 4537L, 4890L, 4842L, 4608L,
5156L, 5157L, 5551L, 5156L, 4804L, 4720L, 4757L, 5021L, 4826L,
4807L, 4475L, 4596L, 4669L, 4588L, 4542L, 4592L, 5583L, 5442L,
5693L, 5739L, 5045L, 4902L, 5006L, 5086L, 4639L, 4781L, 4934L,
4857L, 4537L, 4890L, 4842L, 4608L, 5156L, 5157L, 5551L, 5156L,
4804L, 4720L, 4757L, 5021L, 4826L, 4807L, 4475L, 4596L, 4669L,
4588L, 4542L, 4592L, 5583L, 5442L, 5693L, 5739L, 5045L, 4902L,
5006L, 5086L, 4639L, 4781L, 4934L, 4857L, 4537L, 4890L, 4842L,
4608L), variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("B.glucosidase",
"Protein", "POX.C"), class = "factor"), value = c(1.600946,
1.474084, 1.433078, 1.532492, 1.198667, 1.193193, 1.214941,
1.360981, 1.853056, 1.690117, 1.544357, 1.825132, 1.695409,
1.764123, 1.903743, 1.538684, 0.845077, 1.011463, 0.857032,
0.989803, 0.859022, 0.919467, 1.01717, 0.861689, 0.972332,
0.952922, 0.804431, 0.742634, 1.195837, 1.267285, 1.08571,
1.20097, 6212.631579, 5641.403509, 4392.280702, 7120.701754,
5305.964912, 4936.842105, 5383.157895, 6077.894737, 5769.122807,
5016.842105, 5060.350877, 5967.017544, 5576.842105, 5174.035088,
5655.438596, 5468.77193, 7933.333333, 7000, 6352.982456,
8153.684211, 6077.894737, 4939.649123, 5002.807018, 6489.122807,
4694.035088, 5901.052632, 4303.859649, 6768.421053, 6159.298246,
6090.526316, 4939.649123, 5262.45614, 810.3024, 835.5242,
856.206, 759.8589, 726.2298, 792.6472, 724.7165, 699.3266,
500.9153, 634.8698, 637.9536, 648.8814, 641.0357, 623.3822,
555.2834, 520.8119, 683.3528, 595.9173, 635.4315, 672.4234,
847.2944, 745.5665, 778.3548, 735.8141, 395.2647, 570.4148,
458.0383, 535.3851, 678.0293, 670.7419, 335.2923, 562.5674
)), row.names = c(NA, -96L), class = "data.frame")
library(dplyr)
library(ggplot2)
library(ggpmisc)
Summarize data frame (this could also be done with stat_summary(), but it's often clearer/more transparent to do it explicitly up front). (I think that because your data set is balanced you could collapse/average over the block structure first, and then do your whole plot with the reduced data set - it shouldn't change the outcome of the linear regressions at all, at least not the mean values ... and any statistical comparisons should probably done on block-level summaries anyway ...)
df.sum <- (df.melted
%>% group_by(Location,treatment,variable)
%>% summarise(value=mean(value),yield_sd=sd(yield),
## collapse yield to mean *after* computing sd!
yield=mean(yield))
)
Plot:
(ggplot(df.melted,
aes(x = value, y = yield, color = treatment))
+ stat_smooth(method = "lm", col = "darkslategrey", se=FALSE)
+ stat_poly_eq(
formula = y ~ x,
## aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
aes(group=1, label = ..rr.label..),
parse = TRUE)
+ theme_classic()
+ scale_shape(guide=FALSE)
+ geom_point(data=df.sum)
+ geom_errorbar(data=df.sum,
aes(ymax = yield+yield_sd, ymin = yield-yield_sd),
width = 0.05)
+ facet_wrap(~variable,scale="free_x")
)
(adding group=1 to the stat_poly_eq() aesthetics means we only plot a single R^2 value per facet)
Since you're no longer using the shape aesthetic for anything, you could consider using it to show the Location variable ...

interpret estimated marginal means (emmans aka lsmeans): negative response values

I am working on a a model with lmer where I would like to get estimated marginal means with the emmeanslibrary. This is my dataframe:
df <- structure(list(treatment = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("CCF", "UN"), class = "factor"), level = structure(c(2L,
3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L,
4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L,
2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L,
3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L,
4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L,
2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L,
3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L), .Label = c("A", "F", "H", "L"
), class = "factor"), random = structure(c(3L, 3L, 3L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 4L,
4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L), .Label = c("1.6", "2", "3.2", "5", NA), class = "factor"),
continuous = c(72.7951770264767, 149.373765810534, 1.64153094886205,
54.6697408615215, 25.5801835808851, 1.45794117443253, 25.3660934894788,
91.2321704897132, 2.75353217433675, 44.1995276851725, 33.1854545470435,
5.36536076058866, 29.6807620242672, 80.6077496067764, 0.833434180091457,
13.6789475327185, 77.4930412025109, 3.65998714174906, 25.2848344605563,
136.632099849828, 2.56715261161435, 28.6733878840584, 66.800616194317,
1.37475468782539, 23.007491380183, 84.980285774607, 1.13569710795522,
33.8610875632139, 56.1234827517798, 1.32327007970416, 60.0843812879313,
43.4487832450889, 1.14942423621912, 53.6673704529947, 146.746167255051,
3.91593723271292, 27.0321687961004, 89.5925729244878, 1.47707078226047,
44.0523211310831, 115.087908243373, 1.94039630728038, 86.4074806697431,
43.3266206881612, 2.81456503996437, 66.868588961071, 229.797526052566,
1.07971524769264, 30.3390107111747, 116.680801084036, 1.67711446647817,
69.0961010697534, 78.5454363192614, 1.92137892126384, 53.5708546850303,
37.7175476710608, 1.96087397451467, 25.5166981770257, 37.3755071788757,
2.21602000526086, 10.3266195584378, 38.1458490762217, 2.7508022340832,
44.5864920143771, 8.45382647692274, 2.63204944520792, 87.5376946978593,
27.2354119098268, 3.38134648323956, 26.8815471706502, 14.5539972194568,
2.0556994322415, 27.4619977737491, 32.8546665896602, 2.66809379088059,
42.3815445857533, 21.3359802201685, 2.19167325121191, 53.3189825439001,
13.5708790223439, 2.22274607227071, 88.297423835906, 8.50554349658773,
3.5764241495006, 29.284865737912, 21.1213079519954, 2.3070166819956,
10.7659615128225, 33.4813413290485, 2.49896565066211, 59.0935696616465,
13.2863515051715, 4.36424795471221, 72.1627847396763, 9.09326343200557,
2.13701784901259, 27.5824079679471, 8.84486812842272, 1.98293342019671,
17.5321126287485, 19.1806349705231, 5.03952187899644, 58.3473975730234,
9.17287686145614, 2.99575072457674)), class = "data.frame", row.names = c(NA,
105L))
This is my model:
library(lme4)
model <- lmer((continuous) ~ treatment + level + (1|random), data= df, REML = TRUE)
The data as it is does not meet the model assumptions, but still I am wondering why I get a negative estimated marginal mean (response) on treatment "UN" level "L" (see lettering table) when I don't have any negative numbers in df$continuous?
library(multcompView)
library(emmeans)
lsm.mixed_C <- emmeans::emmeans(my_model,pairwise ~ treatment * level, type="response")
lettering <- CLD(lsm.mixed_C,alpha=0.05,Letters=letters,
adjust= "tukey")
The short answer is because you badly need to include the interaction in your model. Compare:
model2 <- lmer((continuous) ~ treatment * level + (1|random),
data= df, REML = TRUE)
emmip(model2, treatment ~ level)
with:
emmip(model, treatment ~ level)
In model2, both EMMs at level L are close to zero. If you remove the interaction from the model, you force those two profiles to be parallel, while maintaining a sizeable positive difference between treatments CCF and UN, forcing the estimate for UN to go negative. In actual fact, though, all six estimates for treatment x level combinations are seriously distorted.
I can't repeat it enough. emmeans() summarizes a model. If you give it a bad model, you get dumb results. Thanks for the great illustration of this point.

How to use facet_grid correctly in ggplot2?

I'm trying to generate one chart per profile with the following code, but I keep getting "At least one layer must contain all variables used for facetting." errors. I spent the last few hours trying to make it work but I couldn't.
I believe the anwser must be simple, can anyone help?
d = structure(list(category = structure(c(2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L,
3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("4X4",
"HATCH", "SEDAN"), class = "factor"), profile = structure(c(1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L), .Label = c("FIXED", "FREE", "MOBILE"), class = "factor"),
value = c(6440.32, 6287.22, 9324, 7532, 7287.63, 6827.27,
6880.48, 7795.15, 7042.51, 2708.41, 1373.69, 6742.87, 7692.65,
7692.65, 8116.56, 7692.65, 7692.65, 7692.65, 7962.65, 8116.56,
5691.12, 2434, 8343, 7727.73, 7692.65, 7721.15, 1944.38,
6044.23, 8633.65, 7692.65, 7692.65, 8151.65, 7692.65, 7692.65,
2708.41, 3271.45, 3333.82, 1257.48, 6223.13, 7692.65, 6955.46,
7115.46, 7115.46, 7115.46, 7115.46, 6955.46, 7615.46, 2621.21,
2621.21, 445.61)), .Names = c("category", "profile", "value"
), class = "data.frame", row.names = c(NA, -50L))
library(ggplot2)
p = ggplot(d, aes(x=d$value, fill=d$category)) + geom_density(alpha=.3)
p + facet_grid(d$profile ~ .)
Your problem comes from referring to variables explicitly (i.e. d$profile), not with respect to the data argument in the call to ggplot. There is no need for d$ anywhere.
When faceting using facet_grid or facet_wrap, you need to do so. It is also good practice to do in calls to aes
p <- ggplot(d, aes(x = value, fill = category)) + geom_density(alpha = .3)
p + facet_grid(profile ~ .)

Automatically adjusting ylim with stat_summary

ggplot2 adjust the ylim automatically for the data points. Is there any way to adjust the ylim for stat_summary too?
df <- structure(list(Varieties = structure(c(2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L), .Label = c("F9917", "Hegari", "JS263",
"JS2002"), class = "factor"), Priming = structure(c(2L, 2L, 2L,
2L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 4L, 4L, 4L,
4L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 1L, 1L, 1L,
1L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 3L, 3L, 3L,
3L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L), .Label = c("CaCl2", "Dry",
"Hydropriming", "KNO3", "OnFarmpriming"), class = "factor"),
PH = c(225.8, 224.26, 228.9, 215.82, 230.3, 227.7, 232.8,
221.1, 260.2, 230.8, 236.75, 230.5, 250.56, 230.74, 240.64,
226.7, 268.4, 233.4, 243.33, 232.7, 252.04, 233.1, 237.14,
220.6, 265.55, 234.93, 240.04, 218.21, 300.55, 245, 243.5,
234.65, 253.3, 233.5, 238.62, 225.93, 255.74, 233.64, 238.1,
230.93, 246, 240.33, 246.08, 221.7, 250.54, 242.87, 251,
225.32, 251.47, 245.4, 266.74, 227.73, 290.62, 246.68, 256.4,
225.83, 282.67, 240.58, 258.35, 235.87)), .Names = c("Varieties",
"Priming", "PH"), class = "data.frame", row.names = c(NA, 60L
))
p1 <- ggplot(data=df, aes(x=Varieties, y=PH, group=Priming, shape=Priming, colour=Priming))+
stat_summary(fun.y=mean, geom="point", size=2, aes(group=Priming, shape=Priming, colour=Priming))+
theme_bw()
p1 <- p1 + stat_summary(fun.y=mean, geom="line", aes(group=Priming, shape=Priming, colour=Priming))
print(p1)
See extra space in ylim for stat_summary values. Thanks in advance for your help and time.
Here is one approach, using plyr to prep the data before plotting
df <- ddply(df, .(Varieties, Priming), transform, meanPH = mean(PH))
ggplot(df, aes(Varieties, meanPH)) +
geom_point() +
geom_line(aes(group = Priming, color = Priming))
The current "official" answer for 0.8.9 is, I believe, that you can't, at least not automatically, and not without preprocessing the data as Ramnath indicates. Most people asking this question, or some variant of it, are pointed towards setting the limits manually using coord_cartesian.
The reason stat_summary behaves this way is that it sort of assumes that you aren't going to just plot the summaries, but at least some of the underlying data as well, so it sets up the plotting area using the underlying data frame.
However, I found this thread on the ggplot2 list that suggests this behavior might change in the upcoming 0.9.0 release. (The thread is a little vague, but I read it as implying that in the next version, if the only layer you add is form stat_summary then the plot limits will be calculated based on the summaries, not the original data. I could be wrong though.)

Resources