Automatically adjusting ylim with stat_summary

Automatically adjusting ylim with stat_summary - r

ggplot2 adjust the ylim automatically for the data points. Is there any way to adjust the ylim for stat_summary too?
df <- structure(list(Varieties = structure(c(2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L), .Label = c("F9917", "Hegari", "JS263",
"JS2002"), class = "factor"), Priming = structure(c(2L, 2L, 2L,
2L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 4L, 4L, 4L,
4L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 1L, 1L, 1L,
1L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 3L, 3L, 3L,
3L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L), .Label = c("CaCl2", "Dry",
"Hydropriming", "KNO3", "OnFarmpriming"), class = "factor"),
PH = c(225.8, 224.26, 228.9, 215.82, 230.3, 227.7, 232.8,
221.1, 260.2, 230.8, 236.75, 230.5, 250.56, 230.74, 240.64,
226.7, 268.4, 233.4, 243.33, 232.7, 252.04, 233.1, 237.14,
220.6, 265.55, 234.93, 240.04, 218.21, 300.55, 245, 243.5,
234.65, 253.3, 233.5, 238.62, 225.93, 255.74, 233.64, 238.1,
230.93, 246, 240.33, 246.08, 221.7, 250.54, 242.87, 251,
225.32, 251.47, 245.4, 266.74, 227.73, 290.62, 246.68, 256.4,
225.83, 282.67, 240.58, 258.35, 235.87)), .Names = c("Varieties",
"Priming", "PH"), class = "data.frame", row.names = c(NA, 60L
))
p1 <- ggplot(data=df, aes(x=Varieties, y=PH, group=Priming, shape=Priming, colour=Priming))+
stat_summary(fun.y=mean, geom="point", size=2, aes(group=Priming, shape=Priming, colour=Priming))+
theme_bw()
p1 <- p1 + stat_summary(fun.y=mean, geom="line", aes(group=Priming, shape=Priming, colour=Priming))
print(p1)
See extra space in ylim for stat_summary values. Thanks in advance for your help and time.

Here is one approach, using plyr to prep the data before plotting
df <- ddply(df, .(Varieties, Priming), transform, meanPH = mean(PH))
ggplot(df, aes(Varieties, meanPH)) +
geom_point() +
geom_line(aes(group = Priming, color = Priming))

The current "official" answer for 0.8.9 is, I believe, that you can't, at least not automatically, and not without preprocessing the data as Ramnath indicates. Most people asking this question, or some variant of it, are pointed towards setting the limits manually using coord_cartesian.
The reason stat_summary behaves this way is that it sort of assumes that you aren't going to just plot the summaries, but at least some of the underlying data as well, so it sets up the plotting area using the underlying data frame.
However, I found this thread on the ggplot2 list that suggests this behavior might change in the upcoming 0.9.0 release. (The thread is a little vague, but I read it as implying that in the next version, if the only layer you add is form stat_summary then the plot limits will be calculated based on the summaries, not the original data. I could be wrong though.)

Related

ggplot: how to add an outline around each fill in a stacked barchart but only partly

I have the following stacked barchart produced with geom_bar.
Question: how to add an outline around each fill corresponding to the matched color in cols. The tricky part is, that the outline should not be in between each fill but around the "borders" and the top, exclusively (expected output below)
I have
Written with
library(ggplot)
cols = c("#E1B930", "#2C77BF","#E38072","#6DBCC3", "grey40","black")
ggplot(i, aes(fill=uicc, x=n)) + theme_bw() +
geom_bar(position="stack", stat="count") +
scale_fill_manual(values=alpha(cols,0.5))
Expected output
My data i
i <- structure(list(uicc = structure(c(4L, 4L, 4L, 4L, 3L, 4L, 4L,
4L, 4L, 3L, 4L, 3L, 4L, 2L, 4L, 4L, 4L, 1L, 4L, 4L, 2L, 4L, 4L,
3L, 4L, 4L, 4L, 4L, 3L, 2L, 4L, 4L, 3L, 3L, 3L, 3L, 1L, 3L, 4L,
4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 3L,
4L, 1L, 3L, 1L, 4L, 4L, 3L, 1L, 2L, 1L, 3L, 3L, 3L, 4L, 3L, 4L,
4L, 3L, 4L, 3L, 3L, 3L, 2L, 2L, 4L, 3L, 4L, 2L, 1L, 1L, 4L, 4L,
4L, 4L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 3L, 3L, 4L), .Label = c("1",
"2", "3", "4"), class = "factor"), n = structure(c(4L, 4L, 4L,
4L, 2L, 1L, 4L, 1L, 4L, 2L, 4L, 2L, 4L, 1L, 4L, 5L, 2L, 1L, 1L,
5L, 1L, 1L, 2L, 2L, 1L, 4L, 3L, 4L, 2L, 1L, 5L, 2L, 2L, 2L, 2L,
2L, 1L, 2L, 3L, 2L, 1L, 4L, 2L, 1L, 4L, 1L, 4L, 1L, 2L, 2L, 2L,
4L, 1L, 4L, 2L, 4L, 1L, 1L, 1L, 4L, 1L, 2L, 1L, 1L, 1L, 2L, 2L,
1L, 1L, 2L, 4L, 1L, 1L, 4L, 2L, 2L, 2L, 1L, 1L, 3L, 2L, 5L, 1L,
1L, 1L, 4L, 4L, 4L, 5L, 1L, 4L, 4L, 1L, 4L, 2L, 1L, 1L, 2L, 2L,
4L), .Label = c("0", "1", "2", "3", "4", "5"), class = "factor")), row.names = c(NA,
100L), class = "data.frame")

Well, I found a way.
There's no easy way to draw just the "outside" lines, so the approach I used was to go ahead and draw them with the geom_bar call. The inner lines are "erased" by drawing white rectangles over top the initial geom_bar call, and then the fill is drawn back in with a colorless color= aesthetic.
In order to draw the rectangles over the initial geom_bar call, I created a summary dataframe of i which sets the y values.
i.sum <- i %>% group_by(n) %>% tally()
ggplot(i, aes(x=n)) + theme_bw() +
# draw lines
geom_bar(position='stack', stat='count',
aes(color=uicc), fill=NA, size=1.5) +
# cover over those inner lines
geom_col(data=i.sum, aes(y=nn), fill='white') +
# put back in the fill
geom_bar(position='stack', stat='count',
aes(fill=uicc), color=NA) +
scale_fill_manual(values=alpha(cols,0.5)) +
scale_color_manual(values=cols)
Note that the size= of the color= aesthetic needs to be much higher than normal, since the white rectangle ends up covering about half the line.

Visualising Categorical Data across a Time Frame

still fairly new to R and have stepped away for a while, so please bear with me.
I have a set of data which describes the degree of mobility (categorical data) after an operation across 3 days. I have been looking for a way to demonstrate the flow across those 3 days.
I've tried using geom_jitter with x and y being Day 1 and 2, and aes(colour) being Day 3 but this doesn't really convey what I want to show. I've done some reading around Sankey Diagram and Parallel Coordinates but have not got the understanding to quite fit the samples posed by others to fit my data.
This is what I've tried:
test %>% filter(!is.na(Mob_D1.factor) & !is.na(Mob_D2.factor) & !is.na(Mob_D3.factor)) %>%
ggplot(aes(x = Mob_D1.factor, y = Mob_D2.factor, colour = Mob_D3.factor)) +
geom_jitter(size = 5, alpha = 0.25, height = 0.25, width = 0.2) +
scale_colour_brewer(palette = "Dark2", name = "Mobilisation on Day 3") +
xlab("Mobilisation on Day 1") +
ylab("Mobilisation on Day 2") + theme_minimal()
As I said, not quite what I want.
This is a sample of the data:
structure(list(Mob_D1.factor = structure(c(2L, 2L, 2L, 2L, 4L,
1L, 2L, 2L, 1L, 4L, 2L, 4L, 2L, 1L, 2L, 4L, 4L, 2L, 4L, 4L, 2L,
4L, 2L, 2L, 4L, 2L, 1L, 4L, 4L, 3L, 4L, 2L, 3L, 2L, 2L, 2L, 2L,
2L, 4L, 4L, 2L, 4L, 4L, 2L, 2L, 4L, 2L, 4L, 4L, 4L), .Label = c("None",
"Bed", "Stand", "Assisted Walk"), class = "factor"), Mob_D2.factor = structure(c(2L,
3L, 2L, 4L, 4L, 1L, 3L, 4L, 4L, 4L, 3L, 4L, 2L, 2L, 2L, 4L, 4L,
4L, 4L, 4L, 1L, 4L, 2L, 2L, 4L, 2L, 1L, 4L, 4L, 4L, 4L, 2L, 3L,
2L, 2L, 2L, 4L, 4L, 2L, 4L, 3L, 4L, 4L, 2L, 2L, 4L, 4L, 4L, 4L,
4L), .Label = c("None", "Bed", "Stand", "Assisted Walk"), class = "factor"),
Mob_D3.factor = structure(c(2L, 3L, 2L, 4L, 4L, 1L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 4L, 2L,
2L, 4L, 4L, 1L, 4L, 4L, 4L, 4L, 2L, 4L, 4L, 2L, 2L, 4L, 4L,
3L, 4L, 4L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("None",
"Bed", "Stand", "Assisted Walk"), class = "factor")), row.names = c(NA,
-50L), class = c("tbl_df", "tbl", "data.frame"))
Thanks in advance to anyone who takes the time to reply. Any extended explanation would be appreciated as I am still learning.
Larry

I am not entirely sure what the expected result should be, but could a barplot be helpful?
Edit
I now think I understand what you need and I found the package ggalluvial that can help you with this.
Hope this helps.
library(tidyverse)
library(ggalluvial)
# Some data wrangling first. Add row_number to give a unique ID for each patient
d <- df %>% mutate(Patient = row_number()) %>%
# transform it to longer format
pivot_longer(col=(-Patient), values_to = "Stage", names_to = "Day")
# Make the plot
ggplot(d,
aes(x = Day, stratum = Stage, alluvium = Patient,
fill = Stage, label = Stage)) +
scale_fill_brewer(type = "qual", palette = "Set2") +
geom_flow(stat = "alluvium", lode.guidance = "frontback",
color = "darkgray") +
geom_stratum()
Created on 2020-02-24 by the reprex package (v0.3.0)

Normalization of data within ggplot

I have my data as
melted.df <- structure(list(organisms = structure(c(1L, 1L, 1L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L,
1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 1L, 1L, 1L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L), .Label = c("Botrytis cinerea", "Fusarium graminearum",
"Human", "Mus musculus"), class = "factor"), types = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("AllMismatches",
"mismatchType2", "MismatchesType1", "totalDNA"), class = "factor"),
mutations = c(30501L, 12256L, 58357L, 366531L, 3475L, 186907L,
253453L, 222L, 24906L, 2775L, 247990L, 12324L, 4395L, 25324L,
77862L, 1862L, 112217L, 163117L, 100L, 17549L, 1057L, 20331L,
18177L, 7861L, 33033L, 288669L, 1613L, 74690L, 90336L, 122L,
7357L, 1718L, 227659L, 635951L, 229493L, 868052L, 2418724L,
65833L, 1081903L, 1339758L, 4318L, 59387L, 15199L, 2134229L
)), row.names = c(NA, -44L), class = "data.frame")
The values totalDNA in type column indicates total DNAs in the data whereas mismatches are the mutations. I would like to normalize this data based on totalDNA values and plot it. The way I am plotting right now doesn't give me the accurate picture of the data as todalDNA inflates the whole Y-axis and other three types(mismatchType2, mismatchesType1 and AllMismatches) are not properly visible with respect to totalDNA. What would be the better way to plot this? Should I first calculate the percentage? or Perhaps do log scaling? Thanks for helping me out.
ggplot(melted.df, aes(x = types, y = mutations, color=types)) +
geom_point()+
facet_grid(.~organisms)+
xlab("Types")+
ylab("Mismatches")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())

Try a log scale?
ggplot(melted.df, aes(x = types, y = mutations, color=types)) +
geom_point()+
facet_grid(.~organisms)+
xlab("Types")+
ylab("Mismatches")+
# ylim(c(90,130))+
scale_y_log10()+ #add log scale
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
How would you normalise on total DNA? Would you use the (geometric) mean?

interpret estimated marginal means (emmans aka lsmeans): negative response values

I am working on a a model with lmer where I would like to get estimated marginal means with the emmeanslibrary. This is my dataframe:
df <- structure(list(treatment = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("CCF", "UN"), class = "factor"), level = structure(c(2L,
3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L,
4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L,
2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L,
3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L,
4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L,
2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L, 2L,
3L, 4L, 2L, 3L, 4L, 2L, 3L, 4L), .Label = c("A", "F", "H", "L"
), class = "factor"), random = structure(c(3L, 3L, 3L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 4L,
4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L), .Label = c("1.6", "2", "3.2", "5", NA), class = "factor"),
continuous = c(72.7951770264767, 149.373765810534, 1.64153094886205,
54.6697408615215, 25.5801835808851, 1.45794117443253, 25.3660934894788,
91.2321704897132, 2.75353217433675, 44.1995276851725, 33.1854545470435,
5.36536076058866, 29.6807620242672, 80.6077496067764, 0.833434180091457,
13.6789475327185, 77.4930412025109, 3.65998714174906, 25.2848344605563,
136.632099849828, 2.56715261161435, 28.6733878840584, 66.800616194317,
1.37475468782539, 23.007491380183, 84.980285774607, 1.13569710795522,
33.8610875632139, 56.1234827517798, 1.32327007970416, 60.0843812879313,
43.4487832450889, 1.14942423621912, 53.6673704529947, 146.746167255051,
3.91593723271292, 27.0321687961004, 89.5925729244878, 1.47707078226047,
44.0523211310831, 115.087908243373, 1.94039630728038, 86.4074806697431,
43.3266206881612, 2.81456503996437, 66.868588961071, 229.797526052566,
1.07971524769264, 30.3390107111747, 116.680801084036, 1.67711446647817,
69.0961010697534, 78.5454363192614, 1.92137892126384, 53.5708546850303,
37.7175476710608, 1.96087397451467, 25.5166981770257, 37.3755071788757,
2.21602000526086, 10.3266195584378, 38.1458490762217, 2.7508022340832,
44.5864920143771, 8.45382647692274, 2.63204944520792, 87.5376946978593,
27.2354119098268, 3.38134648323956, 26.8815471706502, 14.5539972194568,
2.0556994322415, 27.4619977737491, 32.8546665896602, 2.66809379088059,
42.3815445857533, 21.3359802201685, 2.19167325121191, 53.3189825439001,
13.5708790223439, 2.22274607227071, 88.297423835906, 8.50554349658773,
3.5764241495006, 29.284865737912, 21.1213079519954, 2.3070166819956,
10.7659615128225, 33.4813413290485, 2.49896565066211, 59.0935696616465,
13.2863515051715, 4.36424795471221, 72.1627847396763, 9.09326343200557,
2.13701784901259, 27.5824079679471, 8.84486812842272, 1.98293342019671,
17.5321126287485, 19.1806349705231, 5.03952187899644, 58.3473975730234,
9.17287686145614, 2.99575072457674)), class = "data.frame", row.names = c(NA,
105L))
This is my model:
library(lme4)
model <- lmer((continuous) ~ treatment + level + (1|random), data= df, REML = TRUE)
The data as it is does not meet the model assumptions, but still I am wondering why I get a negative estimated marginal mean (response) on treatment "UN" level "L" (see lettering table) when I don't have any negative numbers in df$continuous?
library(multcompView)
library(emmeans)
lsm.mixed_C <- emmeans::emmeans(my_model,pairwise ~ treatment * level, type="response")
lettering <- CLD(lsm.mixed_C,alpha=0.05,Letters=letters,
adjust= "tukey")

The short answer is because you badly need to include the interaction in your model. Compare:
model2 <- lmer((continuous) ~ treatment * level + (1|random),
data= df, REML = TRUE)
emmip(model2, treatment ~ level)
with:
emmip(model, treatment ~ level)
In model2, both EMMs at level L are close to zero. If you remove the interaction from the model, you force those two profiles to be parallel, while maintaining a sizeable positive difference between treatments CCF and UN, forcing the estimate for UN to go negative. In actual fact, though, all six estimates for treatment x level combinations are seriously distorted.
I can't repeat it enough. emmeans() summarizes a model. If you give it a bad model, you get dumb results. Thanks for the great illustration of this point.

How to plot the amount of basket sizes in each day?

Here's the top 50 records of my data:
structure(list(Day = structure(c(2L, 2L, 5L, 7L, 7L, 6L, 1L, 3L, 7L, 3L, 7L, 5L, 5L, 3L, 7L, 1L, 1L, 3L, 6L, 2L, 6L, 2L, 3L, 4L, 7L, 6L, 3L, 7L, 6L, 7L, 2L, 6L, 7L, 7L, 2L, 3L, 6L, 4L, 3L, 2L, 5L, 6L, 7L, 7L, 3L, 6L, 3L, 4L, 6L, 4L), .Label = c("1", "2", "3", "4", "5", "6", "7"), class = "factor"), BASKET_SIZE = structure(c(1L, 3L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 3L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 3L, 1L), .Label = c("L", "M", "S"), class = "factor")), .Names = c("Day", "BASKET_SIZE"), row.names = c(NA, 50L), class = "data.frame")
Basically I have 3 basket sizes (S,M,L) and 7 days of the week (1-7). I plotted the data with plot(e), and that gave me this:
So this would be good if I wanted to know the basket size dist of each day, but I'm more interested in the total amount of each basket size in each day.
Here's what I've tried:
barchart(Day~BASKET_SIZE,data=e,groups=BASKET_SIZE) based on this post: Simplest way to do grouped barplot. But I can't seem to get the correct axis or distributions:
Also, I'd like it to be vertical, say the sum of each basket size, and have a legend showing th ecolor of each basket size. This chart that I have seems to convert my S,M,L to numbers somehow... I know it's not right because I have 3.8k rows of data.

How about
tt <- t(table(dd))
barplot(as.matrix(tt),beside=TRUE)
?
You'd have to add the legend manually (?legend).
You could also do this with ggplot2, e.g.
library(ggplot2)
ggplot(dd,aes(Day,fill=BASKET_SIZE))+
geom_bar(position="dodge")
ggplot will give you legends automatically. The example here has some empty categories (e.g. no large baskets on day 1); if you want to handle that case properly, it looks like you'll have to pre-tabulate the data (but if your real data set is large, that might not be a problem).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Automatically adjusting ylim with stat_summary - r

Here is one approach, using plyr to prep the data before plotting df <- ddply(df, .(Varieties, Priming), transform, meanPH = mean(PH)) ggplot(df, aes(Varieties, meanPH)) + geom_point() + geom_line(aes(group = Priming, color = Priming))

Related

ggplot: how to add an outline around each fill in a stacked barchart but only partly

Visualising Categorical Data across a Time Frame

Normalization of data within ggplot

interpret estimated marginal means (emmans aka lsmeans): negative response values

How to plot the amount of basket sizes in each day?

Categories

Resources