ggplot: geom_boxplot and geom_jitter - r

Align the data points with the box plot.
DATA:
data<-structure(list(score = c(0.058, 0.21, -0.111, -0.103, 0.051,
0.624, -0.023, 0.01, 0.033, -0.815, -0.505, -0.863, -0.736, -0.971,
-0.137, -0.654, -0.689, -0.126), clin = structure(c(1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), .Label =
c("Non-Sensitive",
"Sensitive "), class = "factor"), culture = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("Co-culture", "Mono-culture"), class = "factor"),
status = structure(c(2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L,
2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L), .Label = c("new", "old"
), class = "factor")), .Names = c("score", "clin", "culture",
"status"), class = "data.frame", row.names = c(NA, -18L))
CODES:
p<-ggplot(data, aes(culture, as.numeric(score),fill=status))
p+geom_boxplot(outlier.shape = NA)+
theme_bw()+scale_fill_grey(start = 0.8, end = 1)+
labs(title="title", x="", y="score",fill="", colour="")+
geom_jitter(aes(colour = clin), alpha=0.9,
position=position_jitter(w=0.1,h=0.1))
As you can see, the data points plotted using geom_jitter do not align with the boxplot. I know that I need to provide aes elements to geom_jitter as well - but I not sure how to do it correctly.

I don't think you can do this because the positions of the boxplots are being driven by the dodge algorithm as opposed to an explicit aesthetic, though I'd be curious if someone else figures out a way of doing it. Here is a workaround:
p<-ggplot(data, aes(status, as.numeric(score),fill=status))
p+geom_boxplot(outlier.shape = NA)+
theme_bw()+scale_fill_grey(start = 0.8, end = 1)+
labs(title="title", x="", y="score",fill="", colour="")+
geom_jitter(aes(colour = clin), alpha=0.9,
position=position_jitter(w=0.1,h=0.1)) +
facet_wrap(~ culture)
By using the facets for culture, we can assign an explicit aesthetic to status, which then allows to line up the geom_jitter with the geom_boxplot. Hopefully this is close enough for your purposes.

Related

Boxplot troubleshooting, adding another variable factor

I have constructed a nice looking boxplot in r for data looking at the production of methane under different incubation temperatures. The plot looks at the production of CH4 by the patch from which the sample was collected.
However there is a temperature variable. Samples were split with 50% incubated at 10* and 50% at 26*
This is my current plot:
Methanogenesis_Data=read.csv("CO2-CH4 Rates.csv")
attach(Methanogenesis_Data)
summary(Methanogenesis_Data)
str(Methanogenesis_Data)
boxplot(CH4rate~Patch, data = Methanogenesis_Data, xlab="Patch",
ylab="CH4 µmol g-1 hr-1 ",
col=c("lightblue","firebrick1"), main = "CH4 Production After
Incubation", frame.plot=FALSE)
This was my previous plot:
boxplot(CH4rate~Patch+Temperature, data = Methanogenesis_Data,
xlab="Patch", ylab="CH4 µmol g-1 hr-1 ",
col=c("lightblue","firebrick1"), main = "CH4 Production After
Incubation", frame.plot=FALSE)
Here is the data:
structure(list(Patch = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Gravel", "Macrophytes",
"Marginal"), class = "factor"), Temperature = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("Cold",
"Warm"), class = "factor"), CH4rate = c(0.001262595, 0.00138508,
0.001675944, 0.001592354, 0.002169233, 0.001772964, 0.002156633,
0.002864403, 0.002301383, 0.002561042, 0.005189598, 0.004557227,
0.008484851, 0.006867866, 0.007438633, 0.005405327, 0.006381582,
0.008860084, 0.007615417, 0.007705906, 0.009198508, 0.00705233,
0.007943024, 0.008319768, 0.010362114, 0.007822153, 0.010339339,
0.009252302, 0.008249555, 0.008197657), CO2rate = c(0.002274825,
0.002484866, 0.003020209, 0.00289133, 0.003927232, 0.003219346,
0.003922613, 0.005217026, 0.00418674, 0.00466427, 0.009427322,
0.008236453, 0.015339532, 0.012494729, 0.013531303, 0.009839847,
0.011624428, 0.016136746, 0.0138831, 0.014051034, 0.016753211,
0.012780956, 0.01445912, 0.01515584, 0.01883252, 0.014249452,
0.018849478, 0.016863299, 0.015045964, 0.014941168)), .Names =
c("Patch",
"Temperature", "CH4rate", "CO2rate"), class = "data.frame", row.names =
c(NA,
-30L))
What I am attempting to do is have my current plot, but with boxes in the boxplot representing both warm and cold temperatures within the 3 Patch areas.
Boxplot of CH4 production by Patch inc. Temp <--- This is what I want to do!
Thank You for any assistance!!
You could try it using ggplot2:
library(tidyverse)
Methanogenesis_Data %>%
ggplot(aes(x = Patch, y = CH4rate, fill = Temperature)) +
geom_boxplot() +
scale_fill_manual(values = c("lightblue","firebrick1")) +
scale_x_discrete(drop = F) +
theme_minimal()+
labs(y = 'CH4 µmol g-1 hr-1', title = "CH4 Production After Incubation")
Or, if you so wish, try it with base-R:
boxplot(CH4rate~Temperature + Patch, data = Methanogenesis_Data, xlab="Patch",
ylab="CH4 µmol g-1 hr-1 ",
col=c("lightblue","firebrick1"), main = "CH4 Production After
Incubation", frame.plot=FALSE,xaxt = 'n')
legend('topleft', legend = c('cold', 'warm'), fill = c("lightblue","firebrick1"))
axis(1,at = c(1.5,3.5,5.5), labels = levels(Methanogenesis_Data$Patch))

Creating a box and whisker plot with ggplot() troubleshooting

UPDATED:
Data has now been updated to full chemistry values as opposed to mean values.
I am attempting to create a box and whisker plot in r, on a very small dataset. My data is not behaving itself or I am missing some glaringly obvious error.
This is the code i have for making said plot
library(ggplot2)
Methanogenesis_Data=read.csv("CO2-CH4 Rates.csv")
attach(Methanogenesis_Data)
summary(Methanogenesis_Data)
str(Methanogenesis_Data)
boxplot(CH4rate~Patch+Temperature, data = Methanogenesis_Data,
xlab="Patch", ylab="CH4 Production")
cols<-c("red", "blue")
From this small dataset.
structure(list(Patch = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Gravel", "Macrophytes",
"Marginal"), class = "factor"), Temperature = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("Cold",
"Warm"), class = "factor"), CH4rate = c(0.001262595, 0.00138508,
0.001675944, 0.001592354, 0.002169233, 0.001772964, 0.002156633,
0.002864403, 0.002301383, 0.002561042, 0.005189598, 0.004557227,
0.008484851, 0.006867866, 0.007438633, 0.005405327, 0.006381582,
0.008860084, 0.007615417, 0.007705906, 0.009198508, 0.00705233,
0.007943024, 0.008319768, 0.010362114, 0.007822153, 0.010339339,
0.009252302, 0.008249555, 0.008197657), CO2rate = c(0.002274825,
0.002484866, 0.003020209, 0.00289133, 0.003927232, 0.003219346,
0.003922613, 0.005217026, 0.00418674, 0.00466427, 0.009427322,
0.008236453, 0.015339532, 0.012494729, 0.013531303, 0.009839847,
0.011624428, 0.016136746, 0.0138831, 0.014051034, 0.016753211,
0.012780956, 0.01445912, 0.01515584, 0.01883252, 0.014249452,
0.018849478, 0.016863299, 0.015045964, 0.014941168)), .Names = c("Patch",
"Temperature", "CH4rate", "CO2rate"), class = "data.frame", row.names =
c(NA,
-30L))
The plot I get as output is good, however I would like the Variables on the X axis to simply display "Gravel" "Macrophytes" "Marginal" as opposed to each of those variables with Warm and Cold. Thanks for any assistance
THIS IS WHAT I AM TRYING TO ACHEIVE -----> Exact Boxplot I want to create
Following your update with an example graph :
I have also included the formating for the legend position. If you want to edit the y axis label to include subscript I would suggest you read over this. I have included a blank title for relabelling.
test <- structure(list(Patch = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Gravel", "Macrophytes",
"Marginal"), class = "factor"), Temperature = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("Cold",
"Warm"), class = "factor"), CH4rate = c(0.001262595, 0.00138508,
0.001675944, 0.001592354, 0.002169233, 0.001772964, 0.002156633,
0.002864403, 0.002301383, 0.002561042, 0.005189598, 0.004557227,
0.008484851, 0.006867866, 0.007438633, 0.005405327, 0.006381582,
0.008860084, 0.007615417, 0.007705906, 0.009198508, 0.00705233,
0.007943024, 0.008319768, 0.010362114, 0.007822153, 0.010339339,
0.009252302, 0.008249555, 0.008197657), CO2rate = c(0.002274825,
0.002484866, 0.003020209, 0.00289133, 0.003927232, 0.003219346,
0.003922613, 0.005217026, 0.00418674, 0.00466427, 0.009427322,
0.008236453, 0.015339532, 0.012494729, 0.013531303, 0.009839847,
0.011624428, 0.016136746, 0.0138831, 0.014051034, 0.016753211,
0.012780956, 0.01445912, 0.01515584, 0.01883252, 0.014249452,
0.018849478, 0.016863299, 0.015045964, 0.014941168)), .Names = c("Patch",
"Temperature", "CH4rate", "CO2rate"), class = "data.frame", row.names =
c(NA,
-30L))
Now I will create two data sets one for each graph just for simplicity you could leave them combined and facet but for formatting purposes this might be easier.
CH4rate <- test %>%
gather("id", "value", 3:4) %>%
filter(id == "CH4rate")
CO2rate <- test %>%
gather("id", "value", 3:4) %>%
filter(id == "CO2rate")
First plot:
ggplot(CH4rate) +
geom_boxplot(mapping = aes(x = Patch, y = value, fill=factor(Temperature, levels = c("Warm", "Cold")))) +
theme(legend.position = c(0.15, 0.9), panel.background = element_rect(fill = "white", colour = "grey50")) +
labs(title = "Title of graph", x="Patch Type", y = "CH4rate") +
scale_fill_manual(name = "", values = c("orange", "light blue")
, labels = c("Cold" = "Incubated at 10˙C", "Warm" = "Incubated at 26˙C"))
Second plot:
ggplot(CO2rate) +
geom_boxplot(mapping = aes(x = Patch, y = value, fill=factor(Temperature, levels = c("Warm", "Cold")))) +
theme(legend.position = c(0.15, 0.9), panel.background = element_rect(fill = "white", colour = "grey50")) +
labs(title = "Title of graph", x="Patch Type", y = "CO2rate") +
scale_fill_manual(name = "", values = c("orange", "light blue")
, labels = c("Cold" = "Incubated at 10˙C", "Warm" = "Incubated at 26˙C"))

Connecting points on a graph within nested groups of data with ggplot2

I'm having a problem working out how to draw lines between points on a ggplot that are in a nested structure.
What I have is a set of data that is broken down by 3 different nested groups.
Which are then plotted, the first group is used with facet to pair the subgroups (Mutation), the second group then splits the data into the initial experiment (HiSeq) and the replication experiment (MiSeq), while the third group (Grouping) colors and shapes the points by the sample type they are from.
Where I have gotten stuck though is I'd like to link the 2 points (HiSeq/Miseq) within an pair (mutation) via a line to make it easy to workout which two are linked. I've made a mock up which can be seen:
However I'm unable to work out how to do this across the two groups (HiSeq/Miseq) while staying within the top level group (Mutation).
Does any one have a solution to this? A fragment of the data and the code I'm using to build the current graph can be seen below. It may end up being to messy to be presentable but it would be useful to solve.
ggplot(test,aes(y=AR,x=Type,fill=Grouping,colour=Grouping,shape=Grouping)) +
geom_point(binaxis='y',stackdir='center',position=position_dodge(width = 0.2),size=7) +
facet_wrap(~ Mutation,nrow=1) +
xlab("") +
ylab("Allelic Ratio") +
theme_minimal(base_size=20)
example data:
structure(list(Mutation = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("chr1:51910329",
"chr1:72951069"), class = "factor"), Type = structure(c(1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L), .Label = c("HiSeq", "MiSeq"), class = "factor"), Grouping = structure(c(3L,
3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L,
2L, 2L, 1L, 3L, 3L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("Offspring (M)", "Offspring (P)", "Proband"
), class = "factor"), Name = c(288458773L, 288458773L, 423125012L,
423125012L, 344991226L, 344991226L, 422977809L, 422977809L, 420753074L,
420753074L, 351142406L, 351142406L, 422743921L, 422743921L, 425596544L,
425596544L, 422595517L, 422595517L, 477342393L, 477342393L, 288458773L,
288458773L, 423125012L, 423125012L, 344991226L, 344991226L, 422977809L,
422977809L, 420753074L, 420753074L, 351142406L, 351142406L, 477342393L,
477342393L, 480773638L, 480773638L), AR = c(0.38, 0.3, 0, 0,
0.375, 0.545, 0.41, 0.388, 0.35, 0.42, 0, 0, NA, 0.59, NA, 0,
0, 0.05, 0, 0, 0.1875, 0.078379734, 0.4, 0.505582473, 0, 0.002394493,
0, 0.002023547, 0, 0.001600569, 0.6, 0.510240797, 0.6, 0.490997813,
0, 0.001785424)), .Names = c("Mutation", "Type", "Grouping",
"Name", "AR"), class = "data.frame", row.names = c(NA, -36L))
I think this may be what you want -- look into geom_line and understanding its group aesthetic:
ggplot(df, aes(x = Type, y = AR, fill = Grouping, color = Grouping, shape = Grouping)) +
geom_point(size = 5) +
geom_line(aes(group = Name)) +
facet_wrap(~ Mutation)

ggplot2 error: Aesthetics must be either length 1 or the same as the data (24)

I am trying to create a plot in ggplot showing the mean home range size of an animal according to different sexes, treatments, time periods and seasons. I get an error in R saying
Error: Aesthetics must be either length 1 or the same as the data (24): x, y, colour, shape"
I have read similar posts about this error but I haven't been able to figure it out yet. There are no NA's in these columns and my numerical variables are being treated as such. Not sure if the error has to do with a need to sub set the data but I don't understand how I should do that. My code runs fine up until the ggplot part and it is the following:
library("ggplot2")
library("dplyr")
lion_HR_size <- read.csv(file = "https://dl.dropboxusercontent.com/u/23723553/lion_sample_data.csv",
header= TRUE, row.names=1)
# Mean of home range size by season, treatment, sex and time
Mean_HR <- lion_HR_size %>%
group_by(season, treatment, sex, time) %>%
summarize(
mean_HR = mean(Area_HR_km),
se_HR = sd(Area_HR_km)/sqrt(n()),
lwrHR = mean_HR - se_HR,
uprHR = mean_HR + se_HR)
limitsHR <- aes(ymin = lwrHR, ymax= uprHR)
ggplot(Mean_HR,
aes(x=season,
y= Mean_HR,
colour=season,
shape= season)) +
geom_point( size = 6, alpha = 0.5)+
facet_grid(sex ~ treatment+time)+
geom_errorbar(limitsHR, width = 0.1, col = 'red', alpha = 0.8)+
theme_bw()
As requested, the dput(Mean_HR) output is the following:
dput(Mean_HR)
structure(list(season = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L), .Label = c("Early_dry", "Late_dry", "Wet"), class = "factor"),
treatment = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("C", "E"), class = "factor"), sex = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L,
1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("F", "M"), class = "factor"),
time = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), mean_HR = c(141.594090181, 138.327188493,
509.287443507692, 345.296845642381, 157.634028930833, 184.202160663125,
252.464096340667, 255.078012825, 59.8485325981818, 143.158189516522,
439.990400912593, 175.410885601333, 221.338774452381, 100.942251723636,
127.961533612727, 167.199563142143, 120.60363022375, 142.351764574211,
249.03854219, 330.018734301176, 123.992902995714, 219.886321226667,
307.869373359167, 296.019550844286), se_HR = c(18.6245437612391,
29.2548378154774, 127.987824704623, 78.9236194797204, 20.8897993194466,
43.1314245224751, 57.6327505533691, 32.1129054260719, 9.383853530199,
38.7678333459788, 130.348285186224, 31.707304307485, 29.1561478797825,
15.4038723326613, 18.1932127432015, 37.791782522185, 32.7089231722616,
33.2629181623941, 46.1500408067739, 88.8736578370159, 15.8046627788777,
36.9665360444972, 70.1560303348504, 87.1340476758794), lwrHR = c(122.969546419761,
109.072350677523, 381.29961880307, 266.373226162661, 136.744229611387,
141.07073614065, 194.831345787298, 222.965107398928, 50.4646790679828,
104.390356170543, 309.642115726369, 143.703581293848, 192.182626572598,
85.5383793909751, 109.768320869526, 129.407780619958, 87.8947070514884,
109.088846411816, 202.888501383226, 241.145076464161, 108.188240216837,
182.91978518217, 237.713343024316, 208.885503168406), uprHR = c(160.218633942239,
167.582026308477, 637.275268212315, 424.220465122101, 178.52382825028,
227.3335851856, 310.096846894036, 287.190918251072, 69.2323861283808,
181.9260228625, 570.338686098816, 207.118189908818, 250.494922332163,
116.346124056298, 146.154746355929, 204.991345664328, 153.312553396012,
175.614682736605, 295.188582996774, 418.892392138192, 139.797565774592,
256.852857271164, 378.025403694017, 383.153598520165)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -24L), vars = list(
season, treatment, sex), drop = TRUE, .Names = c("season",
"treatment", "sex", "time", "mean_HR", "se_HR", "lwrHR", "uprHR"
))
Could someone help me understand this error and how to fix it in my code? Many thanks!
Not entirely sure myself why/how the limitsHR <- ... statement works. I would have expected it to stop on not being able to find the lwrHR and uprHR objects in the workspace.
Anyhow, ggplot has a nice function mean_se() that will help you tremendously.
ggplot(data = lion_HR_size, mapping = aes(x = season, y = Area_HR_km,
colour=season, shape= season)) +
stat_summary(fun.data = mean_se) +
facet_grid(sex ~ treatment+time)+
theme_bw()

Bar chart with several non mutually exclusive characteristics

I have a data set with a variable that has several other characteristics, which are not mutually exclusive. Here's the data.
df <- structure(list(cont1 = structure(c(2L, 2L, 4L, 1L, 2L, 3L, 2L, 4L, 4L, 1L, 2L, 2L, 4L, 1L, 1L, 2L, 2L), .Label = c("Africa", "Asia", "Europe", "LAC"), class = "factor"), SIDS = structure(c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("No", "SIDS"), class = "factor"), LDC = structure(c(2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("LDC", "No"), class = "factor")), .Names = c("cont1",
"SIDS", "LDC"), class = "data.frame", row.names = c(NA, -17L))
So when I put it into long format df.m <- melt(df, id.vars = c("cont1")) I can build the plot with ggplot2 but get all the NAs in the plot. If I exclude them the proportions are distorted because there are more NAs in one of the categories.
ggplot(df.m, aes(x = cont1, fill = value)) + geom_bar()
ggplot(df.m[df.m$value != "No",], aes(x = cont1, fill = value)) + geom_bar()
Is there a way to have a bar plot of the variable cont1 with the value as a fill without the NAs distorting the proportion? That is can I use a different length for the fill in ggplot2?

Resources