Scaling the y-axis in ggplot in R continuously with outliers - r

I have a question regarding the visualization of data using ggplot in R. Specifically, regarding the scaling of the y-axis in case of outliers.
Let's start with a sample dataset with observations from 31 IDs. 30 IDs are in an expected range and there is one outlier:
# Load libraries
library(tidyverse)
library(ggbeeswarm)
library(data.table)
# Set seed
set.seed(123)
# Create dataset
ID <- sprintf("ID-%s",seq(1:30))
baseline <- rnorm(30, mean = 50, sd = 3)
df <- data.frame(ID, baseline) %>%
mutate(`1` = baseline - rnorm(1, mean = 5, sd = 4),
`2` = `1` - rnorm(1, mean = 3, sd = 5),
`3` = `2` - rnorm(1, mean = 1, sd = 3))
# Add outlier
df <- as.data.frame(rbindlist(list(df, list("ID-31", 0.01, 0.02, 0.03 ,1))))
df <- df %>%
pivot_longer(-ID) %>%
rename(time = name) %>%
mutate(time = as.factor(time))
#Plot
ggplot(data = df, aes(x=time, y = value)) +
geom_quasirandom() +
theme_classic() +
scale_x_discrete(limits = c("baseline", "1", "2", "3") ) +
labs(x = "Time", y = "Value")
Expected output
Since the variation in the upper part of the graph is not well visible, I would like to scale the x-axis in a way that shows all values but focusses on a certain part of the plot (in this case values between 20 and 50).
Question
Is it possible to scale the x-axis in such a way?
Additional info
I am specifically not looking for a data transformation solution. Furthermore, I am aware of the scale_y_continuous function in ggplot and it limits argument, but this omits a part of the data.

I don''t know anything about having a broken y-axis with ggplot, but this achieves something similar if you can specify in advance which ID is going to be the outlier.
library(tidyverse)
library(ggbeeswarm)
library(data.table)
# Set seed
set.seed(123)
# Create dataset
ID <- sprintf("ID-%s",seq(1:30))
baseline <- rnorm(30, mean = 50, sd = 3)
df <- data.frame(ID, baseline) %>%
mutate(`1` = baseline - rnorm(1, mean = 5, sd = 4),
`2` = `1` - rnorm(1, mean = 3, sd = 5),
`3` = `2` - rnorm(1, mean = 1, sd = 3))
# Add outlier
df <- as.data.frame(rbindlist(list(df, list("ID-31", 0.01, 0.02, 0.03 ,1))))
df <- df %>%
pivot_longer(-ID) %>%
rename(time = name) %>%
mutate(time = as.factor(time),
is_outlier = (as.character(ID) == "ID-31"))
ggplot(data = df, aes(x=time, y = value)) +
geom_point() +
facet_grid(rows = vars(is_outlier),
scales = "free_y",
switch = "y") +
theme_classic() +
scale_x_discrete(limits = c("baseline", "1", "2", "3") ) +
labs(x = "Time", y = "Value")

Related

How to specify multiple xlims for facetted data in ggplot2 R?

The data is facetted by two variables (see graph). Each variable has a different range. I want to specify the range so that all plots in var1 and vae2 are bound by the min and max values of those variables. See sample code attached. I don't want to use setscales = "free" on facet_wrap.
var1 <- rnorm(100, 6, 2)
var2 <- rnorm(100,15,2)
spp.val <- rnorm(100,10,2)
spp <- rep(c("A","B","C","D"), 25)
df <- data.frame(var1, var2,spp, spp.val)
df <- gather(df,
key = "var",
value = "var.val",
var1,var2)
df$var <- as.factor(as.character(df$var))
df$spp <- as.factor(as.character(df$spp))
ggplot(aes(x = var.val, y = spp.val), data = df) +
geom_point() +
facet_grid(spp~var)
#I want the limits for each facet_grid to be set as follows
xlim(min(df[df$var == "var1",]), max(df[df$var == "var1",])
xlim(min(df[df$var == "var2",]), max(df[df$var == "var2",])
Is this what you want?
library(tidyverse)
tibble(
var1 = rnorm(100, 6, 2),
var2 = rnorm(100, 15, 2),
spp.val = rnorm(100, 10, 2),
spp = rep(c("A", "B", "C", "D"), 25)
) |>
pivot_longer(starts_with("var"), names_to = "var", values_to = "var.val") |>
mutate(across(c(spp, var), factor)) |>
ggplot(aes(var.val, spp.val)) +
geom_point() +
facet_grid(spp ~var, scales = "free_x")
Created on 2022-04-23 by the reprex package (v2.0.1)

Issue on boxplot in R language

May I ask how can I distribute each of these four to two boxplots which contain the pulse meter of male and female.
islands = read.csv('Data.csv')
boxplot(islands$Pulse.meter.First..0m, islands$Pulse.meter.25m, islands$Pulse.meter.Second..0m, islands$Pulse.meter.25m.1)
Things like
boxplot(islands$Pulse.meter.25m ~ islands$Sex)
can distinguish them, but not working for four of them in the same time
before
Wanna boxplot like this
Here is an example using random data, since you hadn't provided data to download. The key is to first transform the data from the 'wide' format as you currently have the data, with a column per value, to a 'long' format, where all values are in the same column with an additional label column. Then the interaction function can be used to create an interaction between the pulse meter type and sex.
# example data with random values
islands <- data.frame(Sex = rep(c('Male', 'Female'), 15),
Pulse.meter.First..0m = rnorm(30, mean = 2),
Pulse.meter.25m = rnorm(30, mean = 1),
Pulse.meter.Second..0m = rnorm(30, mean = 3),
Pulse.meter.25m.1 = rnorm(30, mean = 4))
# reshape from wide to long
islands_long <- reshape(islands,
direction = "long",
varying = 2:5,
v.names = "value",
times = names(islands)[2:5],
timevar = 'measurement')
# plot the boxplot, 'cex.axis' decrease the font size so all the x-axis labels are visible
boxplot(value ~ interaction(Sex, measurement), data = islands_long, pars=list(cex.axis=0.5))
This generates:
library(ggplot2)
library(dplyr)
library(tidyverse)
df <- data.frame(
Gender = sample(c("Male", "Female"), 20, replace = TRUE),
Pulse.meter.First..0m = sample(10:60, 20, replace = FALSE),
Pulse.meter.25m = sample(30:60, 20, replace = FALSE),
Pulse.meter.Second..0m = sample(30:60, 20, replace = FALSE),
Pulse.meter.25m.1 = sample(10:60, 20, replace = FALSE)
)
df <- df %>%
group_by(Gender) %>%
pivot_longer(cols = Pulse.meter.First..0m:Pulse.meter.25m.1, names_to = "Pulse_meter", values_to = "Count") %>%
unite("Groups", Gender:Pulse_meter)
df$Groups <- factor(df$Groups, levels=c("Female_Pulse.meter.First..0m", "Male_Pulse.meter.First..0m",
"Female_Pulse.meter.25m","Male_Pulse.meter.25m",
"Female_Pulse.meter.Second..0m","Male_Pulse.meter.Second..0m",
"Female_Pulse.meter.25m.1","Male_Pulse.meter.25m.1"))
ggplot(data = df, aes(x= Groups, y = Count)) +
geom_boxplot() +
scale_x_discrete(labels=c("(F,0m)","(M,0m)","(F,25m)","(M,25m)", "(F,second_0m)", "(M,second_0m)",
"(F,25m.1)","(M,25m.1)")) +
labs(y="Counts") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

How to plot line graph of normalized differences from binned data with ggplot?

I have several sets of data that I calculate binned normalized differences for. The results I want to plot within a single line plot using ggplot. The lines representing different combinations of the paired differences are supposed to be distinguished by colors and line types.
I am stuck on taking the computed values from the bins (would be y-axis values now), and plotting these onto an x-axis.
Below is the code I use for importing the data and calculating the normalized differences.
# Read data from column 3 as data table for different number of rows
# you could use replicate here for test
# dat1 <- data.frame(replicate(1,sample(25:50,10000,rep=TRUE)))
# dat2 <- data.frame(replicate(1,sample(25:50,9500,rep=TRUE)))
dat1 <- fread("/dir01/a/dat01.txt", header = FALSE, data.table=FALSE, select=c(3))
dat2 <- fread("/dir02/c/dat02.txt", header = FALSE, data.table=FALSE, select=c(3))
# Change column names
colnames(dat1) <- c("Dat1")
colnames(dat2) <- c("Dat2")
# Perhaps there is a better way to compute the following as all-in-one? I have broken these down step by step.
# 1) Sum for each bin
bin1 = cut(dat1$Dat1, breaks = seq(25, 50, by = 2))
sum1 = tapply(dat1$Dat1, bin1, sum)
bin2 = cut(dat2$Dat2, breaks = seq(25, 50, by = 2))
sum2 = tapply(dat2$Dat2, bin2, sum)
# 2) Total sum of all bins
sumt1 = sum(sum1)
sumt2 = sum(sum2)
# 3) Divide each bin by total sum of all bins
sumn1 = lapply(sum1, `/`, sumt1)
sumn2 = lapply(sum2, `/`, sumt2)
# 4) Convert to data frame as I'm not sure how to difference otherwise
df_sumn1 = data.frame(sumn1)
df_sumn2 = data.frame(sumn2)
# 5) Difference between the two as percentage
dbin = (df_sumn1 - df_sumn2)*100
How can I plot those results using ggplot() and geom_line()?
I want
dbin values on the x-axis ranging from 25-50
different colors and line types for the lines
Here is what I tried:
p1 <- ggplot(dbin, aes(x = ?, color=Data, linetype=Data)) +
geom_line() +
scale_linetype_manual(values=c("solid")) +
scale_x_continuous(limits = c(25, 50)) +
scale_color_manual(values = c("#000000"))
dput(dbin) outputs:
structure(list(X.25.27. = -0.0729132928804117, X.27.29. = -0.119044772581772,
X.29.31. = 0.316016473225017, X.31.33. = -0.292812782147632,
X.33.35. = 0.0776336591308158, X.35.37. = 0.0205584754637611,
X.37.39. = -0.300768421159599, X.39.41. = -0.403235174844081,
X.41.43. = 0.392510458816457, X.43.45. = 0.686758883448307,
X.45.47. = -0.25387105113263, X.47.49. = -0.0508324553382303), class = "data.frame", row.names = c(NA,
-1L))
Edit
The final piece of code that works, using only the dbin and plots multiple dbins:
dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100)))
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100)))
dat3 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 12:37/100)))
dat4 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 37:12/100)))
calc_bin_props <- function(data) {
as_tibble(data) %>%
mutate(bin = cut(a, breaks = seq(25, 50, by = 2))) %>%
group_by(bin) %>%
summarise(sum = sum(a), .groups = "drop") %>%
filter(!is.na(bin)) %>%
ungroup() %>%
mutate(sum = sum / sum(sum))
}
diff_data <-
full_join(
calc_bin_props(data = dat1),
calc_bin_props(dat2),
by = "bin") %>%
separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
mutate(dbinA = (sum.x - sum.y * 100)) %>%
select(-starts_with("trsh"))
diff_data2 <-
full_join(
calc_bin_props(data = dat3),
calc_bin_props(dat4),
by = "bin") %>%
separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
mutate(dbinB = (sum.x - sum.y * 100)) %>%
select(-starts_with("trsh"))
# Combine two differences, and remove sum.x and sum.y
full_data <- cbind(diff_data, diff_data2[,4])
full_data <- full_data[,-c(2:3)]
# Melt the data to plot more than 1 variable on a plot
m <- melt(full_data, id.vars="bin")
theme_update(plot.title = element_text(hjust = 0.5))
ggplot(m, aes(as.numeric(bin), value, col=variable, linetype = variable)) +
geom_line() +
scale_linetype_manual(values=c("solid", "longdash")) +
scale_color_manual(values = c("black", "black"))
dev.off()
library(tidyverse)
Creating example data as shown in question, but adding different probabilities to the two sample() calls, to create so visible difference
between the two sets of randomized data.
dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100))) %>% as_tibble()
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100))) %>% as_tibble()
Using dplyr we can handle this within data.frames (tibbles) without
the need to switch to other datatypes.
Let’s define a function that can be applied to both datasets to get
the preprocessing done.
We use base::cut() to create
a new column that pairs each value with its bin. We then group the data
by bin, calculate the sum for each bin and finally divide the bin sums
by the total sum.
calc_bin_props <- function(data) {
as_tibble(data) %>%
mutate(bin = cut(a, breaks = seq(25, 50, by = 2), labels = seq(25, 48, by = 2))) %>%
group_by(bin) %>%
summarise(sum = sum(a), .groups = "drop") %>%
filter(!is.na(bin)) %>%
ungroup() %>%
mutate(sum = sum / sum(sum))
}
Now we call calc_bin_props() on both datasets and join them by bin.
This gives us a dataframe with the columns bin, sum.x and sum.y.
The latter two are correspond to the bin sums derived from dat1 and
dat2. With the mutate() line we calculate the differences between the
two columns.
diff_data <-
full_join(
calc_bin_props(data = dat1),
calc_bin_props(dat2),
by = "bin") %>%
mutate(dbin = (sum.x - sum.y),
bin = as.numeric(as.character(bin))) %>%
select(-starts_with("trsh"))
Before we feed the data into ggplot() we convert it to the long
format using pivot_longer() this allows us to instruct ggplot() to
plot the results for sum.x, sum.y and dbin as separate lines.
diff_data %>%
pivot_longer(-bin) %>%
ggplot(aes(as.numeric(bin), value, color = name, linetype = name)) +
geom_line() +
scale_linetype_manual(values=c("longdash", "solid", "solid")) +
scale_color_manual(values = c("black", "purple", "green"))

A simple plot for many curves with different colors

I have the following data frame which contains 4 columns of data in addition to the vector of labels c.
Time <-c(1:4)
d<-data.frame(Time,
x1= rpois(n = 4, lambda = 10),
x2= runif(n = 4, min = 1, max = 10),
x3= rpois(n = 4, lambda = 5),
x4= runif(n = 4, min = 1, max = 5),
c=c(1,1,2,3))
I would like to use ggpolt to plot 4 curves"x1,..,x4" above each others where each curve is colored according to the label. So curves x1 and x2 are colored by the same color since they have the same label where as curves x3 and x4 in different colors.
I did the following
d %>% pivot_longer(-c(Time,x1,x2,x3,x4))%>%
rename(class=value) %>% select(-name) %>%
pivot_longer(-c(Time,class)) %>%
mutate(Label=ifelse(Time==max(Time,na.rm = T),name,NA),
Label=ifelse(duplicated(Label),NA,Label)) %>%
ggplot(aes(x=Time,y=value,color=factor(class),group=name))+
geom_line()+
labs(color='class')+
scale_color_manual(values=c('red','blue','green'))+
geom_label_repel(aes(label = Label),
nudge_x = 1.5,
na.rm = TRUE,show.legend = F,color='black')
but I don't get the needed plot, the resulted curves are not colored according to the label. I want x1 and x2 in red, x3 in blue and x4 in green.
To add: I would like to obtain the same plot above in the following general case, where I can't add the vector c to the data frame as length(c) is not equal to length(x1)=...=length(x4)
Time <-c(1:5)
d<-data.frame(Time,
x1= rpois(n = 5, lambda = 10),
x2= runif(n = 5, min = 1, max = 10),
x3= rpois(n = 5, lambda = 5),
x4= runif(n = 5, min = 1, max = 5))
and c=c(1,1,2,3)
As you point out in your comments, it is only possible to put the vector of colors as a column in the original data.frame because it happens to be square, but this is a dangerous way to store the information because the colors really belong to the columns rather than the rows. It's better to assign the colors separately and then join into the long format data by variable name prior to plotting.
Below is an example of how I'd do this with your data.
First, prepare the data without the color mapping for each variable, we'll do that next:
# load necessary packages
library(tidyverse)
library(ggrepel)
# set seed to make simulated data reproducible
set.seed(1)
# simulate data
Time <-c(1:4)
d <- data.frame(Time,
x1 = rpois(n = 4, lambda = 10),
x2 = runif(n = 4, min = 1, max = 10),
x3 = rpois(n = 4, lambda = 5),
x4 = runif(n = 4, min = 1, max = 5))
Next, make a separate data.frame that maps the color grouping to the variable names. At some point you'll want to make this a factor (i.e. discrete rather than continuous) to map it to color so I just do it here but it can be done later in the ggplot call if you prefer. Per your request, this solution easily scales with your dataset without needing to manually set each level, but it requires that your vector of color mappings is in the same order and the same length as the variable names in d unless you have some other way to establish that relationship.
# create separate df with color groupings for variable in d
color_grouping <- data.frame(var = names(d)[-1],
color_group = factor(c(1, 1, 2, 3)))
Then you pivot_longer and do a join to merge the color mapping with the data for plotting.
# pivot d to long and merge in color codes
d_long <- d %>%
pivot_longer(cols = -Time, names_to = "var", values_to = "value") %>%
left_join(., color_grouping)
# inspect final table prior to plotting to confirm color mappings
head(d_long, 4)
# # A tibble: 4 x 4
# Time var value color_group
# <int> <chr> <dbl> <fct>
# 1 1 x1 8 1
# 2 1 x2 1.56 1
# 3 1 x3 4 2
# 4 1 x4 4.97 3
Finally, generate line plot where color is mapped to the color_group variable. To ensure you get one line per original variable you also need to set group = var. For more info on this check the documentation on grouping.
# plot data adding labels for each line
p <- d_long %>%
ggplot(aes(x = Time, y = value, group = var, color = color_group)) +
geom_line() +
labs(color='class') +
scale_color_manual(values=c('red','blue','green')) +
geom_label_repel(aes(label = var),
data = d_long %>% slice_max(order_by = Time, n = 1),
nudge_x = 1.5,
na.rm = TRUE,
show.legend = F,
color='black')
p
This produces the this plot:
In your comment you suggested wanting to separate out and stacking the plots. I'm not sure I fully understood, but one way to accomplish this is with faceting.
For example if you wanted to facet out separate panels by color_group, you could add this line to the plot above:
p + facet_grid(rows = "color_group")
Which gives this plot:
Note that the faceting variable must be put in quotes.
You were on the right path, but you need a little bit of a different structure to use ggplot:
# delete old color column
d$c <- NULL
# reshape df
plot.d <- reshape2::melt(d, id.vars = c("Time"))
# create new, correct color column
plot.d$c <- NA
plot.d$c[plot.d$variable == "x1"] <- 1
plot.d$c[plot.d$variable == "x2"] <- 1
plot.d$c[plot.d$variable == "x3"] <- 2
plot.d$c[plot.d$variable == "x4"] <- 3
# plot
ggplot(plot.d, aes(x=Time, y=value, color=as.factor(c), group = variable))+
geom_line() +
labs(color='class')+
scale_color_manual(values=c('red','blue','green'))
Note that I omitted the labels for brevity, but you can add them back in using the same logic. The code above gives the following result:
Here is a solution for how I understood your question.
The DF is brought in the long format, the variable c is replaced with mutate / case_when with the number code you have used.
I have set a seed for better reproducibility.
library(tidyverse)
library(ggrepel)
set.seed(1)
# YOUR DATA
Time <- c(1:4)
d <- data.frame(Time,
x1 = rpois(n = 4, lambda = 10),
x2 = runif(n = 4, min = 1, max = 10),
x3 = rpois(n = 4, lambda = 5),
x4 = runif(n = 4, min = 1, max = 5),
c = c(1, 1, 2, 3)
)
d %>%
pivot_longer(cols = x1:x4) %>% # make it long
mutate(c = as.factor(case_when( # replace consistently
name == "x1" | name == "x2" ~ 1, # according to YOUR DATA
name == "x3" ~ 2,
name == "x4" ~ 3
))) %>%
mutate(
Label = ifelse(Time == max(Time, na.rm = T), name, NA),
Label = ifelse(duplicated(Label), NA, Label)
) %>%
ggplot(aes(x = Time, y = value, color = c, group = name)) +
geom_line() +
labs(color = "class") +
scale_color_manual(values = c("red", "blue", "green")) + # YOUR CHOICE
geom_label_repel(aes(label = Label),
nudge_x = 1.5,
na.rm = TRUE, show.legend = F, color = "black"
)
ADDED
You could leave the c out and color according to name.
The color code was neccessary because you wanted 2 names with the same color. If that is not needed, the following code can do it.
d %>%
pivot_longer(cols = x1:x4) %>% # make it long
mutate(
Label = ifelse(Time == max(Time, na.rm = T), name, NA),
Label = ifelse(duplicated(Label), NA, Label)
) %>%
ggplot(aes(x = Time, y = value, color = name, group = name)) +
geom_line() +
geom_label_repel(aes(label = Label),
nudge_x = 1.5,
na.rm = TRUE, show.legend = F, color = "black"
)

ggplot not drawing boxplots as expected

Consider the MWE below. I would like to generate boxplots with these ideas in mind:
Food on the y-axix ordered according to Amot for Home, while Amt (1:40) on x-axis
show mean points overlaying the boxes
boxplots of Food to be ordered based on median of Home Site from dfsummary data
text annotations of N of observations (to be taken from dfsummary data)
MWE
df <- data.frame(
Site = sample(rep(c("Home", "Office"), size = 884)),
Food = sample(rep(c("Banana","Apple","Egg","Berry","Tomato","Potato","Bean","Pea","Nuts","Onion","Carrot","Cabbage","Eggplant"), size=884)),
Amt = sample(seq(1, 40, by = 0.25), size = 884, replace = TRUE)
)
random <- sample(seq(1, 884, by = 1), size = 100, replace = TRUE) # to randomly introduce 100 NAs to Amt vector
df$Amt[random] <- NA
Summary code
dfsummary <- df %>%
dplyr::group_by(Food, Site) %>%
dplyr::summarise(Median = round(median(Amt, na.rm=TRUE), digits=2), N = sum(!is.na(Amt))) %>%
ungroup()
ggplot code
p1 <- ggplot(df, aes(Amt, Food)) +
geom_boxplot() +
facet_grid(facets = . ~ Site)
Graph
I was expecting to see boxplots here.
Adding annotation
p2 <- p1 + geom_text(aes(y = 42, Food, label = paste("n=", N)), data = dfsummary, size = 3, nudge_x = 0.1) +
facet_grid(facets = . ~ Site)
Unfortunately, this doesn't work either.
Note
tidyverse version is 1.3.0
R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night"
To work out the problem, you may want to generate a scatter plot, first:
library(ggplot2)
p1 <- ggplot(df, aes(Amt, Food)) +
geom_point() +
facet_grid(facets = . ~ Site)
p1
As you can see it is impossible to generate a boxplot.
However, if you switch x and y
ggplot(df, aes(Food, Amt)) +
geom_boxplot() +
facet_grid(facets = . ~ Site)
You get:
This works just fine in the current development version of ggplot2, to be released in January 2020.
# If your ggplot2 version is <= 3.2.1, do:
# remotes::install_github("tidyverse/ggplot2")
library(tidyverse)
df <- data.frame(
Site = sample(rep(c("Home", "Office"), size = 884)),
Food = sample(rep(c("Banana","Apple","Egg","Berry","Tomato","Potato","Bean","Pea","Nuts","Onion","Carrot","Cabbage","Eggplant"), size=884)),
Amt = sample(seq(1, 40, by = 0.25), size = 884, replace = TRUE)
)
random <- sample(seq(1, 884, by = 1), size = 100, replace = TRUE) # to randomly introduce 100 NAs to Amt vector
df$Amt[random] <- NA
ggplot(df, aes(Amt, Food)) +
geom_boxplot() +
facet_grid(facets = . ~ Site)
#> Warning: Removed 98 rows containing non-finite values (stat_boxplot).
Created on 2020-01-01 by the reprex package (v0.3.0)
With annotations:
library(tidyverse)
df <- data.frame(
Site = sample(rep(c("Home", "Office"), size = 884)),
Food = sample(rep(c("Banana","Apple","Egg","Berry","Tomato","Potato","Bean","Pea","Nuts","Onion","Carrot","Cabbage","Eggplant"), size=884)),
Amt = sample(seq(1, 40, by = 0.25), size = 884, replace = TRUE)
)
random <- sample(seq(1, 884, by = 1), size = 100, replace = TRUE) # to randomly introduce 100 NAs to Amt vector
df$Amt[random] <- NA
dfsummary <- df %>%
dplyr::group_by(Food, Site) %>%
dplyr::summarise(Median = round(median(Amt, na.rm=TRUE), digits=2), N = sum(!is.na(Amt))) %>%
ungroup()
ggplot(df, aes(Amt, Food)) +
geom_boxplot() +
geom_text(
aes(x = 42, Food, label = paste("n=", N)),
data = dfsummary,
size = 3, nudge_x = 0.1
) +
facet_grid(facets = . ~ Site)
#> Warning: Removed 95 rows containing non-finite values (stat_boxplot).
Created on 2020-01-01 by the reprex package (v0.3.0)

Resources