ggplot2 geom_smooth didn't work - r

I'm plotting two different variables on the same plot.
sex_female is chr, including 0 and 1.
epoch_36:epoch_144 are num, time variables.
Here is my code:
total %>%
select(sex_female, epoch_36:epoch_144)%>%
gather(key = time, value = ac, epoch_36:epoch_144) %>%
group_by(sex_female,time) %>%
mutate(mean = mean(ac)) %>%
ggplot(aes(x = time, y = mean,color = sex_female)) +
geom_point(alpha = .3)+
geom_smooth(method = "lm")+
theme(axis.text.x = element_text(angle = 90,hjust = 1))
After the mutation, I got the tibble:
A tibble: 45,780 x 4
# Groups: sex_female, time [218]
sex_female time ac mean
<chr> <chr> <dbl> <dbl>
1 1 epoch_36 49.8 54.96406
2 0 epoch_36 34.7 55.43448
3 0 epoch_36 70.9 55.43448
4 0 epoch_36 12.3 55.43448
5 1 epoch_36 102.7 54.96406
6 1 epoch_36 77.9 54.96406
7 0 epoch_36 1.1 55.43448
8 1 epoch_36 140.0 54.96406
9 1 epoch_36 51.3 54.96406
10 0 epoch_36 0.0 55.43448
# ... with 45,770 more rows
I've tried using the solution suggested in a similar question: Plot dashed regression line with geom_smooth in ggplot2, but no lines showed up. How do I fix my code to produce lines?

Your time column is categorical and you should transform it into numerical.
mutTibble$time <- as.numeric(mutTibble$time)
And for plotting you can use this:
library(ggplot2)
ggplot(mutTibble,
aes(time, mean, color = factor(sex_female))) +
geom_point(alpha = 0.3)+
geom_smooth(method = "lm")+
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "Time",
y = "Mean"
color = "Gender (female)")

Related

R - Plot 2-way Anova Results on longitudinal data using GGplot2

I am looking to display the results of a two-way Anova analysis over several time points. This is preliminary data, and I am interested in getting an understanding of the potential relationship between time and sex on tumor burden.
My data:
ID Sex Tumor.Burden Time.Point
Cage3 female 1270800 1
Cage3 female 1237600 2
Cage3 female 1288760 3
Cage3 female 775220 4
Cage4 female 1768400 1
Cage4 female 1630200 2
Cage4 female 1606900 3
Cage4 female 1134220 4
Cage5 male 1441500 1
Cage5 male 3000750 2
Cage5 male 5930500 3
Cage5 male 6944225 4
Cage6 male 2063640 1
Cage6 male 7067600 2
Cage6 male 10460400 3
Cage6 male 18764800 4
This is the plot I am using. I'd like to point out that this wasn't made with the data I just listed, but rather with similar data. However, I plan on using the same approach here.
ggplot(Data, aes(x = Time.Point, y = Tumor.Burden, color = Sex)) +
geom_line() +
theme_minimal() +
labs(title = "Weekly Follow-up of Tumor-Bearing Mice", x = "Time points (weeks)", y="Log(Tumor Burden)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
What is the best approach to add the significance of each time point above to the corresponding time point on the plot? I.E is there a statistically significant difference between males and females at time point 1:5?
Currently, I am following this: https://www.datanovia.com/en/lessons/repeated-measures-anova-in-r/#two-way-repeated-measures-anova. However, I am getting an error at the end and it seems to be related to my ID variable getting flagged as NA when I run
Data %>%
group_by(Time.Point) %>%
anova_test(dv = Tumor.Burden, wid = ID, within = Sex)
Thanks!
To calculate the p values you can use anova_test(Tumor.Burden ~ Sex) and use these output p values in a geom_text to show them in your plot like this:
library(ggplot2)
library(ggpubr)
library(rstatix)
library(dplyr)
p_values <- Data %>%
group_by(Time.Point) %>%
anova_test(Tumor.Burden ~ Sex)
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
p_values
#> # A tibble: 4 × 8
#> Time.Point Effect DFn DFd F p `p<.05` ges
#> * <int> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 1 Sex 1 2 0.342 0.618 "" 0.146
#> 2 2 Sex 1 2 3.11 0.22 "" 0.608
#> 3 3 Sex 1 2 8.83 0.097 "" 0.815
#> 4 4 Sex 1 2 4.05 0.182 "" 0.669
ggplot() +
geom_line(Data, mapping = aes(x = Time.Point, y = Tumor.Burden, color = Sex)) +
geom_text(data = p_values, mapping = aes(x = Time.Point, y = 15000000, label = p), size = 3) +
theme_minimal() +
labs(title = "Weekly Follow-up of Tumor-Bearing Mice", x = "Time points (weeks)", y="Log(Tumor Burden)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
Created on 2022-11-16 with reprex v2.0.2

ggplot boxplot with mean and confidence interval by group

I'd like to make a boxplot with mean instead of median. Moreover, I would like the line to stop at 5% (lower) end 95% (upper) quantile. Here the code;
ggplot(data, aes(x=Cement, y=Mean_Gap, fill=Material)) +
geom_boxplot(fatten = NULL,aes(fill=Material), position=position_dodge(.9)) +
xlab("Cement") + ylab("Mean cement layer thickness") +
stat_summary(fun=mean, geom="point", aes(group=Material), position=position_dodge(.9),color="black")
I'd like to change geom to errorbar, but this doesn't work. I tried middle = mean(Mean_Gap), but this doesn't work either. I tried ymin = quantile(y,0.05), but nothing was changing. Can anyone help me?
The standard boxplot using ggplot. fill is Material:
Here is how you can create the boxplot using custom parameters for the box and whiskers. It's the solution shown by #lukeA in stackoverflow.com/a/34529614/6288065, but this one will also show you how to make several boxes by groups.
The R built-in data set called "ToothGrowth" is similar to your data structure so I will use that as an example. We will plot the length of tooth growth (len) for each vitamin C supplement group (supp), separated/filled by dosage level (dose).
# "ToothGrowth" at a glance
head(ToothGrowth)
# len supp dose
#1 4.2 VC 0.5
#2 11.5 VC 0.5
#3 7.3 VC 0.5
#4 5.8 VC 0.5
#5 6.4 VC 0.5
#6 10.0 VC 0.5
library(dplyr)
# recreate the data structure with specific "len" coordinates to plot for each group
df <- ToothGrowth %>%
group_by(supp, dose) %>%
summarise(
y0 = quantile(len, 0.05),
y25 = quantile(len, 0.25),
y50 = mean(len),
y75 = quantile(len, 0.75),
y100 = quantile(len, 0.95))
df
## A tibble: 6 x 7
## Groups: supp [2]
# supp dose y0 y25 y50 y75 y100
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 OJ 0.5 8.74 9.7 13.2 16.2 19.7
#2 OJ 1 16.8 20.3 22.7 25.6 26.9
#3 OJ 2 22.7 24.6 26.1 27.1 30.2
#4 VC 0.5 4.65 5.95 7.98 10.9 11.4
#5 VC 1 14.0 15.3 16.8 17.3 20.8
#6 VC 2 19.8 23.4 26.1 28.8 33.3
# boxplot using the mean for the middle and 95% quantiles for the whiskers
ggplot(df, aes(supp, fill = as.factor(dose))) +
geom_boxplot(
aes(ymin = y0, lower = y25, middle = y50, upper = y75, ymax = y100),
stat = "identity"
) +
labs(y = "len", title = "Boxplot with Mean Middle Line") +
theme(plot.title = element_text(hjust = 0.5))
In the figure above, the boxplot on the left is the standard boxplot with regular median line and regular min/max whiskers. The boxplot on the right uses the mean middle line and 5%/95% quantile whiskers.

Combining color and linetype legends in ggplot

I'm having trouble combining color and linetype guides into a single legend in a plot produced with ggplot2. Either the linetype shows up with all of the linetypes keyed the same way, or it does not show up at all.
My plot includes both a ribbon to show the bulk of the observations, along with lines showing minimum, median, maximum, and sometimes the observations from a single year.
Example code using built in CO2 data set:
library(tidyverse)
myExample <- CO2 %>%
group_by(conc) %>%
summarise(d.min = min(uptake, na.rm= TRUE),
d.ten = quantile(uptake,probs = .1, na.rm = TRUE),
d.median = median(uptake, na.rm = TRUE),
d.ninty = quantile(uptake, probs = .9, na.rm= TRUE),
d.max = max(uptake, na.rm = TRUE))
myExample <- cbind(myExample, "Qn1"= filter(CO2, Plant == "Qn1")[,5])
plot_plant <- TRUE # Switch to plot single observation series
myExample %>%
ggplot(aes(x=conc))+
geom_ribbon(aes(ymin=d.ten, ymax= d.ninty, fill = "80% of observations"), alpha = .2)+
geom_line(aes(y=d.min, colour = "c"), linetype = 3, size = .5)+
geom_line(aes(y=d.median, colour = "e"),linetype = 2, size = .5)+
geom_line(aes(y=d.max, colour = "a"),linetype = 3, size = .5)+
{if(plot_plant)geom_line(aes(y=Qn1, color = "f"), linetype = 1,size =.5)}+
scale_fill_manual("Statistic", values = "blue")+
scale_color_brewer(palette = "Dark2",name = "",
labels = c(
a= "Maximum",
e= "Median",
c= "Minimum",
f = current_year
), breaks = c("a","e","c","f"))+
scale_linetype_manual(name = "")+
guides(fill= guide_legend(order = 1), color = guide_legend(order = 2), linetype = guide_legend(order = 2))
With plot_plant set to TRUE, the code plots a single observation series, but linetype does not show up at all in the legend:
With plot_plant set to FALSE, linetype shows up in the legend, but I cannot see the distinction between the dotted and dashed legend entries:
The plot is working as desired, but I would like the linetype distinctions to show up in the legend. Visually, it is more important when I'm plotting the single observation series because the distinction between solid and dashed or dotted is stronger.
Searching for answers, I've seen suggestions to combine the different stats(min, median, max, and the single series) into a single variable and let ggplot determine the linetypes (ex [this post]ggplot2 manually specifying color & linetype - duplicate legend) or make a hash that describes the linetype [for example]How to rename a (combined) legend in ggplot2? but neither of these approaches seems to play well in combination with the ribbon plot.
I tried formatting my data into a long format, which usually works well for ggplot. This worked if I plotted all of the statistics as line geometry, but couldn't get the ribbon to work like I wanted, and overlaying a single observation series seemed like it needed to be stored in a different data table.
As you noted, ggplot loves long format data. So I recommend sticking with that.
Here I generate some made up data:
library(tibble)
library(dplyr)
library(ggplot2)
library(tidyr)
set.seed(42)
tibble(x = rep(1:10, each = 10),
y = unlist(lapply(1:10, function(x) rnorm(10, x)))) -> tbl_long
which looks like this:
# A tibble: 100 x 2
x y
<int> <dbl>
1 1 2.37
2 1 0.435
3 1 1.36
4 1 1.63
5 1 1.40
6 1 0.894
7 1 2.51
8 1 0.905
9 1 3.02
10 1 0.937
# ... with 90 more rows
Then I group_by(x) and calculate quantiles of interest for y in each group:
tbl_long %>%
group_by(x) %>%
mutate(q_0.0 = quantile(y, probs = 0.0),
q_0.1 = quantile(y, probs = 0.1),
q_0.5 = quantile(y, probs = 0.5),
q_0.9 = quantile(y, probs = 0.9),
q_1.0 = quantile(y, probs = 1.0)) -> tbl_long_and_wide
and that looks like:
# A tibble: 100 x 7
# Groups: x [10]
x y q_0.0 q_0.1 q_0.5 q_0.9 q_1.0
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2.37 0.435 0.848 1.38 2.56 3.02
2 1 0.435 0.435 0.848 1.38 2.56 3.02
3 1 1.36 0.435 0.848 1.38 2.56 3.02
4 1 1.63 0.435 0.848 1.38 2.56 3.02
5 1 1.40 0.435 0.848 1.38 2.56 3.02
6 1 0.894 0.435 0.848 1.38 2.56 3.02
7 1 2.51 0.435 0.848 1.38 2.56 3.02
8 1 0.905 0.435 0.848 1.38 2.56 3.02
9 1 3.02 0.435 0.848 1.38 2.56 3.02
10 1 0.937 0.435 0.848 1.38 2.56 3.02
# ... with 90 more rows
Then I gather up all the columns except for x, y, and the 10- and 90-percentile variables into two variables: key and value. The new key variable takes on the names of the old variables from which each value came from. The other variables are just copied down as needed.
tbl_long_and_wide %>%
gather(key, value, -x, -y, -q_0.1, -q_0.9) -> tbl_super_long
and that looks like:
# A tibble: 300 x 6
# Groups: x [10]
x y q_0.1 q_0.9 key value
<int> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 2.37 0.848 2.56 q_0.0 0.435
2 1 0.435 0.848 2.56 q_0.0 0.435
3 1 1.36 0.848 2.56 q_0.0 0.435
4 1 1.63 0.848 2.56 q_0.0 0.435
5 1 1.40 0.848 2.56 q_0.0 0.435
6 1 0.894 0.848 2.56 q_0.0 0.435
7 1 2.51 0.848 2.56 q_0.0 0.435
8 1 0.905 0.848 2.56 q_0.0 0.435
9 1 3.02 0.848 2.56 q_0.0 0.435
10 1 0.937 0.848 2.56 q_0.0 0.435
# ... with 290 more rows
This format will allow you to use both geom_ribbon() and geom_smooth() like you want to do because the variables for the lines are contained in value and grouped by key whereas the variables to be mapped to ymin and ymax are separate from value and are all the same within each x group.
tbl_super_long %>%
ggplot() +
geom_ribbon(aes(x = x,
ymin = q_0.1,
ymax = q_0.9,
fill = "80% of observations"),
alpha = 0.2) +
geom_line(aes(x = x,
y = value,
color = key,
linetype = key)) +
scale_fill_manual(name = element_text("Statistic"),
guide = guide_legend(order = 1),
values = viridisLite::viridis(1)) +
scale_color_manual(name = element_blank(),
labels = c("Minimum", "Median", "Maximum"),
guide = guide_legend(reverse = TRUE, order = 2),
values = viridisLite::viridis(3)) +
scale_linetype_manual(name = element_blank(),
labels = c("Minimum", "Median", "Maximum"),
guide = guide_legend(reverse = TRUE, order = 2),
values = c("dotted", "dashed", "solid")) +
labs(x = "x", y = "y")
This data format with the long but grouped x and y variables plus the independent but repeated ymin, and xmin variables will allow you to use both geom_ribbon() and geom_smooth() and allow the linetypes to show up properly in the legend.

Plotting Cumulative Gains Curve Plot R

I am trying to generate a cumulative gain plot using ggplot2 in R. Basically I want to replicate following using ggplot2.
My Data is this
df
# A tibble: 10 x 6
Decile resp Cumresp Gain Cumlift
<int> <dbl> <dbl> <dbl> <dbl>
1 8301 8301 57.7 5.77
2 2449 10750 74.8 3.74
3 1337 12087 84.0 2.80
4 751 12838 89.3 2.23
5 462 13300 92.5 1.85
6 374 13674 95.1 1.58
7 252 13926 96.8 1.38
8 195 14121 98.2 1.23
9 136 14257 99.1 1.10
10 124 14381 100 1
## Cumulative Gains Plot
ggplot(df, aes(Decile, Gain)) +
geom_point() +
geom_line() +
geom_abline(intercept = 52.3 , slope = 4.77)
scale_y_continuous(breaks = seq(0, 100, by = 20)) +
scale_x_continuous(breaks = c(1:10)) +
labs(title = "Cumulative Gains Plot",
y = "Cumulative Gain %")
However, I am not able to get the diagonal line, even though I tried geom_abline or niether my y-axis is right. I could not start from 0 to 100.
I would really appreciate if someone can get me the plot as in picture using ggplot2.
Thanks in advance
library(dplyr); library(ggplot2)
df2 <- df %>%
add_row(Decile = 0, Gain =0) %>%
arrange(Decile)
ggplot(df2, aes(Decile, Gain)) +
geom_point() +
geom_line() +
# This makes another geom_line that only sees the first and last row of the data
geom_line(data = df2 %>% slice(1, n())) +
scale_y_continuous(breaks = seq(0, 100, by = 20), limits = c(0,100)) +
scale_x_continuous(breaks = c(1:10)) +
labs(title = "Cumulative Gains Plot",
y = "Cumulative Gain %")

How can I add a line plot in a BOX plot using secondary y axis but same x axis

I am trying to add a line plot to my box plot, on secondary y axis, but i am not able to. what to do? Please help
code for my box plot are:
library(ggplot2)
mydata<-read.csv("boxplot2.csv")
mydata$Class <- factor(mydata$Class,labels = c("1", "2", "3", "4", "5", "6"))
p10 <- ggplot(mydata, aes(x = mydata$Class, y = log(mydata$erosion))) +
geom_boxplot()
p10
p10 <- p10 +
scale_x_discrete(name = "Mean Annual Precipitation(mm/yr)") +
scale_y_continuous(name = "Log Average Erosion Rate(m/My)")
p10 <- ggplot(mydata, aes(x = mydata$Class, y = log(mydata$erosion))) +
geom_boxplot(varwidth=TRUE)
p10 <- p10 +
scale_x_discrete(name = "Mean Annual Precipitation(mm/yr)") +
scale_y_continuous(name = "Log Average Erosion Rate(m/My)")
I want similar figure, but instead of histograms, i will have box plot
add sample data
% Vegetation erosion Class
0 0.43 1
0 0.81 1
2 0.26 1
3 1.05 1
3 0.97 1
12.76 15.97 2
12.84 17.69 2
11.01 14.76 2
13.44 17.94 2
10.76 10.65 2
7.28 67.47 2
23 120.4 3
21 298.63 3
52 21.4 3
9 64.94 3
50 291.88 3
16 493.98 3
11 183.45 3
You just have to specify different aesthetics for the geom_line, something like this:
ggplot(iris,aes(x=Species, y=Sepal.Length, fill=Species)) +
geom_boxplot() +
geom_line(aes(x=Species, y=Petal.Length, group=1), stat = "summary", fun.y="mean") +
scale_y_continuous(sec.axis = sec_axis(~.))

Resources