Different colors in ggplot2 using groups - r

I have a problem trying to use different colors in my plot for two groups. I created a plot with odds ratios (including 95%CI) over a period of serveral years for 2 groups (mfin and ffin). When using the syntax below, all points and lines are black and my attempts to adjust them e.g. geom_linerange(colour=c("red","blue")) have failed (Error: Incompatible lengths for set aesthetics: colour).
Can anyone help me with this?
ggplot(rbind(data.frame(mfin, group=mfin), data.frame(ffin, group=ffin)),
aes(x = JAAR, y = ror, ymin = llror, ymax = ulror)) +
geom_linerange() +
geom_point() +
geom_hline(yintercept = 1) +
ylab("Odds ratio & 95% CI") +
xlab("") +
geom_errorbar(width=0.2)
Below are some sample data (1st group = mfin, #ND GROUP + ffin)
JAAR ror llror ulror
2008 2.00 1.49 2.51
2009 2.01 1.57 2.59
2010 2.06 1.55 2.56
2011 2.07 1.56 2.58
2012 2.19 1.70 2.69
2013 2.23 1.73 2.72
2014 2.20 1.71 2.69
2015 2.31 1.84 2.78
2016 .230 1.83 2.76
JAAR ror llror ulror
2008 1.36 0.88 1.84
2009 1.20 0.73 1.68
2010 1.16 0.68 1.64
2011 1.23 0.77 1.69
2012 1.43 1.00 1.86
2013 1.46 1.04 1.88
2014 1.49 1.07 1.90
2015 1.30 0.89 1.70
2016 1.29 0.89 1.70

You need to map the group membership variable to the color aesthetic (in the long version of the data):
library(readr)
library(dplyr)
library(ggplot2)
# simulate some data
year_min = 1985
year_max = 2016
num_years = year_max - year_min + 1
num_groups = 2
num_estimates = num_years*num_groups
df_foo = data_frame(
upper_limit = runif(n = num_estimates, min = -20, max = 20),
lower_limit = upper_limit - runif(n = num_estimates, min = 0, max = 5),
point_estimate = runif(num_estimates, min = lower_limit, max = upper_limit),
year = rep(seq(year_min, year_max), num_groups),
group = rep(c("mfin", "ffin"), each = num_years)
)
# plot the confidence intervals
df_foo %>%
ggplot(aes(x = year, y = point_estimate,
ymin = lower_limit, ymax = upper_limit,
color = group)) +
geom_point() +
geom_errorbar() +
theme_bw() +
ylab("Odds Ratio & 95% CI") +
xlab("Year") +
scale_color_discrete(name = "Group")
This produces what I think you are looking for, except the simulated data makes it look somewhat messy:

Related

Need to put asterisk on the top of ggplot barplot to flag the level of significance (pvalue)?

I have a lm model results containing R2 and pvalue, and I plotted them in a bar plot. I have then facetted them using two discrete variables.
I want to put * on the top of bars to flag statistical significance (pvlue <= 0.05), as shown on the bottom-left-most panel of the below image.
I have not found an insightful tutorial on how to do this.
Any way to do this, please?
Here is some code I used
> head(res_all_s2)
WI aggre_per Season yield_level slope Intercept r.squared
1 R IDW2 Dec Season2 Region II -7.06 6091 0.41
2 R IDW2 Dec Season2 Region I -7.29 6280 0.40
3 GDD AS OND Season2 Region II 14.23 -18270 0.34
4 GDD AS Nov Season2 Region II 36.84 -14760 0.33
5 SPI1 IDW2 Dec Season2 Region II -405.10 5358 0.31
6 SPI1 IDW2 Dec Season2 Region I -421.70 5523 0.32
adj.r.squared fstatistic.value pval pearson
1 0.36 9.58 0.01 -0.64
2 0.36 9.49 0.01 -0.64
3 0.29 7.09 0.02 0.58
4 0.28 6.97 0.02 0.58
5 0.26 6.40 0.02 -0.56
6 0.27 6.51 0.02 -0.56
> # significance (pval <= 0.05)
> signif_reg <- res_all_s2 %>% filter(pval <= 0.05)
> head(signif_reg)
WI aggre_per Season yield_level slope Intercept r.squared
1 R IDW2 Dec Season2 Region II -7.06 6091 0.41
2 R IDW2 Dec Season2 Region I -7.29 6280 0.40
3 GDD AS OND Season2 Region II 14.23 -18270 0.34
4 GDD AS Nov Season2 Region II 36.84 -14760 0.33
5 SPI1 IDW2 Dec Season2 Region II -405.10 5358 0.31
6 SPI1 IDW2 Dec Season2 Region I -421.70 5523 0.32
adj.r.squared fstatistic.value pval pearson
1 0.36 9.58 0.01 -0.64
2 0.36 9.49 0.01 -0.64
3 0.29 7.09 0.02 0.58
4 0.28 6.97 0.02 0.58
5 0.26 6.40 0.02 -0.56
6 0.27 6.51 0.02 -0.56
>
> # Plot R2
>
> r <- res_all_s2 %>% ggplot(aes(x=aggre_per,
+ y=r.squared )) +
+ geom_bar(stat="identity", width=0.8) +
+ facet_grid(yield_level ~ WI,
+ scales = "free_y",
+ switch = "y") +
+ scale_y_continuous(limits = c(0, 1)) +
+ xlab("Aggregation period") +
+ ylab(expression(paste("R-squared"))) +
+ theme_bw() +
+ theme(axis.title = element_text(size = 12), # all titles
+ axis.text = element_text(colour = "black"),
+ axis.text.x = element_text(angle = 90, vjust = 0.5,
+ hjust = 1, color = "black"),
+ strip.text.y.left = element_text(angle = 0),
+ panel.border = element_rect(color = "black",
+ size = .5))
> r
And, here is the link to my res_all_s2 dataset https://1drv.ms/u/s!Ajl_vaNPXhANgckJeqDKA0fzfFEbhg?e=VfoFaB
Technically, you can always add an appropriate geom with its independent dataset (that would be your data filtered to exclude pval > .05):
df_filtered <- res_all_s2 %>% filter(...)
## ggplot(...) +
geom_point(data = df_filtered, pch = 8)
## pch = point character, no. 8 = asterisk
or
## ... +
geom_text(data = df_filtered, aes(label = '*'), nudge_y = .05)
## nudge_y = vertical offset
or color only significant columns:
## ... +
geom_col(aes(fill = c('grey','red')[1 + pval <= .05]))
So, yes, technically that's feasible. But before throwing the results of 13 x 7 x 5 = 455 linear models at your audience, please consider the issues of p-hacking, the benefits of multivariate analysis and the viewers' ressources ;-)

several return plotting in R

Try to plot very basic data in R.
Year X1 X2 X3 X4 X5 X6 X7
2004 0.91 0.23 0.28 1.02 0.90 0.95 0.94
2005 0.57 -0.03 0.88 0.52 0.47 0.55 0.56
2006 1.30 -0.43 1.95 1.27 1.00 1.19 1.26
2007 0.44 0.63 0.60 0.34 0.60 0.50 0.46
2008 1.69 0.34 -2.81 -2.41 -1.80 -1.87 -1.83
What I am looking for is a basic line chart over time with x = year and y = value and the chart itself should include all X1-X7.
I was looking at the ggplot2 functionality, but I don't know where to start.
# Libraries
library(tidyverse)
library(streamgraph)
library(viridis)
library(plotly)
# Plot
p <- data %>%
ggplot(aes(x = year, y = n) +
geom_area() +
scale_fill_viridis(discrete = TRUE) +
theme(legend.position = "none") +
ggtitle("multiple X over time") +
theme_ipsum() +
theme(legend.position = "none")
ggplotly(p, tooltip = "text")
Would anyone give me a hand on it, please? Is there an easy way to do it in basic R?
Thanks.
It’s unclear what you intend to do with geom_area but in the following you’ll see a basic line chart of the data you’ve shown.
The key point is that ‘ggplot2’ works on tidy data, so you need to first transform your data into long form:
data %>%
pivot_longer(-Year, names_to = 'Vars', values_to = 'Values') %>%
ggplot() +
aes(x = Year, y = Values, color = Vars) +
geom_line()

Combining color and linetype legends in ggplot

I'm having trouble combining color and linetype guides into a single legend in a plot produced with ggplot2. Either the linetype shows up with all of the linetypes keyed the same way, or it does not show up at all.
My plot includes both a ribbon to show the bulk of the observations, along with lines showing minimum, median, maximum, and sometimes the observations from a single year.
Example code using built in CO2 data set:
library(tidyverse)
myExample <- CO2 %>%
group_by(conc) %>%
summarise(d.min = min(uptake, na.rm= TRUE),
d.ten = quantile(uptake,probs = .1, na.rm = TRUE),
d.median = median(uptake, na.rm = TRUE),
d.ninty = quantile(uptake, probs = .9, na.rm= TRUE),
d.max = max(uptake, na.rm = TRUE))
myExample <- cbind(myExample, "Qn1"= filter(CO2, Plant == "Qn1")[,5])
plot_plant <- TRUE # Switch to plot single observation series
myExample %>%
ggplot(aes(x=conc))+
geom_ribbon(aes(ymin=d.ten, ymax= d.ninty, fill = "80% of observations"), alpha = .2)+
geom_line(aes(y=d.min, colour = "c"), linetype = 3, size = .5)+
geom_line(aes(y=d.median, colour = "e"),linetype = 2, size = .5)+
geom_line(aes(y=d.max, colour = "a"),linetype = 3, size = .5)+
{if(plot_plant)geom_line(aes(y=Qn1, color = "f"), linetype = 1,size =.5)}+
scale_fill_manual("Statistic", values = "blue")+
scale_color_brewer(palette = "Dark2",name = "",
labels = c(
a= "Maximum",
e= "Median",
c= "Minimum",
f = current_year
), breaks = c("a","e","c","f"))+
scale_linetype_manual(name = "")+
guides(fill= guide_legend(order = 1), color = guide_legend(order = 2), linetype = guide_legend(order = 2))
With plot_plant set to TRUE, the code plots a single observation series, but linetype does not show up at all in the legend:
With plot_plant set to FALSE, linetype shows up in the legend, but I cannot see the distinction between the dotted and dashed legend entries:
The plot is working as desired, but I would like the linetype distinctions to show up in the legend. Visually, it is more important when I'm plotting the single observation series because the distinction between solid and dashed or dotted is stronger.
Searching for answers, I've seen suggestions to combine the different stats(min, median, max, and the single series) into a single variable and let ggplot determine the linetypes (ex [this post]ggplot2 manually specifying color & linetype - duplicate legend) or make a hash that describes the linetype [for example]How to rename a (combined) legend in ggplot2? but neither of these approaches seems to play well in combination with the ribbon plot.
I tried formatting my data into a long format, which usually works well for ggplot. This worked if I plotted all of the statistics as line geometry, but couldn't get the ribbon to work like I wanted, and overlaying a single observation series seemed like it needed to be stored in a different data table.
As you noted, ggplot loves long format data. So I recommend sticking with that.
Here I generate some made up data:
library(tibble)
library(dplyr)
library(ggplot2)
library(tidyr)
set.seed(42)
tibble(x = rep(1:10, each = 10),
y = unlist(lapply(1:10, function(x) rnorm(10, x)))) -> tbl_long
which looks like this:
# A tibble: 100 x 2
x y
<int> <dbl>
1 1 2.37
2 1 0.435
3 1 1.36
4 1 1.63
5 1 1.40
6 1 0.894
7 1 2.51
8 1 0.905
9 1 3.02
10 1 0.937
# ... with 90 more rows
Then I group_by(x) and calculate quantiles of interest for y in each group:
tbl_long %>%
group_by(x) %>%
mutate(q_0.0 = quantile(y, probs = 0.0),
q_0.1 = quantile(y, probs = 0.1),
q_0.5 = quantile(y, probs = 0.5),
q_0.9 = quantile(y, probs = 0.9),
q_1.0 = quantile(y, probs = 1.0)) -> tbl_long_and_wide
and that looks like:
# A tibble: 100 x 7
# Groups: x [10]
x y q_0.0 q_0.1 q_0.5 q_0.9 q_1.0
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2.37 0.435 0.848 1.38 2.56 3.02
2 1 0.435 0.435 0.848 1.38 2.56 3.02
3 1 1.36 0.435 0.848 1.38 2.56 3.02
4 1 1.63 0.435 0.848 1.38 2.56 3.02
5 1 1.40 0.435 0.848 1.38 2.56 3.02
6 1 0.894 0.435 0.848 1.38 2.56 3.02
7 1 2.51 0.435 0.848 1.38 2.56 3.02
8 1 0.905 0.435 0.848 1.38 2.56 3.02
9 1 3.02 0.435 0.848 1.38 2.56 3.02
10 1 0.937 0.435 0.848 1.38 2.56 3.02
# ... with 90 more rows
Then I gather up all the columns except for x, y, and the 10- and 90-percentile variables into two variables: key and value. The new key variable takes on the names of the old variables from which each value came from. The other variables are just copied down as needed.
tbl_long_and_wide %>%
gather(key, value, -x, -y, -q_0.1, -q_0.9) -> tbl_super_long
and that looks like:
# A tibble: 300 x 6
# Groups: x [10]
x y q_0.1 q_0.9 key value
<int> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 2.37 0.848 2.56 q_0.0 0.435
2 1 0.435 0.848 2.56 q_0.0 0.435
3 1 1.36 0.848 2.56 q_0.0 0.435
4 1 1.63 0.848 2.56 q_0.0 0.435
5 1 1.40 0.848 2.56 q_0.0 0.435
6 1 0.894 0.848 2.56 q_0.0 0.435
7 1 2.51 0.848 2.56 q_0.0 0.435
8 1 0.905 0.848 2.56 q_0.0 0.435
9 1 3.02 0.848 2.56 q_0.0 0.435
10 1 0.937 0.848 2.56 q_0.0 0.435
# ... with 290 more rows
This format will allow you to use both geom_ribbon() and geom_smooth() like you want to do because the variables for the lines are contained in value and grouped by key whereas the variables to be mapped to ymin and ymax are separate from value and are all the same within each x group.
tbl_super_long %>%
ggplot() +
geom_ribbon(aes(x = x,
ymin = q_0.1,
ymax = q_0.9,
fill = "80% of observations"),
alpha = 0.2) +
geom_line(aes(x = x,
y = value,
color = key,
linetype = key)) +
scale_fill_manual(name = element_text("Statistic"),
guide = guide_legend(order = 1),
values = viridisLite::viridis(1)) +
scale_color_manual(name = element_blank(),
labels = c("Minimum", "Median", "Maximum"),
guide = guide_legend(reverse = TRUE, order = 2),
values = viridisLite::viridis(3)) +
scale_linetype_manual(name = element_blank(),
labels = c("Minimum", "Median", "Maximum"),
guide = guide_legend(reverse = TRUE, order = 2),
values = c("dotted", "dashed", "solid")) +
labs(x = "x", y = "y")
This data format with the long but grouped x and y variables plus the independent but repeated ymin, and xmin variables will allow you to use both geom_ribbon() and geom_smooth() and allow the linetypes to show up properly in the legend.

How to add shaded confidence intervals to line plot with specified values

I have a small table of summary data with the odds ratio, upper and lower confidence limits for four categories, with six levels within each category. I'd like to produce a chart using ggplot2 that looks similar to the usual one created when you specify a lm and it's se, but I'd like R just to use the pre-specified values I have in my table. I've managed to create the line graph with error bars, but these overlap and make it unclear. The data look like this:
interval OR Drug lower upper
14 0.004 a 0.002 0.205
30 0.022 a 0.001 0.101
60 0.13 a 0.061 0.23
90 0.22 a 0.14 0.34
180 0.25 a 0.17 0.35
365 0.31 a 0.23 0.41
14 0.84 b 0.59 1.19
30 0.85 b 0.66 1.084
60 0.94 b 0.75 1.17
90 0.83 b 0.68 1.01
180 1.28 b 1.09 1.51
365 1.58 b 1.38 1.82
14 1.9 c 0.9 4.27
30 2.91 c 1.47 6.29
60 2.57 c 1.52 4.55
90 2.05 c 1.31 3.27
180 2.422 c 1.596 3.769
365 2.83 c 1.93 4.26
14 0.29 d 0.04 1.18
30 0.09 d 0.01 0.29
60 0.39 d 0.17 0.82
90 0.39 d 0.2 0.7
180 0.37 d 0.22 0.59
365 0.34 d 0.21 0.53
I have tried this:
limits <- aes(ymax=upper, ymin=lower)
dodge <- position_dodge(width=0.9)
ggplot(data, aes(y=OR, x=days, colour=Drug)) +
geom_line(stat="identity") +
geom_errorbar(limits, position=dodge)
and searched for a suitable answer to create a pretty plot, but I'm flummoxed!
Any help greatly appreciated!
You need the following lines:
p<-ggplot(data=data, aes(x=interval, y=OR, colour=Drug)) + geom_point() + geom_line()
p<-p+geom_ribbon(aes(ymin=data$lower, ymax=data$upper), linetype=2, alpha=0.1)
Here is a base R approach using polygon() since #jmb requested a solution in the comments. Note that I have to define two sets of x-values and associated y values for the polygon to plot. It works by plotting the outer perimeter of the polygon. I define plot type = 'n' and use points() separately to get the points on top of the polygon. My personal preference is the ggplot solutions above when possible since polygon() is pretty clunky.
library(tidyverse)
data('mtcars') #built in dataset
mean.mpg = mtcars %>%
group_by(cyl) %>%
summarise(N = n(),
avg.mpg = mean(mpg),
SE.low = avg.mpg - (sd(mpg)/sqrt(N)),
SE.high =avg.mpg + (sd(mpg)/sqrt(N)))
plot(avg.mpg ~ cyl, data = mean.mpg, ylim = c(10,30), type = 'n')
#note I have defined c(x1, x2) and c(y1, y2)
polygon(c(mean.mpg$cyl, rev(mean.mpg$cyl)),
c(mean.mpg$SE.low,rev(mean.mpg$SE.high)), density = 200, col ='grey90')
points(avg.mpg ~ cyl, data = mean.mpg, pch = 19, col = 'firebrick')

How to create odds ratio and 95 % CI plot in R

I have estimates of odds ratio with corresponding 95% CI of six pollutants overs 4 lag periods. How can I create a vertical plot similar to the attached figure in R? The figure below was created in SPSS.
Sample data that produced the figure is the following:
lag pollut or lcl ucl
0 CO 0.97 0.90 1.06
0 PM10 1.00 0.91 1.09
0 NO 0.97 0.92 1.02
0 NO2 1.01 0.89 1.15
0 SO2 0.97 0.85 1.11
0 Ozone 1.00 0.87 1.15
1 CO 1.03 0.95 1.10
1 PM10 0.93 0.86 1.01
1 NO 1.01 0.97 1.06
1 NO2 1.08 0.97 1.20
1 SO2 0.94 0.84 1.04
1 Ozone 0.94 0.84 1.04
2 CO 1.09 1.02 1.16
2 PM10 1.04 0.96 1.13
2 NO 1.04 1.00 1.08
2 NO2 1.07 0.96 1.18
2 SO2 1.05 0.95 1.17
2 Ozone 0.93 0.84 1.03
3 CO 0.98 0.91 1.06
3 PM10 1.14 1.05 1.24
3 NO 0.99 0.95 1.04
3 NO2 1.01 0.91 1.12
3 SO2 1.11 1.00 1.23
3 Ozone 1.00 0.90 1.11
You can also do this with ggplot2. The code is somewhat shorter:
dat <- read.table("clipboard", header = T)
dat$lag <- paste0("L", dat$lag)
library(ggplot2)
ggplot(dat, aes(x = pollut, y = or, ymin = lcl, ymax = ucl)) + geom_pointrange(aes(col = factor(lag)), position=position_dodge(width=0.30)) +
ylab("Odds ratio & 95% CI") + geom_hline(aes(yintercept = 1)) + scale_color_discrete(name = "Lag") + xlab("")
EDIT: Here is a version is closer to the SPSS figure:
ggplot(dat, aes(x = pollut, y = or, ymin = lcl, ymax = ucl)) + geom_linerange(aes(col = factor(lag)), position=position_dodge(width=0.30)) +
geom_point(aes(shape = factor(lag)), position=position_dodge(width=0.30)) + ylab("Odds ratio & 95% CI") + geom_hline(aes(yintercept = 1)) + xlab("")
Assuming your data are in datf...
I'd sort it first into just what you want order wise.
datf <- datf[order(datf$pollut, datf$lag), ]
You want a space before and after every lab grouping so I'd add some extra rows in that are NA. That makes it easier because then you'll automatically have blanks in your plot calls.
datfPlusNA <- lapply(split(datf, datf$pollut), function(x) rbind(NA, x, NA))
datf <- do.call(rbind, datfPlusNA)
Now that you have your data.frame sorted and with the extra NAs the plotting is easy.
nr <- nrow(datf) # find out how many rows all together
with(datf, {# this allows entering your commands more succinctly
# first you could set up the plot so you can select the order of drawing
plot(1:nr, or, ylim = c(0.8, 1.3), type = 'n', xaxt = 'n', xlab = '', ylab = 'Odds Ratio and 95% CI', frame.plot = TRUE, panel.first = grid(nx = NA, ny = NULL))
# arrows(1:nr, lcl, 1:nr, ucl, length = 0.02, angle = 90, code = 3, col = factor(lag))
# you could use arrows above but you don't want ends so segments is easier
segments(1:nr, lcl, 1:nr, ucl, col = factor(lag))
# add your points
points(1:nr, or, pch = 19, cex = 0.6)
xLabels <- na.omit(unique(pollut))
axis(1, seq(4, 34, by = 6) - 0.5, xLabels)
})
abline(h = 1.0)
There are packages that make this kind of thing easier but if you can do it like this you can start doing any graphs that you can imagine.

Resources